Tutorial: 

The purpose of this tutorial is to demostrate basic topic analysis method. 

The dataset (news.tsv) is from the MIND dataset for news recommendation task. 

MIND was collected from anonymized behavior logs of Microsoft News website. The data randomly sampled 1 million users who had at least 5 news clicks during 6 weeks from October 12 to November 22, 2019. More information of the dataset can be found: https://www.kaggle.com/datasets/arashnic/mind-news-dataset?resource=download 

The news.tsv file has 7 columns, which are divided by the tab symbol:

*News ID

*Category

*SubCategory

*Title

*Abstract

*URL

*Title Entities (entities contained in the title of this news)

*Abstract Entities (entites contained in the abstract of this news)

More reference: https://colab.research.google.com/github/alvinntnu/NTNU_ENC2045_LECTURES/blob/main/nlp/topic-modeling-naive.ipynb#scrollTo=_7szxgqRYD-l


In [None]:
! pip install pyLDAvis

Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 5.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=faaabc01ab9c54fbf255e807a8beb0b930ebc7a7a47bcbefdb596dc1cc528dd8
  Stored in directory: /root/.cache/pip/wheels/c9/21/f6/17bcf2667e8a68532ba2fbf6d5c72fdf4c7f7d9abfa4852d2f
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-3.3.1


In [None]:
#import libraries
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as esw
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

import nltk
nltk.download('punkt')

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt

import string

# allow display of multiple outputs by running one code cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Get the current working directory
os.getcwd()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

'/content/drive/MyDrive/Queens/Data Analytics'

Change your Google Drive directory in the following cell

In [None]:
#Read dataset from google drive
from google.colab import drive
drive.mount('/content/drive')
os.chdir('/content/drive/MyDrive/Queens/Data Analytics/')

Mounted at /content/drive


## Task 1: Let's take a look at the news title and abstract content

In [None]:
# Load dataset
header_list = ["NewsID", "Category", "SubCategory", "Title", "Abstract", "URL", "Title Entities", "Abstract Entities"]

df = pd.read_csv("news.tsv", sep="\t", names=header_list)
df.head(5)


Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [None]:
# Let's also check if we have nan value for the two target fields.
df['Title'].isnull().sum()

0

In [None]:
df['Abstract'].isnull().sum()

2666

The above results show that we have 2666 empty entries for abstract field, we should remove them for topic analysis on abstracts later.

In [None]:
# we keep category, subcategory and newid so that later we can compare the topic assignment for each new with its manually annotated subcategory and category.
df_abstract_analysis = df.dropna(subset=['Abstract'])[['NewsID', 'Abstract', 'Category','SubCategory']]
df_abstract_analysis.head(10)

Unnamed: 0,NewsID,Abstract,Category,SubCategory
0,N55528,"Shop the notebooks, jackets, and more that the...",lifestyle,lifestyleroyals
1,N19639,These seemingly harmless habits are holding yo...,health,weightloss
2,N61837,Lt. Ivan Molchanets peeked over a parapet of s...,news,newsworld
3,N53526,"I felt like I was a fraud, and being an NBA wi...",health,voices
4,N38324,"They seem harmless, but there's a very good re...",health,medical
5,N2073,Several fines came down against NFL players fo...,sports,football_nfl
6,N49186,There won't be a chill down to your bones this...,weather,weathertopstories
7,N59295,Three people have died in a supermarket fire a...,news,newsworld
8,N24510,Every confirmed or expected PS5 game we can't ...,entertainment,gaming
9,N39237,"When there are active closings, view them here...",news,newsscienceandtechnology


In [None]:
# prepare two text corpus, one for title, one for abstract

all_titles = df.Title
all_abstracts =  df_abstract_analysis.Abstract

all_titles.head(50)
all_abstracts.head(50)

0     The Brands Queen Elizabeth, Prince Charles, an...
1                         50 Worst Habits For Belly Fat
2     The Cost of Trump's Aid Freeze in the Trenches...
3     I Was An NBA Wife. Here's How It Affected My M...
4     How to Get Rid of Skin Tags, According to a De...
5     Should NFL be able to fine players for critici...
6     It's been Orlando's hottest October ever so fa...
7     Chile: Three die in supermarket fire amid prot...
8     Best PS5 games: top PlayStation 5 titles to lo...
9        How to report weather-related closings, delays
10    50 Foods You Should Never Eat, According to He...
11    Trying to Make a Ram 3500 as Quick as a Viper ...
12    25 Biggest Grocery Store Mistakes Making You G...
13    Instagram Filters with Plastic Surgery-Inspire...
14    Michigan apple recall: Nearly 2,300 crates cou...
15    Kate Middleton's Best Hairstyles Through the Y...
16              Stars who got fired from major projects
17    Newark Liberty Airport's Terminal One a $2

0     Shop the notebooks, jackets, and more that the...
1     These seemingly harmless habits are holding yo...
2     Lt. Ivan Molchanets peeked over a parapet of s...
3     I felt like I was a fraud, and being an NBA wi...
4     They seem harmless, but there's a very good re...
5     Several fines came down against NFL players fo...
6     There won't be a chill down to your bones this...
7     Three people have died in a supermarket fire a...
8     Every confirmed or expected PS5 game we can't ...
9     When there are active closings, view them here...
10                               This is so depressing.
11    The 2019 Ram 3500's new Cummins diesel has 100...
12    From picking up free goodies to navigating the...
13    In an effort to combat some of the negative me...
14    A Michigan produce company has recalled nearly...
15    The Duchess of Cambridge knows her way around ...
16    Take a look back at the celebs who got the boo...
17    The project, which is the bi-state agency'

In [None]:
all_abstracts[1]

'These seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good.'

In [None]:
all_titles[1]

'50 Worst Habits For Belly Fat'

It seems that the text is quite formal and clean, without html tags. 

## Task 2: perform standarded text preprocessing steps


In [None]:
# Let's first check highly frequently appearing words on this corpus, excluding normal English stop-words 

def get_top_n_words(corpus, n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    
    get_top_n_words(["I love Python", "Python is a language programming", "Hello world", "I love the world"]) -> 
    [('python', 2),
     ('world', 2),
     ('love', 2),
     ('hello', 1),
     ('is', 1),
     ('programming', 1),
     ('the', 1),
     ('language', 1)]
    """
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)      
    words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]     
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)    
    return words_freq[:n]

common_words = get_top_n_words(all_abstracts, 50)
for word, freq in common_words:
    print(word, freq)

said 6904
new 5465
year 4778
police 3977
state 3412
week 3322
game 2841
season 2829
time 2828
president 2792
county 2698
trump 2694
city 2692
just 2690
old 2664
night 2648
according 2610
home 2462
tuesday 2436
sunday 2354
people 2317
monday 2308
wednesday 2279
man 2196
school 2174
team 2114
day 2084
thursday 2025
years 2010
saturday 1942
friday 1912
morning 1815
house 1792
like 1778
news 1744
high 1707
10 1630
say 1554
says 1551
officials 1454
world 1449
make 1432
road 1402
sign 1368
second 1304
area 1300
best 1299
department 1293
near 1283
2019 1274


Some common words like "said", "say", "says", "state", "just", "2019", "area" could be removed. You are encourage to investigate how varying customized stop word lists would affect the quality of output topics.

In [None]:
# let's use standard english stop-words first, later we would see if we need to filter specific stopwords (high-frequent words with no discriminative power)
cachedStopWords = ["said", "say", "says", "state", "just", "2019", "area"] + list(esw)      #esw: English Stop Words
lemmatizer=WordNetLemmatizer()
#ps = PorterStemmer()

def lemmatize_article(sentence):
    sentence = word_tokenize(sentence)
    res = ''
    for word, tag in pos_tag(sentence):
        wntag = tag[0].lower()
        wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
        word = lemmatizer.lemmatize(word, wntag) if wntag else word
        res += word + ' '
    return res
    
def remove_stop_words(sentence):
    return ' '.join([word for word in sentence.split() if word not in cachedStopWords])
    
def remove_short(sentence):
    return ' '.join([word for word in sentence.split() if len(word) >= 3])
    
def remove_digits(sentence):
    return ' '.join([i for i in sentence.split() if not i.isdigit()])
    
def preprocess(all_texts):
    all_texts = list(map(lambda x: x.lower(), all_texts))
    all_texts = list(map(lambda x: x.translate(str.maketrans('', '', string.punctuation)), all_texts))
    all_texts = list(map(lambda x: lemmatize_article(x), all_texts))
    all_texts = list(map(lambda x: x.strip(), all_texts))
    all_texts = list(map(lambda x: remove_stop_words(x), all_texts))
    all_texts = list(map(lambda x: remove_short(x), all_texts))
    all_texts = list(map(lambda x: remove_digits(x), all_texts))
    return all_texts
    

In [None]:
# preprocess title and abstract
all_titles_processed = preprocess(all_titles)
all_titles_processed

all_abstracts_processed = preprocess(all_abstracts)
all_abstracts_processed

['brand queen elizabeth prince charles prince philip swear',
 'worst habit belly fat',
 'cost trump aid freeze trench ukraine war',
 'nba wife heres affect mental health',
 'rid skin tag accord dermatologist',
 'nfl able fine player criticize officiate',
 'orlandos hottest october far cooler temperature way',
 'chile die supermarket amid protest',
 'best ps5 game playstation title look forward',
 'report weatherrelated closing delay',
 'food eat accord health expert',
 'try make ram quick viper require disassembly',
 'biggest grocery store mistake make gain weight',
 'instagram filter plastic surgeryinspired effect soon disappear',
 'michigan apple recall nearly crate contaminate listeria',
 'kate middleton best hairstyle year',
 'star major project',
 'newark liberty airport terminal billion transformative project',
 'gmc yukon denali',
 'john dorsey admit talk washington tango',
 'elijah cummings lie capitol thursday',
 'abandoned theme park explore thrill chill nostalgia',
 'ford br

['shop notebooks jacket royal live',
 'seemingly harmless habit hold shed unwanted belly fat good',
 'ivan molchanets peek parapet sand bag line war ukraine helmet prop trick sniper perforate multiple hole',
 'felt like fraud nba wife didnt help fact nearly destroy',
 'harmless theres good reason shouldnt ignore post rid skin tag accord dermatologist appear reader digest',
 'fine come nfl player criticize officiate week bad look league',
 'wont chill bone halloween orlando unless count sweat drip armpit',
 'people die supermarket angry protest chile enter seventh day mayor capital city santiago sunday',
 'confirm expect ps5 game wait play',
 'active closing view wxii news receive number phone email viewer question sign newsletter report closure visit wxiireportclosingcom weather closing vieweroperated employee wxiitv wxii12com enter information come straight schoolbusinessinstitution enter information',
 'depressing',
 'ram 3500s new cummins diesel lbft torque work drag strip',
 'pick 

## Task 3, let's analyze abstract first, you can perform similar analysis on titles.

We use max_df and min_df to filter out terms that have a document requency strictly higher or lower than a given threshold. More read: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [None]:
# let's just try with 10 categories
tf_vectorizer = CountVectorizer(max_df = 0.5,           #terms occur in more than 50% of documents, terms occur less than 10 times --> ignore
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(all_abstracts_processed)

# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=10, random_state=0)
lda_tf.fit(dtm_tf)

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)


LatentDirichletAllocation(random_state=0)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [None]:
# Print the top 10 words per topic
n_words = 10
feature_names = tf_vectorizer.get_feature_names()

topic_list = []
for topic_idx, topic in enumerate(lda_tf.components_):    #2D matrix of topic index with its associated words, shape(10,)
    top_n = [feature_names[i]
             for i in topic.argsort()        #topic: features associated with each topic with their corresponding values,, topic.argsort: return indices that sort the array ascendingly
             [-n_words:]][::-1]             #[-n_words] --> to et the last 10 words in the array, [::-1]--> to reverse this 10-element array
    top_features = ' '.join(top_n)
    topic_list.append(f"topic_{'_'.join(top_n[:3])}") 

    print(f"Topic {topic_idx}: {top_features}")

Topic 0: los angeles injury week season california practice start home nfl
Topic 1: game win night season team week play point sunday brown
Topic 2: school student city year high day county veteran million official
Topic 3: weather news snow week storm wind report national city local
Topic 4: new candidate democratic presidential health year die race campaign cancer
Topic 5: new city patriot open business price week restaurant photo company
Topic 6: police man county crash road accord car near morning officer
Topic 7: president trump house impeachment donald court charge white republican inquiry
Topic 8: make home year like know time look family come want
Topic 9: team game season series win football coach world year week




In [None]:
lda_tf.components_[0]
#tf_vectorizer.get_feature_names()

array([3.3965686 , 4.76766051, 2.63003216, ..., 0.10000445, 0.10000575,
       0.1       ])

It seems that we have some topics related to sports, family, car accident, politics, weather, school.

## Task 4: Previously, we directly set the topic number to be 10. However, there might be better choice. So how to select a good topic number?

We can package the above code and varying the number from 10 to 30, for each model, let's print its topics and check alignment with category. Note, in practice, you would not observe annotated label, such as category. Instead, you should varying topic number and see if the returning words for topics are coherent (relevant to a topic can be easily interpreted).

Another note: sklearn has some problem with its perlexity calculation metric, see https://github.com/scikit-learn/scikit-learn/issues/6777, thus we do not use it here.

In [None]:
def print_top_word_for_topic(lda_tf):
    # Print the top 10 words per topic
    n_words = 10
    feature_names = tf_vectorizer.get_feature_names()
    
    topic_list = []
    for topic_idx, topic in enumerate(lda_tf.components_):
        top_n = [feature_names[i]
             for i in topic.argsort()
             [-n_words:]][::-1]
        top_features = ' '.join(top_n)
        topic_list.append(f"topic_{'_'.join(top_n[:3])}") 

        print(f"Topic {topic_idx}: {top_features}")


def varying_topic_number(topicnum):
    tf_vectorizer = CountVectorizer(max_df = 0.5, min_df = 10)
    dtm_tf = tf_vectorizer.fit_transform(all_abstracts_processed)
    lda_tf = LatentDirichletAllocation(n_components=topicnum, random_state=0)
    lda_tf.fit(dtm_tf)

    print_top_word_for_topic(lda_tf)


In [None]:
varying_topic_number(15)


Topic 0: los injury angeles season practice week make year quarterback game
Topic 1: game win night season point play team sunday score week
Topic 2: school student high county charge year child district university woman
Topic 3: news whats week heres today report weather local link recent
Topic 4: candidate democratic presidential die warren campaign film race new elizabeth
Topic 5: week new patriot price look heres england apartment team photo
Topic 6: police man crash county accord car road officer near morning
Topic 7: president trump house impeachment donald white inquiry republican public democrat
Topic 8: home make like year family know day time veteran best
Topic 9: game team win season series football coach world week national
Topic 10: weather snow wind day california storm morning power temperature cold
Topic 11: new york health manager report orleans open city time look
Topic 12: year know make million stock tour think world company look
Topic 13: city year park county new 



What's next? Can you propose a method to compare document-topic distribution with the category and subcategory labels on news articles? This may give you some confidence in using LDA for predicting category of new articles.