## Data

We are using articles from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import numpy as np
import pandas as pd


In [2]:
df=pd.read_csv('npr.csv')

In [3]:
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


We don't have topics associated with the articles.Lets first use LDA to group articles and find coresponding topics

### Text preprocessing

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [6]:
tfdtm=tfidf.fit_transform(df['Article'])

In [7]:
tfdtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## NMF

In [8]:
from sklearn.decomposition import NMF

In [9]:
nmf_model = NMF(n_components=7,random_state=42)

In [10]:
nmf_model.fit(tfdtm)



NMF(n_components=7, random_state=42)

In [11]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0




['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'album', 'way', 'time', 'song', '

### Attaching Discovered Topic Labels to Original Articles

In [12]:
nmf_topic_results = nmf_model.transform(tfdtm)

In [13]:
df['NMF-Topic']= nmf_topic_results.argmax(axis=1)

## LDA

There's one big difference: LDA has TF-IDF built in, so we need to use a CountVectorizer as the vectorizer instead of a TfidfVectorizer. I tried using Tfidf at first and the results are tragic. After a bit of research in net found this



In [14]:
from sklearn.feature_extraction.text import CountVectorizer 

In [15]:
cv= CountVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [16]:
dtm = cv.fit_transform(df['Article'])

In [17]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [18]:
from sklearn.decomposition import LatentDirichletAllocation

In [19]:
#Lets randomly choose the topic count to be 7 at first
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [20]:
LDA.fit(dtm)

LatentDirichletAllocation(n_components=7, random_state=42)

In [21]:
len(cv.get_feature_names_out())


54777

### Showing words per topic

In [22]:
len(LDA.components_[0])

54777

In [23]:
single_topic = LDA.components_[0]

In [24]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([ 2475, 18302, 35285, ..., 22673, 42561, 42993], dtype=int64)

In [25]:
#Top 10 words in topic
single_topic.argsort()[-10:]

array([33390, 36310, 21228, 10425, 31464,  8149, 36283, 22673, 42561,
       42993], dtype=int64)

In [26]:
top_word_indices=single_topic.argsort()[-10:]

In [27]:
for i in top_word_indices:
    print(cv.get_feature_names_out()[i])

new
percent
government
company
million
care
people
health
said
says


In [28]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['companies', 'money', 'year', 'federal', '000', 'new', 'percent', 'government', 'company', 'million', 'care', 'people', 'health', 'said', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['military', 'house', 'security', 'russia', 'government', 'npr', 'reports', 'says', 'news', 'people', 'told', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['way', 'world', 'family', 'home', 'day', 'time', 'water', 'city', 'new', 'years', 'food', 'just', 'people', 'like', 'says']


THE TOP 15 WORDS FOR TOPIC #3
['time', 'new', 'don', 'years', 'medical', 'disease', 'patients', 'just', 'children', 'study', 'like', 'women', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['voters', 'vote', 'election', 'party', 'new', 'obama', 'court', 'republican', 'campaign', 'people', 'state', 'president', 'clinton', 'said', 'trump']


THE TOP 15 WORDS FOR TOPIC #5
['years', 'going', 've', 'life', 'don', 'new', 'way', 'music', 'really', 'time', 'know', 'think',

In [29]:
from sklearn.model_selection import GridSearchCV

In [30]:
search_params = {
  'n_components': [5, 7, 8],
  }


In [31]:
model = LatentDirichletAllocation(learning_method='online')


In [32]:
gridsearch = GridSearchCV(model, param_grid=search_params, n_jobs=-1, verbose=1)
gridsearch.fit(dtm)


Fitting 5 folds for each of 3 candidates, totalling 15 fits


GridSearchCV(estimator=LatentDirichletAllocation(learning_method='online'),
             n_jobs=-1, param_grid={'n_components': [5, 7, 8]}, verbose=1)

In [33]:
print("Best Model's Params: ", gridsearch.best_params_)
print("Best Log Likelihood Score: ", gridsearch.best_score_)


Best Model's Params:  {'n_components': 5}
Best Log Likelihood Score:  -8182521.601141015


In [34]:
LDA5 = LatentDirichletAllocation(n_components=5,random_state=42)

In [35]:
LDA5.fit(dtm)

LatentDirichletAllocation(n_components=5, random_state=42)

In [36]:
for index,topic in enumerate(LDA5.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([cv.get_feature_names_out()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['government', 'money', 'federal', 'million', '000', 'state', 'year', 'company', 'percent', 'new', 'care', 'said', 'health', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['war', 'house', 'russia', 'security', 'npr', 'government', 'reports', 'news', 'told', 'people', 'says', 'police', 'president', 'trump', 'said']


THE TOP 15 WORDS FOR TOPIC #2
['world', 'don', 'music', 'life', 'really', 'think', 'way', 'years', 'know', 'new', 'time', 'people', 'just', 'says', 'like']


THE TOP 15 WORDS FOR TOPIC #3
['years', 'university', 'new', 'don', 'time', 'students', 'study', 'children', 'health', 'women', 'just', 'school', 'like', 'people', 'says']


THE TOP 15 WORDS FOR TOPIC #4
['vote', 'election', 'party', 'just', 'court', 'obama', 'new', 'republican', 'campaign', 'state', 'people', 'president', 'clinton', 'said', 'trump']




NMF can't be scored (at least in scikit-learn!). Because our model can't give us a number that represents how well it did, we can't compare it to other models, which means the only way to differentiate between the topics is how we feel about them

### Attaching Discovered Topic Labels to Original Articles

In [37]:
topic_results = LDA5.transform(dtm)

In [38]:
topic_results.shape

(11992, 5)

In [39]:
topic_results[0].round(2)

array([0.01, 0.67, 0.  , 0.  , 0.32])

In [40]:
topic_results[0].argmax()

1

In [47]:
df['LDA_Topic'] = topic_results.argmax(axis=1)

In [48]:
df.head(10)

Unnamed: 0,Article,NMF-Topic,Topic,Topic_name,LDA_Topic
0,"In the Washington of 2016, even when the polic...",1,1,national security,1
1,Donald Trump has used Twitter — his prefe...,1,1,national security,1
2,Donald Trump is unabashedly praising Russian...,1,1,national security,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,1,national security,1
4,"From photography, illustration and video, to d...",6,2,Media,2
5,I did not want to join yoga class. I hated tho...,5,3,education,3
6,With a who has publicly supported the debunk...,0,3,education,3
7,"I was standing by the airport exit, debating w...",0,2,Media,2
8,"If movies were trying to be more realistic, pe...",0,3,education,3
9,"Eighteen years ago, on New Year’s Eve, David F...",5,2,Media,2


In [43]:
topic_label={0:'healthcare',1:'national security',2:'Media',3:'education',4:'election'}

In [49]:
df['LDA_Topic_name']=df['Topic'].map(topic_label)

In [54]:
df.sample(5)

Unnamed: 0,Article,NMF-Topic,LDA_Topic,LDA_Topic_name
4379,On the first day of the Consumer Electronics S...,0,0,healthcare
5828,Puerto Rico is losing people. Due to a reces...,0,3,education
4955,"One month down, two to go. For unemployed adul...",6,0,healthcare
3298,Milwaukee has the nation’s publicly funded v...,6,3,education
221,"In 1889, Bethlehem Steel brought engineer Fred...",5,2,Media
