**Topic Modeling Assessment Project**

**About Dataframe:**

The objective of the NPR is to create a comprehensive identity database of every usual resident in the country. Like in the population census, the NPR database contains particulars like name, relationship to head of household, father’s name, mother’s name, spouse’s name, sex, date of Birth, marital status, place of birth, nationality (as declared), present address of usual residence, duration of stay at present address, Permanent residential address, occupation/activity, educational qualification etc. 

Import Libraries and read csv file

In [66]:
import pandas as pd

In [67]:
npr = pd.read_csv("npr.csv",engine='python', error_bad_lines=False)

In [68]:
npr.drop(npr.filter(regex="Unname"),axis=1, inplace=True)

In [69]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [70]:
len(npr)

12007

**Preprocessing**

Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

In [71]:
from sklearn.feature_extraction.text import CountVectorizer

In [72]:
cv = CountVectorizer(max_df=0.9,min_df=2,stop_words="english")

In [73]:
dtm = cv.fit_transform(npr["Article"])

In [74]:
dtm

<12007x54731 sparse matrix of type '<class 'numpy.int64'>'
	with 3029517 stored elements in Compressed Sparse Row format>

**Latent Dirichlet Allocation**

Using Scikit-Learn create an instance of LDA with 7 expected components. (Use random_state=42)..

In [75]:
from sklearn.decomposition import LatentDirichletAllocation

In [76]:
LDA = LatentDirichletAllocation(n_components=7,random_state=42)

In [77]:
LDA.fit(dtm)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=7, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In [78]:
#Grab the vocabulary of words

In [79]:
len(cv.get_feature_names())

54731

In [80]:
type(cv.get_feature_names())

list

In [81]:
import random

random_word_id = random.randint(0,39565)

cv.get_feature_names()[random_word_id]

'craton'

In [82]:
#Grab the topics

In [83]:
import numpy as np

**Print our the top 15 most common words for each of the 7 topics.**

In [84]:
for i,topic in enumerate(LDA.components_): 
  print(f"Top 15 words for topic #{i}")
  print([cv.get_feature_names()[index] for index in topic.argsort()[-15:]])
  print("\n")
  print("\n")

Top 15 words for topic #0
['university', 'insurance', 'medical', 'year', 'like', 'patients', 'research', 'new', 'said', 'study', 'care', 'percent', 'people', 'health', 'says']




Top 15 words for topic #1
['says', 'know', 'film', 'life', 'said', 'album', 'song', 'way', 'years', 'world', 'time', 'new', 'just', 'music', 'like']




Top 15 words for topic #2
['water', 'home', 'country', 'day', 'virus', 'don', 'world', 'zika', 'time', 'said', 'like', 'just', 'years', 'people', 'says']




Top 15 words for topic #3
['reported', 'security', 'china', 'company', 'npr', 'city', 'according', 'state', 'new', 'reports', 'people', 'government', 'police', 'says', 'said']




Top 15 words for topic #4
['want', 've', 'way', 'going', 'students', 'time', 'don', 'really', 'know', 'school', 'think', 'just', 'people', 'like', 'says']




Top 15 words for topic #5
['states', 'court', 'election', 'republican', 'white', 'people', 'new', 'obama', 'house', 'state', 'campaign', 'clinton', 'president', 'said', '

In [85]:
topic_results = LDA.transform(dtm)

In [86]:
#Check first row of Article is similar to which topic
topic_results[0].round(2)

array([0.  , 0.  , 0.  , 0.22, 0.  , 0.78, 0.  ])

In [87]:
#Assign the rows with maximum probability of relation with topic
npr["topic"] = topic_results.argmax(axis=1)

In [88]:
my_topic_dict = {0:"health",1:"election",2:"legis",3:"politics",4:"election",5:"music",6:"edu"}

In [89]:
npr["topic_label"] = npr["topic"].map(my_topic_dict)

In [90]:
npr

Unnamed: 0,Article,topic,topic_label
0,"In the Washington of 2016, even when the polic...",5,music
1,Donald Trump has used Twitter — his prefe...,5,music
2,Donald Trump is unabashedly praising Russian...,5,music
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",3,politics
...,...,...,...
12002,The number of law enforcement officers shot an...,3,politics
12003,"Trump is busy these days with victory tours,...",5,music
12004,It’s always interesting for the Goats and Soda...,2,legis
12005,The election of Donald Trump was a surprise to...,5,music


**Topic Modeling using Non-negative Matrix Factorization**

In [92]:
npr = pd.read_csv("npr.csv",engine='python', error_bad_lines=False)

In [93]:
npr.head()

Unnamed: 0,Article,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,...,Unnamed: 190,Unnamed: 191,Unnamed: 192,Unnamed: 193,Unnamed: 194,Unnamed: 195,Unnamed: 196,Unnamed: 197,Unnamed: 198,Unnamed: 199,Unnamed: 200,Unnamed: 201,Unnamed: 202,Unnamed: 203,Unnamed: 204,Unnamed: 205,Unnamed: 206,Unnamed: 207,Unnamed: 208,Unnamed: 209,Unnamed: 210,Unnamed: 211,Unnamed: 212,Unnamed: 213,Unnamed: 214,Unnamed: 215,Unnamed: 216,Unnamed: 217,Unnamed: 218,Unnamed: 219,Unnamed: 220,Unnamed: 221,Unnamed: 222,Unnamed: 223,Unnamed: 224,Unnamed: 225,Unnamed: 226,Unnamed: 227,Unnamed: 228,Unnamed: 229
0,"In the Washington of 2016, even when the polic...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,Donald Trump has used Twitter — his prefe...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,Donald Trump is unabashedly praising Russian...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,"Updated at 2:50 p. m. ET, Russian President Vl...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,"From photography, illustration and video, to d...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [94]:
npr.drop(npr.filter(regex="Unname"),axis=1, inplace=True)

In [95]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [97]:
tfidf = TfidfVectorizer(max_df=0.9,min_df=2,stop_words="english")

In [98]:
dtm = tfidf.fit_transform(npr["Article"])

In [99]:
dtm

<12007x54731 sparse matrix of type '<class 'numpy.float64'>'
	with 3029517 stored elements in Compressed Sparse Row format>

In [100]:
from sklearn.decomposition import NMF

In [101]:
nmf_model = NMF(n_components=7,random_state=42)

In [102]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [103]:
tfidf.get_feature_names()[2300]

'albania'

In [104]:
for i,topic in enumerate(nmf_model.components_): 
  print(f"Top 15 words for topic #{i}")
  print([tfidf.get_feature_names()[index] for index in topic.argsort()[-15:]])
  print("\n")
  print("\n")

Top 15 words for topic #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']




Top 15 words for topic #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']




Top 15 words for topic #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']




Top 15 words for topic #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'president', 'attack', 'reports', 'court', 'said', 'police']




Top 15 words for topic #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']




Top 15 words for topic #5
['love', 've', 'don', 'album', 'way', '

In [105]:
my_topic_dict = {0:"health",1:"election",2:"legis",3:"politics",4:"election",5:"music",6:"edu"}

In [106]:
topic_result = nmf_model.transform(dtm)

In [107]:
topic_result.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3])

In [108]:
npr["topic"] = topic_result.argmax(axis=1)

In [109]:
npr.head()

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [110]:
npr["topic_label"] = npr["topic"].map(my_topic_dict)

In [111]:
npr.head()

Unnamed: 0,Article,topic,topic_label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,edu
