# Non-Negative Matrix Factorization (NMF)

Non-Negative Matrix Factorization (NMF) is an unsupervised technique so there are no labeling of topics that the model will be trained on. 

The way it works is that, NMF decomposes (or factorizes) high-dimensional vectors into a lower-dimensional representation. 

These lower-dimensional vectors are non-negative which also means their coefficients are non-negative.

Using the original matrix (A), NMF will give you two matrices (W and H). W is the topics it found and H is the coefficients (weights) for those topics. In other words, A is articles by words (original), H is articles by topics and W is topics by words.

More info : https://towardsdatascience.com/topic-modeling-articles-with-nmf-8c6b2a227a45


In [1]:
import pandas as pd
npr=pd.read_csv('npr.csv')
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
tfidf=TfidfVectorizer(max_df=0.9,min_df=2,stop_words='english')

In [4]:
dtm=tfidf.fit_transform(npr['Article'])

In [18]:
from sklearn.decomposition import NMF
nmf=NMF(n_components=7,random_state=42)

In [19]:
nmf.fit(dtm)



NMF(n_components=7, random_state=42)

In [20]:
tfidf.get_feature_names()[50000]

'transcribe'

In [21]:
for i in range(0,nmf.components_.shape[0]):
    print(f"Top 10 words for Topic {i} :")
    print([tfidf.get_feature_names()[ind] for ind in nmf.components_[i].argsort()[-10:]])
    print('\n')

Top 10 words for Topic 0 :
['disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


Top 10 words for Topic 1 :
['election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


Top 10 words for Topic 2 :
['tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


Top 10 words for Topic 3 :
['isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


Top 10 words for Topic 4 :
['party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


Top 10 words for Topic 5 :
['time', 'song', 'life', 'really', 'know', 'people', 'think', 'just', 'music', 'like']


Top 10 words for Topic 6 :
['devos', 'children', 'college', 'kids', 'teachers', 'student', 'education', 'schools', 'school', 'students']




In [22]:
results=nmf.transform(dtm)

In [23]:
results[0].round(2)

array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ])

In [24]:
npr['Topic']=results.argmax(axis=1)

In [25]:
npr

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
...,...,...
11987,The number of law enforcement officers shot an...,3
11988,"Trump is busy these days with victory tours,...",1
11989,It’s always interesting for the Goats and Soda...,0
11990,The election of Donald Trump was a surprise to...,4


In [27]:
top_dict={0:'health',1:'election',2:'legislation',3:'politics',5:'music',6:'education'}

In [None]:
npr["Topic Label"]=npr['T']