# Non-Negative Matrix Factorization

Non-Negative Matrix Factorization is **unsupervised** ML that performs dimensionality reduction and clustering, can use with TF_IDF

![](NonNegMatrix.png)

In [1]:
import pandas as pd

In [2]:
# NPR dataset
npr = pd.read_csv('npr.csv')
npr.head() 

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


## Preprocess 


max_df: float in range [0.0, 1.0] or int, default=1.0
- When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.


min_df: float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english') 

# document term matrix
dtm = tfidf.fit_transform(npr['Article']) 
dtm
# articles x num of words 

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

# **NMF**  
## *non-neg-matrix-factorization*

In [5]:
from sklearn.decomposition import NMF 

# n_components is number of topics
nmf_model = NMF(n_components=7,random_state=42) 

# This can take awhile, we're dealing with a large amount of documents!
nmf_model.fit(dtm) 

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [12]:
# display topics
print(tfidf.get_feature_names()[3400] ) 

print(len(tfidf.get_feature_names()) )

aquarius
54777


In [8]:
# get random words from feature names

import random

for i in range(10):
    random_word_id = random.randint(0,54776)
    print(tfidf.get_feature_names()[random_word_id]) 

revamp
fallow
hawaii
teased
lethal
navalny
retainer
pittsburg
2001
mcafee


In [13]:
for i in range(10):
    random_word_id = random.randint(0,54776)
    print(tfidf.get_feature_names()[random_word_id]) 

predictable
criticized
bag
wrestled
inhuman
pulverized
ravens
minor
detestable
naltrexone


In [14]:
len(nmf_model.components_) 

7

In [19]:
single_topic = nmf_model.components_[0]

# Returns the indices that would sort this array.
single_topic.argsort() 

# Word least representative of this topic
single_topic[18302] 

# Word most representative of this topic
single_topic[42993] 

# Top 10 words for this topic:
single_topic.argsort()[-10:] 

array([14441, 36310, 53989, 52615, 47218, 53152, 19307, 36283, 54692,
       42993])

In [20]:
top_word_indices = single_topic.argsort()[-10:] 

for index in top_word_indices:
    print(tfidf.get_feature_names()[index]) 

disease
percent
women
virus
study
water
food
people
zika
says


In [21]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n') 

THE TOP 15 WORDS FOR TOPIC #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 15 WORDS FOR TOPIC #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 15 WORDS FOR TOPIC #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 15 WORDS FOR TOPIC #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 15 WORDS FOR TOPIC #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


THE TOP 15 WORDS FOR TOPIC #5
['love', 've', 'don', 'al

## Attaching discovered labels to articles

In [22]:
topic_results = nmf_model.transform(dtm) 

In [23]:
topic_results.shape

(11992, 7)

In [24]:
topic_results[0] 

topic_results[0].round(2) # rounding values

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [25]:
topic_results[0].argmax() # first article belongs to topic 1

1

In [26]:
topic_results.argmax(axis=1) 

array([1, 1, 1, ..., 0, 4, 3])

In [28]:
npr['Topic'] = topic_results.argmax(axis=1) 

npr.head(10) 

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5


In [42]:
print('Topic:',npr.Topic[1])

Topic: 1


## Make a dictionary for labels by topic numbers

In [44]:
# arbiturary numbers, can create a dictionary
mytopic_dict = {0:'Random',
                1:"Politics", 
                2:'Random', 
                3:'Legislation',
                4:'CurrentEvents',
                5:'Health',
                6:'Education'}

npr['Topic_Label'] = npr['Topic'].map(mytopic_dict) 
npr.head() 


Unnamed: 0,Article,Topic,Topic_Label
0,"In the Washington of 2016, even when the polic...",1,Politics
1,Donald Trump has used Twitter — his prefe...,1,Politics
2,Donald Trump is unabashedly praising Russian...,1,Politics
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Legislation
4,"From photography, illustration and video, to d...",6,Education


In [45]:
print(npr.Article[5]) 

I did not want to join yoga class. I hated those   beatific instructors. I worried that the people in the class could fold up like origami and I’d fold up like a bread stick. I understood the need for stretchy clothes but not for total anatomical disclosure. But my hip joints hurt and so did my shoulders, and my upper back hurt even more than my lower back and my brain would. not. shut. up. I asked my doctor about medication and he said he didn’t like the side effects and was pretty sure I wouldn’t, either. So I signed up for Gentle Mind and Body Yoga, the   of yoga classes. I think the principle is that you get into some pose that has cosmic implications and then hold the pose until you are enlightened or bored silly. I like the bridge pose, where you lie flat on your back and put a rubber block under your butt. I purely hate the eagle pose, where you wind your arms around each other and then wrap your legs around each other and stand on one foot I drop like a sprayed mosquito. The te