# Non-Negative Matric Factorization

## Data

We will be using articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd
import time
import random

In [2]:
npr = pd.read_csv('npr.csv')

In [3]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


## Preprocessing

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

if max_df is 0.9, it will discard the word which is there in 90% or + of documents.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

if min_df is 0.2, it will consider the word which is there in 20% or + of documents.
or if min_df = 3 then to consider the word it must be in atleast 3 documents.

In [6]:
# Instantiateds tfidf
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [7]:
# Create Sparse Matrix
dtm = tfidf.fit_transform(npr['Article'])

In [8]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## NMF

In [9]:
# Call the library
from sklearn.decomposition import NMF

In [10]:
# Instantiating the nmf model
nmf_model = NMF(n_components=7,init = 'nndsvd',verbose =1,random_state=42)

In [11]:
# fitting NMF to get different topics from documents
tic = time.time()
nmf_model.fit(dtm)
toc = time.time()

print('time taken to get the topics using NMF', toc-tic)

violation: 1.0
violation: 0.553642076184776
violation: 0.31958662645684316
violation: 0.20949907625508032
violation: 0.1460629930083162
violation: 0.10451811641075846
violation: 0.078691359801454
violation: 0.06255213269132107
violation: 0.051608583356583394
violation: 0.043435686870770145
violation: 0.03703146826579178
violation: 0.032103729716281566
violation: 0.02861219708010926
violation: 0.026177341604661994
violation: 0.02448243736089255
violation: 0.02335826898405569
violation: 0.02269526006017361
violation: 0.0224089111227523
violation: 0.02245512143907345
violation: 0.022774084300130514
violation: 0.023352143778731297
violation: 0.024159933635366802
violation: 0.025161466587481446
violation: 0.026340251819171132
violation: 0.027681052987260243
violation: 0.02914113389425621
violation: 0.030708228937549042
violation: 0.032323716949936285
violation: 0.033948897655824824
violation: 0.03545968508980596
violation: 0.03673317723712804
violation: 0.03757099695674111
violation: 0.0380

### Grab the vocab of the words

In [12]:
len(tfidf.get_feature_names())

54777

In [18]:
tfidf.get_feature_names()[1587]

'acquisition'

In [17]:
for i in range(20):
    random_word_id = random.randint(0,54776)
    print(tfidf.get_feature_names()[random_word_id])

transnational
disembodied
muchita
decked
reinvigorate
wwii
stymied
machado
copeny
configurations
pulsing
funniest
406
puns
psalm
destruction
interlocutor
fuzzy
tensing
shames


## Grab the Topics

In [20]:
# we have trained NMF which consists of NMF components or topics we want 

len(nmf_model.components_)

7

In [21]:
nmf_model.components_

array([[0.00000000e+00, 2.49950821e-01, 0.00000000e+00, ...,
        1.70313822e-03, 2.37544362e-04, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 8.22048918e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [0.00000000e+00, 3.12379960e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.89723338e-03, 0.00000000e+00, 1.50186440e-03, ...,
        7.06428924e-04, 5.85500542e-04, 6.89536542e-04],
       [4.01763234e-03, 5.31643833e-02, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [22]:
len(nmf_model.components_[0])

54777

In [23]:
single_topic = nmf_model.components_[0]

In [24]:
# Returns the indices that would sort this array.
single_topic.argsort()

array([    0, 27208, 27206, ..., 36283, 54692, 42993], dtype=int64)

In [25]:
# Word least representative of this topic
single_topic[0]

0.0

In [26]:
# Word most representative of this topic
single_topic[42993]

2.0050551654185758

In [27]:
# Top 10 words for this topic:
single_topic.argsort()[-10:]

array([14441, 36310, 53989, 52615, 47218, 53152, 19307, 36283, 54692,
       42993], dtype=int64)

In [28]:
top_word_indices = single_topic.argsort()[-10:]

In [30]:
# Lets see those words:
top_words = []
for index in top_word_indices:
    Top_word = tfidf.get_feature_names()[index]
    top_words.append(Top_word)
print(top_words)

['disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


In [31]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 30 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-30:]])
    print('\n')

THE TOP 30 WORDS FOR TOPIC #0
['risk', 'medical', 'cases', 'company', 'year', 'just', 'world', 'drug', 'children', '000', 'years', 'brain', 'university', 'researchers', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 30 WORDS FOR TOPIC #1
['party', 'speech', 'tax', 'foreign', 'washington', 'senate', 'cruz', 'business', 'news', 'committee', 'intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 30 WORDS FOR TOPIC #2
['budget', 'states', 'medicare', 'costs', 'medical', 'services', 'percent', 'patients', 'premiums', 'plans', 'insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'i

### Attaching Discovered Topic Labels to Original Articles

In [32]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [33]:
dtm.shape

(11992, 54777)

In [34]:
len(npr)

11992

In [35]:
# probability of document relating to a topic

topic_results = nmf_model.transform(dtm)

violation: 1.0
violation: 0.3782356590667793
violation: 0.048282636027319595
violation: 0.006546528403533499
violation: 0.0018206224678587921
violation: 0.0003989653844738552
violation: 6.938078415141977e-05
Converged at iteration 7


In [36]:
# lets see if it has categorised into topics

topic_results.shape

(11992, 7)

In [37]:
# for each document it will have associated probability of relating to each topic
topic_results[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [38]:
# round it to two digits
# It belongs ot topic #1 with 67% probability
topic_results[0].round(2)

array([0.  , 0.12, 0.  , 0.06, 0.02, 0.  , 0.  ])

In [40]:
topic_results[0].argmax()

1

In [43]:
# Lets see the words and article here
# Lets see what those words are:
topic_check = nmf_model.components_[1]
top_word_indices = topic_check.argsort()[-30:]

top_words = []
for index in top_word_indices:
    Top_word = tfidf.get_feature_names()[index]
    top_words.append(Top_word)
print(top_words)

['party', 'speech', 'tax', 'foreign', 'washington', 'senate', 'cruz', 'business', 'news', 'committee', 'intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


In [44]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

### Combining with Original Data

In [45]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [46]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [47]:
npr['Topic'] = topic_results.argmax(axis=1)

In [48]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5


In [49]:
set(npr['Topic'])

{0, 1, 2, 3, 4, 5, 6}

In [50]:
# Lets assign some topic name to this numbers
topic_dict ={0:'Govt_Militry', 
             1:'White_house' , 
             2:'Editorial' , 
             3:'Education' , 
             4:'Politics' , 
             5:'World_news' , 
             6:'Business'   
}

In [51]:
npr['Topic_name'] = npr['Topic'].map(topic_dict)

In [52]:
npr.head(20)

Unnamed: 0,Article,Topic,Topic_name
0,"In the Washington of 2016, even when the polic...",1,White_house
1,Donald Trump has used Twitter — his prefe...,1,White_house
2,Donald Trump is unabashedly praising Russian...,1,White_house
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,Education
4,"From photography, illustration and video, to d...",6,Business
5,I did not want to join yoga class. I hated tho...,5,World_news
6,With a who has publicly supported the debunk...,0,Govt_Militry
7,"I was standing by the airport exit, debating w...",0,Govt_Militry
8,"If movies were trying to be more realistic, pe...",0,Govt_Militry
9,"Eighteen years ago, on New Year’s Eve, David F...",5,World_news


In [53]:
npr.tail(20)

Unnamed: 0,Article,Topic,Topic_name
11972,In a disappointment to Alzheimer’s patients an...,0,Govt_Militry
11973,"In my early 20s, smitten by the mythic underpi...",5,World_news
11974,It’s been a lively year for social media maven...,3,Education
11975,This is not a review. It started out as one: I...,5,World_news
11976,"On a summer’s day in December, a warehouse in ...",5,World_news
11977,"Elections aren’t exactly cozy, even in the bes...",4,Politics
11978,"Although her oldest child, Ben, is 10 years ol...",0,Govt_Militry
11979,"When a political scandal explodes in France, t...",0,Govt_Militry
11980,The darkest moment for American police this ye...,3,Education
11981,Russia was ordered to vacate two compounds it ...,3,Education
