# Non-Negative Matric Factorization

Let's repeat the topic modeling task from the previous lecture, but this time, we will use NMF instead of LDA.

## Data

We will be using articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd

npr = pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


Notice how we don't have the topic of the articles! Let's use NMF to attempt to figure out clusters of the articles.

## Preprocessing

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

**`max_df`**` : float in range [0.0, 1.0] or int, default=1.0`<br>
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

**`min_df`**` : float in range [0.0, 1.0] or int, default=1`<br>
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [3]:
tfidf = TfidfVectorizer(max_df=.95, min_df=2, stop_words='english')

In [4]:
dtm = tfidf.fit_transform(npr['Article'])

In [5]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

## NMF

In [6]:
from sklearn.decomposition import NMF

In [7]:
nmf_model = NMF(n_components=7, random_state=42)

In [8]:
nmf_model.fit(dtm)

## Displaying Topics

In [10]:
import random

len(tfidf.get_feature_names_out())

54777

In [12]:
for i in range(10):
    random_idx = random.randint(0,54777)
    print(tfidf.get_feature_names_out()[random_idx])

walsh
tensions
trucks
forked
novelistic
transitioned
roswell
attache
abject
benne


In [13]:
nmf_model.components_.shape

(7, 54777)

In [15]:
for i,topic in enumerate(nmf_model.components_):
    print(f"THE TOP 20 WORDS FOR TOPIC #{i}")
    print([tfidf.get_feature_names_out()[index] for index in topic.argsort()[-20:]])
    print("\n")

THE TOP 20 WORDS FOR TOPIC #0
['years', 'brain', 'researchers', 'university', 'scientists', 'new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


THE TOP 20 WORDS FOR TOPIC #1
['intelligence', 'office', 'nominee', 'republicans', 'comey', 'gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


THE TOP 20 WORDS FOR TOPIC #2
['insurers', 'federal', 'said', 'aca', 'repeal', 'senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


THE TOP 20 WORDS FOR TOPIC #3
['killed', 'reported', 'military', 'justice', 'city', 'officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


THE TOP 20 WORDS FOR TOPIC #

### Attaching Discovered Topic Labels to Original Articles

In [16]:
topic_result = nmf_model.transform(dtm)

In [17]:
topic_result.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3])

In [18]:
npr['Topic'] = topic_result.argmax(axis=1)

In [19]:
npr.head(10)

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6
5,I did not want to join yoga class. I hated tho...,5
6,With a who has publicly supported the debunk...,0
7,"I was standing by the airport exit, debating w...",0
8,"If movies were trying to be more realistic, pe...",0
9,"Eighteen years ago, on New Year’s Eve, David F...",5
