# Unsupervised learning Text data set--Topic Modelling (NMF)

### NMF means non negative matrix factorization. It uses tfidf vectorizer as document term matrix

### from sklearn decomposition I have used NMF, we can also use LDA for topic modelling 

## Data

Articles scraped from NPR (National Public Radio), obtained from their website [www.npr.org](http://www.npr.org)

In [1]:
import pandas as pd
npr=pd.read_csv('npr.csv')

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
tfidf=TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [4]:
dtm=tfidf.fit_transform(npr['Article'])

In [5]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [6]:
from sklearn.decomposition import NMF

In [7]:
nmf_model=NMF(n_components=7,random_state=42)

In [8]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [10]:
tfidf.get_feature_names()[230]

'1842'

In [13]:
for i,topic in enumerate(nmf_model.components_):
    print(f"Top 15 words #{i}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

Top 15 words #0
['new', 'research', 'like', 'patients', 'health', 'disease', 'percent', 'women', 'virus', 'study', 'water', 'food', 'people', 'zika', 'says']


Top 15 words #1
['gop', 'pence', 'presidential', 'russia', 'administration', 'election', 'republican', 'obama', 'white', 'house', 'donald', 'campaign', 'said', 'president', 'trump']


Top 15 words #2
['senate', 'house', 'people', 'act', 'law', 'tax', 'plan', 'republicans', 'affordable', 'obamacare', 'coverage', 'medicaid', 'insurance', 'care', 'health']


Top 15 words #3
['officers', 'syria', 'security', 'department', 'law', 'isis', 'russia', 'government', 'state', 'attack', 'president', 'reports', 'court', 'said', 'police']


Top 15 words #4
['primary', 'cruz', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']


Top 15 words #5
['love', 've', 'don', 'album', 'way', 'time', 'song', 'life', 'really', 'know', 'people', 'think', 'just', 'm

In [14]:
topic_results = nmf_model.transform(dtm)

In [15]:
topic_results[0]

array([0.        , 0.12075603, 0.00140297, 0.05919954, 0.01518909,
       0.        , 0.        ])

In [16]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 0, 4, 3], dtype=int64)

In [17]:
npr['topic']=topic_results.argmax(axis=1)

In [18]:
npr.head()

Unnamed: 0,Article,topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",6


In [19]:
mytopic_dict={0:'health',1:'election',2:'legis',3:'politics',4:'election',5:'music',6:'education'}

In [21]:
npr['Topic_Label']=npr['topic'].map(mytopic_dict)

In [22]:
npr.head(10)

Unnamed: 0,Article,topic,Topic_Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3,politics
4,"From photography, illustration and video, to d...",6,education
5,I did not want to join yoga class. I hated tho...,5,music
6,With a who has publicly supported the debunk...,0,health
7,"I was standing by the airport exit, debating w...",0,health
8,"If movies were trying to be more realistic, pe...",0,health
9,"Eighteen years ago, on New Year’s Eve, David F...",5,music
