# Purpose of this notebook:
## Understanding NMF (Non Negative Matrix Factorisation)

***Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation.***


***Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative.***


***By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.
Given the original matrix A, we can obtain two matrices W and H, such that A= WH.***

In [0]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv("/content/npr.csv")
df.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [3]:
df.isnull().sum()

Article    0
dtype: int64

In [5]:
blanks = []
for review in df.Article :
  if type(review) == str:
    if review.isalpha():
      blanks.append(review)

blanks

[]

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
tfidf = TfidfVectorizer(stop_words='english',max_df=0.9,min_df=2)

In [8]:
dtm = tfidf.fit_transform(df['Article'])
dtm

<11992x54777 sparse matrix of type '<class 'numpy.float64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [30]:
tfidf.get_feature_names()[53211]

'way'

In [0]:
nmf = NMF(n_components=5,random_state=101)
nmf

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=101, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [14]:
nfm_fit  = nmf.fit(dtm)
nfm_fit

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=101, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [21]:
nfm_fit.components_.shape

(5, 54777)

In [25]:
nfm_fit.components_[0].argsort()[-15:]

array([47210, 53211, 28606, 33390, 43172, 15008, 32729, 49459, 27439,
       39848, 49183, 26752, 36283, 28659, 42993])

In [31]:
for index, article in enumerate(nfm_fit.components_):

  print(f"The topic is {index}")
  print([tfidf.get_feature_names()[item] for item in article.argsort()[-15:]])
  print("\n")
  print("\n")

The topic is 0
['students', 'way', 'life', 'new', 'school', 'don', 'music', 'time', 'know', 'really', 'think', 'just', 'people', 'like', 'says']




The topic is 1
['comey', 'republicans', 'presidential', 'administration', 'russia', 'election', 'republican', 'obama', 'white', 'donald', 'house', 'campaign', 'said', 'president', 'trump']




The topic is 2
['medical', 'plan', 'affordable', 'zika', 'tax', 'obamacare', 'people', 'patients', 'percent', 'coverage', 'medicaid', 'says', 'insurance', 'care', 'health']




The topic is 3
['russia', 'city', 'says', 'security', 'isis', 'department', 'law', 'attack', 'president', 'government', 'state', 'reports', 'court', 'said', 'police']




The topic is 4
['republican', 'said', 'election', 'democrats', 'percent', 'party', 'delegates', 'vote', 'state', 'democratic', 'hillary', 'campaign', 'voters', 'sanders', 'clinton']






In [16]:
nfm_clustering = nfm_fit.transform(dtm)
nfm_clustering

array([[0.        , 0.12184922, 0.        , 0.05267914, 0.01464958],
       [0.        , 0.12273515, 0.        , 0.02986155, 0.        ],
       [0.        , 0.14119805, 0.        , 0.03553825, 0.02195947],
       ...,
       [0.02774447, 0.        , 0.02757206, 0.01984823, 0.00348949],
       [0.00656824, 0.04035902, 0.        , 0.        , 0.12750412],
       [0.02019833, 0.00533462, 0.00628332, 0.02649567, 0.01174886]])

In [22]:
nfm_clustering.shape

(11992, 5)

In [39]:
nfm_clustering[9].argmax()

0

In [40]:
df['Article_topic'] = nfm_clustering.argmax(axis=1)
df.head()

Unnamed: 0,Article,Article_topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",3
4,"From photography, illustration and video, to d...",3


#####***After going through the top 15 words each topic gave I have decided to make a dictionary of article_topic series value and corresponding topic I believe each topic is about . It might vary from person to person according to ones perspective*** 

In [41]:
my_dict = {0:"Education",1:"Election Campaign" ,2:"HealthCare", 3:"Security",4:"Election Votes"}
my_dict

{0: 'Education',
 1: 'Election Campaign',
 2: 'HealthCare',
 3: 'Security',
 4: 'Election Votes'}

In [42]:
df['Article_topic'] = df['Article_topic'].map(my_dict)
df.head()

Unnamed: 0,Article,Article_topic
0,"In the Washington of 2016, even when the polic...",Election Campaign
1,Donald Trump has used Twitter — his prefe...,Election Campaign
2,Donald Trump is unabashedly praising Russian...,Election Campaign
3,"Updated at 2:50 p. m. ET, Russian President Vl...",Security
4,"From photography, illustration and video, to d...",Security
