# Machine Learning Foundation
## Non-Negative Matrix Factorization

### Data
We'll be using the BBC dataset. These are articles collected from 5 different topics, with the data preprocessed

### Data Setup

In [1]:
with open('data/bbc.mtx') as f:
    content=f.readlines()

In [2]:
content.pop(0)
content.pop(0)

'9635 2225 286774\n'

In [6]:
len(content)

286774

### Part 1

In [8]:
content[0].split()

['1', '1', '1.0']

In [12]:
tuple(map(int,map(float,content[0].split())))

(1, 1, 1)

In [13]:
sparsemat=[tuple(map(int,map(float,x.split()))) for x in content]

In [14]:
sparsemat[:8]

[(1, 1, 1),
 (1, 7, 2),
 (1, 11, 1),
 (1, 14, 1),
 (1, 15, 2),
 (1, 19, 2),
 (1, 21, 1),
 (1, 29, 1)]

### Part 2: Preparing Sparse Matrix data for NMF

In [15]:
import numpy as np
from scipy.sparse import coo_matrix
rows=[x[0] for x in sparsemat]
cols=[x[1] for x in sparsemat]
values=[x[2] for x in sparsemat]
coo=coo_matrix((values, (rows, cols)))

In [17]:
coo.shape

(9636, 2226)

### NMF
NMF is a way of decomposing a matrix of documents and words so that one of the matrices can be interpreted as the "loadings" or "weights" of each word on a topic.

### Part 3

In [19]:
from sklearn.decomposition import NMF

model=NMF(n_components=5, init='random', random_state=818)
doc_topic=model.fit_transform(coo)

In [20]:
doc_topic.shape

(9636, 5)

In [21]:
np.argmax(doc_topic, axis=1)

array([0, 0, 2, ..., 4, 4, 4], dtype=int64)

In [22]:
np.sum(doc_topic, axis=1)

array([0.        , 5.21421641, 3.62323002, ..., 0.02509809, 0.02379184,
       0.01625975])

### Part 4
check out the components of this model

In [23]:
model.components_.shape

(5, 2226)

This is five rows, each of which is a "topic" containing the weights of each word on that topic. The exercise is to get a list of the top 10 words for each topic. We can just store this in a list of lists.

In [26]:
with open('data/bbc.terms') as f:
    content=f.readlines()

words=[c.split()[0] for c in content]

In [27]:
content[0]

'ad\n'

In [38]:
words[:10]

['ad',
 'sale',
 'boost',
 'time',
 'warner',
 'profit',
 'quarterli',
 'media',
 'giant',
 'jump']

In [39]:
topic_words=[]

for r in model.components_:
    a = sorted([(v,i) for i,v in enumerate(r)],reverse=True)[0:12]
    #print(a,"\n")
    topic_words.append([words[e[1]] for e in a])
    

In [40]:
topic_words[:5]

[['bondi',
  'stanlei',
  'continent',
  'mortgag',
  'bare',
  'least',
  'extent',
  '200',
  'leav',
  'frustrat',
  'yuan',
  'industri'],
 ['manipul',
  'teenag',
  'drawn',
  'go',
  'prosecutor',
  'herbert',
  'host',
  'protest',
  'hike',
  'nation',
  'calcul',
  'power'],
 ['dimens',
  'hous',
  'march',
  'wider',
  'owner',
  'intend',
  'declin',
  'forc',
  'posit',
  'founder',
  'york',
  'unavail'],
 ['rome',
  'ft',
  'regain',
  'lawmak',
  'outright',
  'resum',
  'childhood',
  'greatest',
  'citi',
  'stagnat',
  'crown',
  'bodi'],
 ['build',
  'empir',
  'isol',
  '拢12',
  'restructur',
  'closer',
  'plung',
  'depreci',
  'durham',
  'race',
  'juli',
  'segreg']]

The original data had 5 topics, as listed in bbc.docs

In [41]:
with open('data/bbc.docs') as d:
    doc_content = d.readlines()
    
doc_content[:8]

['business.001\n',
 'business.002\n',
 'business.003\n',
 'business.004\n',
 'business.005\n',
 'business.006\n',
 'business.007\n',
 'business.008\n']

In [44]:
[x.split(".")[0] for x in doc_content]

['business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',
 'business',