# NMF / sparse matrix pair

This pair problem covers a few techniques: some specfic to dealing with sparse matrices, and some relating to using NMF to look at the top words for given topics.

## Data

We'll be using the BBC dataset. These are articles collected from 5 different topics, with the data pre-processed. 

You can download the data by clicking [here](http://mlg.ucd.ie/files/datasets/bbc.zip). **Save the data and unzip it in the same folder as this notebook!** The data consists of a few files. The steps we'll be following are:

* *bbc.terms* is just a list of words 
* *bbc.docs* is a list of artcles listed by topic.

At a high level, we're going to 

1. Turn the `bbc.mtx` file into a sparse matrix.
1. Decompose that sparse matrix using NMF.
1. Use the resulting components of NMF to analyze the topics that result.

## Read in the data

You can read in the data with these lines:

In [14]:
with open('./bbc/bbc.mtx') as f:
    content = f.readlines()

In [15]:
content.pop(0)
content.pop(0)

'9635 2225 286774\n'

## Exercise

Turn this into a list of tuples representing a sparse matrix. Remember the description of the file from above:

* *bbc.mtx* is a list: first column is **wordID**, second is **articleID** and the third is the number of times that word appeared in that article.

So, if word 1 appears in article 3, 2 times, one element of your list will be:

`(1, 3, 2)`

In [16]:
sparsemat = [tuple(map(int,map(float,c.split()))) for c in content]
sparsemat

[(1, 1, 1),
 (1, 7, 2),
 (1, 11, 1),
 (1, 14, 1),
 (1, 15, 2),
 (1, 19, 2),
 (1, 21, 1),
 (1, 29, 1),
 (1, 30, 1),
 (1, 33, 1),
 (1, 35, 1),
 (1, 41, 1),
 (1, 45, 1),
 (1, 47, 2),
 (1, 50, 1),
 (1, 52, 1),
 (1, 53, 3),
 (1, 55, 1),
 (1, 61, 1),
 (1, 62, 1),
 (1, 63, 1),
 (1, 65, 1),
 (1, 69, 1),
 (1, 80, 1),
 (1, 81, 1),
 (1, 86, 1),
 (1, 93, 1),
 (1, 99, 1),
 (1, 100, 1),
 (1, 104, 1),
 (1, 105, 1),
 (1, 106, 1),
 (1, 112, 1),
 (1, 116, 3),
 (1, 117, 1),
 (1, 118, 2),
 (1, 120, 1),
 (1, 121, 1),
 (1, 126, 1),
 (1, 131, 4),
 (1, 133, 1),
 (1, 134, 3),
 (1, 138, 2),
 (1, 140, 2),
 (1, 143, 1),
 (1, 151, 2),
 (1, 152, 2),
 (1, 175, 1),
 (1, 176, 1),
 (1, 177, 1),
 (1, 180, 1),
 (1, 184, 1),
 (1, 189, 1),
 (1, 194, 1),
 (1, 195, 1),
 (1, 198, 1),
 (1, 201, 1),
 (1, 206, 1),
 (1, 207, 3),
 (1, 208, 1),
 (1, 217, 1),
 (1, 219, 1),
 (1, 221, 1),
 (1, 231, 1),
 (1, 241, 2),
 (1, 252, 1),
 (1, 253, 2),
 (1, 255, 2),
 (1, 263, 1),
 (1, 273, 1),
 (1, 275, 1),
 (1, 279, 2),
 (1, 281, 1),
 (1, 286

## Turn this into an array

Now, we're going to learn a cool function to turn sparse matrix objects like this into 2D arrays. Check out the [coo matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html) documentation and the example below:

In [17]:
from scipy.sparse import coo_matrix
import numpy as np
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])
coo_matrix((data, (row, col)), shape=(4, 4)).toarray()

array([[4, 0, 9, 0],
       [0, 7, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 5]])

## Exercise:

Turn the sparse matrix object you created above into an array using `coo_matrix` and `.toarray()`.

In [18]:
from scipy.sparse import coo_matrix
rows = [x[0] for x in sparsemat]
cols = [x[1] for x in sparsemat]
values = [x[2] for x in sparsemat]
coo = coo_matrix((values, (rows, cols)))

## NMF

NMF is a way of decomposing a matrix of documents and words so that one of the matrices can be interpreted as the "loadings" or "weights" of each word on a topic. 

Check out [the NMF documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) and the [examples of topic extraction using NMF and LDA](http://scikit-learn.org/0.18/auto_examples/applications/topics_extraction_with_nmf_lda.html).

## Exercise

Import `NMF`, define a model object with 5 components, and `fit_transform` the data you created above.

In [19]:
from sklearn.decomposition import NMF
model = NMF(n_components=5, init='random', random_state=818)
doc_topic = model.fit_transform(coo)

doc_topic.shape
# we should have 9636 observations (articles) and five latent features

(9636, 5)

In [20]:
# find feature with highest value per doc
np.argmax(doc_topic, axis=1)

array([0, 0, 2, ..., 4, 4, 4])

## Exercise: 

Check out the `components` of this model:

In [21]:
model.components_.shape

(5, 2226)

This is five rows, each of which is a "topic" containing the weights of each word on that topic. The exercise is to _get a list of the top 10 words for each topic_. You can just store this in a list of lists.

**Note:** Just like you read in the data above, you'll have to read in the words from the `bbc.terms` file.

In [22]:
with open('./bbc/bbc.terms') as f:
    content = f.readlines()
words = [c.split()[0] for c in content]

In [23]:
topic_words = []
for r in model.components_:
    a = sorted([(v,i) for i,v in enumerate(r)],reverse=True)[0:12]
    topic_words.append([words[e[1]] for e in a])

In [24]:
topic_words

[['bondi',
  'stanlei',
  'continent',
  'mortgag',
  'bare',
  'least',
  'extent',
  '200',
  'leav',
  'frustrat',
  'yuan',
  'industri'],
 ['manipul',
  'teenag',
  'drawn',
  'go',
  'prosecutor',
  'herbert',
  'host',
  'protest',
  'hike',
  'nation',
  'calcul',
  'power'],
 ['dimens',
  'hous',
  'march',
  'wider',
  'owner',
  'intend',
  'declin',
  'forc',
  'posit',
  'founder',
  'york',
  'unavail'],
 ['rome',
  'ft',
  'regain',
  'lawmak',
  'outright',
  'resum',
  'childhood',
  'greatest',
  'citi',
  'stagnat',
  'crown',
  'bodi'],
 ['build',
  'empir',
  'isol',
  '£12',
  'restructur',
  'closer',
  'plung',
  'depreci',
  'durham',
  'race',
  'juli',
  'segreg']]

The original data had 5 topics, as listed in `bbc.docs`. 

```
Business
Entertainment
Politics
Sport
Tech
```

In "real life", we would have found a way to use these to inform the model. But for this little demo, we can just compare the recovered topics to the original ones. And they seem to match reasonably well. The order is different, which is to be expected in this kind of model.

In [25]:
with open('./bbc/bbc.docs') as d:
    content = d.readlines()
    
content

['business.001\n',
 'business.002\n',
 'business.003\n',
 'business.004\n',
 'business.005\n',
 'business.006\n',
 'business.007\n',
 'business.008\n',
 'business.009\n',
 'business.010\n',
 'business.011\n',
 'business.012\n',
 'business.013\n',
 'business.014\n',
 'business.015\n',
 'business.016\n',
 'business.017\n',
 'business.018\n',
 'business.019\n',
 'business.020\n',
 'business.021\n',
 'business.022\n',
 'business.023\n',
 'business.024\n',
 'business.025\n',
 'business.026\n',
 'business.027\n',
 'business.028\n',
 'business.029\n',
 'business.030\n',
 'business.031\n',
 'business.032\n',
 'business.033\n',
 'business.034\n',
 'business.035\n',
 'business.036\n',
 'business.037\n',
 'business.038\n',
 'business.039\n',
 'business.040\n',
 'business.041\n',
 'business.042\n',
 'business.043\n',
 'business.044\n',
 'business.045\n',
 'business.046\n',
 'business.047\n',
 'business.048\n',
 'business.049\n',
 'business.050\n',
 'business.051\n',
 'business.052\n',
 'business.0