# Non-Negative Matrix Factorization

In [17]:
import numpy as np
from scipy.sparse import coo_matrix
from sklearn.decomposition import NMF

Below, we read data from a text file that encode articles as embedded values. The data is represented as a sparse coordiante form `i j v` where `i` and `j` are indices and `v` is the value of the cell.

In [9]:
with open("./data/bbc.mtx", 'r') as f:
    content = f.readlines()[2:]
    
content[:5]

['1 1 1.0\n', '1 7 2.0\n', '1 11 1.0\n', '1 14 1.0\n', '1 15 2.0\n']

First we process the data into numbers, representing values at specific coordinates.

In [10]:
sparsemat = np.array(
    [
    tuple(
        map(
            int,
            map( float, c.split() )
            )
        ) 
    for c in content]
)

sparsemat[25:29]

array([[  1,  86,   1],
       [  1,  93,   1],
       [  1,  99,   1],
       [  1, 100,   1]])

Then we can use the `coo_matrix` function from scipy to build a sparse matrix in coordinate form (also known as ijv, or triplet format). This means that the data is encoded as `A[i[k], j[k]] = data[k]`.

For each datapoint, we pass the datapoint itself and the coordinates it should have in the matrix representation.

In [13]:
coo = coo_matrix(
    (sparsemat[:,2], (sparsemat[:, 0], sparsemat[:, 1]))
)

In [16]:
coo.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 5, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], shape=(9636, 2226))

### NMF

The NMF algorithm finds a factorization of the original matrix $X = WH$. With word embeddings this can be interpreted as a weight decomposition that gives the weight of a word on a topic.

The input of the NMF algorithm is a matrix of shape (n_samples, n_features). The decomposition gives two matrices back of shape (n_samples, n_components) and (n_component, n_features).

Using `fit_transform` below on our input data give the matrix that transforms the n_samples into n_components.

In [18]:
model = NMF(n_components=5, init='random', random_state=818)
doc_topic = model.fit_transform(coo)

doc_topic.shape

(9636, 5)

For each of the input samples we can find which component has the greates weight.

In [19]:
np.argmax(doc_topic, axis=1)

array([0, 0, 2, ..., 4, 4, 4], shape=(9636,))

As we can see, the component mapping to features has the following shape.

In [20]:
model.components_.shape

(5, 2226)

To actually map this back to the meaning of the words we need the original embedding information.

In [22]:
with open("./data/bbc.terms") as f:
    content = f.readlines()
words = [c.split()[0] for c in content]
words[:10]

['ad',
 'sale',
 'boost',
 'time',
 'warner',
 'profit',
 'quarterli',
 'media',
 'giant',
 'jump']

In [42]:
topic_words = []
# Go through each of the components
# Each row corresponds to a topic, where each value is the topic's word weight
for c in model.components_:
    # Sort each index by weight, keeping the index value 
    a = sorted(
        [(v, i) for i, v in enumerate(c)], reverse=True
    )
    # Take the top 12 higest ranked words
    a = a[:12]
    # Grad the word that corresponds to the embedding index
    topic_words.append(
        [words[e[1]] for e in a]
    )
    
topic_words

[['bondi',
  'stanlei',
  'continent',
  'mortgag',
  'bare',
  'least',
  'extent',
  '200',
  'leav',
  'frustrat',
  'yuan',
  'industri'],
 ['manipul',
  'teenag',
  'drawn',
  'go',
  'prosecutor',
  'herbert',
  'host',
  'protest',
  'hike',
  'nation',
  'calcul',
  'power'],
 ['dimens',
  'hous',
  'march',
  'wider',
  'owner',
  'intend',
  'declin',
  'forc',
  'posit',
  'founder',
  'york',
  'unavail'],
 ['rome',
  'ft',
  'regain',
  'lawmak',
  'outright',
  'resum',
  'childhood',
  'greatest',
  'citi',
  'stagnat',
  'crown',
  'bodi'],
 ['build',
  'empir',
  'isol',
  'Â£12',
  'restructur',
  'closer',
  'plung',
  'depreci',
  'durham',
  'race',
  'juli',
  'segreg']]

These word groupings *should* correspond to the original topics of the documents, as shown below:

In [41]:
with open("./data/bbc.docs") as f:
    doc_topics = np.unique(list(map(lambda l : l.split(".")[0], f.readlines())))
doc_topics

array(['business', 'entertainment', 'politics', 'sport', 'tech'],
      dtype='<U13')