# NMF

## Imports

In [1]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline
np.set_printoptions(suppress=True)

# Dataset - Newsgroup Dataset

Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics.

In that we select a subset of topics for this demo, specifically we select the following topics

* Atheism
* Religion
* Graphics
* Space

In [3]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [4]:
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

In [9]:
print(newsgroups_train.data[144])

Archive-name: space/probe
Last-modified: $Date: 93/04/01 14:39:19 $

PLANETARY PROBES - HISTORICAL MISSIONS

    This section was lightly adapted from an original posting by Larry Klaes
    (klaes@verga.enet.dec.com), mostly minor formatting changes. Matthew
    Wiener (weemba@libra.wistar.upenn.edu) contributed the section on
    Voyager, and the section on Sakigake was obtained from ISAS material
    posted by Yoshiro Yamada (yamada@yscvax.ysc.go.jp).

US PLANETARY MISSIONS


    MARINER (VENUS, MARS, & MERCURY FLYBYS AND ORBITERS)

    MARINER 1, the first U.S. attempt to send a spacecraft to Venus, failed
    minutes after launch in 1962. The guidance instructions from the ground
    stopped reaching the rocket due to a problem with its antenna, so the
    onboard computer took control. However, there turned out to be a bug in
    the guidance software, and the rocket promptly went off course, so the
    Range Safety Officer destroyed it. Although the bug is sometimes claimed
    t

In [11]:
np.array(newsgroups_train.target_names)[newsgroups_train.target[144]]

'sci.space'

In [12]:
num_topics, num_top_words = 6, 8

# Data processing

In [16]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
vectorizer = CountVectorizer(stop_words='english') #, tokenizer=LemmaTokenizer())

In [17]:
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape


(2034, 26576)

Vectors here is our A matrix

In [29]:
print(len(newsgroups_train.data), vectors.shape)

2034 (2034, 26576)


In [20]:
vocab = np.array(vectorizer.get_feature_names())

In [21]:
vocab.shape

(26576,)

In [23]:
vocab[7000:7020]

array(['cosmonauts', 'cosmos', 'cosponsored', 'cost', 'costa', 'costar',
       'costing', 'costly', 'costruction', 'costs', 'cosy', 'cote',
       'couched', 'couldn', 'council', 'councils', 'counsel',
       'counselees', 'counselor', 'count'], dtype='<U80')

# Helper Functions

In [24]:
num_top_words=8

def show_topics(a):
    top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
    topic_words = ([top_words(t) for t in a])
    return [' '.join(t) for t in topic_words]

# NMF

<img src="nmf_doc.png" > 

In [26]:
m,n=vectors.shape
d=5  # num topics

In [27]:
clf = decomposition.NMF(n_components=d, random_state=1)

W1 = clf.fit_transform(vectors)
H1 = clf.components_



In [28]:
show_topics(H1)

['jpeg image gif file color images format quality',
 'edu graphics pub mail 128 ray ftp send',
 'space launch satellite nasa commercial satellites year market',
 'jesus god people matthew atheists does atheism said',
 'image data available software processing ftp edu analysis']

## Using Tf-idf

Topic Frequency-Inverse Document Frequency (TF-IDF) is a way to normalize term counts by taking into account how often they appear in a document, how long the document is, and how commmon/rare the term is.

In [30]:
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)

In [31]:
W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_



In [32]:
show_topics(H1)

['people don think just like objective say morality',
 'graphics thanks files image file program windows know',
 'space nasa launch shuttle orbit moon lunar earth',
 'ico bobbe tek beauchaine bronx manhattan sank queens',
 'god jesus bible believe christian atheism does belief']

In [34]:
H1.shape

(5, 26576)

In [35]:
clf.reconstruction_err_

43.71292605795278