# 20 news groups Dataset

Dataset: discussion groups on Usenet (forum in 80s-90s)
- 18k docs
- topic models

## Topic modelling problem
- Using SVD
- Using NMF

In [2]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# pick out 4 topics only
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

## Cluster the data into topics using unsupervised SVD
- fit the CountVectorizer
- use SVD to make the clusters

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense()
%time u, s, v = np.linalg.svd(vectors, full_matrices=False)

Wall time: 27.5 s


In [5]:
print(u.shape, s.shape, v.shape)

(2034, 2034) (2034,) (2034, 26576)


Sort the values in `v` and index into `feature_names` with them to get the topics

In [41]:
np.array(np.argsort(v[0])[:, :10])[0]

array([13816, 12642,  8956, 10286, 11444, 12652, 11163,  7506, 19372,
       10798], dtype=int64)

In [47]:
vocab = np.array(vectorizer.get_feature_names())
for t in range(5):
    print([vocab[i] for i in np.array(np.argsort(v[t])[:, :10])[0]])

['jpeg', 'image', 'edu', 'file', 'graphics', 'images', 'gif', 'data', 'pub', 'ftp']
['edu', 'graphics', 'data', 'space', 'pub', 'mail', '128', '3d', 'ray', 'nasa']
['space', 'jesus', 'launch', 'god', 'people', 'satellite', 'matthew', 'atheists', 'does', 'time']
['space', 'launch', 'satellite', 'commercial', 'nasa', 'satellites', 'market', 'year', 'data', 'jpeg']
['jpeg', 'graphics', 'space', 'pub', 'edu', 'ray', 'mail', 'send', 'launch', 'file']


### Randomized svd
#### Why? - Shortcomings of classical algorithms for decomposition:
Matrices are "stupendously big"
Data are often missing or inaccurate. Why spend extra computational resources when imprecision of input limits precision of the output?
Data transfer now plays a major role in time of algorithms. Techniques the require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).
Important to take advantage of GPUs.

In [None]:
### Shortcomings of classical algorithms for decomposition:
Matrices are "stupendously big"
Data are often missing or inaccurate. Why spend extra computational resources when imprecision of input limits precision of the output?
Data transfer now plays a major role in time of algorithms. Techniques the require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).
Important to take advantage of GPUs.

## Non-negative matrix factorization (NMF)
Rather that constraining factors to be orthogonal, constrain them to be non-negative. Positive factors are oftentimes more interpretable.

In [50]:
from sklearn.decomposition import NMF
m,n=vectors.shape
n_topics=5
clf = NMF(n_components=n_topics, random_state=1)

W1 = clf.fit_transform(vectors)
H1 = clf.components_

for t in range(5):
    print([vocab[i] for i in np.array(np.argsort(H1[t])[:, :10])[0]])

IndexError: too many indices for array

In [58]:
np.argsort(H1[0])[:10]

for t in range(5):
    print([vocab[i] for i in np.array(np.argsort(H1[t])[:10])])

['intergalactic', 'tr', 'hoover', 'hop', 'tps', 'hoped', 'hopes', 'hopkins', 'tpa', 'horns']
['intergalactic', 'libertines', 'liberation', 'liberating', 'liberated', 'liberals', 'liberally', 'libertopian', 'liberal', 'libel']
['zyxel', 'chin', 'chimps', 'chimpanzees', 'iowa', 'ipa', 'ipc', 'ipcs', 'children', 'ipl']
['00', 'huygens', 'husc6', 'huntsville', 'hungary', 'hulls', 'huji', 'huisman', 'huffman', 'hues']
['intergalactic', 'lilac', 'lillee', 'lilly', 'limitation', 'limitations', 'limmat', 'limrick', 'lindabury', 'lindbergh']
