The output of a topic model actually reflects the ability to cluster for the corpus. This is because documents with a similar topic probability distribution can be grouped together. Nonetheless, a topic model is not only a clustering algorithm. In contrast to other black-box algorithms, a topic model can interpret the clustering results by the word probability distributions over topics. Meanwhile, it allows data to come from a mixture of topics rather than from only one topic. These characteristics may be crucial for various applications.


**text source:** *An overview of topic modeling and its current applications in bioinformatics* Lin Liu, Lin Tang, Wen Dong, Shaowen Yao & Wei Zhou

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import decomposition
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')

In [3]:
data = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
data.keys()

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [4]:
print('Data:', len(data.data))
print('Target:', len(data.target_names), '\n', data.target_names)

Data: 2034
Target: 4 
 ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


In [5]:
# train_newsgroup = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
# test_newsgroup = fetch_20newsgroups(subset='test', categories=categories, remove=remove)

In [6]:
print("\n".join(data.data[:3]))

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries.

 >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.c

In [7]:
def show_top_words(feature_names, model, n_top_words):
  for i, topic in enumerate(model.components_):
    show = "Topic #%d: " % i
    show += " ".join([feature_names[j] for j in topic.argsort()[:-n_top_words-1:-1]])
    print(show)

In [16]:
no_topics = 20
no_features = 1000

### **Non-negative Matrix Factorization (NMF)**
linear-algebric model

In [17]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
%time tfidf = tfidf_vectorizer.fit_transform(data.data) # matrix of (documents, vocab)
tfidf.shape

CPU times: user 334 ms, sys: 134 µs, total: 334 ms
Wall time: 337 ms


(2034, 1000)

In [27]:
%time nmf = decomposition.NMF(n_components=no_topics, random_state=0, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

CPU times: user 394 ms, sys: 150 ms, total: 544 ms
Wall time: 376 ms


In [28]:
tfidf_features = tfidf_vectorizer.get_feature_names()
show_top_words(tfidf_features, nmf, 10)

Topic #0: people like good time say way make religion really life
Topic #1: graphics comp computer 3d package book library good group help
Topic #2: space nasa shuttle launch station orbit moon lunar earth sci
Topic #3: god believe atheism satan tells belief exist existence bible faith
Topic #4: thanks advance mail hi help looking email appreciated know post
Topic #5: objective morality moral values natural science claim freedom animals word
Topic #6: edu university mac pub michael cs email send info class
Topic #7: files file image format program gif use ftp tiff images
Topic #8: jesus christian christians christ bible christianity sin faith love law
Topic #9: com bob said stay away little really info material bobby
Topic #10: does know anybody exactly saying sure heard expected simply ftp
Topic #11: right hear mind sure nice wrong tell finally people let
Topic #12: think don animals know wouldn try things read little posting
Topic #13: mode card vesa vga driver windows color video 25

### **Latent Dirichliet Allocation (LDA)**
probabilistic generative model

In [20]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
%time tf = tf_vectorizer.fit_transform(data.data)
tf.shape

CPU times: user 334 ms, sys: 959 µs, total: 335 ms
Wall time: 336 ms


(2034, 1000)

In [21]:
%time lda = decomposition.LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)

CPU times: user 4.36 s, sys: 1.18 s, total: 5.54 s
Wall time: 4.15 s


In [22]:
tf_features = tf_vectorizer.get_feature_names()
show_top_words(tf_features, lda, 10)

Topic #0: mode cheers kent spacecraft memory time order interested long cross
Topic #1: law matthew 10 john 00 van dr 16 25 20
Topic #2: values value driver vesa science rules held objective nature include
Topic #3: sky pictures look thank significant doubt nice long idea venus
Topic #4: street ago years public request reading communications washington going dc
Topic #5: 000 100 200 observations 300 payload planet km usa 40
Topic #6: radius p2 p3 p1 sin define return 60 include make
Topic #7: post thread key issues lines long mr posts articles looking
Topic #8: edu gif mac windows works won michael ibm ideas class
Topic #9: god jesus bible christian does faith believe belief christians word
Topic #10: just new like year cost said years don time problem
Topic #11: nasa center dc washington news research space ames funding new
Topic #12: objective morality standard word science information observations russian basis defined
Topic #13: image jpeg software file program color files gif use 

From the above findings we can say that NMF was able to find more meaningful topics than LDA from the 20 Newsgroups dataset.