<img src='img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Accelerate Natural Language Processing: LDA Topic Clustering

LDA is a unsupervised learning algorithm to extract topics from documents.  A trained LDA model can transform documents into the semantic space, a vector describing how likely a document is of a certain topic.

## Table of Contents
* [LDA Topic Clustering](#LDA-Topic-Clustering)
	* [Load data](#Load-data)
	* [Build dictionary](#Build-dictionary)
	* [Build corpus](#Build-corpus)
	* [Training](#Training)
	* [Find topic from documents](#Find-topic-from-documents)


In [None]:
""" Example using GenSim's LDA and sklearn. """
import numpy as np

# Accelerated gensim version, LdaJitModel is not in the original distribution
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.models import LdaModel, LdaJitModel

## Load data

In [None]:
from sklearn.datasets import fetch_20newsgroups

_20newsgroups_ dataset

See http://scikit-learn.org/stable/datasets/twenty_newsgroups.html

In [None]:
rand = np.random.mtrand.RandomState(8675309)  # set random seed for better reproducibility

cats = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

traindata = fetch_20newsgroups(subset='train',                  # using the training set
                          categories=cats,                      # four different categories
                          shuffle=True,                         # shuffle the data
                          remove=('headers', 'footers', 'quotes'), # clean the data
                          random_state=rand)

Four very different topics are selected so that we can easily see the expected result with shorter training time.

In [None]:
print('number of documents', len(traindata.data))
print('number of characters', sum(len(d) for d in traindata.data))

## Build dictionary

* Tokenize and preprocess the documents
    * normalize the words and remove stopwords
* Build dictionary
* Filter out words that are infrequent (not enough information) and too frequent (probably meaningless)

In [None]:
def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]

id2word = Dictionary(map(tokenize, traindata.data))
print(id2word)

# filter out words that are infrequent and too frequent
id2word.filter_extremes(no_below=10, no_above=0.97)
print(id2word)

## Build corpus

In [None]:
corpus = [id2word.doc2bow(tokenize(doc)) for doc in traindata.data]
print(corpus[0])

## Training

Using the standard model

In [None]:
%%time
# Fit LDA.
lda = LdaModel(corpus, id2word=id2word, num_topics=5, passes=10)

In [None]:
lda.print_topics()

Faster training time with ``LdaJitModel``, an optimized version of ``LdaModel`` by speeding up critical components of training procedure using Numba.

In [None]:
%%time
# Fit LDA.
lda_jit = LdaJitModel(corpus, id2word=id2word, num_topics=5, passes=10)

In [None]:
lda_jit.print_topics()

**Note:** due to randomness in the training and the low number of passes, the topics may not match exactly)

Train for real (more passes)

In [None]:
%%time
# Fit LDA.
lda_jit = LdaJitModel(corpus, id2word=id2word, num_topics=5, passes=50)

In [None]:
lda_jit.print_topics()

## Find topic from documents

In [None]:
testdata = fetch_20newsgroups(subset='test',  # now switching to the test dataset
                              categories=cats,
                              shuffle=True,
                              remove=('headers', 'footers', 'quotes'))

In [None]:
idx = 2
doc = testdata.data[idx]
print('expected topic:\n', testdata.target_names[testdata.target[idx]])
print('content:\n', doc[:1000])

# create bag-of-words
bow = id2word.doc2bow(tokenize(doc))
# transform to semantic space
vector = lda[bow]
# get best topic
best_topicid = max(vector, key=lambda x: abs(x[1]))[0]
lda.show_topic(best_topicid)

Assign topics for all documents

In [None]:
from collections import defaultdict

doc_topics = defaultdict(list)

for docid, doc in enumerate(testdata.data):
    bow = id2word.doc2bow(tokenize(doc))
    if bow:  # if not empty doc
        # Get vector in semantic space.
        # Each dimension corresponds to topic.
        vector = lda[bow]   
        # Use the "strongest" topic as the representing topic
        topicid = max(vector, key=lambda x: abs(x[1]))[0]
        doc_topics[topicid].append(docid)

Print assigned topics

In [None]:
from pprint import pprint

for topicid, documents in doc_topics.items():
    print('=' * 80)
    print("Inferred Topic Terms:", lda.print_topic(topicid, topn=5))
    
    for i in range(3):
        print(str(i).center(80, '-'))
        docid = documents[i]
        print("Expected Category:", testdata.target_names[testdata.target[docid]])
        print("Document:")
        print(testdata.data[docid].lstrip()[:500])
        print()
    print()

---
*Copyright Continuum 2012-2016 All Rights Reserved.*