## Using Gensim

### topic modeling
https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html

 - create an id => word mapping, aka dictionary  **gensim.corpora.Dictionary**
 - transform a document into a bag-of-word vector, using a dictionary 
 - transform a stream of documents into a stream of vectors **doc2bow**
 - transform between vector streams, using topic models
 - store and save trained models, for persistency
 - use manual and semi-automated methods to evaluate quality of a topic model

### Training LDA
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/lda_training_tips.ipynb

In [None]:
# Read data.

import os

# Folder containing all NIPS papers.
data_dir = 'nipstxt/'

# Folders containin individual NIPS papers.
yrs = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']
dirs = ['nips' + yr for yr in yrs]

# Read all texts into a list.
docs = []
for yr_dir in dirs:
    files = os.listdir(data_dir + yr_dir)
    for filen in files:
        # Note: ignoring characters that cause encoding errors.
        with open(data_dir + yr_dir + '/' + filen) as fid:
            txt = fid.read()
        docs.append(txt)

### Pre-process and vectorize the documents
Among other things, we will:

Split the documents into tokens.</br>
<br>Lemmatize the tokens.</br>
<br>Compute bigrams.</br>
<br>Compute a bag-of-words representation of the data.</br>
<br>First we tokenize the text using a regular expression tokenizer from NLTK. We remove numeric tokens and tokens that are only a single character, as they don't tend to be useful, and the dataset contains a lot of them.</br>

In [None]:
# Tokenize the documents.

from nltk.tokenize import RegexpTokenizer

# Split the documents into tokens.
tokenizer = RegexpTokenizer(r'\w+')
for idx in range(len(docs)):
    #docs[idx] = docs[idx].lower()  # Convert to lowercase.
    doc = ' '.join(docs[idx])
    docs[idx] = tokenizer.tokenize(doc)  # Split into words.

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isdigit()] for doc in docs]

# Remove words that are only one character.
docs = [[token.lower() for token in doc if len(token) > 1] for doc in docs]

We use the WordNet lemmatizer from NLTK. A lemmatizer is preferred over a stemmer in this case because it produces more readable words. Output that is easy to read is very desirable in topic modelling.

In [None]:
# Lemmatize the documents.

from nltk.stem.wordnet import WordNetLemmatizer

# Lemmatize all words in documents.
lemmatizer = WordNetLemmatizer()
docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]

### New Term Topics Methods and Document Coloring
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb

We're setting up our corpus now. We want to show off the new get_term_topics and get_document_topics functionalities, and a good way to do so is to play around with words which might have different meanings in different context.

The word bank is a good candidate here, where it can mean either the financial institution or a river bank. In the toy corpus presented, there are 11 documents, 5 river related and 6 finance related.

In [1]:
from gensim.corpora import Dictionary
from gensim.models import ldamodel
import numpy
%matplotlib inline



In [2]:
texts = [['bank','river','shore','water'],
        ['river','water','flow','fast','tree'],
        ['bank','water','fall','flow'],
        ['bank','bank','water','rain','river'],
        ['river','water','mud','tree'],
        ['money','transaction','bank','finance'],
        ['bank','borrow','money'], 
        ['bank','finance'],
        ['finance','money','sell','bank'],
        ['borrow','sell'],
        ['bank','loan','sell']]

# create dictionary of mapping between word and id for documents (a list of document)
dictionary = Dictionary(texts)

# create bag-of-words mapping between word and count in each document (a list of words)
corpus = [dictionary.doc2bow(text) for text in texts]

In [14]:
print corpus

[[(0, 1), (1, 1), (2, 1), (3, 1)], [(1, 1), (3, 1), (4, 1), (5, 1), (6, 1)], [(0, 1), (3, 1), (5, 1), (7, 1)], [(0, 2), (1, 1), (3, 1), (8, 1)], [(1, 1), (3, 1), (6, 1), (9, 1)], [(0, 1), (10, 1), (11, 1), (12, 1)], [(0, 1), (11, 1), (13, 1)], [(0, 1), (10, 1)], [(0, 1), (10, 1), (11, 1), (14, 1)], [(13, 1), (14, 1)], [(0, 1), (14, 1), (15, 1)]]


In [3]:

numpy.random.seed(27) # setting random seed to get the same results each time.
model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=2, passes=23, iterations = 400, minimum_phi_value=0.05)

In [4]:
model.show_topics()

[(0,
  u'0.184*"water" + 0.150*"river" + 0.147*"bank" + 0.083*"flow" + 0.083*"tree" + 0.050*"fast" + 0.050*"fall" + 0.050*"shore" + 0.050*"mud" + 0.050*"rain"'),
 (1,
  u'0.214*"bank" + 0.134*"money" + 0.134*"sell" + 0.134*"finance" + 0.095*"borrow" + 0.057*"transaction" + 0.057*"loan" + 0.019*"water" + 0.019*"river" + 0.019*"rain"')]

### get_term_topics

The function get_term_topics returns the odds of that particular word belonging to a particular topic. A few examples:

In [5]:
model.get_term_topics('water')

[(0, 0.17019226)]

In [6]:
model.get_term_topics('finance')

[(1, 0.11727806)]

In [7]:
model.get_term_topics('bank')

[(0, 0.13316578), (1, 0.1992695)]

### get_document_topics and Document Word-Topic Coloring
<br>get_document_topics is an already existing gensim functionality which uses the inference function to get the sufficient statistics and figure out the topic distribution of the document.

<br>The addition to this is the ability for us to now know the topic distribution for each word in the document. Let us test this with two different documents which have the word bank in it, one in the finance context and one in the river context.

<br>The get_document_topics method returns (along with the standard document topic proprtion) the word_type followed by a list sorted with the most likely topic ids, when per_word_topics is set as true.

<br>phi_values contains the phi values for each topic for that particular word, scaled by feature length. Phi is essentially the probability of that word in that document belonging to a particular topic. The next few lines should illustrate this.

In [9]:
bow_water = ['bank','water','bank']
bow_finance = ['bank','finance','bank']

In [10]:
bow = model.id2word.doc2bow(bow_water) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

print doc_topics
print word_topics
print phi_values

[(0, 0.74007785), (1, 0.25992215)]
[(0, [0, 1]), (3, [0])]
[(0, [(0, 1.4690419), (1, 0.53095806)]), (3, [(0, 0.99194235)])]


In [11]:

bow = model.id2word.doc2bow(bow_finance) # convert to bag of words format first
doc_topics, word_topics, phi_values = model.get_document_topics(bow, per_word_topics=True)

word_topics

[(0, [1, 0]), (10, [1])]

#### get_document_topics for entire corpus

In [15]:
all_topics = model.get_document_topics(corpus, per_word_topics=True)

for doc_topics, word_topics, phi_values in all_topics:
    print('New Document \n')
    print 'Document topics:', doc_topics
    print 'Word topics:', word_topics
    print 'Phi values:', phi_values
    print(" ")
    print('-------------- \n')

New Document 

Document topics: [(0, 0.88324851), (1, 0.11675149)]
Word topics: [(0, [0, 1]), (1, [0]), (2, [0]), (3, [0])]
Phi values: [(0, [(0, 0.92862684), (1, 0.071373127)]), (1, [(0, 0.99784774)]), (2, [(0, 0.99170369)]), (3, [(0, 0.99827558)])]
 
-------------- 

New Document 

Document topics: [(0, 0.9146843), (1, 0.085315689)]
Word topics: [(1, [0]), (3, [0]), (4, [0]), (5, [0]), (6, [0])]
Phi values: [(1, [(0, 0.998752)]), (3, [(0, 0.99900019)]), (4, [(0, 0.99527323)]), (5, [(0, 0.9975462)]), (6, [(0, 0.99754167)])]
 
-------------- 

New Document 

Document topics: [(0, 0.88262689), (1, 0.11737306)]
Word topics: [(0, [0, 1]), (3, [0]), (5, [0]), (7, [0])]
Phi values: [(0, [(0, 0.92778981), (1, 0.072210215)]), (3, [(0, 0.99825376)]), (5, [(0, 0.99571902)]), (7, [(0, 0.99160117)])]
 
-------------- 

New Document 

Document topics: [(0, 0.88899761), (1, 0.11100236)]
Word topics: [(0, [0, 1]), (1, [0]), (3, [0]), (8, [0])]
Phi values: [(0, [(0, 1.8475267), (1, 0.15247336)]), (1,

In case you want to store doc_topics, word_topics and phi_values for all the documents in the corpus in a variable and later access details of a particular document using its index, it can be done in the following manner:

In [16]:
topics = model.get_document_topics(corpus, per_word_topics=True)
all_topics = [(doc_topics, word_topics, word_phis) for doc_topics, word_topics, word_phis in topics]

In [17]:
print topics[2]

([(0, 0.88262308), (1, 0.11737685)], [(0, [0, 1]), (3, [0]), (5, [0]), (7, [0])], [(0, [(0, 0.92778468), (1, 0.072215326)]), (3, [(0, 0.99825364)]), (5, [(0, 0.99571872)]), (7, [(0, 0.99160057)])])


In [18]:
for doc in all_topics:
    print('New Document \n')
    print 'Document topic:', doc[0]
    print 'Word topic:', doc[1]
    print 'Phi value:', doc[2]
    print(" ")
    print('-------------- \n')

New Document 

Document topic: [(0, 0.88325179), (1, 0.11674824)]
Word topic: [(0, [0, 1]), (1, [0]), (2, [0]), (3, [0])]
Phi value: [(0, [(0, 0.92863131), (1, 0.071368754)]), (1, [(0, 0.99784786)]), (2, [(0, 0.99170423)]), (3, [(0, 0.99827564)])]
 
-------------- 

New Document 

Document topic: [(0, 0.91467839), (1, 0.085321568)]
Word topic: [(1, [0]), (3, [0]), (4, [0]), (5, [0]), (6, [0])]
Phi value: [(1, [(0, 0.9987517)]), (3, [(0, 0.99899995)]), (4, [(0, 0.99527264)]), (5, [(0, 0.99754578)]), (6, [(0, 0.99754131)])]
 
-------------- 

New Document 

Document topic: [(0, 0.88263309), (1, 0.11736689)]
Word topic: [(0, [0, 1]), (3, [0]), (5, [0]), (7, [0])]
Phi value: [(0, [(0, 0.92779815), (1, 0.072201908)]), (3, [(0, 0.998254)]), (5, [(0, 0.99571955)]), (7, [(0, 0.99160224)])]
 
-------------- 

New Document 

Document topic: [(0, 0.8890214), (1, 0.1109786)]
Word topic: [(0, [0, 1]), (1, [0]), (3, [0]), (8, [0])]
Phi value: [(0, [(0, 1.8475925), (1, 0.15240757)]), (1, [(0, 0.99769

### Evaluation and interpreatation

#### We can compute the topic coherence of each topic. Below we display the average topic coherence and print the topics in order of topic coherence.

Note that we use the "Umass" topic coherence measure here (see docs, https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.top_topics), Gensim has recently obtained an implementation of the "AKSW" topic coherence measure (see accompanying blog post, http://rare-technologies.com/what-is-topic-coherence/).

In [23]:
num_topics=2
top_topics = model.top_topics(corpus, topn=5)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / num_topics
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(top_topics)

Average topic coherence: -3.5992.
[([(0.18381169, u'water'),
   (0.15034011, u'river'),
   (0.1473269, u'bank'),
   (0.083456308, u'flow'),
   (0.083444633, u'tree')],
  -3.5424295468056668),
 ([(0.21429448, u'bank'),
   (0.1337216, u'money'),
   (0.13366649, u'sell'),
   (0.13365366, u'finance'),
   (0.095357426, u'borrow')],
  -3.6559046803328732)]


In [28]:
model.show_topic(topicno, topn=5)

[(u'bank', 0.21429448),
 (u'money', 0.1337216),
 (u'sell', 0.13366649),
 (u'finance', 0.13365366),
 (u'borrow', 0.095357426)]

In [29]:
# select top 5 words for each of the 2 LDA topics
top_words = [[word for word, _  in model.show_topic(topicno, topn=5)] for topicno in range(model.num_topics)]
print(top_words)

[[u'water', u'river', u'bank', u'flow', u'tree'], [u'bank', u'money', u'sell', u'finance', u'borrow']]


### topic coherence pipeline
https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb

In [31]:
import numpy as np
import logging
import pyLDAvis.gensim
import json
import warnings
warnings.filterwarnings('ignore')  # To ignore all warnings that arise here to enhance clarity

from gensim.models.coherencemodel import CoherenceModel
from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from numpy import array

#### Set up logging

In [34]:
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logging.debug("test")

DEBUG:root:test


#### Set up corpus
As stated in table 2 from this paper http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf , this corpus essentially has two classes of documents. First five are about human-computer interaction and the other four are about graphs. We will be setting up two LDA models. One with 50 iterations of training and the other with just 1. Hence the one with 50 iterations ("better" model) should be able to capture this underlying pattern of the corpus better than the "bad" LDA model. Therefore, in theory, our topic coherence for the good LDA model should be greater than the one for the bad LDA model.

In [35]:
texts = [['human', 'interface', 'computer'],
         ['survey', 'user', 'computer', 'system', 'response', 'time'],
         ['eps', 'user', 'interface', 'system'],
         ['system', 'human', 'system', 'eps'],
         ['user', 'response', 'time'],
         ['trees'],
         ['graph', 'trees'],
         ['graph', 'minors', 'trees'],
         ['graph', 'minors', 'survey']]

In [36]:
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

INFO:gensim.corpora.dictionary:adding document #0 to Dictionary(0 unique tokens: [])
INFO:gensim.corpora.dictionary:built Dictionary(12 unique tokens: [u'minors', u'graph', u'system', u'trees', u'eps']...) from 9 documents (total 29 corpus positions)


#### Set up two topic models
We'll be setting up two different LDA Topic models. A good one and bad one. To build a "good" topic model, we'll simply train it using more iterations than the bad one. Therefore the u_mass coherence should in theory be better for the good model than the bad one since it would be producing more "human-interpretable" topics.

In [60]:
goodLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=100, num_topics=2,passes=50)
badLdaModel = LdaModel(corpus=corpus, id2word=dictionary, iterations=1, num_topics=2)

INFO:gensim.models.ldamodel:using symmetric alpha at 0.5
INFO:gensim.models.ldamodel:using symmetric eta at 0.5
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamodel:running online (multi-pass) LDA training, 2 topics, 50 passes over the supplied corpus of 9 documents, updating model once every 9 documents, evaluating perplexity every 9 documents, iterating 100x with a convergence threshold of 0.001000
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-3.296 per-word bound, 9.8 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 0, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:8/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.177*"system" + 0.115*"eps" + 0.104*"interface" + 0.102*"human" + 0

DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.226*"system" + 0.137*"eps" + 0.136*"human" + 0.136*"interface" + 0.097*"computer" + 0.090*"user" + 0.031*"trees" + 0.030*"survey" + 0.030*"time" + 0.029*"response"
INFO:gensim.models.ldamodel:topic #1 (0.500): 0.152*"graph" + 0.151*"trees" + 0.109*"minors" + 0.108*"response" + 0.108*"time" + 0.108*"survey" + 0.104*"user" + 0.054*"computer" + 0.039*"system" + 0.023*"interface"
INFO:gensim.models.ldamodel:topic diff=0.017609, rho=0.316228
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-2.950 per-word bound, 7.7 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 9, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamod

DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.231*"system" + 0.133*"eps" + 0.133*"human" + 0.133*"interface" + 0.108*"computer" + 0.094*"user" + 0.029*"time" + 0.029*"response" + 0.029*"survey" + 0.028*"trees"
INFO:gensim.models.ldamodel:topic #1 (0.500): 0.157*"graph" + 0.157*"trees" + 0.112*"minors" + 0.110*"survey" + 0.110*"response" + 0.110*"time" + 0.101*"user" + 0.044*"computer" + 0.030*"system" + 0.023*"interface"
INFO:gensim.models.ldamodel:topic diff=0.005804, rho=0.229416
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-2.946 per-word bound, 7.7 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 18, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 doc

INFO:gensim.models.ldamodel:PROGRESS: pass 26, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.231*"system" + 0.131*"eps" + 0.131*"human" + 0.131*"interface" + 0.113*"computer" + 0.096*"user" + 0.029*"time" + 0.029*"response" + 0.029*"survey" + 0.027*"trees"
INFO:gensim.models.ldamodel:topic #1 (0.500): 0.159*"graph" + 0.159*"trees" + 0.114*"minors" + 0.111*"survey" + 0.111*"response" + 0.111*"time" + 0.099*"user" + 0.038*"computer" + 0.027*"system" + 0.023*"interface"
INFO:gensim.models.ldamodel:topic diff=0.002803, rho=0.188982
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-2.945 per-word bound, 7.7 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 27, at document #9/9
DEBUG

DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-2.944 per-word bound, 7.7 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 35, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.230*"system" + 0.130*"eps" + 0.130*"human" + 0.130*"interface" + 0.116*"computer" + 0.097*"user" + 0.029*"time" + 0.029*"response" + 0.029*"survey" + 0.027*"trees"
INFO:gensim.models.ldamodel:topic #1 (0.500): 0.160*"graph" + 0.160*"trees" + 0.114*"minors" + 0.112*"survey" + 0.112*"response" + 0.112*"time" + 0.098*"user" + 0.035*"computer" + 0.027*"system" + 0.023*"interface"
INFO:gensim.models.ldamodel:topic diff=0.001494, rho=0.164399
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models

INFO:gensim.models.ldamodel:topic diff=0.000895, rho=0.149071
DEBUG:gensim.models.ldamodel:bound: at document #0
INFO:gensim.models.ldamodel:-2.944 per-word bound, 7.7 perplexity estimate based on a held-out corpus of 9 documents with 29 words
INFO:gensim.models.ldamodel:PROGRESS: pass 44, at document #9/9
DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 100 iterations
DEBUG:gensim.models.ldamodel:updating topics
INFO:gensim.models.ldamodel:topic #0 (0.500): 0.229*"system" + 0.129*"eps" + 0.129*"human" + 0.129*"interface" + 0.118*"computer" + 0.097*"user" + 0.030*"time" + 0.030*"response" + 0.029*"survey" + 0.027*"trees"
INFO:gensim.models.ldamodel:topic #1 (0.500): 0.161*"graph" + 0.161*"trees" + 0.115*"minors" + 0.112*"survey" + 0.112*"response" + 0.112*"time" + 0.098*"user" + 0.034*"computer" + 0.026*"system" + 0.023*"interface"
INFO:gensim.models.ldamodel:topic diff=0.000842, rho=0.147442
DEBUG:g

### Using U_Mass Coherence

In [65]:
goodcm = CoherenceModel(model=goodLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')
badcm = CoherenceModel(model=badLdaModel, corpus=corpus, dictionary=dictionary, coherence='u_mass')

DEBUG:gensim.models.coherencemodel:Setting topics to those of the model: LdaModel(num_terms=12, num_topics=2, decay=0.5, chunksize=2000)
DEBUG:gensim.models.coherencemodel:Setting topics to those of the model: LdaModel(num_terms=12, num_topics=2, decay=0.5, chunksize=2000)


In [66]:
print goodcm

Coherence_Measure(seg=<function s_one_pre at 0x000000000D8DA358>, prob=<function p_boolean_document at 0x000000000D8E7BA8>, conf=<function log_conditional_probability at 0x000000000D9906D8>, aggr=<function arithmetic_mean at 0x000000000D990908>)


Interpreting the topics
As we will see below using LDA visualization, the better model comes up with two topics composed of the following words:

goodLdaModel:
- Topic 1: More weightage assigned to words such as "system", "user", "eps", "interface" etc which captures the first set of documents.
- Topic 2: More weightage assigned to words such as "graph", "trees", "survey" which captures the topic in the second set of documents.
<br>badLdaModel:
- Topic 1: More weightage assigned to words such as "system", "user", "trees", "graph" which doesn't make the topic clear enough.
- Topic 2: More weightage assigned to words such as "system", "trees", "graph", "user" which is similar to the first topic. Hence both topics are not human-interpretable.
<br>Therefore, the topic coherence for the goodLdaModel should be greater for this than the badLdaModel since the topics it comes up with are more human-interpretable. We will see this using u_mass and c_v topic coherence measures.

In [67]:
print goodLdaModel.show_topics()
print badLdaModel.show_topics()

[(0, u'0.229*"system" + 0.129*"eps" + 0.129*"human" + 0.129*"interface" + 0.118*"computer" + 0.098*"user" + 0.030*"time" + 0.030*"response" + 0.029*"survey" + 0.027*"trees"'), (1, u'0.161*"graph" + 0.161*"trees" + 0.115*"minors" + 0.112*"survey" + 0.112*"response" + 0.112*"time" + 0.097*"user" + 0.033*"computer" + 0.026*"system" + 0.023*"interface"')]
[(0, u'0.113*"trees" + 0.109*"user" + 0.098*"graph" + 0.093*"system" + 0.088*"minors" + 0.081*"eps" + 0.077*"survey" + 0.076*"time" + 0.072*"computer" + 0.071*"interface"'), (1, u'0.148*"system" + 0.097*"graph" + 0.087*"user" + 0.084*"human" + 0.084*"response" + 0.084*"trees" + 0.075*"interface" + 0.074*"computer" + 0.070*"time" + 0.070*"survey"')]


### Visualize topic models

In [43]:
pyLDAvis.enable_notebook()

In [44]:
pyLDAvis.gensim.prepare(goodLdaModel, corpus, dictionary)

DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:9/9 documents converged within 50 iterations


In [45]:
pyLDAvis.gensim.prepare(badLdaModel, corpus, dictionary)

DEBUG:gensim.models.ldamodel:performing inference on a chunk of 9 documents
DEBUG:gensim.models.ldamodel:0/9 documents converged within 1 iterations


In [68]:
print goodcm.get_coherence()
print badcm.get_coherence()

-14.6431250635
-14.7199176976


### Using C_V coherence

In [62]:
goodcm = CoherenceModel(model=goodLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')
badcm = CoherenceModel(model=badLdaModel, texts=texts, dictionary=dictionary, coherence='c_v')

DEBUG:gensim.models.coherencemodel:Setting topics to those of the model: LdaModel(num_terms=12, num_topics=2, decay=0.5, chunksize=2000)
DEBUG:gensim.models.coherencemodel:Setting topics to those of the model: LdaModel(num_terms=12, num_topics=2, decay=0.5, chunksize=2000)


In [63]:
print goodcm
print badcm

Coherence_Measure(seg=<function s_one_set at 0x000000000D8DA438>, prob=<function p_boolean_sliding_window at 0x000000000D9904A8>, conf=<function cosine_similarity at 0x000000000D990DD8>, aggr=<function arithmetic_mean at 0x000000000D990908>)
Coherence_Measure(seg=<function s_one_set at 0x000000000D8DA438>, prob=<function p_boolean_sliding_window at 0x000000000D9904A8>, conf=<function cosine_similarity at 0x000000000D990DD8>, aggr=<function arithmetic_mean at 0x000000000D990908>)


In [64]:
print goodcm.get_coherence()
print badcm.get_coherence()

INFO:gensim.topic_coherence.probability_estimation:using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows
INFO:gensim.topic_coherence.text_analysis:3 accumulators retrieved from output queue
INFO:gensim.topic_coherence.text_analysis:accumulated word occurrence stats for 9 virtual documents
INFO:gensim.topic_coherence.probability_estimation:using ParallelWordOccurrenceAccumulator(processes=3, batch_size=64) to estimate probabilities from sliding windows


0.383841355374


INFO:gensim.topic_coherence.text_analysis:3 accumulators retrieved from output queue
INFO:gensim.topic_coherence.text_analysis:accumulated word occurrence stats for 9 virtual documents


0.383841355374


#### Conclusion
Hence as we can see, the u_mass and c_v coherence for the good LDA model is much more (better) than that for the bad LDA model. This is because, simply, the good LDA model usually comes up with better topics that are more human interpretable. The badLdaModel however fails to decipher between these two topics and comes up with topics which are not clear to a human. The u_mass and c_v topic coherences capture this wonderfully by giving the interpretability of these topics a number as we can see above. Hence this coherence measure can be used to compare difference topic models based on their human-interpretability.

## Using scikit
Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation
http://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py

In [69]:
from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [70]:

# Load the 20 newsgroups dataset and vectorize it. We use a few heuristics
# to filter out useless terms early on: the posts are stripped of headers,
# footers and quoted replies, and common English words, words occurring in
# only one document or in at least 95% of the documents are removed.

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

Downloading 20news dataset. This may take a few minutes.
INFO:sklearn.datasets.twenty_newsgroups:Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
INFO:sklearn.datasets.twenty_newsgroups:Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Loading dataset...
done in 63.461s.


A tf-idf transformer is applied to the bag of words matrix that NMF must process with the TfidfVectorizer. 
<br>LDA on the other hand, being a probabilistic graphical model (i.e. dealing with probabilities) only requires raw counts, so a CountVectorizer is used.

In [71]:
# Use tf-idf features for NMF.
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

Extracting tf-idf features for NMF...
done in 0.605s.
Extracting tf features for LDA...
done in 0.952s.



In [77]:
# Fit the NMF model
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.500s.

Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files format win sound ftp pub read save sit

In [73]:
# Fit the NMF model
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))
print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
done in 2.915s.


In [75]:
print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 4.666s.

Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil i

https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
<br>https://towardsdatascience.com/improving-the-interpretation-of-topic-models-87fd2ee3847d

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

In [93]:
# Single line documents from http://web.eecs.utk.edu/~berry/order/node4.html#SECTION00022000000000000000
documents = [
            "Human machine interface for Lab ABC computer applications",
            "A survey of user opinion of computer system response time",
            "The EPS user interface management system",
            "System and human system engineering testing of EPS",
            "Relation of user-perceived response time to error measurement",
            "The generation of random, binary, unordered trees",
            "The intersection graph of paths in trees",
            "Graph minors IV: Widths of trees and quasi-ordering",
            "Graph minors: A survey"
            ]

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 2

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

# Run LDA
lda_model = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

no_top_words = 4
no_top_documents = 4
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)

Topic 0:
trees graph minors survey
Graph minors IV: Widths of trees and quasi-ordering
The intersection graph of paths in trees
The generation of random, binary, unordered trees
Graph minors: A survey
Topic 1:
user time response interface
A survey of user opinion of computer system response time
Relation of user-perceived response time to error measurement
The EPS user interface management system
Human machine interface for Lab ABC computer applications
Topic 0:
user response time computer
A survey of user opinion of computer system response time
Relation of user-perceived response time to error measurement
The EPS user interface management system
Human machine interface for Lab ABC computer applications
Topic 1:
trees graph human minors
Graph minors IV: Widths of trees and quasi-ordering
Graph minors: A survey
The intersection graph of paths in trees
The generation of random, binary, unordered trees
