**SESSION 7. TOPIC MODELS: EVALUATION AND ANALYSIS**

Get modules ready and available:

In [1]:
import sys

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

from pprint import pprint

# Plotting tools
!{sys.executable} -m pip install pyLDAvis
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)



**NOTE:** We use the text normalization function we used last time. The function is defined in Text_Normalization_Function.ipynb notebook.
We will run that notebook (to make function available) using the line below:

The Text_Normalization_Function.ipynb file should be in the same folder as the notebook you are using right now.

In [2]:
%run ./Text_Normalization_Function.ipynb

Collecting html.parser
  Downloading https://files.pythonhosted.org/packages/fa/ae/4b752c60868d26d6d14e89882ade7204fd73543e1bde64b6e9b01c1d9856/html-parser-0.2.tar.gz
Building wheels for collected packages: html.parser
  Building wheel for html.parser (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/corrine/Library/Caches/pip/wheels/f5/5e/9f/dbce0d6a89f44b3f30fba0a9b1b24a288882ea2e235e515d7b
Successfully built html.parser
Installing collected packages: html.parser
Successfully installed html.parser
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over 

Make sure the text normalization function is working properly by running it on the test corpus below:

In [3]:
test_text = "<p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>"
test_corpus = [test_text]
test_corpus.append(test_text)
normalized_test_corpus = normalize_corpus(test_corpus)

print("Original corpus:  ", test_corpus,"\n")
print("Processed corpus: ", normalize_corpus(test_corpus))

Original corpus:   ["<p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>", "<p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>"] 

Processed corpus:  ['circus dog plisse skirt jump python large foot long', 'circus dog plisse skirt jump python large foot long']


Define a function for getting keywords (words with highest weights) from the estimated topic model:

In [4]:
def get_topic_words(vectorizer, lda_model, n_words):
    keywords = np.array(vectorizer.get_feature_names())
    topic_words = []
    for topic_weights in lda_model.components_:
        top_word_locs = (-topic_weights).argsort()[:n_words]
        topic_words.append(keywords.take(top_word_locs).tolist())
    return topic_words

**** TOPIC MODELING: NEWS ****

The dataset here is the one we used for doing classification. 
The newspaper blogposts have 4 topics: atheism, religion, computer graphics and space sciene. 
Of course, we will not use this information for topic modeling.

Download the data and set up the data:

In [5]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
dataset = fetch_20newsgroups(shuffle=True, 
                             random_state=1, 
                             categories = categories, 
                             remove=('headers', 'footers', 'quotes'))
news_corpus = dataset.data

Normalize the corpus and create "bag-of-words" representation of the data. We'll limit the number of features to 1000. 

NOTE: It will take a couple of minutes to get the data ready! 

In [6]:
normalized_corpus_news = normalize_corpus(news_corpus)
bow_vectorizer_news = CountVectorizer()
bow_news_corpus = bow_vectorizer_news.fit_transform(normalized_corpus_news)
bow_feature_names_news = bow_vectorizer_news.get_feature_names()

Set number of topics:

In [7]:
no_topics_news = 3

Run the topic model (LDA). NOTE: It will take a couple of minutes for the estimation to finish!

In [8]:
lda_news = LatentDirichletAllocation(n_components=no_topics_news, max_iter=100,random_state = 42).fit(bow_news_corpus)



Display results:

In [9]:
no_top_words_news = 10
topic_words = get_topic_words(vectorizer = bow_vectorizer_news, 
                              lda_model = lda_news, 
                              n_words = no_top_words_news)
pd.DataFrame(topic_words, 
             columns = ["word_" + str(i) for i in range(no_top_words_news)],
             index = ["Topic_" + str(i) for i in range(len(topic_words))]) 

Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9
Topic_0,space,image,use,file,program,system,data,nasa,launch,edu
Topic_1,people,think,god,know,like,good,use,believe,thing,even
Topic_2,jesus,matthew,god,ra,word,christian,men,day,greek,john


Display a word vectors (words are in alphabetical order) for each topic. Each column is a topic:

In [10]:
word_weights = lda_news.components_ / lda_news.components_.sum(axis=1)[:, np.newaxis]
word_weights_df = pd.DataFrame(word_weights.T, 
                               index = bow_feature_names_news, 
                               columns = ["Topic_" + str(i) for i in range(no_topics_news)])
word_weights_df.head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2
000062david42,1.3e-05,4e-06,1.4e-05
000100255pixel,1.3e-05,4e-06,1.4e-05
000usd,2.3e-05,4e-06,1.4e-05
001200201pixel,1.3e-05,4e-06,1.4e-05
00index,1.3e-05,4e-06,1.4e-05
00pm,2.2e-05,4e-06,1.5e-05
01a,1.3e-05,4e-06,1.4e-05
023b,2.5e-05,4e-06,1.4e-05
04g,1.3e-05,4e-06,1.4e-05
054589e,1.3e-05,4e-06,1.4e-05


Now, sort by word weights in Topic 0 (descending order) and see the weights by 10 most frequent words in Topic 0:

In [11]:
word_weights_df.sort_values(by='Topic_0',ascending=False).head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2
space,0.010149,0.000285,1.6e-05
image,0.008256,0.000194,1.8e-05
use,0.006876,0.004251,2.9e-05
file,0.005271,0.00017,1.5e-05
program,0.005198,0.000125,1.4e-05
system,0.004487,0.001968,1.5e-05
data,0.004383,6.8e-05,3e-05
nasa,0.00408,5.4e-05,1.4e-05
launch,0.003944,3.1e-05,1.4e-05
edu,0.003907,0.000265,0.000847


Now, sort by word weights in Topic 1 (descending order) and see the weights by 10 most frequent words in Topic 1:

In [12]:
word_weights_df.sort_values(by='Topic_1',ascending=False).head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2
people,0.000416,0.007939,2.6e-05
think,0.000479,0.007441,0.0001
god,1.3e-05,0.007198,0.004349
know,0.002233,0.006194,0.000448
like,0.001948,0.005216,0.000126
good,0.001758,0.004366,0.000504
use,0.006876,0.004251,2.9e-05
believe,0.000136,0.004238,3.9e-05
thing,0.000673,0.004225,0.000955
even,0.000537,0.004203,0.000212


Let's assign a dominant topic to each document in our corpus. 

To do this, we need each document be represented as a bag-of-words. 
Each word in the document is associated with some topic. Word weights in a word vector for a topic provide a measure for that association.
E.g., if you sum weights for Topic 0 across all words and their frequencies in a document, you'll get a measure of associaton of that document with Topic 0.

The attribute .transform does that for you in Python (normalized):

In [13]:
lda_news_output = lda_news.transform(bow_news_corpus)

Create a nice-looking dataframe:

In [14]:
doc_names = ["Doc_" + str(i) for i in range(len(normalized_corpus_news))]
topic_names = ["Topic_" + str(i) for i in range(no_topics_news)]
df_document_topic = pd.DataFrame(np.round(lda_news_output, 4), columns=topic_names, index=doc_names)
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic
df_document_topic[0:4]

Unnamed: 0,Topic_0,Topic_1,Topic_2,dominant_topic
Doc_0,0.0004,0.4359,0.5637,2
Doc_1,0.8054,0.1928,0.0018,0
Doc_2,0.0094,0.9461,0.0445,1
Doc_3,0.9879,0.006,0.0061,0


**** INTERACTIVE TOPIC VISUALIZATION ****

You can visualize the topics: topic size, ferquency of words in a topic versus the whole corpus, etc. You can rank words (terms) in a topic by relevancy: do you want rare and exclusive terms (i.e. found mostly in that topic) OR terms that are used frequently in that topic, not not nessesarily exclisuve to that topic?

Relevancy weight parameter is λ (0 ≤ λ ≤ 1): you can adjust it!

* small λ highlights potentially rare, but exclusive terms for the selected topic;
* large values of λ (near 1) highlight frequent, but not necessarily exclusive, terms for the selected topic;

Relevancy is measured as: 

    Relevancy = λ log[p(term | topic)] + (1 - λ) log[p(term | topic)/p(term)], 
   
   where p(term | topic) stands for term (word) weight, i.e. frequency, in a topic and p(term) stands for term's weight (frequency) in a corpus.

Additional information on how to use this visualization:
* http://www.kennyshirley.com/LDAvis/
* https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf



In [15]:
pyLDAvis.enable_notebook()
visualization_panel = pyLDAvis.sklearn.prepare(lda_news, bow_news_corpus, bow_vectorizer_news, mds='tsne')
visualization_panel

**LOG-LIKELIHOOD, PERPLEXITY AND COHERENCE SCORES**

Log-likelihood, perplexity and coherence scores do not have baseline or a threshold values. 
They are used to compare and discriminate between models estimated on the same data.

We will use a function CoherenceModel() from the gensim module for computing coherence scores for our LDA topic model.

In [17]:
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary

paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


The function CoherenceModel() needs:
    * an array of topics in the form [["cat","dog","python"],["java","python","ruby"]], 
    * corpus with each document represented as bag-of-words, and
    * dictionary of the corpus
    
We will create those elements now using a tokenized corpus:

In [18]:
news_corpus_tokenized = [tokenize_text(normalized_corpus_news[doc_id]) for doc_id in range(len(normalized_corpus_news))]
news_dictionary = Dictionary(news_corpus_tokenized)
news_corpus_bow = [news_dictionary.doc2bow(doc) for doc in news_corpus_tokenized]

Here are the lines of Python code that would estimate an LDA topic model:

In [54]:
no_topics_news = 4
no_keywords_news = 20
lda_news = LatentDirichletAllocation(n_components=no_topics_news, max_iter=100,random_state = 42).fit(bow_news_corpus) 
topic_keywords = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = lda_news, n_words=no_keywords_news)
pd.DataFrame(topic_keywords, 
             columns = ["word_" + str(i) for i in range(no_keywords_news)],
             index = ["Topic_" + str(i) for i in range(len(topic_keywords))]) 



Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,word_10,word_11,word_12,word_13,word_14,word_15,word_16,word_17,word_18,word_19
Topic_0,space,image,use,file,program,data,system,launch,nasa,edu,available,software,satellite,graphic,format,jpeg,include,ftp,orbit,information
Topic_1,think,people,like,know,god,could,use,well,good,thing,time,take,believe,even,point,way,many,give,much,post
Topic_2,p2,den,p3,p1,com,de,men,van,radius,navy,presentation,bob,edu,het,double,vice,dr,material,een,stay
Topic_3,jesus,god,christian,bible,know,people,word,matthew,day,law,even,child,point,good,man,christ,time,ra,think,many


Now let's compute the coherence score for the model:

In [56]:
cm = CoherenceModel(topics=topic_keywords, 
                    corpus = news_corpus_bow , 
                    dictionary = news_dictionary, coherence='u_mass')
print("Coherence score for the model: ", np.round(cm.get_coherence(),4))  # get coherence value

Coherence score for the model:  -3.9419


You can also see a coherence score by topic:

In [57]:
print("Coherence score by topic: ", np.round(cm.get_coherence_per_topic(),4))

Coherence score by topic:  [ -1.8577  -1.2833 -11.1032  -1.5232]


**EXERCISE 1**

Compare the coherence-scores-based evaluation for models with 2, 3, and 4 topics with your human-judgment-based evaluation of those models. What do you find? 

Coherence score for the model is -1.4506, -3.7562, -3.9419 with 2, 3, 4 topics, which doesn't agree with human-judgement-based evaluation.

In [36]:
print("Log-Likelihood (higher values are better): ", lda_news.score(bow_news_corpus))

Log-Likelihood (higher values are better):  -1661737.6873378977


Perplexity Score:

In [37]:
print("Perplexity (lower values are better): ", lda_news.perplexity(bow_news_corpus))

Perplexity (lower values are better):  4315.084570393613


**EXERCISE 2**

Compare the perplexity and log-likelihood evaluation for models with 2, 3, and 4 topics with your human-judgment-based evaluation of those models. What do you find? 

Log likelihood is -166.47k, -166.17k, -166.09k with 2, 3, and 4 topics, and perplexity is 4379, 4315, 4298 with 2, 3, and 4 topics, which agree with human-judgement-based evaluation.

**EXERCISE 3**

Write a simple script that selects the best model automatically. 
You select a criteria for "best model" (log-likelihood, perplexity, or coherence score). 
You can vary both parameter alpha and number of topics, or just number of topics.

In [60]:
perplexity = 1000000
for no_topics_news in range(2,5):
    lda_news = LatentDirichletAllocation(n_components=no_topics_news, max_iter=100,random_state = 42).fit(bow_news_corpus)
    perplexity = min(perplexity, lda_news.perplexity(bow_news_corpus))
    if perplexity==lda_news.perplexity(bow_news_corpus):
        best = lda_news
topic_keywords = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = best, n_words=no_keywords_news)
pd.DataFrame(topic_keywords, 
             columns = ["word_" + str(i) for i in range(no_keywords_news)],
             index = ["Topic_" + str(i) for i in range(len(topic_keywords))]) 



Unnamed: 0,word_0,word_1,word_2,word_3,word_4,word_5,word_6,word_7,word_8,word_9,word_10,word_11,word_12,word_13,word_14,word_15,word_16,word_17,word_18,word_19
Topic_0,space,image,use,file,program,data,system,launch,nasa,edu,available,software,satellite,graphic,format,jpeg,include,ftp,orbit,information
Topic_1,think,people,like,know,god,could,use,well,good,thing,time,take,believe,even,point,way,many,give,much,post
Topic_2,p2,den,p3,p1,com,de,men,van,radius,navy,presentation,bob,edu,het,double,vice,dr,material,een,stay
Topic_3,jesus,god,christian,bible,know,people,word,matthew,day,law,even,child,point,good,man,christ,time,ra,think,many


In [61]:
best

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=100,
             mean_change_tol=0.001, n_components=4, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=42,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)