# LAB 4. TOPIC MODELING - ANSWERS

### **<font color=green>INSTRUCTIONS:</font>** <br> 

**<font color=green> 1. Look for EXERCISES in the script (3 in total).</font>** <br>

**<font color=green> 2. Each student INDIVIDUALLY uploads this script with their answers embedded to Canvas by the end of the lab session or by Wednesday, 11:59pm CT (St. Louis time).</font>** 

### Lab Objectives

1. Learn how to estimate a topic model in Python (using the sklearn package)
2. Get familiar with the output of a topic model
3. Visualize topics in a text corpus
4. Evaluate and discriminate between topic models

### Session Prep
Below we install the modules we need and define the text normalization function we used in Lab 3, as well as two addtional function we need for today only.

**Important:** Make sure Text_Normalization_Function.ipynb file is in the same directory as the current notebook

In [1]:
#the module 'sys' allows istalling module from inside Jupyter
import sys

!{sys.executable} -m pip install numpy
import numpy as np

!{sys.executable} -m pip install pandas
import pandas as pd

#Natrual Language ToolKit (NLTK)
!{sys.executable} -m pip install nltk
import nltk

!{sys.executable} -m pip install sklearn
from sklearn import metrics
#from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import  CountVectorizer #bag-of-words vectorizer 
from sklearn.decomposition import LatentDirichletAllocation #package for LDA

# Plotting tools

from pprint import pprint
!{sys.executable} -m pip install pyLDAvis #visualizing LDA
import pyLDAvis
import pyLDAvis.sklearn

import matplotlib.pyplot as plt
%matplotlib inline

#define text normalization function
%run ./Text_Normalization_Function.ipynb #defining text normalization function

#ignore warnings about future changes in functions as they take too much space
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)





[nltk_data] Downloading package stopwords to /Users/lilia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/lilia/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/lilia/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/lilia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  ['<', 'p', '>', 'The', 'circus', 'dog', 'in', 'a', 'plissé', 'skirt', 'jumped', 'over', 'Python', 'who', 'was', "n't", 'that', 'large', ',', 'just', '3', 'feet', 'long.', '<', '/p', '>']
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  <p>The circus dog in a plissé skirt jumped over Python who was not that large, just 3 feet long.</p>
Original:   <p>The circus dog in a plissé skirt jumped over Python who wasn't that large, just 3 feet long.</p>
Processed:  [('<', 'a'), ('p', 'n'), ('>', 'v'), ('the', None), ('circus', 'n'), ('dog', 'n'), ('in', None), ('a', None), ('plissé', 'n'), ('skirt', 'n'), ('jumped', 'v'), ('over', None), ('python', 'n'), ('who', None), ('was', 'v'), ("n't", 'r'), ('that', None), ('large', 'a'), (',', None), ('just', 'r'), ('3', None), ('feet', 'n'), ('long.', 'a'), 

Below we define two functions that will display the results of fitting a topic model, to be used later:

*Note: these functions are not the focus of the lab, therefore we'll not be discussing them, but you are welcome to explore and dig into them later if you prefer*

In [2]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        
def get_topic_words(vectorizer, lda_model, n_words):
    keywords = np.array(vectorizer.get_feature_names())
    topic_words = []
    for topic_weights in lda_model.components_:
        top_word_locs = (-topic_weights).argsort()[:n_words]
        topic_words.append(keywords.take(top_word_locs).tolist())
    return topic_words

### Toy Data Example
We'll start with working on a toy dataset. It will allow us to grasp the full results of a topic model before moving to high-dimensional realistic data.

#### Define and Prep Toy Data
Let's use the toy corpus on animals and programming similar to one in the lecture. Let's define it:

In [3]:
toy_corpus = ["The fox jumps over the dog", 
              "The fox is very clever and quick", 
              "The dog is slow and lazy", 
              "The cat is smarter than the fox and the dog but it can never learn Java", 
              "Python is an excellent programming language", 
              "Java and Ruby are other programming languages", 
              "Python and Java are very popular programming languages", 
              "Python programs are smaller than Java programs"]

Let's **normalize** our toy_corpus and call the normalized corpus **normalized_toy_corpus**:

In [5]:
normalized_toy_corpus = normalize_corpus(toy_corpus) 
normalized_toy_corpus

['fox jump dog',
 'fox clever quick',
 'dog slow lazy',
 'cat smarter fox dog never learn java',
 'python excellent programming language',
 'java ruby programming language',
 'python java popular programming language',
 'python program small java program']

Since for topic modeling we need text data in the **Bag-of-Words** representation, let's **vectorize** our normalized_toy_corpus and call it **bow_toy_corpus**:

In [6]:
#define the bag-of-words vectorizer:
bow_vectorizer = CountVectorizer()

#vectorize the normalized data:
bow_toy_corpus = bow_vectorizer.fit_transform(normalized_toy_corpus)

In [8]:
bow_toy_corpus

<8x20 sparse matrix of type '<class 'numpy.int64'>'
	with 33 stored elements in Compressed Sparse Row format>

Have a look at the Bag-of-Words representation of our corpus: **It never hurts to know how you data look like :)** Note absence of stopwords and other differences with the raw data:

In [9]:
pd.DataFrame(data = bow_toy_corpus.todense(), columns = bow_vectorizer.get_feature_names())

Unnamed: 0,cat,clever,dog,excellent,fox,java,jump,language,lazy,learn,never,popular,program,programming,python,quick,ruby,slow,small,smarter
0,0,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
3,1,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,1
4,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,0,0,0,0
5,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0
6,0,0,0,0,0,1,0,1,0,0,0,1,0,1,1,0,0,0,0,0
7,0,0,0,0,0,1,0,0,0,0,0,0,2,0,1,0,0,0,1,0


#### Topic Model (via Latent Dirichlet Allocation) on Toy Data
Now let's **model topics** in our toy data. Given that the toy corpus is so small, we know all "topics" it contains (**what are they?**) and it will be easy for us:<br> 1) to check if the topic model results make sense; <br>2) see all the results that the topic model produces.  <br><br>
We will be using the **LatentDirichletAllocation** function which we already imported earlier (see Session Prep). The function has the following **parameters** to be set:
1. Number of topics to model: **n_components**
2. Parameter vector for the Dirichlet distribution for *topics*: **doc_topic_prior**
3. Parameter vector for the Dirichlet distribution for *words* in a topic: **topic_word_prior**

Notes on **parameter vectors for the Dirichlet distributions**: <br>
1. Although the Dirichlet distribution parameters are represented by a **vector**, for simplicity we provide one number for each parameter vector. For example, if we set the number of topics to 2 (n_components=2), the parameter vector for the Dirichlet distribution for *topics* should be a two-dimensional vector. We set doc_topic_prior=0.5 and the LatentDirichletAllocation function internally creates a two-dimensional vector (0.5,0.5). Similar logic applies to the parameter vector for the Dirichlet distribution for *words*.<br><br>
2. Remember, that we need **sparsity** in the distribution of topics across documents (i.e., some documents have a zero probability of containing some of the topics) and *sparsity* in the distribution of words in topics (i.e., some words have zero probability to be present in some topics). To induce sparsity, we need to set doc_topic_prior and doc_topic_prior between 0 and 1.

Now, let's set the parameters and estimate the topic model:

In [10]:
lda_toy_corpus = LatentDirichletAllocation(n_components=2, max_iter=500,
                                           doc_topic_prior = 0.9,
                                           topic_word_prior = 0.9).fit(bow_toy_corpus)

Display results by showing 15 **most frequent (top)** words for each topic (we use **function display_topics** defined in Session Prep):

In [11]:
no_top_words = 15
display_topics(lda_toy_corpus, bow_vectorizer.get_feature_names(), no_top_words)

Topic 0:
java python language programming program popular small excellent ruby fox dog clever quick lazy slow
Topic 1:
dog fox smarter never learn cat jump lazy slow quick clever java programming language python


Note that topics do not have names or labels. **Topics are just collections of words**, following the definition of a topic in text mining. <br><br> To be precise, topics are **word vectors**, where each vector element is the **weight** (relative frequency) of the word in a topic. Let's have a look at those "word vectors". Can you see below that each word vector (topic) is a **simplex**?

In [12]:
word_weights = lda_toy_corpus.components_ / lda_toy_corpus.components_.sum(axis=1)[:, np.newaxis]
pd.DataFrame(word_weights.T, index = bow_vectorizer.get_feature_names()).T

Unnamed: 0,cat,clever,dog,excellent,fox,java,jump,language,lazy,learn,never,popular,program,programming,python,quick,ruby,slow,small,smarter
0,0.025956,0.026564,0.026766,0.050982,0.026766,0.109888,0.026407,0.105756,0.026564,0.025956,0.025956,0.051208,0.07848,0.105756,0.105892,0.026564,0.050859,0.026564,0.051163,0.025956
1,0.055088,0.054435,0.113486,0.028202,0.113486,0.053817,0.054604,0.028622,0.054435,0.055088,0.055088,0.027959,0.028293,0.028622,0.028476,0.054435,0.028333,0.054435,0.028007,0.055088


### **<font color=green>EXERCISE 1:</font>**

**<font color=green>1.1. Adjust doc_topic_prior (alpha) and topic_word_prior (beta) and observe the changes in topic representation.</font>** 

**<font color=green>1.2. You are likely to be not 100% satisfied with the model performance, even after all the adjustments. The word "java" might still be appearing among the top words in the "animals" topic. Why? Looking at the text corpus might help to find the answer.</font>**

**Answer 1.1:** <br>

Code:

In [15]:
lda_toy_corpus = LatentDirichletAllocation(n_components=2, max_iter=500,
                                           doc_topic_prior = 0.8,
                                           topic_word_prior = 0.8).fit(bow_toy_corpus)
display_topics(lda_toy_corpus, bow_vectorizer.get_feature_names(), no_top_words)
word_weights = lda_toy_corpus.components_ / lda_toy_corpus.components_.sum(axis=1)[:, np.newaxis]
pd.DataFrame(word_weights.T, index = bow_vectorizer.get_feature_names()).T

Topic 0:
java python language programming program popular small excellent ruby dog fox lazy slow clever quick
Topic 1:
fox dog smarter never learn cat jump quick clever lazy slow java programming language python


Unnamed: 0,cat,clever,dog,excellent,fox,java,jump,language,lazy,learn,never,popular,program,programming,python,quick,ruby,slow,small,smarter
0,0.024198,0.024631,0.024706,0.051596,0.024706,0.113631,0.024541,0.10986,0.024631,0.024198,0.024198,0.051761,0.080823,0.10986,0.109955,0.024631,0.051511,0.024631,0.051735,0.024198
1,0.055741,0.055275,0.118095,0.026268,0.118095,0.053886,0.055372,0.026492,0.055275,0.055741,0.055741,0.026091,0.026277,0.026492,0.02639,0.055275,0.026359,0.055275,0.026118,0.055741


Discussion: 

As alpha and beta change, the top words change accordingly.

**Answer 1.2:**

Discussion:

While we only have 8 corpus and one of them is "The cat is smarter than the fox and the dog but it can never learn Java", which is the only corpus with both animals and programming, and only the word "Java" exists in the corpus while other words in the programming topic does not.

### Topic Modeling on Real Data

The dataset here is the one we used for doing Text Classification in Lab 3. The newspaper blog posts have 4 topics: **atheism, religion, computer graphics, and space science**. Of course, we will *not use* class label information for topic modeling.

Download the data and set up the data (**news_corpus**):

In [16]:
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
dataset = fetch_20newsgroups(shuffle=True, 
                             random_state=1, 
                             categories = categories, 
                             remove=('headers', 'footers', 'quotes'))
news_corpus = dataset.data

Now, let's normalize the corpus and create the Bag-of-Words representation of the data. We'll limit the number of features to **1000 most frequent features** to compute the topic model faster. 

In [17]:
#normalize data
normalized_corpus_news = normalize_corpus(news_corpus)

#define a Bag-of-Words vecgtorizer
bow_vectorizer_news = CountVectorizer(max_features=1000)

#vectorize data
bow_news_corpus = bow_vectorizer_news.fit_transform(normalized_corpus_news)

Now let's fit the topic model. We need to **set the number of topics** first. We are *lucky to know* that there are **4 topics** (atheism, religion, computer graphics, and space science) and it will allow us to judge the performance of the topic model better.

**Note**: It will take a couple of minutes for the estimation to finish. The larger the number of iterations (max_iter) you allow for, the longer it takes.

In [18]:
lda_news = LatentDirichletAllocation(n_components=4, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)

Display results with top 10 words for each topic:

In [19]:
no_top_words_news = 10
display_topics(lda_news, bow_vectorizer_news.get_feature_names(), no_top_words_news)

Topic 0:
space nasa launch satellite orbit mission earth year system use
Topic 1:
think people like know could good time well thing take
Topic 2:
god jesus christian believe religion bible atheist argument belief atheism
Topic 3:
image file use edu program software graphic format jpeg data


Display **word vectors** (words are in alphabetical order) for each topic. Each column is a topic:

In [20]:
word_weights = lda_news.components_ / lda_news.components_.sum(axis=1)[:, np.newaxis]
word_weights_df = pd.DataFrame(word_weights.T, 
                               index = bow_vectorizer_news.get_feature_names(), 
                               columns = ["Topic_" + str(i) for i in range(4)])
word_weights_df.head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3
3d,1e-05,6e-06,1e-05,0.00505
able,0.000273,0.002062,1e-05,0.001274
ac,0.000412,7e-06,4.4e-05,0.001232
accept,1.1e-05,0.001713,0.002364,0.000295
acceptable,0.000315,0.000729,0.000145,7.4e-05
access,0.000956,0.000197,1.1e-05,0.001925
accord,0.000318,0.000771,0.0017,9e-06
account,0.000102,0.000421,0.001012,5.5e-05
act,0.000929,0.000439,0.00324,1e-05
action,0.000124,0.00217,0.00037,3.8e-05


Now, **sort by word weights in Topic 0** (descending order) and see the weights by 10 most frequent words in Topic 0:

In [21]:
word_weights_df.sort_values(by='Topic_0',ascending=False).head(10)

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3
space,0.041186,7e-06,1e-05,0.000869
nasa,0.016677,7e-06,1e-05,9e-06
launch,0.016113,7e-06,1e-05,9e-06
satellite,0.011472,6e-06,1e-05,9e-06
orbit,0.010221,7e-06,1e-05,9e-06
mission,0.009662,7e-06,0.000123,9e-06
earth,0.009652,0.000414,0.00101,9e-06
year,0.009609,0.005065,0.000331,9e-06
system,0.00887,0.005121,1e-05,0.007598
use,0.008455,0.006959,0.005839,0.016321


### Topic Model Visualization

You can **visualize** the topics: topic size, frequency of words in a topic and so on.

In this visualization, you can rank words in a topic by **relevancy**: do you want rare and exclusive terms (i.e. found mostly in that topic) or terms that are used frequently in that topic, but not always exclusive to that topic? Relevancy parameter is λ (0 ≤ λ ≤ 1). You can adjust it:

* small λ highlights potentially rare, but exclusive terms for the selected topic;
* large values of λ (near 1) highlight frequent, but not necessarily exclusive, terms for the selected topic;

Relevancy is measured as: 

    Relevancy = λ log[p(term | topic)] + (1 - λ) log[p(term | topic)/p(term)], 
   
   where p(term | topic) stands for word weight in a topic and p(term) stands for word's weight in a corpus.

Additional information on how to use this visualization:
* http://www.kennyshirley.com/LDAvis/
* https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf

We installed all the **pyLDAvis** module required for this visualization in Session Prep. Now let's use it:

In [22]:
#prepare to display result in the Jupyter notebook
pyLDAvis.enable_notebook()

#run the visualization [mds is a function to use for visualizing the "distance" between topics]
pyLDAvis.sklearn.prepare(lda_news, bow_news_corpus, bow_vectorizer_news, mds='tsne')

### **<font color=green>EXERCISE 2:</font>**
    
**<font color=green>2.1. Fit a topic model with 3 topics (n_components = 3). The script is provided below. Important! Note that the model with three topics is called lda_news_3_topics.</font>**

**<font color=green>2.2. Use the visualization tool to answer the following question: Which topic is the most common / largest topic in the corpus? Can you give a name to that topic? List 5 most relevant and exclusive terms for that topic (with lambda = 0.2).</font>**

**<font color=green>2.3. You fit the model with 3 topics, but you know that the dataset has four classes (topics). Which 2 of the 4 classes ('atheism', 'religion','computer graphics', 'space science') were grouped together into one topic by the topic model when you fit it with 3 topics? Why?</font>**

Your answer (need to add lines of code related to visualization):

**Answer 2.1**:

Code:

In [23]:
#fit the LDA model with 3 topics
lda_news_3_topics = LatentDirichletAllocation(n_components=3, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)


Discussion:

**Answer 2.2:**

Code:

In [25]:
display_topics(lda_news_3_topics, bow_vectorizer_news.get_feature_names(), no_top_words_news)

Topic 0:
god people think know like jesus good thing believe even
Topic 1:
image file use edu program software graphic format jpeg data
Topic 2:
space nasa launch year satellite orbit system use earth mission


In [24]:
#prepare to display result in the Jupyter notebook
pyLDAvis.enable_notebook()

#run the visualization [mds is a function to use for visualizing the "distance" between topics]
pyLDAvis.sklearn.prepare(lda_news_3_topics, bow_news_corpus, bow_vectorizer_news, mds='tsne')

Discussion:

The topics are:
Topic 0:
god people think know like jesus good thing believe even
Topic 1:
image file use edu program software graphic format jpeg data
Topic 2:
space nasa launch year satellite orbit system use earth mission

We can name them as "beliefs" "computer graphics" and "space science".

**Answer 2.3:**

atheism and religion are grouped together.
This is because these two are both topic concerning people's faith and belief.

### How To Find Dominant Topic in a Document

Each document typically contains several topics. One of the topics is **dominant**, i.e. it is the largest topic in the document. That topic gives you an answer to the question: **What is this document about?** In other words, the document's dominant topic **summarizes** the document. 

Let's assign a dominant topic to **each document** in our corpus. Weights in a word vector for a topic provide a measure of association for the word with the topic. If you sum weights for a particular topic across all words in a document, you'll get the weight of that topic in the document.

The attribute **.transform** to our function **lda_news** computes the weights of each topic in documents: 

In [27]:
lda_news_topic_weights = lda_news.transform(bow_news_corpus)

Let's convert lda_news_topic_weights into a nice-looking dataframe and have a look at the computed topic weights in documents:

In [28]:
#array of document "names" and topic "names" ("names" are just indecies)
doc_names = ["Doc_" + str(i) for i in range(len(normalized_corpus_news))]
topic_names = ["Topic_" + str(i) for i in range(4)]

#convert to dataframe
df_document_topic = pd.DataFrame(np.round(lda_news_topic_weights, 4), columns=topic_names, index=doc_names)
df_document_topic.head(5)

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3
Doc_0,0.0005,0.0005,0.9986,0.0004
Doc_1,0.0193,0.3822,0.0026,0.5959
Doc_2,0.0106,0.3423,0.6364,0.0106
Doc_3,0.2103,0.0107,0.0108,0.7682
Doc_4,0.0422,0.0424,0.0417,0.8737


You can see that in document Doc_0 the **dominant topic** is Topic_2 as it has the weight of 0.9986. The weights across the 4 topics sum up to 1. Let's add a column that shows dominant topic for each document:

In [29]:
#vector of indecies for columns with the highest value by each row in df_document_topic
dominant_topic = np.argmax(df_document_topic.values, axis=1)

#add dominant_topic as a column to df_document_topic
df_document_topic['dominant_topic'] = dominant_topic
df_document_topic.head(5)

Unnamed: 0,Topic_0,Topic_1,Topic_2,Topic_3,dominant_topic
Doc_0,0.0005,0.0005,0.9986,0.0004,2
Doc_1,0.0193,0.3822,0.0026,0.5959,3
Doc_2,0.0106,0.3423,0.6364,0.0106,2
Doc_3,0.2103,0.0107,0.0108,0.7682,3
Doc_4,0.0422,0.0424,0.0417,0.8737,3


### Topic Model Evaluation: Log-likelihood, Perplexity and Coherence Scores

Log-likelihood, Perplexity and Coherence Score are **measures of performance** for a topic model. They are used for comparing and discriminating between topic models estimated on the same data. Log-likelihood, perplexity and coherence scores **do not have** a baseline or a threshold values and therefore are useful only for comparing models. 

How do you specify different models? You can set **different number of topics** and also play with the **parameters of the Dirichlet distributions**. 

#### Coherence Score

We will use a function **CoherenceModel()** from the **gensim** module (you can also explore that package as it can be used to estimate an LDA model). The sklearn module does not have the functionality to compute the coherence score. Let's install the gensim package and the functions needed:

In [30]:
!{sys.executable} -m pip install gensim
import gensim

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary



The function CoherenceModel() needs as **inputs**:

**1. Dictionary of the corpus**<br>
**2. Corpus with each document represented as Bag-of-Words**<br>
**3. An array of top words for each topic: we'll have top 20 words for each topic** 
  
We will now create those objects:

In [31]:
#tokenizing the corpus
news_corpus_tokenized = [tokenize_text(normalized_corpus_news[doc_id]) for doc_id in range(len(normalized_corpus_news))]

#Dictionary of the corpus:
news_dictionary = Dictionary(news_corpus_tokenized)

#Bag-of-words representation for each document of the corpus:
news_corpus_bow = [news_dictionary.doc2bow(doc) for doc in news_corpus_tokenized]

#top 20 words for each topic (using the function defined in session prep)
topic_topwords = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = lda_news, n_words=20)

Now let's compute **the coherence score for the model overall**. We use one of the coherence metrics "u-mass" which measures semantic similarity of words in a topic, but there are other metrics as well.

*Note: You can check out different coherence metrics here if you are interested: https://dl.acm.org/doi/abs/10.1145/2684822.2685324*

In [32]:
cm = CoherenceModel(topics=topic_topwords, 
                    corpus = news_corpus_bow , 
                    dictionary = news_dictionary, coherence='u_mass')
print("Coherence score for the model: ", np.round(cm.get_coherence(), 4))  # get coherence value

Coherence score for the model:  -1.4361


You can also see **coherence scores by topic**:

In [33]:
print("Coherence score by topic (higher values are better): ", np.round(cm.get_coherence_per_topic(),4))

Coherence score by topic (higher values are better):  [-1.3688 -1.315  -1.3276 -1.7331]


#### Log-Likelihood Score

To compute the log-likelihood score we use the **.score** attribute of our defined and fitted LDA function:

In [34]:
print("Log-Likelihood (higher values are better): ", lda_news.score(bow_news_corpus))

Log-Likelihood (higher values are better):  -741280.999399245


#### Perplexity Score

To compute the Perplexity score we use the **.perplexity** attribute of our defined and fitted LDA function:

In [35]:
print("Perplexity (lower values are better): ", lda_news.perplexity(bow_news_corpus))

Perplexity (lower values are better):  574.401639146991


### **<font color=green>EXERCISE 3</font>**

**<font color=green>Compare the coherence score, perplexity score and the log-likelihood for models with 2, 3, and 4 topics with your human-judgment-based evaluation of those models. What do you find? </font>**

**<font color=green>What you need to do:</font>**

**<font color=green>3.1. For model with 4 topics - All code work is done: The model and evaluation metrics are already computed above. You just need to look up the values for the coherence, perplexity and log-likelihood for the model with 4 topics above and discuss what you observe. You might be interested in looking at coherence score by topic as well;</font>**

**<font color=green>3.2. For model with 3 topics - The model is computed in Exercise 2. You need to compute the perplexity, log-likelihood and coherence scores for the model with 3 topics (the lines for the coherence score are provided below) and dicuss your results;</font>**

**<font color=green>3.3. For model with 2 topics - You need to fit the model with 2 topics and compute all 3 evaluation metrics; dicuss your results.</font>**

**Answer 3.1:**

Discussion:

4/0.25/0.25
Log-Likelihood (higher values are better):  -740888.159696916
Perplexity (lower values are better):  572.4709222471453
Coherence score for the model: (higher values are better) -1.461



In [42]:
#test
#Fit LDA with 4 topics:
lda_news_test = LatentDirichletAllocation(n_components=4, max_iter=100,
                                     doc_topic_prior = 0.3,
                                     topic_word_prior = 0.3).fit(bow_news_corpus)


#Log-Likelihood:
print("Log-Likelihood (higher values are better): ", lda_news_test.score(bow_news_corpus))

#Perplexity score:
print("Perplexity (lower values are better): ", lda_news_test.perplexity(bow_news_corpus))


#Coherence score for 3 topics:
topic_topwords_test = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = lda_news_test, n_words=20)
cm_2_test = CoherenceModel(topics=topic_topwords_test, 
                             corpus = news_corpus_bow, 
                             dictionary = news_dictionary, coherence='u_mass')

#Overall coherence score for the model:
print("Coherence score for the model: (higher values are better)", np.round(cm_2_test.get_coherence(), 3))  


Log-Likelihood (higher values are better):  -741177.8069456882
Perplexity (lower values are better):  573.8938422547003
Coherence score for the model: (higher values are better) -1.448


**Answer 3.2:**

Code (complete the lines):

In [36]:
#Log-Likelihood (add code):
print("Log-Likelihood (higher values are better): ", lda_news_3_topics.score(bow_news_corpus))

#Perplexity score (add code):
print("Perplexity (lower values are better): ", lda_news_3_topics.perplexity(bow_news_corpus))

#Coherence score for 3 topics:
topic_topwords_3_topics = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = lda_news_3_topics, n_words=20)
cm_3_topics = CoherenceModel(topics=topic_topwords_3_topics, 
                             corpus = news_corpus_bow, 
                             dictionary = news_dictionary, coherence='u_mass')

#Overall coherence score for the model:
print("Coherence score for the model: (higher values are better)", np.round(cm_3_topics.get_coherence(), 3))  

Log-Likelihood (higher values are better):  -743322.6031640773
Perplexity (lower values are better):  584.5410049797958
Coherence score for the model: (higher values are better) -1.442


Discussion:

The model with 3 topic seems to perform better than the model with 4 topics, because the Log-Likelihood, Perplexity and Coherence Score of 4-topic all outperforms those of the 3-topic one.

**Answer 3.3:**

Code:

In [38]:
display_topics(lda_news_2_topics, bow_vectorizer_news.get_feature_names(), no_top_words_news)

Topic 0:
god people think know like jesus thing good believe even
Topic 1:
space image use file program system data edu nasa launch


In [37]:
#Fit LDA with 2 topics:
lda_news_2_topics = LatentDirichletAllocation(n_components=2, max_iter=100,
                                     doc_topic_prior = 0.25,
                                     topic_word_prior = 0.25).fit(bow_news_corpus)


#Log-Likelihood:
print("Log-Likelihood (higher values are better): ", lda_news_2_topics.score(bow_news_corpus))

#Perplexity score:
print("Perplexity (lower values are better): ", lda_news_2_topics.perplexity(bow_news_corpus))


#Coherence score for 3 topics:
topic_topwords_2_topics = get_topic_words(vectorizer = bow_vectorizer_news, lda_model = lda_news_2_topics, n_words=20)
cm_2_topics = CoherenceModel(topics=topic_topwords_2_topics, 
                             corpus = news_corpus_bow, 
                             dictionary = news_dictionary, coherence='u_mass')

#Overall coherence score for the model:
print("Coherence score for the model: (higher values are better)", np.round(cm_2_topics.get_coherence(), 3))  


Log-Likelihood (higher values are better):  -751830.7869127416
Perplexity (lower values are better):  628.7592216537156
Coherence score for the model: (higher values are better) -1.518


Discussion:

Based on our tests, the model with 3 topics performs generally better than others.

**Overall discussion for EXERCISE 3**:



<br>**NOTE:** Generally, you can write a simple script that selects the best topic model **automatically** based on a criterion for "best model" (log-likelihood, perplexity, or coherence score). The script can vary both parameters of the Dirichlet distributions and the number of topics, or just the number of topics.