# Topic Models and their visualization
### Author: Eni Mustafaraj

This tutorial shows how to perform the topic modeling using the `sklearn` library. Additionally, it introduces the library `pyLDAvis`, a tool to visualize topic modeling that works with the results of both`sklearn` and `gensim`.
Here we will see it in action with `sklearn`. We will be using 20 newsgroups dataset as provided by scikit-learn.

**Table of Content**

1. [Prepare the notebook materials](#sec1)  
2. [Convert to document-term matrices](#sec2)  
3. [Fit Latent Dirichlet Allocation models](#sec3) 
4. [Visualizing the models with pyLDAvis](#sec4)
5. [YOUR TASK: Interpret the models](#sec5)

Create virtual environment:
python -m venv myenv

Activate:
Windows: myenv\Scripts\activate
MacOS: source myenv/bin/activate

Install packages:
pip install -r requirements.txt

 <a id="sec1"></a>
 ## 1. Prepare the notebook materials

From `sklearn` we'll be using one of its datasets, the feature extractors, and the LDA model.

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

Now install `pyLDAvis`:

In [2]:
!pip install pyLDAvis



Import the needed packages and enable pyLDAvis to work inside the notebook:

In [5]:
# Ensure pyLDAvis is installed
%pip install pyLDAvis

# Import pyLDAvis and enable notebook visualization
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

Note: you may need to restart the kernel to use updated packages.


ModuleNotFoundError: No module named 'pyLDAvis.sklearn'

### Load 20 newsgroups dataset

Below, the 20 newsgroups dataset available in `sklearn` is loaded. The headers, footers and quotes are removed, so that irrelevant words don't participate in the analysis.

In [8]:
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print(len(docs_raw))

11314


<a id="sec2"></a>
## 2. Convert to document-term matrices

Next, the raw documents are converted into document-term matrix (dtm), initially as raw counts (with `CountVectorizer`) and then as TF-IDF values (with `TfidfVectorizer`).

We're creating two different vectorizers, to notice the difference that different feature representations make in the process of modeling.

In [9]:
# Create the TF vector represetnation, this only counts the terms in each document

tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(11314, 9144)


Let's look at some of the features (i.e. words in the corpus):

In [10]:
tf_vectorizer.get_feature_names()[500:520]

['asks',
 'asleep',
 'aspect',
 'aspects',
 'ass',
 'assault',
 'assaults',
 'assemble',
 'assembled',
 'assembler',
 'assembly',
 'assert',
 'asserted',
 'asserting',
 'assertion',
 'assertions',
 'asserts',
 'assess',
 'assessment',
 'asshole']

To be able to compare the results of TF and TFIDF representation, we will use the same parameters.
This is why we will initialize the `TfidfVectorizer` with the parmeters of the `CountVectorizer`.

In [12]:
tf_vectorizer.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.int64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.5,
 'max_features': None,
 'min_df': 10,
 'ngram_range': (1, 1),
 'preprocessor': None,
 'stop_words': 'english',
 'strip_accents': 'unicode',
 'token_pattern': '\\b[a-zA-Z]{3,}\\b',
 'tokenizer': None,
 'vocabulary': None}

Initialize and prepare Document-Term-Matrix:

In [11]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())

dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)



(11314, 9144)


<a id="sec3"></a>
## 3. Fit Latent Dirichlet Allocation models

We will fit two LDA models, each with 20 topics. Because the dataset is large, it takes a little bit to train the models.

In [13]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tf.fit(dtm_tf)

lda_tf

LatentDirichletAllocation(n_components=20, random_state=0)

In [19]:
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_components=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)

LatentDirichletAllocation(n_components=20, random_state=0)

Given that the models are probability distributions over topics and words, we will focus on their visualization to learn more about them.

<a id="sec4"></a>
## 4. Visualizing the models with pyLDAvis

There is a single method that we need to call, but, notice that this method has three parameters:

In [10]:
# the TF representation model

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [21]:
# The TFIDF representation model

pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### [Optional] Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

**Acronyms:**  
MDS = Multi-Dimensional Scaling   
PCoA = Principled Coordinates Analysis    
MMDS = Metric Multi-Dimensional Scaling  
TSNE = t-distributed Stochastic Neighbor Embedding   

That is, what changes is how the topics are visualized with respect to one-another in the 2D plane. Below, we show the same graphs, but this time with a different scaling technique.

In [20]:
# Notice the fourth parameter

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [22]:
# Different value for the 4th parameter

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')

<a id="sec5"></a>
## 5. Your task: Interpret the models

First, watch the video on slide 23 about using pyLDAvis with the 20 Newsgroup dataset.  
If you need more context, read these other sources:

- [The 20 Newsgroup dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
- [Homepage for the 20 Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/)

Then, play with the four visualizations above to get a better sense of the topics.

**Questions to answer below:**

1. Repeat the two final visualizations, for the TFIDF representation.
2. Which of the four visualizations seems more helpful to you for understanding the topics? Explain.
2. Can you map 