# Topic Modelling with LDA


## Doing LDA in scikit-learn

Let's build an LDA model with [scikit learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) from the Brown corpus. We'll start by generating a term-document matrix. We'll exclude stopwords since they mostly aren't relevant to topic modelling. 

In [1]:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from collections import Counter
from nltk.corpus import brown, stopwords
EN_STOPWORDS = set(stopwords.words("english"))

def preprocess(doc):
    doc = [w.lower() for w in doc]
    return [w for w in doc if w.isalpha() and not w in EN_STOPWORDS]

raw_feature_dicts = []
for document in brown.fileids():
    document = preprocess(brown.words(document))
    raw_feature_dicts.append(Counter(document))

vectorizer = DictVectorizer()
X = vectorizer.fit_transform(raw_feature_dicts)
print(X.shape)

(500, 40097)


> The entries in the matrix are word counts. Here we have 500 documents in brown documents, and approximately 40097 words after preprocessing.

There are a lot of options for training LDA. A few of the more important ones:

- n_components: The number of topics in the model
- doc_topic_prior, the hyperparameter which controls doc/topic distributions (alpha), values closer to 0 result in more even distribution topics among documents
- topic_word_prior, the Dirichlet hyperparameter for topic/word distributions (beta), values closer to 0 result in weaker association between individual topics and words 
- learning_method, "online" or "batch"; "online" is faster
- max_iter, total passes through the corpus
- evaluate_every, provide perplexity output every X iterations to assess convergence
- verbose, if set to 2, you can see more progress of model
- random_state, ensure consistency

Lots of other options related to learning rates, probably not worth fiddling with these!

Like HMMs, an *intrinsic* measure of the quality of a topic model is given by perplexity. Here's the formula again:

\begin{equation}
2^{{-{\frac  {1}{N}}\sum _{{i=1}}^{N}\log _{2}p(x_{i})}}
\end{equation}

Where N is the total number of words in the corpus, and $p(x_{i})$ is the probability of each word.

When the perplexity goes down, the evaluation score goes up.

In [2]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=20,learning_method="online",max_iter=100,evaluate_every=1,verbose=1,random_state=0)
theta = lda.fit_transform(X)

iteration: 1 of max_iter: 100, perplexity: 65840.4194
iteration: 2 of max_iter: 100, perplexity: 36309.0057
iteration: 3 of max_iter: 100, perplexity: 23529.4010
iteration: 4 of max_iter: 100, perplexity: 17179.0895
iteration: 5 of max_iter: 100, perplexity: 13764.8107
iteration: 6 of max_iter: 100, perplexity: 11821.8795
iteration: 7 of max_iter: 100, perplexity: 10667.0038
iteration: 8 of max_iter: 100, perplexity: 9955.8109
iteration: 9 of max_iter: 100, perplexity: 9504.0442
iteration: 10 of max_iter: 100, perplexity: 9208.6315
iteration: 11 of max_iter: 100, perplexity: 9009.8673
iteration: 12 of max_iter: 100, perplexity: 8872.2725
iteration: 13 of max_iter: 100, perplexity: 8774.0631
iteration: 14 of max_iter: 100, perplexity: 8701.5370
iteration: 15 of max_iter: 100, perplexity: 8646.4253
iteration: 16 of max_iter: 100, perplexity: 8603.5607
iteration: 17 of max_iter: 100, perplexity: 8569.5504
iteration: 18 of max_iter: 100, perplexity: 8541.4568
iteration: 19 of max_iter: 100

The direct output of the fit_transform function is the theta matrix, each row is a topic distribution for a text. This can be viewed as a dimensionality reduction.

In [3]:
theta.shape

(500, 20)

> We have 500 documents, and 20 topics. Each topic should have a probability:

In [4]:
theta[0]

array([4.58295149e-05, 4.58295148e-05, 4.58295143e-05, 4.58295143e-05,
       4.58295142e-05, 4.58295147e-05, 4.58295144e-05, 4.58295142e-05,
       9.99129239e-01, 4.58295144e-05, 4.58295142e-05, 4.58295142e-05,
       4.58295142e-05, 4.58295142e-05, 4.58295153e-05, 4.58295143e-05,
       4.58295151e-05, 4.58295142e-05, 4.58295142e-05, 4.58295142e-05])

In [5]:
sum(theta[0])

1.0

The matrix we refered to as $\beta$, i.e. the probability distributions across word for each topic, isn't directly available from the sklearn model. Instead the **components_** matrix has "pseudocounts" of how often words were associated with particular topics (They aren't actual counts because the inference algorithm has allowed partial assigments of topics to words). But we get the probabilities $\beta$ just by normalizing the rows.

In [6]:
lda.components_.shape  # each row for document, each column for word type

(20, 40097)

In [7]:
lda.components_[0]

array([0.05000102, 0.05000089, 0.05000103, ..., 0.05000099, 0.05000072,
       0.05000084])

In [8]:
sum(lda.components_[0])   # does not sum up to 1

3398.110741826974

In [9]:
beta = lda.components_/np.sum(lda.components_,axis=1,keepdims=True)    # normalizing pseudocounts 

In [10]:
beta[0]   # now we turn this into a probability distribution

array([1.47143585e-05, 1.47143197e-05, 1.47143615e-05, ...,
       1.47143499e-05, 1.47142714e-05, 1.47143053e-05])

In [11]:
sum(beta[0])  # each row should sum up to 1

0.9999999999999971

To make sense of what a topic means, we need to know the words associated with the topic. Let's use argsort to grab the index of the 5 highest probability words for each topic, and see what they are by looking them up in the features from the vectorizer. There is no order to the topics

In [12]:
max_words = np.argsort(-beta,axis=1)[:,:5]  # 5 most probable elements
features = vectorizer.get_feature_names_out()
i = 0
for i, top_k in enumerate(max_words):
    print("Topic %u:" % i)
    for ind in top_k: 
        print(features[ind])
    print()

Topic 0:
hanover
p
good
portland
emory

Topic 1:
one
may
world
must
even

Topic 2:
dallas
gin
stock
secrets
cotton

Topic 3:
af
cells
surface
used
p

Topic 4:
one
said
new
also
first

Topic 5:
aircraft
missile
texas
nuclear
bombers

Topic 6:
clay
mold
pieces
place
cut

Topic 7:
vacation
midwest
festival
yosemite
locales

Topic 8:
one
would
new
time
may

Topic 9:
god
christ
jesus
bible
born

Topic 10:
one
time
said
new
would

Topic 11:
music
musical
orchestra
opera
composer

Topic 12:
mason
watercolor
roy
mosque
sophia

Topic 13:
kowalski
comedie
hengesbach
verdict
maxwell

Topic 14:
said
would
one
like
could

Topic 15:
khrushchev
china
class
junior
chinese

Topic 16:
game
palmer
club
player
baseball

Topic 17:
stein
huff
fiedler
leavitt
buchheister

Topic 18:
new
said
one
would
two

Topic 19:
sections
staining
minutes
nonspecific
tissue



Exercise: based on the highest probability topic word given above, let's predict another word which should be high probabiltity for a particular topic and low for another, and see if we're right.

In [13]:
word = "god"
index = features.index(word)    

AttributeError: 'numpy.ndarray' object has no attribute 'index'

We have a somewhat high probability for "god" in religious topic:

In [None]:
beta[9][index]

0.048867588740816974

Whereas in the "military" topic, the probability of the word "god" is relatively low:

In [None]:
beta[5][index]

1.4188496404285626e-05

In [None]:
import sys
!{sys.executable} -m pip install pyldavis



In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

pyLDAvis.sklearn.prepare(lda_model=lda, dtm=X, vectorizer=vectorizer)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s finished
  default_term_info = default_term_info.sort_values(
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
