## Latend Dirichlet Allocation (LDA) - dummy dataset

Musat Bianca-Stefania

407 Artificial Intelligence

In [300]:
# import all necessary packages
import os
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import download
download('stopwords')
download('wordnet')
download('punkt')

import numpy as np
from collections import Counter
!pip install pymc
import pymc as pm
import matplotlib.pyplot as plt
from matplotlib import rcParams
from pymc3 import traceplot
from math import log, exp

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Transform string data from dummy dataset into numbers so that it can be used by LDA

We will define a vocabulary which is a dictionary that maps each word to a number. We also define an index_to_word dictionary which allows us to easily trace back each word from the associated number.

In [301]:
dataset = [["aaa", "bbb", "aaa"], ["bbb", "aaa", "bbb"], ["aaa", "bbb", "bbb", "aaa"], ["uuu", "vvv"], ["uuu", "vvv", "vvv"], ["uuu", "vvv", "vvv", "uuu"]]

In [302]:
NO_FILES = 3

In [303]:
def make_vocab(dataset):
    vocabulary = {}  # dict {word : index}
    index_to_word = {}  # dict {index : word}
    index = 0

    for item in dataset:
        for word in item:
            if word not in vocabulary:
                vocabulary[word] = index
                index_to_word[index] = word
                index += 1
    return vocabulary, index_to_word

In [304]:
vocab, index_to_word = make_vocab(dataset)
print("Vocabulary length: ", len(vocab))

Vocabulary length:  4


In [305]:
def process_dataset(dataset):  # transform dataset into number representation
    for i, item in enumerate(dataset):
        for j, word in enumerate(item):
            if word in vocab:
                dataset[i][j] = vocab[word]
            else:  # if word not in vocabulary, assign new index
                dataset[i][j] = len(vocab)

In [306]:
process_dataset(dataset)

In [307]:
print("Dataset: ", dataset)

Dataset:  [[0, 1, 0], [1, 0, 1], [0, 1, 1, 0], [2, 3], [2, 3, 3], [2, 3, 3, 2]]


# LDA

In the next section I have implemented and described the LDA model.

LDA is a probabilistic topic model in which each document in the dataset exhibits multiple topics in different proportions, each topic being a distribution over the predefined vocabulary.

LDA assumes that the documents are created via a generative process that looks like this: we first choose a distribution over topics for each document, and for each word in that document we choose a random topic from that distribution and then we choose a word from the correct distribution over the vocabulary. As stated before, this generative process will lead to documents belonging to multiple topics in different proportions.

LDA reverse engeneer this process. In order to do that, it has to draw a topic for each word in each document, and then to draw the real word from the correct distribution (which is the distribution associated with the topic). The distribution that we use to draw the topic/word is Categorical distribution. The parameters used for these 2 distributions are going to be drawn from a Dirichlet distribution, as this is the conjugate prior for the Categorical.

z[m,n] = Categorical(theta[m])  # drawing a topic for each word (each word n in each document m)

w[m,n] = Categorical(phi[z[m,n]])   # drawing the physical word (each word n in each document m)

theta[m] = Dirichlet(alpha)  # topic distribution for document m (it will be K dimensional, as there ar K topics in total)

phi[k] = Dirichlet(beta)  # word distribution for topic k (it will be V dimentional, as there are V possible words in the dataset)

In [308]:
K = 2  # number of categories/topics
M = len(dataset)  # number of documents in the dataset
V = len(vocab)  # vocabulary length

In [309]:
def lda(dataset, alpha=np.ones(K), beta=np.ones(V)):
    Nm = []  # number of words in each document
    for d in dataset:
      Nm.append(len(d))

    # draw word distribution for each topic ( dimentional, as there are V possible words in the dataset)
    phi1 = [pm.Dirichlet("phi1_%i" % i, theta=beta) for i in range(K)]
    phi = [pm.CompletedDirichlet("phi_%i" % i, phi1[i]) for i in range(K)]
    phi = pm.Container(phi)

    # draw topic distribution for each document (K dimensional, as there ar K topics in total)
    theta1 = [pm.Dirichlet("theta1_%i" % i, theta=alpha) for i in range(M)]
    theta = [pm.CompletedDirichlet("theta_%i" % i, theta1[i]) for i in range(M)]
    theta = pm.Container(theta)

    rand_topic = []  # randomly choose a topic for each word in each document
    for i in range(M):
      rand_topic.append(np.random.randint(K, size=Nm[i]))

    # draw a topic for each word in each document
    Z = [pm.Categorical("Z1_%i" % i, p=theta[i], size=Nm[i], value=rand_topic[i]) for i in range(M)]
    Z = pm.Container(Z)

    # draw the word itself from the word distribution
    W = [pm.Categorical("w_%d_%i" % (d,i), p = pm.Lambda("z_%d_%i" % (d,i), lambda z=Z[d][i], phi=phi : phi[z]), value=dataset[d][i], observed=True) for d in range(M) for i in range(Nm[d])]
    W = pm.Container(W)

    # create LDA model
    model = pm.Model([theta, phi, Z, W, phi1, theta1])

    # sample values
    mcmc = pm.MCMC(model)
    trace = mcmc.sample(iter=100000, burn=8000)
  
    print("\nThe topic of each word in each document:\n")
    for m in range(M):  
      tr = mcmc.trace('Z1_%i' % m)[100000 - 8000 - 1]
      print(tr)

    print("\nThe mean topic of each word in each document:\n")
    for m in range(M):  
      tr = mcmc.trace('Z1_%i' % m)[-1000:-1].mean(axis=0)
      print(tr)
    
    # for m in range(M):  
    #   traceplot(mcmc.trace('Z1_%i'% m )[-1000:-1].mean(axis=0))

    print("\nThe distribution of words for each topic:\n")
    for k in range(K):
      print("Topic ", k)
      for i, j in enumerate(mcmc.trace('phi_%i' % k)[-1000:-1].mean(axis=0)[0]):
          print("   ", index_to_word[i], " ", j)
      print()

    print("\nTopic distribution for each document:\n")
    for m in range(M):
      print("Document ", m)
      for i, j in enumerate(mcmc.trace('theta_%i' % m)[-1000:-1].mean(axis=0)[0]):
          print("Topic ", i, " ", j)
      print()

    return (theta, phi, Z, W)


## Choosing alpha and beta

Alpha and beta parameters are fixed and it is therefore important to choose them carefully. They affect the sparcity of the model in the following way:

- alpha controls the document topic density, so we want a higher alpha when we have more topics
- beta controls word topic density, so we want a higher beta when the number of words in vocabulary is high

In our case, we have a small number of topics, so we would rather choose a small alpha, and a small number of words in the dictionary, so we should also choose a small beta. However, I have tested a number of values and the results are presented below.

## Choosing a way to measure model success

As a measure of success, I have chosen to implement the accuracy per topic. So, I am choosing the correct topic for a document as being the most frequent topic among those documents that should belong to the same topic. Then I compute how many words have been correctly assign to that topic.

In [310]:
# compute the accuracy for each topic
def topic_acc(Z_matrix):
    indx = 1
    counters = []
    maxims = []
    lenghts = []
    print()
    for lst in Z_matrix:
        counters.append(Counter(lst))  # for each topic, we count how many words have been assign to it
        maxims.append(max(lst, key=lst.count))  # for each document, we get the most common topic
        lenghts.append(len(lst))  # for each document, we get the number of words in it
        if indx % NO_FILES == 0:
            acc = 0
            tot = 0
            i = 0
            for c in counters:
                acc += c[max(maxims, key=maxims.count)] / lenghts[i]  # compute accuracy as correctly assigned words / total number of words
                i += 1
            
            print("Topic: ", max(maxims, key=maxims.count))
            print("Topic accuracy ", acc / NO_FILES)
            print("Most frequent topic in each document belonging to the real topic: ", maxims)
            print()
            counters = []
            maxims = []
            lenghts = []
        indx += 1

In [251]:
(theta, phi, Z, W) = lda(dataset)



 [-----------------100%-----------------] 100000 of 100000 complete in 268.2 sec
The topic of each word in each document:

[0 1 0]
[1 1 0]
[0 0 0 0]
[1 1]
[1 1 1]
[1 1 1 1]

The mean topic of each word in each document:

[0.2972973  0.44644645 0.30530531]
[0.17517518 0.16816817 0.1031031 ]
[0.22822823 0.1951952  0.14414414 0.13513514]
[0.85285285 0.71771772]
[0.78878879 0.66066066 0.64664665]
[0.91291291 0.73573574 0.76376376 0.96196196]

The distribution of words for each topic:

Topic  0
    aaa   0.444378329368263
    bbb   0.34191350152389244
    uuu   0.08089344832459579
    vvv   0.13281472078324993

Topic  1
    aaa   0.12901981217318448
    bbb   0.1541212380998981
    uuu   0.40693435843014636
    vvv   0.3099245912967735


Topic distribution for each document:

Document  0
Topic  0   0.5939133417974944
Topic  1   0.4060866582025048

Document  1
Topic  0   0.7241807521855379
Topic  1   0.2758192478144637

Document  2
Topic  0   0.7013438461163137
Topic  1   0.29865615388368416

In [252]:
Z_matrix = []
for i in range(len(Z)):
        Z_matrix.append(list(Z[i].value))
print(Z_matrix)

topic_acc(Z_matrix)

[[0, 1, 0], [1, 1, 0], [0, 0, 0, 0], [1, 1], [1, 1, 1], [1, 1, 1, 1]]

Topic:  0
Topic accuracy  0.6666666666666666
Most frequent topic in each document belonging to the real topic:  [0, 1, 0]

Topic:  1
Topic accuracy  1.0
Most frequent topic in each document belonging to the real topic:  [1, 1, 1]



In [311]:
(theta, phi, Z, W) = lda(dataset, alpha=[0.2] * K, beta=[0.3] * V)



 [-----------------100%-----------------] 100000 of 100000 complete in 290.7 sec
The topic of each word in each document:

[0 0 0]
[0 0 0]
[0 0 0 0]
[1 1]
[1 1 1]
[1 1 1 1]

The mean topic of each word in each document:

[0.       0.       0.001001]
[0.         0.06206206 0.        ]
[0. 0. 0. 0.]
[1. 1.]
[1. 1. 1.]
[1. 1. 1. 1.]

The distribution of words for each topic:

Topic  0
    aaa   0.3976424269316546
    bbb   0.5825710053279558
    uuu   0.01364369361280558
    vvv   0.006142874127584022

Topic  1
    aaa   0.04182444854307025
    bbb   0.003965125213279359
    uuu   0.45598808015235265
    vvv   0.4982223460912992


Topic distribution for each document:

Document  0
Topic  0   0.9479150468977848
Topic  1   0.05208495310221343

Document  1
Topic  0   0.7920700012086869
Topic  1   0.2079299987913135

Document  2
Topic  0   0.9695469976022981
Topic  1   0.030453002397686443

Document  3
Topic  0   0.019716163310200697
Topic  1   0.9802838366897882

Document  4
Topic  0   0.011

In [312]:
Z_matrix = []
for i in range(len(Z)):
        Z_matrix.append(list(Z[i].value))
print(Z_matrix)

topic_acc(Z_matrix)

[[0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [1, 1], [1, 1, 1], [1, 1, 1, 1]]

Topic:  0
Topic accuracy  1.0
Most frequent topic in each document belonging to the real topic:  [0, 0, 0]

Topic:  1
Topic accuracy  1.0
Most frequent topic in each document belonging to the real topic:  [1, 1, 1]



In [276]:
(theta, phi, Z, W) = lda(dataset, alpha=[0.11] * K, beta=[0.2] * V)



 [-----------------100%-----------------] 100000 of 100000 complete in 282.7 sec
The topic of each word in each document:

[1 1 1]
[0 1 0]
[1 0 0 1]
[1 1]
[1 1 1]
[1 1 1 1]

The mean topic of each word in each document:

[1. 1. 1.]
[0.13613614 0.58358358 0.11011011]
[0.19319319 0.         0.003003   0.22522523]
[1. 1.]
[1. 1. 1.]
[1.         0.48748749 1.         1.        ]

The distribution of words for each topic:

Topic  0
    aaa   0.10719313820046873
    bbb   0.8923190397965621
    uuu   1.6779658124231744e-08
    vvv   0.0004878052232935155

Topic  1
    aaa   0.32524827840477616
    bbb   0.10315131781138397
    uuu   6.865887305025884e-05
    vvv   0.57153174491079


Topic distribution for each document:

Document  0
Topic  0   0.03309916349938243
Topic  1   0.9669008365006161

Document  1
Topic  0   0.6886695904472939
Topic  1   0.31133040955270425

Document  2
Topic  0   0.8736883971227626
Topic  1   0.12631160287723267

Document  3
Topic  0   7.068809352801602e-06
Topic  1

In [277]:
Z_matrix = []
for i in range(len(Z)):
        Z_matrix.append(list(Z[i].value))
print(Z_matrix)

topic_acc(Z_matrix)

[[1, 1, 1], [0, 1, 0], [1, 0, 0, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]

Topic:  1
Topic accuracy  0.611111111111111
Most frequent topic in each document belonging to the real topic:  [1, 0, 1]

Topic:  1
Topic accuracy  1.0
Most frequent topic in each document belonging to the real topic:  [1, 1, 1]



In [257]:
(theta, phi, Z, W) = lda(dataset, alpha=[0.5] * K, beta=[0.6] * V)



 [-----------------100%-----------------] 100000 of 100000 complete in 271.6 sec
The topic of each word in each document:

[1 1 1]
[1 1 1]
[1 1 1 1]
[0 1]
[0 0 0]
[0 0 0 0]

The mean topic of each word in each document:

[0.97797798 0.97697698 0.99299299]
[0.91291291 0.96696697 0.88388388]
[0.99099099 0.99399399 0.95495495 1.        ]
[0.         0.05705706]
[0.03503504 0.16716717 0.05905906]
[0.         0.05305305 0.02702703 0.        ]

The distribution of words for each topic:

Topic  0
    aaa   0.049579271991671774
    bbb   0.07111838109067514
    uuu   0.5070316119035984
    vvv   0.37227073501405517

Topic  1
    aaa   0.5596940377203785
    bbb   0.3183468933925425
    uuu   0.03463883502896487
    vvv   0.08732023385811269


Topic distribution for each document:

Document  0
Topic  0   0.11693157413021617
Topic  1   0.8830684258697818

Document  1
Topic  0   0.13195564963980375
Topic  1   0.8680443503601752

Document  2
Topic  0   0.12513931508698276
Topic  1   0.874860684913

In [258]:
Z_matrix = []
for i in range(len(Z)):
        Z_matrix.append(list(Z[i].value))
print(Z_matrix)

topic_acc(Z_matrix)

[[1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [0, 1], [0, 0, 0], [0, 0, 0, 0]]

Topic:  1
Topic accuracy  1.0
Most frequent topic in each document belonging to the real topic:  [1, 1, 1]

Topic:  0
Topic accuracy  0.8333333333333334
Most frequent topic in each document belonging to the real topic:  [0, 0, 0]



## Conclusions task 1

As we can see from the previous runs, small values of alpha and beta give perfect accuracy. As we stated before, alpha controls the document topic density so we want a small alpha because we have only 2 topics, and beta controls word topic density so we want small beta because we have only 4 words into the vocabulary. The model manage to perfectly predict the correct topic for each word.

# Topic-based similarity measures


For this task I implemented a number of similarity measures proposed for LDA in these papers https://www.cs.memphis.edu/~vrus/publications/2013/CICLing-2013.RusNiraulaBanjade.pdf, https://aclanthology.org/E14-4005.pdf.

I have chosen 3 similarity measures based on the distribution over topics (theta), as each document can be viewed as a distribution over topics. Thus, we can compute the similarity between 2 documents, by computing the similarity of their distributions. The 3 measures below return the dissimilarity between 2 distributions, so they will tell us how diffrent two distributions are (0 meaning identical).

- Kullback-Leibler (KL) divergence -> takes two distributions p and q and computes the distance between them

KL(p, q) = sum(pi * log(pi / qi)), where i defines a topic (i in [0, K))

The drawbacks of KL divergence are that, if qi is 0 it is not defined, and is not symmetric.

- Jensen-Shannon divergence -> takes two distributions p and q and computes the distance between them, while solving the asymmetry problem of KL by considering the average of pi and qi

JS(p, q) = 1/2 * KL(p, m) + 1/2 * KL(q, m), where m = 1/2 * (p + q)

- Hellinger distance -> similar to JS distance, it removes the drawback of KL divergence. An advantage of Hellinger distance is that it is very easy to compute the similarity from the distance, by substractig the result from 1.

HD(p, q) = 1/2 * sqrt(sum( sqrt(pi) - sqrt(qi) )^2)

In [313]:
def kl(p, q):
    return np.sum(p * np.log2(p / q))

In [314]:
def js(p, q):
    m = (p + q) / 2
    return kl(p, m) / 2 + kl(q, m) / 2

In [315]:
def hd(p, q):
    return 1./np.sqrt(2) * np.sqrt(np.sum(np.square(np.sqrt(p) - np.sqrt(q))))

In [316]:
print("Jensen-Shannon divergence \n")
for i in range(len(theta)):
    for j in range(len(theta)):
        print("Documents", i, j, " have the following divergence: ", js(theta[i].value, theta[j].value))

Jensen-Shannon divergence 

Documents 0 0  have the following divergence:  0.0
Documents 0 1  have the following divergence:  0.036231314247596944
Documents 0 2  have the following divergence:  0.03897383527850709
Documents 0 3  have the following divergence:  0.7978568617445874
Documents 0 4  have the following divergence:  0.7845609020531057
Documents 0 5  have the following divergence:  0.7978756640579636
Documents 1 0  have the following divergence:  0.036231314247596944
Documents 1 1  have the following divergence:  0.0
Documents 1 2  have the following divergence:  0.0002382272858704928
Documents 1 3  have the following divergence:  0.9928723741698628
Documents 1 4  have the following divergence:  0.9792658081338039
Documents 1 5  have the following divergence:  0.9928914937872587
Documents 2 0  have the following divergence:  0.03897383527850709
Documents 2 1  have the following divergence:  0.0002382272858704928
Documents 2 2  have the following divergence:  0.0
Documents 2 3  

In [317]:
print("Kullback-Leibler divergence\n")
for i in range(len(theta)):
    for j in range(len(theta)):
        print("Documents", i, j, " have the following divergence: ", kl(theta[i].value, theta[j].value))

Kullback-Leibler divergence

Documents 0 0  have the following divergence:  0.0
Documents 0 1  have the following divergence:  0.367103324859317
Documents 0 2  have the following divergence:  0.5452318372905276
Documents 0 3  have the following divergence:  11.895376768306809
Documents 0 4  have the following divergence:  7.379276262880919
Documents 0 5  have the following divergence:  11.935101080171762
Documents 1 0  have the following divergence:  0.10800739628222725
Documents 1 1  have the following divergence:  0.0
Documents 1 2  have the following divergence:  0.0013214249550807292
Documents 1 3  have the following divergence:  13.297296337964323
Documents 1 4  have the following divergence:  8.40581230609105
Documents 1 5  have the following divergence:  13.340319963069629
Documents 2 0  have the following divergence:  0.11444644911636362
Documents 2 1  have the following divergence:  0.0007832996853520982
Documents 2 2  have the following divergence:  0.0
Documents 2 3  have th

In [318]:
print("Hellinger divergence\n")
for i in range(len(theta)):
    for j in range(len(theta)):
        print("Documents", i, j, " have the following divergence: ", hd(theta[i].value, theta[j].value), "\t and the following similarity: ", 1 - hd(theta[i].value, theta[j].value))

Hellinger divergence

Documents 0 0  have the following divergence:  0.0 	 and the following similarity:  1.0
Documents 0 1  have the following divergence:  0.17543486658062854 	 and the following similarity:  0.8245651334193714
Documents 0 2  have the following divergence:  0.18848741070364308 	 and the following similarity:  0.811512589296357
Documents 0 3  have the following divergence:  0.8436390479949719 	 and the following similarity:  0.1563609520050281
Documents 0 4  have the following divergence:  0.8184464076260862 	 and the following similarity:  0.18155359237391377
Documents 0 5  have the following divergence:  0.8437220486411735 	 and the following similarity:  0.1562779513588265
Documents 1 0  have the following divergence:  0.17543486658062854 	 and the following similarity:  0.8245651334193714
Documents 1 1  have the following divergence:  0.0 	 and the following similarity:  1.0
Documents 1 2  have the following divergence:  0.013161806225742983 	 and the following sim

We can see that documents from the same class have small values for divergece, and documents from diffrent classes have high values for divergence.

I will combine the first and last documents from the dataset to test the similarity measures implemented above. I will rerun the model and then test the dissimilarity using Jensen-Shannon divergence.

In [324]:
print(dataset)

[[0, 1, 0], [1, 0, 1], [0, 1, 1, 0], [2, 3], [2, 3, 3], [2, 3, 3, 2]]


In [325]:
#combine 2 documents to show that the similarity measures work properly
new_doc1 = dataset[0][:2] + dataset[-1][2:]
new_doc2 = dataset[-1][:2] + dataset[0][2:]
print(new_doc1)
print(new_doc2)

[0, 1, 3, 2]
[2, 3, 0]


In [326]:
dataset[0] = new_doc2
dataset[-1] = new_doc1
print(dataset)

[[2, 3, 0], [1, 0, 1], [0, 1, 1, 0], [2, 3], [2, 3, 3], [0, 1, 3, 2]]


In [327]:
(theta, phi, Z, W) = lda(dataset, alpha=[0.11] * K, beta=[0.2] * V)



 [-----------------100%-----------------] 100000 of 100000 complete in 281.5 sec
The topic of each word in each document:

[1 1 0]
[0 0 0]
[0 0 0 0]
[1 1]
[1 1 1]
[0 0 1 1]

The mean topic of each word in each document:

[1.        1.        0.7967968]
[0. 0. 0.]
[0. 0. 0. 0.]
[1. 1.]
[1. 1. 1.]
[0.51251251 0.17317317 1.         1.        ]

The distribution of words for each topic:

Topic  0
    aaa   0.2262073365203095
    bbb   0.7736613421256444
    uuu   6.658652897914324e-09
    vvv   0.00013131469539377297

Topic  1
    aaa   0.09027543570729409
    bbb   0.06542308873346786
    uuu   0.21113491120419534
    vvv   0.6331665643550441


Topic distribution for each document:

Document  0
Topic  0   0.13597369407084337
Topic  1   0.8640263059291626

Document  1
Topic  0   0.999868035785707
Topic  1   0.00013196421429530848

Document  2
Topic  0   0.9778011778307989
Topic  1   0.022198822169194915

Document  3
Topic  0   0.11201119233466023
Topic  1   0.8879888076653435

Document  4


In [328]:
Z_matrix = []
for i in range(len(Z)):
        Z_matrix.append(list(Z[i].value))
print(Z_matrix)

topic_acc(Z_matrix)

[[1, 1, 0], [0, 0, 0], [0, 0, 0, 0], [1, 1], [1, 1, 1], [0, 0, 1, 1]]

Topic:  0
Topic accuracy  0.7777777777777777
Most frequent topic in each document belonging to the real topic:  [1, 0, 0]

Topic:  1
Topic accuracy  0.8333333333333334
Most frequent topic in each document belonging to the real topic:  [1, 1, 0]



In [329]:
print("Jensen-Shannon divergence \n")
for i in range(len(theta)):
    for j in range(len(theta)):
        print("Documents", i, j, " have the following divergence: ", js(theta[i].value, theta[j].value))

Jensen-Shannon divergence 

Documents 0 0  have the following divergence:  0.0
Documents 0 1  have the following divergence:  0.3737586522171188
Documents 0 2  have the following divergence:  0.37529209660911633
Documents 0 3  have the following divergence:  0.2516675867416454
Documents 0 4  have the following divergence:  0.2529473924400953
Documents 0 5  have the following divergence:  0.10693723606795216
Documents 1 0  have the following divergence:  0.3737586522171188
Documents 1 1  have the following divergence:  0.0
Documents 1 2  have the following divergence:  0.00012493649016920885
Documents 1 3  have the following divergence:  0.9967659318252986
Documents 1 4  have the following divergence:  0.9982559040645043
Documents 1 5  have the following divergence:  0.11068049434363715
Documents 2 0  have the following divergence:  0.37529209660911633
Documents 2 1  have the following divergence:  0.00012493649016920885
Documents 2 2  have the following divergence:  0.0
Documents 2 3  

We can observe above that the disimillarity between documents 0 and 5 (that have been mixed up) is pretty low.

# Assigning a topic to a new document

In this section I will present a mathematical approach in which we can estimate the probability to have a topic t, given the fact that we have a new document d, noted p(t/d).

t = topic

d = document

di = word i in document

Followin the bayes theorem, we have that:
p(t/d) = p(d/t) * p(t) / p(d)

At the same time, consdering that the words are independent variables, we can state that p(d/t) = p(d1...dn/t) = prod(p(di, t))

The above p(di/t) is actually the word distribution for each topic (phi).

p(t) is simply the probability of a topic, which is 1/K.

p(d) is the probability of a document, which is 1/(M + 1).

Having all the necessary information, it is now fairly simple to compute the desired probability.

For computational purpose, I have applied log to the probability and added a small value (epsilon) to each value in phi.

log(p(t/d)) = log(p(d/t) + eps) + log(p(t)) - log(p(d))

log(p(d/t) + eps) = sum(log(p(di/t) + eps))

In [319]:
for i in range(len(phi)):
    print("Topic", i, ": ", phi[i].value)

Topic 0 :  [[0.40098889 0.57504692 0.00780524 0.01615895]]
Topic 1 :  [[0.00669411 0.00435578 0.4796438  0.50930632]]


In [320]:
new_doc = ["bbb", "bbb", "aaa", "vvv"]  # make a new document which has 3 words from one topic, and one word from the other
new_d = []  # the new document with words translated to numbers
for i, d in enumerate(new_doc):
    if d in vocab:
      new_d.append(vocab[d])

print(new_d)

[1, 1, 0, 3]


In [321]:
# compute log(p(d/t))
eps = 1e-5
def compute_prod_prob(phi, doc, topic):
    prob = 0
    for di in doc:
        prob += log(list(phi[topic].value)[0][di] + eps)
    return prob

In [322]:
# compute log(p(t/d))
def compute_prob(p_d_t, p_d, p_t, topic):
    return p_d_t + log(p_t) - log(p_d)

In [323]:
# get the most probable topic for the new document
def most_prob_topic(new_d):
    p_t = 1 / K
    p_d = 1 / (M+1)
    probs = []
    for topic in range(K):
        p_d_t = compute_prod_prob(phi, new_d, topic)
        p_t_d = compute_prob(p_d_t, p_d, p_t, topic)
        probs.append(exp(p_t_d))
        print("Probability for topic ", topic, "is", exp(p_t_d))

    max_prob = max(probs)
    max_topic = probs.index(max_prob)

    print("The topic with maximum probability is ", max_topic)

most_prob_topic(new_d)

Probability for topic  0 is 0.007504376064421666
Probability for topic  1 is 2.2778204155982447e-07
The topic with maximum probability is  0


As we can see, the document gets assign to the topic that contains 3 out of the 4 words in the document.