<a href="https://colab.research.google.com/github/DivyaRustagi10/contextualized-topic-models-ssl/blob/main/ZeroshotTM_Parent_Paper_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#To contextualize or to not contextualize?

> Can we define a topic model that does not rely on the BoW input but instead uses contextual information?

First, we want to check if ZeroShotTM maintains comparable performance to other topic models; if this is true, we can then explore its performance in
a cross-lingual setting. Since we use only English text, in this setting we use English representations.



In [1]:
# Install the contextualized topic model library
%%capture
!pip install contextualized-topic-models==2.2.0

In [2]:
%%capture
!pip install pyldavis
!pip install wget
!pip install head

In [3]:
!nvidia-smi

Wed Apr 13 19:23:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We replace the input BoW in Neural-ProdLDA
with pre-trained multilingual representations from
SBERT (Reimers and Gurevych, 2019), a recent and effective model for contextualized representations.

Indeed, ZeroShotTM
is language-independent: given a contextualized
representation of a new language as input,1
it can
predict the topic distribution of the document. The
predicted topic descriptors, though, will be from
the training language. Let us also notice that our
method is agnostic about the choice of the neural
topic model architecture (here, Neural-ProdLDA),
as long as it extends a Variational Autoencoder.

### Data

**Building W1**

We use datasets collected from English
Wikipedia abstracts from DBpedia. The first dataset (W1) contains 20,000 randomly sampled abstracts. 


In [4]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle

**Downloading DBPedia 20K Abstracts**

In [5]:
import wget
wget.download("https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt")

'dbpedia_sample_abstract_20k_unprep.txt'

In [7]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

**Preprocessing**

Why do we use the preprocessed text here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [8]:
nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




In [9]:
len(preprocessed_documents) # number of docs

20000

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations.

Let's pass our files with preprocess and unpreprocessed data to our TopicModelDataPreparation object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "distiluse-base-multilingual-cased", because we need a multilingual model for performing cross-lingual predictions later.



**Training ZeroshotTM**

In [10]:
# Load English SBERT embeddings
tp = TopicModelDataPreparation("sentence-transformers/bert-base-nli-mean-tokens")

In [11]:
# Building training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Batches:   0%|          | 0/100 [00:00<?, ?it/s]



### Training Models

**M1. Training Zero-Shot Contextualized Topic Model**

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (n_component parameter of the CTM object).

In [None]:
# Train over 100 epochs
### 50 TOPICS ###
ctm = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components = 50)         
ctm.fit(training_dataset) # run the model
ctm.save("./") # save model

### 100 TOPICS ###
ctm_100 = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components = 100)     
ctm_100.fit(training_dataset) # run the model
ctm_100.save("./") # save model

After training, now it is the time to look at our topics: we can use the


```
get_topic_lists
```
function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [103]:
# Check topics for 50 topics
ctm.get_topic_lists(5)[:4]

[['released', 'album', 'live', 'second', 'studio'],
 ['school', 'high', 'girls', 'secondary', 'boys'],
 ['hamilton', 'allen', 'fame', 'gene', 'pioneer'],
 ['canada', 'member', 'electoral', 'created', 'parliament']]

In [104]:
# Check topics for 100 topics
ctm_100.get_topic_lists(5)[:4]

[['member', 'riding', 'quebec', 'represented', 'canadian'],
 ['school', 'high', 'aged', 'located', 'elementary'],
 ['may', 'many', 'various', 'often', 'people'],
 ['named', 'land', 'point', 'side', 'ridge']]

In [16]:
# Topic Predictions
### 50 TOPICS ###
topics_predictions = ctm.get_thetas(training_dataset, n_samples=30) # get all the topic predictions

### 100 TOPICS ###
topics_predictions_100 = ctm_100.get_thetas(training_dataset, n_samples=30) # get all the topic predictions

Sampling: [30/30]: : 30it [01:11,  2.40s/it]
Sampling: [30/30]: : 30it [01:15,  2.50s/it]


In [17]:
# Get NPMI Coherence
from contextualized_topic_models.evaluation.measures import CoherenceNPMI
texts = [doc.split() for doc in preprocessed_documents] # load text for NPMI

### 50 TOPICS ###
npmi = CoherenceNPMI(texts=texts, topics=ctm.get_topic_lists(50))
print(npmi.score())

### 100 TOPICS ###
npmi_100 = CoherenceNPMI(texts=texts, topics=ctm_100.get_topic_lists(100))
print(npmi_100.score())

# Store NPMI scores
zeroshotNPMI = [npmi.score(), npmi_100.score()]

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


0.16915373321086055
0.13505185551695653


**M2. Training Neural-ProdLDA**

We use the implementation made available by [Carrow (2018)](https://github.com/estebandito22/PyTorchAVITM/blob/master/README.md).

**Model Training Instructions**

* Epochs = 100
* ADAM optimizer -> learning rate = 2e-3. 
* The inference network is composed of a single hidden layer and 100-dimension of softplus units. 
* The priors over the topic and
document distributions are **learnable parameters**.
* Momentum = 0.99, learning rate = 0.002, and we apply 20% of drop-out to the hidden document representation. 
* Batch size = 200

In [None]:
# Imports
!git clone https://github.com/estebandito22/PyTorchAVITM # COMMENT AFTER RUNNING ONCE
!python /content/PyTorchAVITM/setup.py build
!python /content/PyTorchAVITM/setup.py install

In [19]:
import os
import json
import sys
import numpy as np
import pandas as pd

sys.path.insert(1, "/content/PyTorchAVITM")
from pytorchavitm import AVITM
from pytorchavitm.datasets import BOWDataset
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Build Dataset
train_data = BOWDataset(tp.id2token,training_dataset.X_bow.todense())

cv = CountVectorizer(input = 'content')
train_bow = cv.fit_transform(preprocessed_documents)
train_bow = train_bow.toarray()

idx2token = cv.get_feature_names()
input_size = len(idx2token)

train_data = BOWDataset(train_bow, idx2token)

In [24]:
# Train Neural ProdLDA models
### 50 TOPICS ###
avitm = AVITM(input_size=len(tp.vocab), n_components = 50, model_type='prodLDA',hidden_sizes=(100,100)
                ,activation='softplus', dropout=0.2, learn_priors=True, 
              batch_size=200, lr=2e-3, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False)
avitm.fit(train_data)
avitm.save("./")

### 100 TOPICS ###
avitm_100 = AVITM(input_size=len(tp.vocab), n_components = 100, model_type='prodLDA',hidden_sizes=(100,100)
                ,activation='softplus', dropout=0.2, learn_priors=True, 
              batch_size=200, lr=2e-3, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False)
avitm_100.fit(train_data)
avitm_100.save("./")

# NPMI Scores for t = 50 and t = 100
npmi_avitm = CoherenceNPMI(topics = list(avitm.get_topics(50).values()), texts = texts).score()
npmi_avitm_100 = CoherenceNPMI(topics = list(avitm_100.get_topics(100).values()), texts = texts).score()

print('\nCoherence Score for ProdLDA with t = 50: ', npmi_avitm)
print('\nCoherence Score for ProdLDA with t = 100: ', npmi_avitm_100)

# Store NPMI
avitmNPMI = [npmi_avitm, npmi_avitm_100]

Settings: 
               N Components: 50
               Topic Prior Mean: 0.0
               Topic Prior Variance: 0.98
               Model Type: prodLDA
               Hidden Sizes: (100, 100)
               Activation: softplus
               Dropout: 0.2
               Learn Priors: True
               Learning Rate: 0.002
               Momentum: 0.99
               Reduce On Plateau: False
               Save Dir: None
Epoch: [1/100]	Samples: [20000/2000000]	Train Loss: 172.54962939453125	Time: 0:00:00.876480
Epoch: [2/100]	Samples: [40000/2000000]	Train Loss: 162.2259447265625	Time: 0:00:00.872367
Epoch: [3/100]	Samples: [60000/2000000]	Train Loss: 155.5113236328125	Time: 0:00:00.882355
Epoch: [4/100]	Samples: [80000/2000000]	Train Loss: 150.65048076171874	Time: 0:00:00.858900
Epoch: [5/100]	Samples: [100000/2000000]	Train Loss: 147.2036779296875	Time: 0:00:01.032986
Epoch: [6/100]	Samples: [120000/2000000]	Train Loss: 144.882995703125	Time: 0:00:01.089635
Epoch: [7/100]	Sampl

**M3. Training LDA**

We use [Gensim’s](https://radimrehurek.com/gensim/models/ldamodel.html) implementation of this model.

**Model Training Instructions**

The hyper-parameters alpha and beta, controlling the document-topic and word-topic distribution respectively, are estimated from the data during training.

In [25]:
# Train LDA models
from typing import Dict
from pprint import pprint
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamulticore import LdaModel

# Create a corpus from a list of texts
train_dict = Dictionary([text.split() for text in preprocessed_documents])
train_corpus = [train_dict.doc2bow(text.split()) for text in preprocessed_documents]

# Fit LDA models on the corpus
### 50 TOPICS ###
lda = LdaModel(train_corpus, num_topics=50, passes = 30,
               id2word=train_dict)

### 100 TOPICS ###
lda_100 = LdaModel(train_corpus, num_topics=100, passes = 30,
               id2word=train_dict)

# NPMI Scores for t = 50 and t = 100
npmi_lda_50 = CoherenceModel(model=lda, texts=texts, dictionary=train_dict, coherence='c_npmi')
npmi_lda_100 = CoherenceModel(model=lda_100, texts=texts, dictionary=train_dict, coherence='c_npmi')

print('\nCoherence Score for LDA with t = 50: ', npmi_lda_50.get_coherence())
print('\nCoherence Score for LDA with t = 100: ', npmi_lda_100.get_coherence())

# Store NPMI
npmi_lda = [npmi_lda_50.get_coherence(), npmi_lda_100.get_coherence()]


Coherence Score for LDA with t = 50:  -0.06110367479026294

Coherence Score for LDA with t = 100:  -0.1587741752856547


**M4. Training Combined TM**
CTMs work better when the size of the bag of words has been restricted to a number of terms that does not go over 2000 elements. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. 

**Model Training Instructions**

* Epochs = 100
* ADAM optimizer
* Hyperparameters are the same used for Neural-ProdLDA with the difference that we also use SBERT features in combination with the BoW.
* We take the SBERT embeddings, apply a (learnable) function/dense layer R^512 → R^|V|and concatenate the representation to the BoW. 

In [29]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file
from contextualized_topic_models.evaluation.measures import CoherenceNPMI
texts = [doc.split() for doc in preprocessed_documents] # load text for NPMI

In [30]:
# Train CombinedTM for 50 and 100 topics
### 50 TOPICS ###
comtm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, batch_size=200, n_components=50)
comtm.fit(training_dataset) # run the model
comtm.save("./")

### 100 TOPICS ###
comtm_100 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components=100)
comtm_100.fit(training_dataset) # run the model
comtm_100.save("./")

# NPMI Scores for t = 50 and t = 100
npmi_comtm = CoherenceNPMI(texts=texts, topics=comtm.get_topic_lists(50))
npmi_comtm_100 = CoherenceNPMI(texts=texts, topics=comtm_100.get_topic_lists(100))

print("\nCoherence Score for Combined TM with t = 50: ", npmi_comtm.score())
print("\nCoherence Score for Combined TM with t = 100: ",npmi_comtm_100.score())

# Store NPMI
combinedNPMI = [npmi_comtm.score(), npmi_comtm_100.score()]

Epoch: [100/100]	 Seen Samples: [2000000/2000000]	Train Loss: 133.2114064453125	Time: 0:00:02.471773: : 100it [04:09,  2.49s/it]
Epoch: [100/100]	 Seen Samples: [2000000/2000000]	Train Loss: 150.18139189453126	Time: 0:00:02.475144: : 100it [04:09,  2.49s/it]



Coherence Score for Combined TM with t = 50:  0.179156371294211

Coherence Score for Combined TM with t = 100:  0.1542736819598273


In [106]:
# SHOW RESULTS
NPMI = {"ZeroShotTM" : zeroshotNPMI,
        "CombinedTM" : combinedNPMI,
        "Neural-ProdLDA" : avitmNPMI,
        "LDA" : npmi_lda}

# Results

In [107]:
npmi = pd.DataFrame.from_dict(NPMI, orient='index')
print("NPMI Coherences on W1 dataset")
npmi.set_axis(["t(50)", "t(100)"], axis = 1)

NPMI Coherences on W1 dataset


Unnamed: 0,t(50),t(100)
ZeroShotTM,0.169154,0.135052
CombinedTM,0.179156,0.154274
Neural-ProdLDA,0.171004,0.138252
LDA,-0.061104,-0.158774


# Zero-shot Cross-Lingual Topic Modeling
> Can the conxtextualized TM tackle zero-shot cross-lingual topic modeling?

The second dataset (W2) contains 100,000 English documents. We use 99,700 documents as training and consider the remaining 300 documents as the test set. We collect the 300 respective instances in Portuguese, Italian, French, and German.

First, we use SBERT to generate multilingual embeddings as the input of the model. Then we evaluate multilingual topic predictions on the multilingual abstracts in W2.

In [31]:
# Install the contextualized topic model library
%%capture
!pip install contextualized-topic-models==2.2.0

# Imports
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle
import numpy as np
from pprint import pprint

In [32]:
# Download W2 files for training and testing (given by authors)
!curl -L "https://drive.google.com/u/0/uc?id=1Mlhi5LUWxo7RqCOUvJuDzKZe4GauinoO&export=download" -o dbpedia_train_unprep.txt
!curl -L "https://drive.google.com/u/0/uc?id=1HY-hi_DmoL4FYNTmlvUYgYL9x-yzroj3&export=download" -o test_set

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
100 77.0M  100 77.0M    0     0  14.3M      0  0:00:05  0:00:05 --:--:--  125M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  763k  100  763k    0     0   452k      0  0:00:01  0:00:01 --:--:-- 2272k


### Data
**Building Training Dataset (W2)**

In [33]:
# Raw files (given by authors)
train_file = "dbpedia_train_unprep.txt" # 100K english abstracts
test_file = "test_set" # 300 comparable documents in it, fr, pt, de, en

# Get Test File
with open(test_file, "rb") as filino:
  w2_test = pickle.load(filino)
filino.close()

# Extract multilingual test files (indices given by authors)
italian_documents = [w2_test[i][0] for i in range(len(w2_test))]
french_documents = [w2_test[i][1] for i in range(len(w2_test))]
portugese_documents = [w2_test[i][2] for i in range(len(w2_test))]
german_documents = [w2_test[i][3] for i in range(len(w2_test))]
english_documents = [w2_test[i][4] for i in range(len(w2_test))] 

# Remove english documents from train file to get remaining 99,700 abstracts for training
w2_train = list (set(open(train_file, encoding="utf-8").readlines()) - set (english_documents))[:99700]

# Preprocessing train set
nltk.download('stopwords')
documents = [line.strip() for line in w2_train]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

  return concat([self.open(f).read() for f in fileids])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!




### Training Models

In [34]:
# Load multilingual embeddings from SBERT
tp = TopicModelDataPreparation("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Fit to build training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/402 [00:00<?, ?B/s]

Batches:   0%|          | 0/499 [00:00<?, ?it/s]



In [35]:
# Train zeroshotTM with english abstracts with t = 25
z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size=768, num_epochs=100)
z_ctm_25.fit(training_dataset) # run the model
z_ctm_25.save("./") # save model

Epoch: [100/100]	 Seen Samples: [9970000/9970000]	Train Loss: 210.59970790300588	Time: 0:00:14.002226: : 100it [23:27, 14.07s/it]


In [36]:
# Train zeroshotTM with english abstracts with t = 50
z_ctm_50 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 50,contextual_size=768, num_epochs=100)
z_ctm_50.fit(training_dataset) # run the model
z_ctm_50.save("./") # save model

Epoch: [100/100]	 Seen Samples: [9970000/9970000]	Train Loss: 216.61794061873118	Time: 0:00:13.915765: : 100it [23:03, 13.83s/it]


### Predictions and Evaluation
**Unseen Multilingual  Corpora Predictions**

In [37]:
# Convert test files into test datasets
it_testset = tp.transform(italian_documents)
fr_testset = tp.transform(french_documents)
de_testset = tp.transform(german_documents)
pt_testset = tp.transform(portugese_documents)
en_testset = tp.transform(english_documents)



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [38]:
### 25 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_25.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_25.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_25.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_25.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_25.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_25 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

Sampling: [100/100]: : 100it [00:50,  1.99it/s]
Sampling: [100/100]: : 100it [00:50,  1.97it/s]
Sampling: [100/100]: : 100it [00:51,  1.96it/s]
Sampling: [100/100]: : 100it [00:51,  1.94it/s]
Sampling: [100/100]: : 100it [00:52,  1.92it/s]


In [39]:
### 50 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_50.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_50.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_50.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_50.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_50.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_50 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

Sampling: [100/100]: : 100it [00:52,  1.89it/s]
Sampling: [100/100]: : 100it [00:53,  1.88it/s]
Sampling: [100/100]: : 100it [00:53,  1.86it/s]
Sampling: [100/100]: : 100it [00:53,  1.86it/s]
Sampling: [100/100]: : 100it [00:54,  1.84it/s]


**Quantitative Evaluation**

In [40]:
# Import metrics
from contextualized_topic_models.evaluation.measures import Matches, KLDivergence, CentroidDistance
import warnings
warnings.filterwarnings('ignore')

1. **Matches**

> Matches is the % of times the predicted topic for the non-English test document is the same as for the respective test document in English. The higher the scores, the better.

In [65]:
# Matches for 25 topics
en_it_matches = Matches(topics_25[4], topics_25[0])
en_fr_matches = Matches(topics_25[4], topics_25[1])
en_pt_matches = Matches(topics_25[4], topics_25[2])
en_de_matches = Matches(topics_25[4], topics_25[3])

matches_25 = {'Italian': en_it_matches.score(), 
              'French' : en_fr_matches.score(), 
              'Portugese' : en_pt_matches.score(), 
              'German': en_de_matches.score()}
matches_25

{'French': 0.7766666666666666,
 'German': 0.74,
 'Italian': 0.8066666666666666,
 'Portugese': 0.81}

In [62]:
# Matches for 50 topics
en_it_matches = Matches(topics_50[4], topics_50[0])
en_fr_matches = Matches(topics_50[4], topics_50[1])
en_pt_matches = Matches(topics_50[4], topics_50[2])
en_de_matches = Matches(topics_50[4], topics_50[3])

matches_50 = {'Italian': en_it_matches.score(), 
              'French' : en_fr_matches.score(), 
              'Portugese' : en_pt_matches.score(), 
              'German': en_de_matches.score()}

matches_50

{'French': 0.68, 'German': 0.65, 'Italian': 0.68, 'Portugese': 0.7}

2. **Distributional Similarity**
> Compute the KL divergence between the predicted topic distribution on the test document and the same test document in English. Lower scores are better, indicating that the distributions do not differ by much.

In [51]:
# KL Divergence for 25 topics
en_it_kl = KLDivergence(topics_25[4], topics_25[0])
en_fr_kl = KLDivergence(topics_25[4], topics_25[1])
en_pt_kl = KLDivergence(topics_25[4], topics_25[2])
en_de_kl = KLDivergence(topics_25[4], topics_25[3])

kl_divergence_25 = {'Italian' : en_it_kl.score(), 
                    'French' : en_fr_kl.score(),
                    'Portugese': en_pt_kl.score(),
                    'German' : en_de_kl.score()}
           
kl_divergence_25

{'French': 0.1373087013758131,
 'German': 0.14173567385572905,
 'Italian': 0.12564331036418686,
 'Portugese': 0.12220423315397236}

In [52]:
# KL Divergence for 50 topics
en_it_kl = KLDivergence(topics_50[4], topics_50[0])
en_fr_kl = KLDivergence(topics_50[4], topics_50[1])
en_pt_kl = KLDivergence(topics_50[4], topics_50[2])
en_de_kl = KLDivergence(topics_50[4], topics_50[3])

kl_divergence_50 = {'Italian' : en_it_kl.score(), 
                    'French' : en_fr_kl.score(),
                    'Portugese': en_pt_kl.score(),
                    'German' : en_de_kl.score()}

kl_divergence_50

{'French': 0.17890374140680648,
 'German': 0.18072323598979217,
 'Italian': 0.17331664244180736,
 'Portugese': 0.15378146138751048}

3. **Centroid Embeddings**
> To also account for similar but not exactly equal topic predictions, we compute the centroid embeddings of the 5 words describing the predicted topic for both English and non-English documents. Then we compute the cosine similarity between those two centroids (CD).

In [45]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import KeyedVectors
import gensim.downloader as api
from scipy.spatial.distance import cosine
import abc

class CD(CentroidDistance):
    """Override author's function to upgrade compatibility with Gensim 4.0.0.
    See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4."""

    def get_centroid(self, word_list):
        vector_list = []
        for word in word_list:
            if word in self.wv:   # changed from self.wv.vocab to self.wv as in Gensim 4.0.0
                vector_list.append(self.wv.get_vector(word))
        vec = sum(vector_list)
        return vec / np.linalg.norm(vec)

In [53]:
# Centroid Embeddings for 25 topics
cd_25 = {}
langs = ['Italian', 'French', 'Portugese', 'German']

for i in range(4):
  cd = CD(doc_distribution_original_language = topics_25[4], 
          doc_distribution_unseen_language = topics_25[i], 
          topics = z_ctm_25.get_topic_lists(25),
          topk = 5)
  
  cd_25[langs[i]] = cd.score()

cd_25

{'French': 0.8347816742832462,
 'German': 0.7996540741001567,
 'Italian': 0.8514786190787951,
 'Portugese': 0.8494347466900944}

In [54]:
# Centroid Embeddings for 50 topics
cd_50 = {}
langs = ['Italian', 'French', 'Portugese', 'German']

for i in range(4):
  cd = CD(doc_distribution_original_language = topics_50[4], 
          doc_distribution_unseen_language = topics_50[i], 
          topics = z_ctm_50.get_topic_lists(25),
          topk = 5)
  
  cd_50[langs[i]] = cd.score()
  cd = 0

cd_50

{'French': 0.761204570842286,
 'German': 0.7389298263378441,
 'Italian': 0.7641824815670649,
 'Portugese': 0.7800230605651935}

In [108]:
# Show results
metrics = {"Mat25": matches_25,
           "KL25": kl_divergence_25, 
           "CD25": cd_25, 
           "Mat50": matches_50, 
           "KL50": kl_divergence_50,
           "CD50": cd_50}

In [109]:
# Save all results
results = {"NPMI Coherences on W1 dataset" : NPMI,
           "Match, KL, and Centroid Similarity for 25 and 50 topics on various languages on W2" : metrics}

with open("results.txt", 'wb') as F:
  pickle.dump(results, F)

# Results

In [110]:
metrics = pd.DataFrame.from_dict(metrics, orient='columns') 
print("Match, KL, and Centroid Similarity for 25 and 50 topics on various languages on W2")
metrics

Match, KL, and Centroid Similarity for 25 and 50 topics on various languages on W2


Unnamed: 0,Mat25,KL25,CD25,Mat50,KL50,CD50
Italian,0.806667,0.125643,0.851479,0.68,0.173317,0.764182
French,0.776667,0.137309,0.834782,0.68,0.178904,0.761205
Portugese,0.81,0.122204,0.849435,0.7,0.153781,0.780023
German,0.74,0.141736,0.799654,0.65,0.180723,0.73893
