<a href="https://colab.research.google.com/github/DivyaRustagi10/contextualized-topic-models-ssl/blob/main/ZeroshotTM_Parent_Paper_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#To contextualize or to not contextualize?

> Can we define a topic model that does not rely on the BoW input but instead uses contextual information?

First, we want to check if ZeroShotTM maintains comparable performance to other topic models; if this is true, we can then explore its performance in
a cross-lingual setting. Since we use only English text, in this setting we use English representations.



In [None]:
# Install the contextualized topic model library
%%capture
!pip install contextualized-topic-models==2.2.0

In [None]:
%%capture
!pip install pyldavis
!pip install wget
!pip install head

In [None]:
!nvidia-smi

Thu Mar 31 20:50:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

We replace the input BoW in Neural-ProdLDA
with pre-trained multilingual representations from
SBERT (Reimers and Gurevych, 2019), a recent and effective model for contextualized representations.

Indeed, ZeroShotTM
is language-independent: given a contextualized
representation of a new language as input,1
it can
predict the topic distribution of the document. The
predicted topic descriptors, though, will be from
the training language. Let us also notice that our
method is agnostic about the choice of the neural
topic model architecture (here, Neural-ProdLDA),
as long as it extends a Variational Autoencoder.

### Data

**Building W1**

We use datasets collected from English
Wikipedia abstracts from DBpedia. The first dataset (W1) contains 20,000 randomly sampled abstracts. 


**Downloading DBPedia 20K Abstracts**

In [None]:
import wget
wget.download("https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt")

'dbpedia_sample_abstract_20k_unprep (1).txt'

In [None]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

**Preprocessing**

Why do we use the preprocessed text here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()



We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations.

Let's pass our files with preprocess and unpreprocessed data to our TopicModelDataPreparation object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "distiluse-base-multilingual-cased", because we need a multilingual model for performing cross-lingual predictions later.



**Training ZeroshotTM**

In [None]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle

In [None]:
# Load English SBERT embeddings
tp = TopicModelDataPreparation("sentence-transformers/bert-base-nli-mean-tokens")

In [None]:
# Building training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

### Training Models

**M1. Training Zero-Shot Contextualized Topic Model**

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (n_component parameter of the CTM object).

In [None]:
# Train over 100 epochs
ctm = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components = 50)
ctm_100 = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components = 100)
ctm.fit(training_dataset, save_dir = "./") # run the model
ctm_100.fit(training_dataset, save_dir = "./") # run the model

After training, now it is the time to look at our topics: we can use the


```
get_topic_lists
```
function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [None]:
# Check topics
ctm.get_topic_lists(5)
ctm_100.get_topic_lists(5)

In [None]:
# Topic Predictions
topics_predictions = ctm.get_thetas(training_dataset, n_samples=30) # get all the topic predictions
topics_predictions_100 = ctm_100.get_thetas(training_dataset, n_samples=30) # get all the topic predictions

In [None]:
# Get NPMI Coherence
from contextualized_topic_models.evaluation.measures import CoherenceNPMI
texts = [doc.split() for doc in preprocessed_documents] # load text for NPMI

npmi = CoherenceNPMI(texts=texts, topics=ctm.get_topic_lists(50))
npmi_100 = CoherenceNPMI(texts=texts, topics=ctm_100.get_topic_lists(100))
print(npmi.score())
print(npmi_100.score())

zeroshotNPMI = [npmi.score(), npmi_100.score()]

0.16294250186748438
0.1477697544158918


**M2. Training Neural-ProdLDA**

We use the implementation made available by [Carrow (2018)](https://github.com/estebandito22/PyTorchAVITM/blob/master/README.md).

**Model Training Instructions**

* Epochs = 100
* ADAM optimizer -> learning rate = 2e-3. 
* The inference network is composed of a single hidden layer and 100-dimension of softplus units. 
* The priors over the topic and
document distributions are **learnable parameters**.
* Momentum = 0.99, learning rate = 0.002, and we apply 20% of drop-out to the hidden document representation. 
* Batch size = 200

In [None]:
# Imports
!git clone https://github.com/estebandito22/PyTorchAVITM # COMMENT AFTER RUNNING ONCE
!python /content/PyTorchAVITM/setup.py build
!python /content/PyTorchAVITM/setup.py install

In [None]:
import os
import json
import sys
import numpy as np
import pandas as pd

sys.path.insert(1, "/content/PyTorchAVITM")
from pytorchavitm import AVITM
from pytorchavitm.datasets import BOWDataset
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Build Dataset
train_data = BOWDataset(tp.id2token,training_dataset.X_bow.todense())

cv = CountVectorizer(input = 'content')
train_bow = cv.fit_transform(preprocessed_documents)
train_bow = train_bow.toarray()

idx2token = cv.get_feature_names()
input_size = len(idx2token)

train_data = BOWDataset(train_bow, idx2token)

In [None]:
# Train Neural ProdLDA models
avitm = AVITM(input_size=len(tp.vocab), n_components = 50, model_type='prodLDA',hidden_sizes=(100,100)
                ,activation='softplus', dropout=0.2, learn_priors=True, 
              batch_size=200, lr=2e-3, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False)

avitm_100 = AVITM(input_size=len(tp.vocab), n_components = 100, model_type='prodLDA',hidden_sizes=(100,100)
                ,activation='softplus', dropout=0.2, learn_priors=True, 
              batch_size=200, lr=2e-3, momentum=0.99, solver='adam', num_epochs=100, reduce_on_plateau=False)


avitm.fit(train_data, save_dir="./")
avitm_100.fit(train_data, save_dir="./")

# NPMI Scores for t = 50 and t = 100
npmi_avitm = CoherenceNPMI(topics = list(avitm.get_topics(50).values()), texts = texts).score()
npmi_avitm_100 = CoherenceNPMI(topics = list(avitm_100.get_topics(100).values()), texts = texts).score()
avitmNPMI = [npmi_avitm, npmi_avitm_100]
print(avitmNPMI)

**M3. Training LDA**

We use [Gensim’s](https://radimrehurek.com/gensim/models/ldamodel.html) implementation of this model.

**Model Training Instructions**

The hyper-parameters alpha and beta, controlling the document-topic and word-topic distribution respectively, are estimated from the data during training.

In [None]:
# Train LDA models
from typing import Dict
from pprint import pprint
from gensim.models import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamulticore import LdaModel

# Create a corpus from a list of texts
train_dict = Dictionary([text.split() for text in preprocessed_documents])
train_corpus = [train_dict.doc2bow(text.split()) for text in preprocessed_documents]

# Fit LDA models on the corpus
lda = LdaModel(train_corpus, num_topics=50, passes = 30,
               id2word=train_dict)

lda_100 = LdaModel(train_corpus, num_topics=100, passes = 30,
               id2word=train_dict)

In [None]:
#pprint(lda.print_topics())
coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=train_dict, coherence='c_npmi')
npmi_lda = coherence_model_lda.get_coherence()

npmi_lda_100 = CoherenceModel(model=lda_100, texts=texts, dictionary=train_dict, coherence='c_npmi')
print('\nCoherence Score for 50: ', npmi_lda)
print('\nCoherence Score for 100: ', npmi_lda_100.get_coherence())


Coherence Score for 50:  -0.07140085349638366

Coherence Score for 100:  -0.1742381019805529


In [None]:
lda_25 = LdaModel(train_corpus, num_topics=50, passes = 30,
               id2word=train_dict)

coherence_model_lda = CoherenceModel(model=lda_25, texts=texts, dictionary=train_dict, coherence='c_npmi')
coherence_model_lda.get_coherence()

-0.06021969745287585

In [None]:
lda_npmi =[npmi_lda, npmi_lda_100.get_coherence()]

**M4. Training Combined TM**
CTMs work better when the size of the bag of words has been restricted to a number of terms that does not go over 2000 elements. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. 

**Model Training Instructions**

* Epochs = 100
* ADAM optimizer
* Hyperparameters are the same used for Neural-ProdLDA with the difference that we also use SBERT features in combination with the BoW.
* We take the SBERT embeddings, apply a (learnable) function/dense layer R^512 → R^|V|and concatenate the representation to the BoW. 

In [None]:
# Train CombinedTM for 50 and 100 topics
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file

# Fit CombinedTM models
comtm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, batch_size=200, n_components=50)
comtm_100 = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, batch_size = 200, n_components=100)
#comtm.get_topic_lists(5)
#comtm_100.get_topic_lists(5)

In [None]:
comtm.fit(training_dataset, save_dir = "./") # run the model
comtm_100.fit(training_dataset, save_dir = "./") # run the model

In [None]:
# Get NPMI Coherence
from contextualized_topic_models.evaluation.measures import CoherenceNPMI
texts = [doc.split() for doc in preprocessed_documents] # load text for NPMI

npmi_comtm = CoherenceNPMI(texts=texts, topics=ctm.get_topic_lists(50))
npmi_comtm_100 = CoherenceNPMI(texts=texts, topics=ctm_100.get_topic_lists(100))
print(npmi_comtm.score())
print(npmi_comtm_100.score())

combinedNPMI = [npmi_comtm.score(), npmi_comtm_100.score()]

0.16294250186748438
0.1477697544158918


# Zero-shot Cross-Lingual Topic Modeling
> Can the conxtextualized TM tackle zero-shot cross-lingual topic modeling?

The second dataset (W2) contains 100,000 English documents. We use 99,700 documents as training and consider the remaining 300 documents as the test set. We collect the 300 respective instances in Portuguese, Italian, French, and German.

First, we use SBERT to generate multilingual embeddings as the input of the model. Then we evaluate multilingual topic predictions on the multilingual abstracts in W2.

In [1]:
# Install the contextualized topic model library
%%capture
!pip install contextualized-topic-models==2.2.0

# Imports
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle
import numpy as np
from pprint import pprint

In [2]:
# Download W2 files for training and testing (given by authors)
!curl -L "https://drive.google.com/u/0/uc?id=1Mlhi5LUWxo7RqCOUvJuDzKZe4GauinoO&export=download" -o dbpedia_train_unprep.txt
!curl -L "https://drive.google.com/u/0/uc?id=1HY-hi_DmoL4FYNTmlvUYgYL9x-yzroj3&export=download" -o test_set

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
100 77.0M  100 77.0M    0     0  16.9M      0  0:00:04  0:00:04 --:--:--  128M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100  763k  100  763k    0     0   408k      0  0:00:01  0:00:01 --:--:--  408k


### Data
**Building Training Dataset (W2)**

In [3]:
# Raw files (given by authors)
train_file = "dbpedia_train_unprep.txt" # 100K english abstracts
test_file = "test_set" # 300 comparable documents in it, fr, pt, de, en

# Get Test File
with open(test_file, "rb") as filino:
  w2_test = pickle.load(filino)
filino.close()

# Extract multilingual test files (indices given by authors)
italian_documents = [w2_test[i][0] for i in range(len(w2_test))]
french_documents = [w2_test[i][1] for i in range(len(w2_test))]
portugese_documents = [w2_test[i][2] for i in range(len(w2_test))]
german_documents = [w2_test[i][3] for i in range(len(w2_test))]
english_documents = [w2_test[i][4] for i in range(len(w2_test))] 

# Remove english documents from train file to get remaining 99,700 abstracts for training
w2_train = list (set(open(train_file, encoding="utf-8").readlines()) - set (english_documents))[:99700]

# Preprocessing train set
nltk.download('stopwords')
documents = [line.strip() for line in w2_train]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.




### Training Models

In [4]:
# Load multilingual embeddings from SBERT
tp = TopicModelDataPreparation("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Fit to build training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/723 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/402 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/499 [00:00<?, ?it/s]



In [5]:
# Train zeroshotTM with english abstracts with t = 25
z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size=768, num_epochs=100)
z_ctm_25.fit(training_dataset, save_dir="./") # run the model

Epoch: [100/100]	 Seen Samples: [9970000/9970000]	Train Loss: 210.62604166993168	Time: 0:00:13.496322: : 100it [28:48, 17.28s/it]


In [32]:
# Train zeroshotTM with english abstracts with t = 50
z_ctm_50 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 50,contextual_size=768, num_epochs=100)
z_ctm_50.fit(training_dataset, save_dir="./") # run the model

Epoch: [100/100]	 Seen Samples: [9970000/9970000]	Train Loss: 216.56069684639857	Time: 0:00:14.507778: : 100it [31:17, 18.77s/it]


### Predictions and Evaluation
**Unseen Multilingual  Corpora Predictions**

In [None]:
# # Download 25 and 50 topic models
# !mkdir "contextualized_topic_model_nc_25_tpm_0_tpv_096_hs_prodLDA_ac_do_softplus_lr_02_mo_0002_rp_099"
# !curl -L "https://drive.google.com/u/0/uc?id=1dA4szvg8aIJtXz0CrVpD7HeKEeV-vo6R&export=download" -o contextualized_topic_model_nc_25_tpm_0_tpv_096_hs_prodLDA_ac_do_softplus_lr_02_mo_0002_rp_099/epoch_99.pth
# !cd ..
# # !mkdir "contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99"
# # !curl -L "https://drive.google.com/u/0/uc?id=1HY-hi_DmoL4FYNTmlvUYgYL9x-yzroj3&export=download" -o ccontextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_do_softplus_lr_0.2_mo_0.002_rp_0.99/epoch_99.pth

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2214    0  2214    0     0   6750      0 --:--:-- --:--:-- --:--:--  6750


In [None]:
# # Load model for 25 topics
# z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size = 768, num_epochs = 100)
z_ctm_25.load(model_dir = "/content/contextualized_topic_model_nc_25_tpm_0.0_tpv_0.96_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99", epoch = 99)



In [None]:
# # Load model for 50 topics
# z_ctm_50 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size = 768, num_epochs = 100)
# z_ctm_50.load(model_dir = "/content/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99", 
#                     epoch = 99)

In [6]:
# Convert test files into test datasets
it_testset = tp.transform(italian_documents)
fr_testset = tp.transform(french_documents)
de_testset = tp.transform(german_documents)
pt_testset = tp.transform(portugese_documents)
en_testset = tp.transform(english_documents)



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]



Batches:   0%|          | 0/2 [00:00<?, ?it/s]

In [26]:
### 25 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_25.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_25.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_25.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_25.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_25.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_25 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

Sampling: [100/100]: : 100it [00:56,  1.77it/s]
Sampling: [100/100]: : 100it [00:54,  1.82it/s]
Sampling: [100/100]: : 100it [00:55,  1.79it/s]
Sampling: [100/100]: : 100it [00:55,  1.81it/s]
Sampling: [100/100]: : 100it [00:55,  1.79it/s]


In [27]:
topics_25[4]
topics_25[1]

array([[0.03950265, 0.1565703 , 0.02988122, ..., 0.03332205, 0.02045516,
        0.02141155],
       [0.03132896, 0.01871674, 0.02781799, ..., 0.03004164, 0.02020359,
        0.01966489],
       [0.03622307, 0.02629677, 0.01945086, ..., 0.02466495, 0.01999504,
        0.01880061],
       ...,
       [0.02127323, 0.01354207, 0.03119404, ..., 0.02215813, 0.02936685,
        0.01039033],
       [0.02338544, 0.02284519, 0.03164743, ..., 0.01572657, 0.03061562,
        0.00745121],
       [0.05410563, 0.024912  , 0.08447234, ..., 0.0271306 , 0.01712496,
        0.01671226]])

In [33]:
### 50 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_50.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_50.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_50.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_50.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_50.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_50 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

Sampling: [100/100]: : 100it [01:38,  1.01it/s]
Sampling: [100/100]: : 100it [01:38,  1.01it/s]
Sampling: [100/100]: : 100it [01:40,  1.00s/it]
Sampling: [100/100]: : 100it [01:39,  1.00it/s]
Sampling: [100/100]: : 100it [01:40,  1.01s/it]


**Quantitative Evaluation**

In [8]:
# Import metrics
from contextualized_topic_models.evaluation.measures import Matches, KLDivergence, CentroidDistance
import warnings
warnings.filterwarnings('ignore')

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


1. **Matches**

> Matches is the % of times the predicted topic for the non-English test document is the same as for the respective test document in English. The higher the scores, the better.

In [9]:
# Matches for 25 topics
en_it_matches = Matches(topics_25[4], topics_25[0])
en_fr_matches = Matches(topics_25[4], topics_25[1])
en_pt_matches = Matches(topics_25[4], topics_25[2])
en_de_matches = Matches(topics_25[4], topics_25[3])

matches_25 = [en_it_matches.score(), en_fr_matches.score(), 
           en_pt_matches.score(), en_de_matches.score()]
matches_25

[0.77, 0.7833333333333333, 0.75, 0.7533333333333333]

In [34]:
# Matches for 50 topics
en_it_matches = Matches(topics_50[4], topics_50[0])
en_fr_matches = Matches(topics_50[4], topics_50[1])
en_pt_matches = Matches(topics_50[4], topics_50[2])
en_de_matches = Matches(topics_50[4], topics_50[3])

matches_50 = [en_it_matches.score(), en_fr_matches.score(), 
           en_pt_matches.score(), en_de_matches.score()]

2. **Distributional Similarity**
> Compute the KL divergence between the predicted topic distribution on the test document and the same test document in English. Lower scores are better, indicating that the distributions do not differ by much.

In [10]:
# KL Divergence for 25 topics
en_it_kl = KLDivergence(topics_25[4], topics_25[0])
en_fr_kl = KLDivergence(topics_25[4], topics_25[1])
en_pt_kl = KLDivergence(topics_25[4], topics_25[2])
en_de_kl = KLDivergence(topics_25[4], topics_25[3])

kl_divergence_25 = [en_it_kl.score(), en_fr_kl.score(), 
           en_de_kl.score(), en_pt_kl.score()]
           
kl_divergence_25

[0.12937350917783688,
 0.13802749811641576,
 0.1472539906039054,
 0.11945237810821654]

In [None]:
# KL Divergence for 50 topics
en_it_kl = KLDivergence(topics_50[4], topics_50[0])
en_fr_kl = KLDivergence(topics_50[4], topics_50[1])
en_pt_kl = KLDivergence(topics_50[4], topics_50[2])
en_de_kl = KLDivergence(topics_50[4], topics_50[3])

kl_divergence_50 = [en_it_kl.score(), en_fr_kl.score(), 
           en_de_kl.score(), en_pt_kl.score()]

kl_divergence_50

3. **Centroid Embeddings**
> To also account for similar but not exactly equal topic predictions, we compute the centroid embeddings of the 5 words describing the predicted topic for both English and non-English documents. Then we compute the cosine similarity between those two centroids (CD).

In [11]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import KeyedVectors
import gensim.downloader as api
from scipy.spatial.distance import cosine
import abc

class CD(CentroidDistance):
    """Override author's function to upgrade compatibility with Gensim 4.0.0.
    See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4."""

    def get_centroid(self, word_list):
        vector_list = []
        for word in word_list:
            if word in self.wv:   # changed from self.wv.vocab to self.wv as in Gensim 4.0.0
                vector_list.append(self.wv.get_vector(word))
        vec = sum(vector_list)
        return vec / np.linalg.norm(vec)

In [None]:
# Centroid Embeddings for 25 topics
cd_25 = []

for i in range(4):
  cd = CD(doc_distribution_original_language = topics_25[4], 
          doc_distribution_unseen_language = topics_25[i], 
          topics = z_ctm_25.get_topic_lists(25),
          topk = 5)
  
  cd_25.append(cd.score())

cd_25

In [31]:
metrics_25 = [matches_25, kl_divergence_25, cd_25]
with open("metrics.txt", 'wb') as F:
  pickle.dump(metrics_25, F)

In [36]:
# Centroid Embeddings for 50 topics
cd_50 = []

for i in range(4):
  cd = CD(doc_distribution_original_language = topics_50[4], 
          doc_distribution_unseen_language = topics_50[i], 
          topics = z_ctm_50.get_topic_lists(25),
          topk = 5)
  
  cd_50.append(cd.score())
  cd = 0

cd_50

[0.7623366142809391,
 0.7303878939151764,
 0.7646546208610138,
 0.7610922541220982]

In [37]:
metrics = {"Mat25": matches_25,
           "KL25": kl_divergence_25, 
           "CD25": cd_25, 
           "Mat50": matches_50, 
           "KL50": kl_divergence_50,
           "CD50": cd_50}
with open("metrics.txt", 'wb') as F:
  pickle.dump(metrics, F)