<a href="https://colab.research.google.com/github/DivyaRustagi10/contextualized-topic-models-ssl/blob/main/ZeroshotTM_For_Same_Script_Languages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#To contextualize or to not contextualize?

> Can we define a topic model that does not rely on the BoW input but instead uses contextual information?

First, we want to check if ZeroShotTM maintains comparable performance to other topic models; if this is true, we can then explore its performance in
a cross-lingual setting. 

**Hindi**
Since we use only Hindi text, in this setting we use Hindi representations.



In [2]:
# Install the contextualized topic model library
%%capture
!pip install -U contextualized_topic_models

In [3]:
%%capture
!pip install pyldavis
!pip install wget
!pip install head

In [147]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [None]:
# # Download IndicBERT
# !git clone https://github.com/ai4bharat/indic-bert
# %cd indic-bert
# !pip3 install -r requirements.txt
# %cd ..
# !mkdir indic-glue outputs
# # !pip3 list --outdated --format=freeze | grep -v '^\-e' | cut -d = -f 1 | xargs -n1 pip3 install -U 

**Overview**

This repository contains the code for creating a parallel corpus from the website of the Indian Prime Minister (www.pmindia.gov.in). 

In [None]:
import wget
link = "https://data.statmt.org/pmindia/v1/monolingual/pmindia.v1.hi.tgz"
wget.download(link)
!tar -xvzf pmindia.v1.hi.tgz

In [52]:
import glob
files = glob.glob(r'/content/split/*.txt')

print(files)

['/content/split/mou-between-reserve-bank-of-india-and-central-bank-of-united-arab-emirates-on-co-operation-concerning-currency-swap-agreement.txt', '/content/split/text-of-pms-address-at-the-joint-inauguration-of-the-indo-german-business-summit-in-hannover.txt', '/content/split/pm-inaugurates-india-food-park-in-tumkur.txt', '/content/split/chairman-and-ceo-bank-of-america-calls-on-pm.txt', '/content/split/pm-to-interact-with-school-children-tomorrow.txt', '/content/split/cabinet-approves-cadre-review-of-indian-information-service.txt', '/content/split/pms-speech-after-dedication-of-sardar-sarovar-dam-to-nation-in-gujarat.txt', '/content/split/pm-pays-tributes-to-dr-zakir-hussain-on-his-birth-anniversary.txt', '/content/split/cabinet-approves-signing-of-air-services-agreement-between-india-and-greece.txt', '/content/split/cabinet-approves-agreement-on-audio-visual-co-production-between-india-and-bangladesh.txt', '/content/split/pm-addresses-young-entrepreneurs-at-the-champions-for-chan

In [54]:
# Combine all speeches into one 
combine = []
for f in files:
  with open(f, 'r') as fold:
    speech = " ".join([str(line) for line in fold])
    combine.append(speech)              

In [64]:
type(combine[2])
len(combine)

4806

In [63]:
# Selecting Train speeches
import pandas as pd
hindi_unprep = pd.DataFrame(list for list in combine)

# We select speeches with at least 500 tokens
TOKENS_LIMIT = 500
remove = []
for speech in hindi_unprep[:5].itertuples():
  if len(speech[1].split(" "))  < TOKENS_LIMIT:
    remove.append(speech[0])
    print(speech[0], " removed!")

file_name = 'pmindia_hindi_unprep.txt'

hindi_unprep = hindi_unprep[3:]

#defining a list
with open(file_name, 'w', encoding = "utf-8") as f:     
  f.writelines("%s" % str(line)+"\t" for line in combine[:3])

x = open(file_name, "r", encoding = "utf-8")


0  removed!
2  removed!
3  removed!
4  removed!


104

In [57]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import  WhiteSpacePreprocessingStopwords
import pickle

### Data

**Building W1**

We use datasets collected from Hindi
PMIndia speeches. The first dataset (W1) contains X randomly sampled abstracts. 


**Downloading PMIndia Speeches**

In [58]:
text_file = "pmindia_hindi_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

**Preprocessing**

Why do we use the preprocessed text here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [59]:
# Download Hindi Stopwords
!pip install stopwordsiso
import stopwordsiso as stopwords



In [77]:
documents = [line[:501].strip() for line in combine]

sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list = stopwords.stopwords("hi"))
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()



In [78]:
len(preprocessed_documents)
#len(hindi_unprep)

len(unpreprocessed_corpus)

4802

We don't discard the non-preprocessed hindi texts, because we are going to use them as input for obtaining the **contextualized** document representations.

Let's pass our files with preprocess and unpreprocessed data to our TopicModelDataPreparation object. This object takes care of creating the bag of words and obtains the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "monsoon-nlp/hindi-tpu-electra", because we need a multilingual model for indic languages for performing cross-lingual predictions later.



**Training ZeroshotTM**

In [79]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle

In [86]:
# Load Indic Multilingual embeddings 
tp = TopicModelDataPreparation("ai4bharat/indic-bert")

In [87]:
# Building training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/135M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.65M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.59M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/400M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/ai4bharat_indic-bert were not used when initializing AlbertModel: ['sop_classifier.classifier.weight', 'predictions.LayerNorm.bias', 'predictions.dense.weight', 'predictions.LayerNorm.weight', 'sop_classifier.classifier.bias', 'predictions.decoder.weight', 'predictions.decoder.bias', 'predictions.dense.bias', 'predictions.bias']
- This IS expected if you are initializing AlbertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/25 [00:00<?, ?it/s]



In [88]:
# Hindi - 1
# English - 2
# Train zeroshotTM with english abstracts with t = 25
z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 20, contextual_size=768, num_epochs=30)
z_ctm_25.fit(training_dataset) # run the model

Epoch: [30/30]	 Seen Samples: [144060/144060]	Train Loss: 86.96804498801973	Time: 0:00:01.033218: : 30it [00:30,  1.03s/it]
Sampling: [20/20]: : 20it [00:18,  1.08it/s]


In [91]:
z_ctm_25.get_topic_lists(5)

[['करन', 'बठक', 'गई', 'तहत', 'रपय'],
 ['एव', 'कषतर', 'एमओय', 'जल', 'परबधन'],
 ['नरनदर', 'करत', 'अवसर', 'हए', 'अपन'],
 ['हए', 'आज', 'करत', 'नरनदर', 'नरदर'],
 ['परबधन', 'कषतर', 'एव', 'तहत', 'एमओय'],
 ['आज', 'हम', 'हए', 'अवसर', 'दश'],
 ['तहत', 'कषतर', 'एमओय', 'एव', 'करन'],
 ['मर', 'बहत', 'मझ', 'दश', 'रह'],
 ['इवट', 'यन', 'पम', 'लट', 'चत'],
 ['बहत', 'दश', 'मर', 'मझ', 'आज'],
 ['बठक', 'गई', 'करन', 'रपय', 'नए'],
 ['दश', 'बहत', 'मझ', 'मर', 'आज'],
 ['the', 'in', 'of', 'minister', 'excellency'],
 ['सदश', 'अवसर', 'नमन', 'उनह', 'उनक'],
 ['इवट', 'लट', 'यन', 'पम', 'चत'],
 ['हए', 'आज', 'करत', 'नई', 'अवसर'],
 ['मझ', 'बहत', 'मर', 'हम', 'दश'],
 ['यन', 'इवट', 'करग', 'मध', 'भट'],
 ['बठक', 'रप', 'सममलन', 'करन', 'नई'],
 ['नशनल', 'ऑफ', 'तल', 'गस', 'iii']]

# Zero-shot Cross-Lingual Topic Modeling
> Can the conxtextualized TM tackle zero-shot cross-lingual topic modeling?

W1 contains 5K Hindi documents. We use 4700 documents as training and consider the remaining 300 documents as the test set. We collect the 300 respective instances in Gujrati, Marathi, Sindhi.

Tamil, Telugu, Kannada, Malayalam - Dravadian languages

First, we use IndicBERT to generate multilingual embeddings as the input of the model. Then we evaluate multilingual topic predictions on the multilingual abstracts in W1.

In [None]:
# Install the contextualized topic model library
%%capture
!pip install contextualized-topic-models==2.2.0

# Imports
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk
import pickle
import numpy as np
from pprint import pprint

In [None]:
# Download W2 files for training and testing (given by authors)
!curl -L "https://drive.google.com/u/0/uc?id=1HY-hi_DmoL4FYNTmlvUYgYL9x-yzroj3&export=download" -o test_set

### Data
**Building Training Dataset (W2)**

In [None]:
# Raw files (given by authors)
train_file = "dbpedia_train_unprep.txt" # 100K english abstracts
test_file = "test_set" # 300 comparable documents in it, fr, pt, de, en

# Get Test File
with open(test_file, "rb") as filino:
  w2_test = pickle.load(filino)
filino.close()

# Extract multilingual test files (indices given by authors)
italian_documents = [w2_test[i][0] for i in range(len(w2_test))]
french_documents = [w2_test[i][1] for i in range(len(w2_test))]
portugese_documents = [w2_test[i][2] for i in range(len(w2_test))]
german_documents = [w2_test[i][3] for i in range(len(w2_test))]
english_documents = [w2_test[i][4] for i in range(len(w2_test))] 

# Remove english documents from train file to get remaining 99,700 abstracts for training
w2_train = list (set(open(train_file, encoding="utf-8").readlines()) - set (english_documents))[:99700]

# Preprocessing train set
nltk.download('stopwords')
documents = [line.strip() for line in w2_train]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

### Training Models

In [None]:
# Load multilingual embeddings from SBERT
tp = TopicModelDataPreparation("sentence-transformers/paraphrase-multilingual-mpnet-base-v2")

# Fit to build training dataset
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

In [None]:
# Train zeroshotTM with english abstracts with t = 25
z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size=768, num_epochs=100)
z_ctm_25.fit(training_dataset, save_dir="./") # run the model

In [None]:
# Train zeroshotTM with english abstracts with t = 50
z_ctm_50 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 50,contextual_size=768, num_epochs=100)
z_ctm_50.fit(training_dataset, save_dir="./") # run the model

### Predictions and Evaluation
**Unseen Multilingual  Corpora Predictions**

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  2214    0  2214    0     0   6750      0 --:--:-- --:--:-- --:--:--  6750


In [None]:
# # Load model for 25 topics
# z_ctm_25 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size = 768, num_epochs = 100)
z_ctm_25.load(model_dir = "/content/contextualized_topic_model_nc_25_tpm_0.0_tpv_0.96_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99", epoch = 99)



In [None]:
# # Load model for 50 topics
# z_ctm_50 = ZeroShotTM(bow_size=len(tp.vocab), n_components = 25, contextual_size = 768, num_epochs = 100)
# z_ctm_50.load(model_dir = "/content/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99", 
#                     epoch = 99)

In [None]:
# Convert test files into test datasets
it_testset = tp.transform(italian_documents)
fr_testset = tp.transform(french_documents)
de_testset = tp.transform(german_documents)
pt_testset = tp.transform(portugese_documents)
en_testset = tp.transform(english_documents)

In [None]:
### 25 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_25.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_25.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_25.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_25.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_25.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_25 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

In [None]:
### 50 TOPIC PREDICTIONS ### 
it_topics_predictions = z_ctm_50.get_thetas(it_testset, n_samples=100) # get all the topic predictions
fr_topics_predictions = z_ctm_50.get_thetas(fr_testset, n_samples=100) # get all the topic predictions
de_topics_predictions = z_ctm_50.get_thetas(de_testset, n_samples=100) # get all the topic predictions
pt_topics_predictions = z_ctm_50.get_thetas(pt_testset, n_samples=100) # get all the topic predictions
en_topics_predictions = z_ctm_50.get_thetas(en_testset, n_samples=100) # get all the topic predictions

topics_50 = [it_topics_predictions, fr_topics_predictions, 
             pt_topics_predictions, de_topics_predictions,
             en_topics_predictions]

**Quantitative Evaluation**

In [None]:
# Import metrics
from contextualized_topic_models.evaluation.measures import Matches, KLDivergence, CentroidDistance
import warnings
warnings.filterwarnings('ignore')

1. **Matches**

> Matches is the % of times the predicted topic for the non-English test document is the same as for the respective test document in English. The higher the scores, the better.

In [None]:
# Matches for 25 topics
en_it_matches = Matches(topics_25[4], topics_25[0])
en_fr_matches = Matches(topics_25[4], topics_25[1])
en_pt_matches = Matches(topics_25[4], topics_25[2])
en_de_matches = Matches(topics_25[4], topics_25[3])

matches_25 = [en_it_matches.score(), en_fr_matches.score(), 
           en_pt_matches.score(), en_de_matches.score()]
matches_25

In [None]:
# Matches for 50 topics
en_it_matches = Matches(topics_50[4], topics_50[0])
en_fr_matches = Matches(topics_50[4], topics_50[1])
en_pt_matches = Matches(topics_50[4], topics_50[2])
en_de_matches = Matches(topics_50[4], topics_50[3])

matches_50 = [en_it_matches.score(), en_fr_matches.score(), 
           en_pt_matches.score(), en_de_matches.score()]

2. **Distributional Similarity**
> Compute the KL divergence between the predicted topic distribution on the test document and the same test document in English. Lower scores are better, indicating that the distributions do not differ by much.

In [None]:
# KL Divergence for 25 topics
en_it_kl = KLDivergence(topics_25[4], topics_25[0])
en_fr_kl = KLDivergence(topics_25[4], topics_25[1])
en_pt_kl = KLDivergence(topics_25[4], topics_25[2])
en_de_kl = KLDivergence(topics_25[4], topics_25[3])

kl_divergence_25 = [en_it_kl.score(), en_fr_kl.score(), 
           en_de_kl.score(), en_pt_kl.score()]
           
kl_divergence_25

In [None]:
# KL Divergence for 50 topics
en_it_kl = KLDivergence(topics_50[4], topics_50[0])
en_fr_kl = KLDivergence(topics_50[4], topics_50[1])
en_pt_kl = KLDivergence(topics_50[4], topics_50[2])
en_de_kl = KLDivergence(topics_50[4], topics_50[3])

kl_divergence_50 = [en_it_kl.score(), en_fr_kl.score(), 
           en_de_kl.score(), en_pt_kl.score()]

kl_divergence_50

3. **Centroid Embeddings**
> To also account for similar but not exactly equal topic predictions, we compute the centroid embeddings of the 5 words describing the predicted topic for both English and non-English documents. Then we compute the cosine similarity between those two centroids (CD).

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import KeyedVectors
import gensim.downloader as api
from scipy.spatial.distance import cosine
import abc

class CD(CentroidDistance):
    """Override author's function to upgrade compatibility with Gensim 4.0.0.
    See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4."""

    def get_centroid(self, word_list):
        vector_list = []
        for word in word_list:
            if word in self.wv:   # changed from self.wv.vocab to self.wv as in Gensim 4.0.0
                vector_list.append(self.wv.get_vector(word))
        vec = sum(vector_list)
        return vec / np.linalg.norm(vec)

In [None]:
# Centroid Embeddings for 25 topics
cd_25 = []

for i in range(4):
  topics_25 = topics_25[i]
  cd = CD(doc_distribution_original_language = topics_25[4], 
          doc_distribution_unseen_language = topics_25, 
          topics = z_ctm_25.get_topic_lists(25),
          topk = 5)
  
  cd_25.append(cd.score())

cd_25

In [None]:
# Centroid Embeddings for 50 topics
cd_50 = []

for i in range(4):
  cd = CD(doc_distribution_original_language = topics_50[4], 
          doc_distribution_unseen_language = topics_50[i], 
          topics = z_ctm_50.get_topic_lists(25),
          topk = 5)
  
  cd_50.append(cd.score())
  cd = None

cd_50

In [None]:
metrics = {"Mat25": matches_25,
           "KL25": kl_divergence_25, 
           "CD25": cd_25, 
           "Mat50": matches_50, 
           "KL50": kl_divergence_50,
           "CD50": cd_50}
with open("metrics.txt", 'wb') as F:
  pickle.dump(metrics, F)