<a href="https://colab.research.google.com/github/Ankur3107/Machine-Learning-Notes/blob/master/topic_modeling/Tutorial_(v2_2_0)_Zero_shot_Cross_Lingual_%2B_Visualizations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Zero-shot Cross-lingual Topic Modeling with ZeroShotTM

(last updated 10-05-2021)

In this tutorial, we are going to use our **Zero-Shot Topic Model** to get the topics out of a collections of articles you will upload here. Then, we are going to predict the topics of unseen documents in an unseen language exploiting the multilingual capabilities of Multilingual BERT. 

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents. In this tutorial, we are going to use the **Zero-Shot Topic Model** version of the Contextualized Topic Models because we want to tackle the problem of cross-lingual topic prediction. 

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://travis-ci.com/MilaNLProc/contextualized-topic-models](https://travis-ci.com/MilaNLProc/contextualized-topic-models.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)




# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
%%capture
!pip install contextualized-topic-models==2.2.0

In [None]:
%%capture
!pip install pyldavis

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

In [None]:
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt

--2021-08-03 07:59:50--  https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6208417 (5.9M) [text/plain]
Saving to: ‘dbpedia_sample_abstract_20k_unprep.txt’


2021-08-03 07:59:50 (73.7 MB/s) - ‘dbpedia_sample_abstract_20k_unprep.txt’ saved [6208417/6208417]



In [None]:
!head -n 1 dbpedia_sample_abstract_20k_unprep.txt

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry


In [None]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

# Importing what we need

In [None]:
from contextualized_topic_models.models.ctm import ZeroShotTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
preprocessed_documents[:2]

['mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry',
 'died march american photographer specialized photography operated studio silver spring maryland later lived florida magazine photographer year']

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "distiluse-base-multilingual-cased", because we need a multilingual model for performing cross-lingual predictions later.  

In [None]:
tp = TopicModelDataPreparation("paraphrase-multilingual-mpnet-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

  0%|          | 0.00/1.02G [00:00<?, ?B/s]

You try to use a model that was created with version 1.2.0, however, your version is 1.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





Batches:   0%|          | 0/100 [00:00<?, ?it/s]

Let's check the first ten words of the vocabulary 

In [None]:
tp.vocab[:10]

['abbreviated',
 'academic',
 'academy',
 'access',
 'according',
 'achieved',
 'acquired',
 'acre',
 'acres',
 'across']

## Training our Zero-Shot Contextualized Topic Model

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection (n_component parameter of the CTM object).

In [None]:
ctm = ZeroShotTM(bow_size=len(tp.vocab), contextual_size=768, n_components=50, num_epochs=20)
ctm.fit(training_dataset) # run the model

Epoch: [4/20]	 Seen Samples: [80000/400000]	Train Loss: 142.49389458007812	Time: 0:00:05.564329: : 4it [00:21,  5.48s/it]

# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge). Notice that the topics are in English, because we trained the model on English documents.

In [None]:
ctm.get_topic_lists(5)

[['house', 'built', 'located', 'national', 'historic'],
 ['family', 'found', 'species', 'mm', 'moth'],
 ['district', 'village', 'km', 'county', 'west'],
 ['station', 'line', 'railway', 'river', 'near'],
 ['member', 'politician', 'party', 'general', 'political'],
 ['series', 'game', 'film', 'directed', 'written'],
 ['county', 'located', 'united', 'states', 'city'],
 ['university', 'american', 'professor', 'born', 'law'],
 ['war', 'army', 'french', 'professor', 'british'],
 ['century', 'ii', 'war', 'king', 'greek'],
 ['company', 'software', 'based', 'information', 'developed'],
 ['built', 'house', 'building', 'church', 'story'],
 ['province', 'population', 'district', 'municipality', 'region'],
 ['team', 'football', 'league', 'season', 'games'],
 ['school', 'high', 'university', 'college', 'state'],
 ['television', 'film', 'directed', 'published', 'series'],
 ['county', 'states', 'city', 'united', 'located'],
 ['family', 'genus', 'found', 'plant', 'species'],
 ['american', 'football', 'p

# Let's Draw!

We can use PyLDAvis to plot our topic in a nice and friendly manner :)

In [None]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset)

Sampling: [20/20]: : 20it [01:38,  4.95s/it]


In [None]:
import pyLDAvis as vis
movies_pd = vis.prepare(**lda_vis_data)
vis.display(movies_pd)

  from collections import Iterable
  from collections import Mapping
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


# Topic Predictions

Ok now we can take a document and see which topic has been assigned to it. Results will obviously change with respect to the documents you are using. For example, let's predict the topic of the first preprocessed document that is talking about a peninsula.

In [None]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

Sampling: [5/5]: : 5it [00:23,  4.72s/it]


In [None]:
preprocessed_documents[0] # see the text of our preprocessed document

'mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry'

In [None]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [None]:
ctm.get_topic_lists(5)[topic_number] #and the topic should be about natural location related things

['station', 'railway', 'line', 'located', 'street']

### Let's predict the topics of the documents in unseen languages 

It's time to take advantage of the power of multilingual BERT. Let's predict the topics of some Italian documents. First, we download the data as before.

In [None]:
!wget https://raw.githubusercontent.com/vinid/data/master/italian_documents.txt

--2021-01-11 21:53:28--  https://raw.githubusercontent.com/vinid/data/master/italian_documents.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7122 (7.0K) [text/plain]
Saving to: ‘italian_documents.txt’


2021-01-11 21:53:28 (33.1 MB/s) - ‘italian_documents.txt’ saved [7122/7122]



In [None]:
italian_documents = [line.strip() for line in open("italian_documents.txt", 'r').readlines()]
italian_documents

['Fu l\'ispiratore e uno dei fondatori della rivista "Dau al Set" e dell\'omonimo gruppo artistico ed intellettuale catalano (1948). È ritenuto il massimo esponente della poesia visiva non solo della letteratura catalana, ma il pioniere di questo genere in Spagna e uno dei grandi riferimenti internazionali. La totalità della sua opera creativa lo segnala come uno degli autori più prolifici della cultura occidentale contemporanea. Sebbene schierato nell\'avanguardia della poesia della prima metà del XX secolo (esplorò validamente l\'ipnagogia, il surrealismo e il dadaismo), si esercitò alla scrittura di centinaia di sonetti, odi saffiche e sestine liriche dalla totale perfezione formale alla sperimentazione più estrema, nonché migliaia di poemi in forma libera. In vita pubblicò un\'ottantina di libri lasciandone parecchi d\'inediti. Il suo lavoro letterario comprendeva tutti i generi: poesia, prosa, teatro (la sua cosiddetta "poesia scenica", oltre 350 opere), cinema, musica, cabaret, p

There's no need to do preprocess the documents if you want to do zero-shot topic modeling! (The vocabulary obtained from the Italian documents wouldn't match our English vocabulary!) Let's just pass the italian documents as they are (without preprocessing) to our `TopicModelDataPreparation` object and create the testing dataset. 

In [None]:
testing_dataset = tp.transform(italian_documents) # create dataset for the testset


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Now we are ready to compute the topic predictions for each document. 


In [None]:
# n_sample how many times to sample the distribution (see the documentation)
italian_topics_predictions = ctm.get_thetas(testing_dataset, n_samples=5) # get all the topic predictions


Sampling: [5/5]: : 5it [00:01,  4.90it/s]


Let's consider the first one that talks about an artist

In [None]:
italian_documents[0]

'Fu l\'ispiratore e uno dei fondatori della rivista "Dau al Set" e dell\'omonimo gruppo artistico ed intellettuale catalano (1948). È ritenuto il massimo esponente della poesia visiva non solo della letteratura catalana, ma il pioniere di questo genere in Spagna e uno dei grandi riferimenti internazionali. La totalità della sua opera creativa lo segnala come uno degli autori più prolifici della cultura occidentale contemporanea. Sebbene schierato nell\'avanguardia della poesia della prima metà del XX secolo (esplorò validamente l\'ipnagogia, il surrealismo e il dadaismo), si esercitò alla scrittura di centinaia di sonetti, odi saffiche e sestine liriche dalla totale perfezione formale alla sperimentazione più estrema, nonché migliaia di poemi in forma libera. In vita pubblicò un\'ottantina di libri lasciandone parecchi d\'inediti. Il suo lavoro letterario comprendeva tutti i generi: poesia, prosa, teatro (la sua cosiddetta "poesia scenica", oltre 350 opere), cinema, musica, cabaret, pe

As we did before, let's get the index of most likely topic of the first document and then show the topic words to see if the topic's prediction is accurate

In [None]:
topic_number = np.argmax(italian_topics_predictions[0]) # get the topic id of the first document
ctm.get_topic_lists(10)[topic_number] 

['son',
 'war',
 'de',
 'century',
 'french',
 'ii',
 'daughter',
 'king',
 'father',
 'cross']