# Text Analysis with Contextualized Topic Models

## Installing Contextualized Topic Models

In [7]:
%%capture
!pip install contextualized_topic_models


##Import data

In [2]:
import pandas as pd    
import pandas as pd 
from google.colab import drive 
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
result = pd.read_parquet("/content/drive/My Drive/NLP/twitter_100k_sample.parquet.gzip")
result.head()

Unnamed: 0,id,created_at,text,clean_text,tokenized_text,no_stopwords_text,lemmatized_text,polarity,subjectivity,sentiment
130916,1550244740687446016,2022-07-21 22:22:02+00:00,will smith/chris rock slap. depp vs heard. roe...,smith/chris rock slap depp vs heard roe vs wad...,"[smith, chris, rock, slap, depp, vs, heard, ro...","[smith, chris, rock, slap, depp, vs, heard, ro...","[vs, hear, roe_vs_wade, overturned, endless, c...",-0.116667,0.6,Negative
744753,1547229607019126786,2022-07-13 14:40:59+00:00,Corona virus came from China 🇨🇳 \nJoe Biden al...,corona virus came china joe biden comes china ...,"[corona, virus, came, china, joe, biden, comes...","[corona, virus, came, china, joe, biden, comes...","[come, come]",0.0,0.0,Neutral
191485,1550131614914473984,2022-07-21 14:52:31+00:00,US President Joe Biden tests positive for Covi...,president joe biden tests positive covid presi...,"[president, joe, biden, tests, positive, covid...","[president, joe, biden, tests, positive, covid...","[test, positive, covid, president, test, posit...",0.151515,0.363636,Positive
780698,1547030557804068864,2022-07-13 01:30:02+00:00,https://t.co/9vYNKIXEtS\nThis report speaks th...,report speaks truth believed ecigarette pneumo...,"[report, speaks, truth_believed_ecigarette, pn...","[report, speaks, trying_hide, worldwide]","[speak, worldwide]",0.0,0.1,Neutral
1076480,1551463572823212035,2022-07-25 07:05:15+00:00,The truth is in this report.#COVID #COVID19 #O...,truth report,"[truth, report]","[truth, report]","[truth, report]",0.0,0.0,Neutral


In [4]:
result.shape

(100000, 10)

Let's drop the duplicates.

In [5]:
result = result.drop_duplicates(subset=['clean_text'])
result

Unnamed: 0,id,created_at,text,clean_text,tokenized_text,no_stopwords_text,lemmatized_text,polarity,subjectivity,sentiment
130916,1550244740687446016,2022-07-21 22:22:02+00:00,will smith/chris rock slap. depp vs heard. roe...,smith/chris rock slap depp vs heard roe vs wad...,"[smith, chris, rock, slap, depp, vs, heard, ro...","[smith, chris, rock, slap, depp, vs, heard, ro...","[vs, hear, roe_vs_wade, overturned, endless, c...",-0.116667,0.600000,Negative
744753,1547229607019126786,2022-07-13 14:40:59+00:00,Corona virus came from China 🇨🇳 \nJoe Biden al...,corona virus came china joe biden comes china ...,"[corona, virus, came, china, joe, biden, comes...","[corona, virus, came, china, joe, biden, comes...","[come, come]",0.000000,0.000000,Neutral
191485,1550131614914473984,2022-07-21 14:52:31+00:00,US President Joe Biden tests positive for Covi...,president joe biden tests positive covid presi...,"[president, joe, biden, tests, positive, covid...","[president, joe, biden, tests, positive, covid...","[test, positive, covid, president, test, posit...",0.151515,0.363636,Positive
780698,1547030557804068864,2022-07-13 01:30:02+00:00,https://t.co/9vYNKIXEtS\nThis report speaks th...,report speaks truth believed ecigarette pneumo...,"[report, speaks, truth_believed_ecigarette, pn...","[report, speaks, trying_hide, worldwide]","[speak, worldwide]",0.000000,0.100000,Neutral
1076480,1551463572823212035,2022-07-25 07:05:15+00:00,The truth is in this report.#COVID #COVID19 #O...,truth report,"[truth, report]","[truth, report]","[truth, report]",0.000000,0.000000,Neutral
...,...,...,...,...,...,...,...,...,...,...
763640,1547142069541412865,2022-07-13 08:53:08+00:00,‘Centaurus’: Virologists have voiced concerns ...,‘centaurus’ virologists voiced concerns emerge...,"[centaurus_virologists, voiced_concerns, emerg...","[voiced_concerns, omicron, variant, rapidly, g...","[voiced_concern, rapidly, gaining_ground, arrive]",0.000000,0.000000,Neutral
393875,1549117517414731777,2022-07-18 19:42:52+00:00,"This bro, we lived on the same flat during cor...",bro lived flat corona fun guy drinks hapa na p...,"[bro, lived, flat, corona, fun, guy, drinks, h...","[bro, lived, flat, corona, fun, guy, drinks, h...","[live, flat, corona, fun, guy, drink, fastforw...",0.003750,0.226250,Positive
1073097,1551483551333351425,2022-07-25 08:24:38+00:00,"We knew a very nice, friendly guy who thought ...",knew nice friendly guy thought covid big joke ...,"[knew, nice, friendly, guy, thought, covid, bi...","[knew, nice, friendly, guy, thought, covid, bi...","[know, nice, friendly, guy, think, covid, big,...",0.335000,0.460000,Positive
929160,1546239960872796165,2022-07-10 21:08:29+00:00,Oh yes. Top of everyone’s agenda in a country ...,oh yes everyone’s agenda country beset food am...,"[oh, yes, everyone, agenda, country, beset, fo...","[oh, yes, everyone, agenda, country, beset, fo...","[agenda, country, beset, food, amp, fuel, pove...",0.000000,0.000000,Neutral


## Importing what we need

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk
import torch
import random
import numpy as np

We are going to create a function that fixes the random seeds so that we can replicate the results. We will use this function later.

In [9]:
def fix_seeds():
  torch.manual_seed(10)
  torch.cuda.manual_seed(10)
  np.random.seed(10)
  random.seed(10)
  torch.backends.cudnn.enabled = False
  torch.backends.cudnn.deterministic = True

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. We also remove stop-words, which usually do not convey thematic information. 
Also, in some cases, we might want only to have the most frequent words inside the BoW. Too many words might not help.

 


![](https://raw.githubusercontent.com/silviatti/Contextualized-Topic-Models-Tutorial/main/images/combined_ctm.PNG)


In [10]:
from nltk.corpus import stopwords as stop_words
nltk.download('stopwords')
stopwords = list(set(stop_words.words('english')))

documents = result["clean_text"].tolist()
sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, x = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Other parameters of the object `WhiteSpacePreprocessingStopwords`: 
*  *vocabulary_size*: the number of most frequent words to include in the documents. Infrequent words will be discarded from the list of preprocessed documents
* *max_df* : When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. Default: 1
* *min_words*: Documents with less words than the parameter will be removed. Default: 1 
* *remove_numbers*: If true, numbers are removed from the documents. Default=True. 

Let's check the first ten words of the vocabulary 

In [None]:
vocab[:10]

In [None]:
preprocessed_documents[:5]

In [None]:
unpreprocessed_corpus[:5]

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized representations of documents. This operation allows us to create our training dataset.

Note: You can use the contextualized representation that you like. In our experiments, we noticed that a "better" language models usually leads to more coherent results. For this reason, we are going to use "paraphrase-distilroberta-base-v2". For other models: https://www.sbert.net/docs/pretrained_models.html

In [11]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/449 [00:00<?, ?it/s]

### How many topics? 
For time constraints, we are not going to do this, but we can play with different number of topics later. There are other techniques, for example you can use a black-box optimization strategy to find the best number of topics w.r.t. an arbitrary metric. See OCTIS: https://github.com/mind-Lab/octis



### Code to find the best number of topics

To run this, you don't have to set the random seeds, otherwise, you will always get the same results with the same number of topics.

In [None]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
corpus = [d.split() for d in preprocessed_documents]

num_topics = [5, 10, 15, 20]
num_runs = 5

best_topic_coherence = -999
best_num_topics = 0
for n_components in num_topics:
  for i in range(num_runs):
    print("num topics:", n_components, "/ num run:", i)
    ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, 
                     n_components=n_components, num_epochs=50)
    ctm.fit(training_dataset) # run the model
    coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
    coh_score = coh.score()
    print("coherence score:", coh_score)
    if best_topic_coherence < coh_score:
      best_topic_coherence = coh_score
      best_num_topics = n_components
    print("current best coherence", best_topic_coherence, "/ best num topics", best_num_topics)

num topics: 5 / num run: 0


0it [00:00, ?it/s]

KeyboardInterrupt: ignored

## Training our Combined Contextualized Topic Model
Let us run the topic model with 15 topics (parameter *n_components*). 

Recall that CTM is a neural model. So we need to define for **how many epochs** the model will run. We can also use early stopping criterion to let the model stop automatically. In this case, we should provide a validation dataset to the `fit` function (parameter `validation_dataset`).

We also need to set the dimension of the BoW and the dimension of the contextualized representation. 


In [13]:
fix_seeds() # uncomment if you don't want to fix the random seeds

num_topics =10
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=num_topics, num_epochs=20)
ctm.fit(training_dataset) # run the model

Epoch: [20/20]	 Seen Samples: [1793680/1793680]	Train Loss: 64.18811623566626	Time: 0:00:22.501764: : 20it [07:37, 22.86s/it]
Sampling: [20/20]: : 20it [06:08, 18.45s/it]


# Topics

After training, now it is the time to look at our topics: we can use the 

```
get_topic_lists
```

function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

If you look at the topics, you can see if they all make sense. 

In [14]:
ctm.get_topic_lists(5)

[['people', 'mask', 'masks', 'dont', 'like'],
 ['vaccine', 'vaccines', 'children', 'booster', 'vaccination'],
 ['obviously', 'rip', 'unvaxxed', 'well', 'aware'],
 ['biden', 'president', 'positive', 'joe', 'tests'],
 ['ba', 'omicron', 'subvariant', 'variant', 'new'],
 ['pandemic', 'amp', 'health', 'world', 'government'],
 ['test', 'tested', 'day', 'negative', 'positive'],
 ['corona', 'read', 'market', 'article', 'travel'],
 ['cases', 'deaths', 'new', 'latest', 'death'],
 ['got', 'like', 'im', 'time', 'finally']]

However, we also want to quantify how better the contextualized models are with respect to previous work. For example, how much does CTM perform better than LDA? 

Let's compare the models.

## Latent Dirichlet Allocation (LDA) 
We are going to use gensim library to train LDA and then assess the quality of the topics using NPMI topic coherence (normalized point-wise mutual information).
 

In [15]:
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models import LdaModel 
from gensim.models.coherencemodel import CoherenceModel

split_preprocessed_documents = [d.split() for d in preprocessed_documents]
dictionary = Dictionary(split_preprocessed_documents)
corpus = [dictionary.doc2bow(text) for text in split_preprocessed_documents]

lda = LdaModel(corpus, num_topics=num_topics, iterations=500, random_state=42)



Let's see the topics discovered by LDA

In [16]:
def get_topics_lda(topk=10):
  topic_terms = []
  for i in range(num_topics):
      topic_words_list = []
      for word_tuple in lda.get_topic_terms(i, topk):
          topic_words_list.append(dictionary[word_tuple[0]])
      topic_terms.append(topic_words_list)
  return topic_terms

get_topics_lda(5)

[['covid', 'like', 'people', 'got', 'time'],
 ['covid', 'amp', 'china', 'change', 'dr'],
 ['covid', 'house', 'white', 'new', 'rate'],
 ['covid', 'corona', 'free', 'update', 'th'],
 ['covid', 'vaccine', 'vaccines', 'vaccination', 'social'],
 ['covid', 'positive', 'biden', 'test', 'tested'],
 ['covid', 'amp', 'people', 'mask', 'health'],
 ['covid', 'cases', 'new', 'deaths', 'latest'],
 ['covid', 'children', 'health', 'pandemic', 'death'],
 ['covid', 'got', 'ba', 'im', 'time']]

### Topic Coherence
We usually use the topic coherence as main indicator of the quality of the topics. NPMI topic coherence is the most used one and it is computed on the co-occurrences of the words in the original or in an external corpus. The intuition is that if two words often co-occur together, then they are more likely to be related to each other.



In [17]:
cm = CoherenceModel(model=lda, dictionary=dictionary, 
                    texts=split_preprocessed_documents, coherence='u_mass')
lda_coherence = cm.get_coherence()  # get coherence value
print("coherence score LDA:", lda_coherence)

coherence score LDA: -4.344112024379868


### Coherence on CTM
CTM library already integrates gensim's computation of coherence. We just provide the list of topics and the corpus as input to the class `CoherenceNPMI` and compute the score with the `.score()` function

In [22]:
from contextualized_topic_models.evaluation.measures import Coherence, InvertedRBO,CoherenceUMASS
corpus = [d.split() for d in preprocessed_documents]
coh = CoherenceUMASS(ctm.get_topic_lists(10), corpus)
print("coherence score CTM:", coh.score())

coherence score CTM: -4.821414852489249


### Diversity of the topics 

We can also compute how much diverse are the topics from each other. Ideally we expect topics which represent separate concepts or ideas. In this case, we use the IRBO (inverted ranked biased overlap) measure. Topics with common words at different rankings are penalized less than topics sharing the same words at the highest ranks. 

In [23]:
irbo_lda = InvertedRBO(get_topics_lda(10))
print("diversity score LDA:", irbo_lda.score())

irbo_ctm = InvertedRBO(ctm.get_topic_lists(10))
print("coherence score CTM:", irbo_ctm.score())

diversity score LDA: 0.7038255617085714
coherence score CTM: 0.9911126912242857
