# Tutorial: Combined Topic Modeling

(last updated 10-07-2022)

In this tutorial, we are going to use our **Combined Topic Model** to get the topics out of a collections of articles.

## Topic Models 

Topic models allow you to discover latent topics in your documents in a completely unsupervised way. Just use your documents and get topics out.

## Contextualized Topic Models

![](https://raw.githubusercontent.com/MilaNLProc/contextualized-topic-models/master/img/logo.png)

What are Contextualized Topic Models? **CTMs** are a family of topic models that combine the expressive power of BERT embeddings with the unsupervised capabilities of topic models to get topics out of documents. 

## Python Package

You can find our package [here](https://github.com/MilaNLProc/contextualized-topic-models).

![https://github.com/MilaNLProc/contextualized-topic-models/actions](https://github.com/MilaNLProc/contextualized-topic-models/workflows/Python%20package/badge.svg) ![https://pypi.python.org/pypi/contextualized_topic_models](https://img.shields.io/pypi/v/contextualized_topic_models.svg) ![https://pepy.tech/badge/contextualized-topic-models](https://pepy.tech/badge/contextualized-topic-models)

# **Before you start...**

If you have additional questions about these topics, follow the links:

- you need to work with languages different than English: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/language.html#language-specific)
- you can't get good results with topic models: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#i-am-getting-very-poor-results-what-can-i-do)
- you want to load your own embeddings: [click here!](https://contextualized-topic-models.readthedocs.io/en/latest/faq.html#can-i-load-my-own-embeddings)


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
%%capture
!pip install contextualized-topic-models==2.3.0

In [None]:
%%capture
!pip install pyldavis

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

# Data

We are going to need some data. You should upload a file with one document per line. We assume you haven't run any preprocessing script.

However, if you want to first test the model without uploading your data, you can simply use the test file I'm putting here

# Importing what we need

In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import PorterStemmer
import numpy as np
np.random.seed(2018)

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
import nltk

In [None]:
import pandas as pd

## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
!unzip "/content/data.zip" -d "/content/data"
text_file = "/content/data/data/full_company_data" 


Archive:  /content/data.zip
  inflating: /content/data/data/apple_names.txt  
  inflating: /content/data/data/apple_transcript.txt  
  inflating: /content/data/data/cisco_names.txt  
  inflating: /content/data/data/cisco_transcript.txt  
  inflating: /content/data/data/full_company_data  
  inflating: /content/data/data/intuit_names.txt  
  inflating: /content/data/data/intuit_transcript.txt  
  inflating: /content/data/data/names.txt  
  inflating: /content/data/data/Thermo_fischer_names.txt  
  inflating: /content/data/data/Thermo_Fischer_transcript.txt  
  inflating: /content/data/data/UHG_names.txt  
  inflating: /content/data/data/UHG_transcripts.txt  
  inflating: /content/data/data/wells fargo doc.txt  


In [None]:
import pickle

with open("/content/data/data/full_company_data", "rb") as fp:   

  full_data = pickle.load(fp)


print("full data : ", len(full_data))

full data :  886


In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

In [None]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ']): #'NOUN', 'ADJ', 'VERB', 'ADV'
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
    return texts_out
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations


In [None]:
from nltk.corpus import stopwords as stop_words

nltk.download('stopwords')

documents = full_data

stopwords = list(stop_words.words("english"))

sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()
data_words = list(sent_to_words(preprocessed_documents))
data_lemmatized = lemmatization(preprocessed_documents, allowed_postags=["NOUN","ADJ"])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
preprocessed_documents[:2]

['good morning thank joining us',
 'call today marc president chief executive officer stephen williamson senior vice president chief financial officer please note call webcast live investors section website com heading events august copy press release second quarter earnings available investors section website heading financials']

In [None]:
data_lemmatized[:2]

['g r n g',
 'o r c p r d h e u t e p s o e r v p r d h n p e e t e b c s t l i e s t o r e c o s t e o u g t c p p r r e e o d q u r t e r e r v i l e e s t o r e c o s t e n']

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "paraphrase-distilroberta-base-v1".


In [None]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)



Batches:   0%|          | 0/5 [00:00<?, ?it/s]



Let's check the first ten words of the vocabulary 

In [None]:
tp.vocab[:10]

['ability',
 'able',
 'absolutely',
 'acacia',
 'accelerate',
 'accelerated',
 'accelerates',
 'accelerating',
 'acceleration',
 'accepted']

## Training our Combined TM

Finally, we can fit our new topic model. We will ask the model to find 50 topics in our collection.

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=5, num_epochs=20)
ctm.fit(training_dataset) # run the model

Epoch: [20/20]	 Seen Samples: [17720/17720]	Train Loss: 157.20151466386852	Time: 0:00:00.432140: : 20it [00:08,  2.33it/s]
Sampling: [20/20]: : 20it [00:07,  2.61it/s]


In [None]:
for x in ctm.get_topic_lists(10):
  print(x)

['excuse', 'stick', 'contracts', 'room', 'stepped', 'combining', 'wellness', 'copy', 'reallocate', 'containment']
['revenue', 'quarter', 'higher', 'billion', 'million', 'year', 'income', 'expect', 'total', 'increase']
['customers', 'platform', 'able', 'mid', 'pharma', 'really', 'large', 'biotech', 'small', 'mailchimp']
['allocation', 'function', 'spoken', 'break', 'finished', 'anomalies', 'reasonable', 'measure', 'harbor', 'engage']
['think', 'back', 'look', 'would', 'first', 'going', 'see', 'pretty', 'get', 'time']


# Visualizing 

We can use PyLDAvis to plot our topic in a nice and friendly manner :)

In [None]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [00:04,  2.00it/s]


In [None]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

Sampling: [10/10]: : 10it [00:03,  2.62it/s]
  by='saliency', ascending=False).head(R).drop('saliency', 1)


# Topic Predictions

Ok now we can take a document and see which topic has been assigned to it. Results will obviously change with respect to the documents you are using. For example, let's predict the topic of the first preprocessed document that is talking about a peninsula.

In [None]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

Sampling: [5/5]: : 5it [00:02,  2.21it/s]


In [None]:
preprocessed_documents[0] # see the text of our preprocessed document

'good morning thank joining us'

In [None]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document

In [None]:
topic_number

2

In [None]:
ctm.get_topic_lists(0)[1]

[]

In [None]:
ctm.get_topic_lists(5)[topic_number] #and the topic should be about natural location/places/related things

['explosion', 'environments', 'window', 'exited', 'predictive']

# Fin-BERT Model

In [None]:
tp = TopicModelDataPreparation("ProsusAI/finbert")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/ProsusAI_finbert. Creating a new one with MEAN pooling.
Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/ProsusAI_finbert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Batches:   0%|          | 0/5 [00:00<?, ?it/s]



In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=5, num_epochs=20)
ctm.fit(training_dataset) # run the model

Epoch: [20/20]	 Seen Samples: [17720/17720]	Train Loss: 157.22595297509875	Time: 0:00:00.413493: : 20it [00:10,  1.96it/s]
Sampling: [20/20]: : 20it [00:07,  2.68it/s]


In [None]:
for x in ctm.get_topic_lists(10):
  print(x)

['really', 'things', 'going', 'see', 'half', 'right', 'think', 'look', 'little', 'one']
['recap', 'stability', 'generating', 'still', 'issues', 'activities', 'materials', 'contributions', 'foresee', 'though']
['quarter', 'year', 'revenue', 'grew', 'adjusted', 'higher', 'points', 'increased', 'lower', 'ago']
['stick', 'incredible', 'recognized', 'contributions', 'generating', 'sub', 'progressed', 'stepping', 'proven', 'concerns']
['fortune', 'used', 'deadline', 'struggle', 'exception', 'sterile', 'speeding', 'confirmed', 'stable', 'contend']


In [None]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [00:03,  2.69it/s]


In [None]:

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

Sampling: [10/10]: : 10it [00:03,  2.64it/s]


# Save Our Model for Later Use

In [None]:
ctm.save(models_dir="./")



In [None]:
# let's remove the trained model
del ctm

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)

ctm.load("/content/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99",
                                                                                                      epoch=19)



In [None]:
ctm.get_topic_lists(5)

[['mi', 'east', 'lies', 'south', 'village'],
 ['world', 'women', 'team', 'cup', 'international'],
 ['founded', 'school', 'education', 'university', 'established'],
 ['studied', 'painter', 'paris', 'french', 'german'],
 ['played', 'born', 'english', 'made', 'right'],
 ['film', 'directed', 'written', 'produced', 'stars'],
 ['league', 'football', 'played', 'american', 'team'],
 ['album', 'released', 'island', 'river', 'band'],
 ['de', 'greek', 'king', 'french', 'son'],
 ['member', 'party', 'politician', 'general', 'elected'],
 ['university', 'american', 'professor', 'received', 'college'],
 ['church', 'roman', 'catholic', 'century', 'diocese'],
 ['municipality', 'region', 'area', 'kilometres', 'mi'],
 ['mi', 'west', 'km', 'county', 'south'],
 ['population', 'area', 'county', 'town', 'census'],
 ['built', 'house', 'building', 'story', 'style'],
 ['game', 'developed', 'video', 'games', 'playstation'],
 ['used', 'often', 'term', 'form', 'usually'],
 ['published', 'book', 'written', 'books', 