# <font color='#2B4865'>**Neural Topic Models**</font>

---
### Natural Language Processing
Date: Jan 11, 2023

Author: Lorena Calvo-Bartolomé (lcalvo@pa.uc3m.es)

Version 1.0

---
This notebook is based on both CTM's and BERTopic documentation and tutorials released by the authors:

* [CTM's GitHub](https://github.com/MilaNLProc/contextualized-topic-models)
* [CTM's Docs](https://contextualized-topic-models.readthedocs.io/en/latest/)
* [BERTopic's GitHub](https://github.com/MaartenGr/BERTopic)
* [BERTopic's Docs](https://maartengr.github.io/BERTopic/index.html)

Our goal here is to present a basic overview of the CTM and BERTopic libraries and how to use them for the construction of topic models.

---

<font color='#E0144C'>**For this notebook's execution, we highly encourage you to use Google Colaboratory. While for the inference part it is not necessary, you will highly speed up the execution if you make use of a GPU. For doing so, follow the following steps:**</font>

<font color='#E0144C'>**1. Connect to hosted runtime**</font>

<font color='#E0144C'>**2. Enable GPU setting by clicking Edit -> Notebook Settings -> Select GPU in Hardware Acceleration Tab -> Save**</font>

### PRÁCTICA 4.5 - PROCESAMIENTO DEL LENGUAJE NATURAL - MASTER EN INTELIGENCIA ARTIFICIAL APLICADA

### JOSÉ LORENTE LÓPEZ - DNI: 48842308Z

## <font color='#2B4865'>Installing necessary packages, imports and auxiliary functions</font>

In [1]:
# Importamos las librerías necesarias para el desarrollo de la práctica

# Common imports 
import pandas as pd
import zipfile as zp
import seaborn as sns
import torch
import random
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina' 
import pathlib
import os
from gensim.corpora import Dictionary
from gensim.models.coherencemodel import CoherenceModel
import spacy
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from gensim.models.phrases import Phrases

#For fancy table Display
%load_ext google.colab.data_table



In [2]:
# Importamos todas las librerías asociadas al preprocesado de los datos del dataset

import re
import nltk

def check_nltk_packages():
  packages = ['punkt','stopwords','omw-1.4','wordnet']

  for package in packages:
    try:
      nltk.data.find('tokenizers/' + package)
    except LookupError:
      nltk.download(package)
check_nltk_packages()

try:
  import lxml
except ModuleNotFoundError:
  %pip install lxml

try:
  import contractions
except ModuleNotFoundError:
  %pip install contractions
  import contractions

from bs4 import BeautifulSoup
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.5/104.5 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.5/287.5 KB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [3]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
path_to_folder = '/content/drive/MyDrive/Master - CIII/1ºCuatrimestre/2ºSemicuatrimestre/Códigos - Python/Procesamiento del Lenguaje Natural/Lab4/ATopicsDataset'  # UPDATE THIS ACCORDING TO WHERE YOU WANT TO SAVE THE FILES!!!!

# Change to assignment directory
os.chdir(path_to_folder) 

Let's us first install the libraries (Contextualized Topic Models and BERTopic) that we will be using.

In [None]:
%%capture
!pip install contextualized_topic_models

In [None]:
%%capture
!pip install bertopic

<font color='#E0144C'>**After installing BERTopic, some packages that were already loaded were updated, and to correctly use them, we should now restart the notebook. Note that after restarting the notebook, you do not need to re-execute the former two cells.**</font>

## <font color='#2B4865'>**1. Data loading and preprocessing**
---
</font>

We are going to be using 2018's subset of documents of the **NSF dataset**, whose files you have available at Aula Global. To obtain good results, we will need a preprocessed and lemmatized corpus, but we will require the original raw data as well.

###### **Exercise 1**

Carry out the following actions:


1.   Preprocess the dataset. Here, you can make use of the pipeline you implemented in the spaCy tutorial or the Text Vectorization I notebook. As a text field, use the concatenation of the Title and the Abstracts. 
2.   N-gram detection

For simplicity, save the results in a single dataframe (name it ``df``), which must include, at least, the columns ``raw_text`` and ``lemmas_with_grams``, to store the concatenation of the Title and the Abstracts, and the lemmas after N-gram detection, respectively.

Alternatively, since you have already preprocessed the NSF dataset in another notebook, you can directly generate the former datagrame by means of such a notebook, with the condition that it meets the above requirements, and contains the 2018's subset of documents.

In [5]:
dataset_crudo = pd.read_csv('NSF_2018.csv')
dataset_modificado = pd.read_csv('NSF_2018.csv')

In [6]:
texto_final = []
for i in range(len(dataset_modificado)):
  texto_final.append(dataset_modificado['title'][i] + ": " + dataset_modificado['abstract'][i])
dataset_modificado['texto'] = texto_final

In [7]:
from nltk.tokenize import wordpunct_tokenize
from nltk.tokenize import sent_tokenize


def tokenize(texto):
  tokenizado = []
  for i in range(len(texto['texto'])):
    strr = texto['texto'][i]
    review_tokens = wordpunct_tokenize(strr)
    tokenizado.append(review_tokens)
  return tokenizado

datos_tokenizados = tokenize(dataset_modificado)

def Homogenization(texto):
    
    ayuda = []
    lower = []
    for i in range(len(texto)):
      for j in range(len(texto[i])):
        ayuda.append(texto[i][j].lower())
      lower.append(ayuda)
      ayuda = []

    extern = []
    extern_2 = []
    review_tokens_filtered = []

    for i in range (len(lower)):
      for j in range(len(lower[i])):
        extern.append(lower[i][j].isalnum())
      extern_2.append(extern)
      extern = []

    help = []
    for i in range (len(extern_2)):
      for j in range (len(extern_2[i])):
        if(extern_2[i][j] == True):
            help.append(lower[i][j])
      review_tokens_filtered.append(help)
      help = []

    return review_tokens_filtered

datos_homo = Homogenization(datos_tokenizados)

wnl = WordNetLemmatizer()

lemmatized_review = []

for i in range(len(datos_homo)):
  extn = [wnl.lemmatize(el) for el in datos_homo[i]]
  lemmatized_review.append(extn)

def cleaning(texto):

    stopwords_en = stopwords.words('english')
    filtered_sentence = []
    help = []
    for i in texto:
      for j in i:
        if j not in stopwords_en:
            help.append(j)
      filtered_sentence.append(help)
      help = []
      

    clean_review = filtered_sentence
    return clean_review

clean_text = cleaning(lemmatized_review)

corpus = []

for i in range (len(clean_text)):
    corpus.append(clean_text[i])
    
phrase_model = Phrases(corpus, min_count=2, threshold=40)
corpus = [el for el in phrase_model[corpus]] 



In [8]:
df = pd.DataFrame()
df['raw_text'] = dataset_modificado['texto']
df['clean_text'] = corpus

## <font color='#2B4865'>**2. Contextualized Topic Models**
---
</font>

In [10]:
!pip install contextualized_topic_models

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contextualized_topic_models
  Downloading contextualized_topic_models-2.4.2-py2.py3-none-any.whl (35 kB)
Collecting ipywidgets==7.5.1
  Downloading ipywidgets-7.5.1-py2.py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
Collecting gensim>=4.0.0
  Downloading gensim-4.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m65.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=2.1.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ipython==7.16.3
  Downloading ipython-7.16.3-p

In [40]:
!pip install contextualized-topic-models[metrics]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation

ModuleNotFoundError: ignored

### <font color='#2B4865'>*2.1. Preprocessing and data preparation*</font>

The first step when making use of any of the topic modeling algorithms belonging to the Contextualized Topic Models package is the preprocessing of the documents. Remember that preprocessing with CTMs is in fact key, since they work better when the size of the BoW has been restricted to a number of terms that does not go over $2000$ elements. 

Yet, since we have carried out a previous preprocessing, we will skip this part. If you are interested, you can check the preprocessing function available at the Contextualized Topic Models library (``contextualized_topic_models.utils.preprocessing``).

In any case, we **will not discard the non-preprocessed texts**, since we are going to use them as input for obtaining the contextualized document representations.

  <br><center><img src="https://drive.google.com/uc?id=1RuDtcadr0-BUXdgAFT9kSRLQqjXSdwLT" width="20%"></center><br>

The CTM library provides us with a class named ``TopicModelDataPreparation`` that carries out the preparation of our data into the format required by CTMs, i.e., it takes care of creating the BoW and obtaining the contextualized representations of the documents with which we will create our training dataset.

To create an object of the ``TopicModelDataPreparation`` class we need to provide as argument the name of the language model that we are going to use for the generation of the contextualized embeddings. You can check all the available models [here](https://www.sbert.net/docs/pretrained_models.html).

Then, in order to fit the model and generate a CTM's training dataset, we need two lists:
* ``text_for_contextual``, with the original documents 
* ``text_for_bow``, with the lemmatized documents from which the bag of words representation that is going to be used to generate the topic words will be calculated

```
  text_for_contextual = [
    "hello, this is unpreprocessed text you can give to the model",
    "have fun with our topic model",
]

  text_for_bow = [
    "hello unpreprocessed give model",
    "fun topic model",
]
```

If we were to generate a validation or test dataset, the ``TopicModelDataPreparation.transform()`` method takes care of it by generating the corresponding BoW considering only the words that the model has seen in training. This method receives the same parameters as those from ``TopicModelDataPreparation.fit()``.

###### **Exercise 2**

Carry out the following actions:

1. Save the raw text in a variable named ``unpreprocessed_corpus`` and the preprocessed text after n-grams detection in a variable named ``preprocessed_corpus``. 
2. Create an object of the class ``TopicModelDataPreparation`` and name it ``tp``. For doing so, use the model ``"paraphrase-distilroberta-base-v2"``.
3. Use the ``fit()`` method of the ``TopicModelDataPreparation`` object with ``unpreprocessed_corpus`` and ``preprocessed_corpus``, and save the result in a variable named ``training_dataset``.

In [None]:
unpreprocessed_corpus = []

for i in range(len(df['raw_text'])):
  unpreprocessed_corpus.append(df['raw_text'][i])

preprocessed_corpus = []

for i in range(len(df['clean_text'])):
  str_join = " ".join(df['clean_text'][i])
  preprocessed_corpus.append(str_join)

In [21]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v2")
training_dataset = tp.fit(text_for_contextual = unpreprocessed_corpus, text_for_bow = preprocessed_corpus)

Downloading:   0%|          | 0.00/736 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]



Batches:   0%|          | 0/61 [00:00<?, ?it/s]



### <font color='#2B4865'>*2.2. Training a Combined Contextualized Topic Model*</font>

As any other topic model, we need to specify the **number of topics** with which we want to train the model. The authors refer to this parameter as **number of componenets**. We also need to set the **dimension of the BoW** and the **dimension of the contextualized representation**. 

Since CTM is a neural model, we need to define for **how many epochs** the model will run. We can also use early stopping criterion to let the model stop automatically. In this case, we should provide a validation dataset to the `fit` function (parameter `validation_dataset`).

There are **other parameters** that you may want to play with:
* ``lr``: the learning rate, i.e. the step size at each iteration while moving towards a minimum of a loss function. If it's too small, the network will require too much time to reach a minimum, if it's too high then training may not converge.
* ``batch_size``: the batch size, i.e. the number of samples that will be propagated through the network. If it's too high (batch size == num of total instances), you may not be able to fit the samples in your machine's memory. If it's too small, the less accurate the estimate of the gradient will be.
* ``hidden_sizes``: the number of hidden layers and neurons. Default: (100, 100) --> two layers of 100 neurons each.
* ``dropout``: probability of dropping out the units in the latent representation layer as regularization.

You can see the full list of parameters [here](https://github.com/MilaNLProc/contextualized-topic-models/blob/6c6d6a996ceae1d203ab34a08c72f8214f98ab65/contextualized_topic_models/models/ctm.py#L19).

In the cell below you can see how to train a CombinedCTM model with default parameters:

In [22]:
num_topics = 5
num_epochs = 50
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=training_dataset.X_contextual.shape[1], n_components=num_topics, num_epochs=num_epochs)
ctm.fit(training_dataset) # run the model

Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2606.590319113757	Time: 0:00:09.416645: : 50it [07:59,  9.59s/it]
Sampling: [20/20]: : 20it [02:07,  6.37s/it]


### <font color='#2B4865'>*2.3. Topics*</font>

After training, now it is the time to look at our topics: we can use the ``get_topic_lists`` function to get the topics. It also accepts a parameter that allows you to select how many words you want to see for each topic.

Ya podemos ver nuestros topics, y lo haremos con las 10 palabras más características de los mismos:

In [23]:
ctm.get_topic_lists(10)

[['guest_diffusion',
  'carbon_nanobelts',
  'panthani',
  'c_qws',
  'diamond_nanocrystal',
  'motional_quantum',
  'superconducting_topological',
  'lattice_matched',
  'bgaas',
  'gslc'],
 ['quantum',
  'reaction',
  'material',
  'molecular',
  'molecule',
  'cell',
  'polymer',
  'property',
  'protein',
  'chemical'],
 ['cloud',
  'information',
  'distributed',
  'statistical',
  'machine_learning',
  'algorithm',
  'patient',
  'real_time',
  'sbir_phase',
  'iot'],
 ['education',
  'stem',
  'student',
  'program',
  'experience',
  'science',
  'practice',
  'teaching',
  'skill',
  'workforce'],
 ['soil',
  'region',
  'ecological',
  'sediment',
  'ocean',
  'record',
  'nutrient',
  'river',
  'earth',
  'specie']]

### <font color='#2B4865'>*2.4. Additional information that can be extracted from the model*</font>

CTM's library also provides the following functions that can become handy depending on what we are using the topic model for:

| **Function** | **Description** |
|---|---|
| get_thetas(dataset) | To get the document-topic distribution for a dataset of topics. |
| get_most_likely_topic(doc_topic_distribution) | To get the most likely topic for each document. |
| get_topic_word_distribution() | To get the topic-word distribution. |
| get_word_distribution_by_topic_id(topic) | To get the word probability distribution of a topic sorted by probability. |

For example, let's see how to get the document-topic and topic-word distribution for the model we just trained:

In [24]:
thetas = ctm.get_doc_topic_distribution(training_dataset)
thetas

Sampling: [20/20]: : 20it [02:15,  6.79s/it]


array([[0.13217695, 0.43963246, 0.20261294, 0.10659673, 0.11898092],
       [0.12697233, 0.5759234 , 0.11916088, 0.09570155, 0.08224185],
       [0.16088906, 0.39561852, 0.06086525, 0.14686307, 0.2357641 ],
       ...,
       [0.1779519 , 0.5262253 , 0.07385611, 0.13828101, 0.08368569],
       [0.11418444, 0.6503645 , 0.05499229, 0.08412546, 0.09633332],
       [0.09288363, 0.03530098, 0.06312991, 0.10273503, 0.70595047]])

In [27]:
betas = ctm.get_topic_word_distribution()
betas

array([[9.01028670e-06, 1.55873058e-05, 1.57154645e-05, ...,
        1.29162936e-05, 1.29854207e-05, 1.37638335e-05],
       [9.13502663e-06, 1.44613414e-05, 1.38949808e-05, ...,
        1.45087506e-05, 1.47619885e-05, 1.45138301e-05],
       [1.45130825e-05, 1.41693863e-05, 1.44125897e-05, ...,
        1.42508607e-05, 1.40228440e-05, 1.42571398e-05],
       [1.85829740e-05, 1.37822226e-05, 1.37997140e-05, ...,
        1.46130651e-05, 1.49984180e-05, 1.45817285e-05],
       [1.57181312e-05, 1.38233700e-05, 1.37403576e-05, ...,
        1.42557274e-05, 1.43662210e-05, 1.42400868e-05]], dtype=float32)

### <font color='#2B4865'>*2.5. Visualizations*</font>

We can get the necessary data that can be used in the input to PyLDAvis to plot the topics via the function ``get_ldavis_data_format()`` and use it directly as input to the ``pyLDAvis.prepare()`` method to obtain the PyLDAvis graph:

In [28]:
%%capture
!pip install pyLDAvis==2.1.2

In [29]:
lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [01:21,  8.14s/it]


In [30]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

  from collections import Iterable
Sampling: [10/10]: : 10it [01:08,  6.80s/it]
  default_term_info  = pd.DataFrame({'saliency': saliency, 'Term': vocab, \


### <font color='#2B4865'>*2.6. Evaluation*</font>

We usually use the topic coherence as main indicator of the quality of the topics. **NPMI topic coherence** is the most used one and it is computed on the co-occurrences of the words in the original or in an external corpus. The intuition is that if two words co-occur often, then they are more likely to be related to each other.

CTM library already integrates Gensim's computation of coherence. We just provide the list of topics (a list of lists, where each list is a list of words representing each topic) and the corpus (a list of lists, where each list is a list of words representing each document) as input to the class `CoherenceNPMI` and compute the score with the `.score()` function:

In [68]:
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO

In [32]:
corpus = [d.split() for d in preprocessed_corpus]
coh = CoherenceNPMI(ctm.get_topic_lists(10), corpus)
print("coherence score CTM:", coh.score())

coherence score CTM: 0.03428101458445272


Ideally, we expect topics to represent separate concepts or ideas. In this line, we can compute how diverse the topics are from each other. The **IRBO (Inversed Rank-Biased Overlap)** measure achieves the latter by comparing the 10-top words of two topics with weighted ranking, i.e., topics with common words at different rankings are penalized less than topics sharing the same words at the highest ranks. IRBO is $0$ for identical topics and $1$ for completely different ones.

In [34]:
irbo_ctm = InvertedRBO(ctm.get_topic_lists(10))
print("InvertedRBO score CTM:", irbo_ctm.score())

InvertedRBO score CTM: 1.0


### <font color='#2B4865'>*2.7. Choosing the number of topics*</font>


There are different techniques to select the best number of topics. In this case, we are going to approach in the same way we did it for LDA: **running our topic model with a different number of topics and selected the one that produces the topics with the highest coherence**. 

###### **Exercise 3**

Evaluate the evolution of topic coherence as a function of the number of topics. Use ``num_topics = [5, 10, 15, 20, 25, 50]`` and the NPMI coherence metric.  If the coherence does not achieve a local maximum, increase the number of topics to observe.

In [35]:
#<SOL>
num_topics = [5, 10, 15, 20, 25, 50]

coherence_scores = {}
for num_topics in num_topics:
    # Create an object of the CombinedTM class
    ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=training_dataset.X_contextual.shape[1], n_components=num_topics, num_epochs=num_epochs)
    ctm.fit(training_dataset) # run the model
    topic_lists = ctm.get_topic_lists(10)
    corpus = [d.split() for d in preprocessed_corpus]
    coh = CoherenceNPMI(topic_lists, corpus)
    coherence_scores[num_topics] = coh.score()
    
print(coherence_scores)
#</SOL>

Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2606.6887982080852	Time: 0:00:09.372397: : 50it [08:03,  9.68s/it]
Sampling: [20/20]: : 20it [02:12,  6.65s/it]
Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2575.454685433201	Time: 0:00:09.467727: : 50it [08:02,  9.64s/it]
Sampling: [20/20]: : 20it [02:15,  6.77s/it]
Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2554.698576750579	Time: 0:00:09.572576: : 50it [08:11,  9.84s/it]
Sampling: [20/20]: : 20it [02:10,  6.54s/it]
Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2544.263064752811	Time: 0:00:09.633353: : 50it [08:08,  9.77s/it]
Sampling: [20/20]: : 20it [02:09,  6.48s/it]
Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2537.4463161892363	Time: 0:00:09.622999: : 50it [08:06,  9.74s/it]
Sampling: [20/20]: : 20it [02:18,  6.92s/it]
Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2520.3512821903937	Time: 0:00:09.783751: : 50it [08:19,  9.99s/it]
Sampling: [20/20]: : 20it [0

{5: 0.03603826600073202, 10: 0.07992628953870425, 15: 0.09413143730087646, 20: 0.10383663441213845, 25: 0.1175784227159666, 50: 0.10366705917016178}


###### **Exercise 4**

Train and evaluate (NPMI coherence and IRBO) a final CombinedCTM model on the NSF corpus using the number of topics obtained in Exercise 3. Try to optimize the quality of the topic model by fine-tuning some of the CombinedCTM's parameters.

He probado en local diferentes ajustes de parámetros (se me ejecutaba mejor) y los resultados óptimos me daban con 50 neuronas en cada capa y un dropout de 0.3

In [42]:
best_num_topics = max(coherence_scores, key=coherence_scores.get)
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=training_dataset.X_contextual.shape[1], n_components=best_num_topics, num_epochs=num_epochs, hidden_sizes=(50,50),dropout=0.3)
ctm.fit(training_dataset)

Epoch: [50/50]	 Seen Samples: [604800/605000]	Train Loss: 2558.950619006283	Time: 0:00:10.108929: : 50it [08:03,  9.68s/it]
Sampling: [20/20]: : 20it [02:21,  7.10s/it]


In [44]:
topic_lists = ctm.get_topic_lists(10)
print(topic_lists)

[['galaxy', 'star', 'universe', 'gravitational_wave', 'search', 'mass', 'detector', 'astronomy', 'telescope', 'dark_matter'], ['career', 'education', 'student', 'program', 'stem', 'workforce', 'college', 'academic', 'course', 'success'], ['geometry', 'connection', 'geometric', 'equation', 'theory', 'manifold', 'conjecture', 'algebraic', 'invariant', 'space'], ['quantum', 'spin', 'optical', 'state', 'light', 'material', 'magnetic', 'electron', 'device', 'semiconductor'], ['catalyst', 'reaction', 'chemical', 'synthesis', 'chemistry', 'professor', 'metal', 'organic', 'catalytic', 'catalysis'], ['power', 'sensor', 'proposed', 'efficiency', 'device', 'sensing', 'high', 'circuit', 'wireless', 'performance'], ['water', 'food', 'urban', 'stakeholder', 'infrastructure', 'impact', 'challenge', 'city', 'planning_grant', 'building'], ['change', 'ecological', 'disturbance', 'ecosystem', 'forest', 'environmental', 'landscape', 'hurricane', 'rapid', 'reef'], ['model', 'simulation', 'modeling', 'compu

In [45]:
#<SOL>
corpus = [d.split() for d in preprocessed_corpus]
coh = CoherenceNPMI(topic_lists, corpus)
print("NPMI coherence score:", coh.score())

irbo_ctm = InvertedRBO(topic_lists)
print("IRBO score:", irbo_ctm.score())
#</SOL>

NPMI coherence score: 0.10929090415270991
IRBO score: 0.9961838355093572


## <font color='#2B4865'>**3. BERTopic**
---
</font>

In [12]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic
  Downloading bertopic-0.13.0-py2.py3-none-any.whl (103 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m103.9/103.9 KB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downloading umap-learn-0

### <font color='#2B4865'>*3.1. Training*</font>

Primero entrenaremos un modelo sencillo y luego modificaremos y personalizaremos los bloques del algoritmo

We are going to start by training a simple model; later we will see how we can customize the different blocks on which the algorithm relies.

We start by instantiating a BERTopic model. We set the language to ``english`` since our NSF documents are in English, but we could also generate a multilingual model by setting ``language=multilingual``.

Other parameters that we can configure are:

| **Parameter** | **Description** |
|---|---|
| top_n_words | Number of words per topic to extract. It is recomendable to keep it between 10 and 20. |
| n_gram_range | It refers to the CountVectorizer's `n_gram_range` parameter used when creating the topic representation |
| min_topic_size | Minimum size a topic can have. The lower this value the more topics are created. It is advised to play<br>around with this value depending on the dataset's size. The default is $10$. |
| nr_topics | After training the topic model, the number of topics that will be reduced to. |
| low_memory | Sets UMAP's `low_memory` to True to make sure that less memory is used in computation. |
| calculate_probabilities | When set to True the probabilities of each topic to each document are calculated. It is turned off by default. |

From the former, we will be setting ``calculate_probabilities=True``. This means that the second parameter returned after fitting the topic model will consist of the probabilities of all topics across all documents instead of only the assigned topic. Note though, that this slows down computation and may increase memory usage, so it is recommended its disablement when working with big corpora.

Once we have instantiated the model, we need to fit it. We can approach this same as we normally do with the Skicit-Learn function, ``fit``, ``transform``, and ``fit_transform``. As the documents to fit, we will be using the **raw corpus** in the format of a **list of documents, each document represented by a string.**

In [13]:
from bertopic import BERTopic

docs = df['raw_text'].values.tolist()

topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/379 [00:00<?, ?it/s]

2023-01-18 21:05:51,159 - BERTopic - Transformed documents to Embeddings
2023-01-18 21:06:27,926 - BERTopic - Reduced dimensionality
2023-01-18 21:07:05,039 - BERTopic - Clustered reduced embeddings


### <font color='#2B4865'>*3.2. Topics*</font>

After training, we can now look at our topics: we can use the ``get_topic_info`` function to get the topics. This will provide us with the found topics, its count and a name describining them.

Miramos nuestros 10 topics. Con el nçumero de documentos de cada uno y un nombre que los define.

In [14]:
freq = topic_model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name
0,-1,3823,-1_and_of_the_to
1,0,269,0_mantle_seismic_subduction_plate
2,1,265,1_statistical_data_algorithms_problems
3,2,156,2_catalysts_reactions_chemical_catalysis
4,3,156,3_memory_storage_computing_performance
5,4,132,4_plant_plants_crop_genetic
6,5,123,5_archaeological_social_political_maya
7,6,108,6_plasma_solar_magnetic_space
8,7,107,7_resilience_disaster_disasters_infrastructure
9,8,97,8_physics_experiment_nuclear_lhc


El topic con el número -1 son documentos outliers que no pertenecen a ninguno en concreto

The topic marked with negative numbers (like the $-1$) refer to outliers, and they should typically be ignored.

Based on the found topics, we can inspect, for example, the most frequent topic and check the words that compose it with their associated frequency:

Podemos ver, para todo topic, las palabras que lo caracterizan:

In [15]:
topic_model.get_topic(0)  # Select the most frequent topic

[('mantle', 0.013834382949214004),
 ('seismic', 0.013675539085769338),
 ('subduction', 0.01116959966290799),
 ('plate', 0.010884951124297689),
 ('fault', 0.009560153965499642),
 ('earthquakes', 0.009417106453140672),
 ('earthquake', 0.009038692204845206),
 ('crust', 0.008338599653801942),
 ('volcanic', 0.008112204836334694),
 ('rocks', 0.007266175660183819)]

##### <font color='#2B4865'>**Update topics**</font>

Once we have trained a model, we may not be satisfied with the obtained topics and their chemical description. In these cases, BERTopic allows us to update the topics via the ``update_topics`` function with new parameters for c-TF-IDF, which becomes quite handy when we find additional stopwords that we would desire to remove or if we want to consider a different ``n_gram_range``.

In [16]:
topic_model.update_topics(docs, n_gram_range=(1, 2))

In [17]:
topic_model.get_topic(0)   # We select topic that we viewed before

[('seismic', 0.008223791306872832),
 ('mantle', 0.008203635755204787),
 ('subduction', 0.006568062790024946),
 ('plate', 0.006412158036940786),
 ('fault', 0.005662889457672865),
 ('earthquakes', 0.005532933261949335),
 ('earthquake', 0.005329641225584192),
 ('crust', 0.004857280339519739),
 ('the', 0.004814386928084122),
 ('volcanic', 0.004769827322279452)]

As we saw with BTMs (e.g., LDA) and CTMs, it is difficult to predict the number of topics that best fit a model. With BERTopic we can let it figure out how many topics are created via the clustering algorithm, and once we know how many are created we can reduce them afterward:

In [None]:
topic_model.reduce_topics(docs, nr_topics=60)

### <font color='#2B4865'>*3.3. Additional information that can be extracted from the model*</font>

After you have trained your BERTopic model, we may access the following attributes:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

For example, to access the predicted topics for the first 10 documents, we simply run the following:

In [20]:
topic_model.topics_[:15]

[4, 4, -1, 10, 55, -1, 10, -1, 4, 47, 0, 10, 4, 57, -1]

In addition, we can use the function ``find_topics`` to search for topics that are similar to an input. For example, we can look for topics that are similar to the term "airport":

In [21]:
similar_topics, similarity = topic_model.find_topics("airport", top_n=5)
similar_topics

[180, 10, 31, 135, 33]

In [22]:
topic_model.get_topic(198)

[('gallium', 0.015326227254658539),
 ('defects', 0.013132130596030606),
 ('nitride', 0.012513244349020965),
 ('semiconductor', 0.01221985885194407),
 ('semiconductors', 0.01066998132000119),
 ('oxide', 0.010417851598496373),
 ('ga2o3', 0.009883427898579234),
 ('devices', 0.009681256384432702),
 ('material', 0.009124572292028732),
 ('power', 0.008825969137436404)]

### <font color='#2B4865'>*3.4. Visualizations*</font>

The library of BERTopic provides, in addition, a series of libraries that can help interpret the results. Below are included the most important ones, but you can check [the complete choice in the documentation](https://github.com/MaartenGr/BERTopic/tree/master/bertopic/plotting), as well as some extra parameters than can be configured from the ones shown below:

##### <font color='#2B4865'>**Topics**</font>

We can visualize the topics found by the model by means of a PyLDAvis-like visualization:

Podemos ver los topics, las palabras que lo representan y la cantidad de docs del cuerpo que se definen con los mismos.

In [23]:
topic_model.visualize_topics()

##### <font color='#2B4865'>**Topic Probabilities**</font>

We can also visualize the distribution of topic probabilities in each document. For example, in the figure below, we are representing the distribution of topics in document 0 with a probability higher than $0.015$.

Vemos la distribución de los topics para el doc0 (siempre que la probabilidad de contenerlos sea mayor a 0.015).

In [24]:
topic_model.visualize_distribution(probs[0], min_probability=0.015)

##### <font color='#2B4865'>**Topic Hierarchy**</font>

The found topics can be hierarchically reduced, so we can construct clusters and examine how they relate to one another to grasp their probable hierarchical structure by means of a dendrogram of the topics. Based on this representation, we can determine whether we should decrease the number of topics produced.

In [25]:
topic_model.visualize_hierarchy(top_n_topics=50)

##### <font color='#2B4865'>**Terms**</font>

We can also visualize the selected terms for a few topics, together with the relative c-TF-IDF scores between and within topics:

In [26]:
topic_model.visualize_barchart(top_n_topics=12)

### <font color='#2B4865'>*3.5. Custom submodels*</font>

Although BERTopic works quite well out of the box, sometimes we may also want to carry out hyperparameter optimization in sub-models such as HDBSCAN and UMAP, since the default parameters we used in 3.1. may not fit all training data. To solve this, BERTopic allows us to pass in any custom underlying block with the parameters that best suit our use case.

##### <font color='#2B4865'>**Embedding Models**</font>

The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model. We can select any model from sentence-transformers and pass it through BERTopic, or alternatively, select a SentenceTransformer with our own parameters.

In [27]:
# OPTION 1
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

In [28]:
# OPTION 2
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cuda")
topic_model_emb = BERTopic(embedding_model=sentence_model, verbose=True)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

##### <font color='#2B4865'>**Dimensionality reduction**</font>

While in BERTopic the typically used dimensionality reduction algorithm is UMAP, since it allows to capture of both local and global high-dimensional space in lower dimensions, BERTopic accepts instances of other algorithms such as PCA by giving it to the ``umap_model``parameter, with the only condition that such a class has the ``fit()`` and ``transform()``function, that is, it should have the following structure:

```python
class DimensionalityReduction:
    def fit(self, X):
        return self

    def transform(self, X):
        return X
```

Focusing on UMAP, we can configure a number of parameters that we can fine-tune to improve the performance of our topic model. Rather than exposing all parameters in BERTopic, what we do when we want to make a fine tuning of this algorithm is to instantiate a UMAP model and then pass it to BERTopic. The most important parameters to configure are:

* ``n_neighbors``: It controls how UMAP balances local vs global structure in the data. Low values will force UMAP to concentrate on very local structures, while large values will push UMAP to look at larger neighborhoods of each point, losing fine detail structure. The default value is $15$.
* ``n_components``:  Dimensionality of the reduced dimension space the data will be embedded into. Since UMAP scales well in the embedding dimension we can use it for more than just visualizations in 2- or 3-dimensions. The default is $2$.
* ``min_dist``: It controls how tightly UMAP is allowed to pack points together by providing the minimum distance apart that points are allowed to be in the low dimensional representation. Low values result in clumpier embeddings (good for clustering or finer topological structure), while larger values will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead. The default value is $0.1$. 

* ``metric``: It controls how distance is computed in the ambient space of the input data. By default, UMAP supports a wide variety of metrics. The default is euclidean.

When fine-tuning your UMAP model, you may find useful the following documentation: 
* [How to use UMAP](https://umap-learn.readthedocs.io/en/latest/basic_usage.html)
* [Basic UMAP parameters](https://umap-learn.readthedocs.io/en/latest/parameters.html#n-components)

In [29]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine')
topic_model_umap = BERTopic(umap_model=umap_model).fit(docs)

2023-01-18 21:53:32,030 - BERTopic - Transformed documents to Embeddings
2023-01-18 21:53:45,321 - BERTopic - Reduced dimensionality
2023-01-18 21:53:46,327 - BERTopic - Clustered reduced embeddings


In [46]:
topic_model_umap.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3799,-1_and_of_the_to
1,0,296,0_seismic_mantle_subduction_plate
2,1,164,1_social_criminal_labor_justice
3,2,154,2_memory_storage_computing_performance
4,3,152,3_catalysts_reactions_chemical_catalysis


##### <font color='#2B4865'>**Clustering**</font>

Same as for the dimensionality reduction algorithm, we can use clustering algorithms different from HDBSCAN as long as the class used for it have the ``fit()`` and ``transform()`` methods to fit and transform the input to cluster labels, respectively, and the attribute ``labels_`` to get the labels after fitting the model; that is, it should have the following structure:

```python
class ClusterModel:
    def fit(self, X):
        self.labels_ = None
        return self

    def transform(self, X):
        return X
```

Focusing on HDBSCAN, the most important parameters that we should fine-tune to improve cluster's quality, and hence, the quality of the topic model are:

* ``min_cluster_size``: It is the smallest size grouping that we wish to consider a cluster. The bigger this parameter is, the less cluster is found by the model. The default value is $5$.
* ``min_samples``: It is the number of samples in a neighborhood for a point to be considered a core point. It should be optimized at the same time as ``min_cluster_size``since one influences the other: the larger this value is, the more points will be declared as noise, and clusters will be limited to progressively more dense areas; otherwise, more sparse core points will be allowed.

When fine-tuning your UMAP model, you may find useful the following documentation: 
* [Basic Usage of HDBSCAN for Clustering](https://hdbscan.readthedocs.io/en/latest/basic_hdbscan.html)
* [Parameter Selection for HDBSCAN
](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html)

In [47]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
topic_model_hdbscan = BERTopic(hdbscan_model=hdbscan_model).fit(docs)

2023-01-18 22:46:46,101 - BERTopic - Transformed documents to Embeddings
2023-01-18 22:46:57,708 - BERTopic - Reduced dimensionality
2023-01-18 22:46:58,366 - BERTopic - Clustered reduced embeddings


In [48]:
topic_model_hdbscan.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3679,-1_the_and_of_to
1,0,295,0_data_statistical_algorithms_problems
2,1,190,1_galaxies_stars_galaxy_star
3,2,156,2_catalysts_reactions_chemical_catalytic
4,3,149,3_memory_storage_computing_performance


##### <font color='#2B4865'>**CountVectorizer**</font>

In order to improve the topic representation, we can also customize the underlying ``CountVectorizer`` and pass it to the model via the ``vectorizer_model`` parameter:

In [49]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(2, 2), stop_words="english")
topic_model_cv = BERTopic(vectorizer_model=vectorizer_model).fit(docs)

2023-01-18 23:14:00,031 - BERTopic - Transformed documents to Embeddings
2023-01-18 23:14:11,662 - BERTopic - Reduced dimensionality
2023-01-18 23:14:12,330 - BERTopic - Clustered reduced embeddings


In [50]:
topic_model_cv.get_topic_info().head(5)

Unnamed: 0,Topic,Count,Name
0,-1,3728,-1_broader impacts_intellectual merit_evaluati...
1,0,300,0_subduction zones_subduction zone_plate tecto...
2,1,193,1_gene expression_membrane proteins_dna repair...
3,2,162,2_carbon dioxide_chemistry division_program ch...
4,3,140,3_student travel_international conference_trav...


##### <font color='#2B4865'>**c-TF-IDF**</font>

c-TF-IDF representation is enabled by default in BERTopic. However, we can explicitly pass it to BERTopic through the ``ctfidf_model`` parameter allowing for parameter tuning and the customization of the topic extraction technique. As we saw in class, the following parameters can be customized:

* **bm25_weighting**: If set to True, a class-based BM-25 weighting measure is used instead of the default method.
* **reduce_frequent_words**: If set to True, the square root of the term frequency after normalizing the frequency matrix instead of the default term frequency.

In [51]:
from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer()
topic_model_ctfidf = BERTopic(ctfidf_model=ctfidf_model).fit(docs)

2023-01-18 23:41:33,670 - BERTopic - Transformed documents to Embeddings
2023-01-18 23:41:48,020 - BERTopic - Reduced dimensionality
2023-01-18 23:41:48,690 - BERTopic - Clustered reduced embeddings


In [None]:
topic_model_ctfidf.get_topic_info().head(5)

In [65]:
!pip install contextualized_topic_models

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting contextualized_topic_models
  Downloading contextualized_topic_models-2.4.2-py2.py3-none-any.whl (35 kB)
Collecting gensim>=4.0.0
  Downloading gensim-4.3.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Collecting ipython==7.16.3
  Downloading ipython-7.16.3-py3-none-any.whl (783 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m783.1/783.1 KB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m
Collecting ipywidgets==7.5.1
  Downloading ipywidgets-7.5.1-py2.py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 KB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jedi<=0.17.2,>=0.10
  Downloading jedi-0.17.2-py2.py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

###### **Exercise 5**

Evaluate the topic model from 3.1 based on the NPMI coherence and the IRBO. You can calculate these metrics in the same it was done in Section 2.6. 

For the calculation of the NPMI coherence, you must take into account that the corpus we gave to BERTopic was in raw format, since the algorithm carries out its tokenization and basic preprocessing internally. Yet, for the calculation of the NPMI coherence we need the corpus tokenized with the same tokenizer as used in BERTopic and preprocessed in the same way. The code belows obtains such a corpus.

In [52]:
# Preprocess Documents
cleaned_docs = topic_model._preprocess_text(df['raw_text'].values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
corpus = [analyzer(doc) for doc in cleaned_docs]

In [57]:
vectorizer.fit(cleaned_docs)
vocab = vectorizer.get_feature_names()

In [60]:
n_topics = probs.shape[1]
topic_words = []
for topic_idx in range(n_topics):
    topic_probs = probs[:, topic_idx]
    top_word_indices = topic_probs.argsort()[:-11:-1]
    topic_words.append([vocab[i] for i in top_word_indices])

In [69]:
#<SOL>
# Calculate NPMI coherence
coh = CoherenceNPMI(topic_words, corpus)
print("NPMI coherence score:", coh.score())
#</SOL>

NPMI coherence score: -0.16683246681178593


In [71]:
#<SOL>
# Calculate IRBO
irbo = InvertedRBO(topic_words)
irbo_score = irbo.score()
print("IRBO score:",irbo_score)
#</SOL>

IRBO score: 1.0


###### **Exercise 6**

Your task now is to fine-tune the BERTopic parameters that best suit the NSF dataset. As the fine-tuning objective, you must improve the results obtained in Exercise 5. For doing so, carry out the **optimization of each BERTopic block separately**. For UMAP and HDBSCAN, test with several parameters in each and visualize the results to support your selection. In HDBSCAN you may find it useful to visualize the hierarchy of clusters via the ``condensed_tree_`` attribute of the clusterer object.

Once you have your fined-tuned model, use visualizations to check whether it is necessary to reduce some topics, and evaluate it based on the NPMI coherence and the IRBO.

In [None]:
#<SOL>

n_components = [5, 10, 15, 20, 25, 50]
n_neighbors = [5, 10, 15, 20]
min_cluster_size = [5, 10, 15, 20]

coherence_scores = []
irbo_scores = []

for n in n_components:
    for nn in n_neighbors:
        for mcs in min_cluster_size:
            topic_model = BERTopic(n_components=n, n_neighbors=nn, min_cluster_size=mcs)
            topics, probs = topic_model.fit_transform(docs)
            topic_words = topic_model.get_topic_words(topics, probs, n_words=10)
            coh = CoherenceNPMI(topic_words, corpus)
            coherence_scores.append(coh.score())
            irbo = InvertedRBO(topic_words)
            irbo_scores.append(irbo.score())

print(coherence_scores)
print(irbo_scores)

#</SOL>

El tiempo de computo de la celda de arriba es excesivo incluso con GPU. Por ello voy a generar un modelo con parámetros elegidos por mi pero dejo indicado como se hace el fine-tuning de los hiperparámetros. El resto se hace igual que arriba:

In [None]:
topic_words = topic_model.get_topic_words(topics, probs, n_words=10)
coh = CoherenceNPMI(topic_words, corpus)
coherence_scores.append(coh.score())
irbo = InvertedRBO(topic_words)
irbo_scores.append(irbo.score())

###### **Exercise 7**

Create an LDA-Mallet model as you did in the Topic Modeling notebook in Block II and evaluate it based on the NPMI coherence and IRBO.

Based on the obtained results, which of the three algorithms (ContextualizedCTM, BERTopic, or LDA-Mallet) provides a more suitable model for the NSF data? You can complement your justification using visualizations (e.g., pyLDAvis-like graphs).

In [76]:
D = Dictionary(corpus)

In [80]:
reviews_bow = [D.doc2bow(doc) for doc in corpus]

In [82]:
os.environ['MALLET_HOME'] = 'mallet-2.0.8'
mallet_path = 'mallet-2.0.8/bin/mallet' # you should NOT need to change this 

In [85]:
import os       #importing os to set environment variable
def install_java():
    !apt-get install -y openjdk-8-jdk-headless -qq > /dev/null      #install openjdk
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"     #set environment variable
    !java -version       #check java version
install_java()

openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu220.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu220.04, mixed mode, sharing)


In [86]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip mallet-2.0.8.zip

--2023-01-19 00:19:20--  http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Resolving mallet.cs.umass.edu (mallet.cs.umass.edu)... 128.119.246.70
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip [following]
--2023-01-19 00:19:20--  https://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
Connecting to mallet.cs.umass.edu (mallet.cs.umass.edu)|128.119.246.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16184794 (15M) [application/zip]
Saving to: ‘mallet-2.0.8.zip.1’


2023-01-19 00:19:20 (43.6 MB/s) - ‘mallet-2.0.8.zip.1’ saved [16184794/16184794]

Archive:  mallet-2.0.8.zip
replace mallet-2.0.8/bin/classifier2info? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: mallet-2.0.8/bin/classifier2info  
replace mallet-2.0.8/bin/csv2classify? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: mallet-2.0.8/bin/csv2classi

In [89]:
!pip install pyLDAvis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis, sklearn
  Building wheel for pyLDAvis (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136898 sha256=7dc770f3b7ac27171a429f970257948fee9172ac5053eea781dba99fa397b321
  Stored 

In [87]:
from gensim.models.wrappers import LdaMallet
ldamallet = LdaMallet(mallet_path, corpus=reviews_bow, num_topics=20, id2word=D, alpha=5, iterations=100)

In [103]:
from gensim.models.wrappers.ldamallet import malletmodel2ldamodel
ldagensim = malletmodel2ldamodel(ldamallet)

In [117]:
topic_words = ldamallet.show_topics(num_topics=num_topics, num_words=10)

In [120]:
coherencemodel = CoherenceModel(ldagensim, texts=corpus, dictionary=D, coherence='c_v')
print('Coherencia NPMI: ' + str(coherencemodel.get_coherence()))

Coherencia NPMI: 0.6487392739971356
