# <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2; text-align: center;">Topic Modeling</div>

An important task coming after data preprocessing and analysis (being the second step to virtually any NLP project) is topic modeling: a process that separates the existing data into multiple clusters, each of them representing a different topic. This is a crucial step in the process of understanding the data and extracting valuable insights from it. For this task, the team decided to use the BERTopic library, which is a topic modeling technique that leverages transformers model to create dense representations of the documents and then clusters them using HDBSCAN.

#### Used Embeddings

The embeddings used for topic modeling are taken from the project of digitalepidemiologylab in GitHub, which generated embeddings from COVID-19 tweets. These Embeddings are related to the BERT model, and a description about them can be found in the [official repository of digitalepidemoloylab's project](https://github.com/digitalepidemiologylab/covid-twitter-bert).

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Dependencies Imports</div>

In [1]:
!pip install umap-learn
!pip install hdbscan
!pip install sentence-transformers
!pip install bertopic
!pip install nltk
!pip install datamapplot
!pip install dask[dataframe]









Collecting sentence-transformers
  Using cached sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
Using cached sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
Using cached transformers-4.47.1-py3-none-any.whl (10.1 MB)
Installing collected packages: transformers, sentence-transformers
Successfully installed sentence-transformers-3.3.1 transformers-4.47.1




Collecting bertopic
  Using cached bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Using cached bertopic-0.16.4-py3-none-any.whl (143 kB)
Installing collected packages: bertopic
Successfully installed bertopic-0.16.4








Collecting datamapplot
  Using cached datamapplot-0.4.2-py3-none-any.whl.metadata (6.1 kB)
Collecting colorcet (from datamapplot)
  Using cached colorcet-3.1.0-py3-none-any.whl.metadata (6.3 kB)
Collecting colorspacious>=1.1 (from datamapplot)
  Using cached colorspacious-1.1.2-py2.py3-none-any.whl.metadata (3.6 kB)
Collecting datashader>=0.16 (from datamapplot)
  Using cached datashader-0.16.3-py2.py3-none-any.whl.metadata (12 kB)
Collecting importlib-resources (from datamapplot)
  Using cached importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
Collecting pylabeladjust (from datamapplot)
  Using cached pylabeladjust-0.1.13-py3-none-any.whl.metadata (8.2 kB)
Collecting rcssmin>=1.1.2 (from datamapplot)
  Using cached rcssmin-1.2.0.tar.gz (583 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metad



Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Using cached dask_expr-1.1.21-py3-none-any.whl.metadata (2.6 kB)
Using cached dask_expr-1.1.21-py3-none-any.whl (244 kB)
Installing collected packages: dask-expr
Successfully installed dask-expr-1.1.21




In [2]:
import pandas as pd
import html
from umap import UMAP
from sklearn.decomposition import PCA
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Preprocessing</div>

Before starting to transform the data, some important preprocessing steps must be done in order to clean the data and maintain coherency in results with what has been done in the previous notebooks. The following steps were taken:

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Data Loading</div>

In [3]:
DATA = 'data/'
test = pd.read_csv(DATA + 'Constraint_English_Test.csv', delimiter=';', encoding='utf-8')
train = pd.read_csv(DATA + 'Constraint_English_Train.csv', delimiter=';', encoding='utf-8')
val = pd.read_csv(DATA + 'Constraint_English_Val.csv', delimiter=';', encoding='utf-8')

tweets = pd.concat([train, val, test], ignore_index=True)
tweets.drop(columns=['id'], inplace=True)
tweets['tweet'] = tweets['tweet'].apply(html.unescape)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 115: invalid start byte


### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">2. Data Cleaning</div>

In [None]:
processed_tweets = tweets.copy()
processed_tweets = processed_tweets.drop_duplicates(subset='tweet', keep='first')

## <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4484c2;">Data Transformation (Embeddings)</div>

In order to extract the different topics, a set of pre-trained models, called embeddings, can be used to transform the data into a format that can be used by the topic modeling algorithm. The adecuacy of the embeddings to the domain of the data is paramount for the well-performing of the topic modeling algorithm. In this case, the embeddigs used will be `covid-twitter-bert`, a model provided by digitalepidemiologylab.

### <div style="border: 3px solid #FFFFF; padding: 10px; border-radius: 5px; background-color: #4a44c2;">1. Load Embeddings</div>

In [None]:
embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = embedding_model.encode(processed_tweets["tweet"], show_progress_bar=True)