# BERTopic Model Initial Evaluation

This notebook will be used so as to evaluate baseline BERTopic models with different styles of inputs and sentence transformers, so as to make the right pick when it comes to the final model of choice.

In [None]:
from bertopic import BERTopic
from models.bertopic.utils.data_loader import DataLoader
from sklearn.feature_extraction.text import CountVectorizer

from octis.evaluation_metrics.diversity_metrics import TopicDiversity, KLDivergence
from octis.evaluation_metrics.similarity_metrics import RBO, PairwiseJaccardSimilarity
from models.bertopic.utils.bertopic_evaluator import BERTopicModelEvaluator

from models.bertopic.config.embeddings import st_models
from models.bertopic.config.model import STOPWORDS, NUM_TOPICS, TOP_K

from umap import UMAP

## Data Loading

In [None]:
loader = DataLoader('data/data_speeches.csv', 'data/data_statements.csv')
loader.process()

train_docs, train_sentences = loader.get_train_data()
test_docs, test_sentences = loader.get_test_data()
val_docs, val_sentences = loader.get_val_data()

## Vectorizer and UMAP model initialization

In [None]:
vectorizer_model = CountVectorizer(stop_words=STOPWORDS,
                                   ngram_range=(2, 2))

## BERTopic model initialization

In [None]:
# model_gr_docs = BERTopic(embedding_model=st_models['gr_stsb'], 
#                          vectorizer_model=vectorizer_model,
#                          nr_topics=NUM_TOPICS)

model_gr_media_docs = BERTopic(embedding_model=st_models['gr_media'], 
                               vectorizer_model=vectorizer_model,
                               ngram_range=(2, 2),
                               nr_topics=NUM_TOPICS)

# model_gr_sentences = BERTopic(embedding_model=st_models['gr_stsb'], 
#                               vectorizer_model=vectorizer_model,
#                               nr_topics=NUM_TOPICS)

# model_gr_media_sentences = BERTopic(embedding_model=st_models['gr_media'], 
#                                     vectorizer_model=vectorizer_model,
#                                     nr_topics=NUM_TOPICS)

# model_multilingual_docs = BERTopic(embedding_model=st_models['multilingual'],
#                                    vectorizer_model=vectorizer_model,
#                                    nr_topics=NUM_TOPICS)

# model_multilingual_sentences = BERTopic(embedding_model=st_models['multilingual'],
#                                         vectorizer_model=vectorizer_model,
#                                         nr_topics=NUM_TOPICS)

In [None]:
metrics = {
    'coherence_c_npmi': None,
    'coherence_c_v': None,
    'coherence_u_mass': None,
    'coherence_c_uci': None,
    'diversity_topic': TopicDiversity(topk=TOP_K),
    'similarity_rbo': RBO(topk=TOP_K),
    'similarity_pjs': PairwiseJaccardSimilarity(),
}

In [None]:
models = {
    'gr_docs': model_gr_docs,
    'gr_media_docs': model_gr_media_docs,
    'gr_sentences': model_gr_sentences,
    'gr_media_sentences': model_gr_media_sentences,
    'multilingual_docs': model_multilingual_docs,
    'multilingual_sentences': model_multilingual_sentences
}

In [None]:
datasets = {
    'docs': train_docs,
    'sentences': train_sentences
}

In [18]:
evaluator = BERTopicModelEvaluator(models=models, 
                                   metrics=metrics, 
                                   datasets=datasets,
                                   topics=NUM_TOPICS)

In [19]:
evaluator.evaluate()

Training model:  gr_docs
Model trained
Training model:  gr_media_docs
Model trained
Training model:  gr_sentences


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Model trained
Training model:  gr_media_sentences


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Model trained
Training model:  multilingual_docs
Model trained
Training model:  multilingual_sentences


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Model trained
Evaluating model:  gr_docs
Evaluating metric coherence_c_npmi for model gr_docs
Evaluating metric coherence_c_v for model gr_docs
Evaluating metric coherence_u_mass for model gr_docs
Evaluating metric coherence_c_uci for model gr_docs
Evaluating metric diversity_topic for model gr_docs
Evaluating metric similarity_rbo for model gr_docs
Evaluating metric similarity_pjs for model gr_docs
Model gr_docs evaluated
Evaluating model:  gr_media_docs
Evaluating metric coherence_c_npmi for model gr_media_docs
Evaluating metric coherence_c_v for model gr_media_docs
Evaluating metric coherence_u_mass for model gr_media_docs
Evaluating metric coherence_c_uci for model gr_media_docs
Evaluating metric diversity_topic for model gr_media_docs
Evaluating metric similarity_rbo for model gr_media_docs
Evaluating metric similarity_pjs for model gr_media_docs
Model gr_media_docs evaluated
Evaluating model:  gr_sentences
Evaluating metric coherence_c_npmi for model gr_sentences
Evaluating metri

Unnamed: 0,model,coherence_c_npmi,coherence_c_v,coherence_u_mass,coherence_c_uci,diversity_topic,similarity_rbo,similarity_pjs,dataset
0,gr_docs,0.082016,0.682607,-0.275356,-0.798458,0.731034,0.055541,0.053181,docs
1,gr_media_docs,0.087017,0.686488,-0.374455,-0.581325,0.724138,0.077489,0.047451,docs
2,gr_sentences,0.103696,0.501931,-0.437374,-2.220314,0.97931,0.001059,0.001051,sentences
3,gr_media_sentences,0.173116,0.604086,-0.314721,-0.501189,1.0,0.0,0.00013,sentences
4,multilingual_docs,0.094314,0.702401,-0.199095,-0.653081,0.608696,0.10438,0.082085,docs
5,multilingual_sentences,0.14239,0.582602,-0.300966,-1.331063,0.972414,0.000803,0.001051,sentences


In [20]:
evaluator.topics

30