Demo ipynb for BERTopic

Testing the pipeline for a single game

Ref

BERTopic tutorial

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=ScBUgXn06IK6


BERTopic Best Practices

https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=m3aN-f9B4rmU


BERTopic Big data (for improving the speed of the training pipeline, on GPU)

https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing#scrollTo=Ls2Q-iccGs7O


BERTopic Topic Modelling with Llama2

https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=4Uj8MYhCafmX

In [9]:
import pandas as pd
import numpy as np

from pathlib import Path

import gensim

import nltk

import pyLDAvis

In [10]:
dataset_path = Path('../../dataset/topic_modelling/top_10_games/00_Terraria.pkl')

dataset = pd.read_pickle(dataset_path)

dataset.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
Index: 75499 entries, 57735 to 133233
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   index         75499 non-null  int64 
 1   app_id        75499 non-null  int64 
 2   app_name      75499 non-null  object
 3   review_text   75499 non-null  object
 4   review_score  75499 non-null  int64 
 5   review_votes  75499 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 4.0+ MB


In [11]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
# data preprocessing

import re

import sys
sys.path.append('../../sa')

%autoreload 2
import str_cleaning_functions


def cleaning(df, review):
    df[review] = df[review].apply(lambda x: str_cleaning_functions.remove_links(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.remove_links2(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.clean(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.deEmojify(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.unify_whitespaces(x))

def cleaning_strlist(str_list):
    str_list = list(map(lambda x: str_cleaning_functions.remove_links(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.remove_links2(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.clean(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.deEmojify(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.unify_whitespaces(x), str_list))
    return str_list

In [13]:
cleaning(dataset, 'review_text')

In [14]:
X = dataset['review_text'].values

Training

for small documents, simply run with the BERTopic encapsulated function and the training is all done.

for large documents, it's better to pre-calculate embeddings and prepare vocab b4 training to reduce memory usage.

In [15]:
# small documents

# from bertopic import BERTopic

# TOP_N_WORDS = 10                # number of words per topic
# N_GRAM_RANGE = (1, 2)           # n-gram

# topic_model = BERTopic(language="english", top_n_words=TOP_N_WORDS, calculate_probabilities=True, verbose=True)
# topics, probs = topic_model.fit_transform(X)

In [16]:
# large documents

# pre-calculate embeddings

from sentence_transformers import SentenceTransformer
import torch

# Create embeddings

SENTENCE_TRANSFORMERS_NAME = 'sentence-transformers/all-MiniLM-L6-v2'

model = SentenceTransformer(SENTENCE_TRANSFORMERS_NAME, device='cuda' if torch.cuda.is_available() else 'cpu')
embeddings = model.encode(X, show_progress_bar=True)

Batches: 100%|██████████| 2360/2360 [00:16<00:00, 144.81it/s]


In [17]:
# save the embeddings

embedding_path = Path('00_Terraria_embeddings.pkl')

if not embedding_path.exists():
    with open(embedding_path, 'wb') as f:
        np.save(f, embeddings)

# load the embeddings
if embedding_path.exists():
    with open(embedding_path, 'rb') as f:
        embeddings = np.load(f)

In [18]:
# prepare vocabulary before training such that tokenizer does not need to do the calculations itself

import collections
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer

# Extract vocab to be used in BERTopic
vocab = collections.Counter()
tokenizer = CountVectorizer().build_tokenizer()
for doc in tqdm(X):
  vocab.update(tokenizer(doc))
vocab = [word for word, frequency in vocab.items() if frequency >= 15]; len(vocab)    # set the minimum frequency to reduce the vocabulary size

  0%|          | 0/75499 [00:00<?, ?it/s]

100%|██████████| 75499/75499 [00:00<00:00, 161107.23it/s]


6694

In [19]:
# not using GPU acceleration as the dependency is fking messy
# and the model is deployed on a CPU only server

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN

# parameter optimization
# UMAP
UMAP_N_COMPONENTS = 5
UMAP_N_NEIGHBORS = 50

# HDBSCAN
HDBSCAN_MIN_CLUSTER_SIZE = 150
HDBSCAN_MIN_SAMPLES = 20

# BERTopic
N_TOPICS = 20

# check: https://maartengr.github.io/BERTopic/faq.html#which-embedding-model-should-i-choose 
# for more parameter optimization on the UMAP and HDBSCAN models

# Prepare sub-models
# the HDBSCAN and UMAP are (Nvidia) GPU-accelerated versions
embedding_model = SentenceTransformer(SENTENCE_TRANSFORMERS_NAME)       # use the model as the embedding model
umap_model = UMAP(n_components=UMAP_N_COMPONENTS, n_neighbors=UMAP_N_NEIGHBORS, min_dist=0.0, random_state=42, metric="cosine", verbose=True)       # set random_state for reproductability
hdbscan_model = HDBSCAN( min_cluster_size=HDBSCAN_MIN_CLUSTER_SIZE, min_samples=HDBSCAN_MIN_SAMPLES, gen_min_span_tree=True, prediction_data=True)
vectorizer_model = CountVectorizer(vocabulary=vocab, stop_words="english")

# Fit BERTopic without actually performing any clustering
topic_model = BERTopic(
        nr_topics=N_TOPICS + 1,                 # add 1 as the topic with id = '-1' represents outliers, and should be typically ignored
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        calculate_probabilities=True,
        
        verbose=True
)

topics, probs = topic_model.fit_transform(X, embeddings=embeddings)

2024-01-24 13:01:40,514 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


UMAP(angular_rp_forest=True, metric='cosine', min_dist=0.0, n_components=5, n_neighbors=50, random_state=42, verbose=True)
Wed Jan 24 13:01:40 2024 Construct fuzzy simplicial set
Wed Jan 24 13:01:40 2024 Finding Nearest Neighbors
Wed Jan 24 13:01:40 2024 Building RP forest with 19 trees
Wed Jan 24 13:01:43 2024 NN descent for 16 iterations
	 1  /  16
	 2  /  16
	 3  /  16
	 4  /  16
	 5  /  16
	Stopping threshold met -- exiting after 5 iterations
Wed Jan 24 13:01:58 2024 Finished Nearest Neighbor Search
Wed Jan 24 13:02:00 2024 Construct embedding


Epochs completed:   2%| ▎          5/200 [00:00]

	completed  0  /  200 epochs


Epochs completed:  11%| █          22/200 [00:03]

	completed  20  /  200 epochs


Epochs completed:  21%| ██         42/200 [00:06]

	completed  40  /  200 epochs


Epochs completed:  32%| ███▏       63/200 [00:10]

	completed  60  /  200 epochs


Epochs completed:  42%| ████▏      83/200 [00:14]

	completed  80  /  200 epochs


Epochs completed:  51%| █████      102/200 [00:18]

	completed  100  /  200 epochs


Epochs completed:  62%| ██████▏    123/200 [00:21]

	completed  120  /  200 epochs


Epochs completed:  72%| ███████▏   143/200 [00:25]

	completed  140  /  200 epochs


Epochs completed:  81%| ████████   162/200 [00:29]

	completed  160  /  200 epochs


Epochs completed:  91%| █████████  182/200 [00:32]

	completed  180  /  200 epochs


Epochs completed: 100%| ██████████ 200/200 [00:36]


Wed Jan 24 13:02:41 2024 Finished embedding


2024-01-24 13:02:42,215 - BERTopic - Dimensionality - Completed ✓
2024-01-24 13:02:42,217 - BERTopic - Cluster - Start clustering the reduced embeddings
  self._all_finite = is_finite(X)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TO

In [20]:
#get top5 topic frequency table
freq = topic_model.get_topic_freq()
print(freq.head(6))
print('Num of topics:', len(freq))

    Topic  Count
0       0  25831
1      -1  23367
2       1  17085
15      2   1499
12      3   1382
4       4   1180
Num of topics: 21


In [21]:
# get topic frequency table
freq = topic_model.get_topic_freq()
print(freq)
print('Num of topics:', len(freq))

    Topic  Count
0       0  25831
1      -1  23367
2       1  17085
15      2   1499
12      3   1382
4       4   1180
8       5    830
9       6    556
3       7    529
13      8    520
5       9    442
18     10    428
20     11    273
7      12    259
6      13    255
11     14    205
14     15    187
19     16    177
10     17    171
17     18    166
16     19    157
Num of topics: 21


In [None]:
# reduce outlier: https://maartengr.github.io/BERTopic/getting_started/outlier_reduction/outlier_reduction.html

# https://medium.com/@n83072/topic-modeling-bertopic-ca1b73a035f2

# Reduce outliers using the `probabilities` strategy
# This strategy uses the soft-clustering as performed by HDBSCAN to find the best matching topic for each outlier document.
# To use this, make sure to calculate the probabilities beforehand by instantiating BERTopic with calculate_probabilities=True.
new_topics = topic_model.reduce_outliers(X, topics, probabilities=probs, strategy="probabilities")


# Use the topic distributions, as calculated with .approximate_distribution
# to find the most frequent topic in each outlier document.
# You can use the distributions_params variable to tweak the parameters of .approximate_distribution.
# Reduce outliers using the `distributions` strategy
new_topics = topic_model.reduce_outliers(X, topics, strategy="distributions")


# Reduce outliers using the `c-tf-idf` strategy
# Calculate the c-TF-IDF representation for each outlier document 
# and find the best matching c-TF-IDF topic representation using cosine similarity.
new_topics = topic_model.reduce_outliers(X, topics, strategy="c-tf-idf")

# Reduce outliers using the `embeddings` strategy
# but it costs huge reduction in npmi score
# maybe other less aggressive strategies should be used
# new_topics = topic_model.reduce_outliers(X, topics, strategy="embeddings")

In [None]:
from collections import Counter
new_topic_dict = dict(Counter(new_topics))


new_topic_dict_df = pd.DataFrame(list(new_topic_dict.items()), columns=['topic_id', 'count'])
new_topic_dict_df = new_topic_dict_df.sort_values(by=['count'], ascending=False)

new_topic_dict_df

Unnamed: 0,topic_id,count
3,0,29399
0,1,14477
5,2,13139
8,3,8073
7,4,2529
12,6,1726
6,7,1547
13,5,1500
17,12,1409
14,9,1353


In [None]:
new_topic_dict_df[new_topic_dict_df['topic_id'] == '-1']

Unnamed: 0,topic_id,count


In [None]:
# try to apply the topic reduction to the BERTopic model

topic_model.update_topics(X, topics=new_topics)



In [None]:
embeddings.shape

(75499, 384)

In [48]:
# save the model (different from the func for small documents)
from datetime import datetime

topic_model_name = f'my_model_{datetime.now().strftime("%Y%m%d_%H%M%S")}'

# save as safetensors
# topic_model.save(
#     path=Path(topic_model_name),
#     serialization="safetensors",
#     save_ctfidf=True,
#     save_embedding_model=SENTENCE_TRANSFORMERS_NAME
# )

# save as pickle
topic_model.save(
    path=Path(topic_model_name + '_pickle.pkl'),
    serialization="pickle",
    save_ctfidf=True,
    save_embedding_model=True
)



In [None]:
# reload the trained topic model for faster inference
# del topic_model

In [51]:
# load the model

from bertopic import BERTopic
import joblib

# when loading the model from safetensors, the attributes and umap & hdbscan models are not loaded
# topic_model = BERTopic.load('my_model_20240124_142250')

topic_model = joblib.load('my_model_20240124_150329_pickle.pkl')

In [53]:
# load the embeddings
# embedding_path = Path('00_Terraria_embeddings.pkl')
# embeddings = np.load(embedding_path)

# inference to get the topics and prob for evaluation
# hence, we need the probs to get topic-doc-matrix

import hdbscan

# Method 1: just call the transform() method 
topics, probs = topic_model.transform(X, embeddings=embeddings)

# Method 2 (BERTopic 0.9.2 <=)
# since the hdbscan_model and umap_model is not saved to disk
# after loading, we need to re-fit the models

# topics = topic_model._map_probabilities(topic_model.hdbscan_model.labels_)

# probs = hdbscan.all_points_membership_vectors(topic_model.hdbscan_model)
# probs = topic_model._map_probabilities(probs, original_topics=True)

2024-01-24 15:19:19,522 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


In [46]:
topic_model.topic_embeddings_.shape

(21, 384)

In [54]:
probs.shape

(75499, 21)

---

Get the docs with the highest probability in each topic when transform with a new set of documents

In [23]:
# how about we use the topics and probs variable to calculate the top N representative docs
top_N = 10

idx = np.argpartition(-probs, top_N, axis=0)[:top_N]

In [24]:
# row = document, col = topic
idx.shape

(10, 21)

In [25]:
idx[:, -1]

array([32475, 57095, 37127, 62366, 63134, 39127,  5619, 41368, 72119,
       33916])

In [26]:
probs[idx[:, -1], -1]

array([0.8626926 , 0.88693094, 0.891624  , 0.8622919 , 0.88246596,
       0.8726312 , 0.87021685, 0.8593762 , 0.85431933, 0.83784544],
      dtype=float32)

In [27]:
for i in idx[:, -1]:
    print(X[i])

its a great game 10/10 IGN rating 
I LOVE THIS GAME ign 10/10
this game is amazing 10/10 IGN
This is one of the best games ever. It got 9/10 IGN
Such a great Game 10/10 -Ign
I LOVE THIS GAME 10/10 BEST GAVE EVER IGN
this Game is amazing 10/1o ign
Great game 10/10 IGN :)
Great game, IGN 11/10
Awsome IGN 10/10 best game ever


In [28]:
scores = probs[idx[:, 0]]

In [29]:
scores

array([[0.8439059 , 0.6639326 , 0.8470823 , 0.41522664, 0.69603264,
        0.3260334 , 0.48170853, 0.69639784, 0.18815815, 0.4254477 ,
        0.63974464, 0.46924725, 0.39073318, 0.5540215 , 0.53564095,
        0.51882154, 0.48313844, 0.20669758, 0.16494085, 0.5581672 ,
        0.40123302],
       [0.8420009 , 0.68566704, 0.86814034, 0.4555121 , 0.7356359 ,
        0.36514777, 0.46463612, 0.6834041 , 0.21792297, 0.42076814,
        0.5866291 , 0.46851212, 0.42243493, 0.5106257 , 0.54692024,
        0.5256633 , 0.4516474 , 0.25883296, 0.108062  , 0.56317604,
        0.4025724 ],
       [0.8493773 , 0.6620978 , 0.8550476 , 0.44255364, 0.66760737,
        0.31126752, 0.44108182, 0.71707666, 0.18157372, 0.5122689 ,
        0.65644974, 0.49404135, 0.41204625, 0.50009173, 0.5272465 ,
        0.4609333 , 0.43928248, 0.1960501 , 0.23035052, 0.5707918 ,
        0.35814822],
       [0.84614635, 0.6865132 , 0.8404861 , 0.49374938, 0.67502224,
        0.41698363, 0.48019862, 0.7438107 , 0.2299784

In [30]:
scores.shape

(10, 21)

In [None]:
# # load the embeddings
# embedding_path = Path('00_Terraria_embeddings.pkl')
# embeddings = np.load(embedding_path)

# # inference to get the topics and prob for evaluation
# # hence, we need the probs to get topic-doc-matrix
# topics, probs = topic_model.transform(X, embeddings=embeddings)

---

Extracting Topics

In [22]:
# look at the most frequent topics 

freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,23367,-1_game_10_fun_play,"[game, 10, fun, play, hours, good, great, time...","[game is really fun and cool, fun for playing ..."
1,0,25831,0_terraria_minecraft_game_like,"[terraria, minecraft, game, like, 2d, bosses, ...",[A lot of people will say this game is just li...
2,1,17085,1_game_fun_friends_great,"[game, fun, friends, great, play, best, hours,...","[Great Game. So Fun, Great game, best played w..."
3,2,1499,2_good_awesome_pretty_cool,"[good, awesome, pretty, cool, awsome, love, gr...","[ pretty good, it's pretty good , Its pretty g..."
4,3,1382,3_addictive_addicting_fun_addicted,"[addictive, addicting, fun, addicted, game, ad...","[This game is fun and addictive., This game is..."


In [23]:
topic_model.get_topic(0)  # Select the most frequent topic

[('terraria', 0.03855658943984582),
 ('minecraft', 0.03820221474136693),
 ('game', 0.03085978775389637),
 ('like', 0.0247233652891928),
 ('2d', 0.019962992645576803),
 ('bosses', 0.017343638269238355),
 ('just', 0.016695266402496654),
 ('fun', 0.016510117840983295),
 ('sandbox', 0.01625523163924172),
 ('games', 0.015819221297120722)]

(Copy from BERTopic ipynb in colab)

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used.                                                |

Save and load BERTopic models and components

Visualization

In [33]:
# visualize topics

topic_model.visualize_topics()

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


In [34]:
# visualize topic probabilities
# to understand how confident BERTopic is that certain topics are present in the documents

topic_model.visualize_distribution(probs[200], min_probability=0.001)

In [35]:
# visualize how topics are hierarchically reduced

topic_model.visualize_hierarchy(top_n_topics=50)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



In [36]:
# visualize selecteed terms for a few topics
# creating bar charts out of the c-TF-IDF scores for each topic representation.

topic_model.visualize_barchart(top_n_topics=5)

In [37]:
# visualize topic similarity
# Having generated topic embeddings, through both c-TF-IDF and embeddings,
# we can create a similarity matrix by simply applying cosine similarities through those topic embeddings.
# The result will be a matrix indicating how similar certain topics are to each other.

topic_model.visualize_heatmap(n_clusters=10, width=1000, height=1000)

Evaluation

Calculate metrics with octis

Reference

https://www.theanalyticslab.nl/topic-modeling-with-bertopic/

In [24]:
result_bertopic = {}

top_words = 10     # the functions will only return that number of top words
def _get_topics(topic_model):
    topic_list = []
    empty_topic_l_idx = []

    for idx, topics in topic_model.get_topics().items():
        if idx < 0:
            continue

        topics_sorted = sorted(topics, key=lambda x: x[1], reverse=True)
        topic_l = [t[0] for t in topics_sorted if t[0].strip() != '']

        # it's possible that resulting in an empty list
        # also, topic with only one word fails at calculating NPMI
        if len(topic_l) <= 1:
            empty_topic_l_idx.append(idx)
            continue

        topic_list.append(topic_l)
        # print(len(topic_l))

    return topic_list, empty_topic_l_idx

def _get_topic_word_matrix(topic_model, empty_topic_idxs):

    # use ctfidf value to calculate the probability of a word assigned to a topic
    # but this is not the probability of a word in a topic
    # maybe there's a better way

    c_tfidf_all = topic_model.c_tf_idf_.todense()

    topic_word_matrix = np.exp(c_tfidf_all) / np.exp(c_tfidf_all).sum(axis=1)

    # remove empty topics from the largest index
    for idx in empty_topic_idxs[::-1]:
        topic_word_matrix = np.delete(topic_word_matrix, idx, axis=0)

    # not a better way: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-probablities-or-distribution
    # as this focuses on how the token in a document is weighted w.r.t. every available topics
    
    return topic_word_matrix

def _get_topic_document_matrix(probabilities, empty_topic_idxs):

    topic_document_matrix = probabilities

    for idx in empty_topic_idxs[::-1]:
        topic_document_matrix = np.delete(topic_document_matrix, idx, axis=0)

    # topic_document_matrix = probabilities.T

    # for idx in empty_topic_idxs[::-1]:
    #     topic_document_matrix = np.delete(topic_document_matrix, idx, axis=1)

    return topic_document_matrix.T

result_bertopic['topics'], empty_topic_idxs = _get_topics(topic_model)
result_bertopic['topic-word-matrix'] = _get_topic_word_matrix(topic_model, empty_topic_idxs)
result_bertopic['topic-document-matrix'] = _get_topic_document_matrix(probs, empty_topic_idxs)

In [32]:
probs.shape

(75499, 20)

In [31]:
result_bertopic['topic-document-matrix'][:,0]

array([0.64445549, 0.17693751, 0.00541367, 0.01706351, 0.02325755,
       0.01331316, 0.02066849, 0.01380336, 0.01160087, 0.00583711,
       0.00678572, 0.00591017, 0.00750849, 0.0084701 , 0.00714232,
       0.00600271, 0.00757775, 0.00528419, 0.00577912, 0.00718871])

In [25]:
result_bertopic['topics'], result_bertopic['topic-word-matrix'], result_bertopic['topic-document-matrix']

([['terraria',
   'minecraft',
   'game',
   'like',
   '2d',
   'bosses',
   'just',
   'fun',
   'sandbox',
   'games'],
  ['game',
   'fun',
   'friends',
   'great',
   'play',
   'best',
   'hours',
   'played',
   'good',
   '10'],
  ['good',
   'awesome',
   'pretty',
   'cool',
   'awsome',
   'love',
   'great',
   'amazing',
   'just',
   'yes'],
  ['addictive',
   'addicting',
   'fun',
   'addicted',
   'game',
   'addiction',
   'really',
   'highly',
   'awesome',
   'extremely'],
  ['10',
   'unicorn',
   'killed',
   'bunny',
   'bunnies',
   'fish',
   'giant',
   'unicorns',
   'flying',
   'kill'],
  ['fix',
   'help',
   'crashes',
   'wont',
   'deleted',
   'play',
   'game',
   'world',
   'work',
   'launch'],
  ['expert',
   'mode',
   'mods',
   'hardmode',
   'game',
   'workshop',
   'fun',
   'mod',
   'hard',
   'new'],
  ['10', '11', 'play', 'bang', 'good', 'buy', '12', 'rate', 'life', 'pretty'],
  ['buy',
   'worth',
   'money',
   'just',
   'spent',
  

In [40]:
topic_freq = topic_model.get_topic_freq()
topic_freq[topic_freq['Topic'] != -1]

Unnamed: 0,Topic,Count
0,0,25831
2,1,17085
15,2,1499
12,3,1382
4,4,1180
8,5,830
9,6,556
3,7,529
13,8,520
5,9,442


Evaluation with gensim

(as gives more freedom to control the CoherenceModel by gensim)

In [41]:
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel

# https://stackoverflow.com/questions/70548316/gensim-coherencemodel-gives-valueerror-unable-to-interpret-topic-as-either-a-l

# filter topics that contain only one word from the corpus for calculating npmi
# https://github.com/piskvorky/gensim/issues/3328


topic_words, empty_topic_idxs = _get_topics(topic_model)

documents = pd.DataFrame({"Document": X,
                          "ID": range(len(X)),
                          "Topic": topics})

# remove documents which their topic contains 1<= words
documents = documents[~documents['Topic'].isin(empty_topic_idxs)]

documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

bertopic_vectorizer = topic_model.vectorizer_model
bertopic_analyzer = bertopic_vectorizer.build_analyzer()

words = bertopic_vectorizer.get_feature_names_out()
tokens = [bertopic_analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]

In [42]:
# ~3 min on i714700 with CountVectorizer ~ 6000 words

# we first analysze NPMI

coherence_model = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                corpus=corpus,
                                dictionary=dictionary,
                                topn=10,
                                coherence='c_v')

# npmi = Coherence(texts=tokens,topk=10, measure='c_npmi')
# nmpi_score = npmi.score(result_bertopic)

cv_score = coherence_model.get_coherence()
cv_score


0.5192437948379023

In [43]:
coherence_model_npmi = CoherenceModel(topics=topic_words,
                                    texts=tokens,
                                    corpus=corpus,
                                    dictionary=dictionary,
                                    topn=10,
                                    coherence='c_npmi')

npmi_score = coherence_model_npmi.get_coherence()
npmi_score

0.09737737383636216

In [44]:
def get_topic_diversity(topics, topk=10):
    ''' Topic Diversity as the percentage of unique words in the top M words of all topics
    Modified from octis implementation
    
    Parameters
    ----------
    topics : list of list of str
        List of topics, where each topic is a list of words.
    topk : int, optional
    '''
    if topics is None:
        return 0
    # if topk > len(topics[0]):
    #     raise Exception('Words in topics are less than ' + str(self.topk))

    unique_words = set()
    for topic in topics:
        unique_words = unique_words.union(set(topic[:topk]))
    td = len(unique_words) / (topk * len(topics))
    return td

get_topic_diversity(topic_words)

0.71

In [45]:
import itertools

import sys
sys.path.append('../')

from rbo import rbo

def get_word2index(list1, list2):
    words = set(list1)
    words = words.union(set(list2))
    word2index = {w: i for i, w in enumerate(words)}
    return word2index

def get_inverted_RBO(topics, topk=10, weight=0.9):
    ''' Inverted Rank-Biased Overlap (iRBO)
    to measure the diversity of the topics
    Modified from octis implementation

    Parameters
    ----------
    topics : list of list of str
        List of topics, where each topic is a list of words.
    topk : int, optional
    weight : float, optional
    '''

    if topics is None:
        return 0
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than topk')
    else:
        collect = []
        for list1, list2 in itertools.combinations(topics, 2):
            word2index = get_word2index(list1, list2)
            indexed_list1 = [word2index[word] for word in list1]
            indexed_list2 = [word2index[word] for word in list2]
            rbo_val = rbo(indexed_list1[:topk], indexed_list2[:topk], p=weight)[2]
            collect.append(rbo_val)
        return 1 - np.mean(collect)
    
get_inverted_RBO(topic_words)

0.9446674577437219

In [46]:
def _KL(P, Q):
    """
    Perform Kullback-Leibler divergence

    Parameters
    ----------
    P : distribution P
    Q : distribution Q

    Returns
    -------
    divergence : divergence from Q to P
    """
    # add epsilon to grant absolute continuity
    epsilon = 0.00001
    P = P+epsilon
    Q = Q+epsilon

    divergence = np.sum(np.multiply(P, np.log(P/Q)))        # changed the operator from * to np.multiply to do element-wise multiplication
    return divergence

def get_kl_divergence(topic_word_metrix):
    """Compute KL divergence between topic-word distributions
    to measure document covrage
    Modified from octis implementation
    https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/diversity_metrics.py#L209

    Parameters
    ----------
    topic_word_metrix : topic-word distribution matrix
    """
    beta = topic_word_metrix
    kl_div = 0
    count = 0
    for i, j in itertools.combinations(range(len(beta)), 2):
        kl_div += _KL(beta[i], beta[j])
        count += 1
    return kl_div / count

get_kl_divergence(result_bertopic['topic-word-matrix'])

9.734483225145115e-05

In [47]:
result_bertopic['topic-word-matrix'].shape

(21, 6694)

---

Inference Test

In [None]:
inference_test = ["well its been fun guys, but that's it, no more updates, that one was the last one, there is no longer going to be anymore content for this game anymore, there is no way to replay it as there won't be any updates, nope, that was it, the last update, nothing more, this game has no new ways to experience it as there is no more content updates, nothing new to freshen up the experience, its such a shame that this game has no replay-ability, once you beat the game there is like no point to playing again, as they said guys 1.2 will be they final update. nothing more after 1.2, there is no chance they will make another final update right? several years and final updates later: alright, thats it, no more updates we wont be getting anymore, thats it, nothing more, no more updates, for real this time... oh god, redigit made another tweet.",
                  "keeps forcing me to play it",
'''I will leave the cat here, so that everybody who passes by can pet it and give it a thumbs up and awards
　　　 　　／＞　　フ
　　　 　　| 　_　 _ l
　 　　 　／` ミ＿xノ
　　 　 /　　　 　 |
　　　 /　 ヽ　　 ﾉ
　 　 │　　|　|　|
　／￣|　　 |　|　|
　| (￣ヽ＿_ヽ_)__)
　＼二つ''']