BERTopic Training script

Ref

BERTopic tutorial

https://colab.research.google.com/drive/1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing#scrollTo=ScBUgXn06IK6


BERTopic Best Practices

https://colab.research.google.com/drive/1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing#scrollTo=m3aN-f9B4rmU


BERTopic Big data (for improving the speed of the training pipeline, on GPU)

https://colab.research.google.com/drive/1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing#scrollTo=Ls2Q-iccGs7O


BERTopic Topic Modelling with Llama2

https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing#scrollTo=4Uj8MYhCafmX

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path
import json
from datetime import datetime

import gensim

import nltk

import pyLDAvis

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"          # disable huggingface warning

In [2]:
%load_ext autoreload

In [3]:
import sys

sys.path.append('../')

In [4]:
# load the dataset

%autoreload 2
from dataset_loader import GENRES, load_dataset

genre = GENRES.INDIE
unique_list = ['review_text']

dataset_folder = Path(f'../../dataset/topic_modelling/top_11_genres_unique_[{",".join(unique_list)}]')
dataset, dataset_path = load_dataset(genre, dataset_folder)

dataset.info(verbose=True)

Load dataset from: /root/FYP/NLP/dev-workspace/dataset/topic_modelling/top_11_genres_unique_[review_text]/01_indie.pkl





<class 'pandas.core.frame.DataFrame'>
Int64Index: 725737 entries, 25636 to 4179608
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   index         725737 non-null  int64 
 1   app_id        725737 non-null  int64 
 2   app_name      725737 non-null  object
 3   review_text   725737 non-null  object
 4   review_score  725737 non-null  int64 
 5   review_votes  725737 non-null  int64 
 6   genre_id      725737 non-null  object
 7   category_id   725737 non-null  object
dtypes: int64(4), object(4)
memory usage: 49.8+ MB


In [5]:
# The path of the dataset to be stored to the config file
str(dataset_path.relative_to(dataset_path.parent.parent.parent.parent))

'dataset/topic_modelling/top_11_genres_unique_[review_text]/01_indie.pkl'

In [6]:
# data preprocessing

sys.path.append('../../sa')

%autoreload 2
import str_cleaning_functions


def cleaning(df, review):
    df[review] = df[review].apply(lambda x: str_cleaning_functions.remove_links(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.remove_links2(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.clean(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.deEmojify(x))
    df[review] = df[review].apply(lambda x: str_cleaning_functions.unify_whitespaces(x))

def cleaning_strlist(str_list):
    str_list = list(map(lambda x: str_cleaning_functions.remove_links(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.remove_links2(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.clean(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.deEmojify(x), str_list))
    str_list = list(map(lambda x: str_cleaning_functions.unify_whitespaces(x), str_list))
    return str_list

In [7]:
cleaning(dataset, 'review_text')

In [8]:
# remove reviews with too many punctuations

def calculate_nonalphabet_ratio(review: str) -> float:
    count = 0
    for char in review:
        if not char.isalpha():
            count += 1
    return count / (len(review) + 1e-5)

dataset['alphabet_ratio'] = dataset['review_text'].apply(calculate_nonalphabet_ratio)

dataset['alphabet_ratio'].describe([0.25, 0.5, 0.75, 0.9, 0.95, 0.99])

count    725737.000000
mean          0.222376
std           0.059417
min           0.000000
25%           0.200000
50%           0.215385
75%           0.234257
90%           0.263889
95%           0.294964
99%           0.406250
max           1.000000
Name: alphabet_ratio, dtype: float64

In [9]:
# remove reviews with too many punctuations
# ratio = ~99 percentile

# this further remove ~7.4K reviews

dataset = dataset[dataset['alphabet_ratio'] < 0.40]

In [10]:
X = dataset['review_text'].values

In [11]:
# remove empty strings

X = list(filter(lambda x: len(x) > 0, X))

In [12]:
# check the length when loading in a evaluation script

print(len(X))
print(X[0])

718311
Take one part Faerie Solitaire and two parts Puzzle Quest and mix in a little Poker or Yahtzee for good measure and you will get something like Runespell: Overture. You're a changeling of some sort and you fight monsters and take quests in exchange for coin and buffs (which come in the form of power-up cards). There's a story but it's not the strongest element in the game. Like the Puzzle Quest games, your battles are determined by playing a mini-game. Instead of match-3 though, the game is a card game similar to poker in which making certain combinations of cards (pairs, 5 of a kind, full house, flush, straight) will do a certain amount of damage to your opponent, who is trying to do the same to you. The ability to steal some cards from your opponent, plus the limited number of moves you get per turn to move cards or play power-ups adds just enough strategy to the game to keep it interesting. Admittedly, the game can get a bit repetitive after a while and I found the dialogue o

Training

Pre-calculate embeddings and prepare vocab b4 training to reduce memory usage.

In [13]:
import platform
import torch

if platform.system() == 'Linux' or platform.system() == 'Windows':
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
else:
    device = torch.device('mps')        # m-series machine

print(device)

cuda


Build the hyperparameter selection (typically number of topics, but UMAP and HBSCAN also have hyperparameters)

with grid/random search

In [14]:
from gensim.models import CoherenceModel
from copy import deepcopy

from sklearn.model_selection import ParameterGrid, ParameterSampler

sys.path.append('../')

from eval_metrics import compute_inverted_rbo, compute_topic_diversity, compute_pairwise_jaccard_similarity, \
                        METRICS, SEARCH_BEHAVIOUR, COHERENCE_MODEL_METRICS

In [15]:
def _print_message(message):
    '''Print message with a timestamp in front of it

    Timestamp format: YYYY-MM-DD HH:MM:SS,mmm
    '''
    print(f'{datetime.now().strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]} - {message}')

In [16]:
from typing import Iterable, List, Tuple, Union


def _init_sentence_transformers_params(model_name_or_path: str = None):
    
    params_dict = {}
    params_dict['model_name_or_path'] = model_name_or_path

    return params_dict

def _init_vocab_tokenizer_params(n_frequency:int = 0, ngram_range:Tuple[int, int] = (1, 1)):

    params_dict = {}
    params_dict['n_frequency'] = n_frequency
    params_dict['ngram_range'] = ngram_range

    return params_dict

def _init_umap_params(n_neighbors:int = 15,     # the number of neighbors to consider when approximating the local metric
                      n_components:int = 5,     # the target embedding dimension, its effect is largest on the performance of HDBSCAN. Increasing this value too much and HDBSCAN will have a hard time clustering the high-dimensional embeddings
                      metric:str = 'cosine',
                      min_dist:float = 0.1,     # the desired separation between close points in the embedding space
                      n_epochs:int = None,      # the number of training epochs to use when optimizing the low dimensional representation
                      low_memory:bool = False,
                      random_state:int = None):
    
    '''
    Suggested parameter tuning: https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html
    '''

    params_dict = {}
    params_dict['n_neighbors'] = n_neighbors
    params_dict['n_components'] = n_components
    params_dict['metric'] = metric
    params_dict['min_dist'] = min_dist
    params_dict['n_epochs'] = n_epochs
    params_dict['low_memory'] = low_memory
    params_dict['random_state'] = random_state

    return params_dict

def _init_hdbscan_params(min_cluster_size:int = 5,
                            min_samples:int = None, 
                            metric:str = 'euclidean',
                            prediction_data:bool = True):
        '''
        Suggested parameter tuning: https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html
        '''
    
        params_dict = {}
        params_dict['min_cluster_size'] = min_cluster_size      # equivalent to min_topic_size in BERTopic params
        params_dict['min_samples'] = min_samples
        params_dict['metric'] = metric              # options are those in sklearn.metrics.pairwise_distances. [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’]
        params_dict['prediction_data'] = prediction_data
    
        return params_dict

def _init_bertopic_params(language: str = "english",        # used to simplify the selection of sentence-transformers models, but since we are passing our own sbert model, this can be ignored.
                 top_n_words: int = 10,
                #  n_gram_range: Tuple[int, int] = (1, 1),
                #  min_topic_size: int = 10,
                 nr_topics: Union[int, str] = None,
                #  low_memory: bool = False,
                 calculate_probabilities: bool = False,
                #  seed_topic_list: List[List[str]] = None,
                #  zeroshot_topic_list: List[str] = None,
                #  zeroshot_min_similarity: float = .7
                 ):
    
    '''
    Parameter tuning: https://maartengr.github.io/BERTopic/getting_started/parameter%20tuning/parametertuning.html
    '''

    if top_n_words < 10:
        raise ValueError('top_n_words must be greater or equal than 10. This is for generating CoherenceModel metrics.')
     
    params_dict = {}
    params_dict['language'] = language                     # put in the params_dict for completeness (as it appears in Parameter tuning in BERTopic page)
    params_dict['top_n_words'] = top_n_words                # this also affects the number of top N words used to calculate metrics
    # params_dict['n_gram_range'] = n_gram_range            # this is controlled by the vocab_tokenizer_params
    # params_dict['min_topic_size'] = min_topic_size        # this is controlled by the hdbscan_params
    params_dict['nr_topics'] = nr_topics
    # params_dict['low_memory'] = low_memory                # this is controlled by the umap_params
    params_dict['calculate_probabilities'] = calculate_probabilities
    # params_dict['seed_topic_list'] = seed_topic_list
    # params_dict['zeroshot_topic_list'] = zeroshot_topic_list
    # params_dict['zeroshot_min_similarity'] = zeroshot_min_similarity

    return params_dict


In [17]:
def _init_config_dict(config_path:Path, model_name:str, dataset_path:Path, hyperparameters:dict, search_space_dict:dict, 
                      metrics:list[METRICS], monitor:METRICS,
                      search_behaviour:SEARCH_BEHAVIOUR, search_rs:int, search_n_iter:int):
    
    if not config_path.exists():
        config = {}

        sbert_params = _init_sentence_transformers_params(**hyperparameters['sbert_params'])
        vocab_tokenizer_params = _init_vocab_tokenizer_params(**hyperparameters['vocab_tokenizer_params'])
        umap_params = _init_umap_params(**hyperparameters['umap_params'])
        hdbscan_params = _init_hdbscan_params(**hyperparameters['hdbscan_params'])
        bertopic_params = _init_bertopic_params(**hyperparameters['bertopic_params'])

        config['model'] = model_name
        config['dataset_path'] = str(dataset_path)
        config['sbert_params'] = sbert_params
        config['vocab_tokenizer_params'] = vocab_tokenizer_params
        config['umap_params'] = umap_params
        config['hdbscan_params'] = hdbscan_params
        config['bertopic_params'] = bertopic_params

        # remove hyperparameters that are in the search_space_dict
        if 'sbert_params' in search_space_dict:
            for k in search_space_dict['sbert_params'].keys():
                sbert_params.pop(k, '')
        if 'vocab_tokenizer_params' in search_space_dict:
            for k in search_space_dict['vocab_tokenizer_params'].keys():
                vocab_tokenizer_params.pop(k, '')
        if 'umap_params' in search_space_dict:
            for k in search_space_dict['umap_params'].keys():
                umap_params.pop(k, '')
        if 'hdbscan_params' in search_space_dict:
            for k in search_space_dict['hdbscan_params'].keys():
                hdbscan_params.pop(k, '')
        if 'bertopic_params' in search_space_dict:
            for k in search_space_dict['bertopic_params'].keys():
                bertopic_params.pop(k, '')

        config['search_space'] = search_space_dict

        config['metrics'] = list(map(lambda x: x.value, metrics))

        config['monitor'] = monitor.value

        config['search_behaviour'] = search_behaviour.value
        if search_behaviour == SEARCH_BEHAVIOUR.RANDOM_SEARCH:
            config['search_rs'] = search_rs
            config['search_n_iter'] = search_n_iter

        with open(config_path, 'w') as f:
            json.dump(config, f, indent=2)

        _print_message('Created config file at {}'.format(config_path))
        # print('Created config file at {}'.format(config_path))

    else:
        with open(config_path, 'r') as f:
            config = json.load(f)

        # check whether the config file is consistent with the input parameters
        assert config['model'] == model_name, 'input model_name is not consistent with config["model"]'
        assert config['dataset_path'] == str(dataset_path), 'input dataset_path is not consistent with config["dataset_path"]'
        assert config['metrics'] == list(map(lambda x: x.value, metrics)), 'input metrics is not consistent with config["metrics"]'
        assert config['monitor'] == monitor.value, 'input monitor is not consistent with config["monitor"]'
        assert config['search_behaviour'] == search_behaviour.value, 'input search_behaviour is not consistent with config["search_behaviour"]'
        if search_behaviour == SEARCH_BEHAVIOUR.RANDOM_SEARCH:
            assert config['search_rs'] == search_rs, 'input search_rs is not consistent with config["search_rs"]'
            assert config['search_n_iter'] == search_n_iter, 'input search_n_iter is not consistent with config["search_n_iter"]'
        
        # check whether the config file contains all the hyperparameters
        sbert_params = _init_sentence_transformers_params(**hyperparameters['sbert_params'])
        vocab_tokenizer_params = _init_vocab_tokenizer_params(**hyperparameters['vocab_tokenizer_params'])
        umap_params = _init_umap_params(**hyperparameters['umap_params'])
        hdbscan_params = _init_hdbscan_params(**hyperparameters['hdbscan_params'])
        bertopic_params = _init_bertopic_params(**hyperparameters['bertopic_params'])

        assert config['sbert_params'].keys() <= sbert_params.keys(), 'existing config["sbert_params"] contains additional hyperparameters'
        assert config['vocab_tokenizer_params'].keys() <= vocab_tokenizer_params.keys(), 'existing config["vocab_tokenizer_params"] contains additional hyperparameters'
        assert config['umap_params'].keys() <= umap_params.keys(), 'existing config["umap_params"] contains additional hyperparameters'
        assert config['hdbscan_params'].keys() <= hdbscan_params.keys(), 'existing config["hdbscan_params"] contains additional hyperparameters'
        assert config['bertopic_params'].keys() <= bertopic_params.keys(), 'existing config["bertopic_params"] contains additional hyperparameters'

        for key in sbert_params.keys() & config['sbert_params'].keys():
            assert sbert_params[key] == config['sbert_params'][key], 'existing config["sbert_params"] contains different hyperparameters'
        for key in vocab_tokenizer_params.keys() & config['vocab_tokenizer_params'].keys():
            assert vocab_tokenizer_params[key] == config['vocab_tokenizer_params'][key], 'existing config["vocab_tokenizer_params"] contains different hyperparameters'
        for key in umap_params.keys() & config['umap_params'].keys():
            assert umap_params[key] == config['umap_params'][key], 'existing config["umap_params"] contains different hyperparameters'
        for key in hdbscan_params.keys() & config['hdbscan_params'].keys():
            assert hdbscan_params[key] == config['hdbscan_params'][key], 'existing config["hdbscan_params"] contains different hyperparameters'
        for key in bertopic_params.keys() & config['bertopic_params'].keys():
            assert bertopic_params[key] == config['bertopic_params'][key], 'existing config["bertopic_params"] contains different hyperparameters'
        
        # check whether the config file contains all the search space
        
        if 'sbert_params' in config['search_space']:
            assert config['search_space']['sbert_params'].keys() == search_space_dict['sbert_params'].keys(), 'input search_space_dict["sbert_params"] contains different hyperparameter keys than existing config["search_space"]["sbert_params"]'
            for key in search_space_dict['sbert_params'].keys():
                assert search_space_dict['sbert_params'][key] == config['search_space']['sbert_params'][key], f'input search_space_dict["sbert_params"]["{key}"] contains value than existing config["search_space"]["sbert_params"]["{key}"]'
        if 'vocab_tokenizer_params' in config['search_space']:
            assert config['search_space']['vocab_tokenizer_params'].keys() == search_space_dict['vocab_tokenizer_params'].keys(), 'input search_space_dict["vocab_tokenizer_params"] contains different hyperparameter keys than existing config["search_space"]["vocab_tokenizer_params"]'
            for key in search_space_dict['vocab_tokenizer_params'].keys():
                assert search_space_dict['vocab_tokenizer_params'][key] == config['search_space']['vocab_tokenizer_params'][key], f'input search_space_dict["vocab_tokenizer_params"]["{key}"] contains value than existing config["search_space"]["vocab_tokenizer_params"]["{key}"]'
        if 'umap_params' in config['search_space']:
            assert config['search_space']['umap_params'].keys() == search_space_dict['umap_params'].keys(), 'input search_space_dict["umap_params"] contains different hyperparameter keys than existing config["search_space"]["umap_params"]'
            for key in search_space_dict['umap_params'].keys():
                assert search_space_dict['umap_params'][key] == config['search_space']['umap_params'][key], f'input search_space_dict["umap_params"]["{key}"] contains value than existing config["search_space"]["umap_params"]["{key}"]'
        if 'hdbscan_params' in config['search_space']:
            assert config['search_space']['hdbscan_params'].keys() == search_space_dict['hdbscan_params'].keys(), 'input search_space_dict["hdbscan_params"] contains different hyperparameter keys than existing config["search_space"]["hdbscan_params"]'
            for key in search_space_dict['hdbscan_params'].keys():
                assert search_space_dict['hdbscan_params'][key] == config['search_space']['hdbscan_params'][key], f'input search_space_dict["hdbscan_params"]["{key}"] contains value than existing config["search_space"]["hdbscan_params"]["{key}"]'
        if 'bertopic_params' in config['search_space']:
            assert config['search_space']['bertopic_params'].keys() == search_space_dict['bertopic_params'].keys(), 'input search_space_dict["bertopic_params"] contains different hyperparameter keys than existing config["search_space"]["bertopic_params"]'
            for key in search_space_dict['bertopic_params'].keys():
                assert search_space_dict['bertopic_params'][key] == config['search_space']['bertopic_params'][key], f'input search_space_dict["bertopic_params"]["{key}"] contains value than existing config["search_space"]["bertopic_params"]["{key}"]'


        _print_message('Loaded existing config file from {}'.format(config_path))
        _print_message('Hyperparameters and search space are consistent with the input parameters')
        # print('Loaded existing config file from {}'.format(config_path))
        # print('Hyperparameters and search space are consistent with the input parameters')

    return config

In [18]:
def _init_result_dict(result_path: Path, monitor_type:str):
        
    if not result_path.exists():
        result = {}

        result['best_metric'] = -float('inf')
        result['best_model_checkpoint'] = ""
        result['best_hyperparameters'] = dict()
        result["monitor_type"] = monitor_type
        result["log_history"] = list()
        
    else:
        with open(result_path, 'r') as f:
            result = json.load(f)

        assert result['monitor_type'] == monitor_type

        _print_message('Loaded existing result file from {}'.format(result_path))
        # print('Loaded existing result file from {}'.format(result_path))
    
    return result

In [19]:
from bertopic_utils import _get_topics, _get_topic_word_matrix, _get_topic_document_matrix

# their implementation is moved to utils script as it may be used in eval script.

  from .autonotebook import tqdm as notebook_tqdm


In [20]:
from bertopic_utils import _load_bertopic_model

# the _load_bertopic_model is moved to utils as it is also used in eval script.

In [21]:
class Dimensionality:
  """ Use this for pre-calculated reduced embeddings """
  def __init__(self, reduced_embeddings):
    self.reduced_embeddings = reduced_embeddings

  def fit(self, X):
    return self

  def transform(self, X):
    return self.reduced_embeddings

In [22]:
from gensim import corpora

from sentence_transformers import SentenceTransformer

import collections
from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

from bertopic import BERTopic
from bertopic.cluster import BaseCluster
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP               # for CPU only
from hdbscan import HDBSCAN         # for CPU only
import cuml                         # for GPU


def model_search(X, hyperparameters:dict, search_space:dict, save_folder:Path, dataset_path:Path,
                 additional_stopwords:list[str]=None, cuml_accl:bool=False,
                metrics:list[METRICS]=[METRICS.C_NPMI], monitor:METRICS=METRICS.C_NPMI, 
                save_each_models=True, run_from_checkpoints=False,
                search_behaviour=SEARCH_BEHAVIOUR.GRID_SEARCH, search_rs=42, search_n_iter=10):
    
    config_json_path = save_folder.joinpath('config.json')
    result_json_path = save_folder.joinpath('result.json')

    if monitor not in metrics:
        raise Exception('monitor is not in metrics. Please modify the metrics passed in.')

    if run_from_checkpoints:
        if not save_folder.exists():
            _print_message('Save folder:' + str(save_folder.resolve()) + ' does not exist. Function terminates.')
            # print('Save folder:' + str(save_folder.resolve()) + ' does not exist. Function terminates.')
            raise Exception('No checkpoints found. Function terminates.')
        
        # check for existing configs
        if not config_json_path.exists():
            raise Exception('No config.json found. Function terminates.')
        
        # check for existing results
        if not result_json_path.exists():
            _print_message('No result.json is found. Assuming no existing checkpoints.')
            # print('no result.json is found. Assuming no existing checkpoints.')
    else:
        if save_folder.exists():
            raise Exception('Checkpoints found. Please delete the checkpoints or set run_from_checkpoints=True. Function terminates.')

    if not save_folder.exists():
        save_folder.mkdir()

    config = _init_config_dict(config_json_path, 'bertopic', dataset_path, hyperparameters, search_space, 
                               metrics, monitor, search_behaviour, search_rs, search_n_iter)
    
    result = _init_result_dict(result_json_path, monitor.value)

    _print_message('Search folder: {}'.format(save_folder))

    # init
    best_model_path = result['best_model_checkpoint']
    best_metric_score = result['best_metric']
    best_model = _load_bertopic_model(Path(best_model_path)) if best_model_path != "" else None
    best_hyperparameters = result['best_hyperparameters']

    _print_message('Best model checkpoint: {}'.format(best_model_path))
    _print_message('Best metric score: {}'.format(best_metric_score))
    _print_message('Best model: {}'.format(best_model))

    # print(f'Best model checkpoint: {best_model_path}')
    # print(f'Best metric score: {best_metric_score}')
    # print(f'Best model: {best_model}')

    # search
    # create a temporary dict for initiating the search space by sklearn parameter grid / parameter sampler
    temp_search_space = {}
    for k, v in search_space.items():
        for kk, vv in v.items():
            temp_search_space[k + '__' + kk] = vv

    if search_behaviour == SEARCH_BEHAVIOUR.GRID_SEARCH:
        search_iterator = ParameterGrid(temp_search_space)
    elif search_behaviour == SEARCH_BEHAVIOUR.RANDOM_SEARCH:
        search_iterator = ParameterSampler(temp_search_space, n_iter=search_n_iter, random_state=search_rs)

    print('\n')

    for search_space_dict in search_iterator:
        # unwrap the search space dict

        model_name = ''

        _sbert_params = {}
        _vocab_tokenizer_params = {}
        _umap_params = {}
        _hdbscan_params = {}
        _bertopic_params = {}

        for k, v in search_space_dict.items():
            if k.startswith('sbert_params'):
                _sbert_params[k.split('__')[1]] = v
                model_name += 'sb_' + k.split('__')[1] + '_' + str(v) + '_'
            elif k.startswith('vocab_tokenizer_params'):
                _vocab_tokenizer_params[k.split('__')[1]] = v
                model_name += 'vt_' + k.split('__')[1] + '_' + str(v) + '_'
            elif k.startswith('umap_params'):
                _umap_params[k.split('__')[1]] = v
                model_name += 'um_' + k.split('__')[1] + '_' + str(v) + '_'
            elif k.startswith('hdbscan_params'):
                _hdbscan_params[k.split('__')[1]] = v
                model_name += 'hs_' + k.split('__')[1] + '_' + str(v) + '_'
            elif k.startswith('bertopic_params'):
                _bertopic_params[k.split('__')[1]] = v
                model_name += 'bt_' + k.split('__')[1] + '_' + str(v) + '_'
            else:
                raise Exception('Unknown key: {}'.format(k))
            
        model_name = model_name[:-1]       # remove the last '_'

        # create the model path to save the model
        model_path = save_folder.joinpath(config['model'] + '_' + model_name)

        # check whether the model exists
        if model_path.exists():
            _print_message(f'Skipping current search space: {search_space_dict}')
            # print(f'Skipping current search space: {search_space_dict}')
            continue

        ##########
        # Training starts
        ##########

        _print_message(f'Current search space: {search_space_dict}')
        # print(f'Current search space: {search_space_dict}')

        sbert_params = deepcopy(config['sbert_params'])     # deepcopy just for data safety (not messing up with the original config dict)
        vocab_tokenizer_params = deepcopy(config['vocab_tokenizer_params'])
        umap_params = deepcopy(config['umap_params'])
        hdbscan_params = deepcopy(config['hdbscan_params'])
        bertopic_params = deepcopy(config['bertopic_params'])

        sbert_params.update(_sbert_params)
        vocab_tokenizer_params.update(_vocab_tokenizer_params)
        umap_params.update(_umap_params)
        hdbscan_params.update(_hdbscan_params)
        bertopic_params.update(_bertopic_params)

        # create embeddings
        if platform.system() == 'Linux' or platform.system() == 'Windows':
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        else:
            device = torch.device('mps')        # m-series machine
        
        sent_transformers = SentenceTransformer(**sbert_params,
                                                device=device)
        
        # load existing embeddings in the search folder to reuse the embeddings
        embeddings_path = save_folder.joinpath(f'embeddings_{sbert_params["model_name_or_path"]}.pkl')
        if embeddings_path.exists():
            with open(embeddings_path, 'rb') as f:
                embeddings = np.load(f)

            _print_message(f'Found existing sbert embeddings at {embeddings_path}. Reusing them.')
            # print(f'Found existing sbert embeddings at {embeddings_path}. Reusing them.')
        else:
            embeddings = sent_transformers.encode(X, show_progress_bar=True, batch_size=64)
            with open(embeddings_path, 'wb') as f:
                np.save(f, embeddings)
            
            _print_message(f'Saved sbert embeddings at: {embeddings_path}')
            # print('Saved sbert embeddings at:', embeddings_path)


        # Structure of the BERTopic
        # (Clustering: Topic Creation)
        # 1. SBERT to create document embeddings
        # 2. UMAP to reduce dimensionality
        # 3. HDBSCAN to cluster embedddings
        # (Representation: Label topics)
        # 4. CountVectorizer to tokenize words
        # 5. c-TF-IDF to weight the words and select the most important words

        # prepare the vocabulary (for c-TFIDF) b4 training
        vocab = collections.Counter()
        tokenizer = CountVectorizer().build_tokenizer()
        for doc in tqdm(X):
            vocab.update(tokenizer(doc))
        vocab = [word for word, frequency in vocab.items() if frequency >= vocab_tokenizer_params['n_frequency']]       # set the minimum frequency to reduce the vocabulary size
        _print_message('Number of vocabulary: {}'.format(len(vocab)))

        del vocab_tokenizer_params['n_frequency']       # not used in the vectorizer model for training
        vocab_tokenizer_params['ngram_range'] = tuple(vocab_tokenizer_params['ngram_range'])       # convert list to tuple

        # prepare the sub models of BERTopic
        embedding_model = SentenceTransformer(**sbert_params)       # use the model as the embedding model

        # tokenize the words (for representation)
        # maybe we can do more pre-processing to the CV vocab to eliminate more words
        # like in LLM-take: remove common adj
        vectorizer_model = CountVectorizer(
            vocabulary=vocab, 
            stop_words="english" if additional_stopwords is None else list(ENGLISH_STOP_WORDS.union(additional_stopwords)),
            analyzer='word',
            **vocab_tokenizer_params)              # for computing c-tfidf (first creating a count matrix, then let c-tfidf to calculate the c-tfidf representation)

        bertopic_params['nr_topics'] += 1       # add 1 BERTopic will produce an extra topic for outliers

        # using cuml for faster training
        if cuml_accl:
            umap_model = cuml.manifold.UMAP(**umap_params, verbose=True)
            # TODO: save the reduced embeddings for reuse
            reduced_embeddings = umap_model.fit_transform(embeddings)

            hdbscan_model = cuml.cluster.hdbscan.HDBSCAN(**hdbscan_params, gen_min_span_tree=True)
            # clusters = hdbscan_model.fit_predict(reduced_embeddings).labels_
            clusters = hdbscan_model.fit_predict(reduced_embeddings)

            # Fit BERTopic without actually performing any clustering
            topic_model= BERTopic(
                    **bertopic_params,
                    embedding_model=embedding_model,
                    umap_model=Dimensionality(reduced_embeddings),
                    hdbscan_model= BaseCluster(),
                    vectorizer_model=vectorizer_model,
                    verbose=True
            ).fit(X, embeddings=embeddings, y=clusters)

            topics, probs = topic_model.transform(X, embeddings=embeddings)

            # print(topics)
            # print(topics.shape)
            # print(probs)
            # print(probs.shape)

        else:
            umap_model = UMAP(**umap_params, verbose=True)       # set random_state for reproductability
            hdbscan_model = HDBSCAN(**hdbscan_params, gen_min_span_tree=True)
            

            topic_model = BERTopic(**bertopic_params,
                embedding_model=embedding_model, 
                vectorizer_model=vectorizer_model,
                umap_model=umap_model, 
                hdbscan_model=hdbscan_model,
                # calculate_probabilities=True,     # already in bertopic_params
                verbose=True)
        
            topics, probs = topic_model.fit_transform(X, embeddings=embeddings)

        ##########
        # Training ends
        ##########

        ##########
        # Evaluation starts
        ##########

        # init data for gensim coherence model
        topic_words, empty_topic_idxs = _get_topics(topic_model)

        documents = pd.DataFrame({"Document": X,
                                "ID": range(len(X)),
                                "Topic": topics})

        # remove documents which their topic contains 1<= words
        documents = documents[~documents['Topic'].isin(empty_topic_idxs)]

        documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
        cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

        bertopic_vectorizer = topic_model.vectorizer_model
        bertopic_analyzer = bertopic_vectorizer.build_analyzer()

        words = bertopic_vectorizer.get_feature_names_out()
        tokens = [bertopic_analyzer(doc) for doc in cleaned_docs]
        dictionary = corpora.Dictionary(tokens)
        corpus = [dictionary.doc2bow(token) for token in tokens]

        _print_message('Computing evaluation metrics')
        # print('Computing evaluation metrics')

        
        # topn = bertopic_params['top_n_words']
    
        # init octis format result for convenience
        result_octis = {}
        result_octis['topics'] = topic_words
        result_octis['topic-word-matrix'] = _get_topic_word_matrix(topic_model, empty_topic_idxs)
        result_octis['topic-document-matrix'] = _get_topic_document_matrix(probs, empty_topic_idxs)


        metrics_score = dict()

        for metric in metrics:
            if metric in COHERENCE_MODEL_METRICS:
                # compute the coherence
                coherencemodel = CoherenceModel(
                    topics=topic_words, 
                    texts=tokens, 
                    corpus=corpus, 
                    dictionary=dictionary, 
                    topn=10, 
                    coherence=metric.value, 
                    processes=3
                )
                score = coherencemodel.get_coherence()              

            elif metric == METRICS.TOPIC_DIVERSITY:
                # compute the coherence
                score = compute_topic_diversity(result_octis, topk=10)


            elif metric == METRICS.INVERTED_RBO:
                # compute the coherence
                score = compute_inverted_rbo(result_octis, topk=10)

            elif metric == METRICS.PAIRWISE_JACCARD_SIMILARITY:
                # compute the coherence
                score = compute_pairwise_jaccard_similarity(result_octis, topk=10)

            else:
                raise Exception(f'Unknown metric: {metric.value}')
            
            metrics_score[metric.value] = score

            _print_message(f'Evaluation metric ({metric.value}): {score}')
            # print(f'Evaluation metric ({metric.value}): {score}')
            
        # get the monitor score
        monitor_score = metrics_score[monitor.value]

        ##########
        # Evaluation ends
        ##########
            
        ##########
        # Save models
        ##########
            
        if not model_path.exists():
            model_path.mkdir()

        if save_each_models:
            topic_model.save(
                path = model_path,
                serialization="safetensors",
                save_ctfidf=True,
                save_embedding_model=sbert_params['model_name_or_path']
            )

            _print_message('Model saved at: {}'.format(model_path))
            # print('Model saved at:', model_path)

        ##########
        # Save models ends
        ##########

        ###########
        # Update result dict and json file
        ###########
        
        # rebuild the model_hyperparameters dict
        model_hyperparameters = {
            'sbert_params': sbert_params,
            'vocab_tokenizer_params': vocab_tokenizer_params,
            'umap_params': umap_params,
            'hdbscan_params': hdbscan_params,
            'bertopic_params': bertopic_params
        }

        if monitor_score > best_metric_score:
            best_metric_score = monitor_score
            best_model = topic_model
            best_model_path = model_path
            best_hyperparameters = model_hyperparameters

        model_log_history = dict()
        model_log_history.update(metrics_score)         # add the metrics score values to the log history
        model_log_history['model_name'] = model_name
        model_log_history['hyperparameters'] = model_hyperparameters

        result['best_metric'] = best_metric_score
        result['best_model_checkpoint'] = str(best_model_path)      # relative path
        result['best_hyperparameters'] = best_hyperparameters
        result["log_history"].append(model_log_history)

        # print(result)

        # save result
        with open(result_json_path, 'w') as f:
            json.dump(result, f, indent=2)

        _print_message('Saved result.json at: {}'.format(result_json_path))
        # print("Saved result.json at:", result_json_path)
        print('\n\n')
    
    _print_message('Search ends')
    # print('Search ends')
    return best_model, best_model_path, best_hyperparameters


In [23]:
# load/create custom stopwords stored in a txt from dataset folder
from pathlib import Path

custom_stopwords_path = Path('../../dataset/topic_modelling/stopwords.txt')
custom_stowords_games_path = Path('../../dataset/topic_modelling/stopwords_games.txt')
game_name_list_path = Path('../../dataset/topic_modelling/game_name_list.txt')

with open(custom_stopwords_path, 'r', encoding='utf-8') as f:
    custom_stopwords = f.read().splitlines()

with open(custom_stowords_games_path, 'r', encoding='utf-8') as f:
    custom_stowords_games = f.read().splitlines()

with open(game_name_list_path, 'r', encoding='utf-8') as f:
    game_name_list = f.read().splitlines()

# also include the stopword list from nltk
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words('english')

custom_stopwords = custom_stopwords + custom_stowords_games + game_name_list + nltk_stopwords
custom_stopwords = list(filter(lambda x: len(x) > 0, custom_stopwords))     # remove empty string

custom_stopwords = set(custom_stopwords)

print(custom_stopwords)
print(len(custom_stopwords))


155930


In [24]:
# grid search / random search

min_cluster_size = max(len(X) // 1000, 100)       # 0.1% of the dataset size, or at least 100
n_neighbors = np.median([20, len(X) // 1000 // 10, 100])      # larger n_neighbors will result in a more global view of the embedding structure

print(f'min_cluster_size: {min_cluster_size}')
print(f'n_neighbors: {n_neighbors}')
print('\n\n')
# set a limit to avoid a too global view

# the ratio between min_cluster_size and n_neighbors is important for the performance of HDBSCAN
# it affects the noise to be filtered out during clustering

# hyperparameters
sbert_params = _init_sentence_transformers_params(model_name_or_path='all-MiniLM-L6-v2')
vocab_tokenizer_params = _init_vocab_tokenizer_params(n_frequency=70, ngram_range=[1, 1])       # pass ngram_range as list for type-value check against config.json
umap_params = _init_umap_params(n_neighbors=n_neighbors, n_components=5, metric='cosine', min_dist=0.1, n_epochs=None, low_memory=False)
hdbscan_params = _init_hdbscan_params(
    min_cluster_size=min_cluster_size, 
    min_samples=5,                      # The simplest intuition for what min_samples does is provide a measure of how conservative you want you clustering to be. The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
    metric='euclidean', 
    prediction_data=True
)
bertopic_params = _init_bertopic_params(
    nr_topics=20, 
    top_n_words=10,         # keep it as 10 !!
    calculate_probabilities=True)

# check the demo script for how to define the search space with more parameters

search_space_dict = {
    'bertopic_params':{
        'nr_topics': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],     # number of topics
    }
}

dataset_path_config = dataset_path.relative_to(dataset_path.parent.parent.parent.parent)

search_behaviour = SEARCH_BEHAVIOUR.GRID_SEARCH

training_datetime = datetime.now()
# training_datetime = datetime(2024, 1, 27, 0, 25, 24)
training_folder = Path(f'bertopic_genre_{str(genre)}_{search_behaviour.value}_{training_datetime.strftime("%Y%m%d_%H%M%S")}')

best_model, best_model_path, best_hyperparameters = model_search(
    X,
    hyperparameters={
        'sbert_params': sbert_params,
        'vocab_tokenizer_params': vocab_tokenizer_params,
        'umap_params': umap_params,
        'hdbscan_params': hdbscan_params,
        'bertopic_params': bertopic_params
    },
    search_space=search_space_dict,
    save_folder=training_folder,
    dataset_path=dataset_path_config,
    additional_stopwords=custom_stopwords, cuml_accl=True,
    metrics=[METRICS.C_NPMI, METRICS.C_V, METRICS.UMASS, METRICS.C_UCI, METRICS.TOPIC_DIVERSITY, METRICS.INVERTED_RBO, METRICS.PAIRWISE_JACCARD_SIMILARITY],
    monitor=METRICS.C_NPMI,
    save_each_models=True,
    run_from_checkpoints=False,
    search_behaviour=search_behaviour,
    # search_rs=42,
    # search_n_iter=80
)

min_cluster_size: 718
n_neighbors: 71.0



2024-02-07 18:47:39,133 - Created config file at bertopic_genre_indie_grid_search_20240207_184739/config.json
2024-02-07 18:47:39,133 - Search folder: bertopic_genre_indie_grid_search_20240207_184739
2024-02-07 18:47:39,133 - Best model checkpoint: 
2024-02-07 18:47:39,133 - Best metric score: -inf
2024-02-07 18:47:39,133 - Best model: None


2024-02-07 18:47:39,133 - Current search space: {'bertopic_params__nr_topics': 10}


Batches: 100%|██████████| 11224/11224 [03:22<00:00, 55.53it/s] 


2024-02-07 18:51:10,752 - Saved sbert embeddings at: bertopic_genre_indie_grid_search_20240207_184739/embeddings_all-MiniLM-L6-v2.pkl


100%|██████████| 718311/718311 [00:06<00:00, 103580.51it/s]


2024-02-07 18:51:17,708 - Number of vocabulary: 16141
[I] [18:51:17.820307] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [18:51:18.760860] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [18:51:18.764122] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [18:51:36.915575] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [18:51:36.952385] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [18:51:36.962998] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [18:51:36.963067] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 18:52:56,354 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 18:52:56,355 - BERTopic - Dimensionality - Completed ✓
2024-02-07 18:52:56,364 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 18:52:56,430 - BERTopic - Cluster - Completed ✓
2024-02-07 18:52:56,430 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 18:53:17,830 - BERTopic - Representation - Completed ✓
2024-02-07 18:53:17,834 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 18:53:38,183 - BERTopic - Topic reduction - Reduced number of topics from 127 to 11
2024-02-07 18:53:38,442 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 18:53:53,466 - Computing evaluation metrics
2024-02-07 18:54:45,618 - Evaluation metric (c_npmi): 0.11232390851738454
2024-02-07 18:56:17,034 - Evaluation metric (c_v): 0.5611621136478552
2024-02-07 18:56:17,085 - Evaluation metric (u_mass): -0.07911309442700079
2024-02-07 18:57:09,099 - Evaluation metric (c_uci): 1.082934792714111
2024-02-07 18:57:09,099 - Evaluation metric (topic_diversity): 0.76
2024-02-07 18:57:09,101 - Evaluation metric (inverted_rbo): 0.9227554142914286
2024-02-07 18:57:09,101 - Evaluation metric (pairwise_jaccard_similarity): 0.06356004390060122
2024-02-07 18:57:09,304 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_10
2024-02-07 18:57:09,305 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 18:57:09,305 - Current search space: {'bertopic_params__nr_topics': 20}
2024-02-07 18:57:09,738 - Found existing sbert embeddings at bertopic_genre_indie_grid_search_20240207

100%|██████████| 718311/718311 [00:07<00:00, 96383.89it/s] 


2024-02-07 18:57:17,221 - Number of vocabulary: 16141
[I] [18:57:17.335648] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [18:57:17.444752] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [18:57:17.447925] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [18:57:35.652079] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [18:57:35.675522] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [18:57:35.684543] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [18:57:35.684593] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 18:58:48,224 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 18:58:48,225 - BERTopic - Dimensionality - Completed ✓
2024-02-07 18:58:48,237 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 18:58:48,305 - BERTopic - Cluster - Completed ✓
2024-02-07 18:58:48,306 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 18:59:10,262 - BERTopic - Representation - Completed ✓
2024-02-07 18:59:10,266 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 18:59:31,807 - BERTopic - Topic reduction - Reduced number of topics from 128 to 21
2024-02-07 18:59:32,206 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 18:59:50,189 - Computing evaluation metrics
2024-02-07 19:00:53,774 - Evaluation metric (c_npmi): 0.10881086636145143
2024-02-07 19:03:06,921 - Evaluation metric (c_v): 0.5560084326291146
2024-02-07 19:03:06,959 - Evaluation metric (u_mass): -0.07526362330197989
2024-02-07 19:04:07,072 - Evaluation metric (c_uci): 0.9415684930362863
2024-02-07 19:04:07,072 - Evaluation metric (topic_diversity): 0.75
2024-02-07 19:04:07,079 - Evaluation metric (inverted_rbo): 0.9485195574297368
2024-02-07 19:04:07,079 - Evaluation metric (pairwise_jaccard_similarity): 0.046375355312946884
2024-02-07 19:04:07,281 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_20
2024-02-07 19:04:07,281 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:04:07,281 - Current search space: {'bertopic_params__nr_topics': 30}
2024-02-07 19:04:07,757 - Found existing sbert embeddings at bertopic_genre_indie_grid_search_202402

100%|██████████| 718311/718311 [00:07<00:00, 96681.06it/s] 


2024-02-07 19:04:15,217 - Number of vocabulary: 16141
[I] [19:04:15.346747] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:04:15.427254] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:04:15.430745] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:04:32.375520] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:04:32.399427] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:04:32.408780] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:04:32.408823] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:05:45,834 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:05:45,836 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:05:45,848 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:05:45,915 - BERTopic - Cluster - Completed ✓
2024-02-07 19:05:45,916 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:06:08,205 - BERTopic - Representation - Completed ✓
2024-02-07 19:06:08,209 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:06:30,656 - BERTopic - Topic reduction - Reduced number of topics from 128 to 31
2024-02-07 19:06:31,225 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:06:51,055 - Computing evaluation metrics
2024-02-07 19:07:56,365 - Evaluation metric (c_npmi): 0.1108721510851664
2024-02-07 19:10:30,974 - Evaluation metric (c_v): 0.5756843543763464
2024-02-07 19:10:31,068 - Evaluation metric (u_mass): -0.08372100561002495
2024-02-07 19:11:35,906 - Evaluation metric (c_uci): 0.8903396189183339
2024-02-07 19:11:35,907 - Evaluation metric (topic_diversity): 0.7366666666666667
2024-02-07 19:11:35,921 - Evaluation metric (inverted_rbo): 0.9664917161838423
2024-02-07 19:11:35,922 - Evaluation metric (pairwise_jaccard_similarity): 0.03132945051382109
2024-02-07 19:11:36,118 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_30
2024-02-07 19:11:36,119 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:11:36,119 - Current search space: {'bertopic_params__nr_topics': 40}
2024-02-07 19:11:36,546 - Found existing sbert embeddings at bertopic_genre_indie_grid_s

100%|██████████| 718311/718311 [00:07<00:00, 100914.81it/s]


2024-02-07 19:11:43,690 - Number of vocabulary: 16141
[I] [19:11:43.816183] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:11:43.895039] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:11:43.897653] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:12:00.862660] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:12:00.886002] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:12:00.895706] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:12:00.895805] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:13:13,214 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:13:13,216 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:13:13,227 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:13:13,292 - BERTopic - Cluster - Completed ✓
2024-02-07 19:13:13,293 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:13:36,747 - BERTopic - Representation - Completed ✓
2024-02-07 19:13:36,752 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:14:00,434 - BERTopic - Topic reduction - Reduced number of topics from 128 to 41
2024-02-07 19:14:01,188 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:14:23,330 - Computing evaluation metrics
2024-02-07 19:15:32,815 - Evaluation metric (c_npmi): 0.1285706522200903
2024-02-07 19:18:41,427 - Evaluation metric (c_v): 0.6091610155817333
2024-02-07 19:18:41,511 - Evaluation metric (u_mass): -0.10484836929595787
2024-02-07 19:19:50,035 - Evaluation metric (c_uci): 1.0855417360245134
2024-02-07 19:19:50,035 - Evaluation metric (topic_diversity): 0.75
2024-02-07 19:19:50,060 - Evaluation metric (inverted_rbo): 0.9783572087979671
2024-02-07 19:19:50,060 - Evaluation metric (pairwise_jaccard_similarity): 0.020857810060596393
2024-02-07 19:19:50,279 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_40
2024-02-07 19:19:50,296 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:19:50,296 - Current search space: {'bertopic_params__nr_topics': 50}
2024-02-07 19:19:50,791 - Found existing sbert embeddings at bertopic_genre_indie_grid_search_2024020

100%|██████████| 718311/718311 [00:07<00:00, 98177.50it/s] 


2024-02-07 19:19:58,138 - Number of vocabulary: 16141
[I] [19:19:58.252973] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:19:58.335679] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:19:58.338903] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:20:15.185106] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:20:15.210216] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:20:15.219306] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:20:15.219416] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:21:29,983 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:21:29,984 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:21:29,995 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:21:30,061 - BERTopic - Cluster - Completed ✓
2024-02-07 19:21:30,062 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:21:54,811 - BERTopic - Representation - Completed ✓
2024-02-07 19:21:54,813 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:22:18,876 - BERTopic - Topic reduction - Reduced number of topics from 138 to 51
2024-02-07 19:22:19,696 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:22:42,918 - Computing evaluation metrics
2024-02-07 19:23:55,626 - Evaluation metric (c_npmi): 0.13030024319422387
2024-02-07 19:27:55,988 - Evaluation metric (c_v): 0.6080040577986954
2024-02-07 19:27:56,068 - Evaluation metric (u_mass): -0.1010992031115662
2024-02-07 19:29:08,160 - Evaluation metric (c_uci): 1.1389737311724657
2024-02-07 19:29:08,160 - Evaluation metric (topic_diversity): 0.724
2024-02-07 19:29:08,200 - Evaluation metric (inverted_rbo): 0.9812312173445248
2024-02-07 19:29:08,201 - Evaluation metric (pairwise_jaccard_similarity): 0.01794025881029107
2024-02-07 19:29:08,404 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_50
2024-02-07 19:29:08,415 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:29:08,415 - Current search space: {'bertopic_params__nr_topics': 60}
2024-02-07 19:29:08,851 - Found existing sbert embeddings at bertopic_genre_indie_grid_search_2024020

100%|██████████| 718311/718311 [00:07<00:00, 98373.08it/s] 


2024-02-07 19:29:16,193 - Number of vocabulary: 16141
[I] [19:29:16.330541] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:29:16.410137] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:29:16.413428] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:29:34.727175] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:29:34.749670] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:29:34.760115] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:29:34.760156] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:30:47,136 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:30:47,138 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:30:47,149 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:30:47,217 - BERTopic - Cluster - Completed ✓
2024-02-07 19:30:47,217 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:31:12,512 - BERTopic - Representation - Completed ✓
2024-02-07 19:31:12,514 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:31:37,559 - BERTopic - Topic reduction - Reduced number of topics from 129 to 61
2024-02-07 19:31:38,594 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:32:02,472 - Computing evaluation metrics
2024-02-07 19:33:18,900 - Evaluation metric (c_npmi): 0.1363765875778513
2024-02-07 19:38:06,915 - Evaluation metric (c_v): 0.6138765058461945
2024-02-07 19:38:06,997 - Evaluation metric (u_mass): -0.11740084431682321
2024-02-07 19:39:21,895 - Evaluation metric (c_uci): 1.208716236777324
2024-02-07 19:39:21,896 - Evaluation metric (topic_diversity): 0.7166666666666667
2024-02-07 19:39:21,954 - Evaluation metric (inverted_rbo): 0.9835360860724859
2024-02-07 19:39:21,955 - Evaluation metric (pairwise_jaccard_similarity): 0.01559363957695286
2024-02-07 19:39:22,155 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_60
2024-02-07 19:39:22,167 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:39:22,168 - Current search space: {'bertopic_params__nr_topics': 70}
2024-02-07 19:39:22,597 - Found existing sbert embeddings at bertopic_genre_indie_grid_se

100%|██████████| 718311/718311 [00:07<00:00, 100108.47it/s]


2024-02-07 19:39:29,803 - Number of vocabulary: 16141
[I] [19:39:29.916126] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:39:29.995530] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:39:29.998362] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:39:47.088028] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:39:47.112027] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:39:47.121700] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:39:47.121848] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:41:04,622 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:41:04,623 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:41:04,634 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:41:04,702 - BERTopic - Cluster - Completed ✓
2024-02-07 19:41:04,703 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:41:29,007 - BERTopic - Representation - Completed ✓
2024-02-07 19:41:29,009 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:41:52,680 - BERTopic - Topic reduction - Reduced number of topics from 133 to 71
2024-02-07 19:41:53,889 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:42:16,490 - Computing evaluation metrics
2024-02-07 19:43:33,072 - Evaluation metric (c_npmi): 0.14697244249360084
2024-02-07 19:48:22,759 - Evaluation metric (c_v): 0.6384672798200367
2024-02-07 19:48:22,854 - Evaluation metric (u_mass): -0.14977486983680555
2024-02-07 19:49:39,348 - Evaluation metric (c_uci): 1.3272503148626627
2024-02-07 19:49:39,348 - Evaluation metric (topic_diversity): 0.7371428571428571
2024-02-07 19:49:39,425 - Evaluation metric (inverted_rbo): 0.988023241934123
2024-02-07 19:49:39,427 - Evaluation metric (pairwise_jaccard_similarity): 0.011623327184929462
2024-02-07 19:49:39,627 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_70
2024-02-07 19:49:39,641 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 19:49:39,641 - Current search space: {'bertopic_params__nr_topics': 80}
2024-02-07 19:49:40,042 - Found existing sbert embeddings at bertopic_genre_indie_grid_

100%|██████████| 718311/718311 [00:07<00:00, 92285.83it/s] 


2024-02-07 19:49:47,869 - Number of vocabulary: 16141
[I] [19:49:47.994649] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [19:49:48.072546] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [19:49:48.075243] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [19:50:04.934334] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [19:50:04.958361] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [19:50:04.968479] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [19:50:04.968521] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 19:51:19,036 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 19:51:19,037 - BERTopic - Dimensionality - Completed ✓
2024-02-07 19:51:19,049 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 19:51:19,117 - BERTopic - Cluster - Completed ✓
2024-02-07 19:51:19,118 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:51:44,571 - BERTopic - Representation - Completed ✓
2024-02-07 19:51:44,573 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 19:52:09,662 - BERTopic - Topic reduction - Reduced number of topics from 132 to 81
2024-02-07 19:52:10,969 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 19:52:34,684 - Computing evaluation metrics
2024-02-07 19:53:53,365 - Evaluation metric (c_npmi): 0.1434207210491433
2024-02-07 19:59:23,828 - Evaluation metric (c_v): 0.633988347901092
2024-02-07 19:59:23,941 - Evaluation metric (u_mass): -0.15280514036013823
2024-02-07 20:00:41,500 - Evaluation metric (c_uci): 1.2789297179573793
2024-02-07 20:00:41,501 - Evaluation metric (topic_diversity): 0.725
2024-02-07 20:00:41,601 - Evaluation metric (inverted_rbo): 0.9880379951876853
2024-02-07 20:00:41,603 - Evaluation metric (pairwise_jaccard_similarity): 0.011525808968935212
2024-02-07 20:00:41,798 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_80
2024-02-07 20:00:41,799 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 20:00:41,799 - Current search space: {'bertopic_params__nr_topics': 90}
2024-02-07 20:00:42,209 - Found existing sbert embeddings at bertopic_genre_indie_grid_search_2024020

100%|██████████| 718311/718311 [00:07<00:00, 101032.85it/s]


2024-02-07 20:00:49,349 - Number of vocabulary: 16141
[I] [20:00:49.467167] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [20:00:49.544344] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [20:00:49.547343] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [20:01:06.504090] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [20:01:06.527707] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [20:01:06.537032] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [20:01:06.537159] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 20:02:19,308 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 20:02:19,309 - BERTopic - Dimensionality - Completed ✓
2024-02-07 20:02:19,320 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 20:02:19,386 - BERTopic - Cluster - Completed ✓
2024-02-07 20:02:19,386 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 20:02:43,548 - BERTopic - Representation - Completed ✓
2024-02-07 20:02:43,550 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 20:03:07,362 - BERTopic - Topic reduction - Reduced number of topics from 129 to 91
2024-02-07 20:03:08,799 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 20:03:32,592 - Computing evaluation metrics
2024-02-07 20:04:51,133 - Evaluation metric (c_npmi): 0.14848953492516934
2024-02-07 20:11:11,938 - Evaluation metric (c_v): 0.641952231464387
2024-02-07 20:11:12,079 - Evaluation metric (u_mass): -0.16063248783699444
2024-02-07 20:12:30,448 - Evaluation metric (c_uci): 1.366043852065957
2024-02-07 20:12:30,448 - Evaluation metric (topic_diversity): 0.6977777777777778
2024-02-07 20:12:30,577 - Evaluation metric (inverted_rbo): 0.9874856872487373
2024-02-07 20:12:30,580 - Evaluation metric (pairwise_jaccard_similarity): 0.01201962279871874
2024-02-07 20:12:30,790 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_90
2024-02-07 20:12:30,800 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 20:12:30,801 - Current search space: {'bertopic_params__nr_topics': 100}
2024-02-07 20:12:31,229 - Found existing sbert embeddings at bertopic_genre_indie_grid_s

100%|██████████| 718311/718311 [00:07<00:00, 97991.50it/s] 


2024-02-07 20:12:38,602 - Number of vocabulary: 16141
[I] [20:12:38.728871] Unused keyword parameter: low_memory during cuML estimator initialization
[D] [20:12:38.806124] /__w/cuml/cuml/cpp/src/umap/runner.cuh:108 n_neighbors=71
[D] [20:12:38.808778] /__w/cuml/cuml/cpp/src/umap/runner.cuh:130 Calling knn graph run
[D] [20:12:56.852019] /__w/cuml/cuml/cpp/src/umap/runner.cuh:136 Done. Calling fuzzy simplicial set
[D] [20:12:56.874361] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:317 Smooth kNN Distances
[D] [20:12:56.884677] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:319 sigmas = [ 0.0232351, 0.126921, 0.157377, 0.0236554, 0.0432843, 0.0607038, 0.0157364, 0.0694603, 0.122134, 0.0316501, 0.173924, 0.0179468, 0.0101839, 0.0269362, 0.13557, 0.0210735, 0.0190689, 0.0185764, 0.0566198, 0.114179, 0.0285895, 0.131609, 0.140324, 0.120669, 0.0138571 ]

[D] [20:12:56.884724] /__w/cuml/cuml/cpp/src/umap/fuzzy_simpl_set/naive.cuh:321 rhos = [ 0.259338, 1.19209e-07, 1.78814e-07,

2024-02-07 20:14:08,467 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-02-07 20:14:08,469 - BERTopic - Dimensionality - Completed ✓
2024-02-07 20:14:08,480 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-02-07 20:14:08,549 - BERTopic - Cluster - Completed ✓
2024-02-07 20:14:08,550 - BERTopic - Representation - Extracting topics from clusters using representation models.
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 20:14:34,314 - BERTopic - Representation - Completed ✓
2024-02-07 20:14:34,316 - BERTopic - Topic reduction - Reducing number of topics
  idf = np.log((avg_nr_samples / df)+1)
2024-02-07 20:14:59,681 - BERTopic - Topic reduction - Reduced number of topics from 129 to 101
2024-02-07 20:15:01,299 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


2024-02-07 20:15:25,381 - Computing evaluation metrics
2024-02-07 20:16:41,862 - Evaluation metric (c_npmi): 0.15130261952258928
2024-02-07 20:23:28,076 - Evaluation metric (c_v): 0.6453009792046391
2024-02-07 20:23:28,196 - Evaluation metric (u_mass): -0.17463812112494762
2024-02-07 20:24:44,953 - Evaluation metric (c_uci): 1.382030645174845
2024-02-07 20:24:44,954 - Evaluation metric (topic_diversity): 0.696
2024-02-07 20:24:45,112 - Evaluation metric (inverted_rbo): 0.988977582415762
2024-02-07 20:24:45,115 - Evaluation metric (pairwise_jaccard_similarity): 0.010127936696564284
2024-02-07 20:24:45,317 - Model saved at: bertopic_genre_indie_grid_search_20240207_184739/bertopic_bt_nr_topics_100
2024-02-07 20:24:45,331 - Saved result.json at: bertopic_genre_indie_grid_search_20240207_184739/result.json



2024-02-07 20:24:45,331 - Search ends


In [27]:
# Test whether the result are the same when load the model from the disk

# load the best model and the embedding from the config folder

search_behaviour = SEARCH_BEHAVIOUR.GRID_SEARCH
training_datetime = datetime(2024, 1, 29, 21, 8, 12)
training_folder = Path(f'bertopic_{search_behaviour.value}_{training_datetime.strftime("%Y%m%d_%H%M%S")}')


training_result_json_path = training_folder.joinpath('result.json')
with open(training_result_json_path, 'r') as f:
    training_result = json.load(f)

# embeddings
embeddings_path = training_folder.joinpath(
    f'embeddings_{training_result["best_hyperparameters"]["sbert_params"]["model_name_or_path"]}.pkl'
)
if embeddings_path.exists():
    with open(embeddings_path, 'rb') as f:
        embeddings = np.load(f)
else:
    raise Exception('No embeddings found. Function terminates.')


# model
best_model_checkpoint_path = training_result['best_model_checkpoint']

best_model_loaded = _load_bertopic_model(best_model_checkpoint_path)


topics, probs = best_model.transform(X, embeddings=embeddings)
topics2, probs2 = best_model_loaded.transform(X, embeddings=embeddings)

assert topics.shape == topics2.shape
assert probs.shape == probs2.shape

np.testing.assert_allclose(topics, topics2, rtol=1e-5, atol=1e-5)
np.testing.assert_allclose(probs, probs2, rtol=1e-5, atol=1e-5)

2024-01-29 21:43:10,260 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.
2024-01-29 21:43:12,895 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


In [28]:
# how about we calculate the embeddings on the fly

best_model_loaded2 = _load_bertopic_model(best_model_checkpoint_path)

sent_transformers2 = SentenceTransformer(
    **sbert_params,
    device=device
)

embeddings3 = sent_transformers2.encode(X, show_progress_bar=True, batch_size=64)

topics3, probs3 = best_model_loaded2.transform(X, embeddings=embeddings3)
assert topics.shape == topics3.shape
assert probs.shape == probs3.shape

np.testing.assert_allclose(topics, topics3, rtol=1e-5, atol=1e-5)
np.testing.assert_allclose(probs, probs3, rtol=1e-5, atol=1e-5)

Batches: 100%|██████████| 11591/11591 [03:29<00:00, 55.34it/s] 
2024-01-29 22:20:50,829 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


The above confirmed that only need to copy the pre-processing techniques is adequate to deploy a trained model,

As it relies on trained topic (vector) embeddings to do fast inference (for full inference, require the trained UMAP and HDBSCAN model as well)


---

In [25]:
# load the best model and the embedding from the config folder

search_behaviour = SEARCH_BEHAVIOUR.GRID_SEARCH
training_datetime = datetime(2024, 2, 7, 0, 34, 34)
training_folder = Path(f'bertopic_genre_{str(genre)}_{search_behaviour.value}_{training_datetime.strftime("%Y%m%d_%H%M%S")}')


training_result_json_path = training_folder.joinpath('result.json')
with open(training_result_json_path, 'r') as f:
    training_result = json.load(f)

# embeddings
embeddings_path = training_folder.joinpath(
    f'embeddings_{training_result["best_hyperparameters"]["sbert_params"]["model_name_or_path"]}.pkl'
)
if embeddings_path.exists():
    with open(embeddings_path, 'rb') as f:
        embeddings = np.load(f)
else:
    raise Exception('No embeddings found. Function terminates.')


# model
best_model_checkpoint_path = training_result['best_model_checkpoint']

best_model = _load_bertopic_model(best_model_checkpoint_path)

topic_model = best_model
topics, probs = topic_model.transform(X, embeddings=embeddings)

2024-02-07 12:03:11,810 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


In [26]:
best_model_checkpoint_path

'bertopic_genre_indie_grid_search_20240207_003434/bertopic_bt_nr_topics_100'

In [27]:
# get topic frequency table
freq = topic_model.get_topic_freq()
print(freq)
print('Num of topics:', len(freq))
print('\n\n')

# sum the 'Count'
print('Total number of docs:', freq['Count'].sum())
print('Number of in-liers:', freq['Count'].sum() - freq[freq['Topic'] == -1]['Count'].sum())
print('Ratio of in-liners:', (freq['Count'].sum() - freq[freq['Topic'] == -1]['Count'].sum()) / float(freq['Count'].sum()))

    Topic   Count
0      -1  434109
6       0   19212
76      1   13808
7       2   12910
41      3   12132
..    ...     ...
72     95     840
97     96     831
32     97     826
43     98     788
85     99     770

[101 rows x 2 columns]
Num of topics: 101



Total number of docs: 740083
Number of in-liers: 305974
Ratio of in-liners: 0.41343200695057175


---

Get the docs with the highest probability in each topic when transform with a new set of documents

In [62]:
# how about we use the topics and probs variable to calculate the top N representative docs
top_N = 10

idx = np.argpartition(-probs, top_N, axis=0)[:top_N]

In [63]:
# row = document, col = topic
idx.shape

(10, 21)

In [65]:
idx[:, -1]

array([66922, 60612, 39721, 41823, 34887, 66124,  5826, 44161, 76701,
       76489])

In [66]:
probs[idx[:, -1], -1]

array([0.8847593 , 0.88933086, 0.89252526, 0.87341017, 0.86458516,
       0.86464214, 0.87115467, 0.86157316, 0.8588035 , 0.8588035 ],
      dtype=float32)

In [75]:
for i in idx[:, -1]:
    print(X[i])

Such a great Game 10/10 -Ign
I LOVE THIS GAME ign 10/10
this game is amazing 10/10 IGN
I LOVE THIS GAME 10/10 BEST GAVE EVER IGN
its a great game 10/10 IGN rating 
This is one of the best games ever. It got 9/10 IGN
this Game is amazing 10/1o ign
Great game 10/10 IGN :)
Great game, IGN 11/10
Great game, IGN 11/10


In [77]:
scores = probs[idx[:, 0]]

In [78]:
scores

array([[0.83925736, 0.74457836, 0.80703235, 0.6795908 , 0.4112299 ,
        0.60110843, 0.32747   , 0.4742515 , 0.57148874, 0.11809592,
        0.37909943, 0.57450265, 0.4534629 , 0.4331435 , 0.5107883 ,
        0.52181256, 0.52877164, 0.5952083 , 0.51749295, 0.2519888 ,
        0.3827619 ],
       [0.83819544, 0.6792901 , 0.82225263, 0.59778   , 0.4467274 ,
        0.71488297, 0.4026624 , 0.4975381 , 0.5864483 , 0.17567718,
        0.37321538, 0.6265934 , 0.49665412, 0.45622283, 0.5865581 ,
        0.57748616, 0.50651133, 0.5680938 , 0.5247112 , 0.29756355,
        0.419262  ],
       [0.862828  , 0.71632946, 0.8440254 , 0.6514621 , 0.38810313,
        0.73899895, 0.35407072, 0.51394963, 0.6051907 , 0.10856348,
        0.3591198 , 0.58158875, 0.474846  , 0.46928063, 0.5575608 ,
        0.5806221 , 0.5548263 , 0.5549511 , 0.5808619 , 0.23594311,
        0.4032487 ],
       [0.8423841 , 0.7527422 , 0.79166555, 0.6698292 , 0.42994094,
        0.6526007 , 0.42818356, 0.46310222, 0.551491 

In [79]:
scores.shape

(10, 21)

In [16]:
# # load the embeddings
# embedding_path = Path('00_Terraria_embeddings.pkl')
# embeddings = np.load(embedding_path)

# # inference to get the topics and prob for evaluation
# # hence, we need the probs to get topic-doc-matrix
# topics, probs = topic_model.transform(X, embeddings=embeddings)

In [17]:
probs.shape

(81776, 20)

Extracting Topics

In [28]:
# look at the most frequent topics 

freq = topic_model.get_topic_info(); freq.head(5)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,434109,-1_like_play_fun_really,"[like, play, fun, really, good, great, 10, gam...",
1,0,19212,0_10_11_killed_pig,"[10, 11, killed, pig, kill, died, play, horses...",
2,1,13808,1_minecraft_2d_items_building,"[minecraft, 2d, items, building, world, build,...",
3,2,12910,2_buy_worth_cents_sale,"[buy, worth, cents, sale, bought, money, dolla...",
4,3,12132,3_horror_scary_scares_scared,"[horror, scary, scares, scared, scare, penumbr...",


In [29]:
topic_model.get_topic(0)  # Select the most frequent topic

[['10', 0.11614546698929003],
 ['11', 0.03908306907095163],
 ['killed', 0.02557373978016356],
 ['pig', 0.016723660421671952],
 ['kill', 0.012993703992721482],
 ['died', 0.012853067494433384],
 ['play', 0.009919414965680464],
 ['horses', 0.009775912787600883],
 ['poop', 0.009502663826975285],
 ['wolves', 0.0094947719418585]]

(Copy from BERTopic ipynb in colab)

There are a number of attributes that you can access after having trained your BERTopic model:


| Attribute | Description |
|------------------------|---------------------------------------------------------------------------------------------|
| topics_               | The topics that are generated for each document after training or updating the topic model. |
| probabilities_ | The probabilities that are generated for each document if HDBSCAN is used. |
| topic_sizes_           | The size of each topic                                                                      |
| topic_mapper_          | A class for tracking topics and their mappings anytime they are merged/reduced.             |
| topic_representations_ | The top *n* terms per topic and their respective c-TF-IDF values.                             |
| c_tf_idf_              | The topic-term matrix as calculated through c-TF-IDF.                                       |
| topic_labels_          | The default labels for each topic.                                                          |
| custom_labels_         | Custom labels for each topic as generated through `.set_topic_labels`.                                                               |
| topic_embeddings_      | The embeddings for each topic if `embedding_model` was used.                                                              |
| representative_docs_   | The representative documents for each topic if HDBSCAN is used. (affects evaluation (calling get_topic_info()), transform with the provided data to get the topic and the probability and re-calculate them)                                                |

Save and load BERTopic models and components

Visualization

In [30]:
# visualize topics

topic_model.visualize_topics()

In [31]:
# visualize topic probabilities
# to understand how confident BERTopic is that certain topics are present in the documents

topic_model.visualize_distribution(probs[100], min_probability=0.001)

In [32]:
# visualize how topics are hierarchically reduced

topic_model.visualize_hierarchy(top_n_topics=50)


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead


scipy.array is deprecated and will be removed in SciPy 2.0.0, use numpy.array instead



In [33]:
# visualize selecteed terms for a few topics
# creating bar charts out of the c-TF-IDF scores for each topic representation.

topic_model.visualize_barchart(top_n_topics=10)

In [35]:
# visualize topic similarity
# Having generated topic embeddings, through both c-TF-IDF and embeddings,
# we can create a similarity matrix by simply applying cosine similarities through those topic embeddings.
# The result will be a matrix indicating how similar certain topics are to each other.

topic_model.visualize_heatmap(top_n_topics=100, width=1000, height=1000)

Evaluation

Calculate metrics with octis

Reference

https://www.theanalyticslab.nl/topic-modeling-with-bertopic/

In [52]:
result_bertopic = {}

top_words = 10     # the functions will only return that number of top words
def _get_topics(topic_model):
    topic_list = []
    empty_topic_l_idx = []

    for idx, topics in topic_model.get_topics().items():
        if idx < 0:
            continue

        topics_sorted = sorted(topics, key=lambda x: x[1], reverse=True)
        topic_l = [t[0] for t in topics_sorted if t[0].strip() != '']

        # it's possible that resulting in an empty list
        # also, topic with only one word fails at calculating NPMI
        if len(topic_l) <= 1:
            empty_topic_l_idx.append(idx)
            continue

        topic_list.append(topic_l)
        # print(len(topic_l))

    return topic_list, empty_topic_l_idx

def _get_topic_word_matrix(topic_model, empty_topic_idxs):

    # use ctfidf value to calculate the probability of a word assigned to a topic
    # but this is not the probability of a word in a topic
    # maybe there's a better way

    c_tfidf_all = topic_model.c_tf_idf_.todense()

    topic_word_matrix = np.exp(c_tfidf_all) / np.exp(c_tfidf_all).sum(axis=1)

    # remove empty topics from the largest index
    for idx in empty_topic_idxs[::-1]:
        topic_word_matrix = np.delete(topic_word_matrix, idx, axis=0)

    # a better way: https://maartengr.github.io/BERTopic/getting_started/visualization/visualization.html#visualize-probablities-or-distribution
    

    return topic_word_matrix

def _get_topic_document_matrix(probabilities, empty_topic_idxs):

    topic_document_matrix = probabilities.T

    for idx in empty_topic_idxs[::-1]:
        topic_document_matrix = np.delete(topic_document_matrix, idx, axis=1)

    return topic_document_matrix

result_bertopic['topics'], empty_topic_idxs = _get_topics(topic_model)
result_bertopic['topic-word-matrix'] = _get_topic_word_matrix(topic_model, empty_topic_idxs)
result_bertopic['topic-document-matrix'] = _get_topic_document_matrix(probs, empty_topic_idxs)

In [53]:
result_bertopic['topics'], result_bertopic['topic-word-matrix'], result_bertopic['topic-document-matrix']

([['game', 'this', 'it', 'and', 'the', 'to', 'of', 'you', 'is', 'fun'],
  ['terraria', 'the', 'and', 'to', 'you', 'is', 'of', 'it', 'game', 'that'],
  ['minecraft', 'and', 'game', 'this', 'it', 'is', 'of', 'you', 'the', 'to'],
  ['game', 'this', 'best', 'great', 'ever', 'love', 'is', 'good', 'one', 'it'],
  ['10',
   'again',
   'killed',
   'would',
   'the',
   'my',
   'you',
   'and',
   'to',
   'unicorn'],
  ['my', 'it', 'but', 'fix', 'the', 'and', 'game', 'to', 'me', 'this'],
  ['addictive',
   'addicting',
   'fun',
   'very',
   'addicted',
   'game',
   'hours',
   'and',
   'this',
   'it'],
  ['10', 'would', 'again', '11', 'ign', 'play', 'life', 'tunk', 'my', 'good'],
  ['good',
   'ok',
   'its',
   'pretty',
   'alright',
   'it',
   'guess',
   'cool',
   'yeah',
   'okay'],
  ['bye',
   'cool',
   'slit',
   'dink',
   'so',
   'tickle',
   'pickle',
   'zone',
   'it',
   'let'],
  ['review',
   'reviews',
   'badgei',
   'le',
   'this',
   'the',
   'game',
   'badge

In [57]:
topic_freq = topic_model.get_topic_freq()
topic_freq[topic_freq['Topic'] != -1]

Unnamed: 0,Topic,Count
3,0,29399
0,1,14477
5,2,13139
8,3,8073
7,4,2529
12,6,1726
6,7,1547
13,5,1500
17,12,1409
14,9,1353


Evaluation with gensim

(as gives more freedom to control the CoherenceModel by gensim)

In [59]:
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel

# https://stackoverflow.com/questions/70548316/gensim-coherencemodel-gives-valueerror-unable-to-interpret-topic-as-either-a-l

# filter topics that contain only one word from the corpus for calculating npmi
# https://github.com/piskvorky/gensim/issues/3328


topic_words, empty_topic_l_idx = _get_topics(topic_model)

documents = pd.DataFrame({"Document": X,
                          "ID": range(len(X)),
                          "Topic": topics})

# remove documents which their topic contains 1<= words
documents = documents[~documents['Topic'].isin(empty_topic_idxs)]

documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

bertopic_vectorizer = topic_model.vectorizer_model
bertopic_analyzer = bertopic_vectorizer.build_analyzer()

words = bertopic_vectorizer.get_feature_names_out()
tokens = [bertopic_analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]

In [60]:
# ~3 min on i714700 with CountVectorizer ~ 6000 words

# we first analysze NPMI

coherence_model = CoherenceModel(topics=topic_words,
                                 texts=tokens,
                                corpus=corpus,
                                dictionary=dictionary,
                                topn=10,
                                coherence='c_v')

# npmi = Coherence(texts=tokens,topk=10, measure='c_npmi')
# nmpi_score = npmi.score(result_bertopic)

cv_score = coherence_model.get_coherence()
cv_score


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

0.3994560925733617

In [61]:
coherence_model_npmi = CoherenceModel(topics=topic_words,
                                    texts=tokens,
                                    corpus=corpus,
                                    dictionary=dictionary,
                                    topn=10,
                                    coherence='c_npmi')

npmi_score = coherence_model_npmi.get_coherence()
npmi_score

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

-0.0029256775418027474

In [31]:
def get_topic_diversity(topics, topk=10):
    ''' Topic Diversity as the percentage of unique words in the top M words of all topics
    Modified from octis implementation
    
    Parameters
    ----------
    topics : list of list of str
        List of topics, where each topic is a list of words.
    topk : int, optional
    '''
    if topics is None:
        return 0
    # if topk > len(topics[0]):
    #     raise Exception('Words in topics are less than ' + str(self.topk))

    unique_words = set()
    for topic in topics:
        unique_words = unique_words.union(set(topic[:topk]))
    td = len(unique_words) / (topk * len(topics))
    return td

get_topic_diversity(topic_words)

0.655

In [33]:
import itertools

import sys
sys.path.append('../')

from rbo import rbo

def get_word2index(list1, list2):
    words = set(list1)
    words = words.union(set(list2))
    word2index = {w: i for i, w in enumerate(words)}
    return word2index

def get_inverted_RBO(topics, topk=10, weight=0.9):
    ''' Inverted Rank-Biased Overlap (iRBO)
    to measure the diversity of the topics
    Modified from octis implementation

    Parameters
    ----------
    topics : list of list of str
        List of topics, where each topic is a list of words.
    topk : int, optional
    weight : float, optional
    '''

    if topics is None:
        return 0
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than topk')
    else:
        collect = []
        for list1, list2 in itertools.combinations(topics, 2):
            word2index = get_word2index(list1, list2)
            indexed_list1 = [word2index[word] for word in list1]
            indexed_list2 = [word2index[word] for word in list2]
            rbo_val = rbo(indexed_list1[:topk], indexed_list2[:topk], p=weight)[2]
            collect.append(rbo_val)
        return 1 - np.mean(collect)
    
get_inverted_RBO(topic_words)

0.9363353717539098

In [34]:
def _KL(P, Q):
    """
    Perform Kullback-Leibler divergence

    Parameters
    ----------
    P : distribution P
    Q : distribution Q

    Returns
    -------
    divergence : divergence from Q to P
    """
    # add epsilon to grant absolute continuity
    epsilon = 0.00001
    P = P+epsilon
    Q = Q+epsilon

    divergence = np.sum(np.multiply(P, np.log(P/Q)))        # changed the operator from * to np.multiply to do element-wise multiplication
    return divergence

def get_kl_divergence(topic_word_metrix):
    """Compute KL divergence between topic-word distributions
    to measure document covrage
    Modified from octis implementation
    https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/diversity_metrics.py#L209

    Parameters
    ----------
    topic_word_metrix : topic-word distribution matrix
    """
    beta = topic_word_metrix
    kl_div = 0
    count = 0
    for i, j in itertools.combinations(range(len(beta)), 2):
        kl_div += _KL(beta[i], beta[j])
        count += 1
    return kl_div / count

get_kl_divergence(result_bertopic['topic-word-matrix'])

0.00022574783055084367

In [35]:
result_bertopic['topic-word-matrix'].shape

(21, 6968)

Inference Test

In [None]:
inference_test = ["well its been fun guys, but that's it, no more updates, that one was the last one, there is no longer going to be anymore content for this game anymore, there is no way to replay it as there won't be any updates, nope, that was it, the last update, nothing more, this game has no new ways to experience it as there is no more content updates, nothing new to freshen up the experience, its such a shame that this game has no replay-ability, once you beat the game there is like no point to playing again, as they said guys 1.2 will be they final update. nothing more after 1.2, there is no chance they will make another final update right? several years and final updates later: alright, thats it, no more updates we wont be getting anymore, thats it, nothing more, no more updates, for real this time... oh god, redigit made another tweet.",
                  "keeps forcing me to play it",
'''I will leave the cat here, so that everybody who passes by can pet it and give it a thumbs up and awards
　　　 　　／＞　　フ
　　　 　　| 　_　 _ l
　 　　 　／` ミ＿xノ
　　 　 /　　　 　 |
　　　 /　 ヽ　　 ﾉ
　 　 │　　|　|　|
　／￣|　　 |　|　|
　| (￣ヽ＿_ヽ_)__)
　＼二つ''']

In [1]:
from bertopic import BERTopic