### Objective

- To evaluate the serialization techniques of safetensors, pytorch and pickle to save and use bertopic model

#### Comparing topic probabilities for model.transform()

Observations:

1. The probabilities at topic level are different (even at scale) for safetensors and pytorch vs pickle/dill because the former is just based on similarity with topic embeddings, while dill have full UMAP and HDBSCAN process 
2. The sum of probabilities in case of dill/pickle <= 1 and in case of safetensors it will >> 1 (each probability is a cosine similarity)
3. The top topic seem to be the same largely

In [48]:
import dill
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

In [19]:
from hdbscan import HDBSCAN

In [6]:
with open("ect_bge_small_search.dill", "rb") as f:
    ect = dill.load(f)

In [26]:
sentence = 'the irrelevence of something that is not constant is baffeling'

In [27]:
embedding = ect.embeddings_model.encode(sentence)

In [32]:
embedding.reshape(1,-1).shape

(1, 384)

In [36]:
t, p = ect.bertopic_model.transform([sentence],embedding.reshape(1,-1))

In [47]:
p

array([[6.51333993e-04, 4.24771923e-04, 4.95920672e-04, 7.30991593e-04,
        4.39449397e-04, 3.45828921e-04, 2.71733824e-04, 2.85405838e-04,
        9.11909534e-03, 6.43817476e-04, 2.83510035e-04, 5.18217350e-04,
        2.96346221e-04, 1.40384365e-03, 5.42995934e-04, 4.67405229e-04,
        6.18711099e-04, 2.66288267e-04, 3.18282431e-04, 5.87551478e-04,
        4.65564181e-04, 1.53191598e-04, 6.88476371e-04, 3.50211132e-04,
        2.82841013e-04, 4.80837821e-04, 5.39263528e-04, 1.15367763e-03,
        2.79983305e-04, 7.46336172e-03, 3.09530060e-04, 4.59356129e-04,
        5.55924206e-04, 5.87342333e-04, 2.25065258e-04, 2.18191300e-04,
        1.09768531e-02, 1.00874818e-03, 4.05619312e-04, 3.44003979e-04,
        5.53216816e-04, 7.71383874e-04, 5.64938839e-04, 3.89230362e-04,
        5.63033956e-04, 4.31504034e-04, 7.97671047e-04, 3.45716889e-04,
        9.91215179e-04, 4.71584586e-04, 3.02363086e-04, 6.25854499e-04,
        2.75548479e-04, 2.59822023e-04, 3.21087783e-04, 3.233095

In [2]:
import os

In [21]:
ect.bertopic_model.save("bertopic_bge_small_24_sep", serialization="pytorch", save_ctfidf=True, save_embedding_model='BAAI/bge-small-en')

In [22]:
model2 = BERTopic.load("bertopic_bge_small_24_sep/")

In [25]:
model2.umap_model

<bertopic.dimensionality._base.BaseDimensionalityReduction at 0x16a6a2d10>

In [23]:
t1, p1 = model2.transform(['the irrelevence of something that is not constant is baffeling'])

In [46]:
p1

array([[0.85562265, 0.80319047, 0.83268714, 0.79912955, 0.8424952 ,
        0.803552  , 0.8423945 , 0.77155375, 0.8172012 , 0.8445801 ,
        0.83438563, 0.81655914, 0.80966204, 0.8068774 , 0.81157243,
        0.79545116, 0.81363994, 0.8263074 , 0.7895795 , 0.817497  ,
        0.80741835, 0.82939893, 0.82317543, 0.80960727, 0.78959835,
        0.800855  , 0.8420632 , 0.82647157, 0.81329817, 0.79075843,
        0.8241154 , 0.81410015, 0.80261874, 0.7858674 , 0.7987571 ,
        0.79375666, 0.8098306 , 0.8200163 , 0.8081104 , 0.8126357 ,
        0.8116094 , 0.8079581 , 0.7887103 , 0.78922915, 0.7951264 ,
        0.8027884 , 0.8170266 , 0.81540287, 0.81275845, 0.80681324,
        0.8516815 , 0.7955349 , 0.83217186, 0.80580235, 0.79024434,
        0.8129781 , 0.7944391 , 0.7949419 , 0.80692995, 0.82142687,
        0.8150956 , 0.77906984, 0.82226896, 0.8212482 , 0.79215217,
        0.81794286, 0.8116904 , 0.8089064 , 0.81302613, 0.7977701 ,
        0.8090272 , 0.8177878 , 0.8203667 , 0.81

In [40]:
t1

array([260])

In [45]:
p1[0][261]

0.88561285

In [53]:
cd = ect.hierarchical_topics[ect.hierarchical_topics.Parent_ID == '1924']

In [68]:
from copy import deepcopy

In [71]:
cd = deepcopy(ect.hierarchical_topics)

In [73]:
cd.loc[len(cd)] = ['589','something', None, None, None, None, None, None]

In [78]:
any(cd[cd.Parent_ID == '1924'].Child_Left_ID)

True

In [1]:
all(cd['Child_Left_ID'])

NameError: name 'cd' is not defined

In [79]:
cd.tail(3)

Unnamed: 0,Parent_ID,Parent_Name,Topics,Child_Left_ID,Child_Left_Name,Child_Right_ID,Child_Right_Name,Distance
1,964,codes_redeem_updating_microsite_diamonds,"[589, 611, 673]",673.0,codes_redeem_updating_microsite_diamonds,963.0,codes_redeem_updating_microsite_diamonds,0.00589
0,963,codes_redeem_updating_microsite_diamonds,"[589, 611]",589.0,codes_redeem_updating_microsite_diamonds,611.0,codes_redeem_updating_microsite_diamonds,0.00337
962,589,something,,,,,,


In [81]:
cd[cd.Parent_ID == '611'].empty

True

In [64]:
for _,k in cd.iterrows():
    if k['Child_Left_ID'] is not None:
        print(k)

Parent_ID                                                        1924
Parent_Name                               today_stock_price_at_closed
Topics              [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...
Child_Left_ID                                                    1918
Child_Left_Name                 closed_yesterday_stock_reacts_monitor
Child_Right_ID                                                   1923
Child_Right_Name                                   nifty_on_in_to_for
Distance                                                     4.633254
Name: 961, dtype: object


### Comparing find_topics() performance

Observations:

1. The topics are the same
2. the probabilities are largely same with dill having greater precision
3. It is expected to be same since both of them compare sentence embeddings with topic embeddings

In [49]:
model2.find_topics(sentence)

([260, 940, 502, 125, 535],
 [0.88561285, 0.8836597, 0.868688, 0.86483026, 0.8626039])

In [50]:
ect.bertopic_model.find_topics(sentence)

([260, 940, 502, 125, 535],
 [0.8856128888652373,
  0.8836596652121516,
  0.8686879737850925,
  0.8648303818026876,
  0.862603969705139])