# BERTopic exploration

This notebook is for exploring the BERTopic package with different options of its sub-models.

In [40]:
# read input data

import pandas as pd

pd.set_option('display.max_columns', None)

fpath = '../example/data/data_processed_1prod_full.json'
df = pd.read_json(fpath, lines=True)
docs = df['reviewText'].astype(str)
docs

0      I always get a half size up in my tennis shoes...
1      Put them on and walked 3 hours with no problem...
2                                              excelente
3      The shoes fit well in the arch area. They are ...
4      Tried them on in a store before buying online ...
                             ...                        
366    Favorite Nike shoe ever! The flex sole is exce...
367         I wear these everyday to work, the gym, etc.
368      Love these shoes! Great fit, very light weight.
369    Super comfortable and fit my small feet perfec...
370                                    Love these shoes!
Name: reviewText, Length: 371, dtype: object

## BERTopic simplest usage

This is the simplest case with everything set as default. Here, topic -1 refers to outliers. N_topics is detected automatically by clustering algorithm. 

The following figure shows the whole model in steps:

![Default model](https://maartengr.github.io/BERTopic/algorithm/default.svg)


In [42]:
from bertopic import BERTopic

topic_model_simplest = BERTopic()
topic, probs = topic_model_simplest.fit_transform(docs)
topic_model_simplest.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,26,-1_for_and_lightweight_very
1,0,242,0_the_and_shoes_these
2,1,30,1_comfortable_very_and_great
3,2,30,2_fit_perfect_great_comfortable
4,3,24,3_love_them_loves_daughter
5,4,19,4_excellent_excelente_nice_perfect


In [52]:
# the major topic of an argument and the probability of the argument belong to that topic
i = 5
topic[i], probs[i]

(3, 0.46867069085810076)

In [49]:
# poster topic distributions over documents
topic_model_simplest.approximate_distribution(docs)[0]

array([[0.9108282 , 0.        , 0.0891718 , 0.        , 0.        ],
       [0.28962824, 0.12910433, 0.10972196, 0.47154548, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 1.        ],
       ...,
       [0.22558421, 0.32700033, 0.38587123, 0.06154423, 0.        ],
       [0.52060168, 0.26923082, 0.2101675 , 0.        , 0.        ],
       [0.33411211, 0.14272114, 0.1400075 , 0.38315925, 0.        ]])

## A more complicated example borrowed from [their website](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview)

In [53]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer


# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

topic, probs = topic_model.fit_transform(docs)

In [55]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,2,-1_color_red_orange_wore
1,0,245,0_shoes_shoe_sneakers_nikes
2,1,100,1_comfortable_fit_fits_comfy
3,2,24,2_love_loved_loves_cute
