# BERTopic exploration

This notebook is for exploring the BERTopic package with different options of its sub-models.

In [77]:
# read input data

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

fpath = '../example/data/data_processed_1prod_full.json'
df = pd.read_json(fpath, lines=True)
docs = df['reviewText'].astype(str)
docs

0      I always get a half size up in my tennis shoes...
1      Put them on and walked 3 hours with no problem...
2                                              excelente
3      The shoes fit well in the arch area. They are ...
4      Tried them on in a store before buying online ...
5                                      I recommend that!
6      My son likes these, and this is the 2nd pair h...
7                                            Comfortable
8                Fit fine...did not like color in person
9      The shoe is too large. When you do lunges it h...
10     Really great for walking I'm very glad I got t...
11     Love these shoes. My feet feel so much better....
12                                        ok but too big
13           Love these shoes.. they are so comfortable.
14     In really like these. I wear between a 9-9.5 w...
15     Love these shoes!\nSo stylish and comfortable....
16     This shoe is JUST OK. Its not as comfortable a...
17     Best tennis shoes I've h

## BERTopic simplest usage

This is the simplest case with everything set as default. Here, topic -1 refers to outliers. N_topics is detected automatically by clustering algorithm. 

The following figure shows the whole model in steps:

![Default model](https://maartengr.github.io/BERTopic/algorithm/default.svg)


In [2]:
from bertopic import BERTopic

topic_model_simplest = BERTopic()
topic, probs = topic_model_simplest.fit_transform(docs)
topic_model_simplest.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,0,247,0_the_and_shoes_these
1,1,100,1_comfortable_fit_and_very
2,2,24,2_them_love_loves_daughter


In [3]:
# the major topic of an argument and the probability of the argument belong to that topic
i = 5
topic[i], probs[i]

(2, 0.6815190841859505)

In [4]:
# poster topic distributions over documents
topic_model_simplest.approximate_distribution(docs)[0]

array([[1.        , 0.        , 0.        ],
       [0.30670947, 0.22013244, 0.47315809],
       [0.        , 0.        , 0.        ],
       ...,
       [0.34194034, 0.56437014, 0.09368952],
       [0.63941958, 0.36058042, 0.        ],
       [0.37563527, 0.19979705, 0.42456768]])

## A more complicated example borrowed from [their website](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview)

In [30]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer


# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

topic, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,5,-1_arch_lightweight_shoe_support
1,0,242,0_shoes_shoe_sneakers_nike
2,1,100,1_comfortable_fit_fits_comfy
3,2,24,2_love_loves_loved_cute


There are topics with keywords like `shoes_shoe` and `love_loves`. 

This is because the input of BERTopic is plain text without any preprocessing.

This is good for the embedding and clustering steps, but not for the tokenization and representation steps.

We want to do preprocessing before c-tf-idf, and we can do this at the tokenization step.

In [44]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer_model = StemmedCountVectorizer(
    ngram_range=(1, 1), 
    stop_words='english',
    analyzer='word'
)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,5,-1_arch_support_great_lightweight
1,0,242,0_shoe_comfort_love_size
2,1,100,1_fit_comfort_great_perfect
3,2,24,2_love_daughter_ship_fast


In [51]:
# fine tune the c-tf-idf model, seems to make no significant difference
ctfidf_model = ClassTfidfTransformer(
    reduce_frequent_words=True
)
topic_model.update_topics(docs, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,5,-1_arch_support_sprint_ball
1,0,242,0_shoe_size_wear_feet
2,1,100,1_fit_comfort_super_great
3,2,24,2_love_daughter_ship_fast


In [74]:
from bertopic.representation import MaximalMarginalRelevance

# MMR helps reducing redundancy of similar terms, thus don't do stemming here
representation_model = MaximalMarginalRelevance(diversity=1) 

vectorizer_model = CountVectorizer(
    ngram_range=(1, 1), 
    stop_words='english'
)

topic_model.update_topics(
    docs=docs, 
    vectorizer_model=vectorizer_model, 
    representation_model=representation_model
)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,5,-1_arch_lightweight_aerobic_ball
1,0,242,0_shoes_great_like_really
2,1,100,1_fit_expected_cute_lightweight
3,2,24,2_daughter_happier_services_picky


In [104]:
# Transformer for gerernating label text
from bertopic.representation import TextGeneration

representation_model = TextGeneration('gpt2', pipeline_kwargs={'max_new_tokens': 60})

topic_model.update_topics(
    docs=docs, 
    vectorizer_model=vectorizer_model, 
    ctfidf_model=ctfidf_model, 
    representation_model=representation_model
)
topic_model.get_topic_info()['Name']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0    -1_"What are the most common types of aerobic ...
1    0_Shoes: what, a name of clothing and footwear...
2    1_Anarchy\nWhen we see an organization we look...
3    2_Love and Shipping Love. When people talk abo...
Name: Name, dtype: object

In [90]:
doc_info = topic_model.get_document_info(docs)
doc_info = doc_info[doc_info['Representative_document']].sort_values(by='Topic')
doc_info

Unnamed: 0,Document,Topic,Name,Top_n_words,Probability,Representative_document
26,Great arch support and comfortable.,-1,-1_arch_lightweight_aerobic_ball,arch - lightweight - aerobic - ball - comforta...,0.0,True
54,"Lightweight, decent arch support and comfortab...",-1,-1_arch_lightweight_aerobic_ball,arch - lightweight - aerobic - ball - comforta...,0.0,True
280,Not a lot of arch support but really lightweight.,-1,-1_arch_lightweight_aerobic_ball,arch - lightweight - aerobic - ball - comforta...,0.0,True
13,Love these shoes.. they are so comfortable.,0,0_shoes_great_like_really,shoes - great - like - really - ve - day - fee...,1.0,True
321,Size 5. Very comfortable shoes. Love!,0,0_shoes_great_like_really,shoes - great - like - really - ve - day - fee...,1.0,True
361,I love this shoes they are so comfortable,0,0_shoes_great_like_really,shoes - great - like - really - ve - day - fee...,1.0,True
34,I love them! They fit perfect and are very com...,1,1_fit_expected_cute_lightweight,fit - expected - cute - lightweight - daughter...,1.0,True
202,Great fit so comfortable love them!!!!!,1,1_fit_expected_cute_lightweight,fit - expected - cute - lightweight - daughter...,1.0,True
342,"Perfect fit, very comfortable, and a great col...",1,1_fit_expected_cute_lightweight,fit - expected - cute - lightweight - daughter...,1.0,True
95,Love love love,2,2_daughter_happier_services_picky,daughter - happier - services - picky - couldn...,1.0,True


In [94]:
doc_info.loc[13]['Document']

'Love these shoes.. they are so comfortable.'