# BERTopic exploration

This notebook is for exploring the BERTopic package with different options of its sub-models.

In [1]:
# read input data

import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

fpath = '../example/data/data_processed_1prod_full.json'
df = pd.read_json(fpath, lines=True)
docs = df['reviewText'].astype(str)
docs

0      I always get a half size up in my tennis shoes...
1      Put them on and walked 3 hours with no problem...
2                                              excelente
3      The shoes fit well in the arch area. They are ...
4      Tried them on in a store before buying online ...
5                                      I recommend that!
6      My son likes these, and this is the 2nd pair h...
7                                            Comfortable
8                Fit fine...did not like color in person
9      The shoe is too large. When you do lunges it h...
10     Really great for walking I'm very glad I got t...
11     Love these shoes. My feet feel so much better....
12                                        ok but too big
13           Love these shoes.. they are so comfortable.
14     In really like these. I wear between a 9-9.5 w...
15     Love these shoes!\nSo stylish and comfortable....
16     This shoe is JUST OK. Its not as comfortable a...
17     Best tennis shoes I've h

In [4]:
# split reviews into sentences
import spacy

nlp = spacy.load('en_core_web_md')

temp = []
for doc in docs:
    doc = nlp(doc)
    for sent in doc.sents:
        temp.append(str(sent))
docs = temp
docs

['I always get a half size up in my tennis shoes.',
 'For some reason these feel to big in the heel area and wide.',
 'Put them on and walked 3 hours with no problem!',
 'Love them!',
 'So light feeling',
 'excelente',
 'The shoes fit well in the arch area.',
 'They are a little wider in the toe area of the shoe, you feel like you have a lot of room.',
 'This does not make the shoe uncomfortable, just had to get used to it.',
 'Love the shoe.',
 "Tried them on in a store before buying online so I knew they'd fit good.",
 'Overall I was looking for a durable cross training shoe that would hold up to my rigorous training and these have been great so far.',
 'They are really light and comfortable.',
 "Most importantly for me they have grips on the bottoms so my feet don't slide out from under me while doing planks, push-ups, etc.",
 'Highly satisfied with this purchase.',
 'I recommend that!',
 "My son likes these, and this is the 2nd pair he's worn.",
 'Comfortable',
 'Fit fine...did not

## BERTopic simplest usage

This is the simplest case with everything set as default. Here, topic -1 refers to outliers. N_topics is detected automatically by clustering algorithm. 

The following figure shows the whole model in steps:

![Default model](https://maartengr.github.io/BERTopic/algorithm/default.svg)


In [5]:
from bertopic import BERTopic

topic_model_simplest = BERTopic()
topic, probs = topic_model_simplest.fit_transform(docs)
topic_model_simplest.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,82,-1_they_for_to_expected
1,0,370,0_shoes_the_these_and
2,1,73,1_comfortable_light_very_super
3,2,53,2_fit_perfect_great_they
4,3,51,3_size_ordered_half_wear
5,4,43,4_love_them_really_these
6,5,41,5_they_are_comfortable_feel
7,6,32,6_excellent_lol_excelente_perfect
8,7,30,7_nike_nikes_flex_quality
9,8,29,8_pair_second_this_my


In [6]:
# the major topic of an argument and the probability of the argument belong to that topic
i = 5
topic[i], probs[i]

(7, 0.3738170092615666)

In [7]:
# poster topic distributions over documents
topic_model_simplest.approximate_distribution(docs)[0]

array([[0.1394323 , 0.        , 0.        , ..., 0.19671724, 0.        ,
        0.        ],
       [0.28143717, 0.        , 0.        , ..., 0.20275315, 0.        ,
        0.        ],
       [0.3403342 , 0.        , 0.        , ..., 0.15067118, 0.        ,
        0.        ],
       ...,
       [0.22699865, 0.09132069, 0.10094738, ..., 0.1396909 , 0.        ,
        0.        ],
       [0.        , 0.        , 0.06038233, ..., 0.        , 0.08479222,
        0.        ],
       [0.35144841, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## A more complicated example borrowed from [their website](https://maartengr.github.io/BERTopic/algorithm/algorithm.html#visual-overview)

In [6]:
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer

from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction


# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-mpnet-base-v1")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
# umap_model = BaseDimensionalityReduction() # empty model

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Optional) Fine-tune topic representations with 
# a `bertopic.representation` model
representation_model = KeyBERTInspired()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

topic, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

Downloading (…)b2106/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)711acb2106/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)1acb2106/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)106/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)b2106/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)2106/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)711acb2106/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)acb2106/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name
0,-1,812,-1_fit_nikes_fits_nike
1,0,110,0_comfortable_comfy_shoes_sneakers
2,1,16,1_love___


There are topics with keywords like `shoes_shoe` and `love_loves`. 

This is because the input of BERTopic is plain text without any preprocessing.

This is good for the embedding and clustering steps, but not for the tokenization and representation steps.

We want to do preprocessing before c-tf-idf, and we can do this at the tokenization step.

In [21]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer_model = StemmedCountVectorizer(
    ngram_range=(1, 1), 
    stop_words='english',
    analyzer='word'
)
topic_model.update_topics(docs, vectorizer_model=vectorizer_model)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,190,-1_support_arch_look_shoe
1,0,287,0_shoe_run_comfort_feet
2,1,73,1_comfort_light_super_lightweight
3,2,61,2_love_daughter_absolut_realli
4,3,61,3_size_order_half_nike
5,4,53,4_fit_perfect_great_expect
6,5,41,5_comfort_light_feel_realli
7,6,40,6_purchas_product_satisfi_price
8,7,31,7_excel_lol_excelent_perfect
9,8,30,8_nike_flex_qualiti_black


In [18]:
# fine tune the c-tf-idf model, seems to make no significant difference
ctfidf_model = ClassTfidfTransformer(
    reduce_frequent_words=True
)
topic_model.update_topics(docs, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,190,-1_arch_support_bright_store
1,0,287,0_shoes_running_gym_training
2,1,73,1_super_comfortable_light_lightweight
3,2,61,2_loves_daughter_love_absolute
4,3,61,3_size_ordered_half_large
5,4,53,4_fit_perfect_stretch_fits
6,5,41,5_feel_offer_help_wish
7,6,40,6_purchase_satisfied_product_shipping
8,7,31,7_excellent_lol_excelente_update
9,8,30,8_nike_nikes_abit_quality


In [16]:
from bertopic.representation import MaximalMarginalRelevance

# MMR helps reducing redundancy of similar terms, thus don't do stemming here
representation_model = MaximalMarginalRelevance(diversity=1) 

vectorizer_model = CountVectorizer(
    ngram_range=(1, 1), 
    stop_words='english'
)

topic_model.update_topics(
    docs=docs, 
    vectorizer_model=vectorizer_model, 
    representation_model=representation_model
)
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,190,-1_support_little_shoe_bright
1,0,287,0_shoe_great_gym_day
2,1,73,1_lightweight_start_padding_cloud
3,2,61,2_loves_absolutely_cute_em
4,3,61,3_half_wear_order_big
5,4,53,4_fit_nice_love_true
6,5,41,5_comfortable_light_offer_little
7,6,40,6_happy_fast_purchasing_recommend
8,7,31,7_lol_update_gracias_pay
9,8,30,8_nikes_years_abit_amazing


In [12]:
# Transformer for gerernating label text
from bertopic.representation import TextGeneration

representation_model = TextGeneration('gpt2', pipeline_kwargs={'max_new_tokens': 60})

topic_model.update_topics(
    docs=docs, 
    vectorizer_model=vectorizer_model, 
    ctfidf_model=ctfidf_model, 
    representation_model=representation_model
)
topic_model.get_topic_info()['Name']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


0     -1_"What are the most common keywords used by ...
1     0_Shoes: what, a name of clothing and footwear...
2     1_Someday, when I was very lazy, one of our ed...
3     2_I have a way. I want. I say yes. It's like m...
4     3_the new thing that is going to fix things.\n...
5     4_Fit, perfect, stretch, fits, easy, color, ex...
6     5_Feel\nThe keyword that is used for the feeli...
7     6_"I have a topic about our brand, product, se...
8                                       7_Greatness.___
9                                          8_"Nike "___
10    9_"The Relationship Between Latching and Latch...
11    10_'friction', the idea that when you lift ano...
12    11_grommets\nI know this, but here comes the k...
Name: Name, dtype: object

In [22]:
doc_info = topic_model.get_document_info(docs)
doc_info = doc_info[doc_info['Representative_document']].sort_values(by='Topic')
doc_info

Unnamed: 0,Document,Topic,Name,Top_n_words,Probability,Representative_document
63,Good arch support - I have a high arch!!,-1,-1_support_arch_look_shoe,support - arch - look - shoe - color - great -...,0.0,True
65,Great arch support and comfortable.,-1,-1_support_arch_look_shoe,support - arch - look - shoe - color - great -...,0.0,True
229,No arch support but l love the colors!,-1,-1_support_arch_look_shoe,support - arch - look - shoe - color - great -...,0.0,True
912,I love this shoes they are so comfortable,0,0_shoe_run_comfort_feet,shoe - run - comfort - feet - wear - train - d...,1.0,True
757,SHOES.,0,0_shoe_run_comfort_feet,shoe - run - comfort - feet - wear - train - d...,1.0,True
158,GET THESE SHOES!,0,0_shoe_run_comfort_feet,shoe - run - comfort - feet - wear - train - d...,1.0,True
648,Super light weight comfortable!,1,1_comfort_light_super_lightweight,comfort - light - super - lightweight - comfi ...,1.0,True
504,Super light and comfortable!,1,1_comfort_light_super_lightweight,comfort - light - super - lightweight - comfi ...,1.0,True
407,Super light and very comfortable.,1,1_comfort_light_super_lightweight,comfort - light - super - lightweight - comfi ...,1.0,True
542,Love them.,2,2_love_daughter_absolut_realli,love - daughter - absolut - realli - gift - wi...,1.0,True


In [23]:
hierarchical_topics = topic_model.hierarchical_topics(docs)
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics)

100%|██████████| 11/11 [00:00<00:00, 482.96it/s]
