# Setup and Disclaimer

Based on the Documentation and Guide of BERTopic: https://maartengr.github.io/BERTopic/index.html

Run the following cell to install the necessary dependencies.
Windows: hdbscan requires  Microsoft Visual C++ 14.0 or higher

In [1]:
! pip install bertopic



# BERTopic QuickStart

First we load a dataset for exploration.

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html

Then we fit the default BERTopic model on the documents. fit_transform returns the top topic and its probability for each document.

In [2]:
%%time
# the default model will take up to 10 minutes to run for 5000 documents
# BERTopic does have options for GPU acceleration

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='train',  remove=('headers', 'footers', 'quotes'))['data']
subset = docs[0:5000]

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(subset)

  from .autonotebook import tqdm as notebook_tqdm


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)>

First we take a look at the frequent topics. Topic -1 refers to all documents declared as outliers by the hdbscan algorithm.

In [None]:
topic_model.get_topic_info().head(10)

We can inspect an individual topic and its topic representation.

In [None]:
topic_model.get_topic(0)

We can also extract information at document level.

In [None]:
topic_model.get_document_info(subset)

Let's take a look into the default configuration.

In [None]:
topic_model.get_params()

# BERTopic Step by Step

We load the individual libraries so we can define the modules.

In [3]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

Defining the individual modules of the BERTopic Pipeline.

In [None]:
# Step 1 - Extract embeddings https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embedding_model

In [None]:
# Step 2 - Reduce dimensionality https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
umap_model

In [None]:
# Step 3 - Cluster reduced embeddings https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
hdbscan_model

In [None]:
# Step 4 - Tokenize topics https://maartengr.github.io/BERTopic/getting_started/vectorizers/vectorizers.html
vectorizer_model = CountVectorizer()

In [None]:
# Step 5 - Create topic representation https://maartengr.github.io/BERTopic/getting_started/ctfidf/ctfidf.html
ctfidf_model = ClassTfidfTransformer()

In [None]:
# Step 6 - (Optional) Fine-tune topic representations with another model https://maartengr.github.io/BERTopic/getting_started/representation/representation.html
representation_model = KeyBERTInspired()


In [None]:
# All steps put together
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Vectorize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
  representation_model=representation_model # Step 6 - (Optional) Fine-tune topic represenations
)

### Integration with Huggingface

We can use the transformers pipeline for feature extraction to load any model in the huggingface library

In [None]:
from transformers.pipelines import pipeline

embedding_model = pipeline("feature-extraction", model="Supabase/gte-small")
topic_model = BERTopic(embedding_model=embedding_model)


## Exercises Block 1: Explore Topics with Visualizations

Starting Point: https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_terms.html

1) Topic word scores 
2) Topic similarity 
3) Topic hierarchy 


## Exercises Block 2: Modify the pipeline

Starting Point: https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html

1) Set the pipeline to generate 20 topics.
2) Remove stopwords for the topic representation. 
3) Control the randomness in UMAP.
4) Take a look at alternative topic representations, e.g. MMR.
5) Try anything you think might improve the topics!

# Questions

### 1. What are use cases for topic modeling?
Automatically tag customer support tickets
Route conversations to the right teams based on topic

### 2. Why is dimensionality reduction an important step in the BERTopic pipeline?
to avoid the curse of Dimentionality, eqidistance data in high dim they tend to be equi distance

### 3. Suppose you want more coarse or fine-grained topics; how could you adapt the pipeline to change the output accordingly?
increase the number of clusters, in HDBase reduce the min number of elements in each cluster


### 4. What is the intuition behind c-tf-idf (class-based term frequency- inverse document frequency) ?




Notes: 
1. 
2. 
3. 