# Tutorials on Lonestar 6: BERTopic - Basic Model Construction

BERTopic is a powerful package utilizing transformers and c-TF-IDF in a sequential (and customizable) pipeline to cluster language for interpretable word-group topic extraction. When LLMs are used in an optional final step, this takes the word based groupings and the underlying documents they were derived from to produce readible English language topic descriptions. 

If you are new to BERTopic, we strongly recommend that you at least read over the description in the [documentation](https://maartengr.github.io/BERTopic/index.html) of the [underlying algorithm](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) to understand the steps in the BERTopic pipeline and the options available.  The rest of the package documentation provides fuller discussion of all the pieces and models options availalbe. 

## Load necessary Libraries

### General Operating Libraries and UI/UX parts

In [7]:
import os
import logging
import sys

from IPython.display import Markdown, display

# Provides UI/UX widgets within Notebook
import ipywidgets as widgets
from ipywidgets import Layout, Button, Box

### Set notebook configuration options 
These parameters do not impact the BERTopic model, but rather how elements are displayed in this notebook.

In [8]:
# Layout Options for widgets
items_layout = Layout( width='auto')

items_style={'description_width': 'initial'}

box_layout = Layout(display='flex',
                    flex_flow='column',
                    align_items='stretch',
                    border='solid',
                    width='50%')

## BERTopic Pipeline Overview
The 'Detailed Overview' on ['The Algorithm']((https://maartengr.github.io/BERTopic/algorithm/algorithm.html)) page linked above provides a detailed description of each step below. Each step in the process is highly customizable, and gives the user the option for switching different models at each step. The following list just provides a quick overview of hte default options for each step.

1) **Extract mathematical embeddings from documents**
    * Default model for English: [SentenceTransformers](https://www.sbert.net/) "all-MiniLM-L6-v2"
    * Default multilingual model: "paraphrase-multilingual-MiniLM-L12-v2"  
2) **Reduce Dimensionality**  
   * Default: [UMAP](https://github.com/lmcinnes/umap)  
   * Alternatives: [Guide](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) for choosing alternate options
3) **Cluster reduced embeddings**
    * Default: density-based clustering with [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)    
    * [Guide](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) for selecting alternative options
4) **Vectorize topics**  
    * English Default: CountVectorizer
5) **Create Topic representations**  
    * Default: c-TF-IDF.  Creates topic representation from bag-of-words matrix
6) **Fine-tune the topic representations (optional)**
    * no Default. this step can use a number of different models or LLMs to convert topic word list into English language sentences.  








## BERTopic Quick Start
This section will follow the steps in the ['Quick Start'](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html) section of the BERTopic documentation.

In this section we will do the following:
1) Load the example documents (here a dataset from scikit-learn)
2) Construct the BERTopic model using all default models and fit the docs selected
3) Examine & Visualize Results
4) Fine-tune topic representations

### 1. Load Demo Data

For this example workbook, we will use the same sklearn '20 newsgroups' dataset used in the BERTopic documentation in order to look at the various options for running the algorithim.  The documents are loaded, headers, footers and quotes removed, to generate a list of text strings.

**A note on the definition of 'document'.**  
In BERTopic a 'document' is generally a short chunk of text vs the entire text of longer pieces of information (articles, book chapters, etc.) that you might think of as a document.  For longer text, you will ingest the full text and break it into smaller chunks at a meaningful breakpoint (sentence, paragraph, etc.), and within a BERTopic context it is the smaller chunk of text that is considered a 'document', not the full length text it derives from.

In [9]:
%%time 
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

CPU times: total: 1.72 s
Wall time: 1.93 s


In [10]:
# Example of the first item in the documents list
docs[0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

In [11]:
## add option to pull in own list of documents from $SCRATCH

### 2. Construct and Fit BERTopic Model

Configure the BERTopic model as desired.  For this first example we are simply using all the default options, so this is as straightforward as calling the model construction function and assigning it a name.  We will use the configuration setting 'verbose=True' to give some insight into the loading process, as this can take some time and this provides some insight while the model is fitting.

In [13]:
%%time
from bertopic import BERTopic

CPU times: total: 0 ns
Wall time: 0 ns


In [14]:
base_topic_model = BERTopic(verbose=True)

Fit the documents using the model generated. 

**Please be patient as this may take a long time.**

In [15]:
%%time
base_topics, base_probs = base_topic_model.fit_transform(docs)

Batches:   0%|          | 0/589 [00:00<?, ?it/s]

2024-01-10 15:05:28,946 - BERTopic - Transformed documents to Embeddings
2024-01-10 15:06:59,696 - BERTopic - Reduced dimensionality
2024-01-10 15:07:06,163 - BERTopic - Clustered reduced embeddings


CPU times: total: 1h 3min 30s
Wall time: 1h 4min 13s


We can now look at information on the most freqeuent topics identified. Topic -1 pulls together all the outliers and is generally ignored. 

In [16]:
base_topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6494,-1_to_the_is_of,"[to, the, is, of, and, you, it, for, in, that]",[(This is a continuation of an earlier post)\n...
1,0,1838,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",[\nNo. Patrick Roy is the reason the game was...
2,1,590,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[\nI am not an expert in the cryptography scie...
3,2,530,2_ites_hello_cheek_hi,"[ites, hello, cheek, hi, yep, huh, ken, ignore...","[Hi,, Hello,, ites:]"
4,3,454,3_drive_scsi_drives_ide,"[drive, scsi, drives, ide, disk, controller, h...","[\n[ First of all, please edit your postings. ..."


In [12]:
## Save Model

In [17]:
# Method 1 - safetensors
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
base_topic_model.save("models", serialization="safetensors", save_ctfidf=True, save_embedding_model=embedding_model)


In [5]:
# Load model
base_saved = BERTopic.load("models/base_topic_model")

In [6]:
base_saved.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6505,-1_to_the_and_of,"[to, the, and, of, is, in, for, you, it, that]",
1,0,1828,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",
2,1,575,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",
3,2,527,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, forget, why, lets...",
4,3,467,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",


### Modifying for consistency
Running the model again with the exact same default inputs we see an issue: the results are not guaranteed to be consistent.  The basis for this is the UMAP model 

In [None]:
base_topic_model_2 = BERTopic()
base_topics2, base_probs2 = base_topic_model_2.fit_transform(docs)

In [None]:
base_topic_model_2.get_topic_info().head()

In [114]:
base_topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6636,-1_to_the_and_of,"[to, the, and, of, is, you, for, it, in, that]","[\nI'd like to field this one, if I may. Alth..."
1,0,1825,0_game_team_games_he,"[game, team, games, he, players, season, hocke...","[Scoring stats for the Swedish NHL players, Ap..."
2,1,615,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...","[April 16, 1993\n\nINITIAL EFF ANALYSIS OF CLI..."
3,2,531,2_idjits_ites_hello_cheek,"[idjits, ites, hello, cheek, dancing, hi, yep,...","[Hello,, Hello,, \nDancing With Idjits.\n\n\n]"
4,3,487,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...


In [88]:
embedding_dd = widgets.Dropdown(options=embedding_options, description='Embedding Model:', layout=items_layout, style=items_style)
dimensionality_dd = widgets.Dropdown(options=dimensionality_options, description='Dimenstionality Reduction:', layout=items_layout, style=items_style)
cluster_dd = widgets.Dropdown(options=['all-MiniLM-L6-v2'], description='Clustering:', layout=items_layout, style=items_style)
tokenizer_dd = widgets.Dropdown(options=['all-MiniLM-L6-v2'], description='Tokenizer:', layout=items_layout, style=items_style)
topic_representation = widgets.Dropdown(options=['all-MiniLM-L6-v2'], description='Topic Representation:', layout=items_layout, style=items_style)

In [89]:
items = [embedding_dd, dimensionality_dd, cluster_dd, tokenizer_dd, topic_representation]

In [90]:


Box(children=items, layout=box_layout)

Box(children=(Dropdown(description='Embedding Model:', layout=Layout(width='auto'), options=(('English: all-Mi…

In [85]:
# Dimensionality
dimensionality_options = [
    "UMAP"
]
dimensionality_dd = widgets.Dropdown(options=dimensionality_options, description='Dimenstionality Reduction:', style={'description_width': 'initial'})

In [53]:
dimensionality_dd

Dropdown(description='Dimenstionality Reduction:', options=('UMAP',), style=DescriptionStyle(description_width…

In [10]:
class EmbeddingModel:
    def __init__(self):
        
class Dimensionality:
    def __init__(self):
        
class Clustering:
    def __init__(self):      
        
class Tokenizer:
    def __init__(self):     
        
class Weighting:
    def __init__(self):            

class BERTopicModel:
    def __init__(self):
        self.button = widgets.Button(
            description='Build model algorithm',
            disabled=False,
            button_style='',
            tooltip='Create',
            icon='check'
        )

### BERTopic libraries

In [6]:
from bertopic import BERTopic

In [11]:
BERTopicModel().embedding

(Dropdown(options=(), value=None),)

In [2]:
# Bertopic Model Parameter widgets
sex_widget = widgets.Dropdown(options=['MALE', 'FEMALE'], description='Sex:')

In [3]:
sex_widget

Dropdown(description='Sex:', options=('MALE', 'FEMALE'), value='MALE')