# Tutorials on Lonestar 6: BERTopic - Basic Model Construction

BERTopic is a powerful package utilizing transformers and c-TF-IDF in a sequential (and customizable) pipeline to cluster language for interpretable word-group topic extraction. When LLMs are used in an optional final step, this takes the word based groupings and the underlying documents they were derived from to produce readible English language topic descriptions. 

If you are new to BERTopic, we strongly recommend that you at least read over the description in the [documentation](https://maartengr.github.io/BERTopic/index.html) of the [underlying algorithm](https://maartengr.github.io/BERTopic/algorithm/algorithm.html) to understand the steps in the BERTopic pipeline and the options available.  The rest of the package documentation provides fuller discussion of all the pieces and models options availalbe. 

## Load necessary Libraries

### General Operating Libraries and UI/UX parts

In [1]:
import os
import logging
import sys
import json
import numpy as np
import pandas as pd

from IPython.display import Markdown, display

# Provides UI/UX widgets within Notebook
import ipywidgets as widgets
from ipywidgets import Layout, Button, Box

os.environ["TOKENIZERS_PARALLELISM"] = "false"

import plotly.io as pio
pio.renderers.default='iframe'

### Set notebook configuration options 
These parameters do not impact the BERTopic model, but rather how elements are displayed in this notebook.

In [2]:
# Layout Options for widgets
items_layout = Layout( width='auto')

items_style={'description_width': 'initial'}

box_layout = Layout(display='flex',
                    flex_flow='column',
                    align_items='stretch',
                    border='solid',
                    width='50%')

## BERTopic Pipeline Overview
The 'Detailed Overview' on ['The Algorithm']((https://maartengr.github.io/BERTopic/algorithm/algorithm.html)) page linked above provides a detailed description of each step below. Each step in the process is highly customizable, and gives the user the option for switching different models at each step. The following list just provides a quick overview of hte default options for each step.

1) **Extract mathematical embeddings from documents**
    * Default model for English: [SentenceTransformers](https://www.sbert.net/) "all-MiniLM-L6-v2"
    * Default multilingual model: "paraphrase-multilingual-MiniLM-L12-v2"  
2) **Reduce Dimensionality**  
   * Default: [UMAP](https://github.com/lmcinnes/umap)  
   * Alternatives: [Guide](https://maartengr.github.io/BERTopic/getting_started/dim_reduction/dim_reduction.html) for choosing alternate options
3) **Cluster reduced embeddings**
    * Default: density-based clustering with [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html)    
    * [Guide](https://maartengr.github.io/BERTopic/getting_started/clustering/clustering.html) for selecting alternative options
4) **Vectorize topics**  
    * English Default: CountVectorizer
5) **Create Topic representations**  
    * Default: c-TF-IDF.  Creates topic representation from bag-of-words matrix
6) **Fine-tune the topic representations (optional)**
    * no Default. this step can use a number of different models or LLMs to convert topic word list into English language sentences.  








## BERTopic Quick Start
This section will follow the steps in the ['Quick Start'](https://maartengr.github.io/BERTopic/getting_started/quickstart/quickstart.html) section of the BERTopic documentation.

In this section we will do the following:
1) Load the example documents (here a dataset from scikit-learn)
2) Construct the BERTopic model using all default models and fit the docs selected
3) Save our results for future exploration
4) Fine-tune topic representations
5) Examine & Visualize Results (Touched on - Additional exploration in other notebooks.)

### 1. Load Demo Data

For this example workbook, we will use the same sklearn '20 newsgroups' dataset used in the BERTopic documentation in order to look at the various options for running the algorithim.  The documents are loaded, headers, footers and quotes removed, to generate a list of text strings.

**A note on the definition of 'document'.**  
In BERTopic a 'document' is generally a short chunk of text vs the entire text of longer pieces of information (articles, book chapters, etc.) that you might think of as a document.  For longer text, you will ingest the full text and break it into smaller chunks at a meaningful breakpoint (sentence, paragraph, etc.), and within a BERTopic context it is the smaller chunk of text that is considered a 'document', not the full length text it derives from.

In [3]:
%%time 
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

CPU times: user 1.4 s, sys: 180 ms, total: 1.58 s
Wall time: 1.79 s


In [4]:
# Example of the first item in the documents list
docs[0]

"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

### 2. Construct and Fit BERTopic Model

Configure the BERTopic model as desired.  For this first example we are simply using all the default options, so this is as straightforward as calling the model construction function and assigning it a name.  

In [5]:
%%time
from bertopic import BERTopic

CPU times: user 5.54 s, sys: 1.92 s, total: 7.47 s
Wall time: 8.76 s


In [6]:
newsgroups_default_model = BERTopic()

Fit the documents using the model generated. 

**Please be patient as this may take a long time.**

In [7]:
%%time
newsgroups_default_topics, newsgroups_default_probs = newsgroups_default_model.fit_transform(docs)

CPU times: user 11min 39s, sys: 1min 27s, total: 13min 6s
Wall time: 52.4 s


We can now look at information on the most freqeuent topics identified. Topic -1 pulls together all the outliers and is generally ignored. 

In [8]:
newsgroups_default_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6722,-1_to_the_is_and,"[to, the, is, and, of, you, for, in, it, that]",[Here is a press release from the White House....
1,0,1839,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",[NHL RESULTS FOR GAMES PLAYED 4/14/93.\n\n----...
2,1,594,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[\nI am not an expert in the cryptography scie...
3,2,524,2_ites_cheek_yep_huh,"[ites, cheek, yep, huh, ken, why, each, of, , ]","[\nHuh?, \nYep.\n, \n \n ..."
4,3,470,3_israel_israeli_jews_arab,"[israel, israeli, jews, arab, jewish, arabs, p...",[From: Center for Policy Research <cpr>\nSubje...


### 3. Fine Tune the topic representation
BERTopic allows for a number of topic representations to choose from that can improve on the bucket of words representations.  Secondarily, there is alos the option to use a Large Language Model (LLM) to take the representative documents and these key word representations to produce a more human interpretable version. 

This section demonstrates these processes. You can either use a single representation model as a parameter for the BERTopic model, or use the [Multiple Representations](https://maartengr.github.io/BERTopic/getting_started/multiaspect/multiaspect.html) tehcnique as shown here. A subsequent notebook will work through using LLMs to utilise the extracted keywords and represenative douments to generate English language topic representations.

For a more detailed overview of Fine-Tuning topics, review the documentation [here](https://maartengr.github.io/BERTopic/getting_started/representation/representation.html)

#### Multiple Representation

In [9]:
# Import Bertopic Representations
from bertopic.representation import KeyBERTInspired
from bertopic.representation import PartOfSpeech
from bertopic.representation import MaximalMarginalRelevance

In [11]:
# Define Representation models

# The main representation of a topic
main_representation = KeyBERTInspired()

# Additional ways of representing a topic
aspect_model1 = PartOfSpeech("en_core_web_sm")
aspect_model2 = [KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

In [12]:
# Add all models together to be run in a single `fit`
representation_model = {
    "Main": main_representation,
    "Aspect1":  aspect_model1,
    "Aspect2":  aspect_model2
}

In [13]:
multi_rep_topic_model = BERTopic(representation_model=representation_model).fit(docs)

In [14]:
multi_rep_topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Aspect1,Aspect2,Representative_Docs
0,-1,6898,-1_god_what_our_if,"[god, what, our, if, we, information, about, o...","[will, one, use, other, more, people, god, onl...","[god, if, we, information, be, use, system, sh...","[(Well, I'll email also, but this may apply to..."
1,0,1817,0_nhl_playoffs_flyers_rangers,"[nhl, playoffs, flyers, rangers, puck, leafs, ...","[game, team, games, players, season, hockey, p...","[nhl, playoffs, flyers, rangers, puck, leafs, ...","[The problem with your nihilistic approach, Ro..."
2,1,582,1_encryption_clipper_crypto_encrypted,"[encryption, clipper, crypto, encrypted, nsa, ...","[key, clipper, chip, encryption, keys, escrow,...","[encryption, clipper, nsa, enforcement, agenci...",[Here is a revised version of my summary which...
3,2,525,2_huh___,"[huh, , , , , , , , , ]","[, , , , , , , , , ]","[huh, , , , , , , , , , , , , , , , , , , , , ...","[\nYep.\n, \n \n ..."
4,3,476,3_palestinians_gaza_palestinian_israeli,"[palestinians, gaza, palestinian, israeli, isr...","[israeli, arab, arabs, palestinian, peace, sta...","[palestinians, israels, zionism, jerusalem, le...",[From: Center for Policy Research <cpr>\nSubje...


### 4. Visualizing our results

In [2]:
newsgroups_default_model.visualize_topics()

NameError: name 'newsgroups_default_model' is not defined