# BERTopic

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. 

Rather than a model, BERTopic is a framework that contains a handful of sub-models, each providing a necessary step in topic representation. These are:
* **Embeddings.** This stage represents our text data as a numeric vector to capture sematic meaning and context. This is a core advantage of BERTopic compared to traditional methods such as LDA.
* **Dimensionality reduction.** We then take the above embeddings vector and compresses its size to aid computational performance.
* **Clustering.** We then cluster our reduced dimension embeddings via unsupervised methods. This essentially extracts our topics.
* **TF-IDF.** 'Term Frequency - Inverse Document Frequency' is the approach taken to extract key words and phrases to represent our topic representations. The TF-IDF approach favours frequent terms but also terms that are unique across our wider text corpus.

In BERTopic's modular design, each 'module' is independent, meaning that the specific algorithmic approach can be changed for any component, and the remaining steps will be compatible. 

Although not originally supported, v0.13 (January 2023) also allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

First, we need to read in our cleaned data.

<br>


In [1]:
import pandas as pd
import numpy as np

# Read report data
data = pd.read_csv('../Data/cleaned.csv')

# Extract CleanContent column
reports = data['CleanContent']

## 1. Embeddings

### Sentence splitter
Before embedding our text, it's useful to first split our reports into sentences. BERTopic generally performs poorly on larger documents, as this tends to result in noisy topics. 

Splitting our reports into sentences means that BERTopic will not represent individual reports with a topic out-of-the-box, but we can do this manually (for example, by aggreagating topics within each report).

In [2]:
import re
from nltk.tokenize import sent_tokenize

sentences = [sent_tokenize(report) for report in reports]
sentences = [sentence for doc in sentences for sentence in doc]

# Remove numbers from each sentence
def remove_numbers(sentence):
    return re.sub(r'\d+', '', sentence)

sentences_without_numbers = [remove_numbers(sentence) for sentence in sentences]

In [None]:
sentences

### Pre-calculate embeddings

We'll likely be tweaking hyperparameters for our eventual BERTopic model. Doing this would ordinarily mean that BERTopic would have to caluclate the embeddings each time we run the model, which is computationally demanding. By pre-calculating the embeddings just once, we can re-run our eventual model at a much faster speed.

Below, we download the two best performing models for clustering tasks based on the Hugging Face MTEB leaderboard. Both models have over 7 billion parameters, and are roughty 10GB each. You can read more about these models [here](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct) and [here](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral).

In [4]:
import os
from dotenv import load_dotenv
from openai import OpenAI
from bertopic import BERTopic
from bertopic.backend import OpenAIBackend

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

# Get embeddings
embedding_model = OpenAIBackend(client, "text-embedding-3-large")
topic_model = BERTopic(embedding_model=embedding_model)

## Vectoriser

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1,3), stop_words="english")

In [18]:
from bertopic import BERTopic
topic_model = BERTopic(embedding_model=embedding_model,
                       vectorizer_model=vectorizer_model,
                       min_topic_size=5)

# Fit the model to data
topics, probabilities = topic_model.fit_transform(reports)

# Find unique topics
unique_topics = set(topics)
num_unique_topics = len(unique_topics)

print(f"Number of unique topics identified: {num_unique_topics}")
print("")

# Get topic information
topic_info = topic_model.get_topic_info()
print("Topic Info:\n", topic_info)

Number of unique topics identified: 14

Topic Info:
     Topic  Count                                           Name  \
0      -1     89                 -1_risk_health_evidence_mental   
1       0     60               0_coroner_response_action_report   
2       1     60                        1_risk_staff_care_trust   
3       2     42                     2_prison_hmp_acct_prisoner   
4       3     40         3_health_mental_mental health_services   
5       4     20          4_medication_prescribed_patient_moore   
6       5     19            5_officers_training_police_evidence   
7       6     17                       6_content_website_dan_uk   
8       7     17      7_health_mental_appointment_mental health   
9       8      9            8_students_student_university_staff   
10      9      8  9_railway_thameslink_thameslink railway_govia   
11     10      7     10_care_care coordinator_coordinator_trust   
12     11      6                    11_elft_police_findlay_risk   
13     12