# BERTopic

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. 

Rather than a model, BERTopic is a framework that contains a handful of sub-models, each providing a necessary step in topic representation. These are:
* **Embeddings.** This stage represents our text data as a numeric vector to capture sematic meaning and context. This is a core advantage of BERTopic compared to traditional methods such as LDA.
* **Dimensionality reduction.** We then take the above embeddings vector and compresses its size to aid computational performance.
* **Clustering.** We then cluster our reduced dimension embeddings via unsupervised methods. This essentially extracts our topics.
* **TF-IDF.** 'Term Frequency - Inverse Document Frequency' is the approach taken to extract key words and phrases to represent our topic representations. The TF-IDF approach favours frequent terms but also terms that are unique across our wider text corpus.

In BERTopic's modular design, each 'module' is independent, meaning that the specific algorithmic approach can be changed for any component, and the remaining steps will be compatible. 

Although not originally supported, v0.13 (January 2023) also allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

First, we need to read in our cleaned data.

<br>


In [1]:
import pandas as pd
import numpy as np

# Read report data
data = pd.read_csv('../Data/cleaned.csv')

# Extract CleanContent column
reports = data['CleanContent']

## 1. Embeddings

### Sentence splitter
Before embedding our text, it's useful to first split our reports into sentences. BERTopic generally performs poorly on larger documents, as this tends to result in noisy topics. 

Splitting our reports into sentences means that BERTopic will not represent individual reports with a topic out-of-the-box, but we can do this manually (for example, by aggreagating topics within each report).

In [2]:
import re
from nltk.tokenize import sent_tokenize

sentences = [sent_tokenize(report) for report in reports]
sentences = [sentence for doc in sentences for sentence in doc]
sentences = pd.DataFrame(sentences, columns=['sentences'])
sentences

Unnamed: 0,sentences
0,Pre-amble Mr Larsen was a 52 year old male wi...
1,Mr Larsen reported going through a very diffic...
2,Mr Larsen advised the GP that he had placed a ...
3,Mr Larsen’s GP referred him to the CRISIS Home...
4,Mr Larsen was seen regularly by the team and c...
...,...
5137,It makes no mention of s.136 detentions which ...
5138,SODEXO - ITEMS USED TO FACILITATE SUICIDE 12.
5139,Some prisoners at HMP Peterborough are allowed...
5140,13.


In [7]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

# Define the prompt
prompt = """You will be provided with a sentence. You must return the sentence - and nothing else whatsoever- with the following modifications:
1. Correct any spelling errors.
2. Remove any numbers at the start or end of sentences (e.g. "There were 2 people" would be fine).
3. Ensure that the sentence is grammatically correct.
4. Make the sentence more concise, but do not remove any important information or change the underlying meaning of the sentence.
5. Preserve acronyms and abbreviations as they are.
6. *Never* respond in your own words; always return the original sentence with the requested modifications only.
7. If you cannot find a sentence, simply return what I've given you with no modifications.
8. Remove any reference to dates.
9. Remove reference to names (e.g. "Sam went to the store" should be changed to "They went to the store").

Your turn! Here is your sentence:
{sentence}
"""

from typing import List, Dict

# Construct prompts for each given report sentence
def build_prompt(sentence: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(sentence = sentence)},
    ]

In [8]:
import random
import time

# Define empty array for new texts
new_sentences = []
original_sentences = []

# Sample 20 sentences from the dataframe
random.seed(54321)
sample_sentences = random.sample(range(len(sentences)), 20)

# Start the clock
start_time = time.time()

# Process each sentence with GPT-3.5 Turbo
for count, idx in enumerate(sample_sentences, start=1):
    sentence = sentences['sentences'].iloc[idx]
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=build_prompt(sentence),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_sentences.append(response)
            
            # Print progress and results
            print(f"Processing sentence {idx}")
            print(f"Original: {sentence}")
            print(f"New: {response}\n")
            print("")
            
            success = True

        except Exception as e:
            print(f"Error processing sentence {idx}: {e}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Processing sentence 4003
Original: (3) Not all relevant information was shared between the Child & Adult Mental Health Team about the circumstances disclosed of events on the night of 29th September of Ellis’s failed attempt at hanging as part of a risk assessment.
New: Not all relevant information was shared between the Child & Adult Mental Health Team about the circumstances disclosed of events on the night of Ellis’s failed attempt at hanging as part of a risk assessment.


Processing sentence 568
Original: 4.
New: 4.


Processing sentence 2003
Original: The Trust’s own investigation into events leading to Mr Howe’s death did not consider the full extent of his contacts with mental health services, lacked any meaningful degree of critical analysis of events, and omitted to seek to explore fundamental issues such as access to services from the patient’s perspective.
New: The Trust’s own investigation into events leading to Mr Howe’s death did not consider the full extent of their con

In [None]:
import random
import time

# Define empty array for new texts
new_sentences = []
original_sentences = []

# Sample 20 sentences
random.seed(18062024)
sampled_indices = random.sample(range(len(sentences)), 20)

# Start the timer
start_time = time.time()

# Process each sentence with GPT-3.5 Turbo
for idx, sentence in enumerate(sample_sentences):
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-3.5-turbo",
                messages=build_prompt(sentence = sentence),
                temperature=None,
                seed=18062024
            ).choices[0].message.content
            
            new_sentences.append(response)
            original_sentences.append(sentence)
            
            # Print progress and results
            print(f"Processing sentence {idx + 1}/{len(sample_sentences)}")
            print(f"Original: {sentence}")
            print(f"New: {new_sentences}\n")
            success = True

        except Exception as e:
            print(f"Error processing sentence {idx + 1}: {e}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

# Create data frame to store "new_texts" and "original_texts"
result_df = pd.DataFrame({'Original Text': original_texts, 'Cleaned Text': new_texts})

result_df


### Pre-calculate embeddings

We'll likely be tweaking hyperparameters for our eventual BERTopic model. Doing this would ordinarily mean that BERTopic would have to caluclate the embeddings each time we run the model, which is computationally demanding. By pre-calculating the embeddings just once, we can re-run our eventual model at a much faster speed.

Below, we download the two best performing models for clustering tasks based on the Hugging Face MTEB leaderboard. Both models have over 7 billion parameters, and are roughty 10GB each. You can read more about these models [here](https://huggingface.co/Alibaba-NLP/gte-Qwen1.5-7B-instruct) and [here](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral).

In [None]:

from bertopic import BERTopic
from bertopic.backend import OpenAIBackend



# Get embeddings
embedding_model = OpenAIBackend(client, "text-embedding-3-large")
topic_model = BERTopic(embedding_model=embedding_model)

## Vectoriser

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(ngram_range=(1,3), stop_words="english")

In [None]:
from bertopic import BERTopic
topic_model = BERTopic(embedding_model=embedding_model,
                       vectorizer_model=vectorizer_model,
                       min_topic_size=4)

# Fit the model to data
topics, probabilities = topic_model.fit_transform(reports)

# Find unique topics
unique_topics = set(topics)
num_unique_topics = len(unique_topics)

print(f"Number of unique topics identified: {num_unique_topics}")
print("")

# Get topic information
topic_info = topic_model.get_topic_info()
print("Topic Info:\n", topic_info)