# Implementing BERTopic for PFD reports

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. 

Rather than a model, BERTopic is a framework that contains a handful of sub-models, each providing a necessary step in topic representation. These are:
* **Embeddings.** This stage represents our text data as a numeric vector to capture sematic meaning and context. This is a core advantage of BERTopic compared to traditional methods such as LDA.
* **Dimensionality reduction.** We then take the above embeddings vector and compresses its size to aid computational performance.
* **Clustering.** We then cluster our reduced dimension embeddings via unsupervised methods. This essentially extracts our topics.
* **TF-IDF.** 'Term Frequency - Inverse Document Frequency' is the approach taken to extract key words and phrases to represent our topic representations. The TF-IDF approach favours frequent terms but also terms that are unique across our wider text corpus.

In BERTopic's modular design, each 'module' is independent, meaning that the specific algorithmic approach can be changed for any component, and the remaining steps will be compatible. 

<br>

First, we need to read in our cleaned data...




In [2]:
import pandas as pd
import numpy as np

# Read report data
data = pd.read_csv('../Data/cleaned.csv')

# Extract CleanContent column
reports = data['CleanContent']

## 1. Preprocessing

### Sentence splitter
Before embedding our text, it's useful to first split our reports into sentences. BERTopic generally performs poorly on larger documents, as this tends to result in noisy topics. 

Splitting our reports into sentences means that BERTopic will not represent individual reports with a topic out-of-the-box, but we can do this manually (for example, by aggreagating topics within each report).

In [None]:
import re
from nltk.tokenize import sent_tokenize

sentences = [sent_tokenize(report) for report in reports]
sentences = [sentence for doc in sentences for sentence in doc]
sentences = pd.DataFrame(sentences, columns=['sentences'])
sentences

### Processing with GPT

Now with the reports in sentence format, we can use the OpenAI API to...
* Correct spelling errors and grammatical mistakes - these create noise in our topic representations
* Remove reference to dates, names and addresses - this preserves privacy and increases the relevancy of our data
* In some circumstances, trim down sentences to reduce filler words

First, we'll do this with a sample of 30 sentences to make sure everything is in order.

In [None]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

# Define the prompt
prompt = """You will be provided with a sentence. You must return the sentence - and nothing else whatsoever - with the following modifications:
* Correct spelling and grammatical errors.
* Remove *all* references to dates, including years.
* Remove *all* references to addresses.
* Remove *all* references to names or titles of individuals. For example, "Sam went to the shop" or "Mr Andrews went to the shop" would both be changed to "They went to the shop".
* Keep the first-person "I" pronoun if it is used
* Do *not* change acronyms or organisational names.
* If I haven't provided you with a full sentence simply return what I've given you and nothing else. This might be a single number.

Here is your sentence:
{sentence}
"""

from typing import List, Dict

# Construct prompts for each given report sentence
def build_prompt(sentence: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(sentence = sentence)},
    ]

In [None]:
import random
import time

# Define empty array for new texts
new_sentences = []
original_sentences = []

# Sample 30 sentences from the dataframe
random.seed(12345)
sample_sentences = random.sample(range(len(sentences)), 30)

# Start the clock
start_time = time.time()

# Process each sentence with GPT-3.5 Turbo
for count, idx in enumerate(sample_sentences, start=1):
    sentence = sentences['sentences'].iloc[idx]
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=build_prompt(sentence),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_sentences.append(response)
            
            # Print progress and results
            print(f"Processing sentence {idx}")
            print(f"Original: {sentence}")
            print(f"New: {response}\n")
            print("")
            
            success = True

        except Exception as e:
            print(f"Error processing sentence {idx}: {e}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

This seems to have worked nicely. Names, dates and addresses have been consistently removed. No sentence has had its contents erronously changed.

It's clear that some sentences aren't right in the original data. We have a number of references to numbers (e.g. "4."). This is because our earlier sentence splitter designated sentences by the presence of full stops. Unfortunately, despite prompt engineering, it doesn't seem to be possible to force GPT to provide a blank string as its response, so we've had to return these erronous sentences as-is. 

There's nothing more we can do for the time being, so let's extend the above prompt on to our full data and hope that these erronous sentences don't significantly affect downstream tasks.

In [None]:
# Define empty array for new texts
new_sentences = []

# Start the clock
start_time = time.time()

# Process each sentence with GPT-3.5 Turbo
for idx in range(len(sentences)):
    sentence = sentences['sentences'].iloc[idx]
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo",
                messages=build_prompt(sentence),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_sentences.append(response)
            success = True

        except Exception as e:
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Since the above code takes ages to run, we'll first save the above into a .csv and load it in.

In [None]:
processed_sentences = new_sentences
processed_sentences = pd.DataFrame(processed_sentences, columns=['sentences'])
processed_sentences.to_csv('../Data/processed_sentences.csv', index=False)

## 2. Embeddings

We first need to embed our data, representing our text in a numeric vector that captures semantic meaning. This is a huge advantage compared to methods like LDA, as we can take advantage of cutting-edge development of LM embeddings from the transformers architecture.

BERT's out-of-the-box embeddings model doesn't really compete with more modern approaches. Luckily, we can customise this through calling any transformers model on Hugging Face or OpenAI. 

For ease of use, we'll use OpenAI's more advanced embeddings model.

In [23]:
import pandas as pd
from bertopic import BERTopic
import os
from dotenv import load_dotenv
from openai import OpenAI
from bertopic.backend import OpenAIBackend

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)


# Import processed sentences csv
processed_sentences = pd.read_csv('../Data/processed_sentences.csv')

# Change processed_sentences
processed_sentences = processed_sentences['sentences'].tolist()

# Change processed_sentences to array
#processed_sentences_array = np.array(processed_sentences['sentences'])

# Get embeddings
embedding_model = OpenAIBackend(client, "text-embedding-3-small")

# Generate embeddings
#sentence_embeddings = embedding_model.embed(processed_sentences, verbose=True)

## 3. Dimensionality reduction

Our embeddings have high dimensionality, which poses a problem for downstream clustering tasks.

**UMAP** is a dimensionality reduction technique that balances the preservation of local and global structures by constructing a high-dimensional graph of the data and optimising its low-dimensional representation. UMAP focuses on maintaining the local structure by ensuring that points that are close together in high-dimensional space remain close in the low-dimensional space. This is achieved through a neighborhood graph that captures local relationships. The relationships captured are, as far as possible, preserved in the lower-dimensional representation. 

Another option would be **PCA**. PCA is strictly linear and would effectively capture major themes in the PFD reports, such as the distinction between different types of issues (e.g., hospital vs. workplace safety) based on overall variance. However, smaller clusters of reports with very specific concerns might not be well-preserved, as PCA could mix them if their variance is not as significant compared to the global patterns.

Any other dimensionality reduction model can also be imported from scikit-learn, so long as it has both a `.fit()` and `.transform()` method.

Here's a quick comparison between UMAP and PCA...

| Aspect               | PCA                                                          | UMAP                                                                 |
|----------------------|--------------------------------------------------------------|----------------------------------------------------------------------|
| Type                 | Linear                                                       | Non-linear                                                           |
| Local Structure      | Not specifically preserved                                   | Well preserved                                                       |
| Global Structure     | Well preserved                                               | Well preserved                                                       |
| Computation          | Generally faster and less complex                            | More complex and computationally intensive                           |
| Application Suitability | Best for data with linear relationships and when global patterns are of primary interest | Best for data with non-linear relationships and when both local and global patterns are important |


<br>

Since UMAP excels at maintaining local structures, it will effectively capture the relationships between our PFD report sentences that are similar. This is crucial when working at the sentence level, as we need to identify and group similar sentences together accurately.

### Parameters for UMAP
* `n_neighbors` - controls the local neighborhood size used for manifold approximation. It balances the focus between local versus global structure. Smaller values (e.g., 5-15) will capture very local structures and can lead to more detailed clustering. Larger values (e.g., 50-100) will incorporate more global structure and may provide a broader overview of the data.

* `min_dist` - controls the minimum distance between points in the low-dimensional space. It affects the tightness of clusters. Smaller values (e.g., 0.001-0.1) will result in more compact clusters. Larger values (e.g., 0.1-0.5) will spread out clusters, potentially making broader patterns more apparent.

* `n_components` - determines the number of dimensions for the reduced space. Usually set to 2 for visualisation purposes, but for more complex downstream tasks, 3 or more can be useful.

<br>

We'll experiment with different hyperparameters for UMAP, assessing the visualisation of the global projection of sentence embeddings. We can also look at the silhouette score, but this metric is not super informative for clusters of irregular shapes and different sizes.

In [59]:
from umap import UMAP
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Create a UMAP model
umap_model = UMAP(n_neighbors=15, 
                  n_components=5,
                  min_dist=0,
                  random_state=230624)

## 4. Clustering

Once we've reduced the dimensionality of our input embeddings, the next step is to cluster them into groups of similar embeddings to identify our topics. Clustering is arguably the most important step, as the effectiveness of our clustering method directly impacts the coherence of our topic representations.

HDBSCAN is a very effective approach to clustering, as it can happily depict irregular shapes (e.g. not forcing clusters to be convex). Importantly, HDBSCAN does not force data into a cluster. If it cannot find a natural cluster for a data point, then it assigns it to a special 'outlier' topic (represented as "-1" in BERTopic). This makes our identified topics much tighter and more coherent. 

HDBSCAN has the following main hyperparameters...

* `min_cluster_size` - the minimum size of clusters. Smaller values can lead to more fine-grained clusters, while larger values lead to more general clusters.

* `metric` - the distance metric used. Common choices are 'euclidean', 'manhattan', 'cosine', etc. This choice should be based of data characteristics.

* `cluster_selection_method` - the method to select clusters. 'eom' (excess of mass) is a common choice, but 'leaf' can also be used for a different clustering approach.


In [51]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(
    min_cluster_size=20, # set to the number of clusters
    #metric='cosine', # can choose euclidean, manhattan, cosine, etc.
    cluster_selection_method='leaf', # can choose eom or leaf
    prediction_data=True)

## 5. Vectoriser

After identifying clusters (topics), the vectorizer (often TF-IDF) is used to convert the original text data into a document-term matrix. This matrix represents the frequency of terms in each document while giving more weight to important terms (i.e., terms that are unique to a document relative to the entire corpus.

It has the following hyperparameters:

* `ngram_range` - allows us to specify the range of words that is allowed within a topic representation entity. For example, and ngran_range of (1,3) allows us to have 1, 2 and 3-word entities. This is important for phrases like "mental health" which could only be represented as "mental" and "health", seperately, if we had an ngram range of just 1.
* `stop_words` - allows us to specify that we want stop words to be removed. We've already embedded our text, so removing stop words now will not harm the embedding process and helps to identify meaningful topics.
* `min_df` - this parameter control the minimum number of times a word must be present for it to be assigned a topic. The c-TF-IDF will almost certainly remove these words anyway, so we can afford to be quite liberal with this parameter.
* `max_df` - this controls the count of entities within each topic representation. Stipulating this could force some topics to be more precise, but with the disadvantage of exclusion. In many cases, it might be best to leave it blank.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(
    min_df=20,
    ngram_range=(1,3),
    stop_words="english")

In [60]:
from bertopic import BERTopic
topic_model = BERTopic(#embedding_model=embedding_model, # ...custom embeddings
                       umap_model=umap_model, # ...dimensionality reduction
                       hdbscan_model=hdbscan_model, # ...clustering
                       vectorizer_model=vectorizer_model, # ...vectoriser
                       )

# Fit the model to data
topics, probabilities = topic_model.fit_transform(processed_sentences)

# Find unique topics
unique_topics = set(topics)
num_unique_topics = len(unique_topics)

print(f"Number of unique topics identified: {num_unique_topics}")
print("")

# Get topic information
topic_model.get_topic_info()
#print("Topic Info:\n", topic_info)

Number of unique topics identified: 58



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2022,-1_did_team_death_staff,"[did, team, death, staff, evidence, inquest, c...",[The evidence heard at the Inquest was that th...
1,0,195,0_mental health_mental_health_team,"[mental health, mental, health, team, patients...",[At HMP Haverigg there was inadequate mental h...
2,1,175,1_staff_guidance_issues_national,"[staff, guidance, issues, national, informatio...",[There is a lack of national guidance for both...
3,2,164,2_risk_information_assessment_plan,"[risk, information, assessment, plan, care, pa...",[This gap in information can have an impact on...
4,3,136,3_patients_review_heard_evidence,"[patients, review, heard, evidence, heard evid...",[I also heard evidence to suggest that prescri...
5,4,130,4_concerns_plan_information_care,"[concerns, plan, information, care, place, did...",[They did not receive a care plan that was ade...
6,5,123,5_family___,"[family, , , , , , , , , ]","[2., They's family 3., They's family 3.]"
7,6,100,6____,"[, , , , , , , , , ]","[1., 1., 1.]"
8,7,81,7_review_provided_care_staff,"[review, provided, care, staff, report, policy...",[My concerns relating to the inadequacy of the...
9,8,80,8_including_risk_assessment_does,"[including, risk, assessment, does, guidance, ...",[There is the potential for the risk of harm t...
