# Implementing BERTopic for PFD reports

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. 

Rather than a model, BERTopic is a framework that contains a handful of sub-models, each providing a necessary step in topic representation. These are:
* **Embeddings.** This stage represents our text data as a numeric vector to capture sematic meaning and context. This is a core advantage of BERTopic compared to traditional methods such as LDA.
* **Dimensionality reduction.** We then take the above embeddings vector and compresses its size to aid computational performance.
* **Clustering.** We then cluster our reduced dimension embeddings via unsupervised methods. This essentially extracts our topics.
* **TF-IDF.** 'Term Frequency - Inverse Document Frequency' is the approach taken to extract key words and phrases to represent our topic representations. The TF-IDF approach favours frequent terms but also terms that are unique across our wider text corpus.

In BERTopic's modular design, each of the above sub-models is independent, meaning that we can chop and change each of these models, and the downstream tasks will be compatible. 

<br>

First, we need to read in our cleaned data...




In [6]:
import pandas as pd
import numpy as np

# Read report data
data = pd.read_csv('../Data/cleaned.csv')

# Extract CleanContent column
reports = data['CleanContent']
data

Unnamed: 0,URL,ID,Date,Receiver,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",Pre-amble Mr Larsen was a 52 year old male wi...
1,https://www.judiciary.uk/prevention-of-future-...,2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,- (1) The process for triaging and prioritisi...
2,https://www.judiciary.uk/prevention-of-future-...,2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...",My principal concern is that when a high-risk ...
...,...,...,...,...,...
393,https://www.judiciary.uk/prevention-of-future-...,2016-0065,Date of report: 19 February 2016,TO: 1. Medical Director East London NHS Founda...,1. Brenda Morris was allowed weekend leave on ...
394,https://www.judiciary.uk/prevention-of-future-...,2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,Barts and the London 1. Whilst it was clear to...
395,https://www.judiciary.uk/prevention-of-future-...,2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",1. Piotr Kucharz was a Polish gentleman who co...
396,https://www.judiciary.uk/prevention-of-future-...,2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,Camden and Islington Trust 1. It seemed from t...


## 1. Preprocessing

### Sentence splitter
Before embedding our text, it's useful to first split our reports into sentences. BERTopic generally performs poorly on larger documents, as this tends to result in noisy topics. 

Splitting our reports into sentences means that BERTopic will not represent individual reports with a topic out-of-the-box, but we can do this manually (for example, by aggreagating topics within each report).

In [79]:
import pandas as pd
import numpy as np
import re
from nltk.tokenize import sent_tokenize

# Read report data
data = pd.read_csv('../Data/cleaned.csv')

# Extract CleanContent and ID columns
reports = data['CleanContent']
ids = data['ID']

# Ensure all IDs are strings
ids = ids.astype(str)

# Check if all IDs are of the same length
id_lengths = ids.apply(len)
unique_lengths = id_lengths.unique()

ids_df = pd.DataFrame(ids)

# Save data frame
ids_df.to_csv('../Data/ids.csv', index=False)



In [80]:
data

Unnamed: 0,URL,ID,Date,Receiver,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",Pre-amble Mr Larsen was a 52 year old male wi...
1,https://www.judiciary.uk/prevention-of-future-...,2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,- (1) The process for triaging and prioritisi...
2,https://www.judiciary.uk/prevention-of-future-...,2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...",My principal concern is that when a high-risk ...
...,...,...,...,...,...
393,https://www.judiciary.uk/prevention-of-future-...,2016-0065,Date of report: 19 February 2016,TO: 1. Medical Director East London NHS Founda...,1. Brenda Morris was allowed weekend leave on ...
394,https://www.judiciary.uk/prevention-of-future-...,2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,Barts and the London 1. Whilst it was clear to...
395,https://www.judiciary.uk/prevention-of-future-...,2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",1. Piotr Kucharz was a Polish gentleman who co...
396,https://www.judiciary.uk/prevention-of-future-...,2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,Camden and Islington Trust 1. It seemed from t...


In [81]:
import nltk
import re
from nltk.tokenize import sent_tokenize

#nltk.download('punkt')

# Define paragraph pattern to split reports
paragraph_pattern = r'\d+\.\s|\d+\)\s|\(\d+\)\s|\(\d+\)\.\s|\d+-\s|\d+\s-\s'

def split_sentences(df):
    # Initialise empty lists to store report sentences and corresponding IDs
    report_sentences = []
    report_ids = []
    
    # Loop through each row in the data frame, recording text and ID
    for index, row in df.iterrows():
        # Extract report and ID
        text = row['CleanContent']
        id = row['ID']
        
        # Split into sentences 
        split_sentences = sent_tokenize(text)
        
        # Append each sentence and corresponding ID to the lists
        report_sentences.extend(split_sentences)
        report_ids.extend([id] * len(split_sentences))
        
    # Create a new data frame with the sentences and IDs
    result_df = pd.DataFrame({'ID': report_ids, 'Sentence': report_sentences})
    return result_df

def split_paragraphs(df):
    # Initialise empty lists to store report paragraphs and corresponding IDs
    report_paragraphs = []
    report_ids = []
    
    # Loop through each row in the data frame, recording text and ID
    for index, row in df.iterrows():
        # Extract report and ID
        text = row['CleanContent']
        id = row['ID']
        
        # Split into paragraphs, separated by new line characters
        split_paragraphs = re.split(paragraph_pattern, text)
        
        # Append each paragraph and corresponding ID to the lists
        report_paragraphs.extend(split_paragraphs)
        report_ids.extend([id] * len(split_paragraphs))
        
    # Create a new data frame with the paragraphs and IDs
    result_df = pd.DataFrame({'ID': report_ids, 'Paragraph': report_paragraphs})
    return result_df

# Assuming you have your DataFrame 'data' already loaded
# Apply the functions to your DataFrame
split_reports_sent = split_sentences(data)
split_reports_para = split_paragraphs(data)

# Display the resulting DataFrames
print("Sentences DataFrame:")
print(split_reports_sent)

print("\nParagraphs DataFrame:")
print(split_reports_para)

Sentences DataFrame:
             ID                                           Sentence
0     2024-0318   Pre-amble Mr Larsen was a 52 year old male wi...
1     2024-0318  Mr Larsen reported going through a very diffic...
2     2024-0318  Mr Larsen advised the GP that he had placed a ...
3     2024-0318  Mr Larsen’s GP referred him to the CRISIS Home...
4     2024-0318  Mr Larsen was seen regularly by the team and c...
...         ...                                                ...
5113  2015-0116  It makes no mention of s.136 detentions which ...
5114  2015-0116      SODEXO - ITEMS USED TO FACILITATE SUICIDE 12.
5115  2015-0116  Some prisoners at HMP Peterborough are allowed...
5116  2015-0116                                                13.
5117  2015-0116  In addition, the deceased referred to experime...

[5118 rows x 2 columns]

Paragraphs DataFrame:
             ID                                          Paragraph
0     2024-0318   Pre-amble Mr Larsen was a 52 year old male

### Processing with GPT

Now with the reports in sentence & paragraph formats, we can use the OpenAI API to...
* Correct spelling errors and grammatical mistakes - these create noise in our topic representations
* Remove reference to dates, names and addresses - this preserves privacy and increases the relevancy of our data
* In some circumstances, trim down sentences to reduce filler words

First, we'll do this with a sample of 30 sentences & paragraphs to make sure everything works as expected.

In [10]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

# Define the prompt
prompt_sent = """You will be provided with a sentence. You must return the sentence - and nothing else whatsoever - with the following modifications:
* Correct spelling and grammatical errors.
* Remove *all* references to dates, including years.
* Remove *all* references to addresses.
* Remove *all* references to names or titles of individuals. For example, "Sam went to the shop" or "Mr Andrews went to the shop" would both be changed to "They went to the shop".
* Keep the first-person "I" pronoun if it is used
* Do *not* remove or change acronyms or organisational names.
* If I haven't provided you with a sentence simply return "ERROR: Incomplete".
Here is your sentence:
{sentence}
"""

from typing import List, Dict

# Construct prompts for each given report sentence
def build_prompt_sent(sentence: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt_sent.format(sentence = sentence)},
    ]

In [83]:
import random
import time

# Define empty array for new texts
new_sentences = []
original_sentences = []

# Sample 30 sentences from the dataframe
sample_sentences = split_reports_sent['Sentence'].sample(n=30, random_state=1234).tolist()

# Start the clock
start_time = time.time()

# Process each sentence with GPT-4 Turbo
for count, sentence in enumerate(sample_sentences, start=1):
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=build_prompt_sent(sentence),
                temperature=0,
                seed=18062024,
                n=1
            ).choices[0].message.content
            
            new_sentences.append(response)
            original_sentences.append(sentence)
            
            # Print progress and results
            print(f"Processing sentence {count}")
            print(f"Original: {sentence}")
            print(f"New: {response}\n")
            print("")
            
            success = True

        except Exception as e:
            print(f"Error processing sentence {sentence}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Processing sentence 1
Original: He may send a copy of this report to any person who, he believes, may find it useful or of interest.
New: They may send a copy of this report to any person who they believe may find it useful or of interest.


Processing sentence 2
Original: The Chief Coroner may publish either or both in a complete or redacted or summary form.
New: The Chief Coroner may publish either or both in a complete, redacted, or summary form.


Processing sentence 3
Original: Context: police witnesses agreed that BSMHFT clinicians were the experts on mental health diagnosis, including identifying those conditions that carry an increased risk of suicide, and assessing the risk of suicide generally.
New: Context: police witnesses agreed that BSMHFT clinicians were the experts on mental health diagnosis, including identifying those conditions that carry an increased risk of suicide and assessing the risk of suicide generally.


Processing sentence 4
Original: (2) I heard evidence t

In [84]:
# Define the prompt
prompt_paragraph = """You will be provided with a paragraph from a report. You must return the paragraph - and nothing else whatsoever - with the following modifications:
* Correct spelling and grammatical errors.
* Remove *all* references to dates, including years.
* Remove *all* references to addresses.
* Remove *all* references to names or titles of individuals. For example, "Sam went to the shop" or "Mr Andrews went to the shop" would both be changed to "They went to the shop".
* Keep the first-person "I" pronoun if it is used
* Do *not* remove or change acronyms or organisational names.
* If I haven't provided you with a paragraph simply return "ERROR: Incomplete".
Here is your sentence:
{paragraph}
"""

# Construct prompts for each given report paragraph
def build_prompt_para(paragraph: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt_paragraph.format(paragraph = paragraph)},
    ]

# Define empty array for new texts
new_paragraphs = []
original_paragraphs = []

# Sample 30 paragraphs from the dataframe
sample_paragraphs = split_reports_para['Paragraph'].sample(n=30, random_state=1111).tolist()

# Start the clock
start_time = time.time()

# Process each sentence with GPT-4 Turbo
for count, paragraph in enumerate(sample_paragraphs, start=1):
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=build_prompt_para(paragraph),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_paragraphs.append(response)
            original_paragraphs.append(paragraph)
            
            # Print progress and results
            print(f"Processing paragraph {count}")
            print(f"Original: {paragraph}")
            print(f"New: {response}\n")
            print("")
            
            success = True

        except Exception as e:
            print(f"Error processing paragraph {count}: {e}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Processing paragraph 1
Original: 
New: ERROR: Incomplete


Processing paragraph 2
Original: BSMHFT’s response (September 
New: ERROR: Incomplete


Processing paragraph 3
Original: Although formal revocation of leave was never finalised, Sasha’s mother was given to believe that it would be, and again reluctantly agreed to bring Sasha back to the ward. Whilst with her mother, Sasha was able to run off and take the substantial Propranolol overdose which proved to be fatal.
New: Although formal revocation of leave was never finalized, they were given to believe that it would be, and again reluctantly agreed to bring them back to the ward. Whilst with their mother, they were able to run off and take the substantial Propranolol overdose which proved to be fatal.


Processing paragraph 4
Original: The investigation failed to explore appropriately the functionality of the MDT meetings;  
New: The investigation failed to appropriately explore the functionality of the MDT meetings;


Processing 

This seems to have worked nicely. Names, dates and addresses have been consistently removed. No sentence or paragraph has had its contents erronously changed.

We can also see that a number of sentences and paragraphs have been replaced with error messages, which we can revisit once applying the prompt to the full corpa of text.

In [85]:
# Define empty array for new texts
new_sentences = []

# Start the clock
start_time = time.time()

# Process each sentence with GPT-4o Mini
for count, sentence in enumerate(split_reports_sent['Sentence']):
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=build_prompt_sent(sentence),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_sentences.append(response)
            success = True

        except Exception as e:
            print(f"Error processing sentence {count}")
            break


In [100]:
split_reports_sent['ProcessedSentence'] = new_sentences

# Drop rows where ProcessedSentence is 'ERROR: Incomplete'
split_reports_sent = split_reports_sent[split_reports_sent['ProcessedSentence'] != 'ERROR: Incomplete']

# Save the processed sentences 
split_reports_sent.to_csv('../Data/processed_sentences.csv', index=False)

In [88]:
new_paragraphs = []

# Process each paragraph with GPT-4o Mini
for count, paragraph in enumerate(split_reports_para['Paragraph']):
    success = False
    while not success:
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=build_prompt_para(paragraph),
                temperature=0,
                seed=18062024
            ).choices[0].message.content
            
            new_paragraphs.append(response)
            success = True

        except Exception as e:
            print(f"Error processing paragraph {count}")
            break

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')


Time taken: 161 minutes and 33.77 seconds


In [101]:
split_reports_para['ProcessedParagraph'] = new_paragraphs

# Drop rows where ProcessedParagraph is 'ERROR: Incomplete'
split_reports_para = split_reports_para[split_reports_para['ProcessedParagraph'] != 'ERROR: Incomplete']

# Save the processed paragraphs
split_reports_para.to_csv('../Data/processed_paragraphs.csv', index=False)

In [89]:
processed_paragraphs = new_paragraphs
processed_paragraphs = pd.DataFrame(processed_paragraphs, columns=['sentences'])
processed_paragraphs.to_csv('../Data/processed_paragraphs.csv', index=False)

## 2. Embeddings

We first need to embed our data, representing our text in a numeric vector that captures semantic meaning. This is a huge advantage compared to methods like LDA, as we can take advantage of cutting-edge development of LM embeddings from the transformers architecture.

BERT's out-of-the-box embeddings model doesn't really compete with more modern approaches. Luckily, we can customise this through calling any transformers model on Hugging Face or OpenAI. 

For ease of use, we'll use OpenAI's more advanced embeddings model.

In [2]:
import pandas as pd
from bertopic import BERTopic
import os
from dotenv import load_dotenv
from openai import OpenAI
from bertopic.backend import OpenAIBackend

# Activate OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)


# Import processed sentences csv
processed_sentences = pd.read_csv('../Data/processed_sentences.csv')

# Change processed_sentences
processed_sentences = processed_sentences['sentences'].tolist()

# Get embeddings
embedding_model = OpenAIBackend(client, "text-embedding-3-large")

# Generate embeddings
#sentence_embeddings = embedding_model.embed(processed_sentences)

BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

Exclude sentences where the number of characters is 20 or less (this reduces noise).

In [3]:
processed_sentences = [sentence for sentence in processed_sentences if len(sentence) > 20]
processed_sentences

['They were a male with no history of mental health problems until they presented to their GP to discuss a decline in their mental health.',
 'They reported going through a very difficult time, their business was being liquidated and they felt they had let their family and business associates down.',
 'They advised the GP that they had placed a bag over their head a few days before the consultation but stopped short of seeing their attempt to end their life through because they thought about their children who were a protective factor for them.',
 'They referred them to the CRISIS Home Resolution Treatment Team, part of the Leicestershire Partnership NHS Trust, who accepted them for a period of treatment.',
 'They were seen regularly by the team and commenced on medication.',
 'They were discharged from the CRISIS Home Resolution Treatment Team.',
 'They contacted the CRISIS Home Resolution Treatment Team and reported a return of suicidal thoughts, saying that they felt they needed med

## 3. Dimensionality reduction

Our embeddings have high dimensionality, which poses a problem for downstream clustering tasks.

**UMAP** is a dimensionality reduction technique that balances the preservation of local and global structures by constructing a high-dimensional graph of the data and optimising its low-dimensional representation. UMAP focuses on maintaining the local structure by ensuring that points that are close together in high-dimensional space remain close in the low-dimensional space. This is achieved through a neighborhood graph that captures local relationships. The relationships captured are, as far as possible, preserved in the lower-dimensional representation. 

Another option would be **PCA**. PCA is strictly linear and would effectively capture major themes in the PFD reports, such as the distinction between different types of issues (e.g., hospital vs. workplace safety) based on overall variance. However, smaller clusters of reports with very specific concerns might not be well-preserved, as PCA could mix them if their variance is not as significant compared to the global patterns.

Any other dimensionality reduction model can also be imported from scikit-learn, so long as it has both a `.fit()` and `.transform()` method.

Here's a quick comparison between UMAP and PCA...

| Aspect               | PCA                                                          | UMAP                                                                 |
|----------------------|--------------------------------------------------------------|----------------------------------------------------------------------|
| Type                 | Linear                                                       | Non-linear                                                           |
| Local Structure      | Not specifically preserved                                   | Well preserved                                                       |
| Global Structure     | Well preserved                                               | Well preserved                                                       |
| Computation          | Generally faster and less complex                            | More complex and computationally intensive                           |
| Application Suitability | Best for data with linear relationships and when global patterns are of primary interest | Best for data with non-linear relationships and when both local and global patterns are important |


<br>

Since UMAP excels at maintaining local structures, it will effectively capture the relationships between our PFD report sentences that are similar. This is crucial when working at the sentence level, as we need to identify and group similar sentences together accurately.

### Parameters for UMAP
* `n_neighbors` - controls the local neighborhood size used for manifold approximation. It balances the focus between local versus global structure. Smaller values (e.g., 5-15) will capture very local structures and can lead to more detailed clustering. Larger values (e.g., 50-100) will incorporate more global structure and may provide a broader overview of the data.

* `min_dist` - controls the minimum distance between points in the low-dimensional space. It affects the tightness of clusters. Smaller values (e.g., 0.001-0.1) will result in more compact clusters. Larger values (e.g., 0.1-0.5) will spread out clusters, potentially making broader patterns more apparent.

* `n_components` - determines the number of dimensions for the reduced space. Usually set to 2 for visualisation purposes, but for more complex downstream tasks, 3 or more can be useful.

<br>

We'll experiment with different hyperparameters for UMAP, assessing the visualisation of the global projection of sentence embeddings. We can also look at the silhouette score, but this metric is not super informative for clusters of irregular shapes and different sizes.

In [4]:
from umap import UMAP
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

# Create a UMAP model
umap_model = UMAP(n_neighbors=15, 
                  n_components=5,
                  min_dist=0,
                  random_state=230624)

## 4. Clustering

Once we've reduced the dimensionality of our input embeddings, the next step is to cluster them into groups of similar embeddings to identify our topics. Clustering is arguably the most important step, as the effectiveness of our clustering method directly impacts the coherence of our topic representations.

HDBSCAN is a very effective approach to clustering, as it can happily depict irregular shapes (e.g. not forcing clusters to be convex). Importantly, HDBSCAN does not force data into a cluster. If it cannot find a natural cluster for a data point, then it assigns it to a special 'outlier' topic (represented as "-1" in BERTopic). This makes our identified topics much tighter and more coherent. 

HDBSCAN has the following main hyperparameters...

* `min_cluster_size` - the minimum size of clusters. Smaller values can lead to more fine-grained clusters, while larger values lead to more general clusters.

* `metric` - the distance metric used. Common choices are 'euclidean', 'manhattan', 'cosine', etc. This choice should be based of data characteristics.

* `cluster_selection_method` - the method to select clusters. 'eom' (excess of mass) is a common choice, but 'leaf' can also be used for a different clustering approach.


In [5]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(
    min_cluster_size=20, # set to the number of clusters
    #metric='cosine', # can choose euclidean, manhattan, cosine, etc.
    cluster_selection_method='leaf', # can choose eom or leaf
    prediction_data=True)

## 5. Vectoriser

After identifying clusters (topics), the vectorizer (often TF-IDF) is used to convert the original text data into a document-term matrix. This matrix represents the frequency of terms in each document while giving more weight to important terms (i.e., terms that are unique to a document relative to the entire corpus.

It has the following hyperparameters:

* `ngram_range` - allows us to specify the range of words that is allowed within a topic representation entity. For example, and ngran_range of (1,3) allows us to have 1, 2 and 3-word entities. This is important for phrases like "mental health" which could only be represented as "mental" and "health", seperately, if we had an ngram range of just 1.
* `stop_words` - allows us to specify that we want stop words to be removed. We've already embedded our text, so removing stop words now will not harm the embedding process and helps to identify meaningful topics.
* `min_df` - this parameter control the minimum number of times a word must be present for it to be assigned a topic. The c-TF-IDF will almost certainly remove these words anyway, so we can afford to be quite liberal with this parameter.
* `max_df` - this controls the count of entities within each topic representation. Stipulating this could force some topics to be more precise, but with the disadvantage of exclusion. In many cases, it might be best to leave it blank.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(
    min_df=20,
    ngram_range=(1,3),
    stop_words="english")

In [9]:
from bertopic import BERTopic
topic_model = BERTopic(#embedding_model=embedding_model, # ...custom embeddings
                       umap_model=umap_model, # ...dimensionality reduction
                       hdbscan_model=hdbscan_model, # ...clustering
                       vectorizer_model=vectorizer_model, # ...vectoriser
                       calculate_probabilities=True, # ...calculate probabilities
                       )

# Fit the model to data
topics, probabilities = topic_model.fit_transform(processed_sentences)

# Find unique topics
unique_topics = set(topics)
num_unique_topics = len(unique_topics)

print(f"Number of unique topics identified: {num_unique_topics}")
print("")

# Get topic information
topic_model.get_topic_info()




Number of unique topics identified: 47



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2040,-1_team_staff_health_mental,"[team, staff, health, mental, mental health, d...",[Whilst they had some interactions with mental...
1,0,217,0_risk_information_assessment_patient,"[risk, information, assessment, patient, plan,...",[Although it was submitted that patient risk w...
2,1,145,1_patients_evidence_review_heard,"[patients, evidence, review, heard, patient, a...",[I also heard evidence to suggest that prescri...
3,2,115,2_plan_information_concerns_place,"[plan, information, concerns, place, assessmen...",[The Plan did not show that any meaningful tho...
4,3,114,3_mental health_mental_health_patients,"[mental health, mental, health, patients, team...",[The option for applicants to disclose a menta...
5,4,101,4_risk_assessment_including_does,"[risk, assessment, including, does, place, did...",[They were not able to identify in evidence th...
6,5,96,5_review_provided_care_staff,"[review, provided, care, staff, report, patien...",[The care and treatment provided by the Trust ...
7,6,70,6_staff_provided_received_available,"[staff, provided, received, available, heard, ...",[I am therefore surprised to learn that no sta...
8,7,68,7_guidance_national_lack_issues,"[guidance, national, lack, issues, contact, aw...",[There is a lack of national guidance for both...
9,8,66,8_inquest_evidence_concern_case,"[inquest, evidence, concern, case, heard, conc...",[During the inquest evidence was heard that: i...


In [None]:
representations = topic_model.get_topic_info()

# Convert representations to a pandas dataframe
representations = pd.DataFrame(representations, columns=['Topic', 'Words', 'Count', 'Frequency', 'Cluster'])

# Save data frame
representations.to_csv('../Data/topic_info.csv', index=False)

#print("Topic Info:\n", topic_info)

In [None]:
topic_model.representative_docs_

In [None]:
processed_sentences

### BERTopic modelling with paragraphs

In [2]:
import pandas as pd

# Extract our processed paragraphs as a list
processed_paragraphs = pd.read_csv('../Data/processed_paragraphs.csv')
processed_paragraphs = processed_paragraphs['ProcessedParagraph'].tolist()

In [20]:
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic
from openai import OpenAI
import tiktoken
from bertopic.representation import OpenAI
import os
from dotenv import load_dotenv

# Activate OpenAI API Key
#load_dotenv('api.env')
#openai_api_key = os.getenv('OPENAI_API_KEY')
#client = OpenAI(api_key=openai_api_key)

# Set up embeddings
#from sentence_transformers import SentenceTransformer
#sentence_model = SentenceTransformer("all-MiniLM-L6-v2")

#from transformers.pipelines import pipeline

#embedding_model = pipeline("feature-extraction", model="nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Create a UMAP model
umap_model_para = UMAP(n_neighbors=10, 
                  n_components=5,
                  min_dist=0.2,
                  random_state=230624)

hdbscan_model_para = HDBSCAN(
    min_cluster_size=10,
    cluster_selection_method='leaf',
    prediction_data=True)

vectorizer_model_para = CountVectorizer(
    min_df=10,
    ngram_range=(1,3),
    stop_words="english")

# Tokenizer
tokenizer= tiktoken.encoding_for_model("gpt-4o-mini")

# Create representation model for topic labelling
representation_model = OpenAI(
    client,
    model="gpt-4o-mini", 
    delay_in_seconds=1, 
    chat=True,
    nr_docs=4,
    doc_length=1000,
    tokenizer=tokenizer
)

topic_model_para = BERTopic(embedding_model=embedding_model, # ...custom embeddings
                       umap_model=umap_model_para,
                       hdbscan_model=hdbscan_model_para,
                       vectorizer_model=vectorizer_model_para,
                       calculate_probabilities=True,
                       representation_model=representation_model,
                       nr_topics='auto'
                       )

# Fit the model to data
topics, probabilities = topic_model_para.fit_transform(processed_paragraphs)

# Find unique topics
unique_topics = set(topics)
num_unique_topics = len(unique_topics)

print(f"Number of unique topics identified: {num_unique_topics}")
print("")

# Get topic information
topic_model_para.get_topic_info()

Number of unique topics identified: 20



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,770,-1_NHS Mental Health Safety and Risk Assessmen...,[NHS Mental Health Safety and Risk Assessment ...,"[I received evidence, in particular from the i..."
1,0,81,0_Prison Healthcare and Mental Health Manageme...,[Prison Healthcare and Mental Health Managemen...,"[The next case review was scheduled; however, ..."
2,1,33,1_Patient Safety and Medication Management Con...,[Patient Safety and Medication Management Conc...,[The pharmacy also confirmed that another pati...
3,2,31,2_Inadequate Risk Assessment and Discharge Pla...,[Inadequate Risk Assessment and Discharge Plan...,[Their risk assessment was not up-to-date and ...
4,3,30,3_Mental Health Crisis and Inpatient Care Acce...,[Mental Health Crisis and Inpatient Care Acces...,[Patients who attend a Hospital Accident and E...
5,4,29,4_Coroner's Report Response and Publication Gu...,[Coroner's Report Response and Publication Gui...,[I may extend the period. Your response must c...
6,5,27,5_Police Training and Mental Health Response C...,[Police Training and Mental Health Response Co...,[I heard evidence from a source that both atte...
7,6,26,6_Railway Safety and Police Response in Death ...,[Railway Safety and Police Response in Death I...,"[I am concerned that your investigation into, ..."
8,7,26,7_Communication Failures in Mental Health Serv...,[Communication Failures in Mental Health Servi...,"[When they were taken to the hospital, they ex..."
9,8,23,8_Regulation and Impact of Online Suicide-Rela...,[Regulation and Impact of Online Suicide-Relat...,[Continuing accessibility of the forum was usi...


In [50]:
# Merge outlier topics
topics_to_merge = [[-1, 4, 10]]
topic_model_para.merge_topics(processed_paragraphs, topics_to_merge)

# Get topic information
topic_model_para.get_topic_info()

Number of unique topics identified: 25



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,761,-1_Mental Health Services and Care System Fail...,[Mental Health Services and Care System Failures],"[During the course of the inquest, the evidenc..."
1,0,58,0_Risks and Safety Concerns in Patient Medicat...,[Risks and Safety Concerns in Patient Medicati...,"[During the inquest, I was referred by the tre..."
2,1,45,1_Mental Health Care and Treatment in Prisons,[Mental Health Care and Treatment in Prisons],[Evidence was taken at the inquest that: (i) H...
3,2,34,2_Serious Incident Investigation and Patient S...,[Serious Incident Investigation and Patient Sa...,[I remain gravely concerned about the inadequa...
4,3,33,3_Concerns about Police Training and Mental He...,[Concerns about Police Training and Mental Hea...,[I heard evidence from a source that both atte...
5,4,32,4_Mental Health and Safety Incident Reporting ...,[Mental Health and Safety Incident Reporting i...,[The BTP investigation identified a clear prob...
6,5,26,5_Mental Health Care and Assessment Issues,[Mental Health Care and Assessment Issues],[The Care and Treatment Plan was not updated w...
7,6,26,6_Patient Leave Policies and Risk Management i...,[Patient Leave Policies and Risk Management in...,"[The deceased was allowed to leave the unit, w..."
8,7,25,7_Police Coordination and Response Inefficiencies,[Police Coordination and Response Inefficiencies],[The evidence highlighted a concern that in ta...
9,8,22,8_Challenges in Care Coordination within Menta...,[Challenges in Care Coordination within Mental...,"[From the evidence I heard, the Care Coordinat..."


In [22]:
themes = topic_model_para.get_topic_info()

# Save themes as csv
themes = pd.DataFrame(themes, columns=['Topic', 'Count', 'Name'])
themes.to_csv('../Data/themes.csv', index=False)



In [21]:
topic_model_para.get_representative_docs()

  'In the words of the Jury: “Initial and all subsequent assessments seriously fail to recognize that the prolonged choice not to eat or drink was in fact an indication of ‘action’ to end their own life and therefore they should have been considered as a suicide risk.” Action is needed to prevent future failure to recognize (a) when the prolonged choice of a patient detained under the Mental Health Act not to eat or drink should be regarded as an action to end their own life; and (b) when such a patient’s prolonged choice not to eat or drink should be recognized as elevating that patient’s suicide risk (including suicide by means other than malnourishment). At the conclusion of the Inquest, after the Jury had returned the completed Record of Inquest, I asked the Norfolk & Suffolk NHS Foundation Trust (‘NSFT’) to assist me with written information to inform me of what action is being taken to prevent future deaths related to the “serious failures” in risk assessment as to suicide risk i

In [25]:
topic_model_para.get_topic(1)

[('Prison Mental Health Care and Treatment Issues', 1)]

### Generate topic models

In [38]:
topic_model_para.visualize_heatmap()