## Comparing topic modelling techniques

There are different topic modelling approaches, each with a different set of advantages and disadvantages. The 'best' modelling technique is far from absolute, and largely depends on the nuances of the text data being analysed. To our knowledge, PFD data has not yet been analysed via NLP or topic modelling techniques, meaning that there exists no literature on the optimal approach(es).

This notebook will compare the suitability of 5 topic modelling techniques for PFD 'concerns' data.<br><br><br>


1. **Latent Dirichlet Allocation (LDA)**

LDA is perhaps the most popular topic modelling technique. It is a probabilistic method that assumes each document is a mixture of various topics (likely suitable for PFD reports which frequently contain multiple concerns). It characterises topics as a 'mixture of words'; the model generates a topic distribution for each document and a word distribution for each topic.

LDA *does* require that we pre-define our number of topics.

It uses Dirichlet distribution priors to model the distribution of topics in documents and words in topics, providing a more statistically aligned framework for topic modelling.<br><br><br>


2. **Correlated Topic Modelling (CTM)**

CTM is an extension of LDA that allows for correlations between topics. While it carries over core disadvantages of LDA in terms of less interpretable keyword lists for each topic, its unique contribution is its inclusion of a covariance structure to model topic correlations. This is particularly interesting for our PFD data, where many reports are built from multiple concerns and therefore topics. 

CTM *does* require us to pre-define our number of topics.<br><br><br>


3. **Non-negative Matrix Factorisation (NMF)**

NMF is a matrix factorisation technique that decomposes the document-term matrix into two lower-dimensional matrices. Topics are characterised by non-negative components in the factorised matrices, representing the importance of words in topics and topics in documents. Similarly to LDA, it assumes that documents contain multiple topics.

NMF *does* require that we pre-define our number of topics.

NMF enforces non-negativity constraints. Many report that resulting topic keywords are therefore more interpretable than LDA, with less 'noise' in the keyword lists.<br><br><br>


4. **Top2Vec**

Topics in Top2Vec are characterised by dense clusters of document and word embeddings. These clusters are identified in a joint embedding space, where both documents and words are represented. It does allow for multiple topics per document; this is achieved through the proximity of document embeddings to multiple topic vectors in the semantic space.

Top2Vec does *not* require us to pre-define our number of topics.

Top2Vec uses deep learning-based embeddings (e.g., Doc2Vec, Universal Sentence Encoder) to capture the semantic relationships in the text. This method ensures that topics are discovered based on the natural clustering of similar documents and words, leading to a more intuitive and data-driven identification of topics.<br><br><br>


5. **BERTopic**

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. Although not originally supported, v0.13 (January 2023) allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

BERTopic does *not* require us to pre-define our number of topics.<br><br>


### Before we get started

There are a few processing steps we need to complete before running this comparison. We will remove stop words, punctuation and numbers from our report content, lemmatize the text, and perform word embeddings.<br><br>

## 1. Processing the data

In [1]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os

# Import cleaned data
data = pd.read_csv('../Data/cleaned.csv', index_col='ID')

# Just keep "CleanContent" field
data = data[['CleanContent']]
data

Unnamed: 0_level_0,CleanContent
ID,Unnamed: 1_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...
Ref: 2024-0311,(1) The process for triaging and prioritising ...
Ref: 2024-0298,(1) There are questions and answers on Quora’s...
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...
Ref: 2024-0296,My principal concern is that when a high-risk ...
...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...
Ref: 2015-0116,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...


### Remove stop words and punctuation 

This stage is vital for removing words like "the" and "my" from the reports. These words provide unnecessary 'noise' in topic modelling which can result in much less coherent topics.

Below, we tokenise the report content and remove stop words, numbers and punctuation.

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet


# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')


# Define stop words
stop_words = set(stopwords.words('english'))

# Tokenize the report content
data['ProcessedContent'] = data['CleanContent'].apply(word_tokenize)

# Remove punctuation, special characters, and numbers, and convert to lowercase
data['ProcessedContent'] = data['ProcessedContent'].apply(lambda x: [word.lower() for word in x if word.isalpha()])

# Remove stopwords
data['ProcessedContent'] = data['ProcessedContent'].apply(lambda x: [word for word in x if word not in stop_words])

data


[nltk_data] Downloading package punkt to /home/sam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sam/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/sam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0_level_0,CleanContent,ProcessedContent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[mr, larsen, year, old, male, history, mental,..."
Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[process, triaging, prioritising, ambulance, a..."
Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[questions, answers, quora, website, provide, ..."
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[prison, service, instruction, psi, sets, proc..."
Ref: 2024-0296,My principal concern is that when a high-risk ...,"[principal, concern, mental, health, patient, ..."
...,...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[barts, london, whilst, clear, evidence, heard..."
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[piotr, kucharz, polish, gentleman, commenced,..."
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[camden, islington, trust, seemed, evidence, h..."
Ref: 2015-0116,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...,"[strips, cell, doors, deceased, design, engine..."


### Lemmatize the data

Lemmatization is the process of reducing words to their base or root form. For example, the words "running", "runs" and "ran" all need to be returned to their base form of "run".

Lemmatization is generally favourable to 'stemming' because the former returns a semantically meaningful output. For example, stemming would return "better" as "bet" while lemmatization would return it as "good".

We can also enhance this process via 'part-of-speech' (POS) tagging. POS tagging enhances lemmatization by identifying the grammatical role of words. For example, without POS tagging, the word "lead" in the sentences "The lead levels were high" and "He will lead the investigation" could be incorrectly lemmatized. 

In [3]:
# Map POS tags for lemmatization
# ...J = Adjective, R = Adverb, V = Verb, N = Noun
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

# Initialise the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function to process tokens
def process_content(tokens):
    try:
        # POS tagging
        pos_tags = pos_tag(tokens)
        
        # Lemmatize with POS tags
        lemmatized_tokens = []
        for token, tag in pos_tags:
            wordnet_pos = get_wordnet_pos(tag) or wordnet.NOUN
            lemmatized_token = lemmatizer.lemmatize(token, wordnet_pos)
            lemmatized_tokens.append(lemmatized_token)
        
        return lemmatized_tokens
    except Exception as e:
        print(f"Error processing content: {e}")
        return []

# Apply the process_content function
data['LemmatizedContent'] = data['ProcessedContent'].apply(process_content)
data

Unnamed: 0_level_0,CleanContent,ProcessedContent,LemmatizedContent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[mr, larsen, year, old, male, history, mental,...","[mr, larsen, year, old, male, history, mental,..."
Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[process, triaging, prioritising, ambulance, a...","[process, triaging, prioritise, ambulance, att..."
Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[questions, answers, quora, website, provide, ...","[question, answer, quora, website, provide, in..."
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[prison, service, instruction, psi, sets, proc...","[prison, service, instruction, psi, set, proce..."
Ref: 2024-0296,My principal concern is that when a high-risk ...,"[principal, concern, mental, health, patient, ...","[principal, concern, mental, health, patient, ..."
...,...,...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[barts, london, whilst, clear, evidence, heard...","[bart, london, whilst, clear, evidence, heard,..."
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[piotr, kucharz, polish, gentleman, commenced,...","[piotr, kucharz, polish, gentleman, commence, ..."
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[camden, islington, trust, seemed, evidence, h...","[camden, islington, trust, seem, evidence, hea..."
Ref: 2015-0116,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...,"[strips, cell, doors, deceased, design, engine...","[strip, cell, door, decease, design, engineer,..."


## Word embeddings

It's useful to use word embeddings prior to topic modelling in order to capture semantic similarity between certain words. For example, words like 'medicine', 'drugs' and 'prescription' would all be treated independently if we did not use embeddings, despite them potentially having similar meanings.

By using word embeddings, we therefore increase the chance of our topic modelling approaches identifying coherent topics within the reports.

We can either use word embedding tools (such as Word2Vec) which essentially builds a bespoke embedding framework for our report content, or we can use a pre-trained model courtesy of OpenAI - which is easier to implement but *potentially* less suited to highly domain-specific texts such as PFD reports.

For now, we'll use OpenAI's new and more advanced word embedding model (`text-embedding-3-large`), which was released in January 2024.

For this to work, we first need to untokenize our data in the `LemmatizedContent` column as OpenAI expects non-tokenized strings.

In [5]:
import tiktoken
import python_utils

# First, we need to make sure that no report exceeds the max number of tokens (8000) specified by OpenAI. This will prevent server-side errors.

def num_tokens_from_text(text: str, encoding_name="cl100k_base"):
    """
    Returns the number of OpenAI tokens.
    """
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(text))
    return num_tokens

# Function to un-tokenize text
def untokenize(tokens):
    """
    Reverses the tokenization process.
    """
    return ' '.join(tokens)

# Apply `untokenize`` function
data['LemmatizedContent'] = data['LemmatizedContent'].apply(untokenize)

# Calculate the token count
data['TokenCount_LemmatizedContent'] = data['LemmatizedContent'].apply(num_tokens_from_text)

# Count the number of reports that exceed the maximum number of tokens
max_tokens = 8000
exceeding_reports_count = (data['TokenCount_LemmatizedContent'] > max_tokens).sum()

print(f"Number of reports exceeding the maximum number of tokens: {exceeding_reports_count}")


Number of reports exceeding the maximum number of tokens: 0


In [6]:
data

Unnamed: 0_level_0,CleanContent,ProcessedContent,LemmatizedContent,TokenCount_LemmatizedContent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[mr, larsen, year, old, male, history, mental,...",mr larsen year old male history mental health ...,706
Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[process, triaging, prioritising, ambulance, a...",process triaging prioritise ambulance attendan...,72
Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[questions, answers, quora, website, provide, ...",question answer quora website provide informat...,98
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[prison, service, instruction, psi, sets, proc...",prison service instruction psi set procedure m...,85
Ref: 2024-0296,My principal concern is that when a high-risk ...,"[principal, concern, mental, health, patient, ...",principal concern mental health patient miss r...,501
...,...,...,...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[barts, london, whilst, clear, evidence, heard...",bart london whilst clear evidence heard inques...,274
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[piotr, kucharz, polish, gentleman, commenced,...",piotr kucharz polish gentleman commence living...,258
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[camden, islington, trust, seemed, evidence, h...",camden islington trust seem evidence heard cam...,220
Ref: 2015-0116,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...,"[strips, cell, doors, deceased, design, engine...",strip cell door decease design engineer sketch...,521


In [8]:
import time
from openai import OpenAI
import os
from dotenv import load_dotenv

# Set up OpenAI API
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI(api_key=openai_api_key)

# Create function that provides text embeddings
def get_embedding(text_to_embbed, model_ID):
    text = text_to_embbed.replace("\n", " ")
    return client.embeddings.create(input=[text_to_embbed], model=model_ID).data[0].embedding

# Define empty list to store embeddings
embeddings = []

# Reset index to avoid bug
data = data.reset_index(drop=False)

# Start the timer
start_time = time.time()

# Loop over each element of "LemmatizedContent" field and get embeddings
for idx, text in enumerate(data['LemmatizedContent']):
    print('Processing row {i} of {n}'.format(i=idx, n=len(data)))
    try:
        embedding = get_embedding(text, "text-embedding-3-large")
        embeddings.append(embedding)
    except Exception as e:
        embeddings.append(None)
        print(f'Error on row {idx}: {e}')

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

# Add embeddings to the dataframe
data['text-embedding-3-large'] = embeddings

data


Processing row 0 of 415
Processing row 1 of 415
Processing row 2 of 415
Processing row 3 of 415
Processing row 4 of 415
Processing row 5 of 415
Processing row 6 of 415
Processing row 7 of 415
Processing row 8 of 415
Processing row 9 of 415
Processing row 10 of 415
Processing row 11 of 415
Processing row 12 of 415
Processing row 13 of 415
Processing row 14 of 415
Processing row 15 of 415
Processing row 16 of 415
Processing row 17 of 415
Processing row 18 of 415
Processing row 19 of 415
Processing row 20 of 415
Processing row 21 of 415
Processing row 22 of 415
Processing row 23 of 415
Processing row 24 of 415
Processing row 25 of 415
Processing row 26 of 415
Processing row 27 of 415
Processing row 28 of 415
Processing row 29 of 415
Processing row 30 of 415
Processing row 31 of 415
Processing row 32 of 415
Processing row 33 of 415
Processing row 34 of 415
Processing row 35 of 415
Processing row 36 of 415
Processing row 37 of 415
Processing row 38 of 415
Processing row 39 of 415
Processing

Unnamed: 0,index,ID,CleanContent,ProcessedContent,LemmatizedContent,TokenCount_LemmatizedContent,text-embedding-3-large
0,0,Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[mr, larsen, year, old, male, history, mental,...",mr larsen year old male history mental health ...,706,"[0.03439606726169586, 0.0004933655727654696, -..."
1,1,Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[process, triaging, prioritising, ambulance, a...",process triaging prioritise ambulance attendan...,72,"[0.015928048640489578, 0.015428761951625347, 0..."
2,2,Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[questions, answers, quora, website, provide, ...",question answer quora website provide informat...,98,"[-0.026135534048080444, -0.027413271367549896,..."
3,3,Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[prison, service, instruction, psi, sets, proc...",prison service instruction psi set procedure m...,85,"[0.030314801260828972, 0.003587251529097557, -..."
4,4,Ref: 2024-0296,My principal concern is that when a high-risk ...,"[principal, concern, mental, health, patient, ...",principal concern mental health patient miss r...,501,"[0.04845774546265602, 5.705011062673293e-05, -..."
...,...,...,...,...,...,...,...
410,410,Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[barts, london, whilst, clear, evidence, heard...",bart london whilst clear evidence heard inques...,274,"[0.030965100973844528, 0.026973573490977287, -..."
411,411,Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[piotr, kucharz, polish, gentleman, commenced,...",piotr kucharz polish gentleman commence living...,258,"[-0.0017374138114973903, 0.011945893988013268,..."
412,412,Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[camden, islington, trust, seemed, evidence, h...",camden islington trust seem evidence heard cam...,220,"[0.019967302680015564, 0.007394141983240843, -..."
413,413,Ref: 2015-0116,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...,"[strips, cell, doors, deceased, design, engine...",strip cell door decease design engineer sketch...,521,"[0.02099669724702835, 0.00478058448061347, -0...."


It looks like the above broadly worked, but the console is printing lots of errors. We can investigate these individual reports.

In [9]:
# Print row 174 in "data"
print(data.loc[174])

index                                                                         174
ID                              Date of report: 12/01/2023Ref: 2023-0015Deceas...
CleanContent                                                                   ""
ProcessedContent                                                               []
LemmatizedContent                                                                
TokenCount_LemmatizedContent                                                    0
text-embedding-3-large                                                       None
Name: 174, dtype: object


It looks like these errors exist because our the could not remove the intro text in `preprocess.ipynb`, and provided an empty string. This is fine for now, as the number of errors is relatively small.

In [13]:
# Retokenise the data
data['LemmatizedContent'] = data['LemmatizedContent'].apply(word_tokenize)
data

TypeError: expected string or bytes-like object

In [14]:
print(len(data['LemmatizedContent'][0]), len(data['text-embedding-3-large'][0]))  

589 3072


### Tokenize and process data

Before topic modelling, we need to: (1) tokenize the data; (2) remove punctuation, special characters and numbers; (3) remove stop words; (4) lemmatize tokens to their dictionary base form.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag


# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Tokenise the report content
data['TokenisedContent'] = data['CleanContent'].apply(word_tokenize)

# Remove punctuation, special characters and numbers
data['TokenisedContent'] = data['TokenisedContent'].apply(lambda x: [word for word in x if word.isalpha()])

# Remove stopwords
stop_words = set(stopwords.words('english'))
data['TokenisedContent'] = data['TokenisedContent'].apply(lambda x: [word for word in x if word not in stop_words])

data

In [None]:
# Remove row of "data" with ID of "Ref: 2015-0072" due to erronous output
# This is a temporary fix. Some prompt engineering is needed in the OpenAI API call to prevent erronous outputs
data = data.drop('Ref: 2015-0072')
data