## Comparing topic modelling techniques

There are different topic modelling approaches, each with a different set of advantages and disadvantages. The 'best' modelling technique is far from absolute, and largely depends on the nuances of the text data being analysed. To our knowledge, PFD data has not yet been analysed via NLP or topic modelling techniques, meaning that there exists no literature on the optimal approach(es).

This notebook will compare the suitability of 5 topic modelling techniques for PFD 'concerns' data.<br><br><br>


1. **Latent Dirichlet Allocation (LDA)**

LDA is perhaps the most popular topic modelling technique. It is a probabilistic method that assumes each document is a mixture of various topics (likely suitable for PFD reports which frequently contain multiple concerns). It characterises topics as a 'mixture of words'; the model generates a topic distribution for each document and a word distribution for each topic.

LDA *does* require that we pre-define our number of topics.

It uses Dirichlet distribution priors to model the distribution of topics in documents and words in topics, providing a more statistically aligned framework for topic modelling.<br><br><br>


2. **Correlated Topic Modelling (CTM)**

CTM is an extension of LDA that allows for correlations between topics. While it carries over core disadvantages of LDA in terms of less interpretable keyword lists for each topic, its unique contribution is its inclusion of a covariance structure to model topic correlations. This is particularly interesting for our PFD data, where many reports are built from multiple concerns and therefore topics. 

CTM *does* require us to pre-define our number of topics.<br><br><br>


3. **Non-negative Matrix Factorisation (NMF)**

NMF is a matrix factorisation technique that decomposes the document-term matrix into two lower-dimensional matrices. Topics are characterised by non-negative components in the factorised matrices, representing the importance of words in topics and topics in documents. Similarly to LDA, it assumes that documents contain multiple topics.

NMF *does* require that we pre-define our number of topics.

NMF enforces non-negativity constraints. Many report that resulting topic keywords are therefore more interpretable than LDA, with less 'noise' in the keyword lists.<br><br><br>


4. **Top2Vec**

Topics in Top2Vec are characterised by dense clusters of document and word embeddings. These clusters are identified in a joint embedding space, where both documents and words are represented. It does allow for multiple topics per document; this is achieved through the proximity of document embeddings to multiple topic vectors in the semantic space.

Top2Vec does *not* require us to pre-define our number of topics.

Top2Vec uses deep learning-based embeddings (e.g., Doc2Vec, Universal Sentence Encoder) to capture the semantic relationships in the text. This method ensures that topics are discovered based on the natural clustering of similar documents and words, leading to a more intuitive and data-driven identification of topics.<br><br><br>


5. **BERTopic**

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. Although not originally supported, v0.13 (January 2023) allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

BERTopic does *not* require us to pre-define our number of topics.<br><br><br>

In [14]:
import pandas as pd
import numpy as np

# Import cleaned data
data = pd.read_csv('../Data/cleaned.csv', index_col='ID')

# Just keep "CleanContent" field
data = data[['CleanContent']]
data


Unnamed: 0_level_0,CleanContent
ID,Unnamed: 1_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...
Ref: 2024-0311,(1) The process for triaging and prioritising ...
Ref: 2024-0298,(1) There are questions and answers on Quora’s...
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...
Ref: 2024-0296,My principal concern is that when a high-risk ...
...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...
Ref: 2015-0116,1. The deceased was a design engineer and his ...


### Tokenize and process data

Before topic modelling, we need to: (1) tokenize the data; (2) remove punctuation, special characters and numbers; (3) remove stop words; (4) lemmatize tokens to their dictionary base form.

In [17]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag


# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

# Tokenize the report content
data['TokenisedContent'] = data['CleanContent'].apply(word_tokenize)

# Remove punctuation, special characters and numbers
data['TokenisedContent'] = data['TokenisedContent'].apply(lambda x: [word for word in x if word.isalpha()])

# Remove stopwords
stop_words = set(stopwords.words('english'))
data['TokenisedContent'] = data['TokenisedContent'].apply(lambda x: [word for word in x if word not in stop_words])

data

[nltk_data] Downloading package punkt to /home/sam/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/sam/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/sam/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /home/sam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0_level_0,CleanContent,TokenisedContent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[Mr, Larsen, year, old, male, history, mental,..."
Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[The, process, triaging, prioritising, ambulan..."
Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[There, questions, answers, Quora, website, pr..."
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[The, prison, service, instruction, PSI, sets,..."
Ref: 2024-0296,My principal concern is that when a high-risk ...,"[My, principal, concern, mental, health, patie..."
...,...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[Barts, London, Whilst, clear, evidence, I, he..."
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[Piotr, Kucharz, Polish, gentleman, commenced,..."
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[Camden, Islington, Trust, It, seemed, evidenc..."
Ref: 2015-0116,1. The deceased was a design engineer and his ...,"[The, deceased, design, engineer, sketches, fo..."


In [18]:
# Map POS tags for lemmatization
# ...J = Adjective, R = Adverb, V = Verb, N = Noun
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None



In [20]:
# Initialise the lemmatizer
lemmatizer = WordNetLemmatizer()

# Define function to process tokens
def process_content(content):
    # Tokenize
    tokens = word_tokenize(content)
    
    # Remove punctuation, special characters and numbers
    tokens = [word for word in tokens if word.isalpha()]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # POS tagging
    pos_tags = pos_tag(tokens)
    
    # Lemmatize with POS tags
    lemmatized_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) if get_wordnet_pos(tag) else token for token, tag in pos_tags]
    
    return lemmatized_tokens

data['TokenisedContent'] = data['CleanContent'].apply(process_content)
data

Unnamed: 0_level_0,CleanContent,TokenisedContent
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Ref: 2024-0318,Pre-amble Mr Larsen was a 52 year old male wi...,"[Mr, Larsen, year, old, male, history, mental,..."
Ref: 2024-0311,(1) The process for triaging and prioritising ...,"[The, process, triaging, prioritise, ambulance..."
Ref: 2024-0298,(1) There are questions and answers on Quora’s...,"[There, question, answer, Quora, website, prov..."
Ref: 2024-0297,(1) The prison service instruction (PSI) 64/20...,"[The, prison, service, instruction, PSI, set, ..."
Ref: 2024-0296,My principal concern is that when a high-risk ...,"[My, principal, concern, mental, health, patie..."
...,...,...
Ref: 2016-0037,Barts and the London 1. Whilst it was clear to...,"[Barts, London, Whilst, clear, evidence, I, he..."
Ref: 2015-0465,1. Piotr Kucharz was a Polish gentleman who co...,"[Piotr, Kucharz, Polish, gentleman, commence, ..."
Ref: 2015-0173,Camden and Islington Trust 1. It seemed from t...,"[Camden, Islington, Trust, It, seem, evidence,..."
Ref: 2015-0116,1. The deceased was a design engineer and his ...,"[The, deceased, design, engineer, sketch, find..."
