## Comparing topic modelling techniques

There are different topic modelling approaches. This notebook will compare how suitable each of these modelling techniques are to our scraped PFD data.<br><br><br>



1. **Latent Dirichlet Allocation (LDA)**

LDA is perhaps the most popular topic modelling technique. It is a probabilistic method that assumes each document is a mixture of various topics (likely suitable for PFD reports which frequently contain multiple concerns). It characterises topics as a 'mixture of words'. The model generates a topic distribution for each document and a word distribution for each topic.

LDA *does* require that we pre-define our number of topics.

It uses Dirichlet distribution priors to model the distribution of topics in documents and words in topics, providing a more statistically aligned framework for topic modelling.<br><br><br>



2. **Non-negative Matrix Factorisation (NMF)**

NMF is a matrix factorisation technique that decomposes the document-term matrix into two lower-dimensional matrices. Topics are characterised by non-negative components in the factorised matrices, representing the importance of words in topics and topics in documents. Similarly to LDA, it assumes that documents contain multiple topics.

NMF *does* require that we pre-define our number of topics.

NMF enforces non-negativity constraints. Many report that resulting topic keywords are therefore more interpretable than LDA, with less 'noise' in the keyword lists.<br><br><br>


3. **BERTopic**

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. Although not originally supported, v0.13 (January 2023) allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

BERTopic does *not* require us to pre-define our number of topics.<br><br><br>


4. **Correlated Topic Modelling (CTM)**

CTM is an extension of LDA that allows for correlations between topics. While it carries over core disadvantages of LDA in terms of less interpretable keyword lists for each topic, its unique contribution is its inclusion of a covariance structure to model topic correlations. This is particularly interesting for our PFD data, where many reports are built from multiple concerns and therefore topics. 

CTM *does* require us to pre-define our number of topics.<br><br><br>



5. **Top2Vec**

Topics in Top2Vec are characterised by dense clusters of document and word embeddings. These clusters are identified in a joint embedding space, where both documents and words are represented. It does allow for multiple topics per document; this is achieved through the proximity of document embeddings to multiple topic vectors in the semantic space.

Top2Vec does *not* require us to pre-define our number of topics.

Top2Vec uses deep learning-based embeddings (e.g., Doc2Vec, Universal Sentence Encoder) to capture the semantic relationships in the text. This method ensures that topics are discovered based on the natural clustering of similar documents and words, leading to a more intuitive and data-driven identification of topics.<br><br><br>



In [1]:
import pandas as pd
import numpy as np

# Import cleaned data
data = pd.read_csv('../Data/cleaned.csv')
data

Unnamed: 0,URL,ID,Date,Receiver,Content,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...,Regulation 28 – After Inquest Document Templat...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...,(1) The process for triaging and prioritising ...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...,(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...,(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt...",My principal concern is that when a high-risk ...
...,...,...,...,...,...,...
410,https://www.judiciary.uk/prevention-of-future-...,Ref: 2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,"During the course of the inquest, the evidence...",Barts and the London 1. Whilst it was clear to...
411,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",During the course of the inquest the evidence ...,1. Piotr Kucharz was a Polish gentleman who co...
412,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,"During the course of the inquest, the evidence...",Camden and Islington Trust 1. It seemed from t...
413,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0116,Date of report: 24 March 2015,"TO: National Offender Management Service, Cliv...",During the course of the inquest the evidence ...,NOMS/Sodexo - Anti-Ligature Strips on Cell Doo...


In [None]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')

# Apply word_tokenize to the 'Content' column

data['Tokenized_Content'] = data['Content'].apply(word_tokenize)
data