# Preprocessing scraped PFD data


The **first** stage of this notebook cleans the scraped PFD data by removing the report's template text and correcting spelling mistakes, mainly via the OpenAI API. This is saved to `cleaned.csv` in the `../Data/` directory.

The **second** stage performs NLP-specific preprocessing tasks, including tokenisation, stop word removal, lemmatization, and Word2Vec embeddings. This is saved to `tokenised.json` in the `../Data/` directory.<br><br>

**Outputted fields**

This notebook generates several new fields. These fields were created to maximise flexibility of our data for downstream NLP tasks, which may have different input requirements. For example, some topic modelling techniques require that words are tokenised prior to modelling, while others will tokenise for us. The generated fields are as follows:

* **CleanContent**:  Removal of introduction/template text, correction of spelling mistakes, and removal of names from the report.
* **ProcessedContent**:  As with CleanContent, except content is lowercased, stop words are removed, numbers / special characters are removed, words are lemmatized.
* **ProcessedWords**:  As with ProcessedContent, except content is word-tokenised.
* **ProcessedSentences**:  As with ProcessedContent, except content is sentence-tokenised.
* **WordEmbeddings**:  A vectorised representation of tokens in the ProcessedWords column.

In [1]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os
from openai import OpenAI

## 1. Initial cleaning

Below, we can see the result of our web scraping script. It produces 5 fields:
 * The **URL** of the PFD report.
 * The **ID**, scraped from the judiciary.uk website.
 * The **Date** of the report. Unhelpfully, this is in a mixture of data formats. We could use the OpenAI API to help with this, but I've left it for now.
 * The **Receiver**, or who the report was addressed to.
 * The **Content** of the reports's *concerns* section. This is the main field of interest for this notebook.

In [2]:
data = pd.read_csv('../Data/raw.csv')
data

Unnamed: 0,URL,ID,Date,Receiver,Content
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt..."
...,...,...,...,...,...
565,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0112,Date of report: 20 March 2015,,
566,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0106,Date of report: 17 March 2015,,
567,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0087,Date of report: 9 March 2015,,
568,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0077,Date of report: 4 March 2015,,


Not all reports were scraped successfully, which we can see by `NaN` values in the `Receiver` and `Content` fields. Inspecting the URLs of failed reports shows that this is because these reports are actually images, which have seemingly been scanned in and uploaded.

There are alternative Python packages that allow for the reading of text in images, but we'll leave this for now due to time constraints.

In [3]:
# Count number of rows
rows = data.shape[0]

# Count number of rows that have N/A values in "Content" column
na_rows = data['Content'].isna().sum()

print(f"There are", na_rows, "out of", rows, "reports which were unsuccessfully scraped.",
      "This reflects", round(na_rows/rows*100, 2), "% of all reports.",
      "\nThis leaves us with", rows - na_rows, "reports left to analyse.")

# Drop rows with missing values in "Content" column
data = data.dropna(subset = ['Content'])

There are 155 out of 570 reports which were unsuccessfully scraped. This reflects 27.19 % of all reports. 
This leaves us with 415 reports left to analyse.


### Removing intro text & correcting spelling mistakes

Notice in the `Content` field of the scraped reports that all text begins with some intro text. This text differs ever so slightly between reports, but it usually around the lines of:

> *During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:*

Because of slight grammatical differences between these intro texts across documents, we can't easily use regular expressions to trim them. Instead, we can call the OpenAI API (GPT 3.5 Turbo) to remove this text for us. While we're here, we can also use the model to correct spelling mistakes and to remove individuals' names from the report (which would likely add undesirable 'noise' to our downstream topic models.)

Note that for the below code to run, it requires you to set an environmental variable containing your OpenAI API Key (which I've called `api.env` and placed in the `/Scripts` directory). For security reasons, I've hidden my own API key.

In [4]:
# Load the .env file containing the OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')

client = OpenAI(api_key=openai_api_key)

Below, we define our prompt. We tell GPT to remove introduction text from the report.

Our placeholder (`{report_text}`) allows us to dynamically insert each report content into the prompt. We're essentially repeating the prompt by iterating over each report.

In [5]:
prompt = """You are an expert in removing leading introductions to reports. You must return the provided report content with the leading introduction taken out and spelling issues corrected. You must abide by the below instructions:
1. *Never* change the main content of the report; only ever remove the leading introductory statement.
2. Always remove things like 'matters of concern are as follows' and 'my findings are as follows', etc; I am only interested in the core content of the report.
3. If you cannot remove the leading introduction (or indeed you cannot find a leading introduction), you must delete the entire report and provide "NaN" as your response - nothing else whatsoever.
4. You must not, under any circumstances, respond in your own 'voice' (for example, you must not declare that you cannot find a leading introduction, or similar). In these cases, you must simply remove the entire report and provide "NaN" - nothing else whatsoever.

Here is an example of a leading introductory statement which should be removed from your response:
"During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:"

Your response must not contain the above statement, or anything which resembles it.

Your turn!
Report text:
{report_text}"""

Next, we define a function called `build_prompt` which takes a piece of report text and dynamically constructs a prompt for the LLM.

In [6]:
from typing import List, Dict # (...for type hints)

# Construct prompts for each given report text
# ... "->" is a type hint; it tells you what type of object the function should return
def build_prompt(report_text: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(report_text = report_text)}, # ...applies prompt dynamically with given report content
    ]

Next, we apply this function to each report by for looping over each element of the 'Contents' field. 

We also only display the first 1000 characters for each report. This is because the LLM doesn't need to see the entirety of each report to trim the intro text. This saves on token processing, making it cheaper and faster to run.

*Note the below code chunk will cost approximately $0.11 (£0.09) to run, assuming 415 reports.*

In [7]:
# Define empty array
new_texts = []

import time

# Start the timer
start_time = time.time()

# Loop over each element of "Content" field and apply prompt
for idx, text in enumerate(data['Content']):
    print('Cleaning row {i} of {n}'.format(i=idx, n=len(data)))
    text_left = text[0:1000] # ...shorten text to first 1000 characters
    text_right = text[1000:] # ...hide the rest of the text from the LLM
    try:
        result = client.chat.completions.create(messages=build_prompt(report_text=text_left), 
                                                model="gpt-3.5-turbo", # ...can also use more advanced "gpt-4-turbo" or "gpt-4o"
                                                max_tokens=4096,
                                                temperature=None, # ...remove randomness from completions
                                                seed=18062024).choices[0].message.content
        new_texts.append(result + text_right)
        new_text = result + text_right
    except:
        new_texts.append('ERROR')
        print(f'OpenAI error on row {idx}')

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Cleaning row 0 of 415
Cleaning row 1 of 415
Cleaning row 2 of 415
Cleaning row 3 of 415
Cleaning row 4 of 415
Cleaning row 5 of 415
Cleaning row 6 of 415
Cleaning row 7 of 415
Cleaning row 8 of 415
Cleaning row 9 of 415
Cleaning row 10 of 415
Cleaning row 11 of 415
Cleaning row 12 of 415
Cleaning row 13 of 415
Cleaning row 14 of 415
Cleaning row 15 of 415
Cleaning row 16 of 415
Cleaning row 17 of 415
Cleaning row 18 of 415
Cleaning row 19 of 415
Cleaning row 20 of 415
Cleaning row 21 of 415
Cleaning row 22 of 415
Cleaning row 23 of 415
Cleaning row 24 of 415
Cleaning row 25 of 415
Cleaning row 26 of 415
Cleaning row 27 of 415
Cleaning row 28 of 415
Cleaning row 29 of 415
Cleaning row 30 of 415
Cleaning row 31 of 415
Cleaning row 32 of 415
Cleaning row 33 of 415
Cleaning row 34 of 415
Cleaning row 35 of 415
Cleaning row 36 of 415
Cleaning row 37 of 415
Cleaning row 38 of 415
Cleaning row 39 of 415
Cleaning row 40 of 415
Cleaning row 41 of 415
Cleaning row 42 of 415
Cleaning row 43 of 41

We can then append our new, cleaned reports to our original data frame.

In [8]:
data['CleanContent'] = new_texts
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['CleanContent'] = new_texts


Unnamed: 0,URL,ID,Date,Receiver,Content,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...,Regulation 28 – After Inquest Document Templat...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...,(1) The process for triaging and prioritising ...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...,(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...,(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt...",My principal concern is that when a high-risk ...
...,...,...,...,...,...,...
555,https://www.judiciary.uk/prevention-of-future-...,Ref: 2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,"During the course of the inquest, the evidence...",Barts and the London 1. Whilst it was clear to...
559,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",During the course of the inquest the evidence ...,1. Piotr Kucharz was a Polish gentleman who co...
562,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,"During the course of the inquest, the evidence...",Camden and Islington Trust 1. It seemed from t...
564,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0116,Date of report: 24 March 2015,"TO: National Offender Management Service, Cliv...",During the course of the inquest the evidence ...,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...


We can compare the original content with the cleaned content to see the changes made by the model.

In [9]:
# Compare one example of original and cleaned content
print(f'ORIGINAL DATA: ', data['Content'][9])
print(f'CLEANED DATA: ', data['CleanContent'][9])

ORIGINAL DATA:  During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows: (brief summary of matters of concern) The gatekeeping assessment included a mental health state examination, where it was the clinical opinion of the mental health practitioner from the Crisis Resolution Home Treatment Team, that Ms Morris required an inpatient hospital admission to a mental health ward as there was an immediate risk to her safety as she was found to be a high risk of walking in front of a car. Whilst Ms Morris agreed to an informal admission, this was not possible at the time of assessment as there were no beds available nationally within the NHS or privately. As an inpatient admission was not possible, the option was to attend the Accident and Regulation 28 – After Inquest Doc

Although we've now removed the intro text, we can note that there is also a different kind of template text that has been accidentally captured by our web scraping tool. This takes the following format (with changing dates):

> Regulation 28 – After Inquest Document Template Updated 30/07/2021

We can remove this using regular expressions via the `re` package.

In [10]:
import re

# Provide pattern using regex
pattern = re.compile(r'Regulation 28 – After Inquest Document Template Updated \d{2}/\d{2}/\d{4}')

# Remove pattern from CleanContent field
data['CleanContent'] = data['CleanContent'].apply(lambda x: pattern.sub('', x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['CleanContent'] = data['CleanContent'].apply(lambda x: pattern.sub('', x))


In [11]:
# Compare one example of original and cleaned content... again!
# ...This time there shouldn't be any "Regulation 28" text
print(f'ORIGINAL DATA: ', data['Content'][9])
print(f'CLEANED DATA: ', data['CleanContent'][9])

ORIGINAL DATA:  During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows: (brief summary of matters of concern) The gatekeeping assessment included a mental health state examination, where it was the clinical opinion of the mental health practitioner from the Crisis Resolution Home Treatment Team, that Ms Morris required an inpatient hospital admission to a mental health ward as there was an immediate risk to her safety as she was found to be a high risk of walking in front of a car. Whilst Ms Morris agreed to an informal admission, this was not possible at the time of assessment as there were no beds available nationally within the NHS or privately. As an inpatient admission was not possible, the option was to attend the Accident and Regulation 28 – After Inquest Doc

Finally, we need to remove any reports where our LLM was unable to identify any introduction text. This issue appears to occur for a small minority of reports where the PDF format is atypical. This results in our web scraping tool scraping the template text but not the actual report contents.

In [12]:
nan_count = (data['CleanContent'] == 'NaN').sum()
print(f'Number of reports our LLM was unable to clean: {nan_count}', '- Removing...')

# Remove rows with "NaN" strings in the "CleanContent" column
data = data[data['CleanContent'] != 'NaN']

Number of reports our LLM was unable to clean: 18 - Removing...


In [13]:
# Save the cleaned data to a new CSV file
data = data.drop(columns = ['Content'])
data.to_csv('../Data/cleaned.csv', index = False)

## 2. NLP pre-processing

In [17]:
# Remove unnecessary fields
clean_data = pd.read_csv('../Data/cleaned.csv')
clean_data = clean_data[['URL', 'CleanContent']]
clean_data

Unnamed: 0,URL,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,Pre-amble Mr Larsen was a 52 year old male wi...
1,https://www.judiciary.uk/prevention-of-future-...,(1) The process for triaging and prioritising ...
2,https://www.judiciary.uk/prevention-of-future-...,(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,My principal concern is that when a high-risk ...
...,...,...
392,https://www.judiciary.uk/prevention-of-future-...,1. Brenda Morris was allowed weekend leave on ...
393,https://www.judiciary.uk/prevention-of-future-...,Barts and the London 1. Whilst it was clear to...
394,https://www.judiciary.uk/prevention-of-future-...,1. Piotr Kucharz was a Polish gentleman who co...
395,https://www.judiciary.uk/prevention-of-future-...,Camden and Islington Trust 1. It seemed from t...


### Tokenise and remove unnecessary words

Here we remove stop words (e.g. "the", "my"), punctuation, numbers and special characters.

We then word and sentence-tokenise our report contents (topic modelling primarily uses word-tokenisation, though we also need to sentence tokenise our reports for our word embeddings model later on).


In [18]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import pandas as pd

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Define stop words
stop_words = set(stopwords.words('english'))

# Define a function for pre-processing text
def preprocess_text(text):
    # Convert to lowercase and replace special characters and numbers with spaces
    return ''.join(char.lower() if char.isalpha() or char.isspace() else ' ' for char in text)

# Define a function for pre-processing and tokenizing text
def preprocess_and_tokenize(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    # Remove punctuation, special characters, and numbers, and convert to lowercase
    words = [word.lower() for word in words if word.isalpha()]
    # Remove stopwords
    return [word for word in words if word not in stop_words]

# Define a function to remove stop words from a string
def remove_stopwords(text):
    words = text.split()
    return ' '.join(word for word in words if word not in stop_words)

# Apply text preprocessing to the content
clean_data['ProcessedContent'] = clean_data['CleanContent'].apply(preprocess_text)

# Remove stop words from the preprocessed content
clean_data['ProcessedContent'] = clean_data['ProcessedContent'].apply(remove_stopwords)

# Apply word tokenization and pre-processing to the content
clean_data['ProcessedWords'] = clean_data['ProcessedContent'].apply(preprocess_and_tokenize)

# Sentence-tokenize the content and apply word tokenization and pre-processing to each sentence
clean_data['ProcessedSentences'] = clean_data['ProcessedContent'].apply(lambda x: [preprocess_and_tokenize(sent) for sent in sent_tokenize(x)])

clean_data


Unnamed: 0,URL,CleanContent,ProcessedContent,ProcessedWords,ProcessedSentences
0,https://www.judiciary.uk/prevention-of-future-...,Pre-amble Mr Larsen was a 52 year old male wi...,pre amble mr larsen year old male history ment...,"[pre, amble, mr, larsen, year, old, male, hist...","[[pre, amble, mr, larsen, year, old, male, his..."
1,https://www.judiciary.uk/prevention-of-future-...,(1) The process for triaging and prioritising ...,process triaging prioritising ambulance attend...,"[process, triaging, prioritising, ambulance, a...","[[process, triaging, prioritising, ambulance, ..."
2,https://www.judiciary.uk/prevention-of-future-...,(1) There are questions and answers on Quora’s...,questions answers quora website provide inform...,"[questions, answers, quora, website, provide, ...","[[questions, answers, quora, website, provide,..."
3,https://www.judiciary.uk/prevention-of-future-...,(1) The prison service instruction (PSI) 64/20...,prison service instruction psi sets procedures...,"[prison, service, instruction, psi, sets, proc...","[[prison, service, instruction, psi, sets, pro..."
4,https://www.judiciary.uk/prevention-of-future-...,My principal concern is that when a high-risk ...,principal concern high risk mental health pati...,"[principal, concern, high, risk, mental, healt...","[[principal, concern, high, risk, mental, heal..."
...,...,...,...,...,...
392,https://www.judiciary.uk/prevention-of-future-...,1. Brenda Morris was allowed weekend leave on ...,brenda morris allowed weekend leave basis part...,"[brenda, morris, allowed, weekend, leave, basi...","[[brenda, morris, allowed, weekend, leave, bas..."
393,https://www.judiciary.uk/prevention-of-future-...,Barts and the London 1. Whilst it was clear to...,barts london whilst clear evidence heard inque...,"[barts, london, whilst, clear, evidence, heard...","[[barts, london, whilst, clear, evidence, hear..."
394,https://www.judiciary.uk/prevention-of-future-...,1. Piotr Kucharz was a Polish gentleman who co...,piotr kucharz polish gentleman commenced livin...,"[piotr, kucharz, polish, gentleman, commenced,...","[[piotr, kucharz, polish, gentleman, commenced..."
395,https://www.judiciary.uk/prevention-of-future-...,Camden and Islington Trust 1. It seemed from t...,camden islington trust seemed evidence heard c...,"[camden, islington, trust, seemed, evidence, h...","[[camden, islington, trust, seemed, evidence, ..."


### Lemmatize the data

Lemmatization is the process of reducing words to their base or root form. For example, the words "running", "runs" and "ran" all need to be returned to their base form of "run".

Lemmatization is generally favourable to 'stemming' because the former returns a semantically meaningful output. For example, stemming would return "better" as "bet" while lemmatization would more appropriately return it as "good".

We can also enhance this process via 'part-of-speech' (POS) tagging. POS tagging enhances lemmatization by ensuring that word classes (verbs, adjectives, etc.) do not get lemmatized into the same token.

In [19]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, download
from nltk.corpus import wordnet

# Download NLTK data
download('averaged_perceptron_tagger')
download('wordnet')

# Map POS tags for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Initialise the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize tokens
def lemmatize_tokens(tokens):
    try:
        # POS tagging
        pos_tags = pos_tag(tokens)
        # Lemmatize with POS tags
        return [lemmatizer.lemmatize(token, get_wordnet_pos(tag) or wordnet.NOUN) for token, tag in pos_tags]
    except Exception as e:
        print(f"Error processing content: {e}")
        return []

# Function to lemmatize a text string
def lemmatize_text(text):
    try:
        # Tokenize the text into words
        words = word_tokenize(text)
        # POS tagging
        pos_tags = pos_tag(words)
        # Lemmatize with POS tags
        lemmatized_words = [lemmatizer.lemmatize(token, get_wordnet_pos(tag) or wordnet.NOUN) for token, tag in pos_tags]
        # Reconstruct the text
        return ' '.join(lemmatized_words)
    except Exception as e:
        print(f"Error processing content: {e}")
        return text

# Apply lemmatization to the preprocessed content
clean_data['ProcessedContent'] = clean_data['ProcessedContent'].apply(lemmatize_text)

# Apply word tokenization, pre-processing, and lemmatization to the content
clean_data['ProcessedWords'] = clean_data['ProcessedWords'].apply(lemmatize_tokens)

# Sentence-tokenize the content and apply word tokenization, pre-processing, and lemmatization to each sentence
clean_data['ProcessedSentences'] = clean_data['ProcessedSentences'].apply(lambda x: [lemmatize_tokens(sent) for sent in x])

clean_data


Unnamed: 0,URL,CleanContent,ProcessedContent,ProcessedWords,ProcessedSentences
0,https://www.judiciary.uk/prevention-of-future-...,Pre-amble Mr Larsen was a 52 year old male wi...,pre amble mr larsen year old male history ment...,"[pre, amble, mr, larsen, year, old, male, hist...","[[pre, amble, mr, larsen, year, old, male, his..."
1,https://www.judiciary.uk/prevention-of-future-...,(1) The process for triaging and prioritising ...,process triaging prioritise ambulance attendan...,"[process, triaging, prioritise, ambulance, att...","[[process, triaging, prioritise, ambulance, at..."
2,https://www.judiciary.uk/prevention-of-future-...,(1) There are questions and answers on Quora’s...,question answer quora website provide informat...,"[question, answer, quora, website, provide, in...","[[question, answer, quora, website, provide, i..."
3,https://www.judiciary.uk/prevention-of-future-...,(1) The prison service instruction (PSI) 64/20...,prison service instruction psi set procedure m...,"[prison, service, instruction, psi, set, proce...","[[prison, service, instruction, psi, set, proc..."
4,https://www.judiciary.uk/prevention-of-future-...,My principal concern is that when a high-risk ...,principal concern high risk mental health pati...,"[principal, concern, high, risk, mental, healt...","[[principal, concern, high, risk, mental, heal..."
...,...,...,...,...,...
392,https://www.judiciary.uk/prevention-of-future-...,1. Brenda Morris was allowed weekend leave on ...,brenda morris allow weekend leave basis partne...,"[brenda, morris, allow, weekend, leave, basis,...","[[brenda, morris, allow, weekend, leave, basis..."
393,https://www.judiciary.uk/prevention-of-future-...,Barts and the London 1. Whilst it was clear to...,bart london whilst clear evidence heard inques...,"[bart, london, whilst, clear, evidence, heard,...","[[bart, london, whilst, clear, evidence, heard..."
394,https://www.judiciary.uk/prevention-of-future-...,1. Piotr Kucharz was a Polish gentleman who co...,piotr kucharz polish gentleman commence living...,"[piotr, kucharz, polish, gentleman, commence, ...","[[piotr, kucharz, polish, gentleman, commence,..."
395,https://www.judiciary.uk/prevention-of-future-...,Camden and Islington Trust 1. It seemed from t...,camden islington trust seem evidence heard cam...,"[camden, islington, trust, seem, evidence, hea...","[[camden, islington, trust, seem, evidence, he..."


### Word embeddings

It's useful to use word embeddings prior to topic modelling in order to capture semantic similarity between certain words. For example, words like 'medicine', 'drugs' and 'prescription' would all be treated independently if we did not use embeddings, despite them having similar meanings. Note that we would not have this issue for models such as BERTopic which use their own embeddings.

By using word embeddings, we numerically link words with similar usage contexts and therefore increase the chances of our topic models presenting more coherent topics.

Below we use a pre-trained Word2Vec model from Gensim.

Additionally, we scan for **out-of-vocabulary (OOV)** words. These are words contained within our PFD data that are *not* also contained within our pre-trained model. Where this occurs, this is mostly due to spelling mistakes, specialised terminology, or acronyms. Embeddings vector must be identical in dimension to our word tokens. As a crude solution to OOV words - which our Word2Vec model cannot numerically represent - we assign these words the average of all scores contained within each respective report.

In [20]:
import gensim
import gensim.downloader as api
from gensim.models import Word2Vec

# Load the pre-trained Word2Vec model
# ...We use the popular Google News data source
model = api.load("word2vec-google-news-300")

# Function to get the word vectors for tokens, replacing OOV with average vector
def embed(tokens, model, oov_words):
    valid_tokens = [token for token in tokens if token in model.key_to_index]
    oov_tokens = [token for token in tokens if token not in model.key_to_index]
    oov_words.update(oov_tokens)
    
    if valid_tokens:
        word_vectors = [model[token] for token in valid_tokens]
        avg_vector = np.mean(word_vectors, axis=0)
    else:
        avg_vector = np.zeros(model.vector_size)
    
    # Replace OOV tokens with the average vector
    embeddings = [model[token] if token in model.key_to_index else avg_vector for token in tokens]
    
    return embeddings

# Initialize the WordEmbeddings column
clean_data['WordEmbeddings'] = None

# Initialize a set to store OOV words
oov_words = set()
oov_count = 0
mismatch_rows = []

# Loop through each row in the DataFrame
for i, row in clean_data.iterrows():
    embeddings = embed(row['ProcessedWords'], model, oov_words)
    clean_data.at[i, 'WordEmbeddings'] = embeddings
    
    # Check for OOV words count and mismatched dimensions
    oov_count += len([token for token in row['ProcessedWords'] if token not in model.key_to_index])
    if len(embeddings) != len(row['ProcessedWords']):
        mismatch_rows.append(i)

# Print the total count of OOV words
print(f'Total number of unique OOV words: {len(oov_words)}')
print(f'Total number of times an OOV word is used: {oov_count}')

# Check that all embeddings vectors are identical in dimension to the ProcessedWords column
if mismatch_rows:
    print(f'Rows with dimension mismatch: {mismatch_rows}')
else:
    print('All rows have matching dimensions between ProcessedWords and WordEmbeddings.')

print(f'The OOV words are as follows: {oov_words}')

clean_data

Total number of unique OOV words: 663
Total number of times an OOV word is used: 2152
All rows have matching dimensions between ProcessedWords and WordEmbeddings.
The OOV words are as follows: {'rcrp', 'neighbourhood', 'launceston', 'nationallockdown', 'langley', 'nsft', 'amph', 'tewv', 'ihbtt', 'clinicans', 'ctmuhb', 'beenen', 'ravenswood', 'larsen', 'iie', 'dickinson', 'hcps', 'bede', 'tyneside', 'whitelaw', 'admisssion', 'pimlott', 'usk', 'imb', 'ellson', 'spoe', 'rutland', 'azra', 'pinderfields', 'ssab', 'waverley', 'walczak', 'quora', 'cramlington', 'fulfil', 'mcloughlin', 'tostevin', 'dvpo', 'vauxhall', 'healthwith', 'oliv', 'marnie', 'zarins', 'favour', 'heathcotes', 'undertakerelevant', 'crht', 'warwickshire', 'whitchurch', 'hwhct', 'outsid', 'turbutt', 'templeton', 'kirkham', 'manon', 'martineau', 'hbbt', 'scc', 'wyndham', 'humberside', 'siwan', 'westleigh', 'ofcom', 'rowley', 'mercia', 'ucas', 'neas', 'elmley', 'roner', 'cjldt', 'noproforma', 'avallable', 'epjs', 'irene', 'ki

Unnamed: 0,URL,CleanContent,ProcessedContent,ProcessedWords,ProcessedSentences,WordEmbeddings
0,https://www.judiciary.uk/prevention-of-future-...,Pre-amble Mr Larsen was a 52 year old male wi...,pre amble mr larsen year old male history ment...,"[pre, amble, mr, larsen, year, old, male, hist...","[[pre, amble, mr, larsen, year, old, male, his...","[[0.0107421875, 0.07910156, 0.04345703, -0.058..."
1,https://www.judiciary.uk/prevention-of-future-...,(1) The process for triaging and prioritising ...,process triaging prioritise ambulance attendan...,"[process, triaging, prioritise, ambulance, att...","[[process, triaging, prioritise, ambulance, at...","[[0.11035156, 0.25585938, 0.034179688, -0.0212..."
2,https://www.judiciary.uk/prevention-of-future-...,(1) There are questions and answers on Quora’s...,question answer quora website provide informat...,"[question, answer, quora, website, provide, in...","[[question, answer, quora, website, provide, i...","[[0.10107422, 0.099121094, -0.037597656, 0.265..."
3,https://www.judiciary.uk/prevention-of-future-...,(1) The prison service instruction (PSI) 64/20...,prison service instruction psi set procedure m...,"[prison, service, instruction, psi, set, proce...","[[prison, service, instruction, psi, set, proc...","[[-0.03564453, -0.14257812, 0.27734375, -0.105..."
4,https://www.judiciary.uk/prevention-of-future-...,My principal concern is that when a high-risk ...,principal concern high risk mental health pati...,"[principal, concern, high, risk, mental, healt...","[[principal, concern, high, risk, mental, heal...","[[0.046875, -0.23046875, 0.328125, -0.16308594..."
...,...,...,...,...,...,...
392,https://www.judiciary.uk/prevention-of-future-...,1. Brenda Morris was allowed weekend leave on ...,brenda morris allow weekend leave basis partne...,"[brenda, morris, allow, weekend, leave, basis,...","[[brenda, morris, allow, weekend, leave, basis...","[[0.13769531, -0.05053711, 0.15429688, 0.09423..."
393,https://www.judiciary.uk/prevention-of-future-...,Barts and the London 1. Whilst it was clear to...,bart london whilst clear evidence heard inques...,"[bart, london, whilst, clear, evidence, heard,...","[[bart, london, whilst, clear, evidence, heard...","[[-0.034179688, -0.09716797, -0.03564453, 0.03..."
394,https://www.judiciary.uk/prevention-of-future-...,1. Piotr Kucharz was a Polish gentleman who co...,piotr kucharz polish gentleman commence living...,"[piotr, kucharz, polish, gentleman, commence, ...","[[piotr, kucharz, polish, gentleman, commence,...","[[-0.034623392, 0.02119609, 0.016141066, 0.057..."
395,https://www.judiciary.uk/prevention-of-future-...,Camden and Islington Trust 1. It seemed from t...,camden islington trust seem evidence heard cam...,"[camden, islington, trust, seem, evidence, hea...","[[camden, islington, trust, seem, evidence, he...","[[-0.052734375, -0.0030517578, -0.00793457, 0...."


It looks like the reports collectively contain a big mixture of OOV words. These are mostly comprised of a mixture of names, spelling mistakes and acronyms. This will almost certainly affect the performance of our subsequent topic modelling techniques, since we won't be able to take advantage of BERT's embeddings for each of these OOV words. 

Due to the nested structure of the data, we need to save our processed data in json format rather than csv.

In [21]:
clean_data.to_json('../Data/tokenised.json', orient='split')