# Preprocessing scraped PFD data


The **first** stage of this notebook cleans the scraped PFD data by removing the report's template text and correcting spelling mistakes, mainly via the OpenAI API. This is saved to `cleaned.csv` in the `../Data/` directory.

The **second** stage performs NLP-specific preprocessing tasks, including tokenisation, stop word removal, lemmatization, and Word2Vec embeddings. This is saved to `tokenised.json` in the `../Data/` directory.<br><br>

**Outputted fields**

This notebook generates several new fields. These fields were created to maximise flexibility of our data for downstream NLP tasks, which may have different input requirements. For example, some topic modelling techniques require that words are tokenised prior to modelling, while others will tokenise for us. The generated fields are as follows:

* **CleanContent**:  Removal of introduction/template text, correction of spelling mistakes, and removal of names from the report.
* **ProcessedContent**:  As with CleanContent, except content is lowercased, stop words are removed, numbers / special characters are removed, words are lemmatized.
* **ProcessedWords**:  As with ProcessedContent, except content is word-tokenised.
* **ProcessedSentences**:  As with ProcessedContent, except content is sentence-tokenised.
* **WordEmbeddings**:  A vectorised representation of tokens in the ProcessedWords column.

In [2]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os
from openai import OpenAI

## 1. Initial cleaning

Below, we can see the result of our web scraping script. It produces 5 fields:
 * The **URL** of the PFD report.
 * The **ID**, scraped from the judiciary.uk website.
 * The **Date** of the report. Unhelpfully, this is in a mixture of data formats. We could use the OpenAI API to help with this, but I've left it for now.
 * The **Receiver**, or who the report was addressed to.
 * The **Content** of the reports's *concerns* section. This is the main field of interest for this notebook.

In [2]:
data = pd.read_csv('../Data/raw.csv')
data

Unnamed: 0,URL,ID,Date,Receiver,Content
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt..."
...,...,...,...,...,...
565,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0112,Date of report: 20 March 2015,,
566,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0106,Date of report: 17 March 2015,,
567,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0087,Date of report: 9 March 2015,,
568,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0077,Date of report: 4 March 2015,,


Not all reports were scraped successfully, which we can see by `NaN` values in the `Receiver` and `Content` fields. Inspecting the URLs of failed reports shows that this is because these reports are actually images, which have seemingly been scanned in and uploaded.

There are alternative Python packages that allow for the reading of text in images, but we'll leave this for now due to time constraints.

In [3]:
# Count number of rows
rows = data.shape[0]

# Count number of rows that have N/A values in "Content" column
na_rows = data['Content'].isna().sum()

print(f"There are", na_rows, "out of", rows, "reports which were unsuccessfully scraped.",
      "This reflects", round(na_rows/rows*100, 2), "% of all reports.",
      "\nThis leaves us with", rows - na_rows, "reports left to analyse.")

# Drop rows with missing values in "Content" column
data = data.dropna(subset = ['Content'])

There are 155 out of 570 reports which were unsuccessfully scraped. This reflects 27.19 % of all reports. 
This leaves us with 415 reports left to analyse.


## Using the OpenAI API
### Removing template text

Notice in the `Content` field of the scraped reports that all text begins with some intro text. This text differs ever so slightly between reports, but it usually around the lines of:

> *During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:*

Because of slight grammatical differences between these intro texts across documents, we can't easily use regular expressions to trim them. Instead, we can call the OpenAI API (GPT 3.5 Turbo) to remove this text for us. GPT 3.5 Turbo is less advanced than other models, but generally performs better for simpler tasks. During testing, GPT 4 Turbo performed much less reliably for removing template text.

Note that for the below code to run, it requires you to set an environmental variable containing your OpenAI API Key (which I've called `api.env` and placed in the `/Scripts` directory). For security reasons, I've hidden my own API key.

In [3]:
# Load the .env file containing the OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')

client = OpenAI(api_key=openai_api_key)

Below, we define our prompt. We tell GPT to remove introduction text from the report.

Our placeholder (`{report_text}`) allows us to dynamically insert each report content into the prompt. We're essentially repeating the prompt by iterating over each report.

In [5]:
prompt = """You are an expert in removing leading introductions to reports. You must return the provided report content with the leading introduction taken out and spelling issues corrected. You must abide by the below instructions:
1. *Never* change the main content of the report; only ever remove the leading introductory statement.
2. Always remove things like 'matters of concern are as follows' and 'my findings are as follows', etc; I am only interested in the core content of the report.
3. If you cannot remove the leading introduction (or indeed you cannot find a leading introduction), you must delete the entire report and provide "NaN" as your response - nothing else whatsoever.
4. You must not, under any circumstances, respond in your own 'voice' (for example, you must not declare that you cannot find a leading introduction, or similar). In these cases, you must simply remove the entire report and provide "NaN" - nothing else whatsoever.

Here is an example of a leading introductory statement which should be removed from your response:
"During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:"

Some reports may also contain the phrase "(brief summary of matters of concern)" which should also be removed.

Your turn!
Report text:
{report_text}"""

Next, we define a function called `build_prompt` which takes a piece of report text and dynamically constructs a prompt for the LLM.

In [6]:
from typing import List, Dict # (...for type hints)

# Construct prompts for each given report text
# ... "->" is a type hint; it tells you what type of object the function should return
def build_prompt(report_text: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(report_text = report_text)}, # ...applies prompt dynamically with given report content
    ]

Next, we apply this function to each report by for looping over each element of the 'Contents' field. 

To manage the fact that some reports are very large, we only provide the GPT model with the first 1000 characters - just enough for it to get a sense of what is template text, and what is report text. We then join this processed 'left side' of the data with the unprocessed 'right side'. This approach drastically improves performance and saves on computation time/cost.

First, we apply it to a random sample of 10 reports to check whether it works as expected. This is because applying the code to the entire dataset can take quite a bit of time.

In [7]:
import random
import time

# Define empty array for new texts
new_texts = []
original_texts = []

# Sample 15 rows from the data frame
random.seed(18062024)
sampled_indices = random.sample(range(len(data)), 10)

# Start the timer
start_time = time.time()

# Loop over each element of the sampled "Content" field and apply prompt
for count, idx in enumerate(sampled_indices, start=1):
    text = data['Content'].iloc[idx]
    original_texts.append(text)
    print('\n' * 2) 
    print('## Cleaning row {i} of 10'.format(i=count))
    text_left = text[0:1000]  # ...shorten text to first 1000 characters
    text_right = text[1000:]  # ...hide the rest of the text from the LLM
    try:
        result = client.chat.completions.create(
            messages=build_prompt(report_text=text_left),
            model="gpt-3.5-turbo",  # ...can use "gpt-3.5-turbo", "gpt-4-turbo" or "gpt-4"
            max_tokens=4096,
            temperature=None,  # ...remove randomness from completions
            seed=18062024).choices[0].message.content
        
        cleaned_text = result + text_right
        new_texts.append(cleaned_text)
        print(f'ORIGINAL TEXT: {text}')
        print(f'CLEANED TEXT: {cleaned_text}')
    except:
        new_texts.append('ERROR')
        print(f'OpenAI error on row {count}')
        print(f'ORIGINAL TEXT: {text}')
        print(f'CLEANED TEXT: ERROR')

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

# Create data frame to store "new_texts" and "original_texts"
result_df = pd.DataFrame({'Original Text': original_texts, 'Cleaned Text': new_texts})

result_df





## Cleaning row 1 of 10
ORIGINAL TEXT: During the course of the inquest, evidence revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. Although it remains my duty to report this matter to you (since the circumstances of concern continue to exist), I note the steps that Durham County Council has taken, and proposes to take, to address these concerns (see below and in your attached letter dated 30th November 2022). The MATTERS OF CONCERN are as follows. – All concerns relate to the bridge , which carries a road and two footpaths up to around 30m (100ft) above the reiver Weir (1) the bridge’s parapet and railing is accessible to pedestrians on the bridge; (3) there is absence of monitored CCTV and lighting or other means of detecting those at immediate risk; and 2 (4) there is a risk of death to persons falling AND to those near the foot of the bridge 

Unnamed: 0,Original Text,Cleaned Text
0,"During the course of the inquest, evidence rev...",Although it remains my duty to report this mat...
1,1. I heard evidence that the Department of Edu...,1. I heard evidence that the Department of Edu...
2,6 7 8 During the course of the inquest the evi...,­ (1) No assessment of the risk of suicide or ...
3,During the course of the inquest the evidence ...,1) (Addressed to Nightingales Care Limited and...
4,During the course of the inquest the evidence ...,1. No approved Mental Health practitioner was ...
5,During the course of the investigation my inqu...,During the inquest it came to light that John ...
6,During the course of the inquest the evidence ...,(1) The Joint Commissioning Panel for Mental H...
7,During the course of the inquest the evidence ...,"a) During the evidence, there was some confusi..."
8,"During the course of the inquest, the evidence...",I heard at inquest that no member of the team ...
9,The MATTERS OF,


It looks like the GPT 3.5 model was very effective at removing our template text, so we can extend it to the entire dataset.

In our sample, one error was provided, but this appears to be because of an issue with scraping the original text. We can quantify the extent of these errors once we apply it to the wider dataset.

In [8]:
new_texts = []
start_time = time.time()

for idx, text in enumerate(data['Content']):
    print('Cleaning row {i} of {n}'.format(i=idx, n=len(data)))
    text_left = text[0:1000]
    text_right = text[1000:]
    try:
        result = client.chat.completions.create(messages=build_prompt(report_text=text_left), 
                                                model="gpt-3.5-turbo",
                                                max_tokens=4096,
                                                temperature=None,
                                                seed=18062024).choices[0].message.content
        new_texts.append(result + text_right)
        new_text = result + text_right
    except:
        new_texts.append('ERROR')
        print(f'OpenAI error on row {idx}')

end_time = time.time()

total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Cleaning row 0 of 415
Cleaning row 1 of 415
Cleaning row 2 of 415
Cleaning row 3 of 415
Cleaning row 4 of 415
Cleaning row 5 of 415
Cleaning row 6 of 415
Cleaning row 7 of 415
Cleaning row 8 of 415
Cleaning row 9 of 415
Cleaning row 10 of 415
Cleaning row 11 of 415
Cleaning row 12 of 415
Cleaning row 13 of 415
Cleaning row 14 of 415
Cleaning row 15 of 415
Cleaning row 16 of 415
Cleaning row 17 of 415
Cleaning row 18 of 415
Cleaning row 19 of 415
Cleaning row 20 of 415
Cleaning row 21 of 415
Cleaning row 22 of 415
Cleaning row 23 of 415
Cleaning row 24 of 415
Cleaning row 25 of 415
Cleaning row 26 of 415
Cleaning row 27 of 415
Cleaning row 28 of 415
Cleaning row 29 of 415
Cleaning row 30 of 415
Cleaning row 31 of 415
Cleaning row 32 of 415
Cleaning row 33 of 415
Cleaning row 34 of 415
Cleaning row 35 of 415
Cleaning row 36 of 415
Cleaning row 37 of 415
Cleaning row 38 of 415
Cleaning row 39 of 415
Cleaning row 40 of 415
Cleaning row 41 of 415
Cleaning row 42 of 415
Cleaning row 43 of 41

We can then append our new, cleaned reports to our original data frame.

In [9]:
cleaned_data = data
cleaned_data['CleanContent'] = new_texts
cleaned_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['CleanContent'] = new_texts


Unnamed: 0,URL,ID,Date,Receiver,Content,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...,Regulation 28 – After Inquest Document Templat...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...,Regulation 28 – After Inquest Document Templat...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...,(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...,(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt...",My principal concern is that when a high-risk ...
...,...,...,...,...,...,...
555,https://www.judiciary.uk/prevention-of-future-...,Ref: 2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,"During the course of the inquest, the evidence...",Barts and the London 1. Whilst it was clear to...
559,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",During the course of the inquest the evidence ...,1. Piotr Kucharz was a Polish gentleman who co...
562,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,"During the course of the inquest, the evidence...",Camden and Islington Trust 1. It seemed from t...
564,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0116,Date of report: 24 March 2015,"TO: National Offender Management Service, Cliv...",During the course of the inquest the evidence ...,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...


Although we've now removed the intro text, we can note that there is also a different kind of template text that has been accidentally captured by our web scraping tool. This takes the following format (with changing dates):

> Regulation 28 – After Inquest Document Template Updated 30/07/2021

We can remove this using regular expressions via the `re` package.

In [10]:
import re

# Provide pattern using regex
pattern = re.compile(r'Regulation 28 – After Inquest Document Template Updated \d{2}/\d{2}/\d{4}')

# Remove pattern from CleanContent field
cleaned_data['CleanContent'] = cleaned_data['CleanContent'].apply(lambda x: pattern.sub('', x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['CleanContent'] = cleaned_data['CleanContent'].apply(lambda x: pattern.sub('', x))


We should also check for any reports where our GPT model was unable to remove our template text. Our prompt instructed the model to remove these reports, replacing it with "NaN".

In [11]:
# Count cleaned_data rows with "NaN" in the "CleanContent" column
na_rows = cleaned_data['CleanContent'].apply(lambda x: x == 'NaN').sum()

print(f"There are", na_rows, "out of 415 reports where template text could not be removed.",
      "This reflects", round(na_rows/415, 2), "% of remaining reports.",
      "\nThis leaves us with", 415 - na_rows, "reports left to analyse.")

# Drop rows with "NaN" in the "CleanContent" column
cleaned_data_dropped = cleaned_data[cleaned_data['CleanContent'] != 'NaN']

There are 16 out of 415 reports where template text could not be removed. This reflects 0.04 % of remaining reports. 
This leaves us with 399 reports left to analyse.


#### Validating the changes

We can now validate the data cleaning steps to make sure our efforts have worked as intended. Below, we randomly sample 20 reports and print them. There should be no template text or references to "Regulation 28". We should also make sure we haven't inadvertently cut text from any report's main content.

In [12]:
random_seed = 42
sampled_data = cleaned_data_dropped.sample(n=20, random_state=random_seed)

# Compare the original and cleaned content for the sampled rows
for index, row in sampled_data.iterrows():
    print(f'ORIGINAL DATA: {row["Content"]}')
    print(f'CLEANED DATA: {row["CleanContent"]}\n')

ORIGINAL DATA: 2 During the course of the inquest, the evidence revealed matters giving rise to concern. In my opinion, there is a risk that future deaths will occur unless action is taken. In the circumstances, it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows. Max’s father told me at inquest that in March and April 2022, Max tried to contact his personal advisor at the 18+ Service at Thistley Hill in Dover on several occasions over a number of weeks, but found her phone always to be switched off. There was no redirect and no out of office on her email. Max’s father also tried to call her, with the same result. Just over a week after Max’s death, his family received a letter addressed to him from Kent Social Services, explaining that his social worker was off sick. A crisis number was given and Mr Turbutt called it, but it was simply an answerphone. This arrangement does not seem adequate for a vulnerable person in need.
CLEANED DATA: Max’s father told me

Everything above looks good! As a final failsafe, we'll try to remove any lingering template text via regex.

In [13]:
import re

# Compile all patterns into a single regex
pattern = re.compile(
    r'\bmatters? of concern\b( are as follows)?|' \
    r'in these circumstances it is my statutory duty to report to you|' \
    r'In my opinion there is a risk that future deaths will occur unless action is taken|' \
    r'In my opinion action should be taken to prevent future deaths|' \
    r'I believe you.*the power to take such action',
    re.IGNORECASE
)

# Remove patterns from CleanContent field
cleaned_data_dropped['CleanContent'] = cleaned_data_dropped['CleanContent'].apply(lambda x: pattern.sub('', x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data_dropped['CleanContent'] = cleaned_data_dropped['CleanContent'].apply(lambda x: pattern.sub('', x))


In [14]:
# Drop "Content" column
cleaned_data_dropped = cleaned_data_dropped.drop(columns = ['Content'])

# Make a copy of the data for experimentation
cleaned_reports = cleaned_data_dropped

# Save data to a new CSV file
cleaned_data_dropped.to_csv('../Data/cleaned.csv', index=False)

### Ensure consistent ID format

Most reports have a correctly formatted ID column, taking the format of 'Ref: 20XX-XXXX'. However, some reports were erroneously scraped by our web scraper tool, leading to spillover with other sections. For example, one report ID is registered as:

> Date of report: 20/05/2024Ref: 2024-0277Deceased name: Miriam StoneCoroner name: Sophie LomasCoroner Area: Derby and DerbyshireCategory: Suicide (from 2015)This report is being sent to: Derbyshire Healthcare NHS Trust

This is obviously wrong, so let's call in GPT again to fix these erroneous IDs.

In [14]:
cleaned_reports = pd.read_csv('../Data/cleaned.csv')

URL             https://www.judiciary.uk/prevention-of-future-...
ID                                                            NaN
Date                                   Date of report: 09/11/2020
Receiver        TO: • Domestic Abuse Management Board Surrey P...
CleanContent    - The evidence showed that: 1. was being treat...
Name: 206, dtype: object


In [10]:
import random
import time

# Assuming cleaned_reports is a DataFrame
sampled_reports = cleaned_reports.sample(n=50, random_state=42)

# Define the prompt
id_prompt = """You will be provided with some text. Your task is to return the ID provided in the text.
The ID number will take the following format: 20XX-XXXX (where X is a digit).

The ID number will have "Ref: " just before it, but you should only return the ID number itself.

Do *not* return any other information. You should only return the ID number, and nothing else whatsoever.

Here is the text:
{id}
"""

new_ids = []
old_ids = []
start_time = time.time()

for idx, (report_idx, report) in enumerate(sampled_reports.iterrows()):
    id = report['ID']
    old_ids.append(id)
    print(f'Report {idx + 1} of 50...')
    try:
        result = client.chat.completions.create(messages=[{"role": "system", "content": id_prompt.format(id=id)}], 
                                                model="gpt-3.5-turbo",
                                                max_tokens=4096,
                                                temperature=None,
                                                seed=18062024).choices[0].message.content.strip()
        new_ids.append(result)
        print(f'Original ID: {id}, New ID: {result}')
    except Exception as e:
        new_ids.append('ERROR')
        print(f'OpenAI error on row {idx + 1}: {e}')

end_time = time.time()

total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')


Report 1 of 50...
Original ID: Ref: 2022-0327, New ID: 2022-0327
Report 2 of 50...
Original ID: Ref: 2020-0027, New ID: 2020-0027
Report 3 of 50...
Original ID: Ref: 2024-0192, New ID: 2024-0192
Report 4 of 50...
Original ID: Ref: 2022-0291, New ID: 2022-0291
Report 5 of 50...
Original ID: Ref: 2023-0439, New ID: 2023-0439
Report 6 of 50...
Original ID: Ref: 2023-0486, New ID: 2023-0486
Report 7 of 50...
Original ID: Ref: 2018-0140, New ID: 2018-0140
Report 8 of 50...
Original ID: Ref: 2023-0429, New ID: 2023-0429
Report 9 of 50...
Original ID: Ref: 2022-0214, New ID: 2022-0214
Report 10 of 50...
Original ID: Ref: 2023-0221, New ID: 2023-0221
Report 11 of 50...
Original ID: Ref: 2024-0282, New ID: 2024-0282
Report 12 of 50...
Original ID: Ref: 2019-0273, New ID: 2019-0273
Report 13 of 50...
Original ID: Ref: 2024-0071, New ID: 2024-0071
Report 14 of 50...
Original ID: Ref: 2023-0547, New ID: 2023-0547
Report 15 of 50...
Original ID: Ref: 2023-0199, New ID: 2023-0199
Report 16 of 50...


In [11]:
import time

# Assuming cleaned_reports is a DataFrame

# Define the prompt
id_prompt = """You will be provided with some text. Your task is to return the ID provided in the text.
The ID number will take the following format: 20XX-XXXX (where X is a digit).

The ID number will have "Ref: " just before it, but you should only return the ID number itself.

Do *not* return any other information. You should only return the ID number, and nothing else whatsoever.

Here is the text:
{id}
"""

new_ids = []
old_ids = []
start_time = time.time()

total_reports = len(cleaned_reports)

for idx, (report_idx, report) in enumerate(cleaned_reports.iterrows()):
    id = report['ID']
    old_ids.append(id)
    print(f'Report {idx + 1} of {total_reports}...')
    try:
        result = client.chat.completions.create(messages=[{"role": "system", "content": id_prompt.format(id=id)}], 
                                                model="gpt-3.5-turbo",
                                                max_tokens=4096,
                                                temperature=None,
                                                seed=18062024).choices[0].message.content.strip()
        new_ids.append(result)
        print(f'Original ID: {id}, New ID: {result}')
    except Exception as e:
        new_ids.append('ERROR')
        print(f'OpenAI error on row {idx + 1}: {e}')

end_time = time.time()

total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')


Report 1 of 399...
Original ID: Ref: 2024-0318, New ID: 2024-0318
Report 2 of 399...
Original ID: Ref: 2024-0311, New ID: 2024-0311
Report 3 of 399...
Original ID: Ref: 2024-0298, New ID: 2024-0298
Report 4 of 399...
Original ID: Ref: 2024-0297, New ID: 2024-0297
Report 5 of 399...
Original ID: Ref: 2024-0296, New ID: 2024-0296
Report 6 of 399...
Original ID: Ref: 2024-0295, New ID: 2024-0295
Report 7 of 399...
Original ID: Ref: 2024-0294, New ID: 2024-0294
Report 8 of 399...
Original ID: Ref: 2019-0397, New ID: 2019-0397
Report 9 of 399...
Original ID: Ref: 2018-0405, New ID: 2018-0405
Report 10 of 399...
Original ID: Ref: 2024-0282, New ID: 2024-0282
Report 11 of 399...
Original ID: Ref: 2024-0278, New ID: 2024-0278
Report 12 of 399...
Original ID: Date of report: 20/05/2024Ref: 2024-0277Deceased name: Miriam StoneCoroner name: Sophie LomasCoroner Area: Derby and DerbyshireCategory: Suicide (from 2015)This report is being sent to: Derbyshire Healthcare NHS Trust, New ID: 2024-0277
Re

In [15]:
id_clean_data = cleaned_reports
id_clean_data['ID'] = new_ids
id_clean_data

Unnamed: 0,URL,ID,Date,Receiver,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",Pre-amble Mr Larsen was a 52 year old male wi...
1,https://www.judiciary.uk/prevention-of-future-...,2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,- (1) The process for triaging and prioritisi...
2,https://www.judiciary.uk/prevention-of-future-...,2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...",My principal concern is that when a high-risk ...
...,...,...,...,...,...
394,https://www.judiciary.uk/prevention-of-future-...,2016-0065,Date of report: 19 February 2016,TO: 1. Medical Director East London NHS Founda...,1. Brenda Morris was allowed weekend leave on ...
395,https://www.judiciary.uk/prevention-of-future-...,2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,Barts and the London 1. Whilst it was clear to...
396,https://www.judiciary.uk/prevention-of-future-...,2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",1. Piotr Kucharz was a Polish gentleman who co...
397,https://www.judiciary.uk/prevention-of-future-...,2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,Camden and Islington Trust 1. It seemed from t...


In [18]:
import re

id_clean_data = cleaned_reports
id_clean_data['ID'] = new_ids

# Check to make sure all IDs meet the required format

# Set pattern
pattern = re.compile(r'^\d{4}-\d{4}$')

# Count the number of invalid IDs
invalid_ids_count = id_clean_data[~id_clean_data['ID'].str.match(pattern)].shape[0]
print(f'Number of invalid IDs: {invalid_ids_count}. Filtering out...')

# Filter out invalid IDs
id_clean_data2 = id_clean_data[id_clean_data['ID'].str.match(pattern)]

# Save the cleaned data to a new CSV file
id_clean_data2.to_csv('../Data/cleaned.csv', index=False)

Number of invalid IDs: 1. Filtering out...


Finally, we need to remove any reports where our LLM was unable to identify any introduction text. This issue appears to occur for a small minority of reports where the PDF format is atypical. This results in our web scraping tool scraping the template text but not the actual report contents.

## 2. NLP pre-processing

In [None]:
# Remove unnecessary fields
clean_data = pd.read_csv('../Data/cleaned.csv')
clean_data = clean_data[['URL', 'CleanContent']]
clean_data

### Tokenise and remove unnecessary words

Here we remove stop words (e.g. "the", "my"), punctuation, numbers and special characters.

We then word and sentence-tokenise our report contents (topic modelling primarily uses word-tokenisation, though we also need to sentence tokenise our reports for our word embeddings model later on).


In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import pandas as pd

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Define stop words
stop_words = set(stopwords.words('english'))

# Define a function for pre-processing text
def preprocess_text(text):
    # Convert to lowercase and replace special characters and numbers with spaces
    return ''.join(char.lower() if char.isalpha() or char.isspace() else ' ' for char in text)

# Define a function for pre-processing and tokenizing text
def preprocess_and_tokenize(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    # Remove punctuation, special characters, and numbers, and convert to lowercase
    words = [word.lower() for word in words if word.isalpha()]
    # Remove stopwords
    return [word for word in words if word not in stop_words]

# Define a function to remove stop words from a string
def remove_stopwords(text):
    words = text.split()
    return ' '.join(word for word in words if word not in stop_words)

# Apply text preprocessing to the content
clean_data['ProcessedContent'] = clean_data['CleanContent'].apply(preprocess_text)

# Remove stop words from the preprocessed content
clean_data['ProcessedContent'] = clean_data['ProcessedContent'].apply(remove_stopwords)

# Apply word tokenization and pre-processing to the content
clean_data['ProcessedWords'] = clean_data['ProcessedContent'].apply(preprocess_and_tokenize)

# Sentence-tokenize the content and apply word tokenization and pre-processing to each sentence
clean_data['ProcessedSentences'] = clean_data['ProcessedContent'].apply(lambda x: [preprocess_and_tokenize(sent) for sent in sent_tokenize(x)])

clean_data


### Lemmatize the data

Lemmatization is the process of reducing words to their base or root form. For example, the words "running", "runs" and "ran" all need to be returned to their base form of "run".

Lemmatization is generally favourable to 'stemming' because the former returns a semantically meaningful output. For example, stemming would return "better" as "bet" while lemmatization would more appropriately return it as "good".

We can also enhance this process via 'part-of-speech' (POS) tagging. POS tagging enhances lemmatization by ensuring that word classes (verbs, adjectives, etc.) do not get lemmatized into the same token.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, download
from nltk.corpus import wordnet

# Download NLTK data
download('averaged_perceptron_tagger')
download('wordnet')

# Map POS tags for lemmatization
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

# Initialise the lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to lemmatize tokens
def lemmatize_tokens(tokens):
    try:
        # POS tagging
        pos_tags = pos_tag(tokens)
        # Lemmatize with POS tags
        return [lemmatizer.lemmatize(token, get_wordnet_pos(tag) or wordnet.NOUN) for token, tag in pos_tags]
    except Exception as e:
        print(f"Error processing content: {e}")
        return []

# Function to lemmatize a text string
def lemmatize_text(text):
    try:
        # Tokenize the text into words
        words = word_tokenize(text)
        # POS tagging
        pos_tags = pos_tag(words)
        # Lemmatize with POS tags
        lemmatized_words = [lemmatizer.lemmatize(token, get_wordnet_pos(tag) or wordnet.NOUN) for token, tag in pos_tags]
        # Reconstruct the text
        return ' '.join(lemmatized_words)
    except Exception as e:
        print(f"Error processing content: {e}")
        return text

# Apply lemmatization to the preprocessed content
clean_data['ProcessedContent'] = clean_data['ProcessedContent'].apply(lemmatize_text)

# Apply word tokenization, pre-processing, and lemmatization to the content
clean_data['ProcessedWords'] = clean_data['ProcessedWords'].apply(lemmatize_tokens)

# Sentence-tokenize the content and apply word tokenization, pre-processing, and lemmatization to each sentence
clean_data['ProcessedSentences'] = clean_data['ProcessedSentences'].apply(lambda x: [lemmatize_tokens(sent) for sent in x])

clean_data


### Word embeddings

It's useful to use word embeddings prior to topic modelling in order to capture semantic similarity between certain words. For example, words like 'medicine', 'drugs' and 'prescription' would all be treated independently if we did not use embeddings, despite them having similar meanings. Note that we would not have this issue for models such as BERTopic which use their own embeddings.

By using word embeddings, we numerically link words with similar usage contexts and therefore increase the chances of our topic models presenting more coherent topics.

Below we use a pre-trained Word2Vec model from Gensim.

Additionally, we scan for **out-of-vocabulary (OOV)** words. These are words contained within our PFD data that are *not* also contained within our pre-trained model. Where this occurs, this is mostly due to spelling mistakes, specialised terminology, or acronyms. Embeddings vector must be identical in dimension to our word tokens. As a crude solution to OOV words - which our Word2Vec model cannot numerically represent - we assign these words the average of all scores contained within each respective report.

In [None]:
import gensim
import gensim.downloader as api
from gensim.models import Word2Vec

# Load the pre-trained Word2Vec model
# ...We use the popular Google News data source
model = api.load("word2vec-google-news-300")

# Function to get the word vectors for tokens, replacing OOV with average vector
def embed(tokens, model, oov_words):
    valid_tokens = [token for token in tokens if token in model.key_to_index]
    oov_tokens = [token for token in tokens if token not in model.key_to_index]
    oov_words.update(oov_tokens)
    
    if valid_tokens:
        word_vectors = [model[token] for token in valid_tokens]
        avg_vector = np.mean(word_vectors, axis=0)
    else:
        avg_vector = np.zeros(model.vector_size)
    
    # Replace OOV tokens with the average vector
    embeddings = [model[token] if token in model.key_to_index else avg_vector for token in tokens]
    
    return embeddings

# Initialize the WordEmbeddings column
clean_data['WordEmbeddings'] = None

# Initialize a set to store OOV words
oov_words = set()
oov_count = 0
mismatch_rows = []

# Loop through each row in the DataFrame
for i, row in clean_data.iterrows():
    embeddings = embed(row['ProcessedWords'], model, oov_words)
    clean_data.at[i, 'WordEmbeddings'] = embeddings
    
    # Check for OOV words count and mismatched dimensions
    oov_count += len([token for token in row['ProcessedWords'] if token not in model.key_to_index])
    if len(embeddings) != len(row['ProcessedWords']):
        mismatch_rows.append(i)

# Print the total count of OOV words
print(f'Total number of unique OOV words: {len(oov_words)}')
print(f'Total number of times an OOV word is used: {oov_count}')

# Check that all embeddings vectors are identical in dimension to the ProcessedWords column
if mismatch_rows:
    print(f'Rows with dimension mismatch: {mismatch_rows}')
else:
    print('All rows have matching dimensions between ProcessedWords and WordEmbeddings.')

print(f'The OOV words are as follows: {oov_words}')

clean_data

It looks like the reports collectively contain a big mixture of OOV words. These are mostly comprised of a mixture of names, spelling mistakes and acronyms. This will almost certainly affect the performance of our subsequent topic modelling techniques, since we won't be able to take advantage of BERT's embeddings for each of these OOV words. 

Due to the nested structure of the data, we need to save our processed data in json format rather than csv.

In [None]:
clean_data.to_json('../Data/tokenised.json', orient='split')