## Preprocessing scraped PFD data

In [1]:
import pandas as pd
import numpy as np
from dotenv import load_dotenv
import os
from openai import OpenAI

Below, we can see the result of our web scraping script. It produces 5 fields:
 * The "URL" of the PFD report.
 * The "ID", scraped from the judiciary.uk website.
 * The "date" of the report. Unhelpfully, this is in a mixture of formats. We might be able to use the OpenAI API to help with this, but I've left it for now.
 * The "receiver", or who the report was addressed to.
 * The "content" of the reports's *concerns* section.

In [2]:
data = pd.read_csv('../Data/raw.csv')
data.head(n = 5)

Unnamed: 0,URL,ID,Date,Receiver,Content
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt..."


Not all reports were scraped successfully, which we can see by *NA* values in the `Receiver` and `Content` fields. Inspecting the URLs of failed reports shows that this is because these reports are actually images, which have seemingly been scanned in and uploaded. 

There are alternative Python packages that allow for the reading of text in images, but we'll leave this for now due to time constraints.

In [3]:
# Count number of rows
rows = data.shape[0]

# Count number of rows that have N/A values in "Content" column
na_rows = data['Content'].isna().sum()

# Print with no spaces
print(f"There are", na_rows, "out of", rows, "reports which were unsuccessfully scraped.",
      "This reflects", round(na_rows/rows*100, 2), "% of all reports.",
      "\nThis leaves us with", rows - na_rows, "reports left to analyse.")

# Drop rows with N/A values in "Content" column
data = data.dropna(subset = ['Content'])

There are 155 out of 570 reports which were unsuccessfully scraped. This reflects 27.19 % of all reports. 
This leaves us with 415 reports left to analyse.


### Removing intro text

Notice in the `Content` field of the scraped reports that all text begins with some intro text. This text differs ever so slightly between reports, but it usually around the lines of:

> *During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:*

Because of slight grammatical differences between these intro texts across documents, we can't use regular expressions (regex) to trim them. Instead, we can call the OpenAI API (using GPT 3.5 Turbo) to remove this text for us.

Note that for the below code to run on your machine, it requires you to set an environmental variable containing your OpenAI API Key (which I've called `api.env`). For security reasons, I've hidden my own API key via `.gitignore`.

In [4]:
# Load the .env file containing the OpenAI API Key
load_dotenv('api.env')
openai_api_key = os.getenv('OPENAI_API_KEY')

client = OpenAI(api_key=openai_api_key)

In [5]:
# Below is an example of report content that we need to trim
data['Content'][9]

'During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows: (brief summary of matters of concern) The gatekeeping assessment included a mental health state examination, where it was the clinical opinion of the mental health practitioner from the Crisis Resolution Home Treatment Team, that Ms Morris required an inpatient hospital admission to a mental health ward as there was an immediate risk to her safety as she was found to be a high risk of walking in front of a car. Whilst Ms Morris agreed to an informal admission, this was not possible at the time of assessment as there were no beds available nationally within the NHS or privately. As an inpatient admission was not possible, the option was to attend the Accident and Regulation 28 – After Inquest Document Template 

Below, we define our prompt. We tell GPT to remove introduction text from the report.

Our placeholder (`{report_text}`) allows us to dynamically insert each report content into the prompt. We're essentially repeating the prompt by iterating over each report.

In [16]:
prompt = """You are an expert in removing leading introductions to reports. You must return the report content with the leading introduction taken out.
- *Never* change the main content of the report; only ever remove the leading introductory statement.
- Always remove things like 'matters of concern are as follows' and 'my findings are as follows', etc; I am only interested in the core content of the report.
- If you cannot remove the leading introduction (or indeed you cannot find a leading introduction), you must delete the entire report and provide an empty string (i.e. "") as your response - nothing else.
- You must not, in any circumstances, respond in your own 'voice' (for example, you must not declare that you cannot find a leading introduction, or similar). In these cases, you must simply remove the report and provide an empty string ("").

Here is an example of a leading introductory statement which should be removed from your response:
"During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows:"


Your turn!
Report text:
{report_text}"""

Next, we define a function called `build_prompt` which takes a piece of report text and dynamically constructs a prompt for the LLM.

In [17]:
from typing import List, Dict # (...for type hints)

# Construct prompts for each given report text
# ... "->" is a type hint; it tells you what type of object the function should return
def build_prompt(report_text: str) -> List[Dict[str, str]]:
    # OpenAI 'messages' take a list of dictionaries, each with a 'role' and 'content' key. 
    # Role can be 'system', 'user', or 'assistant' (LLM replies as assistant); content is the text the LLM sees.
    return [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt.format(report_text = report_text)}, # ...applies prompt dynamically with given report content
    ]

Next, we apply this function to each report by for looping over each element of the 'Contents' field. 

We also only display the first 600 characters for each report. This is because the LLM doesn't need to see the entirety of each report to trim the intro text. This saves on token processing, making it cheaper and faster to run.

*Note the below code chunk will cost approximately $0.11 (£0.09) to run, assuming 415 reports.*

In [20]:
# Define empty array
new_texts = []

import time

# Start the timer
start_time = time.time()

# Loop over each element of "Content" field and apply prompt
for idx, text in enumerate(data['Content']):
    print('Cleaning row {i} of {n}'.format(i=idx, n=len(data)))
    text_left = text[0:600] # Shorten text to first 600 characters
    text_right = text[600:] # Hide the rest of the text from the LLM
    try:
        result = client.chat.completions.create(messages=build_prompt(report_text=text_left), 
                                                model="gpt-3-turbo", # ...can also use more advanced "gpt-4-turbo" or "gpt-4o"
                                                max_tokens=4096,
                                                temperature=None, # ...remove randomness from completions
                                                seed=18062024).choices[0].message.content
        new_texts.append(result + text_right)
        new_text = result + text_right
    except:
        new_texts.append('ERROR')
        print(f'OpenAI error on row {idx}')

# End the timer
end_time = time.time()

# Calculate & print time taken
total_time = end_time - start_time
minutes = int(total_time // 60)
seconds = total_time % 60

print(f'Time taken: {minutes} minutes and {seconds:.2f} seconds')

Cleaning row 0 of 415
Cleaning row 1 of 415
Cleaning row 2 of 415
Cleaning row 3 of 415
Cleaning row 4 of 415
Cleaning row 5 of 415
Cleaning row 6 of 415
Cleaning row 7 of 415
Cleaning row 8 of 415
Cleaning row 9 of 415
Cleaning row 10 of 415
Cleaning row 11 of 415
Cleaning row 12 of 415
Cleaning row 13 of 415
Cleaning row 14 of 415
Cleaning row 15 of 415
Cleaning row 16 of 415
Cleaning row 17 of 415
Cleaning row 18 of 415
Cleaning row 19 of 415
Cleaning row 20 of 415
Cleaning row 21 of 415
Cleaning row 22 of 415
Cleaning row 23 of 415
Cleaning row 24 of 415
Cleaning row 25 of 415
Cleaning row 26 of 415
Cleaning row 27 of 415
Cleaning row 28 of 415
Cleaning row 29 of 415
Cleaning row 30 of 415
Cleaning row 31 of 415
Cleaning row 32 of 415
Cleaning row 33 of 415
Cleaning row 34 of 415
Cleaning row 35 of 415
Cleaning row 36 of 415
Cleaning row 37 of 415
Cleaning row 38 of 415
Cleaning row 39 of 415
Cleaning row 40 of 415
Cleaning row 41 of 415
Cleaning row 42 of 415
Cleaning row 43 of 41

Now, all we need to do is append these trimmed texts to our original data frame.

In [21]:
data['CleanContent'] = new_texts
data

Unnamed: 0,URL,ID,Date,Receiver,Content,CleanContent
0,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0318,Date of report: 13/06/2024,"TO: The Chief Executive, Leicestershire Partne...",During the course of the investigation my inqu...,Regulation 28 – After Inquest Document Templat...
1,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0311,Date of report: 07/06/2024,TO: 1. NATIONAL AMBULANCE RESILIENCE UNIT (NAR...,Regulation 28 – After Inquest Document Templat...,(1) The process for triaging and prioritising ...
2,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0298,Date of report: 05/05/2024,"TO: 1. CEO of Quora, 2. The Rt Hon Lucy Fraser...",During the course of the inquest the evidence ...,(1) There are questions and answers on Quora’s...
3,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0297,Date of report: 29/05/2024,"TO: Secretary of State for Justice, 1",During the course of the inquest the evidence ...,(1) The prison service instruction (PSI) 64/20...
4,https://www.judiciary.uk/prevention-of-future-...,Ref: 2024-0296,Date of report: 03/06/2024,"TO: (1) , Chief Executive, Birmingham and Soli...","During the inquest, the evidence revealed matt...",My principal concern is that when a high-risk ...
...,...,...,...,...,...,...
555,https://www.judiciary.uk/prevention-of-future-...,Ref: 2016-0037,Date of report: 5 February 2016,TO: 1. Dean for Education Barts and the London...,"During the course of the inquest, the evidence...",Barts and the London 1. Whilst it was clear to...
559,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0465,Date of report: 24 November 2015,"TO: Chief Executive, Lancashire Care NHS Found...",During the course of the inquest the evidence ...,1. Piotr Kucharz was a Polish gentleman who co...
562,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0173,Date of report: 29 April 2015,TO: 1. Ms Wendy Wallace Chief Executive Camden...,"During the course of the inquest, the evidence...",Camden and Islington Trust 1. It seemed from t...
564,https://www.judiciary.uk/prevention-of-future-...,Ref: 2015-0116,Date of report: 24 March 2015,"TO: National Offender Management Service, Cliv...",During the course of the inquest the evidence ...,NOMS/SODEXO - ANTI-LIGATURE STRIPS ON CELL DOO...


In [22]:
# Compare one example of original and cleaned content
print(f'ORIGINAL DATA: ', data['Content'][9])
print(f'CLEANED DATA: ', data['CleanContent'][9])

ORIGINAL DATA:  During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows: (brief summary of matters of concern) The gatekeeping assessment included a mental health state examination, where it was the clinical opinion of the mental health practitioner from the Crisis Resolution Home Treatment Team, that Ms Morris required an inpatient hospital admission to a mental health ward as there was an immediate risk to her safety as she was found to be a high risk of walking in front of a car. Whilst Ms Morris agreed to an informal admission, this was not possible at the time of assessment as there were no beds available nationally within the NHS or privately. As an inpatient admission was not possible, the option was to attend the Accident and Regulation 28 – After Inquest Doc

Although we've now removed the intro text, we can note that there is also PDF template text that has been accidentally captured by our web scraping tool. This takes the following format (with changing dates):

> Regulation 28 – After Inquest Document Template Updated 30/07/2021

We can remove this using regular expressions via the `re` package.

In [23]:
import re

# Provide pattern using regex
pattern = re.compile(r'Regulation 28 – After Inquest Document Template Updated \d{2}/\d{2}/\d{4}')

# Remove pattern from CleanContent field
data['CleanContent'] = data['CleanContent'].apply(lambda x: pattern.sub('', x))


In [24]:
# Compare one example of original and cleaned content... again!
# ...This time there shouldn't be any "Regulation 28" text
print(f'ORIGINAL DATA: ', data['Content'][9])
print(f'CLEANED DATA: ', data['CleanContent'][9])

ORIGINAL DATA:  During the course of the investigation my inquiries revealed matters giving rise to concern. In my opinion there is a risk that future deaths could occur unless action is taken. In the circumstances it is my statutory duty to report to you. The MATTERS OF CONCERN are as follows: (brief summary of matters of concern) The gatekeeping assessment included a mental health state examination, where it was the clinical opinion of the mental health practitioner from the Crisis Resolution Home Treatment Team, that Ms Morris required an inpatient hospital admission to a mental health ward as there was an immediate risk to her safety as she was found to be a high risk of walking in front of a car. Whilst Ms Morris agreed to an informal admission, this was not possible at the time of assessment as there were no beds available nationally within the NHS or privately. As an inpatient admission was not possible, the option was to attend the Accident and Regulation 28 – After Inquest Doc

In [25]:
# Save the cleaned data to a new CSV file
data.to_csv('../Data/cleaned.csv', index = False)