## Cleaner demo

Any changes to the Cleaner module should only be pushed to main if the below code works without issue.

The Cleaner class is primarily respomnsible for correcting spelling errors contained within PFD reports. It also standardises coroner names into _Initial. LastName_ format, which we've used to assist with coroner-level filtering.

In [1]:
from pfd_toolkit import Cleaner, LLM, load_reports
from dotenv import load_dotenv
import os
import pandas as pd

# Read unclean reports from file (these were scraped with the Scraper class)
unclean_reports = load_reports(n_reports=20)

# Get API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=50)

# Run cleaner
cleaner = Cleaner(
    llm=llm_client,
    reports=unclean_reports)


cleaned_reports = cleaner.clean_reports(anonymise=True)

Processing Fields: 100%|██████████| 6/6 [00:53<00:00,  8.88s/it]


In [2]:
cleaned_reports.head(n=10)

#cleaned_reports.to_csv('../data/testreports_cleaned.csv')

Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns
0,https://www.judiciary.uk/prevention-of-future-...,2025-0276,2025-06-05,S. Brenchley,Birmingham and Solihull,Secretary of State for Health and Social Care,On 23 September 2024 I commenced an investigat...,On 7 May 2024 at the Queen Elizabeth Hospital ...,The only two on call perfusionists on site wer...
1,https://www.judiciary.uk/prevention-of-future-...,2025-0274,2025-06-04,S. Hayes,Essex,Chief Executive of East Suffolk and North Esse...,On 3 May 2024 I commenced an investigation int...,They died on thirteenth April two thousand and...,The treating doctor was not informed when they...
2,https://www.judiciary.uk/prevention-of-future-...,2025-0273,2025-06-04,E. Ramsay,Swansea and Neath Port Talbot,"Chief Executive Officer, Neath Port Talbot Cou...",On 24 June 2023 the Senior Coroner commenced a...,At 20:05 on nineteenth June two thousand and t...,There are no lifeguards stationed at the break...
3,https://www.judiciary.uk/prevention-of-future-...,2025-0277,2025-06-03,M. Hassell,Inner North London,Chief Executive Officer Islington Council,"On 4th November 2024, one of my assistant coro...",They jumped from the sixth floor balcony outsi...,Islington Council failed to take into account ...
4,https://www.judiciary.uk/prevention-of-future-...,2025-0275,2025-06-03,O. Longstaff,West Yorkshire East,Secretary of State for Health and Social Care;...,On 15 June 2022 I commenced an investigation i...,It had been intended that they be born at Leed...,The provision of maternity services across the...
5,https://www.judiciary.uk/prevention-of-future-...,2025-0272,2025-06-03,J. Richards,County Durham and Darlington,,On thirtieth of December two thousand and twen...,"They, aged 92 years, died at their care home o...",Poor communication and liaison with family gen...
6,https://www.judiciary.uk/prevention-of-future-...,2025-0269,2025-06-03,L. Hunt,Birmingham and Solihull,Secretary of State for Health; University Hosp...,On 7 January 2025 I commenced an investigation...,They attended Good Hope Hospital on eighteenth...,The investigation by the hospital trust identi...
7,https://www.judiciary.uk/prevention-of-future-...,2025-0268,2025-06-02,C. Long,Lancashire and Blackburn with Darwen,"Chief Executive, NHS England; Chief Executive,...",On nineteenth December two thousand and twenty...,They died on 1st June 2024 at Royal Infirmary ...,NHS England national service specifications pr...
8,https://www.judiciary.uk/prevention-of-future-...,2025-0267,2025-06-02,N. Mundy,South Yorkshire East,Chief Executive National Highways,On 20 September 2024 I commenced an investigat...,This relates to the death of an 18 year old ma...,The mound of earth creates a continuing hazard...
9,https://www.judiciary.uk/prevention-of-future-...,2025-0270,2025-06-02,M. Hassell,Inner North London,Medical director University College London Hos...,"On 29 November 2024, one of my assistant coron...",They hanged themselves at home.,It seems from the evidence I heard that an exp...


Below, we can see the output of our cleaning instance:

Let's compare it with the original, unclean reports that we imported earlier. Even though the below content in concatinated, we can see that the above has correctly standardised the Coroner's name into the desired format. There are a couple of instances in the longer sections where improper spaces have been removed (e.g. "On 19 th September" has been changed to "On 19th September").

In [None]:
unclean_reports.head(n=10)

In [None]:
from pfd_toolkit import load_reports, Cleaner, LLM
from dotenv import load_dotenv
import os

# Get API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=50)

reports_samp = load_reports(n_reports=10)


cleaner = Cleaner(
    llm=llm_client,
    reports=reports_samp)

summarised_reports = cleaner.summarise()
summarised_reports

In [None]:
summarised_reports.to_csv('../data/summarised_reports.csv')