## Demoing PFD Toolkit

The below is a demo build of PFD Toolkit version 0.1.27.

Note that all below code will not actually run on machines other than my own, as the toolkit is not on `pip` yet. 

Each code block is accompanied with a `%%time` command. This prints the CPU and wall times of the code so you can get a better sense of its speed.

### Load reports

The dataset containing PFD reports is bundled with the package. The user can load reports somewhat similar to how an R user can load Iris. However, unlike Iris, PFD reports are automatically refreshed weekly with new reports courtesy of GitHub Workflows.

To save money on OpenAI credits, this feature has only been implemented with HTML scraping for now. Eventually, this will be replaced by the more performative LLM scraping.

In [2]:
%%time

from pfd_toolkit import load_reports

# Load all reports from the month of April 2025
reports = load_reports(start_date = "2025-04-01",
                       end_date="2025-04-30")

reports.head(n=10)

CPU times: user 103 ms, sys: 7.95 ms, total: 111 ms
Wall time: 110 ms


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
30,https://www.judiciary.uk/prevention-of-future-...,2025-0208,2025-04-30,Joanne Andrews,"West Sussex, Brighton and Hove",West Sussex County Council 1,On 02 November 2024 I commenced an investigati...,Mrs Turner drove her car into the canal at the...,During the course of the investigation my inqu...
31,https://www.judiciary.uk/prevention-of-future-...,2025-0207,2025-04-30,Alison Mutch,Manchester South,1) Flixton Road Medical Centre 2) Greater Manc...,On 1st October 2024 I commenced an investigati...,Louise Danielle Rosendale was prescribed long ...,During the course of the inquest the evidence ...
32,https://www.judiciary.uk/prevention-of-future-...,2025-0120,2025-04-25,Mary Hassell,Inner North London,1. The President Royal College Obstetricians a...,"On 23 August 2024, one of my assistant coroner...",Jannat was a big baby and her mother had a his...,"During the course of the inquest, the evidence..."
33,https://www.judiciary.uk/prevention-of-future-...,2025-0206,2025-04-25,Jonathan Heath,North Yorkshire and York,Townhead Surgery,On 04 June 2024 I commenced an investigation i...,"On 15 March 2024, Richard James Moss attended ...",During the course of the inquest the evidence ...
34,https://www.judiciary.uk/prevention-of-future-...,2025-0120,2025-04-24,Samantha Marsh,Somerset,Part 1 1. Somerset Foundation Trust of Trust M...,On 6 th December 2022 I commenced an investiga...,Anne first presented to her GP in 2008. During...,4 During the course of the inquest the evidenc...
35,https://www.judiciary.uk/prevention-of-future-...,2025-0199,2025-04-24,Samantha Goward,Norfolk,The Department for Transport 1,On 22 August 2024 I commenced an investigation...,"In summary, on the 17th of August 2024 Mr Mill...",During the course of the investigation my inqu...
36,https://www.judiciary.uk/prevention-of-future-...,2025-0194,2025-04-23,Heidi Connor,Berkshire,"1 , President of the Association of Coloprocto...",The family requested me to refer to the deceas...,This can be summarised by my findings on the R...,During the course of the investigation my inqu...
37,https://www.judiciary.uk/prevention-of-future-...,2025-0120,2025-04-23,Kerrie Burge,South Wales Central,"The Chief Executive, Rhondda Cynon Taf County ...","On 26 June 2023, I commenced an investigation ...",These were recorded as follows: Martin Robert ...,During the course of the inquest the evidence ...
38,https://www.judiciary.uk/prevention-of-future-...,2025-0193,2025-04-23,Heidi Connor,Berkshire,"1 , Chief Executive of Royal Berkshire NHS Fou...",The family requested me to refer to the deceas...,Lorraine Parker's death was the third in three...,During the course of the investigation my inqu...
39,https://www.judiciary.uk/prevention-of-future-...,2025-0198,2025-04-23,Louisa Corcoran,Ceredigion,to:- REGULATION 28 REPORT TO PREVENT FUTURE DE...,"On 8 August 2022, an investigation into the de...",Mr Brazil had suffered from an accident some t...,During the course of the inquest the evidence ...


### Running the scraper

There could be circumstances where the user wishes to run their own Scraper rather than simply loading the reports above. For example, we are currently considering whether the reports available through `load_reports` should be cleaned & recategorised - in which case, some users may wish to scrape the raw reports.

Our scraper has 3 cascading features:

1. Attempts to parse structured data from the primary **HTML** page of a report.
2. If HTML parsing is insufficient or disabled, it falls back to extracting text from the linked **PDF** document and parsing sections based on keywords.
3. If both HTML and PDF methods fail to retrieve necessary information, it can use a **Vision Large Language Model (LLM)** for extraction, provided an LLM client is configured.

Each of these 3 options can be disabled. If LLM scraping is used, the user must import the `LLM` module from the package.

In [None]:
%%time

from pfd_toolkit import PFDScraper, LLM
from dotenv import load_dotenv 
import os

# Load OpenAI API key from local environment 
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up LLM client
llm_client = LLM(api_key=openai_api_key,
                 max_workers=30)

# Set up Scraper 
scraper = PFDScraper(
    category="suicide",
    date_from="2020-01-10", date_to="2020-03-11",
    llm = llm_client,
    llm_fallback=True)

# Scrape!
scraper.scrape_reports()
scraper.reports

INFO:pfd_toolkit.scraper:Total collected report links: 12
Scraping reports: 100%|██████████| 12/12 [00:01<00:00,  9.65it/s]
LLM fallback (parallel processing): 100%|██████████| 12/12 [00:44<00:00,  3.71s/it]

CPU times: user 6.49 s, sys: 577 ms, total: 7.07 s
Wall time: 51.6 s





Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2020-0052,2020-03-03,Nadia Persaud,East London,"Professor Oliver Shanley, Interim Chief Execut...",On the 23rd October 2019 I commenced an invest...,See narrative conclusion in box 3 for detail.,The matter of concern during the course of the...
1,https://www.judiciary.uk/prevention-of-future-...,2020-0050,2020-03-03,Alison Mutch,Manchester South,Department of Health,"On 22nd July 2019, I commenced an investigatio...",On 20th July Shaun Lea Turner was found unresp...,During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,2020-0043,2020-02-25,Veronica Hamilton-Deeley,Brighton and Hove,Sussex Police,On 9th October 2019 I commenced an investigati...,See Record of Inquest,During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,2020-0041,2020-02-21,Simon Burge,Hampshire (Central),Chief Executive - Central and North West Londo...,On 27th January 2020 I commenced an investigat...,Mr Goldstraw was found hanging in cell B3-03 a...,The MATTERS OF CONCERNS are as follows:\n\nA. ...
2,https://www.judiciary.uk/prevention-of-future-...,2020-0027,2020-02-17,"Caroline Beasley-Murray, senior coroner, for t...",Essex,NHS England,On 3 January 2020 I commenced an investigation...,Joseph James Gingell had suffered from mental ...,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2020-0026,2020-02-11,Bridget Dolan QC,West Sussex,"The Chief Executive, the Sussex Community NHS ...",On 05 November 2019 I commenced an investigati...,On 9 August 2109 Gemma Azhar had self-referred...,The MATTERS OF CONCERNS are as follows: Those ...
5,https://www.judiciary.uk/prevention-of-future-...,2020-0008,2020-01-20,Graeme Hughes,South Wales Central,The Chief Constable of South Wales Police & Th...,On 04/04/2019 I commenced an investigation int...,SWP officers attended room 161 at the Village ...,During the course of the inquest the evidence ...
6,https://www.judiciary.uk/prevention-of-future-...,2020-0025,2020-01-20,Andrew Bridgman,Manchester (South),"Ms Claire Molloy, Chief Executive, Pennine Car...",On 24.05.18 an investigation was commenced int...,"About 4 years prior, Samantha suffered an epis...",During the course of the inquest the evidence ...
9,https://www.judiciary.uk/prevention-of-future-...,2020-0010,2020-01-14,Andrew Haigh,Staffordshire (South),HMP Dovegate,On 1st October 2018 I commenced an investigati...,Marlon Roy Watson was a serving prisoner at HM...,1. At the inquest there was a concern that mem...
7,https://www.judiciary.uk/prevention-of-future-...,2020-0007,2020-01-10,Oliver Longstaff,West Yorkshire (West),Kirklees Council\nHighways England,On 17th September 2018 I commenced an investig...,"On 12th September 2018, Mr Wajid was observed ...",During the course of the inquest the evidence ...


Between them, `LLM` and `PFDScraper` contain quite a few configurable parameters. Here are all of them with their default values:

In [None]:
from pfd_toolkit import PFDScraper, LLM

llm = LLM(
    api_key=None, # ...API key for LLM client (e.g. OpenAI),
    model="gpt-4.1-mini", # ...name of LLM model
    base_url=None, # ...if not using OpenAI, allow the user to point at alternative API (e.g. Gemini, Ollama)
    max_workers=1 # ...to enable LLM parallel API requests
)

scraper = PFDScraper(
    llm=None, # ...specify LLM client
    category='all', # ...category as given on judiciary.uk
    date_from="2000-01-01", date_to="2030-01-01", # ...date range for scraping
    max_workers=10, # ...parallelisation for CPU-bound tasks
    max_requests=5, # ...parallelisation for HTTP-bound tasks
    delay_range=(1, 2), # ...delay range between 1 and 2 seconds to mimic human browsing
    timeout=60, # ...60 second timeout for any individual scraping activity
    include_url=True, # ...whether to include 
    # (`include_` params for all other sections)
    verbose=False
)

### Topping up reports

Occassionaly, we might run the scraper once and want to 'top up' our reports as new ones are published. It would be a pain to run the entire scraper again, so we've included a `top_up()` method that scans for any new reports not part of our original dataframe, and seeks to add them.

It uses the exact same configurations as the `scrape_reports()` methods (i.e. both inherit from the `PFDScraper` class).

In [None]:
scraper.top_up(date_to="2020-04-11") # ..add a month to the end date; look for new reports!

INFO:pfd_toolkit.scraper:Attempting to 'top up' the existing reports with new data.
INFO:pfd_toolkit.scraper:Total collected report links: 22
INFO:pfd_toolkit.scraper:Top-up: 10 new report(s) found; 12 duplicate(s) which won't be added
Topping up reports: 100%|██████████| 10/10 [00:01<00:00,  9.61it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
12,https://www.judiciary.uk/prevention-of-future-...,2020-0077,2020-03-24,Andrew Walker,London (North),Ministerial Correspondence and Public Enquirie...,On the 21st day of October 2019 I opened an in...,On the Second of October 2019 Simon Anthony De...,During the course of the inquest the evidence ...
13,https://www.judiciary.uk/prevention-of-future-...,2020-0076,2020-03-24,Geoffrey Sullivan,Hertfordshire,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
14,https://www.judiciary.uk/prevention-of-future-...,2020-0074,2020-03-23,Nicholas Rheinberg,Exeter and Greater Devon,1. Devon Partnership NHS Trust as lead for the...,On 27th April 2017 an investigation into the d...,Lewis Francis whilst acutely psychotic stabbed...,During the course of the inquest the evidence ...
16,https://www.judiciary.uk/prevention-of-future-...,2020-0071,2020-03-16,Penelope Schofield,West Sussex,Ms Samantha Allen Chief Executive Sussex Partn...,On 6th November 2018 I commenced an investigat...,N/A: Not found,During the course of the inquest the evidence ...
18,https://www.judiciary.uk/prevention-of-future-...,2020-0064,2020-03-12,Geraint Williams,South Wales Central,Cardiff & Vale NHS Trust 1.,On 26th October 2017 I commenced an investigat...,These were recorded as:- Ian Weeks was remande...,During the course of the inquest the evidence ...
19,https://www.judiciary.uk/prevention-of-future-...,2020-0058,2020-03-09,Fiona Wilcox,London Inner (West),N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
20,https://www.judiciary.uk/prevention-of-future-...,2020-0056,2020-03-06,Andre Rebello,Liverpool and the Wirral,HMPPS,On 11/10/2017 I commenced an investigation int...,The Jury found: During admission to 68 Hornby ...,The MATTERS OF CONCERNS are as follows: (brief...
0,https://www.judiciary.uk/prevention-of-future-...,2020-0052,2020-03-03,Nadia Persaud,East London,"Professor Oliver Shanley, Interim Chief Execut...",On the 23rd October 2019 I commenced an invest...,See narrative conclusion in box 3 for detail.,The matter of concern during the course of the...
1,https://www.judiciary.uk/prevention-of-future-...,2020-0050,2020-03-03,Alison Mutch,Manchester South,Department of Health,"On 22nd July 2019, I commenced an investigatio...",On 20th July Shaun Lea Turner was found unresp...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2020-0043,2020-02-25,Veronica Hamilton-Deeley,Brighton and Hove,Sussex Police,On 9th October 2019 I commenced an investigati...,See Record of Inquest,During the course of the inquest the evidence ...


Incidentally, running `top_up()` is also how the dataset contained within `load_data()` gets updated each week, under the hood. All using PFD Toolkit methods!

### Cleaning reports

PFD Reports have a lot of spelling & grammatical errors. Additionally, the Concerns section tends to have a lot of boilerplate text, which differs slightly between reports (meaning that regex can't help us here).

On top of this, the way that coroner names are formatted can also differ from report to report, making it difficult to filter for specific coroner.

We can use an LLM to fix all of these issues...

In [7]:
%%time

from pfd_toolkit import Cleaner, LLM

# Set up LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=30)

# Run cleaner
cleaner = Cleaner(
    llm=llm_client,
    reports=scraper.reports)

cleaned_reports = cleaner.clean_reports()
cleaned_reports

Processing Fields:   0%|          | 0/6 [00:00<?, ?it/s]

Processing Fields: 100%|██████████| 6/6 [00:42<00:00,  7.04s/it]

CPU times: user 1.92 s, sys: 158 ms, total: 2.08 s
Wall time: 42.2 s





Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
12,https://www.judiciary.uk/prevention-of-future-...,2020-0077,2020-03-24,A. Walker,London North,Ministerial Correspondence and Public Enquirie...,On the 21st day of October 2019 I opened an in...,On the second of October two thousand and nine...,There are no arrangements or guidance concerni...
13,https://www.judiciary.uk/prevention-of-future-...,2020-0076,2020-03-24,G. Sullivan,Hertfordshire,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
14,https://www.judiciary.uk/prevention-of-future-...,2020-0074,2020-03-23,N. Rheinberg,Exeter and Greater Devon,Devon Partnership NHS Trust as lead for the So...,On 27 April 2017 an investigation into the dea...,Lewis Francis whilst acutely psychotic stabbed...,At present there is no mechanism for the ready...
16,https://www.judiciary.uk/prevention-of-future-...,2020-0071,2020-03-16,P. Schofield,West Sussex,Ms Samantha Allen Chief Executive Sussex Partn...,On 6 November 2018 I commenced an investigatio...,N/A: Not found,1. Mr Ashley's care and treatment plan was not...
18,https://www.judiciary.uk/prevention-of-future-...,2020-0064,2020-03-12,G. Williams,South Wales Central,Cardiff & Vale NHS Trust,On 26 October 2017 I commenced an investigatio...,Ian Weeks was remanded into custody at HMP Car...,Although it was recorded on System 1 that Mr W...
19,https://www.judiciary.uk/prevention-of-future-...,2020-0058,2020-03-09,F. Wilcox,London Inner West,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
20,https://www.judiciary.uk/prevention-of-future-...,2020-0056,2020-03-06,A. Rebello,Liverpool and the Wirral,HMPPS,On 11 October 2017 I commenced an investigatio...,"During admission to 68 Hornby Road, Liverpool ...",During the course of evidence it became appare...
0,https://www.judiciary.uk/prevention-of-future-...,2020-0052,2020-03-03,N. Persaud,East London,"Professor Oliver Shanley, Interim Chief Execut...",On the 23rd October 2019 I commenced an invest...,See narrative conclusion in box 3 for detail.,A GP made a referral to the mental health team...
1,https://www.judiciary.uk/prevention-of-future-...,2020-0050,2020-03-03,A. Mutch,Manchester South,Department of Health,"On 22 July 2019, I commenced an investigation ...",On 20 July Shaun Lea Turner was found unrespon...,"During the course of the inquest, evidence was..."
2,https://www.judiciary.uk/prevention-of-future-...,2020-0043,2020-02-25,V. Hamilton-Deeley,Brighton and Hove,Sussex Police,On 9 October 2019 I commenced an investigation...,See Record of Inquest,Mr Reilly visited Beachy Head on the 1st of Oc...


Notice, for example, how the coroner name has been reformatted around common denominators: intiial of first name and last name. Other changes likely aren't visible without inspecting the dataframe in more detail.

### What comes next

Version 0.1.27 of PFD Toolkit is still an early build. For future versions (in 0.1.xx), the focus is on improving the success of LLM-scraping. Version 0.2.xx will come with a Categorisation module, allowing users to provide themes and sub-themes for custom data curation.