In [1]:
# Time the entire workflow

import time
start = time.time()

# Getting started

This page talks you through an example workflow using PFD Toolkit: loading a dataset and screening for relevant cases.

It doesn't cover everything: for more, we strongly suggest browsing through the pages in the top panel.

---

## Installation

PFD Toolkit can be installed from pip as `pfd_toolkit`:

```bash
pip install pfd_toolkit
```

## Load your first dataset

First, you'll need to load a PFD dataset. These datasets are updated weekly, meaning you always have access to the latest reports with minimal setup.

In [2]:
from pfd_toolkit import load_reports

# Load all PFD reports from Jan-May 2025
reports = load_reports(
    start_date="2024-01-01",
    end_date="2025-05-01")

# Identify number of reports
num_reports = len(reports)

reports.head(n=5)

Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns
0,https://www.judiciary.uk/prevention-of-future-...,2025-0209,2025-05-01,A. Hodson,Birmingham and Solihull,NHS England; The Robert Jones and Agnes Hunt O...,On 9th December 2024 I commenced an investigat...,"At 10.45am on 23rd November 2024, Peter sadly ...",To The Robert Jones and Agnes Hunt Orthopaedic...
1,https://www.judiciary.uk/prevention-of-future-...,2025-0208,2025-04-30,J. Andrews,"West Sussex, Brighton and Hove",West Sussex County Council,On 2 November 2024 I commenced an investigatio...,Mrs Turner drove her car into the canal at the...,The inquest was told that South Bank is a resi...
2,https://www.judiciary.uk/prevention-of-future-...,2025-0207,2025-04-30,A. Mutch,Manchester South,Flixton Road Medical Centre; Greater Mancheste...,On 1 October 2024 I commenced an investigation...,Louise Danielle Rosendale was prescribed long ...,The inquest heard evidence that Louise Rosenda...
3,https://www.judiciary.uk/prevention-of-future-...,2025-0206,2025-04-25,J. Heath,North Yorkshire and York,Townhead Surgery,On 4th June 2024 I commenced an investigation ...,"On 15 March 2024, Richard James Moss attended ...",When a referral document is completed by a med...
4,https://www.judiciary.uk/prevention-of-future-...,2025-0120,2025-04-25,M. Hassell,Inner North London,The President Royal College Obstetricians and ...,"On 23 August 2024, one of my assistant coroner...",Jannat was a big baby and her mother had a his...,With the benefit of a maternity and newborn sa...


## Screening for relevant reports

You're likely using PFD Toolkit because you want to answer a specific question. For example: "Do any PFD reports raise concerns related to detention under the Mental Health Act?"

PFD Toolkit lets you query reports in plain English — no need to know precise keywords or categories. Just describe the cases you care about, and the toolkit will return matching reports.

### Set up an LLM client

Screening and other advanced features use AI, and require you to first set up an LLM client. You'll need to head to [platform.openai.com](https://platform.openai.com/docs/overview) and create an API key. Once you've got this, simply feed it to the `LLM`.

In [3]:
from pfd_toolkit import LLM
from dotenv import load_dotenv
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, max_workers=25,
                 temperature=0, seed=123)

### Screen reports in plain English

Now, all we need to do is specify our `user_query` (the statement the LLM will use to filter reports), and set up our `Screener` engine.

In [4]:
from pfd_toolkit import Screener

# Create a user query to filter
user_query = "Concerns related to detention under the Mental Health Act **only**"

# Screen reports
screener = Screener(llm = llm_client,
                        reports = reports) # Reports that you loaded earlier

filtered_reports = screener.screen_reports(user_query=user_query,
                                           produce_spans=True,
                                           drop_spans=True)

Sending requests to the LLM: 100%|██████████| 883/883 [00:32<00:00, 26.77it/s]


In [5]:
# Capture number of screened reports
num_reports_screened = len(filtered_reports)

# Check how many reports we've identified
print(f"From our initial {num_reports} reports, PFD Toolkit identified {num_reports_screened} \
reports discussing concerns around detention under the Mental Health Act.")

From our initial 883 reports, PFD Toolkit identified 98 reports discussing concerns around detention under the Mental Health Act.


In practice, we'd probably want to extend our start and end dates to cover the entire corpus of reports. We've only kept things short for demo purposes :)

---

## Discover themes in your filtered dataset

With your subset of reports screened for Mental Health Act detention concerns, the next step is to uncover the underlying themes. This lets you see 'at a glance' what issues the coroners keep raising.

We'll use the `Extractor` class to automatically identify themes from the *concerns* section of each report.

In [6]:
from pfd_toolkit import Extractor

extractor = Extractor(
    llm=llm_client,             # The same client you created earlier
    reports=filtered_reports,   # Your screened DataFrame
    
    # Only supply the 'concerns' text
    include_date=False,
    include_coroner=False,
    include_area=False,
    include_receiver=False,
    include_investigation=False,
    include_circumstances=False,
    include_concerns=True   # <--- Only identify themes relating to concerns 
)

Keeping the prompt focused on the coroner's concerns reduces cost and often results in more accurate themes.

---

### Summarise then discover themes

Before discovering themes, we need to summarise each report. 

We do this because the length of PFD report varies from coroner-to-coroner. By summarising the reports, we're centering on the key messages, keeping the prompt short for the LLM.

In [7]:
# Create short summaries of the concerns
extractor.summarise(trim_intensity="medium")

# Ask the LLM to propose recurring themes
IdentifiedThemes = extractor.discover_themes(
    max_themes=6,  # Limit the list to keep things manageable
)

                                                                    

_Note:_ `Extractor` will warn you if the word count of your summaries is too high. In these cases, you might want to set your `trim_intensity` to `high` or `very high` (though please note that the more we trim, the more detail we lose).

`IdentifiedThemes` is a Pydantic model whose boolean fields represent the themes the LLM found. 

`IdentifiedThemes` is not printable in itself, but it is replicated as a JSON in `self.identified_themes` which we can print. This gives us a record of each proposed theme with an accompanying description.

In [8]:
print(extractor.identified_themes)

```json
{
  "risk_assessment": {
    "type": "bool",
    "description": "Failures or inadequacies in assessing, documenting, and managing patient or prisoner risk, including suicide, self-harm, violence, and absconsion."
  },
  "information_sharing": {
    "type": "bool",
    "description": "Poor communication and data sharing between agencies, healthcare providers, police, families, and within teams, leading to gaps in care and safety."
  },
  "bed_shortage": {
    "type": "bool",
    "description": "Insufficient availability of inpatient mental health beds causing delays in admission, prolonged A&E stays, out-of-area placements, and inadequate care."
  },
  "staff_training": {
    "type": "bool",
    "description": "Inadequate, inconsistent, or insufficient training for healthcare, custodial, police, or support staff impacting knowledge, skills, and patient safety."
  },
  "policy_compliance": {
    "type": "bool",
    "description": "Non-adherence to or absence of clear policies and

### Tag the reports

Above, we've only identified the themes: we haven't assigned these themes to the reports.

Once you have the theme model, pass it back into the extractor to assign themes to every report in the dataset:

In [9]:
labelled_reports = extractor.extract_features(
    feature_model=IdentifiedThemes,
    force_assign=True,
    allow_multiple=True  # A single report might touch on multiple themes
)

labelled_reports.head(n=5)

Extracting features: 100%|██████████| 98/98 [00:04<00:00, 21.34it/s]


Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns,risk_assessment,information_sharing,bed_shortage,staff_training,policy_compliance,environmental_safety
5,https://www.judiciary.uk/prevention-of-future-...,2025-0200,2025-04-24,S. Marsh,Somerset,Somerset Foundation Trust; Royal College of Ob...,On sixth December two thousand and twenty-two ...,Anne first presented to her GP in 2008. During...,Anne was not sent home for her first overnight...,True,True,False,True,False,False
30,https://www.judiciary.uk/prevention-of-future-...,2025-0172,2025-04-07,S. Reeves,South London,South London and Maudsley NHS Foundation Trust,"On 21 March 2023, an inquest was opened, and a...",Christopher McDonald was pronounced dead at 14...,The evidence heard at the inquest demonstrated...,False,False,False,True,True,False
46,https://www.judiciary.uk/prevention-of-future-...,2025-0160,2025-03-25,F. Wilcox,Inner West London,Commissioner of the Police of the Metropolis; ...,From third March to twenty-fourth March two th...,Mr Omishore had been observed by multiple memb...,That there is an inconsistency of approach bet...,True,True,False,True,False,False
48,https://www.judiciary.uk/prevention-of-future-...,2025-0161,2025-03-24,T. Rawden,South Yorkshire West,South West Yorkshire Partnership NHS Foundatio...,On 27 September 2024 I commenced an investigat...,Claire Louise Driver had a past medical histor...,The inquest heard there were only two attempts...,True,True,False,True,False,False
58,https://www.judiciary.uk/prevention-of-future-...,2025-0144,2025-03-17,S. Horstead,Essex,Chief Executive Officer of Essex Partnership U...,On 31 October 2023 I commenced an investigatio...,On the 23rd September 2023 after concerns were...,(a) Failures in care planning specifically a f...,True,True,False,False,True,False


The resulting DataFrame now contains a column for each discovered theme, filled with `True` or `False` depending on whether that theme was present in the coroner's concerns.

Finally, we can count how often a theme appears in our collection of reports:

From here you can perform whatever analysis you need — counting how often each theme occurs, filtering for particular issues, or exporting the data to other tools.

In [12]:
from pfd_toolkit import _tabulate

_tabulate(labelled_reports, columns=[
    "risk_assessment",
    "information_sharing",
    "bed_shortage",
    "staff_training",
    "policy_compliance",
    "environmental_safety"])

Unnamed: 0,Category,Count,Percentage
0,risk_assessment,69,70.408163
1,information_sharing,50,51.020408
2,bed_shortage,18,18.367347
3,staff_training,49,50.0
4,policy_compliance,58,59.183673
5,environmental_safety,17,17.346939


That's it! You've gone from a mass of PFD reports, to a focused set of cases relating to Mental Health Act detention, to a theme‑tagged dataset ready for deeper exploration.

From here we can either save our `labelled_reports` dataset via `pandas` for qualitative analysis, or we can use *even more* analytical features of PFD Toolkit.

In [11]:
# Check workflow runtime

end = time.time()

elapsed_seconds = int(end - start)

minutes, seconds = divmod(elapsed_seconds, 60)
print(f"Elapsed time: {minutes}m {seconds}s")

Elapsed time: 1m 5s
