# Feature extraction DEMO

#### 1. "Can we extract age & ethnicity from PFD reports?"

**Context:** There are no age or ethnicity fields in PFD reports. However, anecdotally, many coroners seem to note age in the "Investigation & Inquest" or "Circumstances of Death" sections of the report.

This demo considers whether age & ethnicity, as features, can be extracted from these reports using PFD Toolkit's `Extractor` class.

##### The workflow 'at a glance'

Below, we:

 * Load a sample of reports
 * Load our OpenAI API key and create an LLM object
 * Initialise a Pydantic (BaseModel) object, which specifies the key (e.g. age), its type (e.g. int) and add in a short description.
 * Feed this to an `Extractor` instance
 * Inspect the prompt template just so that we can get an idea of what will get fed to the LLM
 * Run the Extractor and inspect the resulting dataframe


In [1]:
from pfd_toolkit import load_reports, LLM, Extractor
from dotenv import load_dotenv
import os
from pydantic import BaseModel, Field

# Load reports sample
reports = load_reports(n_reports=150)

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, max_workers=30)

# Initialise Pydantic model
class DemoFeatures(BaseModel):
    age: int = Field(..., description="Age of the deceased")
    ethnicity: str = Field(..., description="Ethnicity of the deceased")

# Set up our Extractor, passing it everything we've initialised above
extractor = Extractor(
    llm=llm_client,
    feature_model=DemoFeatures,
    reports=reports
)

This is the prompt that the LLM will be fed:

In [2]:
prompt = extractor._generate_prompt(reports.iloc[0])
print(prompt)

You are an expert at extracting structured information from UK Prevention of Future Death reports.

Extract the following features from the report excerpt provided.

If a feature cannot be located, respond with 'N/A: Not found'.

Assign only one category to each report.

Return your answer strictly as a JSON object matching this schema:

{
  "age": {
    "description": "Age of the deceased",
    "title": "Age",
    "type": "integer"
  },
  "ethnicity": {
    "description": "Ethnicity of the deceased",
    "title": "Ethnicity",
    "type": "string"
  }
}

Here is the report excerpt:

InvestigationAndInquest: Mr Dean Bradley died on 15 th October 2021 at [REDACTED], Stockton on Tees. An inquest into Mr Bradley's death was opened on 28 th July 2023 and his inquest was heard before me on 19 th May 2025. The medical cause of Mr Bradley's death was: 1a. Pressure on the neck 1b. Hanging It was discovered in toxicology testing that the level of Venlafaxine in his blood was around twenty times 

In [3]:
result_df = extractor.extract_features()
result_df

Extracting features: 100%|██████████| 150/150 [00:13<00:00, 11.35it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,age,ethnicity
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,N/A: Not found,N/A: Not found
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,N/A: Not found,N/A: Not found
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,N/A: Not found,N/A: Not found
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,17,N/A: Not found
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,24,N/A: Not found
...,...,...,...,...,...,...,...,...,...,...,...
145,https://www.judiciary.uk/prevention-of-future-...,2025-0096,2025-02-19,Susan Ridge,Surrey,Chief Executive Surrey and Sussex Healthcare N...,An inquest into Mrs Rodgers death was opened o...,A narrative conclusion was recorded at Box 4 o...,The MATTERS OF CONCERN are: The court heard th...,N/A: Not found,N/A: Not found
146,https://www.judiciary.uk/prevention-of-future-...,2025-0092,2025-02-18,Caroline Saunders,Gwent,Welsh Parliament,"On 14/3/2024, an investigation was opened touc...","On 19/2/2024, Jeffrey Martin Tyler called the ...",The MATTERS OF CONCERN are as follows: - In ev...,N/A: Not found,N/A: Not found
147,https://www.judiciary.uk/prevention-of-future-...,2025-0099,2025-02-18,Sarah Bourke,Inner North London,"1. , Secretary of State for Justice, Ministry ...","On 4 October 2023, Assistant Coroner Smith com...",Mr Bainborough lived in supported living accom...,During the course of the inquest the evidence ...,52,N/A: Not found
148,https://www.judiciary.uk/prevention-of-future-...,2025-0098,2025-02-18,Sarah Bourke,Inner North London,"1. , Secretary of State for Justice, Ministry ...","On 19 October 2022, I commenced an investigati...",Mrs Mohamed developed mental health problems a...,During the course of the inquest the evidence ...,53,N/A: Not found


We can see that age seems generally well extracted. However, ethnicity does not appear to be well extracted. Let's unpack this a bit more...

In [4]:
# For all ages print True, for all missing data print False
print(result_df["age"].ne("N/A: Not found").value_counts())

# Print all unique values counts for ethnicity
print("\n\n")
print(result_df["ethnicity"].value_counts())

age
True     111
False     39
Name: count, dtype: int64



ethnicity
N/A: Not found    149
Black               1
Name: count, dtype: int64


Conclusions: 

1. In our sample, it was possible to extract age as a feature in ~110 reports, with 40 reports possibly not recording this information.

2. Our sample showed that ethnicity was recorded just once, with 149 reports possibly not recording this information. In this one instance, the model correctly extracted this information.

#### 2. How do suicide methods in PFD reports compare to overall registrations?

This is a little more complicated than our first research question, because we'll be asking the report to assign to a broader range of categories. 

In their annual review of all registered suicide deaths, the [ONS](https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/bulletins/suicidesintheunitedkingdom/2023#suicide-methods) report on the proportion of suicides by the following methods:

| Method                                           | % |
|--------------------------------------------------|-------|
| Drowning                                         | 3.7   |
| Fall and Fracture                                | 3.6   |
| Poisoning                                        | 19.8  |
| Hanging, strangulation and suffocation           | 58.8  |
| Jumping or lying in front of a moving object     | 3.7   |
| Sharp object                                     | 3.5   |
| Other                                            | 6.9   |


Note: the above is for suicides registered in 2023. For sample size considerations, we are using PFD reports dated as either 2023 or 2024.

In [5]:
from pfd_toolkit import Screener, load_reports

# Load all reports published in 2023 and 2024
reports_2023_24 = load_reports(start_date="2023-01-01", end_date="2024-12-31")

# Set up a screener to identify suicide cases from our 2023 reports
screener = Screener(llm=llm_client, reports=reports_2023_24,
                    user_query="Where the cause of death was established as a form of suicide",
                    match_leniency=None)

# Run screener
suicide_reports = screener.screen_reports()
print(f"Of the {len(reports_2023_24)} reports published in 2023, {len(suicide_reports)} concern death by suicide.")

Sending requests to the LLM (in parallel): 100%|██████████| 1235/1235 [00:33<00:00, 36.82it/s]

Of the 1235 reports published in 2023, 319 concern death by suicide.





In [6]:
class MethodBreakdown(BaseModel):
    drowning: bool = Field(
        ...,
        description="Suicide method involved drowning? True or false."
    )
    fall_and_fracture: bool = Field(
        ...,
        description="Suicide method involved fall or fracture? True or false."
    )
    poisoning: bool = Field(
        ...,
        description="Suicide method involved poisoning? True or false."
    )
    hanging: bool = Field(
        ...,
        description="Suicide method involved hanging, strangulation, or suffocation? True or false."
    )
    jumping: bool = Field(
        ...,
        description="Suicide method involved jumping or lying in front of a moving object? True or false."
    )
    sharp_object: bool = Field(
        ...,
        description="Suicide method involved sharp object? True or false."
    )
    other: bool = Field(
        ...,
        description="Suicide method involved other means not listed above? True or false."
    )

extractor = Extractor(
    llm=llm_client,
    feature_model=MethodBreakdown,
    reports=suicide_reports,
    force_assign=True,
    include_investigation=True,
    include_circumstances=True,
    include_concerns=False # We don't need this section
)

methods_df = extractor.extract_features()
methods_df

Extracting features: 100%|██████████| 319/319 [00:18<00:00, 17.35it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,drowning,fall_and_fracture,poisoning,hanging,jumping,sharp_object,other
4,https://www.judiciary.uk/prevention-of-future-...,2024-0710,2024-12-24,Nathanael Hartley,Nottingham and Nottinghamshire,HM Assistant Coroner Hartley,On 29 th April 2024 an inquest was opened into...,Paul Taylor had been under police investigatio...,; When a suspect is arrested for offences requ...,False,False,True,False,False,False,False
11,https://www.judiciary.uk/prevention-of-future-...,2024-0702,2024-12-20,Caroline Topping,Surrey,"1. , HMP Coldingley 2. , Minister of State for...",An inquest into the death of Mr Haydar Jefferi...,Haydar Jefferies was sentenced to imprisonment...,During the course of the inquest the evidence ...,False,False,False,True,False,False,False
13,https://www.judiciary.uk/prevention-of-future-...,2024-0700,2024-12-20,Adrian Farrow,Manchester South,The SECRETARY OF STATE FOR HEALTH AND SOCIAL C...,On 28 th March 2024 an investigation was comme...,Mr Williamson began to suffer from lower urina...,During the course of the inquest the evidence ...,True,False,False,False,False,False,False
20,https://www.judiciary.uk/prevention-of-future-...,2025-0080,2024-12-17,Laurinda Bower,Nottingham City and Nottinghamshire,HMP Lowdham Grange,"On 7 March 2023, I commenced an investigation ...",Anthony Binfield died as a result of ligature ...,During the course of the investigation my inqu...,False,False,False,True,False,False,False
21,https://www.judiciary.uk/prevention-of-future-...,2024-0690,2024-12-16,Penelope Schofield,"West Sussex, Brighton and Hove", Secretary of State for Health  NHS England 1,On 23 rd November 2022 I commenced an investig...,Matty had struggled with their mental health t...,"During the investigation, my inquiries reveale...",False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210,https://www.judiciary.uk/prevention-of-future-...,2023-0029,2023-01-26,under a duty to send the Chief Coroner a copy ...,Surrey,"• Chief Executive Officer, NHS England and NHS...",The inquest into the death of Zachary KLEMENT ...,Zachary was found suspended in the bedroom of ...,The MATTERS OF CONCERN are: - Zachary had a hi...,False,False,False,True,False,False,False
1212,https://www.judiciary.uk/prevention-of-future-...,2023-0027,2023-01-25,: Coroner ME Hassell Senior Coroner Inner Nort...,Inner North London,1. Chief Executive Officer East London NHS Fou...,"On 10 February 2022, one of my assistant coron...",Mr Largin asphyxiated himself in the early hou...,"During the course of the inquest, the evidence...",False,False,False,False,False,False,True
1225,https://www.judiciary.uk/prevention-of-future-...,2023-0016,2023-01-16,"Sean CUMMINGS, Assistant Coroner for the coron...",Bedfordshire and Luton Coroner Service 2,1 Bedfordshire Police Chief Constable 2 His Ma...,On 10 June 2021 I commenced an investigation i...,This report touches the death of police sergea...,During the course of the investigation my inqu...,False,False,False,False,False,False,True
1226,https://www.judiciary.uk/prevention-of-future-...,2023-0015,2023-01-12,"Robert Cohen, HM Assistant Coroner for Cumbria 2",Cumbria,(1) The Secretary of State for Culture Media a...,On 6 July 2022 I commenced an investigation in...,Gary Cooper was 41 years old. He lived in Kend...,During the course of the inquest the evidence ...,False,False,True,False,False,False,False


In [7]:
import pandas as pd

# List of cols to aggregate
method_cols = [
    "drowning",
    "fall_and_fracture",
    "poisoning",
    "hanging",
    "jumping",
    "sharp_object",
    "other",
]

percentages = methods_df[method_cols].mean().mul(100)
rounded = percentages.round(1)
formatted = rounded.map(lambda x: f"{x:.1f}")

# Build summary df
summary_df = pd.DataFrame({
    "Method": formatted.index,
    "%": formatted.values
})

print(summary_df)


              Method     %
0           drowning   2.8
1  fall_and_fracture   9.1
2          poisoning  17.9
3            hanging  51.1
4            jumping  12.2
5       sharp_object   3.1
6              other   8.8


Let's compare this to ONS's table...

| Method                                           | % |
|--------------------------------------------------|-------|
| Drowning                                         | 3.7   |
| Fall and Fracture                                | 3.6   |
| Poisoning                                        | 19.8  |
| Hanging, strangulation and suffocation           | 58.8  |
| Jumping or lying in front of a moving object     | 3.7   |
| Sharp object                                     | 3.5   |
| Other                                            | 6.9   |

Conclusions:

Preventable deaths by suicide between 2023-24 are typically deaths by hanging. The proportion of deaths by 'falls & fractures' and 'jumping or lying in front of a moving object' are disproportionately high among PFD-recorded deaths compared to suicide deaths by large.

In [8]:
model = extractor._build_prompt_template()
model

'You are an expert at extracting structured information from UK Prevention of Future Death reports.\n\nExtract the following features from the report excerpt provided.\n\n\nAssign only one category to each report.\n\nReturn your answer strictly as a JSON object matching this schema:\n\n{schema}\n\nHere is the report excerpt:\n\n{report_excerpt}'