# Feature extraction DEMO

*"Can we extract age & ethnicity from PFD reports?"*

**Context:** There are no age or ethnicity fields in PFD reports. However, anecdotally, many coroners seem to note age in the "Investigation & Inquest" or "Circumstances of Death" sections of the report.

This demo considers the extent to which age & ethnicity can be extracted from these reports using PFD Toolkit's `Extractor` class.

In [4]:
from pfd_toolkit import load_reports, LLM, Extractor
from dotenv import load_dotenv
import os
from pydantic import BaseModel

# Load reports sample
reports = load_reports(n_reports=150)

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, max_workers=30)

# Initialise Pydantic model
class DemoFeatures(BaseModel):
    age: int
    ethnicity: str

# Initialise extraction instructions
feature_instructions = {
    "age": "Age of the deceased",
    "ethnicity": "Ethnicity of the deceased",
}

extractor = Extractor(
    llm=llm_client,
    feature_model=DemoFeatures,
    feature_instructions=feature_instructions,
    reports=reports,
    include_investigation=True,
    include_circumstances=True,
)

result_df = extractor.extract_features()
result_df


LLM: Extracting features: 100%|██████████| 150/150 [00:05<00:00, 27.32it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,age,ethnicity
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,N/A: Not found,N/A: Not found
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,N/A: Not found,N/A: Not found
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,N/A: Not found,N/A: Not found
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,17,N/A: Not found
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,24,N/A: Not found
...,...,...,...,...,...,...,...,...,...,...,...
145,https://www.judiciary.uk/prevention-of-future-...,2025-0096,2025-02-19,Susan Ridge,Surrey,Chief Executive Surrey and Sussex Healthcare N...,An inquest into Mrs Rodgers death was opened o...,A narrative conclusion was recorded at Box 4 o...,The MATTERS OF CONCERN are: The court heard th...,N/A: Not found,N/A: Not found
146,https://www.judiciary.uk/prevention-of-future-...,2025-0092,2025-02-18,Caroline Saunders,Gwent,Welsh Parliament,"On 14/3/2024, an investigation was opened touc...","On 19/2/2024, Jeffrey Martin Tyler called the ...",The MATTERS OF CONCERN are as follows: - In ev...,N/A: Not found,N/A: Not found
147,https://www.judiciary.uk/prevention-of-future-...,2025-0099,2025-02-18,Sarah Bourke,Inner North London,"1. , Secretary of State for Justice, Ministry ...","On 4 October 2023, Assistant Coroner Smith com...",Mr Bainborough lived in supported living accom...,During the course of the inquest the evidence ...,52,N/A: Not found
148,https://www.judiciary.uk/prevention-of-future-...,2025-0098,2025-02-18,Sarah Bourke,Inner North London,"1. , Secretary of State for Justice, Ministry ...","On 19 October 2022, I commenced an investigati...",Mrs Mohamed developed mental health problems a...,During the course of the inquest the evidence ...,53,N/A: Not found


In [2]:
result_df.to_csv('../data/age_extracted.csv')

We can see that age seems generally well extracted. However, ethnicity does not appear to be well extracted. Let's unpack this a bit more...

In [16]:
# For all ages print True, for all missing data print False
print(result_df["age"].ne("N/A: Not found").value_counts())

# Print all unique values counts for ethnicity
print("\n\n")
print(result_df["ethnicity"].value_counts())

age
True     110
False     40
Name: count, dtype: int64



ethnicity
N/A: Not found    149
Black               1
Name: count, dtype: int64


Conclusions: 

1. In our sample, it was possible to extract age as a feature in 110 reports, with 40 reports possibly not recording this information.

2. Our sample showed that ethnicity was recorded just once, with 149 reports possibly not recording this information. In this one instance, the model correctly extracted this information.