## Automating ONS Research on Child Suicides

In February 2025, the ONS published [research](https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/mentalhealth/bulletins/preventionoffuturedeathreportsforsuicideinchildreninenglandandwales/january2015tonovember2023) analysing:

> Prevention of Future Death reports for suicide in children in England and Wales: January 2015 to November 2023

This notebook assesses how well the PFD Toolkit can replicate and automate the ONS’s manual approach.

*Note: The dataset loaded via `load_reports` is an early development version based solely on HTML and PDF scraping, without the LLM fallback. I'm reluctant to do a full LLM scrape of all reports until I'm confident that its scraping logic is as good as it can be, due to associated costs. As a result, some reports may be missed from screening due to missing data.*

In [None]:
# Time the entire workflow

import time
start = time.time()

### Identifying the reports

#### Loading all reports

In [None]:
from pfd_toolkit import load_reports

reports = load_reports(category='all',
                    end_date="2023-01-01")

print(f"In total, there were {len(reports)} PFD reports published between July 2013 and November 2023")

#### Create 'Screener' specification to filter reports

**Note:** The ONS analysis defines a "child" as "aged 18 years and under," with included cases ranging from 12 to 18 years old. While the standard UK definition of a child is *under* 18, for consistency we have adopted the ONS inclusion criteria.

First, we need to set up the LLM and Screener modules...


In [None]:
from pfd_toolkit import LLM, Screener
from dotenv import load_dotenv
import os

# Load OpenAI API key from local environment
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=60, model="gpt-4.1-mini",
                 seed=12345, temperature=0)

# Set up Screener
user_query = (
"Where the deceased is 18 or younger *AND* the death was due to suicide. Age may be explicitly recorded, or inferred from agencies such as CAMHS, etc." 
)

child_suicide_screener = Screener(llm=llm_client,
                                  reports=reports,
                                  user_query=user_query,
                                  match_leniency=None,
                                  )

This will generate a prompt to our LLM. We can see what this prompt looks like:

In [None]:
print(child_suicide_screener._build_prompt_template(user_query))

Now we can run the Screener and assign the results to `child_suicide_reports`.

In [None]:
child_suicide_reports = child_suicide_screener.screen_reports()

print(
    f"""\nFrom the initial {len(reports)} reports, PFD Toolkit identified {len(child_suicide_reports)} reports on child suicide"""
)

For context, the ONS identified 37 reports relevant reports.

The difference between the number of ONS and Toolkit-identified reports is likely due to one of the following:

 * The ONS filtered reports on judiciary.uk by suicide & child deaths - which are both drop-down categories on the website. We already know that many reports are miscategorised or not categorised at all. By contrast, toolkit didn't use these categories and screened all reports within the same date range.
 * Human error; I'm sure manually screening these reports would have been mind-numbingly boring.

---

In [None]:
# Save & reload reports to keep progress...
child_suicide_reports.to_csv('../data/child_suicide.csv')

In [3]:
import pandas as pd
child_suicide_reports = pd.read_csv('../data/child_suicide.csv')
len(child_suicide_reports)

80

---

### Categorise addressees

We now need to reproduce the 'report by addressees' table produced by ONS.

For this, we'll create a new screener object and run it multiple times. We'll also turn on 'annotate mode,' which doesn't filter reports but insted adds a classification column, which we can name and use to create a tabulation.

In [None]:
from pfd_toolkit import Screener, LLM, Extractor
from pydantic import BaseModel, Field

from dotenv import load_dotenv
import os

# Load OpenAI API key from local environment
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=60, model="gpt-4.1-mini",
                 seed=12345, temperature=0)

# Up the model to GPT 4.1 for better performance
llm_client.model = "gpt-4.1"

# Set up a feature model for addressees
class DemoFeatures(BaseModel):
    sent_gov: bool = Field(..., description="Recipient(s) include a government department or minister, but not NHS")
    sent_nhs: bool = Field(..., description="Recipient(s) include NHS Trust, CCG or ICS")
    sent_prof_body: bool = Field(..., description="Recipient(s) include an organisation with statutory responsibility for a profession (GMC, NMC, Royal Colleges, etc.)")
    sent_council: bool = Field(..., description="Recipient(s) include a local council")
    sent_other: bool = Field(..., description="Recipient(s) include some other recipient group not listed")

addressee_extractor = Extractor(reports=child_suicide_reports,
                                llm=llm_client,
                                
                                # Turn 'on' receiver field; turn defaults 'off'
                                include_receiver=True,
                                include_circumstances=False,
                                include_investigation=False,
                                include_concerns=False)


child_suicide_reports = addressee_extractor.extract_features(feature_model=DemoFeatures,
                                                             allow_multiple=True,
                                                             force_assign=True,
                                                             produce_spans=True)

Extracting features: 100%|██████████| 80/80 [00:04<00:00, 16.05it/s]


#### Create summary table

In [5]:
import pandas as pd

categories_config = [
    {"name": "Government department or minister", "col": "sent_gov"},
    {"name": "NHS Trust or CCG", "col": "sent_nhs"},
    {"name": "Professional body", "col": "sent_prof_body"},
    {"name": "Local council", "col": "sent_council"},
    {"name": "Other", "col": "sent_other"},
]

total_reports = len(child_suicide_reports)

summary_data = [
    {
        "Addressee": cat["name"],
        "No of reports": int(child_suicide_reports[cat["col"]].sum()),
        "%": int(
            round((child_suicide_reports[cat["col"]].sum() / total_reports) * 100)
        ),
    }
    for cat in categories_config
]

summary_table_df = pd.DataFrame(summary_data)

print("Summary Table of Report Addressees:")
print(summary_table_df.to_string(index=False))

Summary Table of Report Addressees:
                        Addressee  No of reports  %
Government department or minister             41 51
                 NHS Trust or CCG             40 50
                Professional body              7  9
                    Local council             15 19
                            Other             28 35


We can now compare this with ONS's own table...

| Addressee                         | No of reports | %  |
|----------------------------------|---------------|----|
| Government department or minister| 15            | 41 |
| NHS Trust or CCG                 | 15            | 41 |
| Professional body                | 12            | 32 |
| Local council                    | 8             | 22 |
| Other                            | 10            | 27 |


The big point of difference here is assignment to 'professional body'. In ONS's research, this reflected 32% of reports, but ours is only 10%. Despite us identifying a significantly higher number of reports, our absolute value of reports for this addressee category is lower than that of ONS (6 vs. 12, respectively.)

In ONS's report and accompanying metadata spreadsheet, I was unable to find a definition of 'professional body'. So I used what I think is a reasonable definition:

> "An organisation with statutory responsibility for a profession (e.g. GMC, Nursing and Midwifery Council, Royal Colleges, etc.)"

It's therefore possible that this discrepency is caused by a definitional mismatch.

### Categorise 'themes' from coroner concerns

ONS coded the **coroner's concerns** sections into 6 primary themes: service provision, staffing & resourcing, communication, multiple services involved in care, accessing services, access to harmful content & environment. 

Each of these themes contains a number of sub-themes. 

Because there's quite a lot of themes to assign, we're going to split this up into separate Extractor calls. This is to keep the prompt more manageable for the LLM. This is unlikely to be much of an issue for non-mini OpenAI models though, which are a touch more expensive.

In [6]:
# Just like we did with addressees, create a feature model with all fields and descriptions.

class ThemeFeatures(BaseModel):
    sp_sop_inadequate: bool = Field(
        ..., 
        description="Standard operating procedures (e.g. note taking, monitoring, observations) don't exist, are unclear, or not followed correctly"
    )
    sp_specialist_services: bool = Field(
        ..., 
        description="Specialist services unavailable or insufficient (e.g. issues with crisis teams, urgent inpatient beds, special educational needs, autism support, deprioritised services, etc.)"
    )
    sp_risk_assessment: bool = Field(
        ..., 
        description="Risk assessment documents not completed, assessed inadequately, not updated, not communicated, etc."
    )
    sp_discharge: bool = Field(
        ..., 
        description="Discharge without review or liaison, self-discharge when detention may be required, poor communication of care requirements to community teams, uncoordinated post-discharge care, inadequate care packages, etc."
    )
    sp_diagnostics: bool = Field(
        ..., 
        description="Delayed diagnosis, misdiagnosis, lack of caregiver support for a specific diagnosis, lack of specialist diagnostic training, etc."
    )

    sr_training: bool = Field(
        ..., 
        description="Inadequate staff knowledge of suicide-prevention processes, missing grab bags or anti-ligature tools, staff not following procedures, training gaps, etc."
    )
    sr_inadequate_staffing: bool = Field(
        ..., 
        description="Staff not appropriately qualified, inexperienced case worker assigned, or other inadequate staffing levels"
    )
    sr_funding: bool = Field(
        ..., 
        description="Lack of funding to CAMHS services etc., preventing recruitment or provision of specialist services"
    )
    sr_recruitment_retention: bool = Field(
        ..., 
        description="Unable to recruit specialist staff or retain an adequate number of staff"
    )

    comm_between_services: bool = Field(
        ..., 
        description="Lack of communication between CAMHS and foster or care services or schools, or information sharing between services not possible or not conducted, etc."
    )
    comm_patient_family: bool = Field(
        ..., 
        description="Lack of communication from CAMHS with child and/or parent, including insufficient family involvement, support or signposting, etc."
    )
    comm_confidentiality_risk: bool = Field(
        ..., 
        description="Instances where professionals did not communicate with parents or caregivers, resulting in missed opportunities to intervene"
    )
    comm_within_services: bool = Field(
        ..., 
        description="Inadequate communication of policies to staff, inadequate note keeping or record sharing, unclear responsibility for care coordination within a service"
    )

    msic_integration_care: bool = Field(
        ..., 
        description="Care coordinator not assigned or unclear responsibility for coordinating care needs across multiple services"
    )
    msic_local_authority: bool = Field(
        ..., 
        description="Lack of social services involvement, no social worker, inadequate safeguarding checks, lack of specialist support in schools, missing school safety plan, etc."
    )
    msic_transition_camhs: bool = Field(
        ..., 
        description="Lack of support transitioning from CAMHS to adult services or unclear guidance for 16–18-year-olds"
    )

    as_delays_waiting: bool = Field(
        ..., 
        description="Delay in GP or CAMHS referrals, CAMHS picking up referral, offering appointments, excessive waiting times leading to inappropriate referral, or COVID-19 related delays"
    )
    as_referral_rejected: bool = Field(
        ..., 
        description="Referral rejected due to waiting times, lack of staff, inadequate risk assessment, or complex needs not met by CAMHS"
    )
    as_patient_engagement: bool = Field(
        ..., 
        description="Inadequate contact with child or parent regarding referral, or patient refusal to engage followed by insufficient follow-up"
    )

    ahce_internet: bool = Field(
        ..., 
        description="Lack of internet safeguarding in school or failure of websites or social media to block harmful content"
    )
    ahce_safeguarding_sensitive: bool = Field(
        ..., 
        description="Sensitive questions or material presented to a child without adequate follow-up, adult support, warnings, or consideration of safety"
    )
    ahce_harmful_items: bool = Field(
        ..., 
        description="Access to items that can be used to harm or ligature, or access to alcohol, drugs, or substances where safety concerns are known"
    )
    ahce_trainline: bool = Field(
        ..., 
        description="Ability to access railway environments where access should be prevented, such as inadequate fencing"
    )


# Instantiate Extractor once, covering all of the above features in one go.
theme_extractor = Extractor(
    reports=child_suicide_reports,
    llm=llm_client,

    # We only care about the Concerns section here
    include_receiver=False,
    include_circumstances=False,
    include_investigation=False,
    include_concerns=True
)


child_suicide_reports = theme_extractor.extract_features(feature_model=ThemeFeatures,
                                                         allow_multiple=True,
                                                         force_assign=True,
                                                         produce_spans=True)


Extracting features: 100%|██████████| 80/80 [00:18<00:00,  4.25it/s]


#### Create theme tables

In [7]:
# 1. Service provision
print("\nPrimary theme: Service provision")
service_provision_config = [
    {
        "name": "Standard operating procedures/ processes not followed or adequate",
        "col": "sp_sop_inadequate",
    },
    {
        "name": "Specialist services (crisis, autism, beds)",
        "col": "sp_specialist_services",
    },
    {"name": "Risk assessment", "col": "sp_risk_assessment"},
    {"name": "Discharge from services", "col": "sp_discharge"},
    {"name": "Diagnostics", "col": "sp_diagnostics"},
]
service_provision_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in service_provision_config
]
service_provision_df = pd.DataFrame(service_provision_data)
print(service_provision_df.to_string(index=False))

# 2. Staffing and resourcing 
print("\n---\n\nPrimary theme: Staffing and resourcing\n")
staffing_resourcing_config = [
    {"name": "Training", "col": "sr_training"},
    {"name": "Inadequate staffing", "col": "sr_inadequate_staffing"},
    {"name": "Funding", "col": "sr_funding"},
    {"name": "Recruitment and retention", "col": "sr_recruitment_retention"},
]
staffing_resourcing_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in staffing_resourcing_config
]
staffing_resourcing_df = pd.DataFrame(staffing_resourcing_data)
print(staffing_resourcing_df.to_string(index=False))

# 3. Communication
print("\n---\n\nPrimary theme: Communication\n")
communication_config = [
    {"name": "Between services", "col": "comm_between_services"},
    {"name": "With patient and family", "col": "comm_patient_family"},
    {
        "name": "Confidentiality risk not communicated",
        "col": "comm_confidentiality_risk",
    },
    {"name": "Within services", "col": "comm_within_services"},
]
communication_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in communication_config
]
communication_df = pd.DataFrame(communication_data)
print(communication_df.to_string(index=False))

# 4. Multiple services involved in care
print("\n---\n\nPrimary theme: Multiple services involved in care\n")
multi_services_config = [
    {"name": "Integration of care", "col": "msic_integration_care"},
    {
        "name": "Local Authority (incl child services, schools)",
        "col": "msic_local_authority",
    },
    {"name": "Transition from CAMHS", "col": "msic_transition_camhs"},
]
multi_services_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in multi_services_config
]
multi_services_df = pd.DataFrame(multi_services_data)
print(multi_services_df.to_string(index=False))

# 5. Accessing services 
print("\n---\n\nPrimary theme: Accessing services\n")
accessing_services_config = [
    {"name": "Delays in referrals and waiting times", "col": "as_delays_waiting"},
    {"name": "Referral rejected", "col": "as_referral_rejected"},
    {"name": "Patient engagement", "col": "as_patient_engagement"},
]
accessing_services_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in accessing_services_config
]
accessing_services_df = pd.DataFrame(accessing_services_data)
print(accessing_services_df.to_string(index=False))

# 6. Access to harmful content and environment
print("\n---\n\nPrimary theme: Access to harmful content and environment\n")
harmful_content_config = [
    {"name": "Internet", "col": "ahce_internet"},
    {
        "name": "Safeguarding from sensitive material",
        "col": "ahce_safeguarding_sensitive",
    },
    {"name": "Harmful items/ substances", "col": "ahce_harmful_items"},
    {"name": "Trainline", "col": "ahce_trainline"},
]
harmful_content_data = [
    {
        "Sub-theme": theme["name"],
        "Number of reports": int(child_suicide_reports[theme["col"]].sum()),
    }
    for theme in harmful_content_config
]
harmful_content_df = pd.DataFrame(harmful_content_data)
print(harmful_content_df.to_string(index=False))


Primary theme: Service provision
                                                        Sub-theme  Number of reports
Standard operating procedures/ processes not followed or adequate                 49
                       Specialist services (crisis, autism, beds)                 22
                                                  Risk assessment                 31
                                          Discharge from services                  8
                                                      Diagnostics                 11

---

Primary theme: Staffing and resourcing

                Sub-theme  Number of reports
                 Training                 31
      Inadequate staffing                 16
                  Funding                 14
Recruitment and retention                  4

---

Primary theme: Communication

                            Sub-theme  Number of reports
                     Between services                 33
              With patient and fami

In [8]:
child_suicide_reports.to_csv('../data/child_suicide_tagged.csv')

### Check workflow runtime

In [9]:
end = time.time()

elapsed_seconds = int(end - start)

minutes, seconds = divmod(elapsed_seconds, 60)
print(f"Elapsed time: {minutes}m {seconds}s")

NameError: name 'time' is not defined