## Automating ONS Research on Child Suicides

In February 2025, the ONS published [research](https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/mentalhealth/bulletins/preventionoffuturedeathreportsforsuicideinchildreninenglandandwales/january2015tonovember2023) analysing:

> Prevention of Future Death reports for suicide in children in England and Wales: January 2015 to November 2023

Their study involved the manual identification and thematic coding of PFD reports relating to child suicides - a process known to be time-consuming and difficult to scale.

This notebook evaluates the performance of PFD Toolkit in replicating and automating the ONS approach. By comparing automated outputs to the original manual research, we assess the toolkit’s accuracy, efficiency, and potential to accelerate large-scale, reproducible analysis of PFD reports.

In [1]:
# Keep track of notebook run time
import time
start_time = time.time()

### Identifying the reports

#### Loading all reports

In [2]:
from pfd_toolkit import load_reports

reports = load_reports(category='all',
                    start_date="2015-01-01",
                    end_date="2023-11-01")

print(f"In total, there were {len(reports)} PFD reports published between January 2015 and November 2023")

In total, there were 3941 PFD reports published between January 2015 and November 2023


#### Create 'Screener' specification to filter reports

**Note:** The ONS analysis defines a "child" as "aged 18 years and under," with included cases ranging from 12 to 18 years old. While the standard UK definition of a child is "under 18," for consistency we have adopted the ONS inclusion criteria.

First, we need to set up the LLM and Screener modules...


In [None]:
from pfd_toolkit import LLM, Screener
from dotenv import load_dotenv
import os

# Load OpenAI API key from local environment
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialise LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=70)

# Set up Screener

user_query = """
Suicide deaths among children (i.e. 18 and younger **only**)

Referring to the deceased as a "young person" alone should 
**not** quality. Referring to school alone is not enough,
unless qualified (e.g. in Year 8). Referring to a parent 
alone should *not* qualify. Referring to past age but their
current age being unclear should *not* qualify.

"""

child_suicide_screener = Screener(llm=llm_client,
                                  reports=reports,
                                  user_query=user_query,
                                  match_leniency='strict',
                                  include_date=True,
                                  include_concerns=False
                                  )

This will generate a prompt to our LLM. We can see what this prompt looks like:

In [4]:
child_suicide_screener._build_prompt_template(user_query)

'\n        \nYou are an expert text classification assistant. Your task is to read\nthe following excerpt from a Prevention of Future Death (PFD) report and\ndecide whether it matches the user\'s query.\n\nA query may refer to a theme, a specific detail, or require two or more \nelements to be present (e.g. "hospital-acquired infection deaths" requires \nevidence of **both** an infection and that it was acquired in hospital).\n\n**Instructions:**\n- Only respond \'Yes\' if **all** elements of the user query are clearly present in the report.\n- If any required element is missing or there is not enough information, respond \'No\'.\n- Your response must be a JSON object in which "matches_topic" can be either "Yes" or "No".\n\n**User query:** \'\nSuicide deaths among children (i.e. 18 and younger **only**)\n\nReferring to the deceased as a "young person" alone should \n**not** quality. Referring to school alone is not enough,\nunless qualified (e.g. in Year 8). Referring to a parent \nalo

Now we can run the Screener and assign the results to `child_suicide_reports`.

In [5]:
%%time

child_suicide_reports = child_suicide_screener.screen_reports(user_query=user_query)

print(
    f"""From the initial {len(reports)} reports, PFD Toolkit 
      identified {len(child_suicide_reports)} reports on child suicide"""
)

Sending requests to the LLM (in parallel):   0%|          | 0/3941 [00:00<?, ?it/s]

From the initial 3941 reports, PFD Toolkit 
      identified 70 reports on child suicide
CPU times: user 51.1 s, sys: 3.28 s, total: 54.3 s
Wall time: 1min 3s


This is significantly more reports than those identified by ONS (37 reports). 

In [7]:
child_suicide_reports.to_csv('../data/child_suicide.csv')