## John's Task: Categorisation System for PFD Reports

### Purpose

The goal of this task is to create a `Categoriser` module for PFD Toolkit. This will be a robust way that users can categorise Prevention of Future Death (PFD) reports using our LLM class. The system should support both broad themes and granular sub-categories that help organise the content for research, policy, and analysis.

This is probably the most useful and exciting feature of the Toolkit because it will mean researchers no longer have to manually sift through reports to check whether they're relevant to their research goals, which can take months or even years.


### Format

I would like the user to be able to provide a custom schema in the following format:

> ```Theme, Sub-theme, Description/keywords```

This schema will guide the LLM in assigning categories.

### Mock research question

To give this task some direction, here's a mock research question you can work from. This is optional - feel free to choose a different framing if it’s more useful for testing.

Your research question is: 
> **"*Where* are suicide deaths preventable?"**

The aim of the research question is to understand the specific venues or contexts behind a preventable death by suicide. This aim is important, because it allows policymakers to identify key priorities for where we could change public policy or NHS practice, ultimately saving lives. 


### Mock schema 
Below is a mock-up of the schema that you may wish to run. You don’t need to follow this format exactly (e.g. you’ll likely want to use Pydantic), but here’s a structured example of what I’m hoping the schema will support:

```
{
  "suicide": {
    "drug_related": [
      "overdose",
      "toxicity",
      "substance_misuse",
      "medication_error",
      "intoxication",
      "polypharmacy"
    ],
    "institutional_settings": [
      "prison",
      "police_custody",
      "mental_health_inpatient",
      "secure_hospital",
      "approved_premises",
      "detention_centre"
    ],
    "community_care_failures": [
      "missed_referral",
      "inadequate_risk_assessment",
      "lack_of_follow_up",
      "gp_did_not_refer",
      "mental_health_team_failure",
      "discharged_without_support"
    ],
    "communication_errors": [
      "records_not_shared",
      "poor_handover",
      "discharge_summary_not_sent",
      "information_not_passed_on",
      "care_coordination_failure"
    ],
    "access_to_means": [
      "ligature_point",
      "firearm",
      "railway_access",
      "jumped_from_height",
      "chemical_ingestion",
      "access_to_high_risk_medication"
    ],
    "recent_contact_with_services": [
      "seen_by_gp",
      "discharged_from_hospital",
      "recent_ae_attendance",
      "under_mental_health_services",
      "in_care_of_crisis_team"
    ],
    "domestic_or_social_factors": [
      "domestic_abuse",
      "relationship_breakdown",
      "social_isolation",
      "recent_bereavement",
      "housing_insecurity"
    ],
    "youth_or_transitional_risk": [
      "under_25",
      "child_and_adolescent_services",
      "transition_to_adult_services",
      "excluded_from_school",
      "care_leaver"
    ]
  }
}
```

Note that above, we only have 1 top-level theme ("suicide") with everything contained within it being sub-themes and descriptions.

### Further instructions

The default LLM model is `gpt-4.1-mini`, which works fantastically well. However, `gpt-4.1-nano` is much cheaper (but often fails to follow instructions). To save on API costs, please use the nano model for development, and the mini model for making sure it actually works.

Please implement the following:

- The categorisation module should return the original reports DataFrame, with **additional columns** for each theme and sub-theme. Values in these columns should be `"Yes"` or `"No"` depending on whether the report meets the criteria.
  
- Add a **boolean parameter** that controls whether a report can be assigned to **multiple themes / sub-themes** or must be assigned to only the most appropriate one.
  
- Add the **same parameter for sub-themes**, allowing control over single vs. multi-label classification at both levels.

- Add Boolean parameters that determine whether certain sections should be fed to the model. For example, `include_circumstances=False` would mean that this section is never fed to the LLM, saving input tokens. The URL, ID, Date, Coroner name, and Receiver should be `False` by default, with the other sections being `True` by default.

- Please place all **LLM-specific logic** in the `llm.py` module. Everything else should go in a new `categoriser.py` module. In `llm.py`, you’ll notice how the base prompts and config are structured for the `Cleaner` class — I’d love if you could mirror that for this.

- Please ensure that changes made to `llm.py` are non-destructive (i.e. make sure you only 'add' features, not take any functionality away). If you're unsure, please do test on the two `DEMO` notebooks.

---

### Optional Features, in order of preference (if you have time or feel inclined)

- The `LLM` class has a `max_workers` parameter for parallel processing. Your work should ideally use this to speed things up.

- Add a Boolean parameter that controls whether the **third tier (description/keywords)** should be treated as **sub-sub-themes** (i.e. distinct columns), or just descriptive phrases to support assignment to the sub-theme level.

- If implemented, please apply the same `"Yes"/"No"` structure for sub-sub-themes. Please also extend your multi- or single-theme assignment parameter to the sub-sub-theme.

- Add a Boolean parameter to **automatically include an "Other / None of the above"** category at the theme, sub-theme, or sub-sub-theme level as appropriate.

---

### What you **don't** need to do

- You don't need to do any formal evaluation - that's for me to do! It would be good if you could eyeball the outputs though to make sure everything looks right. I can then go from there.


### Getting the data

There are two files you'll need for the mock research question.

1. `../data/suicide_reports.csv` -- 293 scraped, suicide-specific reports spanning from 1st May 2023 to 1st May 2025.
2. `../data/suicide_reports_sample.csv` -- a sample of 50 reports you can use for development to save on token costs.

You do not need to run the below blocks; it's just to show where the datasets came from.

In [1]:
import pandas as pd

sample_reports= pd.read_csv('../data/suicide_reports_sample.csv')
sample_reports.columns

Index(['URL', 'ID', 'Date', 'CoronerName', 'Area', 'Receiver',
       'InvestigationAndInquest', 'CircumstancesOfDeath', 'MattersOfConcern'],
      dtype='object')

### How I'd like the API to look 

This isn't a requirement, just a suggestion! 

The below code will NOT run because the module hasn't been created yet. It's just a mock-up of what I'm imagining. Notice how it closely resembles the PFDScraper and Cleaner classes for a nice, harmonised user-interface.

In [2]:
json_schema = {
  "suicide": {
    "drug_related": [
      "overdose",
      "toxicity",
      "substance_misuse",
      "medication_error",
      "intoxication",
      "polypharmacy"
    ],
    "institutional_settings": [
      "prison",
      "police_custody",
      "mental_health_inpatient",
      "secure_hospital",
      "approved_premises",
      "detention_centre"
    ],
    "community_care_failures": [
      "missed_referral",
      "inadequate_risk_assessment",
      "lack_of_follow_up",
      "gp_did_not_refer",
      "mental_health_team_failure",
      "discharged_without_support"
    ],
    "communication_errors": [
      "records_not_shared",
      "poor_handover",
      "discharge_summary_not_sent",
      "information_not_passed_on",
      "care_coordination_failure"
    ],
    "access_to_means": [
      "ligature_point",
      "firearm",
      "railway_access",
      "jumped_from_height",
      "chemical_ingestion",
      "access_to_high_risk_medication"
    ],
    "recent_contact_with_services": [
      "seen_by_gp",
      "discharged_from_hospital",
      "recent_ae_attendance",
      "under_mental_health_services",
      "in_care_of_crisis_team"
    ],
    "domestic_or_social_factors": [
      "domestic_abuse",
      "relationship_breakdown",
      "social_isolation",
      "recent_bereavement",
      "housing_insecurity"
    ],
    "youth_or_transitional_risk": [
      "under_25",
      "child_and_adolescent_services",
      "transition_to_adult_services",
      "excluded_from_school",
      "care_leaver"
    ]
  }
}

In [3]:
from pydantic import BaseModel
from typing import Literal
def get_categorisation_model(categories: list[str]) -> BaseModel:
    class Categorisation(BaseModel):
        category: Literal[tuple(categories)] # type: ignore
    return Categorisation

In [4]:
from pfd_toolkit import Categoriser, LLM
from dotenv import load_dotenv
import os

# Get API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set up LLM client
llm_client = LLM(api_key=openai_api_key, 
                 max_workers=30) # Max workers to parallelise Categoriser tasks

categoriser = Categoriser(
    # Most important:
    reports = sample_reports, # Dataframe containing scraped reports
    llm = llm_client,
    json_schema = json_schema,
    
    # Other desired functionality as optional params
    third_tier_as_theme = True, # ...whether to treat the third tier as a sub-sub-theme (True) or just as a description of the sub-theme (False)
    multi_assignment = True, # ...whether reports can be assigned to multiple themes and sub-themes (True) or not (False)
    include_date = False, # All existing columns need a toggle for whether this should be fed to the model. `include_date` is just one example
)

# categoriser.reports # Returns DataFrame with additional Yes/No columns for each theme / sub-theme / sub-sub-theme (if using)
# categoriser.tabulation # Returns a DataFrame with the counts of each theme / sub-theme / sub-sub-theme (if using). I can develop this myself :)

In [6]:
categoriser.categorise_reports()

Unnamed: 0,URL,ID,Area,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,suicide,suicide-drug_related,suicide-drug_related-overdose,suicide-drug_related-toxicity,...,suicide-domestic_or_social_factors-relationship_breakdown,suicide-domestic_or_social_factors-social_isolation,suicide-domestic_or_social_factors-recent_bereavement,suicide-domestic_or_social_factors-housing_insecurity,suicide-youth_or_transitional_risk,suicide-youth_or_transitional_risk-under_25,suicide-youth_or_transitional_risk-child_and_adolescent_services,suicide-youth_or_transitional_risk-transition_to_adult_services,suicide-youth_or_transitional_risk-excluded_from_school,suicide-youth_or_transitional_risk-care_leaver
0,https://www.judiciary.uk/prevention-of-future-...,2024-0328,South Yorkshire West,On 19 October 2023 I commenced an investigatio...,Jacob was in long term foster care. Following ...,During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
1,https://www.judiciary.uk/prevention-of-future-...,2023-0185,North West Wales,On 21 June 2022 an investigation was commenced...,The circumstances of the death are as follows ...,"During the course of the inquest, the evidence...",,,,,...,,,,,,,,,,
2,https://www.judiciary.uk/prevention-of-future-...,2024-0159,Cheshire,On 14 November 2023 I commenced an investigati...,Mary Jones was found dead at her home address ...,During the course of the investigation my inqu...,,,,,...,,,,,,,,,,
3,https://www.judiciary.uk/prevention-of-future-...,2024-0646,Dorset,"On the 4 th April 2023, an investigation was c...",Emma had a complex mental health history with ...,During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
4,https://www.judiciary.uk/prevention-of-future-...,2024-0405,East London,On 12 April 2023 I commenced an investigation ...,Danny Anderson suffered from chronic mental he...,During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
5,https://www.judiciary.uk/prevention-of-future-...,2025-0079,Nottingham City and Nottinghamshire,"On 5 April 2023, I commenced an investigation ...","Serco had operated HMP Lowdham Grange, Notting...",During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
6,https://www.judiciary.uk/prevention-of-future-...,2024-0514,Surrey,On 30 th May 2023 an inquest was opened into t...,See the details set out in the narrative concl...,During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
7,https://www.judiciary.uk/prevention-of-future-...,2023-0510,Essex,On 26 October 2022 I commenced an investigatio...,Katharine Fox was being treated at home follow...,During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
8,https://www.judiciary.uk/prevention-of-future-...,2024-0541,Worcestershire,On 16 January 2023 I commenced an investigatio...,"In answer to the questions “when, where and ho...",During the course of the inquest the evidence ...,,,,,...,,,,,,,,,,
9,https://www.judiciary.uk/prevention-of-future-...,2023-0521,Inner North London,"On 17 May 2023, one of my assistant coroners, ...",Michael Hindes killed himself [REDACTED]. One ...,"During the course of the inquest, the evidence...",,,,,...,,,,,,,,,,


In [None]:
def traverse_category_tree(json_schema: dict | list, final_node: bool = False):
    if not final_node:
        keys = list(json_schema.keys()) # all parent nodes are dicts
    else:
        keys = json_schema # the last node is just a list

    model = get_categorisation_model(keys)
    result = generate('prompt', model)
    if not final_node:
        if isinstance(json_schema[result], dict):
            # traverse
            traverse_category_tree(json_schema)
        else:
            assert isinstance(json_schema[result], list) # sanity check
            traverse_category_tree(json_schema[result], final_node=True)
    else:
        # we've traversed to the maximum depth of this tree
        pass