## Topic identification

This notebook demonstrates how users can use PFD Toolkit to identify recurring topics or themes from PFD reports.

### Set up

We first need to load in a sample of reports, an initalise our `LLM` and `Extractor` Classes.

In [1]:
from pfd_toolkit import load_reports, Extractor, LLM
from dotenv import load_dotenv
import os

# Load reports
reports = load_reports(n_reports=100)

# Set up LLM client
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

llm_client = LLM(api_key=openai_api_key, max_workers=40)

# Set up Extractor
extractor = Extractor(
    llm=llm_client,
    reports=reports
)

### Summarise reports

Since we'll be feeding our reports to our LLM in one big prompt (rather than taking it one row at a time), it's a good idea to first summarise the reports into a short paragraph with all the main points. 

Running the `.summarise()` method produces a new column, which gets added to `self.reports`.

In [2]:
# Summarise reports
extractor.summarise(trim_intensity="medium")

                                                                      

Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,summary
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,Mr Dean Bradley died by hanging on 15 October ...
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,"Mr Keith Ineson, a resident at Glenroyd Care H..."
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,"Paul Andrew Alexander, who had a long history ..."
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,"Abdirahman Abdirizaq Afrah, aged 17, died from..."
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,"Sophie Ann Louise Cotton, 24, died by suicide ..."
...,...,...,...,...,...,...,...,...,...,...
95,https://www.judiciary.uk/prevention-of-future-...,2025-0152,2025-03-18,Joanne Andrews,"West Sussex, Brighton and Hove",1. The President of the Royal College of Obste...,On 3 October 2023 I commenced an investigation...,At 28 weeks of gestation it was noted on scans...,During the course of the inquest the evidence ...,"Alonzo Christopher Andrew Wood, born on 23 Sep..."
96,https://www.judiciary.uk/prevention-of-future-...,2025-0149,2025-03-18,Andrew Hetherington,Northumberland,"CHIEF EXECUTIVE, NORTHUMBRIA HEALTHCARE NHS FO...",On 1 May 2024 I commenced an investigation int...,The deceased had considerable underlying natur...,During the course of the inquest the evidence ...,Renate Mark died from a head injury sustained ...
97,https://www.judiciary.uk/prevention-of-future-...,2025-0144,2025-03-17,Sean Horstead,Essex,1. Chief Executive Officer of Essex Partnershi...,On 31 st October 2023 I commenced an investiga...,On the 23 rd September 2023 after concerns wer...,: A significant number of the serious causativ...,"Darren Neil Turner, aged 37, died by suicide o..."
98,https://www.judiciary.uk/prevention-of-future-...,2025-0145,2025-03-17,Rachel Knight,South Wales Central,The Chief Executive Cardiff & Vale University ...,On 24 October 2023 I commenced an investigatio...,Mr Colley was left unsupervised with cot sides...,During the course of the inquest the evidence ...,"Colin Colley, aged 87 with dementia, frailty, ..."


Next, we can estimate the number of tokens in our new column to make sure it's reasonable to feed to the model in one go. This method gets called downstream anyway, but we're calling it here to get an idea.

In [3]:
# Estimate the total number of tokens in the summary column
extractor.estimate_tokens()

18239

### Discover themes

Next, we use the `self.discover_themes()` method to identify latent topics contained without our summarised corpus of reports. This method assumes that `self.summarise()` has already been run, and will throw an error if not.

We can customise this method in several ways...
1. We can set `min_` and `max_themes` parameters. These are optional, default to None, and restrain the count of identified themes.
2. We can optionally provide seed topics to prod a model towards identifying specific themes. This parameter accepts a single string, a list of strings, or a Pydantic BaseModel.
3. Extra instructions can be provided if we want to control the model's behaviour in some other way. For example, we could instruct a model to only identify *systematic* concerns.
4. We can instuct the method to both warn us or throw an error if the number of tokens exceeds a certain number. These are set to reasonable default values for OpenAI models. To show it in action, we'll manually set `warn_exceed` to a small value.

In [4]:
extractor.discover_themes()

print(extractor.identified_themes)

```json
{
  "mental_health": {
    "type": "bool",
    "description": "Failures in mental health care including risk assessment, crisis response, communication, and service coordination leading to preventable deaths."
  },
  "communication_failures": {
    "type": "bool",
    "description": "Breakdowns in communication between healthcare providers, agencies, families, or within teams that contribute to inadequate care or delayed interventions."
  },
  "care_home_issues": {
    "type": "bool",
    "description": "Inadequate care, supervision, record-keeping, or training in care home settings resulting in harm or death."
  },
  "ambulance_delays": {
    "type": "bool",
    "description": "Delays in ambulance response or handover times causing critical treatment delays and increased mortality risk."
  },
  "medication_management": {
    "type": "bool",
    "description": "Problems with prescribing, monitoring, or communication about medications leading to adverse events or deaths."
  },
 

In [5]:
extractor.discover_themes(max_themes=5, 
                          min_themes=5,
                          seed_topics="Factors implicitly or explicitly about staff shortages; potential issues with staff training",
                          extra_instructions=None,
                          warn_exceed=10000,
                          error_exceed=500000)

print(extractor.identified_themes)



```json
{
  "staffing_shortages": {
    "type": "bool",
    "description": "Concerns related to insufficient staffing levels, including shortages in healthcare, social care, and support services, impacting patient safety and care quality."
  },
  "communication_failures": {
    "type": "bool",
    "description": "Failures or deficiencies in communication between agencies, healthcare providers, patients, families, or within teams that contribute to inadequate care or missed interventions."
  },
  "training_deficiencies": {
    "type": "bool",
    "description": "Issues arising from inadequate staff training, lack of awareness, or poor understanding of protocols, guidelines, or clinical skills leading to preventable harm."
  },
  "systemic_delays": {
    "type": "bool",
    "description": "Delays caused by systemic factors such as ambulance response times, hospital overcrowding, bed shortages, or procedural inefficiencies that adversely affect timely care."
  },
  "information_management

We can see that staff shortages and training deficiencies have been added as the first two themes, with the model identifying 3 other themes (remember that we constrained its output to identify a maximum of 5 themes in total).

### Assign themes to reports

Finally, we can assign our generated list of themes to our original DataFrame containing the reports. We assigned the output to a new DataFrame. This adds new columns; one for each theme.

An optional `force_assign` parameter prevents the model from outputting missing data for any given report field, while `allow_multiple` tells the model not to treat the list of themes as mutually exclusive (i.e. each report can have multiple themes). Both values default to `False`.

In [6]:
assigned_reports = extractor.extract_features(force_assign=True,
                                              allow_multiple=True,
                                              produce_spans=True)

assigned_reports.head(10)

Extracting features: 100%|██████████| 100/100 [00:19<00:00,  5.08it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,spans_staffing_shortages,staffing_shortages,spans_communication_failures,communication_failures,spans_training_deficiencies,training_deficiencies,spans_systemic_delays,systemic_delays,spans_information_management,information_management
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,,False,The police failed to contact the mental health...,True,,False,Current resources for safeguarding those with ...,True,,False
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,,False,,False,,False,,False,"It was noted in the evidence, that the observa...",True
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,,False,"""Under the Right Care Right Person framework, ...",True,,False,"""Under the Right Care Right Person framework, ...",True,"""The call was redirected to the ambulance serv...",True
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,The inquest heard that waiting times to be see...,True,When the doctor called Abdirahman the followin...,True,The A&E doctor did not know how to share such ...,True,Waiting times to be seen in Majors A&E at Newh...,True,"Neither the results, nor the discharge summary...",True
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,,False,The refusal of the police to attend the Deceas...,True,The call handler's inappropriate advice to the...,True,"The procedure in place to have a negative ""Rig...",True,"The ""Right Care, Right Person"" assessment proc...",True
5,https://www.judiciary.uk/prevention-of-future-...,2025-0241,2025-05-23,Mary Hassell,Inner North London,1. Commissioner Metropolitan Police Service (M...,"On 12 February 2016, I commenced an investigat...",Lewis Johnson died as a consequence of a road ...,"2 During the course of the inquest, the eviden...",,False,"""there was not a consistent expectation among ...",True,"""the jury concluded there was a failure by MPS...",True,(There were several reasons unconnected with t...,True,,False
6,https://www.judiciary.uk/prevention-of-future-...,2025-0242,2025-05-23,Mary Hassell,Inner North London,1. Director General Independent Office for Pol...,"On 12 February 2016, I commenced an investigat...",Lewis Johnson died as a consequence of a road ...,"2 During the course of the inquest, the eviden...",,False,,False,,False,(There were several reasons unconnected with t...,True,the terms of reference set out for the forensi...,True
7,https://www.judiciary.uk/prevention-of-future-...,2025-0247,2025-05-23,Nadia Persaud,East London,"1. , CEO, North East London Foundation Trust (...",On 27 November 2024 I commenced an investigati...,Mr. Fraser was a 37-year-old gentleman who had...,During the course of the inquest the evidence ...,,False,The team did not make it clear to the family t...,True,,False,,False,There was no clear care plan in place whilst h...,True
8,https://www.judiciary.uk/prevention-of-future-...,2025-0236,2025-05-21,Kate Robertson,North West Wales,Betsi Cadwaladr University Health Board (BCUHB) 1,On 20 May 2024 I commenced an investigation in...,The circumstances of the death are as follows ...,"During the course of the inquest, the evidence...",,False,"""There was no sufficiently full contextual sha...",True,"""The neonatal investigation was not thorough. ...",True,,False,"""The records themselves, identified as part of...",True
9,https://www.judiciary.uk/prevention-of-future-...,2025-0240,2025-05-21,Andrew Morse,South Wales Central,The Chief Executive Cardiff & Vale University ...,On 30 October 2023 I commenced an investigatio...,These were recorded as follows Robert Maxwell ...,During the course of the inquest the evidence ...,,False,Mental heath services did not inform Mr Smith'...,True,The guidance provided to clinicians and nursin...,True,,False,The guidance provided to clinicians and nursin...,True
