## Topic identification

This notebook demonstrates how users can use PFD Toolkit to identify recurring topics or themes from PFD reports.

### Set up

We first need to load in a sample of reports, an initalise our `LLM` and `Extractor` Classes.

In [1]:
from pfd_toolkit import load_reports, Extractor, LLM
from dotenv import load_dotenv
import os

# Load reports
reports = load_reports(n_reports=100)

# Set up LLM client
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

llm_client = LLM(api_key=openai_api_key, max_workers=40)

# Set up Extractor
extractor = Extractor(
    llm=llm_client,
    reports=reports
)

### Summarise reports

Since we'll be feeding our reports to our LLM in one big prompt (rather than taking it one row at a time), it's a good idea to first summarise the reports into a short paragraph with all the main points. 

Running the `.summarise()` method produces a new column, which gets added to `self.reports`.

In [2]:
# Summarise reports
extractor.summarise(trim_intensity="medium")

Summarising reports:   0%|          | 0/100 [00:00<?, ?it/s]INFO:backoff:Backing off _call_llm(...) for 0.1s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.2s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.8s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.5s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.3s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.3s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.9s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.4s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.2s (openai.APIConnectionError: Connection error.)
INFO:backoff:Backing off _call_llm(...) for 0.7s (openai.APIConnectionError:

Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,summary
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,Mr Dean Bradley died by hanging on 15 October ...
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,"Mr Keith Ineson, a resident at Glenroyd Care H..."
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,"Paul Andrew Alexander, who had a long history ..."
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,"Abdirahman Abdirizaq Afrah, aged 17, died from..."
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,"Sophie Ann Louise Cotton, 24, died by suicide ..."
...,...,...,...,...,...,...,...,...,...,...
95,https://www.judiciary.uk/prevention-of-future-...,2025-0152,2025-03-18,Joanne Andrews,"West Sussex, Brighton and Hove",1. The President of the Royal College of Obste...,On 3 October 2023 I commenced an investigation...,At 28 weeks of gestation it was noted on scans...,During the course of the inquest the evidence ...,"Alonzo Christopher Andrew Wood, born on 23 Sep..."
96,https://www.judiciary.uk/prevention-of-future-...,2025-0149,2025-03-18,Andrew Hetherington,Northumberland,"CHIEF EXECUTIVE, NORTHUMBRIA HEALTHCARE NHS FO...",On 1 May 2024 I commenced an investigation int...,The deceased had considerable underlying natur...,During the course of the inquest the evidence ...,"Renate Mark, who had significant underlying he..."
97,https://www.judiciary.uk/prevention-of-future-...,2025-0144,2025-03-17,Sean Horstead,Essex,1. Chief Executive Officer of Essex Partnershi...,On 31 st October 2023 I commenced an investiga...,On the 23 rd September 2023 after concerns wer...,: A significant number of the serious causativ...,"Darren Neil Turner, aged 37, died by suicide o..."
98,https://www.judiciary.uk/prevention-of-future-...,2025-0145,2025-03-17,Rachel Knight,South Wales Central,The Chief Executive Cardiff & Vale University ...,On 24 October 2023 I commenced an investigatio...,Mr Colley was left unsupervised with cot sides...,During the course of the inquest the evidence ...,"Colin Colley, aged 87 with dementia, frailty, ..."


Next, we can estimate the number of tokens in our new column to make sure it's reasonable to feed to the model in one go. This method gets called downstream anyway, but we're calling it here to get an idea.

In [3]:
# Estimate the total number of tokens in the summary column
extractor.estimate_tokens()

18327

### Discover themes

Next, we use the `self.discover_themes()` method to identify latent topics contained without our summarised corpus of reports. This method assumes that `self.summarise()` has already been run, and will throw an error if not.

We can customise this method in several ways...
1. We can set `min_` and `max_themes` parameters. These are optional, default to None, and restrain the count of identified themes.
2. We can optionally provide seed topics to prod a model towards identifying specific themes. This parameter accepts a single string, a list of strings, or a Pydantic BaseModel.
3. Extra instructions can be provided if we want to control the model's behaviour in some other way. For example, we could instruct a model to only identify *systematic* concerns.
4. We can instuct the method to both warn us or throw an error if the number of tokens exceeds a certain number. These are set to reasonable default values for OpenAI models. To show it in action, we'll manually set `warn_exceed` to a small value.

In [4]:
extractor.discover_themes()

print(extractor.identified_themes)

```json
{
  "mental_health": {
    "type": "bool",
    "description": "Issues related to mental health care, including assessment, treatment, communication, and crisis intervention failures."
  },
  "communication_failures": {
    "type": "bool",
    "description": "Breakdowns or inadequacies in communication between healthcare providers, agencies, patients, families, or within teams."
  },
  "care_planning": {
    "type": "bool",
    "description": "Deficiencies in care plans, risk assessments, monitoring, or follow-up arrangements leading to inadequate patient management."
  },
  "emergency_response": {
    "type": "bool",
    "description": "Delays, failures, or inadequacies in emergency services response, ambulance dispatch, or urgent care provision."
  },
  "documentation_issues": {
    "type": "bool",
    "description": "Incomplete, inaccurate, or missing records and documentation impacting patient care and clinical decision-making."
  },
  "interagency_coordination": {
    "type

In [5]:
extractor.discover_themes(max_themes=5, 
                          min_themes=5,
                          seed_topics="Factors implicitly or explicitly about staff shortages; potential issues with staff training",
                          extra_instructions=None,
                          warn_exceed=10000,
                          error_exceed=500000)

print(extractor.identified_themes)



```json
{
  "staffing_shortages": {
    "type": "bool",
    "description": "Issues related to insufficient numbers of staff, including impacts on patient care, supervision, and service delivery."
  },
  "training_deficiencies": {
    "type": "bool",
    "description": "Concerns about inadequate or insufficient staff training, including lack of mandatory training, poor understanding of protocols, and skill gaps."
  },
  "communication_failures": {
    "type": "bool",
    "description": "Failures in communication within and between agencies, including poor information sharing, unclear handovers, and lack of coordination."
  },
  "systemic_process_gaps": {
    "type": "bool",
    "description": "Deficiencies in policies, procedures, protocols, or systemic frameworks that lead to missed or delayed interventions and poor outcomes."
  },
  "resource_and_capacity_constraints": {
    "type": "bool",
    "description": "Limitations in available resources such as beds, equipment, or social care 

We can see that staff shortages and training deficiencies have been added as the first two themes, with the model identifying 3 other themes (remember that we constrained its output to identify a maximum of 5 themes in total).

### Assign themes to reports

Finally, we can assign our generated list of themes to our original DataFrame containing the reports. We assigned the output to a new DataFrame. This adds new columns; one for each theme.

An optional `force_assign` parameter prevents the model from outputting missing data for any given report field, while `allow_multiple` tells the model not to treat the list of themes as mutually exclusive (i.e. each report can have multiple themes). Both values default to `False`.

In [6]:
assigned_reports = extractor.extract_features(force_assign=True,
                                              allow_multiple=True,
                                              produce_spans=True)

assigned_reports.head(10)

Extracting features: 100%|██████████| 100/100 [00:29<00:00,  3.37it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern,spans_staffing_shortages,staffing_shortages,spans_training_deficiencies,training_deficiencies,spans_communication_failures,communication_failures,spans_systemic_process_gaps,systemic_process_gaps,spans_resource_and_capacity_constraints,resource_and_capacity_constraints
0,https://www.judiciary.uk/prevention-of-future-...,2025-0248,2025-05-28,Clare Bailey,Teesside and Hartlepool,1 Department of Health and Social Care 2 Chief...,Mr Dean Bradley died on 15 th October 2021 at ...,At approximately 0300 on 15 th October 2021 Mr...,During the course of the investigation my inqu...,N/A: Not found,False,N/A: Not found,False,"""The police failed to contact the mental healt...",True,"""There appears to be a gap in the services ava...",True,"""Current resources for safeguarding those with...",True
1,https://www.judiciary.uk/prevention-of-future-...,2025-0243,2025-05-27,Andrew Cousins,Blackpool & Fylde,BARCHESTER HEALTHCARE LIMITED 1,"On 30 April 2025 and 23 May 2025, at an inques...",I returned the following in box 4 of the Recor...,During the course of the inquest the evidence ...,N/A: Not found,False,N/A: Not found,False,N/A: Not found,False,"""observation scores taken for Mr Ineson follow...",True,N/A: Not found,False
2,https://www.judiciary.uk/prevention-of-future-...,2025-0244,2025-05-27,Peter Merchant,West Yorkshire West,"1 , Chief Constable West Yorkshire Police 1",On 15 February 2024 the death of Paul Andrew A...,"As identified above, Paul Alexander had a long...",During the course of the investigation my inqu...,N/A: Not found,False,N/A: Not found,False,"""call was redirected to the ambulance service ...",True,"""Under the Right Care Right Person framework, ...",True,N/A: Not found,False
3,https://www.judiciary.uk/prevention-of-future-...,2025-0245,2025-05-27,Nadia Persaud,East London,", Chief Executive Officer, Barts Health NHS Fo...",On the 13 June 2024 I commenced an investigati...,Abdirahman Afrah began to suffer from chest pa...,During the course of the inquest the evidence ...,"""waiting times to be seen in Majors A&E at New...",True,"""The inquest heard that the A&E doctor did not...",True,"""When the doctor called Abdirahman the followi...",True,"""There was no timely triage of Majors patients...",True,"""there were no beds available and that he woul...",True
4,https://www.judiciary.uk/prevention-of-future-...,2025-0246,2025-05-27,Rebecca Sutton,Durham and Darlington,"1. Deputy Chief Constable , Durham Constabular...",On 7 January 2025 an investigation into the de...,The Deceased had a long history of mental heal...,During the course of the inquest the evidence ...,N/A: Not found,False,"""it was not best practice to have asked the ca...",True,"""During the 16:44 call, by following the “Righ...",True,"""During the 16:44 call, by following the “Righ...",True,N/A: Not found,False
5,https://www.judiciary.uk/prevention-of-future-...,2025-0241,2025-05-23,Mary Hassell,Inner North London,1. Commissioner Metropolitan Police Service (M...,"On 12 February 2016, I commenced an investigat...",Lewis Johnson died as a consequence of a road ...,"2 During the course of the inquest, the eviden...",N/A: Not found,False,"""failure by MPS to implement, disseminate and ...",True,"""not a consistent expectation among police off...",True,"""failure by MPS to implement, disseminate and ...",True,N/A: Not found,False
6,https://www.judiciary.uk/prevention-of-future-...,2025-0242,2025-05-23,Mary Hassell,Inner North London,1. Director General Independent Office for Pol...,"On 12 February 2016, I commenced an investigat...",Lewis Johnson died as a consequence of a road ...,"2 During the course of the inquest, the eviden...",N/A: Not found,False,N/A: Not found,False,N/A: Not found,False,"""the terms of reference set out for the forens...",True,N/A: Not found,False
7,https://www.judiciary.uk/prevention-of-future-...,2025-0247,2025-05-23,Nadia Persaud,East London,"1. , CEO, North East London Foundation Trust (...",On 27 November 2024 I commenced an investigati...,Mr. Fraser was a 37-year-old gentleman who had...,During the course of the inquest the evidence ...,N/A: Not found,False,N/A: Not found,False,"""The team did not make it clear to the family ...",True,"""There was no clear care plan in place whilst ...",True,N/A: Not found,False
8,https://www.judiciary.uk/prevention-of-future-...,2025-0236,2025-05-21,Kate Robertson,North West Wales,Betsi Cadwaladr University Health Board (BCUHB) 1,On 20 May 2024 I commenced an investigation in...,The circumstances of the death are as follows ...,"During the course of the inquest, the evidence...",N/A: Not found,False,"""The neonatal investigation was not thorough. ...",True,"""There was no sufficiently full contextual sha...",True,"""There were several opportunities not taken by...",True,N/A: Not found,False
9,https://www.judiciary.uk/prevention-of-future-...,2025-0240,2025-05-21,Andrew Morse,South Wales Central,The Chief Executive Cardiff & Vale University ...,On 30 October 2023 I commenced an investigatio...,These were recorded as follows Robert Maxwell ...,During the course of the inquest the evidence ...,N/A: Not found,False,"""The guidance provided to clinicians and nursi...",True,"""The guidance provided to clinicians and nursi...",True,"""The guidance provided to clinicians and nursi...",True,N/A: Not found,False


In [7]:
print(extractor.feature_model)

<class 'pfd_toolkit.extractor.DiscoveredThemesWithSpans'>
