# Getting our decoy-boosted sample

This notebook documents the creation of our decoy-boosted dataset.

In the previous notebook, we generated a tagged version of the PFD corpus, adding a new column (`positive_cases`) that flags whether each report was identified by the PFD Toolkit as a child-suicide case (`True`) or not (`False`).

To construct a balanced dataset, we paired each positive case with a randomly selected negative case, resulting in a 1:1 ratio.

In [1]:
import pandas as pd

# Load in our tagged reports

reports = pd.read_csv("all_tagged_reports.csv")
reports.head()

Unnamed: 0.1,Unnamed: 0,url,id,date,coroner,area,receiver,investigation,circumstances,concerns,spans_positive_cases,positive_cases
0,0,https://www.judiciary.uk/prevention-of-future-...,2023-0489,2023-11-30,A. Bhardwaj,Liverpool and the Wirral,NHS England & NHS Improvement (PFDs); Society ...,On 23 October 2023 I commenced an investigatio...,Katherine Sarah Flynn was a 34 year old lady w...,The case is a complex death where the immediat...,"""Katherine Sarah FLYNN aged 34""; ""Katherine Sa...",False
1,1,https://www.judiciary.uk/prevention-of-future-...,2023-0490,2023-11-30,J. Goulding,"Sefton, Knowsley and St Helens",Manager; Abbey Wood Lodge Care Home,On 28 April 2023 I commenced an investigation ...,Julia Murphy (known as Sheila) sadly died on 0...,"Julia had 21 falls, the final fall led to her ...","""Julia Murphy aged 89""; ""Julia was 89 years of...",False
2,2,https://www.judiciary.uk/prevention-of-future-...,2023-0493,2023-11-30,J. Kearsley,Greater Manchester North,Chief Executive Northern Care Alliance; Chief ...,"On the 23rd January 2023, I commenced an inves...","The deceased, Donna, had a long standing histo...",1) There was a lack of understanding between t...,,False
3,3,https://www.judiciary.uk/prevention-of-future-...,2023-0484,2023-11-28,J. Andrews,"West Sussex, Brighton and Hove",Chief Executive University Hospitals Sussex NH...,On 13 April 2022 I commenced an investigation ...,Ann Dorothy Pearce was taken to the Princess R...,Evidence at the inquest revealed that the Veno...,"""Ann Dorothy Pearce aged 61""; ""died from a mas...",False
4,4,https://www.judiciary.uk/prevention-of-future-...,2024-0065,2023-11-27,J. Richards,County Durham and Darlington,Stanley Park Care Centre,On 21/09/2023 11:12 an investigation was comme...,Mrs Austin passed away at Stanley Park Care Ho...,1. The deceased was known to be at high risk o...,"""Margaret Austin, who was 90 years of age""; ""d...",False


In [2]:
# Randomly sample 73 negative cases

negative_cases = reports[reports["positive_cases"] == False]

sampled_negative_cases = negative_cases.sample(n=73, random_state=123)
len(sampled_negative_cases)

73

In [4]:
# Combine 1:1-matched positive and negative cases

positive_cases = pd.read_csv('child_suicide_cases.csv')

dataset_for_annotation = pd.concat([positive_cases, sampled_negative_cases], 
                                   ignore_index=True)

len(dataset_for_annotation)

146

In [5]:
# Save dataset
dataset_for_annotation.to_csv('data/dataset_for_annotation.csv')