# Evaluating the Scraper Module

In this notebook, we evaluate various aspects of the Scraper module. This involves creating a ground truth dataset of 150 PFD reports. 

In [1]:
from pfd_toolkit import PFDScraper, llm
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
llm_client = llm.LLM(api_key=openai_api_key)

# Initialise Scraper
llm_scraper = PFDScraper(
    llm = llm_client,
    delay_range=None,
    
    date_to="2025-05-22",
    
    # Only enable HTML scraping
    html_scraping=True,
    pdf_fallback=False,
    llm_fallback=False,
    
    # Only include id and URL
    include_id=True,
    include_url=True,
    
    # ...everything else is ignored:
    include_date=False,
    include_coroner=False,
    include_receiver=False,
    include_area=False,
    include_investigation=False,
    include_circumstances=False,
    include_concerns=False
)

llm_scraper.scrape_reports()
llm_scraper.reports

Consider enabling .pdf or LLM fallback for more complete data extraction.

This will disable delays between requests. This may trigger anti-scraping measures by the host, leading to temporary or permanent IP bans. 
We recommend setting to (1,2).

Fetching pages: 567 page(s) [00:49, 12.33 page(s)/s]INFO:pfd_toolkit.scraper:Total collected report links: 5661
Scraping reports:  24%|██▍       | 1351/5661 [03:39<15:30,  4.63it/s]INFO:pfd_toolkit.scraper:File https://www.judiciary.uk/wp-content/uploads/2023/03/John-Ibbotson-Prevention-of-future-deaths-report-2023-0093_Published.docx is not a .pdf (extension .docx)
INFO:pfd_toolkit.scraper:docx_conversion is set to 'None'; skipping conversion!
Scraping reports: 100%|██████████| 5661/5661 [20:49<00:00,  4.53it/s]  


Unnamed: 0,URL,ID
0,https://www.judiciary.uk/prevention-of-future-...,2025-0180
1,https://www.judiciary.uk/prevention-of-future-...,2025-0184
2,https://www.judiciary.uk/prevention-of-future-...,2025-0188
3,https://www.judiciary.uk/prevention-of-future-...,2025-0187
4,https://www.judiciary.uk/prevention-of-future-...,2025-0189
...,...,...
5656,https://www.judiciary.uk/prevention-of-future-...,2013-0194
5657,https://www.judiciary.uk/prevention-of-future-...,2013-0175
5658,https://www.judiciary.uk/prevention-of-future-...,2013-0174
5659,https://www.judiciary.uk/prevention-of-future-...,2013-0171


In [2]:
import pandas as pd

reports = llm_scraper.reports
#reports.to_csv('../data/all_report_urls.csv')

reports_sample = reports.sample(n=150, random_state=22052025).reset_index(drop=True)
#reports_sample.to_csv('../data/sample_report_urls.csv')

(Between the above and below code being run, `sample_report_urls.csv` was manually edited with 'ground truth' text extraction and renamed as `ground_truth.csv`).

In [None]:
reports_sample.head()

NameError: name 'reports_sample' is not defined

In [4]:
import pandas as pd
labelled_reports = pd.read_csv('../data/labelled_reports.csv')
labelled_reports = labelled_reports.drop(columns=['Unnamed: 0', 'Notes, if applicable?'])
labelled_reports.head()

Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2021-0396,2021-11-25,Chris Morris,Greater Manchester (South),Secretary of State for Social Care,"On 11th March 2020, Alison Mutch OBE, Senior C...","During autumn of 2019, Dr Dixon became unwell ...",The Court heard it was likely that an observat...
1,https://www.judiciary.uk/prevention-of-future-...,2017-0245,2017-8-29,Caroline Saunders,Gloucestershire,1. Tonic Construction Ltd \r\n2. Health and S...,"On the 7th June, the senior coroner commenced ...",Shaun Carter was employed by Tonic Constructio...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2013-0183,2013-10-12,Nicholas Leslie Rheinberg,Cheshire,1. NHS England\r\n2. Castlefields Health Centre,On 10th August 2012 I commenced an investigati...,The deceased had a history of severe urinary t...,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2025-0127,2025-7-3,Kirsten Heaven,SWANSEA & NEATH PORT TALBOT,Chief Executive Swansea Bay University Health ...,On 18th February and 5th March 2025 I heard an...,On 18 May 2022 Jean Pike was stating that she ...,During the inquest the evidence revealed matte...
4,https://www.judiciary.uk/prevention-of-future-...,2023-0024,2023-01-11,Michael Walsh,West London,Bendpak Inc\r\nLiftmaster Ltd\r\nLiftmaster Se...,The inquest into the death of Mr Ashley Bullar...,Ashley died due to a vehicle leaving a Bendpak...,During the course of the inquest the evidence ...


In [6]:
# Create blank version of above

# Create a copy
blank_reports = labelled_reports.copy()

# Columns to exclude from filling
exclude_cols = ['ID', 'URL']

# Replace all values except in 'ID' and 'URL' columns
blank_reports.loc[:, ~blank_reports.columns.isin(exclude_cols)] = "N/A: Not found"

# Now .head() will work
blank_reports.head()


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2021-0396,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
1,https://www.judiciary.uk/prevention-of-future-...,2017-0245,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
2,https://www.judiciary.uk/prevention-of-future-...,2013-0183,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
3,https://www.judiciary.uk/prevention-of-future-...,2025-0127,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
4,https://www.judiciary.uk/prevention-of-future-...,2023-0024,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found


In [9]:
from pfd_toolkit import PFDScraper, llm
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")

# Set LLM client
llm_client = llm.LLM(api_key=openai_api_key,
                     model="gpt-4o-mini")

# Set scraper object
scraper = PFDScraper(
    llm = llm_client
)

# 'Trick' the scraper so that it thinks our blank reports is post-scraping
scraper.reports = blank_reports

# Run the LLM scraper
scraper.run_llm_fallback()

# Show reports
scraper.reports

Running LLM Fallback: 100%|██████████| 150/150 [51:20<00:00, 20.54s/it] 


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2021-0396,2021-11-25,Chris Morris,Greater Manchester (South),"Rt. Hon. Sajid Javid, Secretary of State for H...","On 11th March 2020, Alison Mutch OBE, Senior C...","During autumn of 2019, Dr Dixon became unwell ...",The MATTERS OF CONCERN are as follows. -\n\nDu...
1,https://www.judiciary.uk/prevention-of-future-...,2017-0245,2017-08-29,Caroline Saunders,Gloucestershire,"Tonic Construction Ltd, Health and Safety exec...","On the 7th June, the senior coroner commenced ...",Shaun Carter was employed by Tonic Constructio...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2013-0183,2013-10-12,Nicholas Leslie Rheinberg,Cheshire,"NHS England, Castlefields Health Centre",On 10th August 2012 I commenced an investigati...,The deceased had a history of severe urinary t...,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2025-0127,2025-03-07,Kirsten Heaven,SWANSEA & NEATH PORT TALBOT,Chief Executive Swansea Bay University Health ...,On 18th February and 5th March 2025 I heard an...,On 18 May 2022 Jean Pike was stating that she ...,During the inquest the evidence revealed matte...
4,https://www.judiciary.uk/prevention-of-future-...,2023-0024,2023-01-11,Michael Walsh,West London,"Ashley’s family, Liftmaster Ltd and Liftmaster...",The inquest into the death of Mr Ashley Bullar...,Ashley died due to a vehicle leaving a Bendpak...,During the course of the inquest the evidence ...
...,...,...,...,...,...,...,...,...,...
145,https://www.judiciary.uk/prevention-of-future-...,2014-0577,2014-12-18,Mr D. M. Salter,Oxfordshire,"Mr Graham Dalton, Chief Executive, Highways Ag...",On 26 June 2014 I opened an Inquest into the d...,The circumstances are briefly set out above bu...,During the course of the Inquest the evidence ...
146,https://www.judiciary.uk/prevention-of-future-...,2022-0243,2022-08-04,Alison Mutch,Greater Manchester South,[REDACTED],On 10th March 2022 I commenced an investigatio...,Margaret Ena Warwick lived independently. She ...,During the course of the inquest the evidence ...
147,https://www.judiciary.uk/prevention-of-future-...,2019-0374,2019-11-06,David Urpeth,South Yorkshire West,"1. Practice Manager, Upwell Street Surgery\n2....","On 25.4.19, an investigation into the death of...",On the 18th April 2019 Royal Hallamshire Hospi...,During the course of the inquest the evidence ...
148,https://www.judiciary.uk/prevention-of-future-...,2014-0455,2014-10-21,M. E. Voisin,Avon,Royal College of Obstetricians and Gynaecologists,On 1st May 2014 I commenced an investigation i...,Elsie Plumb was born full term. Her mother dur...,During the course of the inquest the evidence ...


In [10]:
# Save
scraped_reports = scraper.reports
scraped_reports.to_csv('../data/scraped_reports.csv')

Evaluating the output

