# Evaluating the Scraper Module

In this notebook, we evaluate various aspects of the Scraper module. This involves creating a ground truth dataset of 150 PFD reports. 

In [1]:
from pfd_toolkit import PFDScraper, llm
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
llm_client = llm.LLM(api_key=openai_api_key)

# Initialise Scraper
llm_scraper = PFDScraper(
    llm = llm_client,
    delay_range=None,
    
    # Only enable HTML scraping
    html_scraping=True,
    pdf_fallback=False,
    llm_fallback=False,
    
    # Only include id and URL
    include_id=True,
    include_url=True,
    
    # ...everything else is ignored:
    include_date=False,
    include_coroner=False,
    include_receiver=False,
    include_area=False,
    include_investigation=False,
    include_circumstances=False,
    include_concerns=False
)

llm_scraper.scrape_reports()
llm_scraper.reports

Consider enabling .pdf or LLM fallback for more complete data extraction.

This will disable delays between requests. This may trigger anti-scraping measures by the host, leading to temporary or permanent IP bans. 
We recommend setting to (1,2).

Fetching pages: 569 page(s) [03:18, 18.74 page(s)/s]INFO:pfd_toolkit.scraper:Total collected report links: 5661
Scraping reports:  24%|██▍       | 1350/5661 [03:17<17:34,  4.09it/s]  INFO:pfd_toolkit.scraper:File https://www.judiciary.uk/wp-content/uploads/2023/03/John-Ibbotson-Prevention-of-future-deaths-report-2023-0093_Published.docx is not a .pdf (extension .docx)
INFO:pfd_toolkit.scraper:docx_conversion is set to 'None'; skipping conversion!
Scraping reports: 100%|██████████| 5661/5661 [16:27<00:00,  5.73it/s]  


Unnamed: 0,URL,ID
0,https://www.judiciary.uk/prevention-of-future-...,2025-0189
1,https://www.judiciary.uk/prevention-of-future-...,2025-0185
2,https://www.judiciary.uk/prevention-of-future-...,2025-0182
3,https://www.judiciary.uk/prevention-of-future-...,2025-0188
4,https://www.judiciary.uk/prevention-of-future-...,2025-0178
...,...,...
5656,https://www.judiciary.uk/prevention-of-future-...,2013-0175
5657,https://www.judiciary.uk/prevention-of-future-...,2013-0194
5658,https://www.judiciary.uk/prevention-of-future-...,2013-0176
5659,https://www.judiciary.uk/prevention-of-future-...,2013-0178


In [13]:
import pandas as pd

reports = llm_scraper.reports
reports.to_csv('../data/all_report_urls.csv')

reports_sample = reports.sample(n=150, random_state=22052025).reset_index(drop=True)
reports_sample.to_csv('../data/sample_report_urls.csv')

(Between the above and below code being run, `sample_report_urls.csv` was manually edited with 'ground truth' text extraction and renamed as `ground_truth.csv`).