## Scraper demo

Any changes to the PFDScraper module should only be pushed to main if the below code works without issue.

There are a number of different workflow options users can go through when running the scraper. It's important that each of these are fully functional on any changes to the `scraper.py` **or** `llm.py` modules.

### Workflow 1

This is where the user wishes to use the Scraper, but does not wish to call the LLM at all. Therefore, importing the LLM module should not be required to run the scraper. No errors or warnings should be thrown for any LLM-specific methods or attributes.

In PFDScraper, HTML and .pdf scraping are enabled by default. LLM scraping is disabled by default.

In [1]:
from pfd_toolkit import PFDScraper # No LLM is imported
# Not needed for Workflow 1:
#from dotenv import load_dotenv 
#import os

# Workflow 1
workflow1_scraper = PFDScraper(
    date_from="2024-01-01",
    date_to="2024-01-09"
)

workflow1_scraper.scrape_reports()
workflow1_scraper.reports

INFO:pfd_toolkit.scraper:Total collected report links: 10
Scraping reports: 100%|██████████| 10/10 [00:01<00:00,  7.11it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2024-0007,2024-01-04,Ian Potter,Inner North London,Chief Executive Officer Office for Product Saf...,"On 31 July 2023, an investigation was commence...",Mr Lee died at home on 6 July 2023 from the ef...,During the course of the inquest the evidence ...
1,https://www.judiciary.uk/prevention-of-future-...,2024-0005,2023-12-20,Darren Stewart,"Hampshire, Portsmouth and Southampton","1. CEO, Frimley Health NHS Foundation Trust (F...",On 4th April 2018 I commenced an investigation...,The circumstances of the death are recorded in...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2024-0003,2024-01-02,Rebecca Ollivere,Birmingham and Solihull,"1. Birmingham City Council, 2. Connaught House...","On 24 April 2023, I commenced an investigation...","On 11th March 2023, the deceased fell at The O...",During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2024-0002,2024-01-02,Sean Cummings,Bedfordshire and Luton,1 2 Kirby Road Surgery 1,On 14 June 2023 I commenced an investigation i...,"Mrs Ebanks lived alone, she had been self negl...",During the course of the investigation my inqu...
4,https://www.judiciary.uk/prevention-of-future-...,2024-0006,2024-01-04,Lauren Costello,Manchester South,1. Secretary of State for Health and Social Ca...,On 24th May 2023 an investigation was commence...,Mrs Roberts was severely frail and bedbound wi...,"During the course of the inquest, the evidence..."
5,https://www.judiciary.uk/prevention-of-future-...,2023-0549,2023-12-29,Deborah Lakin,Coventry and Warwickshire,"1. , Chief Medical Officer, South Warwickshire...",On 6 July 2023 I commenced an investigation in...,1. Mr Guillaume was admitted to Warwick Hospit...,During the course of the inquest the evidence ...
6,https://www.judiciary.uk/prevention-of-future-...,2024-0004,2024-01-03,Lorraine Harris,East Riding and Hull,1. The Minister for Health – 1,N/A: Not found,On 30th October 2023 Mr HOLGATE age 89 years w...,The MATTERS OF CONCERN are as follows. – This ...
7,https://www.judiciary.uk/prevention-of-future-...,2023-0550,2023-12-29,Michael Pemberton,Black Country,Sandwell and Birmingham NHS Trust 1,On 26 September 2022 an investigation was comm...,"On 23rd September 2022, Karmchand was taken fr...",During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,2023-0548,2023-12-21,Hannah Hinton,West London,The Executive Director of Oxleas NHS Foundatio...,On 23 February 2023 I commenced an investigati...,"On 19th February 2023, the deceased jumped on ...",British Transport Police made a referral to th...
9,https://www.judiciary.uk/prevention-of-future-...,2024-0001,2023-12-12,Marianne Johnson,North Lincolnshire and Grimsby,1 Navigo 1,On 07 March 2022 I commenced an investigation ...,Reece Nelson was found hanging [REDACTED] on t...,During the course of the investigation my inqu...


The above output should be generally complete. However, the Investigation & Inquest section for `2024-0004` should show as `N/A: Not found`.

### Workflow 2

This is where the user *only* wishes to run the LLM fallback for PDFScraper. This means that HTML and .pdf scraping will be disabled.

In [2]:
from pfd_toolkit import PFDScraper, LLM # LLM now imported
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
llm_client = LLM(api_key=openai_api_key, max_workers=50, model = 'gpt-4.1-mini')

# Workflow 2
workflow2_scraper = PFDScraper(
    llm = llm_client,
    category="suicide",
    date_from="2020-01-10",
    date_to="2020-03-11",
    html_scraping=False,
    pdf_fallback=False,
    llm_fallback=True,
    delay_range=None
)
workflow2_scraper.scrape_reports()
workflow2_scraper.reports

While this is a high-performance option, large API costs may be incurred, especially for large requests. 
Consider enabling HTML scraping or .pdf fallback for more cost-effective data extraction.

This will disable delays between requests. This may trigger anti-scraping measures by the host, leading to temporary or permanent IP bans. 
We recommend setting to (1,2).

INFO:pfd_toolkit.scraper:Total collected report links: 12
Scraping reports: 100%|██████████| 12/12 [00:00<00:00, 13.00it/s]
LLM fallback (parallel): 100%|██████████| 12/12 [00:34<00:00,  2.88s/it]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-03-03,Nadia Persaud,East London,"Professor Oliver Shanley, Interim Chief Execut...",On the 23rd October 2019 I commenced an invest...,See narrative conclusion in box 3 for detail.,The matter of concern during the course of the...
1,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-02-17,Caroline Beasley-Murray,Essex,Professor Stephen Powis National Medical Direc...,On 3 January 2020 I commenced an investigation...,Joseph James Gingell had suffered from mental ...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-02-25,Veronica Hamilton-Deeley,Brighton and Hove,Chief Constable Giles York - Sussex Police\n[R...,On 9th October 2019 I commenced an investigati...,See Record of Inquest,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-02-21,Simon N Burge,Central Hampshire,1 Chief Executive - Central and North West Lon...,On 27th January 2020 I commenced an investigat...,Mr Goldstraw was found hanging in cell B3-03 a...,The MATTERS OF CONCERNS are as follows:\n\nA. ...
4,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-03-03,Alison Mutch,Greater Manchester South,Secretary of State for Health,"On 22nd July 2019, I commenced an investigatio...",On 20th July Shaun Lea Turner was found unresp...,During the course of the inquest the evidence ...
5,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-01-20,Graeme Hughes,South Wales Central,The Chief Constable of South Wales Police & Th...,On 04/04/2019 I commenced an investigation int...,SWP officers attended room 161 at the Village ...,During the course of the inquest the evidence ...
6,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-01-20,Andrew Bridgman,South Manchester,"Ms Claire Molloy, Chief Executive, Pennine Car...",On 24.05.18 an investigation was commenced int...,"About 4 years prior, Samantha suffered an epis...",During the course of the inquest the evidence ...
7,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-01-10,Oliver Robert Longstaff,West Yorkshire (Western District),1. Kirklees Council 2. Highways England,On 17th September 2018 I commenced an investig...,"On 12th September 2018, Mr Wajid was observed ...",During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-01-14,Andrew A Haigh,Staffordshire (South),Head of Healthcare HMP Dovegate Marchington Ut...,I make this report under paragraph 7 Schedule ...,N/A: Not found,During the course of the inquest the evidence ...
9,https://www.judiciary.uk/prevention-of-future-...,N/A: Not found,2020-02-11,Bridget Dolan QC,West Sussex,"The Chief Executive, the Sussex Community NHS ...",On 05 November 2019 I commenced an investigati...,On 9 August 2109 Gemma Azhar had self-referred...,The MATTERS OF CONCERNS are as follows:\n\nTho...


Above, every field should be populated other than the ID column (ID is only present in the HTML metadata; not the reports themselves. Therefore the LLM scraper alone will never be able to extract this data).

### Workflow 3

In this workflow, the user will run the HTML & .pdf components of the scraper. They will then inspect the resulting dataframe, and run the LLM fallback separately. 

The reason we're accounting for this workflow is that it's difficult to know what the API costs will be ahead of time. In Workflow #3, the user can run the local, zero-cost extraction methods, then investigate how much missing data remains and what the API costs will be.

In [3]:
from pfd_toolkit import PFDScraper, llm 
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
llm_client = llm.LLM(api_key=openai_api_key, max_workers=20, model='gpt-4.1-mini')

# Workflow 3
workflow3_scraper = PFDScraper(
    llm = llm_client,
    category="care_home",
    date_from="2020-01-01",
    date_to="2020-06-30",
    html_scraping=True,
    pdf_fallback=True,
    llm_fallback=False
)
workflow3_scraper.scrape_reports()
workflow3_scraper.reports

INFO:pfd_toolkit.scraper:Total collected report links: 13
Scraping reports: 100%|██████████| 13/13 [00:01<00:00, 11.04it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2020-0109,2020-02-12,Paul Cooper,Lincolnshire,1 Glenholme Holdingham Grange Care Home 1,On 05/03/2019 I commenced an investigation int...,The deceased was cared for at Glenholme Holdin...,The MATTERS OF CONCERNS are as follows: (brief...
1,https://www.judiciary.uk/prevention-of-future-...,2020-0098,2020-04-22,Chris Morris,Manchester South,Lynmere Nursing home,"On 28th August 2019, Alison Mutch OBE, Senior ...",N/A: Not found,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2020-0039,2020-02-24,Yvonne Blake,Norfolk,Select Healthcare,N/A: Not found,N/A: Not found,N/A: Not found
3,https://www.judiciary.uk/prevention-of-future-...,2020-0105,2020-04-24,Alison Mutch,Greater Manchester South,Chief Executive of the Care Quality Commission...,On 22nd January 2019 I commenced an investigat...,Mary Brady moved to reside at Balmoral Care Ho...,During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,2020-0088,2020-02-27,Emma Serrano,Derby and Derbyshire,N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
5,https://www.judiciary.uk/prevention-of-future-...,2020-0001,2020-01-03,Chris Morris,Manchester (South),N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
6,https://www.judiciary.uk/prevention-of-future-...,2020-0086,2020-04-03,Joanne Lees,Black Country,"1. Oak Court House, Oaks Crescent, Wolverhampt...",On 7/1/20 I commenced an investigation into th...,i) On 2nd October 2019 the deceased was admitt...,During the inquest the evidence revealed matte...
7,https://www.judiciary.uk/prevention-of-future-...,2019-0499,2019-09-06,Zafar Siddique,Black Country,"1. The Secretary Of State, Department of Healt...","On the 21 January 2019, I commenced an investi...",i) Ms Shannon Quinn (SQ) was a 24 year old wom...,During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,2019-0454,2019-12-24,Alison Mutch,Manchester (South),N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found
9,https://www.judiciary.uk/prevention-of-future-...,2019-0452,2019-12-24,Andrew Haigh,Staffordshire (South),N/A: Not found,N/A: Not found,N/A: Not found,N/A: Not found


Running the above shows a moderately well populated dataframe, but with many gaps -- especially for the 3 long text sections (Investigation, Circumstances, and Concerns).

We can now estimate the API costs and run the LLM fallback.

It's estimated that this will cost $0.21 to run. With this, we can now run the fallback.

In [4]:
workflow3_scraper.run_llm_fallback()
workflow3_scraper.reports

LLM fallback (parallel): 100%|██████████| 13/13 [00:38<00:00,  2.94s/it]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2020-0109,2020-02-12,Paul Cooper,Lincolnshire,1 Glenholme Holdingham Grange Care Home 1,On 05/03/2019 I commenced an investigation int...,The deceased was cared for at Glenholme Holdin...,The MATTERS OF CONCERNS are as follows: (brief...
1,https://www.judiciary.uk/prevention-of-future-...,2020-0098,2020-04-22,Chris Morris,Manchester South,Lynmere Nursing home,"On 28th August 2019, Alison Mutch OBE, Senior ...",Mr Baxter had been slowing down and showing si...,During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2020-0039,2020-02-24,Yvonne Blake,Norfolk,Select Healthcare,"On 5 July 2019, I commenced an investigation i...",Mr Lee suffered an infarct of his spinal cord ...,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2020-0105,2020-04-24,Alison Mutch,Greater Manchester South,Chief Executive of the Care Quality Commission...,On 22nd January 2019 I commenced an investigat...,Mary Brady moved to reside at Balmoral Care Ho...,During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,2020-0088,2020-02-27,Emma Serrano,Derby and Derbyshire,"1. Normanton Village View Nursing Home, Derby;...","On the 3rd August 2017, I commenced an investi...",i) Mr Clarke was a 74 year old gentleman who w...,During the course of the inquest the evidence ...
5,https://www.judiciary.uk/prevention-of-future-...,2020-0001,2020-01-03,Chris Morris,Manchester (South),"1) Sir Andrew Dillon, Chief Executive, Nationa...","On 12th February 2018, I opened an inquest int...",Mr Wheeler had a history of cerebral palsy and...,During the course of the inquest the evidence ...
6,https://www.judiciary.uk/prevention-of-future-...,2020-0086,2020-04-03,Joanne Lees,Black Country,"1. Oak Court House, Oaks Crescent, Wolverhampt...",On 7/1/20 I commenced an investigation into th...,i) On 2nd October 2019 the deceased was admitt...,During the inquest the evidence revealed matte...
7,https://www.judiciary.uk/prevention-of-future-...,2019-0499,2019-09-06,Zafar Siddique,Black Country,"1. The Secretary Of State, Department of Healt...","On the 21 January 2019, I commenced an investi...",i) Ms Shannon Quinn (SQ) was a 24 year old wom...,During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,2019-0454,2019-12-24,Alison Mutch,Manchester (South),Secretary of State for Health and the Greater ...,"On 25th September 2018, I commenced an investi...",Julie Helen Taylor had Downs Syndrome and cons...,During the course of the inquest the evidence ...
9,https://www.judiciary.uk/prevention-of-future-...,2019-0452,2019-12-24,Andrew Haigh,Staffordshire (South),Hunters Lodge Care Home Hollybush Lane Codsall...,On 23 October 2019 I commenced an investigatio...,Keith had an unwitnessed fall in his room at H...,During the course of the inquest the evidence ...


Now every field should be populated with no gaps.

### Workflow 4

This final workflow is what we expect to be the most common. This is when the user runs all of HTML, .pdf and LLM scraping methods in one go. 

In [5]:
from pfd_toolkit import PFDScraper, llm
from dotenv import load_dotenv 
import os

# Load OpenAI API key
load_dotenv("api.env")
openai_api_key = os.getenv("OPENAI_API_KEY")
llm_client = llm.LLM(api_key=openai_api_key, max_workers=30, model='gpt-4.1-mini')

# Workflow 4
workflow4_scraper = PFDScraper(
    llm = llm_client,
    category="emergency_services",
    date_from="2024-01-01",
    date_to="2024-06-30",
    html_scraping=True,
    pdf_fallback=True,
    llm_fallback=True,
    delay_range=None
)
workflow4_scraper.scrape_reports()
workflow4_scraper.reports

This will disable delays between requests. This may trigger anti-scraping measures by the host, leading to temporary or permanent IP bans. 
We recommend setting to (1,2).

INFO:pfd_toolkit.scraper:Total collected report links: 25
Scraping reports: 100%|██████████| 25/25 [00:03<00:00,  8.23it/s]
LLM fallback (parallel): 100%|██████████| 25/25 [00:11<00:00,  2.20it/s]


Unnamed: 0,URL,ID,Date,CoronerName,Area,Receiver,InvestigationAndInquest,CircumstancesOfDeath,MattersOfConcern
0,https://www.judiciary.uk/prevention-of-future-...,2024-0275,2024-05-20,Caroline Saunders,Gwent,The Chief Executive of Aneurin Bevan Universit...,"On 18/09/2023, an investigation was opened tou...","On 05/09/2023, Sylvia Eileen Evans sustained a...",The MATTERS OF CONCERN are as follows: - Sylvi...
1,https://www.judiciary.uk/prevention-of-future-...,2024-0238,2023-05-22,Peter Taheri,Suffolk,1) The Right Honourable Steve Barclay MP Secre...,On 15th October 2021 an investigation was comm...,"The Jury's answer to how, when, where and in w...",During the course of the inquest the evidence ...
2,https://www.judiciary.uk/prevention-of-future-...,2024-0319,2024-06-17,Edward Ramsay,Swansea Neath and Port Talbot,"1. , CHIEF EXECUTIVE, WELSH AMBULANCE SERVICES...",On 7 JULY 2020 the Senior Coroner commenced an...,At the time of his death STEFAN was a detained...,During the course of the inquest the evidence ...
3,https://www.judiciary.uk/prevention-of-future-...,2024-0304,2024-06-05,Alison Mutch,Manchester South,1) NHS England 1,On 25th October 2023 I commenced an investigat...,On 13th October 2023 at about 20:23 Bernard Co...,During the course of the inquest the evidence ...
4,https://www.judiciary.uk/prevention-of-future-...,2024-0246,2024-02-16,David Reid,Worcestershire,"1) , Chief Executive, West Midlands Ambulance ...",On 17 November 2021 I commenced an investigati...,See above.,During the course of the inquest the evidence ...
5,https://www.judiciary.uk/prevention-of-future-...,2024-0307,2024-06-06,James Bennett,Birmingham and Solihull,"(1) , Chief Executive, West Midlands Ambulance...",On 24/04/23 I commenced an investigation into ...,On 04/04/22 at a dialysis session Mr Fray was ...,During the course of the inquest the evidence ...
6,https://www.judiciary.uk/prevention-of-future-...,2024-0250,2024-05-08,Lauren Costello,Manchester South,1. Secretary of State for Health and Social Ca...,On 24th May 2023 an investigation was commence...,Mrs Mulonge had multiple co-morbidities includ...,"During the course of the inquest, the evidence..."
7,https://www.judiciary.uk/prevention-of-future-...,2024-0337,2024-06-24,Jonathan Dixey,Northamptonshire,(1) EAST MIDLANDS AMBULANCE SERVICE NHS TRUST ...,On 5th April 2023 an investigation was commenc...,At around 23.23 on 1st April 2023 Liam Paul Mc...,During the course of the inquest the evidence ...
8,https://www.judiciary.uk/prevention-of-future-...,2024-0158,2024-03-20,Hannah Berry,South Yorkshire West,"1. The Department of Health and Social Care, 3...",On 19 December 2023 I commenced an investigati...,On 4 November 2022 Mrs Walker called her daugh...,During the course of the inquest the evidence ...
9,https://www.judiciary.uk/prevention-of-future-...,2024-0231,2024-04-29,Hannah Berry,South Yorkshire West,"The Department of Health and Social Care, 39 V...",On 18 September 2023 I commenced an investigat...,Sophie had complex needs and required full tim...,During the course of the inquest the evidence ...


And, as before, all fields should be populated with no gaps.