# What

After initial safety issue extraction was completed some time ago (https://github.com/1jamesthompson1/TAIC-report-summary/pull/176).

I will expand this to include safety issue extraction for ATSB and TSB.

This currently works by having a LLM read the important text. THen parsing the repsonse into a workable format. The important text has been added for ATSB and TSB #266.

Note that ATSB actually has a safety issue dataset that goes back till 2010. So for all of their safety issues they have exact extract and dont need to have hte reports read.

In [None]:
import importlib
import engine.extract.ReportExtracting as ReportExtracting 
import engine.gather.WebsiteScraping as WebsiteScraping
import pandas as pd
import tiktoken
import shutil
import re
import os
importlib.reload(ReportExtracting)
importlib.reload(WebsiteScraping)

# Website scraping


As mentioned above ATSB has a only dataset of the safety issues that they have identified.

This means that I need to set up a scraper of this. I do also need to make sure that the pre 2010 reports also function alright.


To keep with everything else I will add the safety issue dataset scrapping to the `WebScraping.py` module.

In [None]:
importlib.reload(WebsiteScraping)
scraper = WebsiteScraping.ATSBSafetyIssueScraper('atsb_safety_issues.pkl', refresh=True)
scraper.extract_safety_issues_from_website()

In [None]:
atsb_webscraped_safety_issues = pd.read_pickle('atsb_safety_issues.pkl')
atsb_webscraped_safety_issues

In [None]:
pd.concat(atsb_webscraped_safety_issues['safety_issues'].tolist())

This scraping works. However it does not scrape the same amount each time. There are a varying amount +- 10 rows. This is quite weird.

I will move on from now and might come back to it at another point https://github.com/1jamesthompson1/TAIC-report-summary/issues/277

# Report extraction

For the pre 2010 reports for ATSb and all fo the TSB reports I will need to extract the safety issues by reading the important text.

Because it is going to be quite expensive I will take a sample

Then I can start building some tests

In [None]:
important_text_df = pd.read_pickle('../../output/important_text.pkl')
important_text_df['year'] = important_text_df['report_id'].map(lambda x: x.split("_")[2])
important_text_df['agency'] = important_text_df['report_id'].map(lambda x: x.split("_")[0])
important_text_df

In [None]:
encoder = tiktoken.encoding_for_model('gpt-4o')

tokens = important_text_df['important_text'].map(lambda x: len(encoder.encode(x)))
display(tokens.describe())

print(f"There are a total of {tokens.sum():.0f} tokens. At the cost of 2.5/1m tokens each, the total cost is ${tokens.sum() * 2.5 / 1_000_000:.2f}")

It will cost $54.10usd which is about 90 nzd. Thjerefore we will use a sample set that is small enough so that it costs a small amount to do a full extraction

In [None]:
sample_important_text = important_text_df.query("agency != 'ATSB' | (year < '2009')").sample(frac=0.01, random_state=45, ignore_index=True)
sample_df_path = 'sample_important_text.pkl'
sample_important_text.to_pickle(sample_df_path)
for _, id,sample_text in sample_important_text[['report_id', 'important_text']].itertuples():
    shutil.copy(f'../../output/report_pdfs/{id}.pdf', f'sample/{id}.pdf')
    with open(f'sample/{id}_important.txt', 'a') as f:
        f.write(sample_text)
sample_important_text

In [None]:
report_text = pd.read_pickle('../../output/parsed_reports.pkl')
report_text[report_text['report_id'].isin(sample_important_text['report_id'])].to_pickle('sample_parsed_reports.pkl')

In [None]:
importlib.reload(ReportExtracting)
processor = ReportExtracting.ReportExtractingProcessor('sample_parsed_reports.pkl', refresh=True)

processor.extract_safety_issues_from_reports(sample_df_path, 'sample_safety_issues.pkl')

In [None]:
safety_issues_df = pd.read_pickle('sample_safety_issues.pkl')
safety_issues_df

In [None]:
safety_issues_df['safety_issues'].apply(lambda x: display(x))

## ATSB

ATSB ha a concept of investigation level. That is that some investigation are full investigations and some are short investigations.

I am not sure if these short invetigations actually have safety issues.

In [None]:
report_titles = pd.read_pickle('../../output/report_titles.pkl')
report_titles['agency'] = report_titles['report_id'].map(lambda x: x.split('_')[0])
report_titles['year'] = report_titles['report_id'].map(lambda x: int(x.split('_')[2]))
report_titles['misc'] = report_titles['misc'].map(lambda x: x[0] if isinstance(x, list) else x)
atsb_reports = report_titles.query('agency == "ATSB"')
atsb_reports['level'] = atsb_reports['misc'].map(lambda x: x['investigation_level'])

atsb_reports


In [None]:
atsb_reports['level'].value_counts()

## TSB

As there are only 7 years (2000-2007) that are not already included in the safety issues dataset for ATSB it seems more important to do the safety issue extraction for TSB.

TSB has the concept class of investigation. It goes from 6-1. With 1 being the most important.
More information can be found lower down on this page: https://www.tsb.gc.ca/eng/lois-acts/evenements-occurrences.html

Class 6 are for external investigations and class 1 are for thematic investigations.

I will start scrapng the class occuracne from the webpages of the reports by adding the metadata requests to the TSB scraper.

In [None]:
import hrequests
from bs4 import BeautifulSoup
import importlib

import engine.utils.Modes as Modes
import engine.gather.WebsiteScraping as WebsiteScraping

importlib.reload(WebsiteScraping)

pdf_page = hrequests.get(
   "https://www.tsb.gc.ca/eng/rapports-reports/aviation/2015/a15h0002/a15h0002.html" 
)
soup = BeautifulSoup(pdf_page.text, 'html.parser')

report_id = "TSB_a_2015_H0002"

scraper = WebsiteScraping.TSBReportScraper(WebsiteScraping.ReportScraperSettings(
    "../../output/report_pdfs/", "../../output/report_titles.pkl", "{{rpeort_id}}.pdf", 2010, 2020, 1000, Modes.all_modes, [], False
))

scraper.get_report_metadata(report_id, soup)



In [None]:
tsb_reports = report_titles.query('agency == "TSB"')
tsb_reports

In [None]:
tsb_sample = important_text_df.query("agency == 'TSB'").sample(frac=0.01, random_state=42)
for _, id,sample_text in tsb_sample[['report_id', 'important_text']].itertuples():
    with open(f'tsb_sample/{id}_important.txt', 'a') as f:
        f.write(sample_text)
    shutil.copy(f'../../output/report_pdfs_stash/{id}.pdf', f'tsb_sample/{id}.pdf')
tsb_sample

## Quick adding in of investigation levels

In [None]:
reports = pd.read_pickle('../../output/report_titles.pkl')
reports['agency'] = reports['report_id'].map(lambda x: x.split('_')[0])
reports

In [None]:
atsb = reports.query('agency == "ATSB"')
atsb['investigation_type'] = atsb['misc'].map(lambda x: "full" if x['investigation_level'] in ["Defined", "Systemic"] else "short" if x['investigation_level'] == "Short" else "unknown")
atsb

In [None]:
taic = reports.query('agency == "TAIC"')
taic.loc[:,'investigation_type'] = "full"
taic

In [None]:
tsb = reports.query('agency == "TSB"')
tsb['investigation_type'] = tsb['misc'].map(lambda x: "unknown" if x['investigation_class'] is None else "full" if int(x['investigation_class']) < 4 else "short")
tsb

In [None]:
reports_out = pd.concat([atsb, taic])[['report_id', 'title', 'event_type', 'investigation_type', 'misc']].reset_index(drop=True)
reports_out['url'] = None
reports_out.to_pickle('../../output/report_titles.pkl')