# What

After initial safety issue extraction was completed some time ago (https://github.com/1jamesthompson1/TAIC-report-summary/pull/176).

I will expand this to include safety issue extraction for ATSB and TSB.

This currently works by having a LLM read the important text. THen parsing the repsonse into a workable format. The important text has been added for ATSB and TSB #266.

Note that ATSB actually has a safety issue dataset that goes back till 2010. So for all of their safety issues they have exact extract and dont need to have hte reports read.

In [None]:
import importlib
import engine.extract.ReportExtracting as ReportExtracting 
import engine.gather.WebsiteScraping as WebsiteScraping
import pandas as pd
import tiktoken
import shutil
import re
import os
importlib.reload(ReportExtracting)
importlib.reload(WebsiteScraping)

In [None]:
output_path = '../../output/'

# Website scraping


As mentioned above ATSB has a only dataset of the safety issues that they have identified.

This means that I need to set up a scraper of this. I do also need to make sure that the pre 2010 reports also function alright.


To keep with everything else I will add the safety issue dataset scrapping to the `WebScraping.py` module.

In [None]:
importlib.reload(WebsiteScraping)
atsb_safety_issues_path = os.path.join(output_path, 'atsb_safety_issues.pkl')
scraper = WebsiteScraping.ATSBSafetyIssueScraper(atsb_safety_issues_path, refresh=True)
scraper.extract_safety_issues_from_website()

In [None]:
atsb_webscraped_safety_issues = pd.read_pickle(atsb_safety_issues_path)
atsb_webscraped_safety_issues

In [None]:
pd.concat(atsb_webscraped_safety_issues['safety_issues'].tolist())

This scraping works. However it does not scrape the same amount each time. There are a varying amount +- 10 rows. This is quite weird.

I will move on from now and might come back to it at another point https://github.com/1jamesthompson1/TAIC-report-summary/issues/277

# Report extraction

For the pre 2010 reports for ATSb and all fo the TSB reports I will need to extract the safety issues by reading the important text.

Because it is going to be quite expensive I will take a sample

Then I can start building some tests

## Getting datasets

In [None]:
important_text_df_path = os.path.join(output_path, 'important_text.pkl')
important_text_df = pd.read_pickle(important_text_df_path)
important_text_df['year'] = important_text_df['report_id'].map(lambda x: x.split("_")[2])
important_text_df['agency'] = important_text_df['report_id'].map(lambda x: x.split("_")[0])
important_text_df

In [None]:
report_titles_path = os.path.join(output_path, 'report_titles.pkl')
report_titles = pd.read_pickle(report_titles_path)
report_titles

In [None]:
parsed_reports_path = os.path.join(output_path, 'parsed_reports.pkl')
parsed_reports = pd.read_pickle(parsed_reports_path)
parsed_reports

In [None]:
merged_df = pd.merge(important_text_df, report_titles, how='outer', on='report_id')

merged_df

In [None]:
filtered_df = merged_df.dropna(subset=['important_text'])
filtered_df = filtered_df[filtered_df['investigation_type'] != 'short']
filtered_df

## How much will it cost

In [None]:
encoder = tiktoken.encoding_for_model('gpt-4o')
def cost_to_read(df):
    tokens = df['important_text'].map(lambda x: len(encoder.encode(x)))

    return tokens.sum() * 2.5 / 1_000_000

In [None]:
cost_to_read(filtered_df)

It will cost $54.10usd which is about 90 nzd. Thjerefore we will use a sample set that is small enough so that it costs a small amount to do a full extraction

## Running a sample

In [None]:
sample_important_text = filtered_df.sample(frac=0.01, random_state=45, ignore_index=True)
sample_df_path = 'sample_important_text.pkl'
sample_important_text.to_pickle(sample_df_path)
for _, id,sample_text in sample_important_text[['report_id', 'important_text']].itertuples():
    shutil.copy(f'../../output/report_pdfs/{id}.pdf', f'sample/{id}.pdf')
    with open(f'sample/{id}_important.txt', 'a') as f:
        f.write(sample_text)
sample_important_text

In [None]:
def extract_safety_from_df(df):
    temp_df_path = 'temp_df.pkl'
    df.to_pickle(temp_df_path)
    parsed_reports[parsed_reports['report_id'].isin(df['report_id'])].to_pickle('sample_parsed_reports.pkl')
    report_titles[report_titles['report_id'].isin(df['report_id'])].to_pickle('sample_report_titles.pkl')

    importlib.reload(ReportExtracting)
    processor = ReportExtracting.ReportExtractingProcessor('sample_parsed_reports.pkl', refresh=True)
    
    processor.extract_safety_issues_from_reports(temp_df_path, 'sample_report_titles.pkl', atsb_safety_issues_path, 'sample_safety_issues.pkl')
    safety_issues_df = pd.read_pickle('sample_safety_issues.pkl')
    safety_issues_df = safety_issues_df[~safety_issues_df['report_id'].isin(atsb_webscraped_safety_issues['report_id'])]
    os.remove(temp_df_path)
    return safety_issues_df

In [None]:
safety_issues_df = extract_safety_from_df(sample_important_text)

In [None]:
safety_issues_df

## ATSB

ATSB ha a concept of investigation level. That is that some investigation are full investigations and some are short investigations.

I am not sure if these short invetigations actually have safety issues.

In [None]:
atsb_reports = merged_df.query('agency == "ATSB"')
atsb_reports

In [None]:
atsb_reports['important_text_len'] = atsb_reports['important_text'].map(len)
atsb_reports

In [None]:
atsb_reports['level'].value_counts()

## TSB

As there are only 7 years (2000-2007) that are not already included in the safety issues dataset for ATSB it seems more important to do the safety issue extraction for TSB.

TSB has the concept class of investigation. It goes from 6-1. With 1 being the most important.
More information can be found lower down on this page: https://www.tsb.gc.ca/eng/lois-acts/evenements-occurrences.html

Class 6 are for external investigations and class 1 are for thematic investigations.

I will start scrapng the class occuracne from the webpages of the reports by adding the metadata requests to the TSB scraper.

In [None]:
import hrequests
from bs4 import BeautifulSoup
import importlib

import engine.utils.Modes as Modes
import engine.gather.WebsiteScraping as WebsiteScraping

importlib.reload(WebsiteScraping)

pdf_page = hrequests.get(
   "https://www.tsb.gc.ca/eng/rapports-reports/aviation/2012/a12w0004/a12w0004.html" 
)
soup = BeautifulSoup(pdf_page.text, 'html.parser')

report_id = "TSB_a_2012_w0004"

scraper = WebsiteScraping.TSBReportScraper(WebsiteScraping.ReportScraperSettings(
    "../../output/report_pdfs/", "../../output/report_titles.pkl", "{{rpeort_id}}.pdf", 2010, 2020, 1000, Modes.all_modes, [], False
))

scraper.get_report_metadata(report_id, soup)



In [None]:
tsb_reports = merged_df.query('agency == "TSB"')
tsb_reports

In [None]:
tsb_reports['text_length'] = tsb_reports['important_text'].str.len()
tsb_reports['investigation_type'] = tsb_reports.apply(
    lambda row:
        row['investigation_type'] if row['investigation_type'] != 'unknown' else 'full' if isinstance(row['pages_read'], list) else 'short' if row['text_length'] < 40_000 else 'full',
    axis=1
)

### Doing TSB sample

In [None]:
tsb_sample = tsb_reports.sample(frac=0.02, random_state=42)
tsb_sample

In [None]:
tsb_sample_safety_issues = extract_safety_from_df(tsb_sample)
tsb_sample_safety_issues.to_pickle('tsb_sample_si.pkl')
tsb_sample_safety_issues.set_index('report_id', inplace=True)
tsb_sample_safety_issues

In [None]:
shutil.rmtree('tsb_sample')
if not os.path.exists('tsb_sample'):
    os.mkdir('tsb_sample')
for _, id,sample_text, url in tsb_sample[['report_id', 'important_text', 'url']].itertuples():
    with open(f'tsb_sample/{id}_important.txt', 'a') as f:
        f.write(sample_text)
    try:
        shutil.copy(f'../../output/report_pdfs/{id}.pdf', f'tsb_sample/{id}.pdf')
    except FileNotFoundError as e:
        print(e)
    try:
        with open(f'tsb_sample/{id}_si.txt', 'a') as f:
            f.write("\n\n".join(tsb_sample_safety_issues.loc[id]['safety_issues']['safety_issue']))
    except KeyError:
        print(f"{id} has no safety issues")

In [None]:
tsb_sample_safety_issues

## Quick adding in of investigation levels

In [None]:
reports = pd.read_pickle('../../output/report_titles.pkl')
reports['agency'] = reports['report_id'].map(lambda x: x.split('_')[0])
reports

In [None]:
atsb = reports.query('agency == "ATSB"')
atsb['investigation_type'] = atsb['misc'].map(lambda x: "full" if x['investigation_level'] in ["Defined", "Systemic"] else "short" if x['investigation_level'] == "Short" else "unknown")
atsb

In [None]:
taic = reports.query('agency == "TAIC"')
taic.loc[:,'investigation_type'] = "full"
taic

In [None]:
tsb = reports.query('agency == "TSB"')
tsb['investigation_type'] = tsb['misc'].map(lambda x: "unknown" if x['investigation_class'] is None else "full" if int(x['investigation_class']) < 4 else "short")
tsb

In [None]:
reports_out = pd.concat([atsb, taic])[['report_id', 'title', 'event_type', 'investigation_type', 'misc']].reset_index(drop=True)
reports_out['url'] = None
reports_out.to_pickle('../../output/report_titles.pkl')