# What

Now that #265 is complete I need to make sure that important text can be extracted from each of the reports.

This will be done using the content section if present. The content section will be read and then the page numbers will be extracted. If the content section is not present or not useful then the entire rpeort up to 30_000 tokens will be used.

Note that running this notebook has some costs due to the API calls for a LLM to read the content page.

In [None]:
import re
import importlib

import pandas as pd
import tiktoken
from tqdm import tqdm

import engine.extract.ReportExtracting as ReportExtracting

tqdm.pandas()

importlib.reload(ReportExtracting)

# Content sections

Currently the `reportExtractor.get_important_text()` will return the important text from the report given the content section, or pdf headers are present

In [None]:
## How many reports have a content section

parsed_reports = pd.read_pickle('../../output/parsed_reports.pkl')

parsed_reports['agency'] = parsed_reports['report_id'].map(lambda x: x.split('_')[0])

parsed_reports['content_section'] = parsed_reports.apply(lambda x: ReportExtracting.ReportExtractor(x['text'], x['report_id'], x['headers']).extract_contents_section(), axis=1)

In [None]:
print(f"Reports that have a content section")
display(parsed_reports['content_section'].notna().value_counts())

print(f"What are the lengths of these content sections (both characters and tokens)")
display(parsed_reports['content_section'].dropna().map(len).describe())
encoder = tiktoken.encoding_for_model('gpt-4o')
encoded_content_sections = parsed_reports['content_section'].dropna().map(lambda x: len(encoder.encode(x))) 
display(encoded_content_sections.describe())
print(f"Total cost to read ({encoded_content_sections.sum()} tokens): USD ${encoded_content_sections.sum() * 0.15 / 1_000_000}\n")

print(f"Which content sections have come from PDF headers")
display(parsed_reports['content_section'].dropna().map(lambda x: True if re.search(r'^\s+Title  Level', x) else False).value_counts())

In [None]:
print(parsed_reports['content_section'].dropna().loc[17])

We have about half of the content sections coming from pdf headers and the other half coming from the text itself.

Furthermore I expect that quite a few of the pdf headers are not actually useful (like above). It will be up to the LLM to decide if it is a relevant table of contents or not.

# Important text

In [None]:
importlib.reload(ReportExtracting)
parsed_reports['important_text'] = parsed_reports.progress_apply(lambda x: ReportExtracting.ReportExtractor(x['text'], x['report_id'], x['headers']).extract_important_text(), axis=1)
parsed_reports.to_pickle('important_text.pkl')
parsed_reports

In [None]:
parsed_reports['important_text'].map(lambda x: isinstance(x[0], str)).value_counts()

Initial results with just content section extraction was 1865 found and 1757 without an important text section

In [None]:
failed_to_extract_pages = parsed_reports[parsed_reports['important_text'].map(lambda x: not isinstance(x[0], str) and isinstance(x[1], list))]

failed_completely = parsed_reports[~parsed_reports['important_text'].map(lambda x: isinstance(x[0], str) or isinstance(x[1], list))]

failed_completely_with_content_section = parsed_reports[parsed_reports['content_section'].notna() & ~parsed_reports['important_text'].map(lambda x: isinstance(x[0], str) or isinstance(x[1], list))]

print(f"Failed to extract pages: {len(failed_to_extract_pages)}\nFailed completely: {len(failed_completely)}\nFailed completely with content section: {len(failed_completely_with_content_section)}")

print("How successful is the pdf headers at being a content section")
parsed_reports.dropna(subset=['content_section'])[parsed_reports['content_section'].dropna().map(lambda x: True if re.search(r'^\s+Title  Level', x) else False)]['important_text'].map(lambda x: isinstance(x[0], str)).value_counts()

In [None]:
encoded = parsed_reports['important_text'].map(lambda x: len(encoder.encode(x[0])) if isinstance(x[0], str) else 0)
print(f"Total cost to read ({encoded.sum()} tokens): USD ${encoded.sum() * 2.5 / 1_000_000:.2f}")

In [None]:
print(f"Important text lengths followed by full text lengths")

display(encoded.describe())
display(parsed_reports['text'].dropna().map(lambda x: len(encoder.encode(x))).describe())

After having a look at the failed with content section they are either failing because the pdf headers are not good enough (i.e from a short report) or they are failing because there are a short investigation summary report.

There are 940 TSB reports failing with 796 ATSB, only 11 are missing from TAIC. However this changes and we have 651 ATSB failing with only 311 TSB once we are only looking at the ones that have a content section.

# Checking the generated important text is fair

In [None]:
parsed_reports = pd.read_pickle("important_text.pkl")

In [None]:
parsed_reports[['important_text', 'pages_read']] = parsed_reports['important_text'].apply(pd.Series)

In [None]:
parsed_reports['full_text'] = parsed_reports.progress_apply(lambda x: x['text'] == x['important_text'], axis=1)

parsed_reports

In [None]:
test_sample = parsed_reports.query('full_text == False').query('agency != "TAIC"').sample(frac=0.1, random_state=45)
test_sample

In [None]:
importlib.reload(ReportExtracting)
index = 7
print(test_sample.iloc[index]['report_id'])
# headers = test_sample.iloc[index]['headers']
# print(
# headers.assign(Page=headers['Page'].replace('', 0)).to_string(index=False)
# )
print(
    test_sample.iloc[index]['content_section']
)

In [None]:
importlib.reload(ReportExtracting)
test_sample[['important_text_new', 'pages_read_new']] = test_sample.progress_apply(lambda x: ReportExtracting.ReportExtractor(x['text'], x['report_id'], x['headers']).extract_important_text(), axis=1).apply(pd.Series)

test_sample

In [None]:
test_sample[['important_text', 'important_text_new']].map(lambda x: len(x) if isinstance(x, str) else 0).describe()