# What

As per [#286](https://github.com/1jamesthompson1/TAIC-report-summary/issues/286) there is a problem where I dont currently know where the reports are going.

I want to start by looking at the various steps and figuring out which reports are lost at which step.

In [None]:
import pandas as pd

import os
import plotly.express as px


def output_file(path):
    return os.path.join('../../output', path)

## Getting datasets

The data flows through the engine as pandas dataframes.

In theory by just looking at `report_titles.pkl`, `extracted_reports.pkl` we will be able to know what reports were dropped off in the gather and extract phases. Then the last step is the analyze which is mostly embedding as the other two datasets are used as much

In [None]:
report_titles = pd.read_pickle(output_file('report_titles.pkl'))
extracted_reports = pd.read_pickle(output_file('extracted_reports.pkl'))
report_pdfs = pd.DataFrame(map(lambda x: x[:-4], os.listdir(output_file('report_pdfs'))), columns = ['report_id'])

In [None]:
report_titles

In [None]:
extracted_reports

In [None]:
report_pdfs['pdf_download'] = True
report_pdfs

In [None]:
all_info = report_titles.merge(report_pdfs, on='report_id', how='left').merge(extracted_reports[["report_id", "text", "toc", "recommendations", "safety_issues", "sections"]], on='report_id', how='left')
all_info

In [None]:
all_info["mode"] = all_info["report_id"].apply(lambda x: x.split("_")[1])
all_info["year"] = all_info["report_id"].apply(lambda x: x.split("_")[2])
all_info["agency"] = all_info["report_id"].apply(lambda x: x.split("_")[0])
all_info

## Creating outcome dataset

The report titles are the list of all reports that were web scraped. It should line up with

In [None]:
# I want to instead do it widers o that each column is for its own stage

outcome = all_info.copy()

outcome["found_on_website"] = True

outcome['pdf_download'] = outcome['pdf_download'].fillna(False)

outcome["text_extracted"] = outcome['text'].map(lambda x: True if isinstance(x, str) else False)

outcome["toc_extracted"] = outcome['toc'].map(lambda x: True if isinstance(x, str) else False)

outcome["safety_issues_extracted"] = outcome['safety_issues'].map(lambda x: True if isinstance(x, pd.DataFrame) and len(x) > 0 else False)

outcome["recommendations_extracted"] = outcome['recommendations'].map(lambda x: True if isinstance(x, pd.DataFrame) and len(x) > 0 else False)

outcome["safety_issues and/or recommendations extracted"] = outcome["safety_issues_extracted"] | outcome["recommendations_extracted"]


counts = outcome[["found_on_website", "pdf_download", "text_extracted", "safety_issues and/or recommendations extracted"]].apply(sum, axis = 0 )

counts

## Where did the reports go?

In [None]:
import plotly.graph_objects as go

nodes = ["found_on_website", "pdf_download", "text_extracted", "safety_issues_and/or recommendations extracted", "nothing_extracted",  "could_not_get_pdf", "no_text_extraction"]

links = [
    {"source": 0, "target": 1, "value": counts["pdf_download"]},
    {"source": 0, "target": 5, "value": counts["found_on_website"] - counts["pdf_download"]},
    {"source": 1, "target": 2, "value": counts["text_extracted"]},
    {"source": 1, "target": 6, "value": counts["pdf_download"] - counts["text_extracted"]},
    {"source": 2, "target": 3, "value": counts["safety_issues and/or recommendations extracted"]},
    {"source": 2, "target": 4, "value": counts["text_extracted"] - counts["safety_issues and/or recommendations extracted"]},
]

fig = go.Figure(data=[go.Sankey(
    node = dict(
        pad = 15,
        thickness = 20,
        line = dict(color = "black", width = 0.5),
        label = nodes,
        color = "blue",
        align="left"
    ),
    link = dict(
        source = [link["source"] for link in links],
        target = [link["target"] for link in links],
        value = [link["value"] for link in links]
    )
)])

fig.update_layout(title_text="Report Extraction Pipeline", font_size=10)
fig.show()

## Distributions of different outcomes

In [None]:

fig = px.histogram(outcome, x="year", color="agency", facet_col="mode", 
                   barmode='group', title="Count of Records by Year, Agency, and Mode")
fig.update_xaxes(tickangle=45)
fig.show()

### Could not get PDF

In [None]:
could_not_get_pdf = outcome[outcome['pdf_download'] == False]

fig = px.histogram(could_not_get_pdf, x="year", color="agency", facet_col="mode", 
                   barmode='group', title="Count of Records by Year, Agency, and Mode")
fig.update_xaxes(tickangle=45)
fig.show()


### Nothing extracted

In [None]:
nothing_extracted = outcome[(outcome['text_extracted'] == True) & (outcome["safety_issues and/or recommendations extracted"] == False)]


fig = px.histogram(nothing_extracted, x="year", color="agency", facet_col="mode", 
                   barmode='group', title="Count of Records by Year, Agency, and Mode")
fig.update_xaxes(tickangle=45)
fig.show()