# What

As discussed in https://github.com/1jamesthompson1/TAIC-report-summary/issues/130 there is a requirement for recommendations to be extracted and linked. Due to the discovering of a dataset the extraction is no longer needed and focus on linking can be made.

## How to do it

I have two datasets to work with.

Firstly is the recommendation dataset from TAIC this dataset can be considered trustworthy and complete.
Secondly I have the safety issue extracted from the reports. This has a problem that as it is from the engine it cannot be fully trusted. What I will do instead will start with just the ones extracted using regexes so that I know they are true.

## All of the modules needed

To keep things as transparent as possible I will add all of the dependencies at the top.

In [None]:
# From the engine
import engine.Extract_Analyze.ReportExtracting as ReportExtracting
from engine.OpenAICaller import openAICaller


# Third party
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import yaml

# Built in
import os
import re
import importlib
import textwrap

pd.options.mode.copy_on_write = True

# Getting datasets

My goal here is to have a single dataset that has 3 columns.

These are report_id, safety_issue and recommendation. *Note that there will be more columns but this is the idea of the three things conveyed by each row*

This means that there will be a row for all of the safety issues in each report and for each safety issue row there will also be a row for each recommendation from the report. Therefore I will be making a very long dataset

In the next section I will work on adding the fourth column of whether they are linked and from their I can easily filter out the nonexistant connections.

In [None]:
# There are a few reports that have problems for various reasons that are just not worth the effort to include. These will be excluded from this work

reports_to_exclude = [
    '2022_203', # stevedore joint investigation that doesn't follow usual format
    '2022_202', # ""
    '2023_010', # This is not a report and instead just a protection order
    
]

## TAIC dataset

This data set just comes from a xlsx files but needs to be cleaned.

In [None]:
original_TAIC_recommendations_df = pd.read_excel('TAIC_recommendations_04_04_2024.xlsx')

cleaned_TAIC_recommendations_df = original_TAIC_recommendations_df.copy()

cleaned_TAIC_recommendations_df['recommendation_id'] = cleaned_TAIC_recommendations_df['Number']

cleaned_TAIC_recommendations_df['recommendation_text'] = cleaned_TAIC_recommendations_df['Recommendation']

cleaned_TAIC_recommendations_df.dropna(subset=['recommendation_text', 'Inquiry'], inplace = True)

rows_deleted = len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df)
print(f"Deleting {rows_deleted} rows with NAs in either recommendation_text or Inquiry")


# find all rows that dont match regex on column 'Inquiry'
inquiry_regex = r'^(((AO)|(MO)|(RO))-[12][09][987012]\d-[012]\d{2})$'
cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['Inquiry'].str.match(inquiry_regex)]

# printout how many rows were dropped
print(f"Dropped {len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df) - rows_deleted} rows that didn't match regex")

# Show the rows that were dropped
# display(original_TAIC_recommendations_df[~original_TAIC_recommendations_df.index.isin(cleaned_TAIC_recommendations_df.index)])

cleaned_TAIC_recommendations_df['report_id'] = cleaned_TAIC_recommendations_df['Inquiry'].apply(lambda x: "_".join(x.split('-')[1:3]))

# Some of the recommendations have more than just hte recommendation itself. I will use regex to rid of these and can store the extra context in another column
cleaned_TAIC_recommendations_df['has_extra_context'] = cleaned_TAIC_recommendations_df['recommendation_text'].apply(lambda x: re.search(r'[\s\S]{35,}the commission recommend[\s\S]{100,}', x, re.IGNORECASE) is not None)
cleaned_TAIC_recommendations_df['extra_recommendation_context'] = cleaned_TAIC_recommendations_df.apply(lambda x: re.search(r'([\s\S]*)the commission recommend', x['recommendation_text'], re.IGNORECASE).group(1) if x['has_extra_context'] else None, axis=1)
cleaned_TAIC_recommendations_df['recommendation'] = cleaned_TAIC_recommendations_df.apply(lambda x: x['recommendation_text'].replace(x['extra_recommendation_context'], '') if x['has_extra_context'] else x['recommendation_text'], axis=1)


# To make it simpler later on all other columns are being removed.

cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[['report_id', 'recommendation_id', 'recommendation']]

## Safety extraction dataset

I will need to extract all of the direct safety issues from the reports

In [None]:

reports_path = 'output'

reports = [name for name in os.listdir(reports_path) if os.path.isdir(os.path.join(reports_path, name))]

reports = [report for report in reports if report not in reports_to_exclude]

### Old method

It was previously done by reading them with regex.
However now I have a dataset that was run with the LLM. This is completely reliable but it is larger and better than before.

In [None]:
# ## Clean the outputs folder

# for report in reports:
#     files = os.listdir(os.path.join(reports_path, report))

#     for file in files:
#         if not re.match(fr"{report}((.pdf)|(.txt))", file):
#             print(os.path.join(reports_path, report, file))
#             os.remove(os.path.join(reports_path, report, file))

In [None]:
# Function copied from 'safety_issue_extraction.ipynb'
def  read_or_create_important_text_file(path, report_id, text):
    important_text_path = os.path.join(path, report_id, f'{report_id}_important_text.txt')

    if not os.path.isfile(important_text_path):
        important_text, pages = ReportExtracting.ReportExtractor(text, report_id).extract_important_text()

        important_text = "" if important_text == None else important_text

        with open(important_text_path, 'w') as stream:
            stream.write(important_text)

        return important_text
    
    with open(important_text_path, 'r') as stream:
        important_text = stream.read()
        
    important_text = None if important_text == "" else important_text

    return important_text

In [None]:
# Get safety issues extracted from all reports
importlib.reload(ReportExtracting)

safety_issues = []

for report in reports:
    
    with open(os.path.join(reports_path, report, f'{report}.txt'), 'r') as stream:
        text = stream.read()

    important_text = read_or_create_important_text_file(reports_path, report, text)

    if important_text == None:
        print(" No safety issues from " + report + " as important text could not be found")
        continue

    extracted_safety_issues = ReportExtracting.SafetyIssueExtractor(text, report)._extract_safety_issues_with_regex(important_text)

    safety_issues.append({
        'report_id': report,
        'safety_issues': extracted_safety_issues
    })

safety_issues_df = pd.DataFrame(safety_issues)


In [None]:
# Lengthen it so that each safety issue has its own row
safety_issues_df = safety_issues_df.explode('safety_issues')

# Remove reports without explicit safety issues
safety_issues_df.dropna(subset=['safety_issues'], inplace = True)

In [None]:
# Get all the safety issues that have non standard characters

standard_characters_regex = r'''^[\w,.\s()'"/\-%]+$'''

non_standard_safety_issues = safety_issues_df[~safety_issues_df['safety_issues'].str.match(standard_characters_regex)]

In [None]:
safety_issues_df = safety_issues_df[safety_issues_df['safety_issues'].str.match(standard_characters_regex)]
print(f"  Removed {len(non_standard_safety_issues)} non standard safety issues")

### New method with LLM

As I have the update data set from https://github.com/1jamesthompson1/TAIC-report-summary/pull/141

Therefore I can simple read that datset

In [None]:
safety_issues = []

for report in reports:
    
    SI_file_path = os.path.join(reports_path, report, f"{report}_safety_issues.yaml")

    if os.path.exists(SI_file_path):
        with open(SI_file_path, "r") as f:
            safety_issues.append({
                "report_id": report,
                "safety_issue": yaml.safe_load(f)
                })
    else:
        print(f"Could not find safety issues for {SI_file_path}")

safety_issues_df = pd.DataFrame(safety_issues)
safety_issues_df = safety_issues_df.explode("safety_issue")

safety_issues_df['safety_issue'] = safety_issues_df['safety_issue'].apply(lambda x: x['safety_issue'])

display(safety_issues_df)


## Combine the two datasets

Now that I have the two data sets `safety_issues_df` and `cleaned_TAIC_recommendations_df` I can combine them together to be compared

In [None]:
# List the safety issues and the recommendations for each report

combined_df = []

for report in safety_issues_df['report_id'].unique():

    report_safety_issues = safety_issues_df[safety_issues_df['report_id'] == report]['safety_issue']

    report_recommendations = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['report_id'] == report]['recommendation']

    for safety_issue in report_safety_issues:
        for recommendation in report_recommendations:
            combined_df.append({
                'report_id': report,
                'safety_issue': safety_issue,
                'recommendation': recommendation
            })

combined_df = pd.DataFrame(combined_df)

print(f"  There are {len(combined_df)} safety issues and recommendations.")

# Performing link analysis

Now that I have a big dataset I need to compare all the possible connections and see which ones stick

## How many tokens am I dealing with

As I have a large data set with 500~ rows I need to know roughly how many tokens. Because this will give me a rough cost guide.

In [None]:
# Count up the tokens in the safety issues and recommendations

def how_many_tokens(df):

    df['safety_issue_tokens'] = df['safety_issue'].apply(lambda x: sum(openAICaller.get_tokens(x)))
    df['recommendation_tokens'] = df['recommendation'].apply(lambda x: sum(openAICaller.get_tokens(x)))

    total_number_of_tokens = sum(df['safety_issue_tokens']) + sum(df['recommendation_tokens'])

    return total_number_of_tokens

def rough_cost(df):

    tokens = how_many_tokens(df)

    return round(tokens/1000 * 0.01, 2)

print(f"I am dealing with {len(combined_df)} possible combinations. The cost to read with gpt 4 turbo is ${rough_cost(combined_df)}.")

## Trying out linking with sample

I am going to simply look at 10 random reports and see how to do some comparisons

### Get the sample dataset

In [None]:
# Get the play dataset

playset_reports = combined_df['report_id'].sample(10, random_state=42)

combined_df_playset = combined_df[combined_df['report_id'].isin(playset_reports)]

print(f"  There are {len(combined_df_playset)} safety issues and recommendations in the playset.\nIt will cost roughly ${rough_cost(combined_df_playset)} to run through.")

In [None]:
def evaluate_SI_recommendation_link(SI, recommendation):
    response =openAICaller.query(
        system = f"""
You are going to help me find find links between recommendations and safety issues identified in transport accident investigation reports.

Each transport accident investigation report will identify safety issues. These reports will then issue recommendation that will address one or more of the safety issues identfied in the report.

For each pair given you need to respond with one of three answers.

- None (The recommendation is not directly related to the safety issue)
- Possible (The recommendation is reasonably likely to directly address the safety issue)
- Confirmed (The recommendation explicitly mention that safety issue that it is trying address)
""",
        user = f"""
Here is the recommendation:

{recommendation}

Here is the safety issue:

{SI}
\
""",
        model = "gpt-4",
        temp = 0)
    
    if response in ['None', 'Possible', 'Confirmed']:
        return response
    else:
        print(f"Model response is incorrect and is {response}")
        return 'undetermined'


In [None]:
# Look for pkl file and if not found remake it
redo = False
if os.path.exists('links.pkl'):

    print(" Found pkl file")
    found_df = pd.read_pickle('links.pkl')

    if found_df.drop(columns='link').equals(combined_df_playset.drop(columns='link', errors='ignore')):
        print("  Pkl file is up to date")
        combined_df_playset = found_df
    else:
        redo =True
else:
    redo = True

if redo:
    print("Remaking the file")
    combined_df_playset['link'] = combined_df_playset.apply(lambda x: evaluate_SI_recommendation_link(x['safety_issue'], x['recommendation']), axis=1)
    combined_df_playset.to_pickle('links.pkl')


### Look at the linking results

#### Basic information

In [None]:
combined_df_playset['link'].value_counts()

#### Visualize the results

In [None]:
def make_link_visualization(df, report_id):
    '''
    Create a picture that shows both the recommendations and the safety issues fro a report. There will be arrows showing the link between the two.
    '''
    # Create a directed graph
    print("Creating visualization")
    G = nx.DiGraph()

    NODE_WIDTH = 50

    # Preemptively wrap the text
    df['recommendation'] = df['recommendation'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))
    df['safety_issue'] = df['safety_issue'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))

    # Add nodes for the 'extracted' and 'inferred' indices
    for i, text in enumerate(df['recommendation'].unique()):
        G.add_node(text, pos=(0, i))

    for i, issue in enumerate(df['safety_issue'].unique()):
        G.add_node(issue, pos=(3, i))


    # Add edges between the matched indices
    for _, row in df.iterrows():
        # Add solid arrow
        if row['link'] == 'Confirmed':
            G.add_edge(row['recommendation'], row['safety_issue'], color='red', style='solid', alpha =1)

        # Add dotted arrow
        elif row['link'] == 'Possible':
            G.add_edge(row['recommendation'], row['safety_issue'], color='blue', style='dashed', alpha = 0.5)


    max_num_of_nodes_column = max(
        len(df['recommendation'].unique()),
        len(df['safety_issue'].unique())
        )

    # Draw the graph
    pos = nx.get_node_attributes(G, 'pos')

    plt.figure(figsize=(10, max_num_of_nodes_column * 5))  
    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=[len(node) * NODE_WIDTH for node in G.nodes()])
    nx.draw_networkx_labels(G, pos, font_size=7)
    nx.draw_networkx_edges(G, pos, arrows=True, edge_color=nx.get_edge_attributes(G, 'color').values(), style=list(nx.get_edge_attributes(G, 'style').values()), alpha = list(nx.get_edge_attributes(G, 'alpha').values()))

    plt.xlim(-1, 4.5)  # Add buffer to the outside side edges
    plt.ylim(-1, max_num_of_nodes_column*1)  # Add buffer to the outside top and bottom edges

    # Add report ID and headers
    plt.title(f'Report ID: {report_id}')
    plt.text(0, max_num_of_nodes_column-0.2, 'Recommendations', fontsize=12, ha='center')
    plt.text(3, max_num_of_nodes_column-0.2, 'Safety issue', fontsize=12, ha='center')
    plt.text(1.5, max_num_of_nodes_column-0.5, 'Arrow indicates that the recommendation is indicated to solve the safety issue by the LLM', fontsize=6, ha='center')

    if not os.path.exists('visualization_of_links'):
        os.mkdir('visualization_of_links')

    plt.savefig(os.path.join('visualization_of_links', f'{report_id}_comparison_visualization.png'))
    plt.close()

In [None]:
# Make lots of d which just have one sort of report_id. THen send each individual one to the make visualization function

for report_id in combined_df_playset['report_id'].unique():
    d = combined_df_playset[combined_df_playset['report_id'] == report_id]

    make_link_visualization(d, report_id)


In [None]:
# Run the linking for just 2019_006

df_2019_006 = combined_df[combined_df['report_id'] == '2019_006']

df_2019_006['link'] = df_2019_006.apply(lambda row: evaluate_SI_recommendation_link(row['safety_issue'], row['recommendation']), axis = 1)

make_link_visualization(df_2019_006, '2019_006')

I am comparing this too what was done by Ingrid. It has got all of the correct links with only adding an extra link. This is good as clearly gpt 4 does a much better job than gpt 3.5

# Testing out analysis


The two identified problems are 
- Too many links we only need one link type
- There are recommendations without any links

## Recommendations with links

In [None]:
# Find recommendations from combined_df_playset that dont have a linked safety issue

def find_unlinked_recommendations(df):
    all_recommendations = df[['report_id', 'recommendation']].drop_duplicates()

    linked_recommendations = df[df['link'] == "Confirmed"]

    unlinked_recommendations = all_recommendations[~all_recommendations['recommendation'].isin(linked_recommendations['recommendation'])]

    return unlinked_recommendations

find_unlinked_recommendations(combined_df_playset)

Here I will look through the 5 reports that have unlinked recommendations.

2010_009
This report has really short safety issues. This can make it hard for it to find anything to link to.

2016_206
One is missing a red link but does have a blue link. 
This link could be upgrades to red if needed.

2013_005
Vague wordings that don't make it clear that they are addressing each other.
The recommendation is specific where both of the safety issues are more general.

2014_005
The first missing recommendation about seat belts does not seem to have an associated safety issue.
Recommendations regarding the vortex effect has a blue link that seems reasonable.

2011_204
There are not many direct links here but the two blue links will work.
