# What

As discussed in https://github.com/1jamesthompson1/TAIC-report-summary/issues/130 there is a requirement for recommendations to be extracted and linked. Due to the discovering of a dataset the extraction is no longer needed and focus on linking can be made.

## How to do it

I have two datasets to work with.

Firstly is the recommendation dataset from TAIC this dataset can be considered trustworthy and complete.
Secondly I have the safety issue extracted from the reports. This has a problem that as it is from the engine it cannot be fully trusted. What I will do instead will start with just the ones extracted using regexes so that I know they are true.

## All of the modules needed

To keep things as transparent as possible I will add all of the dependencies at the top.

In [62]:
# From the engine
import engine.Extract_Analyze.ReportExtracting as ReportExtracting
from engine.OpenAICaller import openAICaller


# Third party
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx

# Built in
import os
import re
import importlib
import textwrap

# Getting datasets

My goal here is to have a single dataset that has 3 columns.

These are report_id, safety_issue and recommendation. *Note that there will be more columns but this is the idea of the three things conveyed by each row*

This means that there will be a row for all of the safety issues in each report and for each safety issue row there will also be a row for each recommendation from the report. Therefore I will be making a very long dataset

In the next section I will work on adding the fourth column of whether they are linked and from their I can easily filter out the nonexistant connections.

In [3]:
# There are a few reports that have problems for various reasons that are just not worth the effort to include. These will be excluded from this work

reports_to_exclude = [
    '2022_203', # stevedore joint investigation that doesnt follow usual format
    '2022_202', # ""
    '2023_010', # This is not a report and instead just a protection order
    
]

## TAIC dataset

This data set just comes from a xlsx files but needs to be cleaned.

In [4]:
original_TAIC_recommendations_df = pd.read_excel('TAIC_recommendations_04_04_2024.xlsx')

cleaned_TAIC_recommendations_df = original_TAIC_recommendations_df.copy()

cleaned_TAIC_recommendations_df['recommendation_id'] = cleaned_TAIC_recommendations_df['Number']

cleaned_TAIC_recommendations_df['recommendation_text'] = cleaned_TAIC_recommendations_df['Recommendation']

cleaned_TAIC_recommendations_df.dropna(subset=['recommendation_text', 'Inquiry'], inplace = True)

rows_deleted = len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df)
print(f"Deleting {rows_deleted} rows with NAs in either recommendation_text or Inquiry")


# find all rows that dont match regex on column 'Inquiry'
inquiry_regex = r'^(((AO)|(MO)|(RO))-[12][09][987012]\d-[012]\d{2})$'
cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['Inquiry'].str.match(inquiry_regex)]

# printout how many rows were dropped
print(f"Dropped {len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df) - rows_deleted} rows that didn't match regex")

# Show the rows that were dropped
# display(original_TAIC_recommendations_df[~original_TAIC_recommendations_df.index.isin(cleaned_TAIC_recommendations_df.index)])

cleaned_TAIC_recommendations_df['report_id'] = cleaned_TAIC_recommendations_df['Inquiry'].apply(lambda x: "_".join(x.split('-')[1:3]))

cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[['report_id', 'recommendation_id', 'recommendation_text']]

Deleting 165 rows with NAs in either recommendation_text or Inquiry
Dropped 37 rows that didn't match regex


## Safety extraction dataset

I will need to extract all of the direct safety issues from the reports

In [5]:

reports_path = 'output'

reports = [name for name in os.listdir(reports_path) if os.path.isdir(os.path.join(reports_path, name))]

reports = [report for report in reports if report not in reports_to_exclude]

In [6]:
## Clean the outputs folder

for report in reports:
    files = os.listdir(os.path.join(reports_path, report))

    for file in files:
        if not re.match(fr"{report}((.pdf)|(.txt))", file):
            print(os.path.join(reports_path, report, file))
            os.remove(os.path.join(reports_path, report, file))

output/2019_106/2019_106_important_text.txt
output/2013_107/2013_107_important_text.txt
output/2020_102/2020_102_important_text.txt
output/2011_003/2011_003_important_text.txt
output/2012_105/2012_105_important_text.txt
output/2019_202/2019_202_important_text.txt
output/2022_102/2022_102_important_text.txt
output/2010_010/2010_010_important_text.txt
output/2013_011/2013_011_important_text.txt
output/2015_103/2015_103_important_text.txt
output/2019_201/2019_201_important_text.txt
output/2013_006/2013_006_important_text.txt
output/2010_009/2010_009_important_text.txt
output/2019_204/2019_204_important_text.txt
output/2010_007/2010_007_important_text.txt
output/2015_202/2015_202_important_text.txt
output/2010_101/2010_101_important_text.txt
output/2016_206/2016_206_important_text.txt
output/2021_101/2021_101_important_text.txt
output/2010_202/2010_202_important_text.txt
output/2021_204/2021_204_important_text.txt
output/2011_001/2011_001_important_text.txt
output/2012_001/2012_001_importa

In [7]:
# Function copied from 'safety_issue_extraction.ipynb'
def  read_or_create_important_text_file(path, report_id, text):
    important_text_path = os.path.join(path, report_id, f'{report_id}_important_text.txt')

    if not os.path.isfile(important_text_path):
        important_text, pages = ReportExtracting.ReportExtractor(text, report_id).extract_important_text()

        important_text = "" if important_text == None else important_text

        with open(important_text_path, 'w') as stream:
            stream.write(important_text)

        return important_text
    
    with open(important_text_path, 'r') as stream:
        important_text = stream.read()
        
    important_text = None if important_text == "" else important_text

    return important_text

In [22]:
# Get safety issues extracted from all reports
importlib.reload(ReportExtracting)

safety_issues = []

for report in reports:
    
    with open(os.path.join(reports_path, report, f'{report}.txt'), 'r') as stream:
        text = stream.read()

    important_text = read_or_create_important_text_file(reports_path, report, text)

    if important_text == None:
        print(" No safety issues from " + report + " as important text could not be found")
        continue

    extracted_safety_issues = ReportExtracting.SafetyIssueExtractor(text, report)._extract_safety_issues_with_regex(important_text)

    safety_issues.append({
        'report_id': report,
        'safety_issues': extracted_safety_issues
    })

safety_issues_df = pd.DataFrame(safety_issues)


 No safety issues from 2010_011 as important text could not be found
 No safety issues from 2010_207 as important text could not be found
 No safety issues from 2010_205 as important text could not be found
 No safety issues from 2022_206 as important text could not be found
 No safety issues from 2016_002 as important text could not be found
 No safety issues from 2023_201 as important text could not be found
 No safety issues from 2023_206 as important text could not be found


In [24]:
# Remove reports without explicit safety issues
safety_issues_df.dropna(subset=['safety_issues'], inplace = True)

# Lengthen it so that each safety issue has its own row
safety_issues_df = safety_issues_df.explode('safety_issues')

In [28]:
# Get all the safety issues that have non standard characters

standard_characters_regex = r'''^[\w,.\s()'"/\-%]+$'''

non_standard_safety_issues = safety_issues_df[~safety_issues_df['safety_issues'].str.match(standard_characters_regex)]

In [29]:
safety_issues_df = safety_issues_df[safety_issues_df['safety_issues'].str.match(standard_characters_regex)]
print(f"  Removed {len(non_standard_safety_issues)} non standard safety issues")

  Removed 19 non standard safety issues


## Combine the two datasets

Now that I have the two data sets `safety_issues_df` and `cleaned_TAIC_recommendations_df` I can combine them together to be compared

In [54]:
# List the safety issues and the recommendations for each report

combined_df = []

for report in safety_issues_df['report_id'].unique():

    report_safety_issues = safety_issues_df[safety_issues_df['report_id'] == report]['safety_issues']

    report_recommendations = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['report_id'] == report]['recommendation_text']

    for safety_issue in report_safety_issues:
        for recommendation in report_recommendations:
            combined_df.append({
                'report_id': report,
                'safety_issue': safety_issue,
                'recommendation': recommendation
            })

combined_df = pd.DataFrame(combined_df)

print(f"  There are {len(combined_df)} safety issues and recommendations.")

  There are 522 safety issues and recommendations.


# Performing link analysis

Now that I have a big dataset I need to compare all the possible connections and see which ones stick

## How many tokens am I dealing with

As I have a large data set with 500~ rows I need to know roughly how many tokens. Because this will give me a rough cost guide.

In [53]:
# Count up the tokens in the safety issues and recommendations

combined_df['safety_issue_tokens'] = combined_df['safety_issue'].apply(lambda x: sum(openAICaller.get_tokens(x)))
combined_df['recommendation_tokens'] = combined_df['recommendation'].apply(lambda x: sum(openAICaller.get_tokens(x)))


total_number_of_tokens = sum(combined_df['safety_issue_tokens']) + sum(combined_df['recommendation_tokens'])

print(f" There are about {total_number_of_tokens} tokens in the safety issues and recommendations. \nCost for reading all tokens is ${round(total_number_of_tokens/1000 * 0.01, 2)}")

 There are about 355989 tokens in the safety issues and recommendations. 
Cost for reading all tokens is $3.56


## Trying out linking with sample

I am going to simply look at 15 random reports and see how to do some comparisons

In [57]:
# Get the play dataset

playset_reports = combined_df['report_id'].sample(10, random_state=42)

combined_df_playset = combined_df[combined_df['report_id'].isin(playset_reports)]

print(f"  There are {len(combined_df_playset)} safety issues and recommendations in the playset.")

  There are 162 safety issues and recommendations in the playset.


In [59]:
def evaluate_SI_recommendation_link(SI, recommendation):
    response =openAICaller.query(
        system = f"""
You are going to help me find find links between recommendations and safety issues identified in transport accident investigation reports.

Each transport accident investigation report will identify safety issues. These reports will then issue recommendation that will address one or more of the safety issues identfied in the report.

For each pair given you need to respond with one of three answers.

- None (The recommendation has nothing to do with the safety issue)
- Possible (The engine thinks that it might be linked to the safety issue i.e it has inferred it.)
- Confirmed (The safety issue and/or recommendation explicitly mention each other as being connected)
""",
        user = f"""
Here is the recommendation:

{recommendation}

Here is the safety issue:

{SI}
""",
        model = "gpt-3.5",
        temp = 0)
    
    if response in ['None', 'Possible', 'Confirmed']:
        return response
    else:
        print(f"Model response is incorrect and is {response}")
        return 'undetermined'


In [61]:
combined_df_playset['link'] = combined_df_playset.apply(lambda x: evaluate_SI_recommendation_link(x['safety_issue'], x['recommendation']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df_playset['link'] = combined_df_playset.apply(lambda x: evaluate_SI_recommendation_link(x['safety_issue'], x['recommendation']), axis=1)


In [128]:
def make_link_visualization(df, report_id):
    '''
    Create a picture that shows both the recommendations and the safety issues fro a report. There will be arrows showing the link between the two.
    '''
    # Create a directed graph
    print("Creating visualization")
    G = nx.DiGraph()

    NODE_WIDTH = 50

    # Preemptively wrap the text
    df['recommendation'] = df['recommendation'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))
    df['safety_issue'] = df['safety_issue'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))

    # Add nodes for the 'extracted' and 'inferred' indices
    for i, text in enumerate(df['recommendation'].unique()):
        G.add_node(text, pos=(0, i))

    for i, issue in enumerate(df['safety_issue'].unique()):
        G.add_node(issue, pos=(3, i))

    # Add edges between the matched indices
    for _, row in df.iterrows():
        # Add solid arrow
        if row['link'] == 'Confirmed':
            G.add_edge(row['recommendation'], row['safety_issue'], color='red', style='solid', alpha =1)

        # Add dotted arrow
        elif row['link'] == 'Possible':
            G.add_edge(row['recommendation'], row['safety_issue'], color='blue', style='dashed', alpha = 0.5)


    max_num_of_nodes_column = max(
        len(df['recommendation'].unique()),
        len(df['safety_issue'].unique())
        )

    # Draw the graph
    pos = nx.get_node_attributes(G, 'pos')

    plt.figure(figsize=(8, max_num_of_nodes_column * 2.5))  
    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=[len(node) * NODE_WIDTH for node in G.nodes()])
    nx.draw_networkx_labels(G, pos, font_size=7)
    nx.draw_networkx_edges(G, pos, arrows=True, edge_color=nx.get_edge_attributes(G, 'color').values(), style=list(nx.get_edge_attributes(G, 'style').values()), alpha = list(nx.get_edge_attributes(G, 'alpha').values()))

    plt.xlim(-1, 4.5)  # Add buffer to the outside side edges
    plt.ylim(-1, max_num_of_nodes_column*1)  # Add buffer to the outside top and bottom edges

    # Add report ID and headers
    plt.title(f'Report ID: {report_id}')
    plt.text(0, max_num_of_nodes_column-0.2, 'Real', fontsize=12, ha='center')
    plt.text(3, max_num_of_nodes_column-0.2, 'Inferred', fontsize=12, ha='center')
    plt.text(1.5, max_num_of_nodes_column-0.5, 'Arrow indicates that real has been "matched" by LLM with inferred', fontsize=6, ha='center')

    if not os.path.exists('visualization_of_links'):
        os.mkdir('visualization_of_links')

    plt.savefig(os.path.join('visualization_of_links', f'{report_id}_comparison_visualization.png'))
    plt.close()

In [129]:
# Make lots of d which just have one sort of report_id. THen send each individual one to the make visualization function

for report_id in combined_df_playset['report_id'].unique():
    d = combined_df_playset[combined_df_playset['report_id'] == report_id]

    make_link_visualization(d, report_id)
    

Creating visualization
Creating visualization


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['recommendation'] = df['recommendation'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['safety_issue'] = df['safety_issue'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))


Creating visualization
Creating visualization
Creating visualization
Creating visualization
Creating visualization
Creating visualization
Creating visualization
