# What

As discussed in https://github.com/1jamesthompson1/TAIC-report-summary/issues/130 there is a requirement for recommendations to be extracted and linked. Due to the discovering of a dataset the extraction is no longer needed and focus on linking can be made.

## How to do it

I have two datasets to work with.

Firstly is the recommendation dataset from TAIC this dataset can be considered trustworthy and complete.
Secondly I have the safety issue extracted from the reports. This has a problem that as it is from the engine it cannot be fully trusted. What I will do instead will start with just the ones extracted using regexes so that I know they are true.

## All of the modules needed

To keep things as transparent as possible I will add all of the dependencies at the top.

In [None]:
# From the engine
import engine.Extract_Analyze.ReportExtracting as ReportExtracting
from engine.OpenAICaller import openAICaller


# Third party
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import yaml

# Built in
import os
import re
import importlib
import textwrap
from typing import Literal

pd.options.mode.copy_on_write = True

# Getting datasets

My goal here is to have a single dataset that has 3 columns.

These are report_id, safety_issue and recommendation. *Note that there will be more columns but this is the idea of the three things conveyed by each row*

This means that there will be a row for all of the safety issues in each report and for each safety issue row there will also be a row for each recommendation from the report. Therefore I will be making a very long dataset

In the next section I will work on adding the fourth column of whether they are linked and from their I can easily filter out the nonexistant connections.

In [None]:
# There are a few reports that have problems for various reasons that are just not worth the effort to include. These will be excluded from this work

reports_to_exclude = [
    '2022_203', # stevedore joint investigation that doesn't follow usual format
    '2022_202', # ""
    '2023_010', # This is not a report and instead just a protection order
    
]

## TAIC dataset

This data set just comes from a xlsx files but needs to be cleaned.

In [None]:
original_TAIC_recommendations_df = pd.read_excel('TAIC_recommendations_04_04_2024.xlsx')

cleaned_TAIC_recommendations_df = original_TAIC_recommendations_df.copy()

cleaned_TAIC_recommendations_df['recommendation_id'] = cleaned_TAIC_recommendations_df['Number']

cleaned_TAIC_recommendations_df['recommendation_text'] = cleaned_TAIC_recommendations_df['Recommendation']

cleaned_TAIC_recommendations_df.dropna(subset=['recommendation_text', 'Inquiry'], inplace = True)

rows_deleted = len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df)
print(f"Deleting {rows_deleted} rows with NAs in either recommendation_text or Inquiry")


# find all rows that dont match regex on column 'Inquiry'
inquiry_regex = r'^(((AO)|(MO)|(RO))-[12][09][987012]\d-[012]\d{2})$'
cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['Inquiry'].str.match(inquiry_regex)]

# printout how many rows were dropped
print(f"Dropped {len(original_TAIC_recommendations_df) - len(cleaned_TAIC_recommendations_df) - rows_deleted} rows that didn't match regex")

# Show the rows that were dropped
# display(original_TAIC_recommendations_df[~original_TAIC_recommendations_df.index.isin(cleaned_TAIC_recommendations_df.index)])

cleaned_TAIC_recommendations_df['report_id'] = cleaned_TAIC_recommendations_df['Inquiry'].apply(lambda x: "_".join(x.split('-')[1:3]))

# Some of the recommendations have more than just hte recommendation itself. I will use regex to rid of these and can store the extra context in another column
cleaned_TAIC_recommendations_df['has_extra_context'] = cleaned_TAIC_recommendations_df['recommendation_text'].apply(lambda x: re.search(r'[\s\S]{35,}the commission recommend[\s\S]{100,}', x, re.IGNORECASE) is not None)
cleaned_TAIC_recommendations_df['extra_recommendation_context'] = cleaned_TAIC_recommendations_df.apply(lambda x: re.search(r'([\s\S]*)the commission recommend', x['recommendation_text'], re.IGNORECASE).group(1) if x['has_extra_context'] else None, axis=1) # type: ignore
cleaned_TAIC_recommendations_df['recommendation'] = cleaned_TAIC_recommendations_df.apply(lambda x: x['recommendation_text'].replace(x['extra_recommendation_context'], '') if x['has_extra_context'] else x['recommendation_text'], axis=1)  

# There are recommendations in 2011_104 that are duplicated once all of the context is removed.
# This should be dropped now to not cause any more confusion
cleaned_TAIC_recommendations_df.drop_duplicates(subset=['report_id', 'recommendation'], inplace=True)


# To make it simpler later on all other columns are being removed.

cleaned_TAIC_recommendations_df = cleaned_TAIC_recommendations_df[['report_id', 'recommendation_id', 'recommendation', 'extra_recommendation_context']]

In [None]:
all_data = pd.read_csv('search_results.csv')

print(f"Total number of recommendations: {cleaned_TAIC_recommendations_df.shape[0]}")
print(f"Average number of recommendations per report: {all_data['Recommendations'].mean():.2f}%")
print(f"Number of reports with no recommendations: {all_data[all_data['Recommendations'] == 0].shape[0]}")

## Safety extraction dataset

I will need to extract all of the direct safety issues from the reports

In [None]:
reports_path = 'output'

reports = [name for name in os.listdir(reports_path) if os.path.isdir(os.path.join(reports_path, name))]

reports = [report for report in reports if report not in reports_to_exclude]

### Old method

It was previously done by reading them with regex.
However now I have a dataset that was run with the LLM. This is completely reliable but it is larger and better than before.

In [None]:
# ## Clean the outputs folder

# for report in reports:
#     files = os.listdir(os.path.join(reports_path, report))

#     for file in files:
#         if not re.match(fr"{report}((.pdf)|(.txt))", file):
#             print(os.path.join(reports_path, report, file))
#             os.remove(os.path.join(reports_path, report, file))

In [None]:
# Function copied from 'safety_issue_extraction.ipynb'
def  read_or_create_important_text_file(path, report_id, text):
    important_text_path = os.path.join(path, report_id, f'{report_id}_important_text.txt')

    if not os.path.isfile(important_text_path):
        important_text, pages = ReportExtracting.ReportExtractor(text, report_id).extract_important_text()

        important_text = "" if important_text == None else important_text

        with open(important_text_path, 'w') as stream:
            stream.write(important_text)

        return important_text
    
    with open(important_text_path, 'r') as stream:
        important_text = stream.read()
        
    important_text = None if important_text == "" else important_text

    return important_text

In [None]:
# Get safety issues extracted from all reports


# importlib.reload(ReportExtracting)

# safety_issues = []

# for report in reports:
    
#     with open(os.path.join(reports_path, report, f'{report}.txt'), 'r') as stream:
#         text = stream.read()

#     important_text = read_or_create_important_text_file(reports_path, report, text)

#     if important_text == None:
#         print(" No safety issues from " + report + " as important text could not be found")
#         continue

#     extracted_safety_issues = ReportExtracting.SafetyIssueExtractor(text, report)._extract_safety_issues_with_regex(important_text)

#     safety_issues.append({
#         'report_id': report,
#         'safety_issues': extracted_safety_issues
#     })

# safety_issues_df = pd.DataFrame(safety_issues)


In [None]:
# Lengthen it so that each safety issue has its own row
safety_issues_df = safety_issues_df.explode('safety_issues')

# Remove reports without explicit safety issues
safety_issues_df.dropna(subset=['safety_issues'], inplace = True)

In [None]:
# Get all the safety issues that have non standard characters

standard_characters_regex = r'''^[\w,.\s()'"/\-%]+$'''

non_standard_safety_issues = safety_issues_df[~safety_issues_df['safety_issues'].str.match(standard_characters_regex)]

In [None]:
safety_issues_df = safety_issues_df[safety_issues_df['safety_issues'].str.match(standard_characters_regex)]
print(f"  Removed {len(non_standard_safety_issues)} non standard safety issues")

### New method with LLM

As I have the update data set from https://github.com/1jamesthompson1/TAIC-report-summary/pull/141

Therefore I can simple read that datset

In [None]:
safety_issues = []

for report in reports:
    
    SI_file_path = os.path.join(reports_path, report, f"{report}_safety_issues.yaml")

    if os.path.exists(SI_file_path):
        with open(SI_file_path, "r") as f:
            safety_issues.append({
                "report_id": report,
                "safety_issue": yaml.safe_load(f)
                })
    else:
        print(f"Could not find safety issues for {SI_file_path}")

safety_issues_df = pd.DataFrame(safety_issues)
safety_issues_df = safety_issues_df.explode("safety_issue")

safety_issues_df['safety_issue'] = safety_issues_df['safety_issue'].apply(lambda x: x['safety_issue'])

display(safety_issues_df)


## Combine the two datasets

Now that I have the two data sets `safety_issues_df` and `cleaned_TAIC_recommendations_df` I can combine them together to be compared

In [None]:
# List the safety issues and the recommendations for each report

combined_df = []

for report in safety_issues_df['report_id'].unique():

    report_safety_issues = safety_issues_df[safety_issues_df['report_id'] == report]['safety_issue']

    report_recommendations = cleaned_TAIC_recommendations_df[cleaned_TAIC_recommendations_df['report_id'] == report]['recommendation']

    for safety_issue in report_safety_issues:
        for recommendation in report_recommendations:
            combined_df.append({
                'report_id': report,
                'safety_issue': safety_issue,
                'recommendation': recommendation
            })

combined_df = pd.DataFrame(combined_df)

print(f"  There are {len(combined_df)} safety issues and recommendations.")

test = combined_df.loc[combined_df.duplicated()]

# Link analysis functions

Now that I have a big dataset I need to compare all the possible connections and see which ones stick

## How many tokens am I dealing with

As I have a large data set with 500~ rows I need to know roughly how many tokens. Because this will give me a rough cost guide.

In [None]:
# Count up the tokens in the safety issues and recommendations

def how_many_tokens(df):

    df['safety_issue_tokens'] = df['safety_issue'].apply(lambda x: sum(openAICaller.get_tokens(x)))
    df['recommendation_tokens'] = df['recommendation'].apply(lambda x: sum(openAICaller.get_tokens(x)))

    total_number_of_tokens = sum(df['safety_issue_tokens']) + sum(df['recommendation_tokens'])

    return total_number_of_tokens

def rough_cost(df):

    tokens = how_many_tokens(df)

    return round(tokens/1000 * 0.01, 2)

print(f"I am dealing with {len(combined_df)} possible combinations. The cost to read with gpt 4 turbo is ${rough_cost(combined_df)}.")

## Evaluating link functions

In [None]:
# This collection of functions will be used to evaluate the recommendations using the all method. That is the LLM is given all the SI and recommendations at the same time

def create_numbered_list_string(list):
    return "\n".join([f"{i}. {item}" for i, item in enumerate(list)])

def issues_recommendations_to_string(SI, recs):
    return """

Here are the safety issues:
{SI}

Here are the recommendations:
{recs}

""".format(SI=create_numbered_list_string(SI), recs=create_numbered_list_string(recs))

def evaluate_all_SI_recommendations(SI, recommendations):
  response = openAICaller.query(
        system = f"""
You are going to help me find find links between recommendations and safety issues identified in transport accident investigation reports.

Each transport accident investigation report will identify safety issues. These reports will then issue recommendation that will address one or more of the safety issues identfied in the report.

You will be given the list of safety issues and recommendations from the report. You will need to identify which safety issues the recommendation was meant to solve.

There are two types of links that you can use:
- Possible (The recommendation is reasonably likely to directly address the safety issue)
- Confirmed (The recommendation explicitly mention that safety issue that it is trying address)

Note that a recommendation will always be issues to address at least one safety issue but can solve multiple safety issues. However not all safety issues will have a recommendations. There there should be a linked safety issue for each recommendation.

Your response should be a yaml list of all links

- recommendation_index: 0
  SI: [1, ... ]
  SI_quality: ['Confirmed', ... ]
...

Your yaml response never needs opening or closing code blocks.
        """,
        user = issues_recommendations_to_string(SI, recommendations),
        model = "gpt-4",
        temp = 0
    )

  try:
    response_yaml = yaml.safe_load(response)
  except yaml.YAMLError as exc:
    print(f"Response was incorrect \n{response} error is \n{exc}")

  # check to make sure there are the right amount of recommendations
  if len(response_yaml) != len(recommendations):
    print(f"Model response is incorrect there are the right  amount of recommendations")

  if not isinstance(response_yaml, list):
    response_yaml = [response_yaml]

  return response_yaml

def find_recommendation_links_all(df):
  """
  Take a expanded df (that is a row for each potential safety issue and recommendation match) and find all recommendation matches. 
  """

  # Convert to collapsed df with one row per report
  collapsed_df = df.groupby('report_id').agg({'safety_issue': list, 'recommendation': list}).reset_index()

  collapsed_df['safety_issue'] = collapsed_df['safety_issue'].apply(lambda x: list(set(x)))
  collapsed_df['recommendation'] = collapsed_df['recommendation'].apply(lambda x: list(set(x)))

  # Perform evaluation
  collapsed_df['links'] = collapsed_df.apply(lambda row: evaluate_all_SI_recommendations(row['safety_issue'], row['recommendation']), axis=1)

  # Expand back into all possible links

  expanded_df = collapsed_df.explode('links')


  expanded_df['recommendation'] = expanded_df.apply(lambda x: x['recommendation'][x['links']['recommendation_index']], axis=1)
  expanded_df['targeted_SI'] = expanded_df['links'].apply(lambda x: x['SI'])

  expanded_df = expanded_df.explode('targeted_SI')

  expanded_df.dropna(subset=['targeted_SI'], inplace=True)

  expanded_df['safety_issue'] = expanded_df.apply(lambda x: x['safety_issue'][x['targeted_SI']], axis=1)
  expanded_df['link'] = expanded_df.apply(lambda x: x['links']['SI_quality'][x['links']['SI'].index(x['targeted_SI'])], axis=1)

  # Join the expanded links as well as original to make sure that the none links are present.
  combined_df = df.drop(labels = 'link', axis = 1, errors = 'ignore').merge(expanded_df, how = 'outer')[['report_id', 'safety_issue', 'recommendation', 'link']]

  combined_df.fillna("None", axis =1, inplace = True)

  return combined_df


In [None]:
# Function is used to evaluate a given recommendation and return either None, Possible, or Confirmed


def evaluate_SI_recommendation_link(SI, recommendation):
    response =openAICaller.query(
        system = f"""
You are going to help me find find links between recommendations and safety issues identified in transport accident investigation reports.

Each transport accident investigation report will identify safety issues. These reports will then issue recommendation that will address one or more of the safety issues identfied in the report.

For each pair given you need to respond with one of three answers.

- None (The recommendation is not directly related to the safety issue)
- Possible (The recommendation is reasonably likely to directly address the safety issue)
- Confirmed (The recommendation explicitly mention that safety issue that it is trying address)
""",
        user = f"""
Here is the safety issue:

{SI}


Here is the recommendation:

{recommendation}

Now can you please respond with one of three options

- None (The recommendation is not directly related to the safety issue)
- Possible (The recommendation is reasonably likely to directly address the safety issue)
- Confirmed (The recommendation explicitly mention that safety issue that it is trying address)
""",
        model = "gpt-4",
        temp = 0)
    
    if response in ['None', 'Possible', 'Confirmed']:
        return response
    else:
        print(f"Model response is incorrect and is {response}")
        return 'undetermined'


In [None]:
# Look for pkl file and if not found remake it
_EVALUATION_TYPES = Literal['single', 'all']

def perform_link_evaluation(df, _file_name = 'links', link_evaluation_method:_EVALUATION_TYPES = 'single'):
    """
    This function will do a the link evaluation with the llm. It will however save it to a file at the end. Therefore next time it is run it will just retrieve the file if it thinks it is up to date.
    """

    file_name = f"{_file_name}.pkl"

    redo = False
    if os.path.exists(file_name):

        print(f" Found pkl file {file_name}")
        found_df = pd.read_pickle(file_name)

        if found_df.drop(columns='link').equals(df.drop(columns='link', errors='ignore')):
            print("  Pkl file is up to date")
            df = found_df
        else:
            redo =True
    else:
        redo = True

    if not redo:
        return df
    
    print(f"Remaking the file")

    match link_evaluation_method:
        case 'single':
            df['link'] = df.apply(lambda x: evaluate_SI_recommendation_link(x['safety_issue'], x['recommendation']), axis=1)

        case 'all':
            df = find_recommendation_links_all(df)

    print(f" saving file: {file_name}")
    df.to_pickle(file_name)

    return df


## Visualize the results functions

In [None]:
def make_link_visualization(df, report_id, folder = 'visualization_of_links'):
    '''
    Create a picture that shows both the recommendations and the safety issues fro a report. There will be arrows showing the link between the two.
    '''
    # Create a directed graph
    print("Creating visualization")
    G = nx.DiGraph()

    NODE_WIDTH = 50

    # Preemptively wrap the text
    df['recommendation'] = df['recommendation'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))
    df['safety_issue'] = df['safety_issue'].apply(lambda x: "\n".join(textwrap.wrap(x, width=NODE_WIDTH)))

    # Add nodes for the 'extracted' and 'inferred' indices
    for i, text in enumerate(df['recommendation'].unique()):
        G.add_node(text, pos=(0, i))

    for i, issue in enumerate(df['safety_issue'].unique()):
        G.add_node(issue, pos=(3, i))


    # Add edges between the matched indices
    for _, row in df.iterrows():
        # Add solid arrow
        if row['link'] == 'Confirmed':
            G.add_edge(row['recommendation'], row['safety_issue'], color='red', style='solid', alpha =1)

        # Add dotted arrow
        elif row['link'] == 'Possible':
            G.add_edge(row['recommendation'], row['safety_issue'], color='blue', style='dashed', alpha = 0.5)


    max_num_of_nodes_column = max(
        len(df['recommendation'].unique()),
        len(df['safety_issue'].unique())
        )

    # Draw the graph
    pos = nx.get_node_attributes(G, 'pos')

    plt.figure(figsize=(10, max_num_of_nodes_column * 5))  
    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=[len(node) * NODE_WIDTH for node in G.nodes()])
    nx.draw_networkx_labels(G, pos, font_size=7)
    nx.draw_networkx_edges(G, pos, arrows=True, edge_color=nx.get_edge_attributes(G, 'color').values(), style=list(nx.get_edge_attributes(G, 'style').values()), alpha = list(nx.get_edge_attributes(G, 'alpha').values()))

    plt.xlim(-1, 4.5)  # Add buffer to the outside side edges
    plt.ylim(-1, max_num_of_nodes_column*1)  # Add buffer to the outside top and bottom edges

    # Add report ID and headers
    plt.title(f'Report ID: {report_id}')
    plt.text(0, max_num_of_nodes_column-0.2, 'Recommendations', fontsize=12, ha='center')
    plt.text(3, max_num_of_nodes_column-0.2, 'Safety issue', fontsize=12, ha='center')
    plt.text(1.5, max_num_of_nodes_column-0.5, 'Arrow indicates that the recommendation is indicated to solve the safety issue by the LLM', fontsize=6, ha='center')

    if not os.path.exists(folder):
        os.mkdir(folder)

    plt.savefig(os.path.join(folder, f'{report_id}_comparison_visualization.png'))
    plt.close()

In [None]:
# Make lots of d which just have one sort of report_id. THen send each individual one to the make visualization function
def make_link_visualization_for_df(df, output_folder = 'visualization_of_links'):
    for report_id in df['report_id'].unique():
        d = df[df['report_id'] == report_id]

        make_link_visualization(d, report_id, output_folder)


# Performing link analysis

## Trying out linking with sample

I am going to simply look at 10 random reports and see how to do some comparisons

### Get the sample dataset

In [None]:
# Get the play dataset

playset_reports = combined_df['report_id'].sample(10, random_state=42)

combined_df_playset = combined_df[combined_df['report_id'].isin(playset_reports)]

print(f"  There are {len(combined_df_playset)} safety issues and recommendations in the playset.\nIt will cost roughly ${rough_cost(combined_df_playset)} to run through.")

### Perform analysis on playset

In [None]:
df_playset_links = perform_link_evaluation(combined_df_playset, _file_name = 'playset_links', link_evaluation_method = 'single')

## Perform link analysis

In [None]:
cleaned_TAIC_recommendations_df.to_csv('cleaned_TAIC_recommendations_df.csv')

In [None]:
links_df_SI_first = perform_link_evaluation(combined_df, _file_name = 'all_links_SI_first', link_evaluation_method = 'single')
links_df_SI_second = pd.read_pickle('all_links_SI_second.pkl')


## Look at the linking results

### Visualization

In [None]:
make_link_visualization_for_df(links_df_SI_first, 'visualization_of_links_SI_first')
make_link_visualization_for_df(links_df_SI_second, 'visualization_of_links_SI_second')

### Basic stats

In [None]:
def get_summary_stats_links(df):
    """
    This function will return a string that has a nice printout that summaries the links with all useful statistics
    """

    # Check to make sure that there are the 4 required columns
    if not {'report_id', 'safety_issue', 'recommendation', 'link'}.issubset(df.columns):
        raise ValueError("df must have columns 'report_id', 'safety_issue', 'recommendation', and 'link'")
    
    number_of_reports = df['report_id'].unique().shape[0]
    
    return_string = f"""
== Here are some summary stats for the df ==

There are {df.shape[0]} links compared.
    """
    
    ### Stats about recommendations

    number_of_recommendations = df['recommendation'].unique().shape[0]

    avg_recs_per_report = number_of_recommendations/number_of_reports

    return_string += f"""
There are {number_of_recommendations} unique recommendations and {avg_recs_per_report:.2f} recommendations per report.
    """

    ### Stats about safety issues

    number_of_safety_issues = df['safety_issue'].unique().shape[0]

    avg_safety_issues_per_report = number_of_safety_issues/number_of_reports

    return_string += f"""
There are {number_of_safety_issues} unique safety issues and {avg_safety_issues_per_report:.2f} safety issues per report.
    """

    ### Stats about links

    number_of_links = df.shape[0]

    avg_links_per_report = number_of_links/number_of_reports

    avg_confirmed_links_per_report = df[df['link'] == 'Confirmed'].shape[0]/number_of_reports

    avg_total_links_per_report = df[df['link'].isin(['Confirmed', 'Possible'])].shape[0]/number_of_reports

    link_type_count = df['link'].value_counts()

    return_string += f"""
There are {number_of_links} unique links and {avg_confirmed_links_per_report:.2f} confirmed links per report.
The link type breakdown is as follows:
None: {link_type_count['None']} ({link_type_count['None']/number_of_links*100:.2f}% of all links)
Possible: {link_type_count['Possible']} ({link_type_count['Possible']/number_of_links*100:.2f}% of all links)
Confirmed: {link_type_count['Confirmed']} ({link_type_count['Confirmed']/number_of_links*100:.2f}% of all links)
    """

    ### Stats about links per recommendations

    possible_or_greater_links_per_rec = avg_total_links_per_report/avg_recs_per_report

    confirmed_links_per_rec = avg_confirmed_links_per_report/avg_recs_per_report

    potential_links_per_rec = avg_links_per_report/avg_recs_per_report

    return_string += f"""
There are about {possible_or_greater_links_per_rec:.2f} links per recommendation.
These are {confirmed_links_per_rec:.2f} confirmed per recommendation.
Note that there were on average {potential_links_per_rec:.2f} potential links per recommendation. So only about {confirmed_links_per_rec/potential_links_per_rec*100:.2f}% of links were confirmed.
"""
    
    return_string += f"""
== End of summary ==
    """

    print(return_string)

In [None]:
get_summary_stats_links(links_df_SI_first)

In [None]:
get_summary_stats_links(links_df_SI_second)

### Comparison of two prompts

I have now split up and have two link datasets one with SI given first and one with SI given second.

the SI first solves problems that SI_second had. However I need to confirm that it doesnt change to much.

The summary stats show that they are quite similar but I could actaully compare row by row and find the differences

In [None]:
comparison_df = pd.merge(links_df_SI_second, links_df_SI_first, on = ['report_id', 'safety_issue', 'recommendation'], suffixes=('_SI_second', '_SI_first'))

comparison_df['link_changed'] = comparison_df['link_SI_first'] != comparison_df['link_SI_second']

changed_df = comparison_df[comparison_df['link_changed'] == True]

changed_df['change'] = changed_df['link_SI_second'] + ' -> ' + changed_df['link_SI_first']

display(changed_df['change'].value_counts())

# What are the rows where the link was downgraded

print("Links going from Confirmed -> None")
display(changed_df[changed_df['change'] == 'Confirmed -> None'])

print("Links going from Confirmed -> Possible")
display(changed_df[changed_df['change'] == 'Confirmed -> Possible'])

*2011_102*
The recommendation is a summary of all of the other recommendations. This means that it has no relevant context

*2018_001*
Even though a connection could be made. It was not really necessary.

The rest of these are maybe not perfect but it is not that different.

# Testing out analysis


The two identified problems are 
- Too many links we only need one link type this will be fixed by upgrading the possible however they will be mostly be deleted.
- There are recommendations without any links

Therefor the main problem which needs to be focused on here is the **recommendations without any links**

### Which ones don't have links

I need to focus on the problem that some done have links

In [None]:
# Find recommendations from combined_df_playset that dont have a linked safety issue

def find_unlinked_recommendations(df):
    all_recommendations = df.drop_duplicates(['report_id', 'recommendation', 'safety_issue'])

    linked_recommendations = df[df['link'] == "Confirmed"]

    unlinked_recommendations = all_recommendations[~all_recommendations['recommendation'].isin(linked_recommendations['recommendation'])]

    return unlinked_recommendations

#### SI second

In [None]:
find_unlinked_recommendations(links_df_SI_second)

Here I will look through the 5 reports that have unlinked recommendations.

**2010_009**
This report has really short safety issues. This can make it hard for it to find anything to link to.

<em>Hmm, yes, I think you’re right for the most part. Some of it is to do with the way we are writing our reports. Re the unlinked rec on drug and alcohol: it is mentioned in 4.1.3 as an ‘issue’ (although not a ‘safety issue’) and then turns up in a ‘findings’ box. Also, the text talks about cannabis and performance-impairing substances but then the rec refers to ‘drug and alcohol’ – but maybe that wouldn’t flummox your model!

For the other unlinked recs, they are about monitoring, checking, ensuring standards, which I would all link to the safety issue ‘ CAA’s oversight of the parachuting industry’. So I think the report has identified several safety issues and broadly lumped them under one ‘theme’; and then written some recs on that topic. So, yes, I agree that this is about safety rec extraction. But one of them – the one about monitoring the outcome of a FAA/US Parachute Assoc report -- refers to something in the report that isn’t actually a safety issue. IMHO. I don’t think it really should be a recommendation (assuming that a rec is about system-level safety issues).  

Some very vague wording and passive sentence construction doesn’t help.</em>

**2016_206**
One is missing a red link but does have a blue link. 
This link could be upgrades to red if needed.

<em> Ditto last sentence above. The rec refers to ‘people’ – but which people? If it’s recreational boaties, then the link is to the third rec,  I think the safety issue links to the third rec only; otherwise both blue links should be red. On reading the report, I think it’s the latter.</em>

**2013_005**
Vague wordings that don't make it clear that they are addressing each other.
The recommendation is specific where both of the safety issues are more general.

<em>One of the safety issues here is identified as such in the report (the only one identified) and the other is a ‘finding’ (see 1.4). But the unlinked rec also relates to a finding listed in 1.4 – so not sure why the model would have picked up one of these as a safety issue, but not the other….?</em>

**2014_005**
The first missing recommendation about seat belts does not seem to have an associated safety issue.
Recommendations regarding the vortex effect has a blue link that seems reasonable.

<em>This seems to me another example of making recs on the basis of things other than safety issues – in this case key lessons. I think all three safety issues are linked to the middle rec on ‘safety culture’, which is about the CAA doing something in its oversight role – reviewing and analysing safety culture. The other two recs are about its educational role and using the key lessons from the inquiry to educate pilots about stuff – so (in my opinion anyway) not really a rec aimed at making change at the system level to mitigate a risk.</em>

**2011_204**
There are not many direct links here but the two blue links will work.

<em>This is a bit like 2010-009 – a potential safety issue is identified at 4.1.19 and discussed at 4.8. There was not enough evidence to say that it was, so the rec was about collecting data to see whether action was justified.</em>

_I am awaiting a response from Ingrid_

#### SI first

In [None]:
find_unlinked_recommendations(links_df_SI_first)

### Upgrading possible to confirmed

#### Functions

In [None]:
def upgrade_unlinked_recommendations(df):

    # Find unlinked recommendations
    unlinked_recommendations = find_unlinked_recommendations(df)

    print(f"There are {unlinked_recommendations.drop_duplicates(['report_id', 'recommendation']).shape[0]} unlinked recommendations. This is {unlinked_recommendations.drop_duplicates(['report_id', 'recommendation']).shape[0]/df.drop_duplicates(['report_id', 'recommendation']).shape[0]*100:.2f}% of all recommendations.")

    # Find which links should be upgraded
    link_to_upgrade = unlinked_recommendations[unlinked_recommendations['link'] == "Possible"]
    link_to_upgrade['link'] = "Confirmed"

    # Upgrade links in original df
    upgraded_df = df.merge(link_to_upgrade, how = 'outer')

    upgraded_df.drop_duplicates(subset=['report_id', 'safety_issue', 'recommendation'], keep = 'first', inplace = True)

    num_upgraded_links = upgraded_df.value_counts('link')['Confirmed'] - df.value_counts('link')['Confirmed']

    print(f"{num_upgraded_links} links were upgraded.\n This represents {num_upgraded_links/df.shape[0]*100:.2f}% of all links and {num_upgraded_links/df[df['link'] == 'Possible'].shape[0]*100:.2f}% of possible links.")

    still_unlinked_recommendations = find_unlinked_recommendations(upgraded_df).drop_duplicates(['report_id', 'recommendation'])[["report_id", "recommendation"]]
    print(f"After performing the upgrading there are still {still_unlinked_recommendations.shape[0]} unlinked recommendations which is {still_unlinked_recommendations.shape[0]/df.drop_duplicates(['report_id', 'recommendation']).shape[0]*100:.2f}% of all recommendations.")
    if still_unlinked_recommendations.shape[0] < 10 and still_unlinked_recommendations.shape[0] > 0:
        display(still_unlinked_recommendations)

    return upgraded_df

#### Performing upgrading

In [None]:
# Looking at second

links_df_SI_second_upgraded = upgrade_unlinked_recommendations(links_df_SI_second)

In [None]:
# Looking at SI first in prompt

links_df_SI_first_upgraded = upgrade_unlinked_recommendations(links_df_SI_first)

In [None]:
def make_visuals_for_upgraded(df):
    remove_possible_df = df.where(df['link'] != 'Possible', "None")
    make_link_visualization_for_df(remove_possible_df)

#### Remaining unlinked

There are 5 recommendations that still aren't linked.

I will go through each of them here.

_After discussion with TAIC these were all deemed to be non important and it would be reasonable to leave them unlinked. The only caveat to this was 2013_002 as it was a simple erorr._

##### 2010_009

Drug and Alcohol related recommendation where the safety issues are really short and non descript. This means that even though it could be argued to be linked to the last safety issue. This is regarding the operator and owner use of the aircraft.


##### 2011_003

Flight tracking devices are not relevant at all to any of the safety issues identified.

##### 2013_002

This is a problem where there is a clear link and it is just missing it.

Not sure why but will do some tests.

In [None]:
missing_clear_link = combined_df[combined_df['report_id'] == "2013_002"]

missing_clear_link['link'] = missing_clear_link.apply(lambda x: evaluate_SI_recommendation_link(x['safety_issue'], x['recommendation']), axis=1)

I have had a look and the problem is that it doesn't think a helicopter is a type of aircraft.

I could resolve this by adding some information in the system message about helicopters being a type of aricraft.

However I noticed that just changing the order around fixes it. I have worked on it above and now have two link datasets. I will have to compare these and see how really different they are.

##### 2013_005

This is the problem where the reccommenation relaate to a fiding but not a safety issue.

##### 2014_005

Missing a direct link

#### Conclusion

I have run this on a complete runthrough with two types of prompts eithe SI first or second.

This was to resolve the 3013_005 problem of missing a direct link.

There are no unlinked recommendation problem anymore.

I will have to confirm that flipping the prompt is not too problematic.

### Giving all issues and recommendations at same time and asking for a link

In [None]:
# Perform evaluation
combined_df_playset_alltogather = find_recommendation_links_all(combined_df_playset)

In [None]:
for report_id in combined_df_playset_alltogather['report_id'].unique():
    d = combined_df_playset_alltogather[combined_df_playset_alltogather['report_id'] == report_id]

    make_link_visualization(d, report_id, 'visualization_of_links_alltogather')

In [None]:
# Having a look at 2019_006
single_df = find_recommendation_links_all(combined_df[combined_df['report_id'] == '2019_006'])

make_link_visualization(single_df, '2019_006', 'visualization_of_links_alltogather')