# Extracting safety issues from the data

### Problem introduction
Safety issues are being extracted from the reports.

There are two types of safety issues being extracted.
Firstly are exact extraction. These are safety issues that are explicitly mentioned in the report. They will follow something like "safety issue: ..." or "safety issue - ...". Exact safety issue extraction is not too much of a worry and just needs to be tested and could be done so with unit tests.

Secondly however are inferred safety issues. These safety issues are supposedly present but are not explicitly stated. This requires a little bit more validation then a simple unit test. Fortunately I can get a good test set.

All the reports that have the safety issues explicitly mentioned will show me what the correct answer is. So all I need to do is remove the explicitly mention of the safety issue from the report then I can run it as an inferred and see if it comes up with the same safety issue.

### Solution story

I have peformed some testing on fine-tuning gpt 3.5 turbo.

It was about 56% accurate on the training set (about 15 reports)
However on a testing set it was only 17% accurate. This was becuase about 70% of the items had mismatched number of safety issues extracted.

This means that I wanted to try with more data. There are about 70 reports that I have meaning I can split it between a training, validation and test set. This is what the code does below.

In [1]:
import os
import yaml
import pandas as pd
import re
import importlib
from engine.Extract_Analyze import ReportExtracting
importlib.reload(ReportExtracting)

<module 'engine.Extract_Analyze.ReportExtracting' from '/home/james/code/TAIC-report-summary/engine/Extract_Analyze/ReportExtracting.py'>

## Get datasets

### Functions needed for dataset creation

In [4]:
def  read_or_create_important_text_file(path, report_id, text):
    important_text_path = os.path.join(path, report_id, f'{report_id}_important_text.txt')

    if not os.path.isfile(important_text_path):
        important_text, pages = ReportExtracting.ReportExtractor(text, report_id).extract_important_text()

        important_text = "" if important_text == None else important_text

        with open(important_text_path, 'w') as stream:
            stream.write(important_text)

        return important_text
    
    with open(important_text_path, 'r') as stream:
        important_text = stream.read()
        
    important_text = None if important_text == "" else important_text

    return important_text

def clean_text_and_get_safety_issues(path, reports_ids):
    """
    This will go through a folder and extract all of the safety issues that it can from the reports using regex. Then it will also remove those extracted safety issues from the report text and report a DataFrame that has the report_id, the safety issues extracted, and the cleaned report text.
    """
    safety_issues_raw = []

    for report in reports_ids:
        with open(os.path.join(path, report, f'{report}.txt'), 'r') as stream:
            text = stream.read()

        # Check to see if the important text exists

        important_text = read_or_create_important_text_file(path, report, text)

        if important_text == None:
            print(f'Could not extract important text from {report}')
            continue
        
        safety_regex = lambda x: fr's ?a ?f ?e ?t ?y ? ?i ?s ?s ?u ?e ?s? ?{x} ?'
        end_regex = r'([\s\S]+?)(?=(?:\d+\.(?:\d+\.)?(?:\d+)?)|(?:^ [A-Z])|(?:s ?a ?f ?e ?t ?y ? ?i ?s ?s ?u ?e ?s?))'

        uncompiled_regexes = ["(" + safety_regex(sep) + end_regex + ")" for sep in ["-", ":"]]

        safety_issue_regexes = [re.compile(regex , re.MULTILINE | re.IGNORECASE) for regex in uncompiled_regexes]

        safety_issues_for_report = []

        cleaned_text = text

        matches = [regex.findall(important_text) for regex in safety_issue_regexes]

        # Choose one of the matches that has the most matches
        matches = max(matches, key=lambda x: len(x))

        for full_match, safety_issue_match in matches:
            safety_issues_for_report.append(safety_issue_match)
            cleaned_text = re.sub(
                full_match, '',
                cleaned_text,
                flags=re.IGNORECASE | re.MULTILINE)
    
        safety_issues_raw.append({
            "file": report,
            "safety_issues": safety_issues_for_report,
            'cleaned_report': cleaned_text})

    safety_issues_df = pd.DataFrame(safety_issues_raw)

    safety_issues_df['number_extracted'] = safety_issues_df['safety_issues'].apply(lambda x: len(x))

    return safety_issues_df

### Creating datasets

In [5]:
reports_path = "output"

reports = os.listdir(reports_path)

reports_to_ignore = [
    "2020_102", # This report has a funny safety issue that is not formatted correctly so cant be picked up by the regex
]
reports = [report for report in reports if report not in reports_to_ignore]

all_reports_df = clean_text_and_get_safety_issues(reports_path, reports)

# Remove all reports that have no found safety issues
all_reports_df = all_reports_df[all_reports_df['number_extracted'] > 0]

Could not extract important text from 2010_207
Could not extract important text from 2010_205
Could not extract important text from 2016_002
Could not extract important text from 2019_003


In [6]:
# Get split of 60% train, 20% validation, 20% test
seed = 42

train_df = all_reports_df.sample(frac=0.6, random_state=seed)

validation_df = all_reports_df.drop(train_df.index).sample(frac=0.5, random_state=seed)

test_df = all_reports_df.drop(train_df.index).drop(validation_df.index)

Because this data is in three distinct categories it would be prudent to make sure that the resultant data has similar distribution

In [7]:
# Combine the three dataframes into one and add a column to indicate the split

train_df['split'] = 'train'
validation_df['split'] = 'validation'
test_df['split'] = 'test'

all_datasets_df = pd.concat([train_df, validation_df, test_df])

all_datasets_df['mode'] = all_datasets_df['file'].apply(lambda x: x.split('_')[1][0])

# Group by split and get distribution of mode in each split
all_datasets_df.groupby(['split', 'mode']).count().reset_index()[['split', 'mode', 'file']].rename(columns={'file': 'count'})

Unnamed: 0,split,mode,count
0,test,0,4
1,test,1,5
2,test,2,6
3,train,0,14
4,train,1,17
5,train,2,15
6,validation,0,5
7,validation,1,7
8,validation,2,3


This is not perfect but we can see that it is pretty darn close. The validation set is low on reports. However this is not something that can be completely fixed and aught not to affect the final outcome.

## Creating fine tuned model

Because of the dismal success of just asking it I am going to give it a couple of examples and then see if that helps.

In [28]:
# Required functions for formatting the data
def format_safety_issue_to_yaml(safety_issue):
    safety_issues_dict = []
    for issue in safety_issue:
        safety_issues_dict.append({"safety_issue": issue, "quality": "inferred"})

    return yaml.dump(safety_issues_dict, sort_keys=False)

def format_data(data):
  formatted_data = []
  for index, row in data.iterrows():
    formatted_data.append({"messages": [
        {
            "role": "system",
            "content": f"""
You are going help me read a transport accident investigation report.

 I want you to please read the report and respond with the safety issues identified in the report.

Please only respond with safety issues that are quite clearly stated and/or implied.

It should be noted that the number of safety issues in a report has a minimum of 1 an 0.25 quantile of 1, median of 2, 0.75 quantile of 3 and a maximum of 13. You should try make your answers match this distribution and on average a report will have only 2 safety issues.

Remember the definitions give

Safety factor - Any (non-trivial) events or conditions, which increases safety risk. If they occurred in the future, these would
increase the likelihood of an occurrence, and/or the
severity of any adverse consequences associated with the
occurrence.

Safety issue - A safety factor that:
• can reasonably be regarded as having the
potential to adversely affect the safety of future
operations, and
• is characteristic of an organisation, a system, or an
operational environment at a specific point in time.
Safety Issues are derived from safety factors classified
either as Risk Controls or Organisational Influences.

Safety theme - Indication of recurring circumstances or causes, either across transport modes or over time. A safety theme may
cover a single safety issue, or two or more related safety
issues.
"""
        },
        {
            "role": "user",
            "content": f"""
{ReportExtracting.ReportExtractor(row['cleaned_report'], row['file']).extract_important_text()[0]}
        
=Instructions=

I want to know the safety issues which this investigation has found.

I also ned to know what is the quality of this safety issue. Some reports will have safety issues explicitly stated with something like "safety issue - ..." or "safety issue: ...", these are "exact" safety issues. Other reports however wont have these yet safety issues may still be present in this case they are "inferred" safety issues.

Can your response please be in yaml format as shown below.

- safety_issue: "bla bla talking about this and that bla bla bla"
  quality: exact
- safety_issue: "bla bla talking about this and that bla bla bla"
  quality: exact


There is no need to enclose the yaml in any tags.

=Here are some definitions=

Safety factor - Any (non-trivial) events or conditions, which increases safety risk. If they occurred in the future, these would
increase the likelihood of an occurrence, and/or the
severity of any adverse consequences associated with the
occurrence.

Safety issue - A safety factor that:
• can reasonably be regarded as having the
potential to adversely affect the safety of future
operations, and
• is characteristic of an organisation, a system, or an
operational environment at a specific point in time.
Safety Issues are derived from safety factors classified
either as Risk Controls or Organisational Influences.

Safety theme - Indication of recurring circumstances or causes, either across transport modes or over time. A safety theme may
cover a single safety issue, or two or more related safety
issues.
"""

        },
        {
            "role": "assistant",
            "content": f"""{format_safety_issue_to_yaml(row['safety_issues'])}"""
        }
    ]})

  return formatted_data

In [29]:
# Format and save dataset

import json

training_data = format_data(train_df)

validation_data = format_data(validation_df)



with open('training_data.jsonl', 'w') as outfile:
    for entry in training_data:
        json.dump(entry, outfile)
        outfile.write('\n')

with open('validation_data.jsonl', 'w') as outfile:
    for entry in validation_data:
        json.dump(entry, outfile)
        outfile.write('\n')
        

  I am going to be reading these pages: [7, 8, 9, 10, 11, 12]
  I am going to be reading these pages: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
  I am going to be reading these pages: [10, 11, 12, 13, 14, 15, 16, 17]
  I am going to be reading these pages: [6, 7, 8, 9, 10, 11, 12]
  I am going to be reading these pages: [9, 10, 11, 12, 13, 14, 15, 16, 17]
  I am going to be reading these pages: [12, 13, 14, 15, 16, 17, 18, 19]
  I am going to be reading these pages: [10, 11, 12, 13, 14, 15]
  Could not find text between pages 15 and 16
  Could not extract text from page 15
  I am going to be reading these pages: [13, 14, 15, 16, 17, 18, 19]
  I am going to be reading these pages: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
  I am going to be reading these pages: [14, 15, 16, 17, 18, 19]
  I am going to be reading these pages: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
  I am going to be reading these pages: [14, 15, 16, 17, 18, 19, 20, 21]
  I am going to be reading these pages

In [30]:
# Upload data to OpenAI

from openai import OpenAI
client = OpenAI()

openai_file = client.files.create(
  file=open("training_data.jsonl", "rb"),
  purpose="fine-tune"
)
openai_file_val = client.files.create(
  file=open("validation_data.jsonl", "rb"),
  purpose="fine-tune"
)

In [31]:
# Create the fine-tune job

client.fine_tuning.jobs.create(
  training_file=openai_file.id, 
  validation_file=openai_file_val.id,
  model="gpt-3.5-turbo"
)

FineTuningJob(id='ftjob-RXt3Jit0q6UnJda8QRaV3LrZ', created_at=1712098644, error=Error(code=None, message=None, param=None, error=None), fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs='auto', batch_size='auto', learning_rate_multiplier='auto'), model='gpt-3.5-turbo-0125', object='fine_tuning.job', organization_id='org-imErdBW1EpSlnDOl8RluwTmE', result_files=[], status='validating_files', trained_tokens=None, training_file='file-BifhlnCFv2RiWnT4pm6AVTL3', validation_file='file-7LJJlx8HUoHGmBpLcucpK16I', user_provided_suffix=None, seed=141041653)

After completing this I am working on fine tuning a model so that I can accurately extract the safety issues from the report.

I am getting to the point of thinking that it might not be possible to extract them 100% accurately.

Instead maybe I should just focus on making a good distinction between exact and inferred.

False news!

False news!

I had made a mistake and moved over the wrong query request to the new ft model.

After checking it on the training data it is about 50% accurate which is pretty good.
I will now check it on some new data that the training hasn't seen yet.

# Testing fine-tuned model

I will now test the fine-tuned model and see how it fares.

### Functions needed for testing

In [38]:
import engine.OpenAICaller as OpenAICaller
from engine.OpenAICaller import openAICaller

def compare_safety_issues(safety_issues, inferred_safety_issues):
    """
    This will receive two lists of safety issues and then ask the LLM to compare each safety issue and see if they are the same or different.
    """

    pairs = [(i, j) for i, _ in enumerate(safety_issues) for j, _ in enumerate(inferred_safety_issues)]

    pairs_comparison_results = []

    for pair in pairs:
        response = openAICaller.query(
            """
You are going to help me compare if safety issues are the same.

You will be given two safety issues that have been retrieved from the same transport investigation report in different ways.

I will need to know if they are either the same or different. Same will mean that they are the same safety issue and maybe just worded differently. Whereas different will mean that they are fundamentally different safety issues.

Your response should just be "yes" or "no".
        """,
        f"""
Here is the first safety issues:
{safety_issues}

Here is the second safety issues:
{inferred_safety_issues}
    """,
            temp =  0,
            model = "gpt-4"
        ).lower()

        if response != 'yes' and response != 'no':
            print(f"Error response in the wrong format {response}")
            response = 'undetermined'
    
        pairs_comparison_results.append({
            'pair': pair,
            'result': response
        })
    return pairs_comparison_results

Here is the prompt used to compare two lists of safety themes.
```
You are going to help me compare if these listed safety issues are the same.

You will be given two lists of safety issues that have been retrieved from the same transport investigation report in different ways.

I will need to know if they are either the same or different. Same will mean that for each safety issue in the set it will match a safety issue found in the other set albeit if worded slightly differently.

Your response should just be "yes" or "no".
        """,
        f"""
Here is the first set of safety issues:
{safety_issues}

Here is the second set of safety issues:
{inferred_safety_issues}
```

In [39]:
def compare_safety_issues_df(df):
    # Extract the safety issues from the data structured returned from the inferences extraction function.
    df['inferred_safety_issues'] = df['inferred_safety_issues'].apply(lambda x: [issue['safety_issue'] for issue in x] if (x is not None and isinstance(x[0], dict)) else None if x is None else x)

    # # Rule out matches that dont have the same number of safety issues
    # df['same'] = df.apply(lambda x: "no" if x['number_extracted'] != x['number_inferred'] else 'yes', axis=1)

    # # Rule out matches that dont have safety issues as all reports have safety issues
    # df['same'] = df.apply(lambda x: "no" if x['inferred_safety_issues'] is None else x['same'], axis=1)

    # Compare safety issues with the LLM
    df['same_comparison_results'] = df.apply(
        lambda x: compare_safety_issues(x['safety_issues'],
                                        x['inferred_safety_issues']),
        axis=1)

    return df

In [32]:
def extract_safety_issues_from_cleaned_reports(df):
    """
    This will call the inference function and extract the safety issues from the cleaned reports.
    """
    df['inferred_safety_issues'] = df.apply(lambda x: ReportExtracting.SafetyIssueExtractor(x['cleaned_report'], x['file'])._extract_safety_issues_with_inference(), axis=1)

    df['number_inferred'] = df['inferred_safety_issues'].apply(lambda x: len(x) if x is not None else 0)

    return df

### Test the testing dataset

In [40]:
importlib.reload(ReportExtracting)
importlib.reload(OpenAICaller)

validation_df_with_SI_Inference = extract_safety_issues_from_cleaned_reports(validation_df)


  I am going to be reading these pages: [7, 8, 9, 10, 11, 12, 13, 14, 15]
  I am going to be reading these pages: [6, 7, 8, 9, 10, 11]
  I am going to be reading these pages: [11, 12, 13, 14, 15, 16, 17, 18, 19]
  I am going to be reading these pages: [9, 10, 11, 12, 13, 14, 15, 16]
  I am going to be reading these pages: [13, 14, 15, 16, 17, 18]
  I am going to be reading these pages: [6, 7, 8, 9, 10]
  I am going to be reading these pages: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
  I am going to be reading these pages: [17, 18, 19, 20, 21, 22]
  I am going to be reading these pages: [10, 11, 12, 13, 14, 15, 16, 17]
  I am going to be reading these pages: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  Found multiple matches for text between pages 27 and 28
  Could not extract text from page 27
  I am going to be reading these pages: [8, 9, 10, 11, 12, 13, 14]
  I am going to be reading these pages: [5, 6, 7, 8, 9]
  I am going to be reading thes

In [41]:
validation_df_with_SI_Inference = compare_safety_issues_df(validation_df_with_SI_Inference)

In [44]:
# Understanding comparison results

# Loop throufgh each reuslt

for index, row in validation_df_with_SI_Inference.iterrows():
    result = row['same_comparison_results']

    # Interpret the results
    # Make sure that each exact safety issues is matched to only one inferred safety issue

    inferred_safety_issues_matched = [False] * len(inferred_safety_issues)
    safety_issues_matched = [False] * len(safety_issues)

    for pair_comparison in pairs_comparison_results:
        if pair_comparison['result'] == 'yes':
            inferred_safety_issues_matched[pair_comparison['pair'][1]] = True
            safety_issues_matched[pair_comparison['pair'][0]] = True

    print(result)

[{'pair': (0, 0), 'result': 'no'}, {'pair': (0, 1), 'result': 'yes'}, {'pair': (0, 2), 'result': 'yes'}, {'pair': (1, 0), 'result': 'yes'}, {'pair': (1, 1), 'result': 'yes'}, {'pair': (1, 2), 'result': 'yes'}]
[{'pair': (0, 0), 'result': 'no'}, {'pair': (1, 0), 'result': 'no'}]
[{'pair': (0, 0), 'result': 'no'}, {'pair': (0, 1), 'result': 'no'}, {'pair': (1, 0), 'result': 'yes'}, {'pair': (1, 1), 'result': 'yes'}, {'pair': (2, 0), 'result': 'no'}, {'pair': (2, 1), 'result': 'no'}, {'pair': (3, 0), 'result': 'yes'}, {'pair': (3, 1), 'result': 'yes'}, {'pair': (4, 0), 'result': 'yes'}, {'pair': (4, 1), 'result': 'yes'}]
[{'pair': (0, 0), 'result': 'no'}, {'pair': (1, 0), 'result': 'no'}]
[{'pair': (0, 0), 'result': 'yes'}]
[{'pair': (0, 0), 'result': 'yes'}, {'pair': (0, 1), 'result': 'yes'}, {'pair': (1, 0), 'result': 'yes'}, {'pair': (1, 1), 'result': 'yes'}]
[{'pair': (0, 0), 'result': 'no'}, {'pair': (1, 0), 'result': 'no'}]
[{'pair': (0, 0), 'result': 'yes'}, {'pair': (1, 0), 'resul

In [64]:
# Get percentage of rows that 'number_extracted' == 'number_inferred'
percent_matching_SI_count = validation_df_with_SI_Inference[validation_df_with_SI_Inference['same'] == 'yes'].shape[0] / validation_df_with_SI_Inference.shape[0]

# percentage of same
percent_same = validation_df_with_SI_Inference.value_counts('same', normalize=True)['yes']

pd.DataFrame([
    ['same number of extracted and inferred safety issues', round(percent_matching_SI_count, 2)],
    ['same extracted and inferred safety issues', round(percent_same, 2)],
    ['number of same safety issues with different inferred safety issues', round(percent_matching_SI_count - percent_same, 2)]
], columns=["description", "statistic"]).style \
    .format(precision=2) \
    .hide(axis="index")

description,statistic
same number of extracted and inferred safety issues,0.33
same extracted and inferred safety issues,0.33
number of same safety issues with different inferred safety issues,0.0


Now we can see good enough performance on the validation set of about 33%.

I need to still do 2 things before the completion of the safety issue extraction upgrade:

- [ ] Make sure that the comparison of the extracted and inferred safety issues are reasonable.
- [ ] Check whether it does a good job getting the difference between inferred and exact safety issues

## Final checks of fine-tuned model

### Comparison of safety issues reasonable

I will start by just looking at the validation_df and the 15 examples there and see if I agree with them

In [19]:
# Firstly will go through and make manual comparisons of the extracted and inferred safety issues

with open('manual_comparison.txt', 'w') as file:
    for index, row in validation_df_with_SI_Inference.iterrows():
        file.write(f"""
=============================================================================
                   Report: {row['file']}
=============================================================================

""")

        file.write(f"""
- - - - - - - - - - - - - - - - - - -
        Extracted Safety Issues:
- - - - - - - - - - - - - - - - - - -

""")

        for issue in enumerate(row['safety_issues']):
            file.write(f"{issue[0] + 1}. {issue[1]}\n")

        file.write(f"""
- - - - - - - - - - - - - - - - - - -
        Inferred Safety Issues:
- - - - - - - - - - - - - - - - - - -

""")

        for issue in enumerate(row['inferred_safety_issues']):
            file.write(f"{issue[0] + 1}. {issue[1]}\n")

In [22]:

manual_comparison = [
    {
        "file": "2013_106",
        "same": "no",
        "reason": "count extra"
    },
    {
        "file": "2014_102",
        "same": "no",
        "reason": "count less"
    },
    {
        "file": "2016_204",
        "same": "no",
        "reason": "count less",
        "notes": "Here the extracted safety issues have also caputred a summarty of hte safety issues at the top of the analysis. This means that there are 5 where there should only be 4 and the first one is curently a list of all of them."
    },
    {
        "file": "2011_007",
        "same": "no",
        "reason": "count less",
    },
    {
        "file": "2013_005",
        "same": "yes",
        "notes": "The wording is different but it is similar enough. However the inferred one does not 'aviation industry practice'"
    },
    {
        "file": "2019_108",
        "same": "yes",
        "notes": "The first ones match however the inferred one is much longer and even gives a reference to a document."
    },
    {
        "file": "2011_006",
        "same": "no",
        "reason": "count less",
        'notes': "Completely different safety issues"
    },
    {
        "file": "2017_203",
        "same": "no",
        "reason": "count less",
        "notes": "The one inferred one is an exact match and I wonder if it was not removed prior to the inference"
    },
    {
        "file": "2014_101",
        "same": "no",
        "reason": "different",
        "notes": "They are similar but not the same"
    },
    {
        "file": "2011_003",
        "same": "no",
        "reason": "count more",
        "notes": "The inferred safety issues are actually correct as there is a summary which it has read from."
    },
    {
        "file": "2013_202",
        "same": "no",
        "reason": "count less"
    },
    {
        "file": "2019_007",
        "same": "no",
        "reason": "different",
        "notes": "The inferred is more specific about what equipment but is more vague when it comes to defining what the problem to be mitigated was."
    },
    {
        "file": "2011_106",
        "same": "no",
        "reason": "count more",
        "notes": "The inferred are actaully exact ones lifted from the report"
    },
    {
        "file": "2012_103",
        "same": "yes",
    },
    {
        "file": "2016_102",
        "same": "no",
        "reason": "count less",
        "notes": "Two of the safety issues are exact however it should of exracted 4 exact issues."
    }
]

This has made me noted that actaully I should have a look at which ones are be extracted exactly or inferred. But first I need to compare this with what my comparision function says

In [23]:
validation_df_with_SI_Inference['manual_same'] = validation_df_with_SI_Inference['file'].apply(lambda x: next((item['same'] for item in manual_comparison if item["file"] == x), None))

# Show cases where the manual comparison is different to the LLM comparison
validation_df_with_SI_Inference[validation_df_with_SI_Inference['same'] != validation_df_with_SI_Inference['manual_same']]

Unnamed: 0,file,safety_issues,cleaned_report,number_extracted,split,inferred_safety_issues,number_inferred,same,manual_same
60,2014_101,[The sighting distance available to drivers of...,﻿ \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n...,2,validation,[There was an insufficient sighting distance ...,2,yes,no
