# Extracting safety issues from the data

### Problem introduction
Safety issues are being extracted from the reports.

This is linked with issue https://github.com/1jamesthompson1/TAIC-report-summary/issues/101

There are two types of safety issues being extracted.
Firstly are exact extraction. These are safety issues that are explicitly mentioned in the report. They will follow something like "safety issue: ..." or "safety issue - ...". Exact safety issue extraction is not too much of a worry and just needs to be tested and could be done so with unit tests.

Secondly however are inferred safety issues. These safety issues are supposedly present but are not explicitly stated. This requires a little bit more validation then a simple unit test. Fortunately I can get a good test set.

All the reports that have the safety issues explicitly mentioned will show me what the correct answer is. So all I need to do is remove the explicitly mention of the safety issue from the report then I can run it as an inferred and see if it comes up with the same safety issue.

### Solution story

I have peformed some testing on fine-tuning gpt 3.5 turbo.

It was about 56% accurate on the training set (about 15 reports)
However on a testing set it was only 17% accurate. This was becuase about 70% of the items had mismatched number of safety issues extracted.

This means that I wanted to try with more data. There are about 70 reports that I have meaning I can split it between a training, validation and test set. This is what the code does below.

## Required modules

Some cells reload modules 
There are potentially new modules that are loaded in cells later on but I do like to have something at the top so here it is.

In [None]:
import os
import yaml
import pandas as pd
import re
import importlib
from engine.Extract_Analyze import ReportExtracting
import engine.OpenAICaller as OpenAICaller
importlib.reload(ReportExtracting)

## Get datasets for working with

To get the dataset you need I just pulled it from the viewer/output folder.

This has all of the reports downloaded and extracted. From there everything else is done in the notebook.

Here is a link to the exact copy of the output folder I used: https://github.com/1jamesthompson1/TAIC-report-summary/tree/48e3b09380364731bfb343f2105851cd3babf90a/viewer/output
I have not included it in the repo to save space.

### Functions needed for dataset creation

In [None]:
def  read_or_create_important_text_file(path, report_id, text):
    important_text_path = os.path.join(path, report_id, f'{report_id}_important_text.txt')

    if not os.path.isfile(important_text_path):
        important_text, pages = ReportExtracting.ReportExtractor(text, report_id).extract_important_text()

        important_text = "" if important_text == None else important_text

        with open(important_text_path, 'w') as stream:
            stream.write(important_text)

        return important_text
    
    with open(important_text_path, 'r') as stream:
        important_text = stream.read()
        
    important_text = None if important_text == "" else important_text

    return important_text

In [None]:

def clean_text_and_get_safety_issues(path, reports_ids):
    """
    This will go through a folder and extract all of the safety issues that it can from the reports using regex. Then it will also remove those extracted safety issues from the report text and report a DataFrame that has the report_id, the safety issues extracted, and the cleaned report text.
    """
    safety_issues_raw = []

    for report in reports_ids:
        with open(os.path.join(path, report, f'{report}.txt'), 'r') as stream:
            text = stream.read()

        # Check to see if the important text exists

        important_text = read_or_create_important_text_file(path, report, text)

        if important_text == None:
            print(f'Could not extract important text from {report}')
            continue
        
        safety_regex = lambda x: fr's ?a ?f ?e ?t ?y ? ?i ?s ?s ?u ?e ?s? {{0,3}}{x} {{0,3}}'
        end_regex = r'([\s\S]+?)(?=(?:\d+\.(?:\d+\.)?(?:\d+)?)|(?:^ [A-Z])|(?:s ?a ?f ?e ?t ?y ? ?i ?s ?s ?u ?e ?s?))'

        uncompiled_regexes = ["(" + safety_regex(sep) + end_regex + ")" for sep in ["-", ":"]]

        safety_issue_regexes = [re.compile(regex , re.MULTILINE | re.IGNORECASE) for regex in uncompiled_regexes]

        safety_issues_for_report = []

        cleaned_text = text

        matches = [regex.findall(important_text) for regex in safety_issue_regexes]

        # Choose one of the matches that has the most matches
        matches = max(matches, key=lambda x: len(x))

        for full_match, safety_issue_match in matches:
            safety_issues_for_report.append(safety_issue_match)
            cleaned_text = re.sub(
                re.escape(full_match), '',
                cleaned_text,
                flags=re.IGNORECASE | re.MULTILINE)
            
        # Remove extracted safety issues that are just summaries in the analysis section introduction
        # This prevents the double matching of the safety issues.
        for safety_issue in safety_issues_for_report:
            if re.search(r'[]|(\uf0b7)', safety_issue, re.MULTILINE | re.IGNORECASE):
                print(f"Remove safety issue summary from {report}")
                safety_issues_for_report.remove(safety_issue)
    
        safety_issues_raw.append({
            "file": report,
            "safety_issues": safety_issues_for_report,
            'cleaned_report': cleaned_text})

    safety_issues_df = pd.DataFrame(safety_issues_raw)

    safety_issues_df['number_extracted'] = safety_issues_df['safety_issues'].apply(lambda x: len(x))

    return safety_issues_df

### Creating datasets

In [None]:
importlib.reload(ReportExtracting)
reports_path = "output"

reports = os.listdir(reports_path)

reports_to_ignore = [
    "2020_102", # This report has a funny safety issue that is not formatted correctly so cant be picked up by the regex
]
reports = [report for report in reports if report not in reports_to_ignore]

all_reports_df = clean_text_and_get_safety_issues(reports_path, reports)

# Remove all reports that have no found safety issues
removed_reports = all_reports_df[all_reports_df['number_extracted'] == 0]
all_reports_df = all_reports_df[all_reports_df['number_extracted'] != 0]

In [None]:
# Get split of 60% train, 20% validation, 20% test
seed = 42

train_df = all_reports_df.sample(frac=0.6, random_state=seed)

validation_df = all_reports_df.drop(train_df.index).sample(frac=0.5, random_state=seed)

test_df = all_reports_df.drop(train_df.index).drop(validation_df.index)

As was discovered in the testing of the model there is a problem where responses from the fine-tuned model will only have "inferred" quality even if they are in fact "exact" safety issues.

To fix this I will take a small amount of the rows from the train_df and change it so that it is just getting the exact 

In [None]:
def add_random_exact_examples(df, seed):
    df_exact = df.sample(frac=0.2, random_state=seed)

    df_exact['uncleaned_report'] = df_exact['file'].apply(lambda x: open(os.path.join(reports_path, x, f'{x}.txt'), 'r').read())

    df_exact['expected_response_quality'] = 'exact'

    df['expected_response_quality'] = 'inferred'

    df_final = pd.concat([df, df_exact]).sort_index()

    return df_final

train_df = add_random_exact_examples(train_df, seed)

validation_df = add_random_exact_examples(validation_df, seed)

test_df = add_random_exact_examples(test_df, seed)

Because this data is in three distinct categories it would be prudent to make sure that the resultant data has similar distribution

In [None]:
# Combine the three dataframes into one and add a column to indicate the split

train_df['split'] = 'train'

validation_df['split'] = 'validation'
test_df['split'] = 'test'

all_datasets_df = pd.concat([train_df, validation_df, test_df])

all_datasets_df['mode'] = all_datasets_df['file'].apply(lambda x: x.split('_')[1][0])

# Group by split and get distribution of mode in each split
all_datasets_df.groupby(['split', 'mode']).count().reset_index()[['split', 'mode', 'file']].rename(columns={'file': 'count'})

This is not perfect but we can see that it is pretty darn close. The validation set is low on reports. However this is not something that can be completely fixed and aught not to affect the final outcome.

## Creating fine tuned model

Because of the dismal success of just asking it I am going to give it a couple of examples and then see if that helps.

In [None]:
# Required functions for formatting the data
def format_safety_issue_to_yaml(safety_issue, quality):
    safety_issues_dict = []
    for issue in safety_issue:
        safety_issues_dict.append({"safety_issue": issue, "quality": quality})

    return yaml.dump(safety_issues_dict, sort_keys=False)

def format_data(data):
  formatted_data = []
  for index, row in data.iterrows():
    formatted_data.append({"messages": [
        {
            "role": "system",
            "content": f"""
You are going help me read a transport accident investigation report.

I want you to please read the report and respond with the safety issues identified in the report.

Please only respond with safety issues that are quite clearly stated ("inferred" safety issues) or implied ("inferred" safety issues) in the report. Each report will only contain one type of safety issue.

Remember the definitions give

Safety factor - Any (non-trivial) events or conditions, which increases safety risk. If they occurred in the future, these would
increase the likelihood of an occurrence, and/or the
severity of any adverse consequences associated with the
occurrence.

Safety issue - A safety factor that:
• can reasonably be regarded as having the
potential to adversely affect the safety of future
operations, and
• is characteristic of an organisation, a system, or an
operational environment at a specific point in time.
Safety Issues are derived from safety factors classified
either as Risk Controls or Organisational Influences.

Safety theme - Indication of recurring circumstances or causes, either across transport modes or over time. A safety theme may
cover a single safety issue, or two or more related safety
issues.
"""
        },
        {
            "role": "user",
            "content": f"""
{ReportExtracting.ReportExtractor(row['cleaned_report'] if row['expected_response_quality'] == "inferred" else row['uncleaned_report'], row['file']).extract_important_text()[0]}
        
        
=Instructions=

I want to know the safety issues which this investigation has found.

For each safety issue you find I need to know what is the quality of this safety issue.
Some reports will have safety issues explicitly stated with something like "safety issue - ..." or "safety issue: ...", these are "exact" safety issues. Now that the text may have extra spaces or characters in it.

However if no safety issues are stated explicitly, then you need to inferred them. These inferred safety issues are "inferred" safety issues.


Can your response please be in yaml format as shown below.

- safety_issue: |
    bla bla talking about this and that bla bla bla
  quality: exact
- safety_issue: |
    bla bla talking about this and that bla bla bla
  quality: exact


There is no need to enclose the yaml in any tags.

=Here are some definitions=

Safety factor - Any (non-trivial) events or conditions, which increases safety risk. If they occurred in the future, these would
increase the likelihood of an occurrence, and/or the
severity of any adverse consequences associated with the
occurrence.

Safety issue - A safety factor that:
• can reasonably be regarded as having the
potential to adversely affect the safety of future
operations, and
• is characteristic of an organisation, a system, or an
operational environment at a specific point in time.
Safety Issues are derived from safety factors classified
either as Risk Controls or Organisational Influences.

Safety theme - Indication of recurring circumstances or causes, either across transport modes or over time. A safety theme may
cover a single safety issue, or two or more related safety
issues.
"""

        },
        {
            "role": "assistant",
            "content": f"""{format_safety_issue_to_yaml(row['safety_issues'], row['expected_response_quality'])}"""
        }
    ]})

  return formatted_data

In [None]:
# Format and save dataset

import json

training_data = format_data(train_df)

validation_data = format_data(validation_df)

In [None]:
from openai import OpenAI
client = OpenAI()


with open('training_data.jsonl', 'w') as outfile:
    for entry in training_data:
        json.dump(entry, outfile)
        outfile.write('\n')

with open('validation_data.jsonl', 'w') as outfile:
    for entry in validation_data:
        json.dump(entry, outfile)
        outfile.write('\n')

In [None]:
# Upload data to OpenAI


openai_file = client.files.create(
  file=open("training_data.jsonl", "rb"),
  purpose="fine-tune"
)
openai_file_val = client.files.create(
  file=open("validation_data.jsonl", "rb"),
  purpose="fine-tune"
)

In [None]:
# Create the fine-tune job

client.fine_tuning.jobs.create(
  training_file=openai_file.id, 
  validation_file=openai_file_val.id,
  model="gpt-3.5-turbo"
)

As I am creating multiple fine-tuned models I should have some reference text to explain what is going on.

| name | description |
| ---- | ---- |
| ft:gpt-3.5-turbo-0125:personal::99hWfq4f | second model fined tuned. This is the first model that uses the full 70 or so explicit reports with redacted safety issues. After fine tuning I tweaked the prompts and got it to 73% matched exact vs inferred. With just over half of the validation set having all of its respective safety issues matched by the inferred. Exact vs exact was at 100% |
| gpt-4-turbo (untrained) | Just to store the metrics somewhere I will test it on gpt 4 turbo. I got about 83% on average match with over 77% perfect match.|
| ft:gpt-3.5-turbo-0125:personal::9AU7vXs3 | This is my third ft model. It is trained on the same data base as the previous one. However it is differnet in two ways. The prompts in the training data are differnet and there are some exmaples included of what it is like to have exact safety issues.  Out of the box I have it at 61% match percent and about just over half are fully matched. However I am starting to doubt that the comparer is being reaosnable. I.e I think it might be being too hard.|


After completing this I am working on fine tuning a model so that I can accurately extract the safety issues from the report.

I am getting to the point of thinking that it might not be possible to extract them 100% accurately.

Instead maybe I should just focus on making a good distinction between exact and inferred.

False news!

False news!

I had made a mistake and moved over the wrong query request to the new ft model.

After checking it on the training data it is about 50% accurate which is pretty good.
I will now check it on some new data that the training hasn't seen yet.

## Testing fine-tuned model

I will now test the fine-tuned model and see how it fares.

### Functions needed for testing

#### Creating and comparing datasets inferred and exact

In [None]:
import engine.OpenAICaller as OpenAICaller
from engine.OpenAICaller import openAICaller

def compare_safety_issues(safety_issues, inferred_safety_issues):
    """
    This will receive two lists of safety issues and then ask the LLM to compare each safety issue and see if they are the same or different.
    """

    pairs = [(i, j) for i, _ in enumerate(safety_issues) for j, _ in enumerate(inferred_safety_issues)]

    pairs_comparison_results = []

    for extracted, inferred in pairs:
        response = openAICaller.query(
            """
You are going to help me compare if safety issues are the same.

You will be given two safety issues that have been retrieved from the same transport investigation report in different ways.

I will need to know if they are either the same or different. Same will mean that they are the same safety issue and maybe just worded differently. Whereas different will mean that they are fundamentally different safety issues.

Your response should just be "yes" or "no".
        """,
        f"""
    Here is the first safety issues:
    {safety_issues[extracted]}

    Here is the second safety issues:
    {inferred_safety_issues[inferred]}
    """,
            temp =  0,
            model = "gpt-4"
        ).lower()

        if response != 'yes' and response != 'no':
            print(f"Error response in the wrong format {response}")
            response = 'undetermined'
    
        pairs_comparison_results.append({
            'pair': (extracted, inferred),
            'result': response
        })
    return pairs_comparison_results

Here is the prompt used to compare two lists of safety themes.
```
You are going to help me compare if these listed safety issues are the same.

You will be given two lists of safety issues that have been retrieved from the same transport investigation report in different ways.

I will need to know if they are either the same or different. Same will mean that for each safety issue in the set it will match a safety issue found in the other set albeit if worded slightly differently.

Your response should just be "yes" or "no".
        """,
        f"""
Here is the first set of safety issues:
{safety_issues}

Here is the second set of safety issues:
{inferred_safety_issues}
```

In [None]:
def compare_safety_issues_df(df):
    # Compare safety issues with the LLM
    df['same_comparison_results'] = df.apply(
        lambda x: compare_safety_issues(x['safety_issues'],
                                        x['inferred_safety_issues']),
        axis=1)
    
    df['inferred_safety_issue_quality'] = df['inferred_safety_issues_struct'].apply(lambda issues: next(iter(set([x['quality'] for x in issues]))))

    df['same_quality'] = df['inferred_safety_issue_quality'] == df['expected_response_quality']

    return df

In [None]:
def extract_safety_issues_from_cleaned_reports(df):
    """
    This will call the inference function and extract the safety issues from the cleaned reports.
    """
    df['inferred_safety_issues_struct'] = df.apply(lambda x: ReportExtracting.SafetyIssueExtractor(x['cleaned_report'], x['file'])._extract_safety_issues_with_inference(), axis=1)

    # Extract the safety issues from the data structured returned from the inferences extraction function.
    df['inferred_safety_issues'] = df['inferred_safety_issues_struct'].apply(lambda x: [issue['safety_issue'] for issue in x] if (x is not None and isinstance(x[0], dict)) else None if x is None else x)

    df['number_inferred'] = df['inferred_safety_issues'].apply(lambda x: len(x) if x is not None else 0)

    return df

#### Interpreting the comparison results

In [None]:
import matplotlib.pyplot as plt
import networkx as nx
import textwrap

def make_match_visualization(df, safety_issue_text_dict, report_id):
    '''
    Create a picture that shows both the extracted and inferred safety issues and the matches between them as lines/arrows.
    '''
    # Create a directed graph
    G = nx.DiGraph()

    # Add nodes for the 'extracted' and 'inferred' indices
    extracted_indices = df['extracted'].unique()
    inferred_indices = df['inferred'].unique()

    NODE_WIDTH = 50

    for i, index in enumerate(extracted_indices):
        node_label = '\n'.join(textwrap.wrap(safety_issue_text_dict["extracted"][index], width=NODE_WIDTH))
        G.add_node(node_label, pos=(0, i))

    for i, index in enumerate(inferred_indices):
        node_label = '\n'.join(textwrap.wrap(safety_issue_text_dict["inferred"][index], width=NODE_WIDTH))
        G.add_node(node_label, pos=(3, i))

    # Add edges between the matched indices
    for _, row in df.iterrows():
        if row['result'] == 'yes':
            G.add_edge('\n'.join(textwrap.wrap(safety_issue_text_dict["extracted"][row["extracted"]], width=NODE_WIDTH)), 
                        '\n'.join(textwrap.wrap(safety_issue_text_dict["inferred"][row["inferred"]], width=NODE_WIDTH)))

    max_num_of_nodes_column = max(
        len(safety_issue_text_dict['extracted']),
        len(safety_issue_text_dict['inferred'])
        )

    # Draw the graph
    pos = nx.get_node_attributes(G, 'pos')

    plt.figure(figsize=(8, max_num_of_nodes_column * 2.5))  
    nx.draw(G, pos, with_labels=False, node_color='lightblue', node_size=[len(node) * NODE_WIDTH for node in G.nodes()])
    nx.draw_networkx_labels(G, pos, font_size=7)
    
    plt.xlim(-1, 4.5)  # Add buffer to the outside side edges
    plt.ylim(-1, max_num_of_nodes_column*1)  # Add buffer to the outside top and bottom edges

    # Add report ID and headers
    plt.title(f'Report ID: {report_id}')
    plt.text(0, max_num_of_nodes_column-0.2, 'Real', fontsize=12, ha='center')
    plt.text(3, max_num_of_nodes_column-0.2, 'Inferred', fontsize=12, ha='center')
    plt.text(1.5, max_num_of_nodes_column-0.5, 'Arrow indicates that real has been "matched" by LLM with inferred', fontsize=6, ha='center')

    plt.savefig(os.path.join('visualizations_of_matches', f'{report_id}_comparison_visualization.png'))
    plt.close()

In [None]:
def interpret_comparison_of_reports_SI(row, with_visual_generation = True):
    '''
    Take a row of the DataFrame with the comparison results and interpret them by giving you a match percentage. This percentage is how 'similar' the extracted safety issues are to the inferred safety issues. However similar is hard to define so currently it is simply how many of the extracted safety issues have a match.
    '''
    result = row['same_comparison_results']

    # Make it a DataFrame
    result_df = pd.DataFrame(result)
    # Split tuple into two columns
    result_df['extracted'] = result_df['pair'].apply(lambda x: x[0])
    result_df['inferred'] = result_df['pair'].apply(lambda x: x[1])

    num_extracted_safety_issues = result_df['extracted'].unique().shape[0]
    num_inferred_safety_issues = result_df['inferred'].unique().shape[0]

    # Interpret the results
    # Make sure that each exact safety issues is matched to only one inferred safety issue

    inferred_safety_issues_matched = [False] * num_inferred_safety_issues
    safety_issues_matched = [False] * num_extracted_safety_issues

    for pair_comparison in result:
        if pair_comparison['result'] == 'yes':
            inferred_safety_issues_matched[pair_comparison['pair'][1]] = True
            safety_issues_matched[pair_comparison['pair'][0]] = True

    match_percent = sum(safety_issues_matched) / num_extracted_safety_issues

    if with_visual_generation:
        safety_issue_text ={
            "extracted": [row['safety_issues'][index] for index in result_df['extracted'].unique()],
            "inferred": [row['inferred_safety_issues'][index] for index in result_df['inferred'].unique()]
        }

        make_match_visualization(result_df, safety_issue_text, row['file'])
        

    return match_percent

In [None]:
def interpret_dataset_SI_comparison(dataset):
    """
    The dataset is going to have a data on the safety issues extracted and inferred as well as the comparison results of these safety issues. This function will look at the while results and give you a summary of what is happening.
    """

    dataset['SI_match_percent'] = dataset.apply(lambda row: interpret_comparison_of_reports_SI(row, True), axis = 1)
    dataset['SI_match'] = dataset['SI_match_percent'] == 1

    dataset['SI_count_match'] = dataset['number_extracted'] == dataset['number_inferred']
    dataset['SI_count_difference'] = dataset['number_extracted'] - dataset['number_inferred']

    print(f"""
==============                                   ==============
                Comparing Safety Issues results
==============                                   ==============
          
There are {dataset.shape[0]} reports in the dataset.
    
Average match percentage: {round(dataset['SI_match_percent'].mean(), 2)}
Amount of all matched safety issues: {dataset['SI_match'].sum()/dataset.shape[0]}
    This is the percentage of extracted safety issues that have been matched to inferred safety issues.
Information on the number of safety issues extracted and inferred:
          """)
    
    display(
        pd.DataFrame([
            ['same number of extracted and inferred safety issues', dataset['SI_count_match'].mean()],
            ['Average match of same count of SI', dataset[dataset['SI_count_match']]['SI_match_percent'].mean()],
            ['Average difference in count of SI', dataset['SI_count_difference'].mean()],
            ['Average number of extracted SI', dataset['number_extracted'].mean()],
            ['Average number of inferred SI', dataset['number_inferred'].mean()]
        ], columns=["description", "statistic"]).style \
            .format(precision=2) \
            .hide(axis="index")
    )

### Test a dataset
I will compare the models responses with the exact extracted ones.

In [None]:
importlib.reload(ReportExtracting)
importlib.reload(OpenAICaller)

# Infer safety issues
validation_df_with_SI_Inference = extract_safety_issues_from_cleaned_reports(validation_df)

In [None]:
# Compare safety issues
validation_df_with_SI_Inference = compare_safety_issues_df(validation_df_with_SI_Inference)


In [None]:
# Understand how well they compared
interpret_dataset_SI_comparison(validation_df_with_SI_Inference)

Now we can see that there is some interesting performance going on.

We have about 0.54 percent of the exact ones have a match. However there are problems where there is about half an extra safety issue inferred for every safety issues extracted. That is only about a 1/3 of the time does it generate the same amount of safety issues. However when it does there is a 60% chance they will all match up.

I need to still do 2 things before the completion of the safety issue extraction upgrade:

- [x] Make sure that the comparison of the extracted and inferred safety issues are reasonable.
- [x] Check whether it does a good job getting the difference between inferred and exact safety issues
To be completed with a re fine-tune of the model with giving it some new examples  that include ones where the safety issues are actaully exact because I havent redacted them.

## Final checks of fine-tuned model

### Comparison of safety issues reasonable

I will start by just looking at the validation_df and the 15 examples there and see if I agree with them

In [None]:
# Firstly will go through and make manual comparisons of the extracted and inferred safety issues

with open('manual_comparison.txt', 'w') as file:
    for index, row in validation_df_with_SI_Inference.iterrows():
        file.write(f"""
=============================================================================
                   Report: {row['file']}
=============================================================================

""")

        file.write(f"""
- - - - - - - - - - - - - - - - - - -
        Extracted Safety Issues:
- - - - - - - - - - - - - - - - - - -

""")

        for issue in enumerate(row['safety_issues']):
            file.write(f"{issue[0] + 1}. {issue[1]}\n")

        file.write(f"""
- - - - - - - - - - - - - - - - - - -
        Inferred Safety Issues:
- - - - - - - - - - - - - - - - - - -

""")

        for issue in enumerate(row['inferred_safety_issues']):
            file.write(f"{issue[0] + 1}. {issue[1]}\n")

In [None]:

manual_comparison = [
    {
        "file": "2013_106",
        "same": "no",
        "reason": "count extra"
    },
    {
        "file": "2014_102",
        "same": "no",
        "reason": "count less"
    },
    {
        "file": "2016_204",
        "same": "no",
        "reason": "count less",
        "notes": "Here the extracted safety issues have also captured a summary of hte safety issues at the top of the analysis. This means that there are 5 where there should only be 4 and the first one is currently a list of all of them."
    },
    {
        "file": "2011_007",
        "same": "no",
        "reason": "count less",
    },
    {
        "file": "2013_005",
        "same": "yes",
        "notes": "The wording is different but it is similar enough. However the inferred one does not 'aviation industry practice'"
    },
    {
        "file": "2019_108",
        "same": "yes",
        "notes": "The first ones match however the inferred one is much longer and even gives a reference to a document."
    },
    {
        "file": "2011_006",
        "same": "no",
        "reason": "count less",
        'notes': "Completely different safety issues"
    },
    {
        "file": "2017_203",
        "same": "no",
        "reason": "count less",
        "notes": "The one inferred one is an exact match and I wonder if it was not removed prior to the inference"
    },
    {
        "file": "2014_101",
        "same": "no",
        "reason": "different",
        "notes": "They are similar but not the same"
    },
    {
        "file": "2011_003",
        "same": "no",
        "reason": "count more",
        "notes": "The inferred safety issues are actually correct as there is a summary which it has read from."
    },
    {
        "file": "2013_202",
        "same": "no",
        "reason": "count less"
    },
    {
        "file": "2019_007",
        "same": "no",
        "reason": "different",
        "notes": "The inferred is more specific about what equipment but is more vague when it comes to defining what the problem to be mitigated was."
    },
    {
        "file": "2011_106",
        "same": "no",
        "reason": "count more",
        "notes": "The inferred are actually exact ones lifted from the report"
    },
    {
        "file": "2012_103",
        "same": "yes",
    },
    {
        "file": "2016_102",
        "same": "no",
        "reason": "count less",
        "notes": "Two of the safety issues are exact however it should of extracted 4 exact issues."
    }
]

This has made me noted that actaully I should have a look at which ones are be extracted exactly or inferred. But first I need to compare this with what my comparision function says

In [None]:
validation_df_with_SI_Inference['manual_same'] = validation_df_with_SI_Inference['file'].apply(lambda x: next((item['same'] for item in manual_comparison if item["file"] == x), None))

# Show cases where the manual comparison is different to the LLM comparison
validation_df_with_SI_Inference[validation_df_with_SI_Inference['same'] != validation_df_with_SI_Inference['manual_same']]

After having a look at this I have changed the comparision to be per safety issue.

Looking at the charts that are produced it seems to do a good job.

### Difference between exact and inferred

The Model is going to be asked to read the important part of a report and extract the safety issues there will be two sorts of safety issues extracted.

Firstly will be the exact safety issues. These are where the report explicitly states the safety issues. Then there will be inferred which is where no exact safety issues can be found. Then the model will "invent" some.

This needs to be checked.

In [None]:
# Check to see if there are any reports that have exact safety issues that are inferred

for _, row in validation_df_with_SI_Inference.iterrows():
    
    for issue in row['inferred_safety_issues_struct']:
        if issue['quality'] == 'exact':
            print(f"Report {row.file} has an exact safety issue that is inferred")

There are currently no exact safety issues from the validation which makes sense as the all the of the validation report have ha the safety issues removed.

I could test this by seeing what happenings if I get a report that should have a exact safety issues in it.


In [None]:
importlib.reload(ReportExtracting)

# Getting some examples of exact safety issues that are found using the model
reports_path = "output"

comparing_with_exact = validation_df.copy()

comparing_with_exact['cleaned_report'] = comparing_with_exact['file'].apply(lambda x: open(os.path.join(reports_path, x, f'{x}.txt'), 'r').read())

comparing_with_exact = extract_safety_issues_from_cleaned_reports(comparing_with_exact)

comparing_with_exact = compare_safety_issues_df(comparing_with_exact)

interpret_dataset_SI_comparison(comparing_with_exact)

This shows that it does a good job of extracted safety issues even if they are mentioned exactly. However two things. Some explicit safety issues are missing. Secondly the safety quality is not being correctly identified as exact.

The problem of explicit safety issues missed are solved by: 735d98b1e122b66c229b93021c266424af5cd9d4

Where as the incorrect labelling will be fixed by a fine tuning of the model with new data that has some "exact" safety issue examples.

### Comparing to plain gpt-4-turbo

As the fine tuning model is expensive to train I was interested in trying gpt 4 now with all the changes in the prompt.

It is doing quite a good job and so I might try use it.

I am getting about 77% completely matched reports (that is that all the safety issues in a report is matched with at least one of the inferred safety issues.)

This was quite a surprise however now it is listing all of the safety issue qualities as exact. Therefore I will update the prompt and see what happens. After trying again and again with the prompts I have realized that i just cant ge it to do want and instead it is making it worse.

I will fine-tune the data with the updated prompts and see what happens.

In [None]:
importlib.reload(ReportExtracting)
importlib.reload(OpenAICaller)

# For this unique test I will change the df so that it has the uncleaned reports in where the cleaned reports should be. THis is just to test that it can do both.

validation_df_gpt4 = validation_df.copy()

# Only replace if uncleaned_report exists
validation_df_gpt4['cleaned_report'] = validation_df_gpt4.apply(lambda row: row['cleaned_report'] if row['expected_response_quality'] != "exact" else row['uncleaned_report'], axis=1)

In [None]:

# Infer safety issues
validation_df_gpt4_with_inference = extract_safety_issues_from_cleaned_reports(validation_df_gpt4)

# Compare safety issues
validation_df_gpt4_with_inference = compare_safety_issues_df(validation_df_gpt4_with_inference)

# Understand how well they compared
interpret_dataset_SI_comparison(validation_df_gpt4_with_inference)

In [None]:
# Compare the safety issue quality with expected and results

def find_mismatched_quality_issues(df):
    df['inferred_safety_issue_quality'] = df['inferred_safety_issues_struct'].apply(lambda issues: next(iter(set([x['quality'] for x in issues]))))

    return df[df['inferred_safety_issue_quality'] != df['expected_response_quality']]

mismatched_reports = find_mismatched_quality_issues(validation_df_gpt4_with_inference)['file'].tolist()

In [None]:
looking = validation_df_gpt4_with_inference[validation_df_gpt4_with_inference['file'].isin(mismatched_reports)]
looking = looking[looking['expected_response_quality'] == "inferred"]

interpret_dataset_SI_comparison(looking)

There is a little problem with the fact that the data extraction of the safety issues is not as good as it should be.

That means that the "cleaned reports" and number extracted etc can't actually be fully trusted.
Therefore I think that it means there are better results than first thought.

I am going to go through the validation set mismatches and list whether the extracted or the inferred ones are more correct. This is quite a manual process but needs to be done.

| report_id | which one was correct | notes |
| ------ | ----- | ---- |
| 2011_003 | inferred | there was a summary in analysis that was not removed |
| 2012_105  | exact | This one is confusing as in the analysis section it does state the safety issues arising are. This is where the first safety issue comes from the others are from the final "findings" section. Thus they are findings and not safety issues. |
| 2020_002 | exact| It seemingly is grabbing randoms paragraphs and stating them as safety issues. |
| 2015_102 | inferred | Getting its information from the analysis introduction |
| 2012_103 | exact | The inferred is saying more or less the same thing. However it is just copied the text from the findings section. |
| 2019_005 | exact | There is a summary in Analysis section but this misses out on two non causal safety issues | 
| 2017_202 | exact | There is a summary in the analysis section however it does not extract from there. Instead it pulls most of the them from the findings section. Also the text extraction did not work perfectly. |
| 2011_204 | exact | The inferred is just grabbing from the findings section. |
| 2013_106 | exact | There is a summary that outlines what the safety issues are. But it has just taken the findings. |
| 2017_101 | inferred | Summarary in the analysis section |  

In [None]:
# After changing the prompts I will rerun it on just the reports that are mismatched

importlib.reload(ReportExtracting)

# Only get mismatched _reports
mismatched_reports_df = validation_df_gpt4[validation_df_gpt4['file'].isin(mismatched_reports)]
mismatched_reports_df = mismatched_reports_df[mismatched_reports_df['expected_response_quality'] == "inferred"]

mismatched_reports_df_with_inference = extract_safety_issues_from_cleaned_reports(mismatched_reports_df)
print("Safety issues extracted")
# Compare safety issues
mismatched_reports_df_with_inference = compare_safety_issues_df(mismatched_reports_df_with_inference)
print("Comparisons made")
# Understand how well they compared
interpret_dataset_SI_comparison(mismatched_reports_df_with_inference)

### Sending off dataset to chris and ingrid

In [None]:
importlib.reload(ReportExtracting)
reports_path = "output"

reports = os.listdir(reports_path)

reports_to_ignore = [
    "2020_102", # This report has a funny safety issue that is not formatted correctly so cant be picked up by the regex
]
reports = [report for report in reports if report not in reports_to_ignore]

all_reports_df = clean_text_and_get_safety_issues(reports_path, reports)

reports_to_send = all_reports_df.sample(50, random_state=42)

reports_to_send['cleaned_report'] = reports_to_send['file'].apply(lambda x: open(os.path.join(reports_path, x, f'{x}.txt'), 'r').read())



In [None]:

reports_to_send_with_safety_issues = extract_safety_issues_from_cleaned_reports(reports_to_send)


In [None]:
minimum_df = reports_to_send_with_safety_issues[['file', 'inferred_safety_issues_struct']]

# Make minimum_df longer by only having one safety issues per row. As currently safety issues are stored in a list

minimum_df = minimum_df.explode('inferred_safety_issues_struct')

#Remove the rows that have no safety issues
minimum_df = minimum_df[minimum_df['inferred_safety_issues_struct'].notna()]

minimum_df['quality'] = minimum_df['inferred_safety_issues_struct'].apply(lambda x: x['quality'])
minimum_df['safety_issue'] = minimum_df['inferred_safety_issues_struct'].apply(lambda x: x['safety_issue'])

minimum_df[['file', 'safety_issue', 'quality']].to_excel('inferred_safety_issues.xlsx')



After having a meeting it was decided that it is good enough at its about 80% accuracy. Before I merge I just want to check a few things

- That is still does a good job with the exact issue extraction
- What the testing dataset has to say about accuracy

## Make sure that exact safety issue extraction is reasonable



In [None]:
# I will just use the validation set and add in the exact ones and see what it does

validation_exact_check_df = validation_df.copy()

validation_exact_check_df['cleaned_report'] = validation_exact_check_df['file'].apply(lambda x: open(os.path.join(reports_path, x, f'{x}.txt'), 'r').read())

validation_exact_check_df['expected_response_quality'] = 'exact'

validation_exact_check_df.drop(labels='uncleaned_report', axis=1, inplace=True)
validation_exact_check_df.drop_duplicates(subset=['file'], inplace=True)


In [None]:
validation_exact_check_df_with_inference = extract_safety_issues_from_cleaned_reports(validation_exact_check_df)

In [None]:
validation_exact_check_df_with_inference = compare_safety_issues_df(validation_exact_check_df_with_inference)


interpret_dataset_SI_comparison(validation_exact_check_df_with_inference)

After running this I have gotten very good results.

Here are some of the stats

same number of extracted and inferred safety issues	0.71
Average match of same count of SI	1.00
Average difference in count of SI	-0.43
Average number of extracted SI	3.29
Average number of inferred SI	3.71

But importantly there is 100% coverage. I can call this good enough.

## Check the test set to get a final expected accuracy

Now that I have reached then end I will look at the test set and see what kind of numbers I get

In [None]:
test_df['cleaned_report'] = test_df.apply(lambda row: row['cleaned_report'] if row['expected_response_quality'] == "inferred" else row['uncleaned_report'], axis=1)

test_df_with_inference = extract_safety_issues_from_cleaned_reports(test_df)

test_df_with_inference = compare_safety_issues_df(test_df_with_inference)

interpret_dataset_SI_comparison(test_df_with_inference)

I ran the test set and got the results of 86% coverage with 83% of reports completely matched.

That is pretty good so I will merge it in for now.

description	statistic
same number of extracted and inferred safety issues	0.56
Average match of same count of SI	0.90
Average difference in count of SI	-1.28
Average number of extracted SI	2.44
Average number of inferred SI	3.72