# What

As discussed in https://github.com/1jamesthompson1/TAIC-report-summary/issues/138 each recommendation that TAIC issues also has a response. These responses have a few categories.

## How to do it

I have a dataset file that are all of the recommendations from TAIC. 

I also have 10 random examples of categorized recommendations.

I will start by just asking it and seeing how it does against this exmaples. Then I can use the examples as exmpaels within the prompt

## All of the modules needed

To keep things as transparent as possible I will add all of the dependencies at the top.

In [38]:
# From the engine
from engine.OpenAICaller import openAICaller

# Third party
import pandas as pd

# Built in
import os
import re
import importlib

In [66]:
response_categories = [
    {"category": "Accepted", "definition": "The recommendation was accepted (wholly or in part) and is being, has been, or will be implemented"},
    {"category": "Under consideration", "definition": "The recipient has acknowledged that the recommendation is received and will consider it."},
    {"category": "Rejected", "definition": "The recommendation will not be implemented"}
]

In [26]:
"\n".join([f"{element['category']} - {element['definition']}" for element in response_categories])

'Accepted - The recommendation was accepted (wholly or in part) and is being, has been, or will be, implemented\nUnder consideration - The recommendation will not be implemented\nRejected - The recommendation will not be implemented'

# Getting example dataset

I will get Ingrid from TAIC to give me some examples

In [17]:
# Read Recommendations 2024-04-04.xlsx into dataframe

taic_recommendations = pd.read_excel('TAIC_recommendations_04_04_2024.xlsx')

In [20]:
# Get a sample where recommendation and reply is not empty
def clean_TAIC_recommendations(df):
    df.dropna(subset=["Recommendation", "Inquiry", "Number", "Reply Text"], inplace = True)
    df = df[(df['Made'] >= '2010-01-01') & (df['Made'] <= '2024-12-31')]

    # structure the inquiry to be able to match with rest of project
    inquiry_regex = r'^(((AO)|(MO)|(RO))-[12][09][987012]\d-[012]\d{2})$'
    df = df[df['Inquiry'].str.match(inquiry_regex)]
    df['report_id'] = df['Inquiry'].apply(lambda x: "_".join(x.split('-')[1:3]))
    return df

sample_df = clean_TAIC_recommendations(taic_recommendations)

sample_df = sample_df.sample(10, random_state=42)

sample_df['response_category'] = "To be filled out"

sample_df

Unnamed: 0,Number,Recipient,Safety Issue,Inquiry,Status,Made,Closed,Recommendation,Reply Text,report_id,response_category
12,032/23,KiwiRail,Awareness and education,RO-2022-103,Open,2023-09-27,NaT,"On 27 September 2023, the Commission recommend...",032/23:\tThe Commission recommended that KiwiR...,2022_103,To be filled out
315,040/10,NZTA,,RO-2009-103,Closed acceptable,2010-09-23,2012-06-28,KiwiRail provided a draft document that demons...,We intend to work closely with KiwiRail to ove...,2009_103,To be filled out
181,020/15,CAA,,AO-2013-006,Closed acceptable,2016-02-25,2017-10-25,On 25 February 2016 the Commission recommended...,The recommendation to check aerodrome lighting...,2013_006,To be filled out
256,036/12,CAA,,AO-2011-002,Closed acceptable,2012-12-14,2018-04-23,On 14 December 2012 the Commission recommended...,On 16 Januray 2013 the Director of Civil Aviat...,2011_002,To be filled out
274,008/12,CAA,,AO-2010-009,Open,2012-03-22,NaT,The wearing of appropriate seat restraints can...,Accepted. The Director will monitor the outco...,2010_009,To be filled out
237,016/13,KiwiRail,,RO-2011-102,Closed acceptable,2013-09-26,2015-03-26,The Commission recommends that the Chief Execu...,"Recommendations 013/13, 014/13, 015/13 and 016...",2011_102,To be filled out
129,035/17,MNZ,,MO-2016-201,Closed acceptable,2017-12-16,2020-04-06,The CO2 fixed fire-fighting system installed i...,I accept this recommendation. Maritime New Zea...,2016_201,To be filled out
105,029/18,Navim Group,,MO-2017-203,Open,2018-11-22,NaT,The nitrogen cylinders that formed part of the...,The accident highlighted the need to pay extra...,2017_203,To be filled out
141,025/17,KiwiRail,,RO-2016-102,Closed acceptable,2017-08-24,2018-03-26,"On 24 August 2017, the Commission recommended ...",KiwiRail will review its policy and procedures...,2016_102,To be filled out
60,003/22,Oceanic Fishing Limited,Ensuring compliance,MO-2021-203,Open,2022-03-23,NaT,The Commission recommended that Oceanic Fishin...,"Oceanic Fishing Limited replied, in part:_x000...",2021_203,To be filled out


In [30]:
# Get defintions read

df_definitions = pd.DataFrame(response_categories)

In [31]:
with pd.ExcelWriter('example_recommendation_categories_ingrid.xlsx', engine='openpyxl') as writer:
    sample_df.to_excel(writer, sheet_name='Recommendations')
    df_definitions.to_excel(writer, sheet_name='Definitions')

# Performing categorization



## Get datasets 



In [44]:
# Get examples from Ingrid

ingrid_categories = pd.read_excel('example_recommendation_categories_ingrid_responses.xlsx')

ingrid_categories['response_category_example'] = ingrid_categories['response_category'].apply(lambda x: response_categories[x]['category'])

ingrid_categories.drop(columns=['response_category'], inplace=True)

# Update 011/17 as N/A
ingrid_categories.loc[ingrid_categories['Number'] == '011/17', 'response_category_example'] = "N/A"

In [45]:
# Get all recommendations from TAIC

taic_recommendations_cleaned = clean_TAIC_recommendations(taic_recommendations)

In [47]:
# Combine datasets together

recommendations_df = taic_recommendations_cleaned.merge(ingrid_categories, how='outer')

recommendations_df = recommendations_df[['report_id', 'Number', 'Recommendation', 'Reply Text' ,'Made', 'Recipient', 'response_category_example']]

# change all columsn to lower case and repalce spaces with _
recommendations_df.columns = recommendations_df.columns.str.lower().str.replace(' ', '_')

## Assign Categories



In [59]:
def assign_response_category(response, recommendation_num):
    categories = '\n'.join([f"{element['category']} - {element['definition']}" for element in response_categories])

    system_prompt = f"""
You are helping me put responses into categories.

These responses are to recommendations that were made in a transport accident investigation report. These recommendations are issued directly to a particular party.

There are three categories:

{categories}

However if there are responses that don't fit into any of the categories then you can put them as N/A. These may be responses that request further information or want recommendation to be sent elsewhere.

Your response should just be the name of the category with nothing else.
"""
    user_prompt = f"""
Which category is this response in?

"
{response}
"

in regards to recommendation number {recommendation_num}
"""

    # print(f"system propmt is:\n{system_prompt} and user prompt is:\n{user_prompt}")

    openai_response = openAICaller.query(
        system_prompt,
        user_prompt,
        model = "gpt-4",
        temp = 0
    )

    if openai_response in [category['category'] for category in response_categories] + ['N/A']:
        return openai_response
    else:
        print(f"Did not match any of the categories - {openai_response}")
        return None

In [67]:
sample_recommendations_df = recommendations_df.sample(25, random_state=42)

sample_recommendations_df['response_category'] = sample_recommendations_df.apply(lambda x: assign_response_category(x['reply_text'], x['number']), axis=1)


In [64]:

examples_recommendations_df = recommendations_df[~recommendations_df['response_category_example'].isna()]
examples_recommendations_df['response_category'] = examples_recommendations_df.apply(lambda x: assign_response_category(x['reply_text'], x['number']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  examples_recommendations_df['response_category'] = examples_recommendations_df.apply(lambda x: assign_response_category(x['reply_text'], x['number']), axis=1)


In [65]:
# Compare the results

comparison = examples_recommendations_df['response_category'] == examples_recommendations_df['response_category_example']

sum(comparison)/len(comparison)



0.8