This notebook is for developing the code for extracting metadata from the Prisma incident report (PIR) data. The notebook:

* Loads the full set of PIRs, both labeled and unlabeled
* Separates the labeled from the unlabeled PIRs
* Defines a function that takes as input a prompt template, a set of labeled PIRs, and an unlabeled PIR to create a few-shot learning (FSL) prompt
* Defines a function that takes as input a prompt and a LLM, and returns the model's output
* Defines a function that parses the model output to produce structured metadata
* Saves the resulting structured metadata.

This notebook is for developing that code; once it is developed, it should be put into one or several .py script(s) so that it can be run on the Cluster without having a Jupyter session open.

In [1]:
# Load libraries
from langchain import PromptTemplate
from langchain import FewShotPromptTemplate
import pandas as pd

In [2]:
# Load unlabeled PIRs (full)
"""
In its final form, the unlabeled PIRs should be in the format of a pandas dataframe,
where one column (titled 'text') is the PIR, and 'near_miss' and 'any_harm' each have a column.
"""

# Read the full PIRs

PIRs_full = pd.read_excel("PIRS_FULL.xlsx",usecols=['Near_Miss', 'Any_Harm', 'Text'])

In [3]:
# Load labeled PIRs
"""
In its final form, the labeled PIRs should be in the format of a pandas dataframe with four columns,
where one column (titled 'text') is the PIR, and the other three columns are 'risks_challenges',
'actions_strategies', and 'facilitators'.
"""

# Read the labeled PIRs

PIRs_labeled = pd.read_excel("PIRS_LABELED.xlsx", usecols=['Text', 'Risk_Challenge', 'Actions_Strategies', 'Facilitators'])


In [4]:
# Separate labeled from unlabeled PIRs
"""
PIRs appearing in the labeled data should be dropped from the unlabeled data set.
"""

df = pd.merge(PIRs_full, PIRs_labeled, on =["Text"], how = "outer", indicator = True)

df2 = df.loc[df["_merge"] == "left_only"].drop("_merge", axis=1)

PIRs_unlabeled = df2.drop(columns=['Risk_Challenge', 'Actions_Strategies','Facilitators', 'Near_Miss', 'Any_Harm'])


In [5]:
## Get the actions/strategies labeled PIRs

labeled_pirs = PIRs_labeled[['Text','Actions_Strategies']]

In [6]:
## Get five examples

#Example 1
ex1 = labeled_pirs.sample(n=1)
ex1_text = ex1['Text']
ex1_text = "".join(ex1_text)
ex1_as = ex1['Actions_Strategies']
ex1_as = ex1_as.to_csv(header = None, index=False)
ex1_as = ex1_as.split('\n')

ex1_as = ','.join(ex1_as)
ex1_as = ex1_as.replace(',,',',')
ex1_as = ex1_as[:-1]

In [7]:
#Example 2
ex2 = labeled_pirs.sample(n=1)
ex2_text = ex2['Text']
ex2_text = "".join(ex2_text)
ex2_as = ex2['Actions_Strategies']
ex2_as = ex2_as.to_csv(header = None, index=False)
ex2_as = ex2_as.split('\n')

ex2_as = ','.join(ex2_as)
ex2_as = ex2_as.replace(',,',',')
ex2_as = ex2_as[:-1]

In [8]:
#Example 3
ex3 = labeled_pirs.sample(n=1)
ex3_text = ex3['Text']
ex3_text = "".join(ex3_text)
ex3_as = ex3['Actions_Strategies']
ex3_as = ex3_as.to_csv(header = None, index=False)
ex3_as = ex3_as.split('\n')

ex3_as = ','.join(ex3_as)
ex3_as = ex3_as.replace(',,',',')
ex3_as = ex3_as[:-1]

In [9]:
#Example 4
ex4 = labeled_pirs.sample(n=1)
ex4_text = ex4['Text']
ex4_text = "".join(ex4_text)
ex4_as = ex4['Actions_Strategies']
ex4_as = ex4_as.to_csv(header = None, index=False)
ex4_as = ex4_as.split('\n')

ex4_as = ','.join(ex4_as)
ex4_as = ex4_as.replace(',,',',')
ex4_as = ex4_as[:-1]

In [10]:
# Example 5
ex5 = labeled_pirs.sample(n=1)
ex5_text = ex5['Text']
ex5_text = "".join(ex5_text)
ex5_as = ex5['Actions_Strategies']
ex5_as = ex5_as.to_csv(header = None, index=False)
ex5_as = ex5_as.split('\n')

ex5_as = ','.join(ex5_as)
ex5_as = ex5_as.replace(',,',',')
ex5_as = ex5_as[:-1]

In [11]:
## Prompt template

# Create our examples

examples = [ 
    
       { "text": ex1_text,
        
         "Actions_Strategies": ex1_as},
       
       { "text": ex2_text,
        
         "Actions_Strategies": ex2_as},
       
       { "text": ex3_text,
        
         "Actions_Strategies": ex3_as},
       
       { "text": ex4_text,
        
         "Actions_Strategies": ex4_as},
       
       { "text": ex5_text,
        
         "Actions_Strategies": ex5_as}
    ]


In [12]:
# Create an example template

example_template = """

Text: {text}
Labels: {Actions_Strategies}
"""

# Create a few-shot prompt example from above template

example_prompt = PromptTemplate(
    
    input_variables = ["text","Actions_Strategies"],
    template = example_template
    
    )

In [13]:
# now break our previous prompt into a prefix and uuffix 
# the prefix is our instructions

prefix = """ The following is an instruction for labeling Hospital Incident Reports (HIR) Text into categories of actions and strategies. 

A hospital incident report text is a document used to record unexpected events or accidents in a hospital, with the aim of improving patient safety and healthcare quality. 

The definition of actions and strategies in a HIR Text is refer to the procedures and interventions carried out in response to an incident in order to address the incident, ensure patient safety, reduce risks, and enhance overall care. With the definition of actions and strategies, please categorize the HIR Text into actions and strategies through the next steps: 

1. Start by thoroughly reading each Text to gain a comprehensive understanding of the actions or strategies detailed within it.

2. Identify the specific actions or strategies mentioned in the Text that were either implemented or proposed to address the incident.

3. In the below, you will be provided with five examples. Each example contains a Text and Labels, the corresponding categories representing actions or strategies in the Text. If there were no actions or strategies in the Text, then the Labels would be None. 


Here are the five examples:

### BEGIN EXAMPLES 

"""

In [14]:
# and the suffix our user input and output indicator

suffix = """

### END EXAMPLES

Based on the five examples given above, please label the following Text as actions and strategies. Your answer is not just to sort the text into the existing categories we talked about. You also need to come up with new categories that fit the actions and strategies mentioned in the following text.

Text: {text}

"""

In [15]:
# now create the few shot promt template

few_shot_prompt_template = FewShotPromptTemplate(
    
    examples = examples,
    example_prompt = example_prompt,
    prefix = prefix,
    suffix = suffix,
    input_variables =["text"],
    example_separator ="\n\n"
    
    )


In [16]:
## Get the unlabeled text 

unlabeled_pir = PIRs_unlabeled['Text']

unlabeled_pir_examples = unlabeled_pir.iloc[0]


In [17]:
## Print the prompt

text =unlabeled_pir_examples

prompt = few_shot_prompt_template.format(text=text)

print(prompt)

 The following is an instruction for labeling Hospital Incident Reports (HIR) Text into categories of actions and strategies. 

A hospital incident report text is a document used to record unexpected events or accidents in a hospital, with the aim of improving patient safety and healthcare quality. 

The definition of actions and strategies in a HIR Text is refer to the procedures and interventions carried out in response to an incident in order to address the incident, ensure patient safety, reduce risks, and enhance overall care. With the definition of actions and strategies, please categorize the HIR Text into actions and strategies through the next steps: 

1. Start by thoroughly reading each Text to gain a comprehensive understanding of the actions or strategies detailed within it.

2. Identify the specific actions or strategies mentioned in the Text that were either implemented or proposed to address the incident.

3. In the below, you will be provided with five examples. Each ex

The below section is calling the LLMs to label the metadata using the prompts 

In [18]:
# Set cache directory and load Huggingface api key

import os

username = os.getenv('USER')
directory_path = os.path.join('/scratch',username)

# Set Huggingface cache directory to be on scratch drive
if os.path.exists(directory_path):
    hf_cache_dir = os.path.join(directory_path,'hf_cache')
    if not os.path.exists(hf_cache_dir):
        os.mkdir(hf_cache_dir)
    print(f"Okay, using {hf_cache_dir} for huggingface cache. Models will be stored there.")
    assert os.path.exists(hf_cache_dir)
    os.environ['TRANSFORMERS_CACHE'] = f'/scratch/{username}/hf_cache/'
else:
    error_message = f"Are you sure you entered your username correctly? I couldn't find a directory {directory_path}."
    raise FileNotFoundError(error_message)

# Load Huggingface api key
api_key_loc = os.path.join('/home', username, '.apikeys', 'huggingface_api_key.txt')

if os.path.exists(api_key_loc):
    print('Huggingface API key loaded.')
else:
    error_message = f'Huggingface API key not found. You need to get an HF API key from the HF website and store it at {api_key_loc}.\n' \
                    'The API key will let you download models from Huggingface.'
    raise FileNotFoundError(error_message)


Okay, using /scratch/dixizil/hf_cache for huggingface cache. Models will be stored there.
Huggingface API key loaded.


In [19]:
# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline

Alpaca 30B

* VicUnlocked-alpaca-30B

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Aeala/VicUnlocked-alpaca-30b")

model = AutoModelForCausalLM.from_pretrained("Aeala/VicUnlocked-alpaca-30b")

print('Instantiating pipeline')
pipe = pipeline(
    "text-generation",
    model=model,
    do_sample=True,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=1,
    top_p=0.95,
    repetition_penalty=1.2)

print('Instantiating HuggingFacePipeline')
local_llm = HuggingFacePipeline(pipeline=pipe)

In [None]:
# Get the model's output
print(prompt + local_llm(prompt))