# EPIDEMIOLOGICAL PARAMETER EXTRACTION USING PRE-TRAINED LLMS

The objective of this notebook is to test different prompting strategies (e.g. zero shot, few shot, CoT) for extracting epidemiological parameters from full text medical articles.

This notebook will guide you through the process of extracting parameters. 
An OpenAI API key is required to access the GPT API but an example of the LLM output has been provided.

#### 1. SETTING UP

In [1]:
import os, json, re
import pandas as pd
import datetime as datetime
from io import StringIO
from openai import OpenAI
from dotenv import load_dotenv
import openai

In [3]:
# Changing the working directory - make sure you're in the correct working directory
project_directory = os.getcwd()
os.chdir(project_directory) 

openai.api_key = 'YOUR KEY HERE'

client = OpenAI(
    api_key='YOUR KEY HERE'
)

#### 2. PROMPT TEMPLATE BUILDING 

This section of code will demonstrate how the prompts will be built and how the LLM will be prompted via the API.

This demonstration provides a list of filepaths to a subset of files containing delay parameters but you can modify this to include other text files in the `text_file` directory.

This section of code includes a function `prompt_gpt` which GPT-4-Turbo model using the API.

There is also a `get_full_prompt` function which buils the full prompt (comprising the instructions, text file and parameters for extraction). A demonstration of the prompt and the LLM output is provided.

In [4]:
# Filepaths to subset of articles containing delay parameters

filepath_list = ['data\\text_files\\Gear_1975.txt', 
                'data\\text_files\\Martini_1973.txt',
                'data\\text_files\\Knust_2015.txt', 
                'data\\text_files\\Ajelli_2012.txt',
                'data\\text_files\\Bausch_2006.txt']

delay_params = """
Incubation period, 
Generation time, 
Time symptom to outcome (death),
Time symptom to outcome (other),
Time in care,
Time symptom to careseeking
"""

In [5]:
def prompt_gpt(prompt, temperature):
    chat_completion = client.chat.completions.create(
        messages=[
            {"role": "user", "content": prompt}
        ],
        model="gpt-4-turbo",  # Use the gpt-4-turbo model
        temperature=temperature,
        max_tokens=1024
    )
    api_call_timestamp = datetime.datetime.now().isoformat()  # Records timestamp immediately after API call
    return chat_completion, api_call_timestamp

In [6]:
def get_full_prompt(prompt_filepath, text_filepath, 
                                disease_of_interest, parameters_to_extract):
    
    with open(prompt_filepath, 'r', encoding='UTF-8') as file:
        prompt_template = file.read() # Reading the prompt template file
        
    with open(text_filepath, 'r', encoding='utf-8') as file:
        input_text = file.read() # Reading the text file

    # Building the full prompt
    extraction_prompt = prompt_template.format(MEDICAL_ARTICLE=input_text, 
                                                     DISEASE=disease_of_interest, 
                                                     PARAMETERS=parameters_to_extract)
    return extraction_prompt

# Testing the prompt builder - provide any path to any text file to test
print(get_full_prompt(prompt_filepath='prompts\\p0_zero_shot.txt',
                                  text_filepath='data\\text_files\\Gear_1975.txt', 
                                  disease_of_interest='Marburg virus disease', 
                                  parameters_to_extract=delay_params))

Here is the full text of a medical article:
<medical_article> 
BRITISH MEDICAL JOURNAL
29 NOVEMBER 1975
PAPERS AND ORIGINALS
Outbreak of Marburg virus disease in Johannesburg
J S S GEAR,
G A CASSEL,
T H BOTHWELL,
R SHER,
M ISAACSON,
J H S GEAR
A J GEAR,
B TRAPPLER,
L CLAUSEN,
A M MEYERS, M C KEW,
G B MILLER,
J SCHNEIDER,
H J KOORNHOF,
E D GOMPERTS,
British Medical Journal, 1975, 4, 489-493
Summary
The first recognised outbreak of Marburg virus disease
in Africa, and the first since the original epidemic in
West Germany and Yugoslavia in 1967, occurred in South
Africa in February 1975. The primary case was in a young
Australian man, who was admitted to the Johannesburg
Hospital after having toured Rhodesia. Two secondary
cases occurred, one being in the first patient's travelling
companion, and the other in a nurse. Features of the
illness
included
high
fever,
myalgia,
vomiting and
diarrhoea,
hepatitis,
a
characteristic
maculopapular
rash,
leucopenia,
thrombocytopenia,
and
a bleeding
te

In [7]:
test_prompt = get_full_prompt(prompt_filepath='prompts\\p0_zero_shot.txt',
                                  text_filepath='data\\text_files\\Gear_1975.txt', 
                                  disease_of_interest='Marburg virus disease', 
                                  parameters_to_extract=delay_params)
test, time = prompt_gpt(test_prompt, temperature=0.0)

In [12]:
# Printing out the test extraction
print(test)
print(test.choices[0].message.content)

ChatCompletion(id='chatcmpl-AGKKF8YZTgDP0MXBzZK9MV5VG2VBC', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='```xml\n<extracted_parameters>\nIncubation period, 7-8, days, estimated, , , \nTime symptom to outcome (death), 7, days, exact, , , \n</extracted_parameters>\n```', role='assistant', function_call=None, tool_calls=None, refusal=None))], created=1728454743, model='gpt-4-turbo-2024-04-09', object='chat.completion', system_fingerprint='fp_68a5bb159e', usage=CompletionUsage(completion_tokens=48, prompt_tokens=9844, total_tokens=9892, prompt_tokens_details={'cached_tokens': 0}, completion_tokens_details={'reasoning_tokens': 0}))
```xml
<extracted_parameters>
Incubation period, 7-8, days, estimated, , , 
Time symptom to outcome (death), 7, days, exact, , , 
</extracted_parameters>
```


In [13]:
# If you don't have access to an API key, you can open the example extraction file
with open('examples\\example_extraction_Gear_1975.txt', 'r') as file:
    example = file.read()
print(example)

```xml
<extracted_parameters>
Incubation period, 7-8, days, estimated, , , 
Time symptom to outcome (death), 7, days, exact, , , 
</extracted_parameters>
```


#### 3. PROMPTING THE LLM

The function `get_llm_output` takes in several arguments: a filepath to the input file, a disease, parameters to extract, the filepath to the prompt and a temperature setting for the LLM.

It uses these arguments to build a full prompt with the `get_full_prompt` function, promtps the llm with `prompt_gpt` and returns a `result` dictionary. This dictionary will later be parsed into a JSON file.

In [14]:
# Function for getting llm output
# The only input for this function is a filepath to the input text file

def get_llm_output(filepath, 
                   disease, 
                   parameters_to_extract,
                   prompt_filepath,
                   temperature):
    article_label = os.path.basename(filepath)[:-4] # Getting the filename for labelling the output

    # Building the sentence extraction prompt
    sentence_extract_prompt = get_full_prompt(prompt_filepath=prompt_filepath,
                                                          text_filepath=filepath, 
                                                          disease_of_interest=disease, 
                                                          parameters_to_extract=parameters_to_extract) 
    
    # Prompting gpt with sentence extraction
    message_1, api_call_timestamp_1 = prompt_gpt(prompt=sentence_extract_prompt, 
                                                    temperature=temperature) 

    # Building final result dictionary - contains both raw outputs and cleaned extracted parameters
    result = {
        "article_label": article_label,
        "prompt_code": os.path.basename(prompt_filepath),
        "api_call_timestamp_1": api_call_timestamp_1,
        "message_1": message_1,
        "content": message_1.choices[0].message.content
    }
    return result

#### 4. EXRACTION OF PARAMETERS

This demonstration will extract delay parameters from the articles in `filepath_list` using the prompts in the `prompts` directory. A dictionary of `prompt_filepaths` is used to loop through each prompt and each output is saved in the `outputs` dictionary. The `outputs` dictionary contains the messages from GPT as message.content objects so this needs to be parsed into a `parsed_outputs` dictionary, which contains strings. This is then saved as a JSON object `outputs.json`. A demonstration of`outputs.json` can be found in the examples directory, if you don't have the API key.

In [15]:
# Filepaths to prompts
prompt_filepaths = {
    'p0': 'prompts\\p0_zero_shot.txt',
    'p1': 'prompts\\p1_zero_shot_results.txt',
    'p2': 'prompts\\p2_zero_shot_results_discussion.txt',
    'p3': 'prompts\\p3_zero_shot_results_discussion_methods.txt',
    'p4': 'prompts\\p4_zero_shot_cot.txt',
    'p5': 'prompts\\p5_zero_shot_results_cot.txt',
    'p6': 'prompts\\p6_zero_shot_results_discussion_cot.txt',
    'p7': 'prompts\\p7_zero_shot_results_discussion_methods_cot.txt',
    'p8': 'prompts\\p8_zero_shot_summarize.txt',
    'p9': 'prompts\\p9_zero_shot_results_summarize.txt',
    'p10': 'prompts\\p10_zero_shot_results_discussion_summarize.txt',
    'p11': 'prompts\\p11_zero_shot_results_discussion_methods_summarize.txt',
    'p12': 'prompts\\p12_zero_shot_summarize_cot.txt',
    'p13': 'prompts\\p13_zero_shot_results_summarize_cot.txt',
    'p14': 'prompts\\p14_zero_shot_results_discussion_summarize_cot.txt',
    'p15': 'prompts\\p15_zero_shot_results_discussion_methods_summarize_cot.txt'
}

list_of_prompts = [f'p{i}' for i in range(16)]

delay_params = """
Incubation period, 
Generation time, 
Time symptom to outcome (death),
Time symptom to outcome (other),
Time in care,
Time symptom to careseeking
"""

# Dictionary for storing output
outputs = {prompt: {} for prompt in list_of_prompts}

In [16]:
for key in prompt_filepaths:
    print(f"Using {prompt_filepaths[key]} to prompt LLM...")
    prompt_output = {}
    for x in range(0, len(filepath_list)):
        answer = get_llm_output(prompt_filepath=prompt_filepaths[key],
                                filepath=filepath_list[x], 
                                disease='Marburg Virus Disease', 
                                parameters_to_extract=delay_params,
                                temperature=0.0)
        label = os.path.basename(filepath_list[x])
        print(f"Extracted parameters from {label}")
        prompt_output[label] = answer
    outputs[key] = prompt_output

Using prompts\p0_zero_shot.txt to prompt LLM...
Extracted parameters from Gear_1975.txt
Extracted parameters from Martini_1973.txt
Extracted parameters from Knust_2015.txt
Extracted parameters from Ajelli_2012.txt
Extracted parameters from Bausch_2006.txt
Using prompts\p1_zero_shot_results.txt to prompt LLM...
Extracted parameters from Gear_1975.txt
Extracted parameters from Martini_1973.txt
Extracted parameters from Knust_2015.txt
Extracted parameters from Ajelli_2012.txt
Extracted parameters from Bausch_2006.txt
Using prompts\p2_zero_shot_results_discussion.txt to prompt LLM...
Extracted parameters from Gear_1975.txt
Extracted parameters from Martini_1973.txt
Extracted parameters from Knust_2015.txt
Extracted parameters from Ajelli_2012.txt
Extracted parameters from Bausch_2006.txt
Using prompts\p3_zero_shot_results_discussion_methods.txt to prompt LLM...
Extracted parameters from Gear_1975.txt
Extracted parameters from Martini_1973.txt
Extracted parameters from Knust_2015.txt
Extrac

In [17]:
outputs_parsed = {}

for key in outputs:
    print(f"\nOutput from prompt {key}:\n")
    prompt_output = {}
    #print(outputs[key].keys())
    for k in outputs[key]:
        print(f"Extraction from {k}")
        print(outputs[key][k]['content'])
        prompt_output[k] = outputs[key][k]['content']
    outputs_parsed[key] = prompt_output


Output from prompt p0:

Extraction from Gear_1975.txt
<extracted_parameters>
Incubation period, 7-8, days, estimated, , , 
Time symptom to outcome (death), 7, days, exact, , , 
</extracted_parameters>
Extraction from Martini_1973.txt
<extracted_parameters>
Incubation period, 3-9, days, range, 3, 9, exact
Time symptom to outcome (death), 15-20, days, range, 15, 20, exact
</extracted_parameters>
Extraction from Knust_2015.txt
<extracted_parameters>
Time symptom to outcome (death), 9, days, mean, , , 
Time symptom to outcome (death), 6.5, days, mean, , , 
Time in care, 14.3, days, mean, 4, 22, range
Time symptom to careseeking, 4, days, mean, , , 
</extracted_parameters>
Extraction from Ajelli_2012.txt
<extracted_parameters>
Generation time, 9, days, mean, 8.2, 10, 95%CI
Generation time, 5.4, days, std dev, 3.9, 8.6, 95%CI
Time symptom to outcome (death), 7, days, median, 5, 9, range
</extracted_parameters>
Extraction from Bausch_2006.txt
<extracted_parameters>
Time symptom to outcome (d

In [18]:
with open('results\\outputs.json', 'w') as json_file:
    json.dump(outputs_parsed, json_file, indent=4)

In [19]:
# If you don't have the API key but would like to see the outputs file
with open('examples\\outputs.json', 'r') as file:
    example_outputs = json.load(file)

for key in example_outputs:
    print(f"Extraction using prompt {key}:")
    for k in example_outputs[key]:
        print(f"\nExtraction from article {k}")
        print(example_outputs[key][k])
    print("\n\n\n")

Extraction using prompt p0:

Extraction from article Gear_1975.txt
<extracted_parameters>
Incubation period, 7-8, days, estimated, , , 
Time symptom to outcome (death), 7, days, exact, , , 
</extracted_parameters>

Extraction from article Martini_1973.txt
<extracted_parameters>
Incubation period, 3-9, days, range, 3, 9, exact
Time symptom to outcome (death), 15-20, days, range, 15, 20, exact
</extracted_parameters>

Extraction from article Knust_2015.txt
<extracted_parameters>
Time symptom to outcome (death), 9, days, mean, , , 
Time symptom to outcome (death), 6.5, days, mean, , , 
Time in care, 14.3, days, mean, 4, 22, range
Time symptom to careseeking, 4, days, mean, , , 
</extracted_parameters>

Extraction from article Ajelli_2012.txt
<extracted_parameters>
Generation time, 9, days, mean, 8.2, 10, 95%CI
Generation time, 5.4, days, std dev, 3.9, 8.6, 95%CI
Time symptom to outcome (death), 7, days, median, 5, 9, range
</extracted_parameters>

Extraction from article Bausch_2006.txt
<