# Code: Extract MRI Parameters Via LLM (Task 2)

This is the code to run the MRI Parameters prompt and papers through GPT-4o. Mostly derived from previous experiments. Each paper is combined with our fixed prompt and submitted to the LLM, and each labeling is returned as an individual JSON file.

This is **step 1** for obtaining and analyzing the LLM produced annotations:

1. The notebook `run_T1_MRI_Parameters_GPT-4o.ipynb` (this notebook) accesses the OpenAI API and saves the JSON responses from the LLM.
2. `convert_MRI_JSON_parameters_to_CSV.ipynb` converts the JSON files from the LLM to a single CSV.

## Setup

### Imports, Kill-Switch, ENV Stuff, Model Settings

In [None]:
from datetime import datetime
from dotenv import load_dotenv, find_dotenv
import json
from openai import OpenAI
import os
import pandas as pd

In [None]:
LLM_run = False    # When True, the LLM will be used (this is for testing without hitting the API)

if LLM_run:
    print("LLM set to Run.")
else:
    print("LLM not to be called.")

We use `dotenv` but this code should also find API keys in the environment even if other methods are used.

In [None]:
dotenvfile = find_dotenv()
load_dotenv(dotenvfile)      # Apparently no issues if null

if dotenvfile == '':
    print("No dotenv file. If OpenAI key set in environment this should still run.")

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
ROOT_DIR       = os.getenv('ROOT_DIR')
print("ROOT_DIR set to: " + ROOT_DIR)

if OPENAI_API_KEY == '' or OPENAI_API_KEY is None:
    print("No OpenAI API key found in environment or dotenv file. See https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key.")
else:
    print("Using OpenAI API key from environment or dotenv.")

In [None]:
# Check directory

cwd = os.getcwd()  
print(cwd)

In [None]:
model = "gpt-4o-2024-08-06"  # NB: 2024-08-06 is the cheapest and most recent model

print("Using this model: " + model)

### Prompt Setup

In [None]:
with open("./MRI_parameter_prompt.txt") as f:
    prompt_start = f.read()

with open("./prototype.json") as f:
    prototype_JSON = f.read()

base_prompt = prompt_start + prototype_JSON + "\n###\n\n"

print(base_prompt + "Paper Text Goes Here")

In [None]:
system_prompt_text = """
You are a helpful assistant who is expert in MRI (magnetic resonance imaging). You understand
the parameters associated with structural MRI scans. You are careful, thorough, and brief in
your responses. You carefully read the text presented and correctly extract from it the
information you need to respond to the user. You format your responses exclusively as JSON using
the guide provided in the prompt.
"""

# As above for readability, the next lines reduce the whitespace for the LLM

system_prompt_text = system_prompt_text.replace("\n", " ")
system_prompt_text = system_prompt_text.strip()

print(system_prompt_text)

### Message List

Messages to the API must be formatted as a message list. The next function combines the system prompt and the (eventual) user prompt, which itself is the combination of the text of the paper added to the end of the `base_prompt`. This function **starts** a message list, but does not update it!

In [None]:
def start_message_list(user_prompt, system_prompt = system_prompt_text):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    return messages

# Test

start_message_list("This is the user message placeholder.")  

### Load the Ground Truth data to get the PMIDs of the papers

In [None]:
gt_df = pd.read_csv("ground_truth_structural_mri_parameters_final.csv")
gt_df.head()

In [None]:
print(gt_df.shape) # Should be 44 rows based on current CSV

In [None]:
paper_list = gt_df['pmcid'].tolist()
paper_list = [x + "_partial.txt" for x in paper_list]
print(len(paper_list))
print(paper_list)

In [None]:
# Deal with directories

input_articles_folder = ROOT_DIR + "/Library/Articles_Title_Thru_Methods/"
print(input_articles_folder)

## Actual Runner Code
Here is the code that actually runs everything throgh the API

In [None]:
print(paper_list)

In [None]:
# Parameters for API calls
max_reply_tokens = 5000
temp = 0
run_date = str(datetime.now().strftime("%Y-%m-%d_%H-%M-%S")) # Start time of run
print("Using LLM model: " + model)

# Results Folders
raw_results_dir  = ROOT_DIR + "/LLM_Experiments/Structural_MRI_Parameters/Results/Raw_Results/OpenAI/GPT-4o/"
JSON_results_dir = ROOT_DIR + "/LLM_Experiments/Structural_MRI_Parameters/Results/Inner_JSON/OpenAI/GPT-4o/"

for paper in paper_list:
    # Setup
    fn = input_articles_folder + paper
    with open(fn, "r") as f:
        text = f.read()
    submit_prompt = base_prompt + text
    ml = start_message_list(submit_prompt)
    print("Starting: ", paper)
    # LLM Call
    if LLM_run:
        client = OpenAI()
        completion = client.chat.completions.create(
            model = model,
            response_format = { "type": "json_object" },
            max_tokens = max_reply_tokens,
            temperature = temp,
            messages = ml
        )
    else:
        print("LLM deactivated.")  # Crash out of loop if LLM deactivated
        break
    # Save Results
    response   = completion.choices[0].message.content # JSON response from LLM
    # Filename stuff:
    paperID = paper.split("_")[0]  # Grab pmcid from filename
    # Save:
    with(open(JSON_results_dir + "/" + paperID + "_" + run_date + ".json", "w")) as f:
        cleanJSON = json.loads(response)
        json.dump(cleanJSON, f, indent=2)
    with(open(raw_results_dir + "/" + paperID + "_" + run_date + ".json", "w")) as f:
        f.write(completion.model_dump_json(indent=2))
    print("### Finished: ", paper)

if LLM_run:
    print("Full run completed.")
else:
    print("Nothing run.")

EOF