# Task 3: Experimental Group Information

The code in this notebook runs the prompts and papers through the GPT-4o model. Please note that this code is not optimized and there are some details that are not essential to running the code through GPT-4o.

## General Setup 

In the setup section, we do the imports, load the API key, set up an LLM kill switch (because OpenAI costs money), load the prompt, and load the ground truth data.

### Imports & Utilities

In [None]:
from datetime import datetime
from pydantic import BaseModel
from dotenv import load_dotenv, find_dotenv
import json
from openai import OpenAI
import os
import pandas as pd
import pickle
import tiktoken

The following function was originally used for token counting, but OpenAI's models have changed tokenization (circa. 5/2024) so this function should be updated if you intend to use it! (It is included here for completeness.)

In [None]:
# Token counting function (for checking LLM token usage):

def num_tokens_from_string(string: str, model_name: str="gpt-4") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

### Kill-Switch

The `LLM_run` variable is set to test before making any API calls ($$$). It is defaulted to `False` to prevent charges from the API.

In [None]:
LLM_run = False    # When True, the LLM will be used

if LLM_run:
    print("LLM set to Run.")
else:
    print("LLM not to be called.")

### Load API Keys

All of this was built with DotEnv files, but it should work with API keys saved in the environment with the right names (see below). We also use `ROOT_DIR`, an environment variable set to the folder containing this repo.

In [None]:
dotenvfile = find_dotenv()
load_dotenv(dotenvfile)      # Apparently no issues if null

if dotenvfile == '':
    print("No dotenv file. If OpenAI key set in environment this should still run.")

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
ROOT_DIR       = os.getenv('ROOT_DIR')
print("ROOT_DIR set to: " + ROOT_DIR)

if OPENAI_API_KEY == '' or OPENAI_API_KEY is None:
    print("No OpenAI API key found in environment or dotenv file. See https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key.")
else:
    print("Using OpenAI API key from environment or dotenv.")

In [None]:
cwd = os.getcwd()  # Check to make sure it is local
print(cwd)

### Model Specifics

The models used in this research are listed in the next cell and can be selected for use as needed.

In [None]:
model_list = {"gpt4p":"gpt-4-0125-preview", 
              "gpt4":"gpt-4-0613", 
              "gpt35t":"gpt-3.5-turbo-0125",
              "gpt4o":"gpt-4o-2024-08-06"}

model = model_list["gpt4o"]

print("Using this model: " + model)

### Load Prompts

#### Load the basic task prompt

In [None]:
with open("zero-shot-prompt-disease-labels.txt", "r") as f:
    base_prompt = f.read()

print(base_prompt)

In [None]:
num_tokens_from_string(base_prompt)   # Under the old tokenization model

#### Make a System Prompt

This should describe the persona of the model.

In [None]:
system_prompt_text = """
You are a helpful assistant who is expert in reading and using standard psychiatric
terminology and in translating between different systems of terminology. You understand
science and experimental design. You are careful, thorough, and brief in your responses.
You carefully read the text presented and correctly extract from it the information you
need to respond to the user. You format your responses as JSON using the guide provided
in the prompt.
"""

# As above for readability, the next lines reduce the whitespace for the LLM

system_prompt_text = system_prompt_text.replace("\n", " ")
system_prompt_text = system_prompt_text.strip()
num_tokens_from_string(system_prompt_text)

In [None]:
print(system_prompt_text)

#### Message List

Messages to the API must be formatted as a message list. The next function combines the system prompt and the (eventual) user prompt, which itself is the combination of the text of the paper added to the end of the `base_prompt`. This function **starts** a message list, but does not update them!

In [None]:
def start_message_list(user_prompt, system_prompt = system_prompt_text):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]
    return messages

# Test

start_message_list("This is the user message placeholder.")  

The following code loads the ground truth data for this test set, however we use only the list of `pmid`'s here as these are the publications we are analyzing.

In [None]:
gt_df = pd.read_csv("test-cases-with-ground-truth.csv")
print(gt_df.shape)
gt_df.head()

In [None]:
paper_list = gt_df['pmid'].tolist()
paper_list = [x + "_partial.txt" for x in paper_list]
print(paper_list)
print(len(paper_list))

In [None]:
input_articles_folder = ROOT_DIR + "/Library/Articles_Title_Thru_Methods"
print(input_articles_folder)

### Make Initial Pass through the Papers to See the Lengths

Just a note to other researchers: when this research started LLM context length was much shorter (often 4096 or 8192 tokens) so we did a lot to shorten the inputs. As models have developed this has become much less of a concern (except for per-token costs).

We expect that in the future the entire papers will be uploaded without issues.

The following code loads each paper and shows the token length under the old tokenization model.

In [None]:
paper_lengths = dict()
for paper in paper_list:
    with open(input_articles_folder + "/" + paper, "r") as f:
        textblock = f.read()
        submit = base_prompt + textblock
        num_tokens = num_tokens_from_string(submit)
        paper_lengths[paper] = num_tokens

dict(sorted(paper_lengths.items(), key=lambda item: item[1])) # Min 2523, Max 6504

In [None]:
# Total

sum(paper_lengths.values())  # Approximately 125k tokens for prompts and papers

In [None]:
# Range of values for number of tokens

max(paper_lengths.values()), sum(paper_lengths.values())/len(paper_lengths), min(paper_lengths.values())

## Run All Cases

The following code goes through all of the papers in the papers list, submits each with the prompt to the LLM, then saves the raw API return (as JSON) and the LLM response (also as JSON).

We will use the LLM responses for analysis; the raw API return is logged for reference (model ID, fingerprint), but will likely not be analyzed.

In [None]:
len(paper_list)

In [None]:
# Parameters for API calls
max_reply_tokens = 5000
temp = 0.2
run_date = str(datetime.now().strftime("%Y-%m-%d_%H-%M-%S")) # Start time of run
print("Using LLM model: " + model)

# Results Folders
raw_results_dir  = ROOT_DIR + "/LLM_Experiments/Diagnosis_And_Group_Size_Easy/Results/Raw_Results/OpenAI/GPT-4o/"
JSON_results_dir = ROOT_DIR + "/LLM_Experiments/Diagnosis_And_Group_Size_Easy/Results/Inner_JSON/OpenAI/GPT-4o/"

for paper in paper_list:
    # Setup
    fn = input_articles_folder + "/" + paper
    with open(fn, "r") as f:
        text = f.read()
    submit_prompt = base_prompt + text
    ml = start_message_list(submit_prompt)
    print("Starting: ", paper, num_tokens_from_string(str(ml)), " approx tokens.")
    # LLM Call
    if LLM_run:
        client = OpenAI()
        completion = client.chat.completions.create(
            model = model,
            response_format = { "type": "json_object" },
            max_tokens = max_reply_tokens,
            temperature = temp,
            messages = ml
        )
    else:
        print("LLM deactivated.")  # Crash out of loop if LLM deactivated
        break
    # Save Results
    response   = completion.choices[0].message.content # JSON response from LLM
    # Filename stuff:
    paperID = paper.split("_full-text.txt")[0]   # Note this was an error, but the filenames are fine to use!
    # Save:
    with(open(JSON_results_dir + "/" + paperID + "_" + run_date + ".json", "w")) as f:
        f.write(response)
    with(open(raw_results_dir + "/" + paperID + "_" + run_date + ".json", "w")) as f:
        f.write(completion.model_dump_json(indent=2))
    print("### Finished: ", paper)

print("Full run completed.")

End of file.