# Kirsten Mayland - Final Project

---

Kirsten Mayland (kirsten.r.mayland.25@dartmouth.edu) <br>
Dartmouth College, CS72, Winter 2025

Purpose: To evaluate the success of our LLMs in creating relevant follow-up questions to medical posts

In [91]:
# 1 - connect to ChatGPT
# 2 - get chatGPT to forecast on the questions as a baseline
# 3 - prompt using chain-of-reasoning prompting to ask chatgpt to label the 5 generated sets on five-point likert scale for each of our 4 axes
# 4 - in excel? generate means, standard deviation, and total score

## Connect to Dartmouth ChatGPT Api

---



In [92]:
%%capture
!pip install langchain_dartmouth

In [None]:
import os
# os.environ['DARTMOUTH_CHAT_API_KEY'] = ''
# os.environ['DARTMOUTH_API_KEY'] = ''

In [94]:
from langchain_dartmouth.llms import DartmouthLLM
from langchain_dartmouth.llms import ChatDartmouth
from langchain_dartmouth.llms import ChatDartmouthCloud
from pprint import pprint
import re
import pandas as pd
from google.colab import files

In [95]:
gpt_4o_mini = ChatDartmouthCloud(model_name="openai.gpt-4o-mini-2024-07-18")

## ChatGPT Baseline Forecast

---



In [96]:
def generate_follow_up(title, post):
    input_text = f"Please ask follow up questions to these posts. Your goal is to prompt them to provide more relevant medical information that they might have forgotten to add. Post Title: {title}. Post: {post}"

    # Generate follow-up questions
    response = gpt_4o_mini.invoke(input_text)

    print("Generated:", response)
    print("-" * 40)
    return response

In [97]:
# Load CSV
# eval_df = pd.read_csv("/content/CS72_FinalProject_EvalDataset.csv", on_bad_lines='skip')
# eval_df = eval_df.dropna().reset_index(drop=True)
# eval_df['Pre-Training Results'] = eval_df.apply(lambda row: generate_follow_up(row['Title'], row['Post']), axis=1)

# eval_df.to_csv("/content/CS72_FinalProject_gpt_4o_mini_Results.csv", index=False, encoding="utf-8-sig")
# files.download("CS72_FinalProject_gpt_4o_mini_Results.csv")

## Evaluate Questions

---



In [98]:
def create_utility_prompt(title, post, follow_up):
  prompt = f'''

  ### Instruction ###
  - Read the title, post, and follow-up questions carefully and determine on a scale from 1-5 with 1 being the least useful and 5 being the most useful, how much utility the follow-up questions have. Utility is defined as how useful the follow-up questions would be to a healthcare provider in responding to the patient message. First, explain your reasoning step by step, then provide the final numeric label.
  - Output the label in brackets. E.g. the questions have a utility score of [2]

  Now, classify the following post and follow-up questions:
  Post Title: {title}
  Post: {post}
  Follow-up questions: {follow_up}
  Reasoning:'''

  return prompt


In [99]:
def create_necessity_prompt(title, post, follow_up):
  prompt = f'''

  ### Instruction ###
  - Read the title, post, and follow-up questions carefully and determine on a scale from 1-5 with 1 being the least necessary and 5 being the most necessary, how much necessity the follow-up questions have. Necessity is defined as how necessary all of the follow-up questions are for a healthcare provider in addressing the patient’s concern. First, explain your reasoning step by step, then provide the final numeric label.
  - Output the label in brackets. E.g. the questions have a necessity score of [2]

  Now, classify the following post and follow-up questions:
  Post Title: {title}
  Post: {post}
  Follow-up questions: {follow_up}
  Reasoning:'''

  return prompt


In [100]:
def create_completeness_prompt(title, post, follow_up):
  prompt = f'''

  ### Instruction ###
  - Read the title, post, and follow-up questions carefully and determine on a scale from 1-5 with 1 being the least complete and 5 being the most complete, how much completeness the follow-up questions have. Completeness is defined as how many follow-up questions are not missing important information necessary for a healthcare provider in addressing the patient’s concern. First, explain your reasoning step by step, then provide the final numeric label.
  - Output the label in brackets. E.g. the questions have a completeness score of [2]

  Now, classify the following post and follow-up questions:
  Post Title: {title}
  Post: {post}
  Follow-up questions: {follow_up}
  Reasoning:'''

  return prompt

In [101]:
def create_clarity_prompt(title, post, follow_up):
  prompt = f'''
clarity is defined as “The follow-up questions are easy to understand and answer by patients.”
  ### Instruction ###
  - Read the title, post, and follow-up questions carefully and determine on a scale from 1-5 with 1 being the least clear and 5 being the most clear, how much clarity the follow-up questions have. Necessity is defined as how easy the follow-up questions are to understand and answer by patients. First, explain your reasoning step by step, then provide the final numeric label.
  - Output the label in brackets. E.g. the questions have a clarity score of [2]

  Now, classify the following post and follow-up questions:
  Post Title: {title}
  Post: {post}
  Follow-up questions: {follow_up}
  Reasoning:'''

  return prompt

In [102]:
# Chain-of-Thought prompting w/ Special Output Parsing

# Regular expression pattern
# pattern = r"\[(.*?)\]"
# pprint(response.content)
# pprint(re.findall(pattern, response.content))

In [103]:
final = pd.DataFrame()

In [104]:
def evaluate_follow_up(title, post, follow_up):
  utility_prompt = create_utility_prompt(title, post, follow_up)
  necessity_prompt = create_necessity_prompt(title, post, follow_up)
  completeness_prompt = create_completeness_prompt(title, post, follow_up)
  clarity_prompt = create_clarity_prompt(title, post, follow_up)

  utility_response = gpt_4o_mini.invoke(utility_prompt)
  necessity_response = gpt_4o_mini.invoke(necessity_prompt)
  completeness_response = gpt_4o_mini.invoke(completeness_prompt)
  clarity_response = gpt_4o_mini.invoke(clarity_prompt)

  pattern = r"\[(.*?)\]"

  response_filtered = [re.findall(pattern, utility_response.content) if not None else ['0'], re.findall(pattern, necessity_response.content) if not None else ['0'], re.findall(pattern, completeness_response.content) if not None else ['0'], re.findall(pattern, clarity_response.content) if not None else ['0']]

  pprint(response_filtered)
  return response_filtered

In [105]:
flan_base_df = pd.read_csv("/content/CS72_FinalProject_flan_t5_base_Results.csv", on_bad_lines='skip')
flan_small_df = pd.read_csv("/content/CS72_FinalProject_flan_t5_small_Results.csv", on_bad_lines='skip')
gpt4_df = pd.read_csv("/content/CS72_FinalProject_gpt_4o_mini_Results.csv", on_bad_lines='skip')

In [106]:
final['Flan-t5 Base, Pre-training Eval'] = flan_base_df.apply(lambda row: evaluate_follow_up(row['Title'], row['Post'], row['Pre-training Results']), axis=1)


[['1'], ['1'], ['1'], ['1']]
[['1'], ['2'], ['1'], ['2']]
[['1'], ['1'], ['1'], ['2']]
[['1'], ['2'], ['1'], ['2']]
[['1'], ['2'], ['1'], ['2']]
[['1'], ['1'], ['1'], ['2']]
[['2'], ['3'], ['1'], ['2']]
[['1'], ['4'], ['1'], ['2']]


ValueError: {'detail': 'You\'ve exceeded the daily usage limit (1000 credits) for the paid AI models.\n                    IMPORTANT: Click the "New Chat" button and select one of the free models (ex. Llama 3.1) to start a new chat session.\n                    '}

In [None]:
final['Flan-t5 Base, Post-training Eval'] = flan_base_df.apply(lambda row: evaluate_follow_up(row['Title'], row['Post'], row['Post Training Results']), axis=1)

In [None]:
final['Flan-t5 Small, Pre-training Eval'] = flan_small_df.apply(lambda row: evaluate_follow_up(row['Title'], row['Post'], row['Pre-training Results']), axis=1)

In [None]:
final['Flan-t5 Small, Post-training Eval'] = flan_small_df.apply(lambda row: evaluate_follow_up(row['Title'], row['Post'], row['Post Training Results']), axis=1)

In [None]:
final['GPT-4, Pre-training Eval'] = gpt4_df.apply(lambda row: evaluate_follow_up(row['Title'], row['Post'], row['Pre-Training Results']), axis=1)

In [None]:
final.to_csv("/content/CS72_FinalProject_LLM_Evaluations.csv", index=False, encoding="utf-8-sig")
files.download("CS72_FinalProject_LLM_Evaluations.csv")

## Run the Numbers

---



In [113]:
import pandas as pd
import numpy as np
import ast

final = pd.read_csv("/content/CS72_FinalProject_LLM_Evaluations.csv", on_bad_lines='skip')


# Process each column
for column in final.columns:
  print(f"Column: {column}")

  utility_scores = []
  necessity_scores = []
  completeness_scores = []
  clarity_scores = []

  for row in final[column]:
    try:
      # Convert string representation of lists into actual lists (if needed)
      row = ast.literal_eval(row) if isinstance(row, str) else row

      # Ensure row is a list and has at least 4 elements
      if isinstance(row, list) and len(row) >= 4:
        utility_scores.extend(map(int, row[0]))  # Convert to int and extend list
        necessity_scores.extend(map(int, row[1]))
        completeness_scores.extend(map(int, row[2]))
        clarity_scores.extend(map(int, row[3]))

    except Exception as e:
      print(f"Skipping malformed row: {row} due to error {e}")
      continue

  # Convert to numpy arrays for easier statistical calculations
  utility_scores = np.array(utility_scores) if utility_scores else np.array([0])
  necessity_scores = np.array(necessity_scores) if necessity_scores else np.array([0])
  completeness_scores = np.array(completeness_scores) if completeness_scores else np.array([0])
  clarity_scores = np.array(clarity_scores) if clarity_scores else np.array([0])

  # Combine all scores for overall statistics
  all_scores = np.concatenate([utility_scores, necessity_scores, completeness_scores, clarity_scores])

  # Compute means and standard deviations
  print(f"Utility Mean: {np.mean(utility_scores):.2f}, Std Dev: {np.std(utility_scores, ddof=1):.2f}")
  print(f"Necessity Mean: {np.mean(necessity_scores):.2f}, Std Dev: {np.std(necessity_scores, ddof=1):.2f}")
  print(f"Completeness Mean: {np.mean(completeness_scores):.2f}, Std Dev: {np.std(completeness_scores, ddof=1):.2f}")
  print(f"Clarity Mean: {np.mean(clarity_scores):.2f}, Std Dev: {np.std(clarity_scores, ddof=1):.2f}")

  # Compute overall mean and standard deviation
  print(f"Overall Mean: {np.mean(all_scores):.2f}, Overall Std Dev: {np.std(all_scores, ddof=1):.2f}")
  print("-" * 40)

Column: Flan-t5 Base, Pre-training Eval
Utility Mean: 1.58, Std Dev: 0.97
Necessity Mean: 1.71, Std Dev: 1.00
Completeness Mean: 1.17, Std Dev: 0.48
Clarity Mean: 1.96, Std Dev: 0.55
Overall Mean: 1.60, Overall Std Dev: 0.83
----------------------------------------
Column: Flan-t5 Base, Post-training Eval
Utility Mean: 1.83, Std Dev: 1.31
Necessity Mean: 2.08, Std Dev: 1.35
Completeness Mean: 1.38, Std Dev: 0.49
Clarity Mean: 2.50, Std Dev: 1.44
Overall Mean: 1.95, Overall Std Dev: 1.26
----------------------------------------
Column: Flan-t5 Small, Pre-training Eval
Utility Mean: 1.17, Std Dev: 0.38
Necessity Mean: 1.21, Std Dev: 0.88
Completeness Mean: 1.08, Std Dev: 0.28
Clarity Mean: 1.42, Std Dev: 0.50
Overall Mean: 1.22, Overall Std Dev: 0.57
----------------------------------------
Column: Flan-t5 Small, Post-training Eval
Utility Mean: 1.21, Std Dev: 0.41
Necessity Mean: 1.46, Std Dev: 0.66
Completeness Mean: 1.08, Std Dev: 0.41
Clarity Mean: 1.58, Std Dev: 0.97
Overall Mean: 1