GROUP MEMBERS:
- Felipe BAGNI;
- Erfan AMIDI;
- Rishabh TIWARI;
- Federica VINCIGUERRA;
- Dan LIONIS.

# Testing the GPT 3.5 Turbo Fine Tunned Model on Original Dataset

The goal of this notebook is to test the GPT (gpt-3.5-turbo-0125) fine tunned model on the Medical Flashcards dataset.

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

Found existing installation: openai 0.28.0
Uninstalling openai-0.28.0:
  Successfully uninstalled openai-0.28.0
Collecting openai==0.28
  Using cached openai-0.28.0-py3-none-any.whl (76 kB)
Installing collected packages: openai
Successfully installed openai-0.28.0


In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import random
from google.colab import drive
import os

## Connect to GDrive

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [5]:
api_key ="sk-proj-###" #ADD YOUR API KEY HERE
openai.api_key = api_key

## Split dataset

In [6]:
dataset = load_dataset('arrow', data_files='data-00000-of-00001.arrow')

In [7]:
df = dataset['train'].to_pandas()

In [8]:
df.head()

Unnamed: 0,input,output,instruction
0,What is the relationship between very low Mg2+...,Very low Mg2+ levels correspond to low PTH lev...,Answer this question truthfully
1,What leads to genitourinary syndrome of menopa...,Low estradiol production leads to genitourinar...,Answer this question truthfully
2,What does low REM sleep latency and experienci...,Low REM sleep latency and experiencing halluci...,Answer this question truthfully
3,What are some possible causes of low PTH and h...,"PTH-independent hypercalcemia, which can be ca...",Answer this question truthfully
4,How does the level of anti-müllerian hormone r...,The level of anti-müllerian hormone is directl...,Answer this question truthfully


In [9]:
df = dataset['train'].to_pandas()
df = df.iloc[:, :-1]

In [10]:
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

In [11]:
test_data.head()

Unnamed: 0,input,output
27911,What are some physical signs that may indicate...,What are some physical signs that may indicate...
7251,What is the name of the amino acid that serves...,Arginine is the amino acid that acts as the pr...
32050,Do high or low potency typical antipsychotics ...,High potency typical antipsychotics are more l...
7969,Which type of heart valves are commonly affect...,Viridans streptococci infection is typically s...
6904,"Among all bugs, which one is the most frequent...",Staphylococcus aureus is the bug that is the m...


## Generate answers from the model

In [32]:
DEFAULT_SYSTEM_PROMPT = 'Answer this question truthfully.'

def generate_answer(question):
  response = openai.ChatCompletion.create(
              model="ft:gpt-3.5-turbo-0125:personal::9So7gDaT",
              messages=[{"role": "system", "content": DEFAULT_SYSTEM_PROMPT},
                        {"role": "user", "content": question}],
              max_tokens=100,
              top_p=0.9,
              frequency_penalty=2,
              presence_penalty=1,
              stop=["\n"]
              )
  return response["choices"][0]["message"]["content"]

In [34]:
total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    if count >= 100:
        break
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["input"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 6791
Processed question 2 out of 6791
Processed question 3 out of 6791
Processed question 4 out of 6791
Processed question 5 out of 6791
Processed question 6 out of 6791
Processed question 7 out of 6791
Processed question 8 out of 6791
Processed question 9 out of 6791
Processed question 10 out of 6791
Processed question 11 out of 6791
Processed question 12 out of 6791
Processed question 13 out of 6791
Processed question 14 out of 6791
Processed question 15 out of 6791
Processed question 16 out of 6791
Processed question 17 out of 6791
Processed question 18 out of 6791
Processed question 19 out of 6791
Processed question 20 out of 6791
Processed question 21 out of 6791
Processed question 22 out of 6791
Processed question 23 out of 6791
Processed question 24 out of 6791
Processed question 25 out of 6791
Processed question 26 out of 6791
Processed question 27 out of 6791
Processed question 28 out of 6791
Processed question 29 out of 6791
Processed question 30 o

## Compute some metrics

In [37]:
# Get answers frmo the dataset
count = 0
references = []
for index, row in test_data.iterrows():
    if count >= 100:
        break
    count = count + 1
    references.append(row["output"])

In [38]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)



In [39]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


### Cosine Similarity

In [40]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [41]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.81


### BLEU Score

In [42]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [43]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [44]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [45]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.20


### Exact Match

In [46]:
def exact_match(prediction, ground_truth):
    return prediction == ground_truth

# Assuming predictions and references are lists of answers
em_score = sum(exact_match(pred, ref) for pred, ref in zip(predictions, references)) / len(predictions)
print(f'Exact Match Score: {em_score:.2f}')

Exact Match Score: 0.02


# Manual checking with random question from test dataset

In [48]:
# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["input"]
actual_answer = random_row["output"]

bot_answer = generate_answer(prompt)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', bot_answer)


*************************************
Question:  Do patients with secondary adrenocortical insufficiency exhibit symptoms such as hyperkalemia, metabolic acidosis, or hypotension?
Actual Answer: No, patients with secondary adrenocortical insufficiency do not exhibit those symptoms because their aldosterone levels are normal.
Bot Answer:  No, patients with secondary adrenocortical insufficiency do not typically exhibit symptoms such as hyperkalemia, metabolic acidosis or hypotension. Adrenal insufficiency is a condition in which the adrenal glands do not produce enough of certain hormones, including cortisol and aldosterone. In primary adrenal insufficiency (Addison's disease), the adrenal glands themselves are damaged or destroyed. In contrast, secondary adrenal insufficiency occurs when there is a problem with the pituitary


---