# Testing de GPT Fine Tunned Model

The goal of this notebook is to test the GPT (davinci-02) fine tunned model on a different dataset. Instead of using the Medical Flashcards, here we test the model on this dataset: [GokulWork/QuestionAnswer_MCQ](https://huggingface.co/datasets/GokulWork/QuestionAnswer_MCQ)

## Install requirements

In [1]:
!pip uninstall -y openai
!pip install openai==0.28
!pip install datasets
!pip install scikit-learn sentence-transformers
!pip install nltk

Found existing installation: openai 0.28.0
Uninstalling openai-0.28.0:
  Successfully uninstalled openai-0.28.0
Collecting openai==0.28
  Using cached openai-0.28.0-py3-none-any.whl (76 kB)
Installing collected packages: openai
Successfully installed openai-0.28.0


In [2]:
import json
import openai
import pandas as pd
import numpy as np
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import os
from google.colab import drive
import random

## Connect to GDrive

In [3]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
path = '_NLP/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/_NLP/Project'

## Add API key

In [5]:
api_key ="sk-proj-###" #ADD YOUR API KEY HERE
openai.api_key = api_key

## Load and Split dataset

In [6]:
dataset = load_dataset("GokulWork/QuestionAnswer_MCQ")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
df = dataset['train'].to_pandas()
df.head()

Unnamed: 0.1,Unnamed: 0,question,answer,text
0,0,What is a force?,Correct Answer- A force is a push or pull that...,"###Human:\ngenerate a correct answer, a ration..."
1,1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ...","###Human:\ngenerate a correct answer, a ration..."
2,2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...,"###Human:\ngenerate a correct answer, a ration..."
3,3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...,"###Human:\ngenerate a correct answer, a ration..."
4,4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...,"###Human:\ngenerate a correct answer, a ration..."


In [8]:
df = df.drop(['Unnamed: 0', 'text'], axis=1)
df.head()

Unnamed: 0,question,answer
0,What is a force?,Correct Answer- A force is a push or pull that...
1,What is Newton's First Law of Motion?,"Correct Answer- Newton's First Law of Motion, ..."
2,What is the difference between speed and veloc...,Correct Answer- Speed is a scalar quantity tha...
3,Explain that when the kinetic energy of an obj...,Correct Answer- The change in kinetic energy c...
4,What is the SI unit of electric current?,Correct Answer- Ampere\n\nRationale- Ampere is...


In [9]:
train_data, test_data = train_test_split(df, test_size=100, random_state=42) # limit to 100

In [10]:
len(test_data)

100

In [11]:
test_data.head()

Unnamed: 0,question,answer
15,What is the unit of measure for electric poten...,Correct Answer- Volt.\n\nRationale- The volt i...
9,What property of a wave determines its loudness?,Correct Answer- Amplitude.\n\nRationale- Ampli...
100,What is a planet?,"Correct Answer- A large, spherical body that o..."
132,Which planet mentioned in the context is an ex...,Correct Answer- Jupiter.\n\nRationale- Jupiter...
68,What type of rock is formed from the compactio...,Correct Answer- Sedimentary rock.\n\nRationale...


## Generate answers from the model

In [12]:
def generate_answer(question):
    prompt = question + " ->"
    response = openai.Completion.create(
        model='ft:davinci-002:personal::9KLi6nKN',
        prompt=prompt,
        max_tokens=100,
        top_p=0.9,
        frequency_penalty=2,
        presence_penalty=1,
        stop=["\n"]
    )
    return response.choices[0].text

total_questions = len(test_data)
predictions = []
count = 0
for index, row in test_data.iterrows():
    print(f"Processed question {count + 1} out of {total_questions}")
    question = row["question"]
    predicted_answer = generate_answer(question)
    predictions.append(predicted_answer)
    count = count + 1

Processed question 1 out of 100
Processed question 2 out of 100
Processed question 3 out of 100
Processed question 4 out of 100
Processed question 5 out of 100
Processed question 6 out of 100
Processed question 7 out of 100
Processed question 8 out of 100
Processed question 9 out of 100
Processed question 10 out of 100
Processed question 11 out of 100
Processed question 12 out of 100
Processed question 13 out of 100
Processed question 14 out of 100
Processed question 15 out of 100
Processed question 16 out of 100
Processed question 17 out of 100
Processed question 18 out of 100
Processed question 19 out of 100
Processed question 20 out of 100
Processed question 21 out of 100
Processed question 22 out of 100
Processed question 23 out of 100
Processed question 24 out of 100
Processed question 25 out of 100
Processed question 26 out of 100
Processed question 27 out of 100
Processed question 28 out of 100
Processed question 29 out of 100
Processed question 30 out of 100
Processed question 

## Get the reference answers

In [13]:
# Get answers from the dataset
references = []
for index, row in test_data.iterrows():
    references.append(row["answer"])

## Compute some metrics

In [14]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Compute embeddings
reference_embeddings = model.encode(references)
prediction_embeddings = model.encode(predictions)



In [15]:
# Log the shape of the embeddings for debugging
print(f"Reference embeddings shape: {np.array(reference_embeddings).shape}")
print(f"Prediction embeddings shape: {np.array(prediction_embeddings).shape}")

Reference embeddings shape: (100, 384)
Prediction embeddings shape: (100, 384)


### Cosine Similarity

In [16]:
# Compute cosine similarity for each pair
cosine_similarities = []
for ref_emb, pred_emb in zip(reference_embeddings, prediction_embeddings):
    cos_sim = cosine_similarity([ref_emb], [pred_emb])[0][0]
    cosine_similarities.append(cos_sim)

In [17]:
# Calculate average cosine similarity
average_cosine_similarity = np.mean(cosine_similarities)
print(f'Average Cosine Similarity: {average_cosine_similarity:.2f}')

Average Cosine Similarity: 0.54


### BLEU Score

In [18]:
# Function to calculate BLEU score
def calculate_bleu(reference, prediction):
    reference_tokens = [nltk.word_tokenize(reference)]
    prediction_tokens = nltk.word_tokenize(prediction)
    # Using smoothing function to avoid zero scores for short sequences
    smoothing_function = SmoothingFunction().method1
    bleu_score = sentence_bleu(reference_tokens, prediction_tokens, smoothing_function=smoothing_function)
    return bleu_score

In [19]:
# Ensure NLTK resources are downloaded
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [20]:
# Calculate BLEU scores for all predictions
bleu_scores = [calculate_bleu(ref, pred) for ref, pred in zip(references, predictions)]

In [21]:
# Calculate average BLEU score
average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f'Average BLEU Score: {average_bleu_score:.2f}')

Average BLEU Score: 0.03


# Manual checking with random question from test dataset

In [24]:
# Assuming you have a pandas DataFrame named test_data containing your test dataset
random_index = random.randint(0, len(test_data) - 1)
random_row = test_data.iloc[random_index]
prompt = random_row["question"] + " ->"
actual_answer = random_row["answer"]

response = openai.Completion.create(
    model='ft:davinci-002:personal::9KLi6nKN',
    prompt=prompt,
    max_tokens=100,
    top_p=0.9,
    frequency_penalty=2,
    presence_penalty=1,
    stop=["\n"]
)

print('*************************************')
print('Question: ', prompt)
print('Actual Answer:', actual_answer)
print('Bot Answer: ', response.choices[0].text)


*************************************
Question:  What is the term for the transfer of heat through electromagnetic waves? ->
Actual Answer: Correct Answer- Radiation.

Rationale- Radiation involves the emission of energy in the form of electromagnetic waves.

Distractor 1- Conduction.
Distractor 2- Convection.
Distractor 3- Reflection.
Bot Answer:   Conduction is a term for heat transfer through molecular contact. However, convection and radiation are other methods of heat transfer that do not involve direct physical contact..
