# Question-Answer (QA) Dual LLM Comparison

##1: Task:
    
- Process a set of question-answer pairs through two different LLMs
- Employ simple NLP tools like BERT and cosine similarity to score each model's results relative to the reference response.
- Do you agree with the cosine similarity's assessment of similarity?

In [1]:
!pip install sentence_transformers==2.2.2 --quiet
!pip install torch==2.1.2 --quiet
!pip install datasets --quiet
!pip install openai==1.8.0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence_transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m670.2/670.2 MB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m731.7/731.7 MB[0m [31m877.0 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
from openai import OpenAI

##2: Load the ARC Dataset

We'll use "allenai/ai2_arc" from HuggingFace, a common dataset for testing QA models. It contains 7,787 question-answer pairs from grade school multiple-choice science questions. You can learn more about it on the HuggingFace site [here](https://huggingface.co/datasets/allenai/ai2_arc).

### Two Other Question Answer Datasets

If you're interested in exploring other question answering datasets, these are two other datasets you can try:

- [HellaSwag](https://rowanzellers.com/hellaswag/)
- [MMLU (Massive Multitask Language Understanding)](https://paperswithcode.com/dataset/mmlu)


In [3]:
from datasets import load_dataset

dataset_name = "allenai/ai2_arc"

dataset = load_dataset('ai2_arc', 'ARC-Challenge')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.00k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/190k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/204k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

##3: Create Question-Answer Pairs


Use the loaded dataset to create question-answer pairs in the following format:

```
[{"question": "answer"}, {"question": "answer"}, ...]
```

There are mulitple answers to each question. Use the correct one. (Hint: Take a look at the variable `answer_key` and what information it provides.)

**NOTE** You can set the number of QA pairs to be processed by setting `size_of_dataset` below!

In [4]:
prepared_dataset = []
size_of_dataset = 20
counter = 0

for sample in dataset['train']:
    question = sample.get('question')
    choices = sample.get('choices')
    possible_answers = dict(zip(choices.get('label'), choices.get('text')))
    question_answer_pair = {}
    question_answer_pair['question'] = question
    question_answer_pair['answer'] = possible_answers.get(sample.get('answerKey'))
    prepared_dataset.append(question_answer_pair)
    if len(prepared_dataset) > size_of_dataset:
        break

Look at a sample of questions and their corresponding correct answers.

In [5]:
from random import sample
num_to_sample = 5
qalist = sample(prepared_dataset, num_to_sample)
for i in qalist:
  print(f"Question: {i['question']}")
  print(f"Answer: {i['answer']}\n")

Question: One evening as it is getting dark, Alex sits on the front porch and watches the sun slowly disappear behind the neighbor's house across the street. Which explains this observation?
Answer: The sun appears to move due to Earth's rotation.

Question: Which of the following statements best explains why magnets usually stick to a refrigerator door?
Answer: The refrigerator door contains iron.

Question: Which of the following areas is most likely to form metamorphic rocks such as gneiss and schist?
Answer: a site deep underground

Question: What do cells break down to produce energy?
Answer: food

Question: Which land form is the result of the constructive force of a glacier?
Answer: piles of rocks deposited by a melting glacier



##4: Define OpenAI models for question answering (QA)

Define two different models. We'll get the answers each model provides to the questions above and compare them to the actual answer.

**Note**: Not all OpenAI models can be used for Chat.


In [6]:
MODEL1 = "gpt-3.5-turbo"
MODEL2 = "gpt-4"

Set your OpenAI API key

In [7]:
openai_key = "YOUR KEY HERE"

openai_qa_model = OpenAI(
  api_key = openai_key
)

Create a function that returns an answer given a model and a question. You'll use `openai_qa_model.chat.completions.create` to create the type of answers we're looking for.

In [8]:
def get_answer_from_openai(MODEL, question):
    response = openai_qa_model.chat.completions.create(
        messages=[
             {'role': 'system', 'content': 'Give me very short answer without explanation or additional information.'},
            {'role': 'user', 'content': question},
        ],
        model=MODEL,
        temperature=0,
    )

    return response.choices[0].message.content

##5: Run the Question-Answer Pairs Through `get_answer_from_openai()`

We'll store the GPT3.5 and GPT4 answers in the dictionary with the original question and correct answer.

In [9]:
for pair in prepared_dataset:
    pair['openai_answers_{}'.format(MODEL1)] = get_answer_from_openai(MODEL1, pair["question"])
    pair['openai_answers_{}'.format(MODEL2)] = get_answer_from_openai(MODEL2, pair["question"])

Take a look at the first entry to see the answers the two LLM's produced:

In [10]:
prepared_dataset[0]

{'question': 'George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?',
 'answer': 'dry palms',
 'openai_answers_gpt-3.5-turbo': 'Palms',
 'openai_answers_gpt-4': 'The palms.'}

##6: Evaluate Questions and Answers with BERT Embeddings

Evaluate questions and answers from OpenAI using Bert Embeddings by applying cosine similarity

Define BERT model using the sentence-transformer library. Use the `bert-base-uncased` model.

In [11]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

BERT_MODEL = 'bert-base-uncased'
bert_model = SentenceTransformer(BERT_MODEL)

.gitattributes:   0%|          | 0.00/491 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

(…)kage/Data/com.apple.CoreML/model.mlmodel:   0%|          | 0.00/165k [00:00<?, ?B/s]

weight.bin:   0%|          | 0.00/532M [00:00<?, ?B/s]

(…)sk/float32_model.mlpackage/Manifest.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



Create a function for calculating the cosine similarity between a question and the OpenAI answer.

In [12]:
def calculate_cosine_similarity(question, answer):

    question_embedding = bert_model.encode(question)
    answer_embedding = bert_model.encode(answer)
    cosine_similarity_value = cosine_similarity([question_embedding], [answer_embedding])
    return cosine_similarity_value[0][0]

Process each question-answer pair, pass it through the BERT-based question-answering model and show the cosine similarity for:
- The question and the correct answer
- The correct answer and GPT 3.5's answer
- The correct answer and GPT 4's answer

In [13]:
for i, pair in enumerate(prepared_dataset):
    print(f"Question: {pair['question']}")
    print(f"Correct Answer: {pair['answer']}")
    print(f"Correct Answer - Question Cosine Similarity: {calculate_cosine_similarity(pair['question'], pair['answer'])}\n")

    print(f"OpenAI {MODEL1} Answer: {pair['openai_answers_{}'.format(MODEL1)]}")
    print(f"Correct Answer - OpenAI {MODEL1} Answer Cosine Similarity: {calculate_cosine_similarity(pair['answer'], pair['openai_answers_{}'.format(MODEL1)])}\n")

    print(f"OpenAI {MODEL2} Answer: {pair['openai_answers_{}'.format(MODEL2)]}")
    print(f"Correct Answer - OpenAI {MODEL2} Answer Cosine Similarity: {calculate_cosine_similarity(pair['answer'], pair['openai_answers_{}'.format(MODEL2)])}\n")

    print(70*"-")

Question: George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?
Correct Answer: dry palms
Correct Answer - Question Cosine Similarity: 0.4845195412635803

OpenAI gpt-3.5-turbo Answer: Palms
Correct Answer - OpenAI gpt-3.5-turbo Answer Cosine Similarity: 0.8214078545570374

OpenAI gpt-4 Answer: The palms.
Correct Answer - OpenAI gpt-4 Answer Cosine Similarity: 0.7200816869735718

----------------------------------------------------------------------
Question: Which of the following statements best explains why magnets usually stick to a refrigerator door?
Correct Answer: The refrigerator door contains iron.
Correct Answer - Question Cosine Similarity: 0.5436828136444092

OpenAI gpt-3.5-turbo Answer: Magnetic attraction.
Correct Answer - OpenAI gpt-3.5-turbo Answer Cosine Similarity: 0.558740496635437

OpenAI gpt-4 Answer: The refrigerator door is made of metal.
Correct Answer - OpenAI gpt-4 Answer Cosine Similarity: 0.9005322456359863

--