# Comparison of ChatGPT 4 Latest Models Behavior

The goal of this small experiment is to use the same set of instructions and compare the answers provided by different ChatGPT 4.0 models.

The answers are limited to the content of a set of documents, therefore, a RAG approach is used.

Created by: Jhonnatan Torres <br>
May 7th, 2024
___

I stored my OpenAI key in a .env file, as a result, if you want to reproduce this notebook, you would need to install `pip install python-dotenv`, create a `.env` file in the same directory than the notebook and add your key to the `OPENAI_API_KEY` entry in the .env file

In [1]:
from dotenv import load_dotenv
load_dotenv()

True

The models to be used in this small experiment are referenced in the official documentation of OpenAI, https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4, these are part of the "4.0" family and are considered as the "flagship" models.

In [2]:
models = ["gpt-4-turbo", "gpt-4-turbo-2024-04-09"]

This is the set of instructions, simple RAG application, with some follow-up capabilities

In [3]:
USER_FIRST_MESSAGE = '''You will be asked a series of questions or issues which can be found between 
the <QUESTION> XML Tags.
A collection of documents that must be used to provide an answer to 'QUESTION' can be found 
between the <SOURCES> XML Tags.

Use your experience as an AI Assistant, troubleshoot and respond with an answer to 
each 'QUESTION' following these guidelines:

- Limit your knowledge to 'SOURCES' and determine if its content can provide a full and honest answer to 'QUESTION', 
  if it does, then respond with a honest answer. 
- Limit your answer to the content of the 'SOURCES'.  
- Don't include more information or your chain of thought in the answer.
- If you are unsure about the answer, or 'QUESTION' is not clear, or it is an  open question, then feel free 
  to respond with a follow-up question for the student to get a better understanding of the 'QUESTION' and 
  keep the troubleshooting.
- If the content in the 'SOURCES' cannot provide a complete and honest answer to 'QUESTION' 
  then respond with "IDK".'''

These are the sources that should be used to provide an Answer (*mix of real and made up documents*)

In [4]:
SOURCES='''
DOC1234: In python a lambda function can be created following this structure ```lambda x: x**2```.
DOC1235: Collection from the numbutils can be really handy to get the value counts of items in a list in python.
DOC1236: You can elevate a number to an `x` power by using the `**``character in python, 
for example ```x ** 2** is equal to `x to the power of 2`.
'''

___
## First Use Case:
The question entered by the student is not clear, therefore, the expected answer that should be provided by the chatbot is a follow-up question for the student.

In [5]:
from openai import OpenAI
client = OpenAI()
for r in range(3):
  print(f"### Round: {r} ###")
  for m in models:
    response = client.chat.completions.create(
      model = m,
      messages = [
        {
          "role": "system",
          "content": "You are a friendly AI Assistant helping to students who have a basic knowledge \
            about programming languages"
        },
        {
          "role": "user",
          "content": USER_FIRST_MESSAGE
        },
        {
          "role": "assistant",
          "content": "OK. Instructions Are Clear."
        },
        {
          "role": "user",
          "content": f"<QUESTION>How can I create a lambda function in Python?</QUESTION>\n<SOURCES>\{SOURCES}</SOURCES>\nOutput:"
        },
        {
          "role": "assistant",
          "content": "To create a lambda function in Python, you can use the following structure: `lambda x: x**2.`"
        },
        {
          "role": "user",
          "content": f"<QUESTION>How can I do this?</QUESTION>\n<SOURCES>\n{SOURCES}\n</SOURCES>\nOutput:"
        }
      ],
      temperature=0.1,
      max_tokens=50,
      top_p=0.1,
    )
    print(f"Model[{m}]:")
    print(response.choices[0].message.content)

### Round: 0 ###
Model[gpt-4-turbo]:
Could you please clarify what specific task you are trying to accomplish?
Model[gpt-4-turbo-2024-04-09]:
Could you please clarify what you are trying to do?
### Round: 1 ###
Model[gpt-4-turbo]:
Could you please clarify what you are trying to do?
Model[gpt-4-turbo-2024-04-09]:
Could you please clarify what specific task you are trying to accomplish?
### Round: 2 ###
Model[gpt-4-turbo]:
Could you please clarify what you are trying to do?
Model[gpt-4-turbo-2024-04-09]:
Can you please clarify what you are trying to do?


According to the official documentation of OpenAI (https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4) the `gpt-4-turbo` and `gpt-4-turbo-2024-04-09` are pointing to the same endpoint
> gpt-4-turbo <br>
> "GPT-4 Turbo with Vision
> The latest GPT-4 Turbo model with vision capabilities. Vision requests can now use JSON mode and function calling. Currently points to gpt-4-turbo-2024-04-09."

> gpt-4-turbo-2024-04-09<br>
> "GPT-4 Turbo with Vision model. Vision requests can now use JSON mode and function calling. gpt-4-turbo currently points to this version."

However, in this small test, we got a *similar* answer in 1 (Round #2) round out of 3. In the other 2 rounds (Round #1 and Round #2) the answers provided by the models were different. Based on the documentation, one, would assume the models are pointing to the same endpoint, hence, the answers should by the same or pretty similar.
___

## Second Use Case:
Providing an answer to the question entered by the student based on the sources only. (RAG Approach)

In [6]:
from openai import OpenAI
client = OpenAI()
for r in range(3):
  print(f"### Round: {r} ###")
  for m in models:
    response = client.chat.completions.create(
      model = m,
      messages = [
        {
          "role": "system",
          "content": "You are a friendly AI Assistant helping to students who have a basic knowledge \
            about programming languages"
        },
        {
          "role": "user",
          "content": USER_FIRST_MESSAGE
        },
        {
          "role": "assistant",
          "content": "OK. Instructions Are Clear."
        },
        {
          "role": "user",
          "content": f"<QUESTION>How can I create a lambda function in Python?</QUESTION>\n<SOURCES>\{SOURCES}</SOURCES>\nOutput:"
        },
        {
          "role": "assistant",
          "content": "To create a lambda function in Python, you can use the following structure: `lambda x: x**2.`"
        },
        {
          "role": "user",
          "content": f"<QUESTION>How can I get the value counts of items in a list in python?</QUESTION>\n<SOURCES>\n{SOURCES}\n</SOURCES>\nOutput:"
        }
      ],
      temperature=0.1,
      max_tokens=50,
      top_p=0.1,
    )
    print(f"Model[{m}]:")
    print(response.choices[0].message.content)

### Round: 0 ###
Model[gpt-4-turbo]:
IDK
Model[gpt-4-turbo-2024-04-09]:
Using the collection from the numbutils can be really handy to get the value counts of items in a list in Python.
### Round: 1 ###
Model[gpt-4-turbo]:
You can use the collection from the numbutils to get the value counts of items in a list in Python.
Model[gpt-4-turbo-2024-04-09]:
Using the collection from the numbutils can be really handy to get the value counts of items in a list in Python.
### Round: 2 ###
Model[gpt-4-turbo]:
You can use the collection from the numbutils to get the value counts of items in a list in Python.
Model[gpt-4-turbo-2024-04-09]:
IDK.


It was observed a similar behaviour than in the first use case, in just 1 round (Round #1) out of 3 the answers provided by both models were *similar*, in the other 2 rounds, either `gpt-4-turbo-2024-04-09` or `gpt-4-turbo` replied with "IDK" or the "I don't know" message. Again, based on the documentation, one would assume both models should provide similar answers if the 2 models are pointing to the same endpoint.
___

## Closing Remarks

- I know, it is not possible to draw conclusions with these simple experiments, however, my recommendation is to keep in mind these differences in the answers and the behaviour of the models when designing your RAG applications.

    - Maybe, it can be a good idea to test with a small number of questions and compare the results provided by different models. (This can be like a "Stage 0" in the prompt engineering phase)

- Note: I included the instructions in the first user message because I wanted to reduce the number of tokens used in each turn, this way, I am not sending the Question, Sources and Instructions in each user turn. *I know, it is not the common RAG approach*.