# 🤖 Comparison of Outputs Between Different Generative AI Models

## 🧠 Introduction

Generative AI models like GPT-4, Claude, Gemini, and others have demonstrated impressive capabilities in generating human-like text across various domains. As their usage becomes more widespread, it becomes essential to evaluate and compare the quality of their outputs in a structured manner.

This notebook provides a framework for comparing outputs from multiple Generative AI models on the same set of prompts. It is designed to help users understand how different models perform across tasks like summarization, question answering, translation, and creative writing.

## ❓ Problem Statement

With the proliferation of Generative AI models, users and organizations often face the challenge of selecting the right model for their specific needs. However, this process is complicated by several factors:

- ⚖️ **Lack of standardized evaluation**: There is no uniform method to benchmark model outputs on the same tasks.
- 🤔 **Subjectivity in response quality**: Human judgments about quality can vary, making comparisons inconsistent.
- 📊 **Need for data-driven insights**: A systematic approach is required to assess and compare model performance using both qualitative and quantitative analysis.

This project addresses these challenges by:

- Running a fixed set of prompts across various Gen AI models.
- Structuring and displaying their responses side-by-side.
- Enabling both visual and manual inspection to highlight differences in tone, accuracy, fluency, and relevance.


In [1]:
import os
from dotenv import load_dotenv
from anthropic import Anthropic
from openai import OpenAI
from IPython.display import Markdown,display,update_display

In [2]:
load_dotenv(override = True)
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
if anthropic_api_key:
    print(f"Anthropic API Key exists and begins {anthropic_api_key[:7]}")
else:
    print("Anthropic API Key not set")

Anthropic API Key exists and begins sk-ant-


In [3]:
openai = OpenAI(base_url = 'http://localhost:11434/v1', api_key='ollama')

In [4]:
claude = Anthropic()

### Asking LLMs to tell a joke

In [5]:
system_message = "You are an assistant that is great at telling jokes"
user_prompt = "Tell a light-hearted joke for an audience of Data Scientists"

In [6]:
prompts = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_prompt}
  ]

In [7]:
llama = openai.chat.completions.create(
    model="llama3.2", 
    messages=prompts
)

print(llama.choices[0].message.content)

Here's one:

Why did the data scientist break up with his girlfriend?

(pause for dramatic effect)

Because she was always trying to fit him into a linear regression model, and he just couldn't predict their future together! (ba-dum-tss!)

I hope that one wasn't too " statistically" inaccurate!

(Sorry, I couldn't resist adding a few data-filled puns!)


In [8]:
sonet = claude.messages.create(
    model = "claude-3-7-sonnet-20250219",
    max_tokens = 20000,
    system = system_message,
    messages = [{"role": "user", "content": user_prompt}]
)

In [9]:
print(sonet.content[0].text)

Why don't data scientists like to play hide and seek?

Because good luck trying to hide when they've already predicted where you'll be with 99.7% confidence!


In [10]:
# with higher temperature

In [11]:
llama = openai.chat.completions.create(
    model="llama3.2", 
    messages=prompts,
    temperature = 1
)

print(llama.choices[0].message.content)

Here's one:

Why did the regression analyst break up with his girlfriend?

*pauses for dramatic effect*

Because he was trying to optimize their relationship, but it was just not adding value!

(gets some chuckles from the data-sciency crowd)

But seriously, folks, who needs emotions when you've got R-squared and RMSE?


In [12]:
sonet = claude.messages.create(
    model = "claude-3-7-sonnet-latest",
    max_tokens = 20000,
    system = system_message,
    messages = [{"role": "user", "content": user_prompt}],
    temperature = 1
)

In [13]:
print(sonet.content[0].text)

Why don't data scientists like to go outside during winter?

Because they're afraid of getting caught in a random forest!

*ba-dum-tss* 🥁


In [14]:
sonet = claude.messages.create(
    model = "claude-3-5-sonnet-latest",
    max_tokens = 8000,
    system = system_message,
    messages = [{"role": "user", "content": user_prompt}],
    temperature = 1
)

In [15]:
print(sonet.content[0].text)

Here's one for the data scientists:

Why did the data scientist bring a ladder to work?

Because they heard the data was skewed and needed to be normalized at a higher level! 

Alternative joke:

What's a data scientist's favorite kind of music?
Algorithm and blues! 

*ba dum tss* 😄

These are pretty nerdy but should get a chuckle from your data science colleagues!


In [16]:
sonet = claude.messages.stream(
    model = "claude-3-5-sonnet-latest",
    max_tokens = 8000,
    system = system_message,
    messages = [{"role": "user", "content": user_prompt}],
    temperature = 1
)

with sonet as stream:
    for text in stream.text_stream:
        print(text,end = "", flush = True)

Here's one for the data scientists:

Why did the data scientist bring a ladder to work?

Because they heard the data was skewed and needed to be normalized!

Alternative data science jokes:

Why do data scientists make great partners?
Because they know the importance of a good correlation!

What's a data scientist's favorite snack?
Chocolate chips and cookies... but mostly just the data chips!

What did the data scientist say when they got locked out of their house?
"Time to use my k-nearest neighbors algorithm!"

In [17]:
deepseek_api_key = os.getenv('DEEPSEEK_API_KEY')
if deepseek_api_key:
    print(f"DeepSeek API Key exists and begins {deepseek_api_key[:3]}")
else:
    print("DeepSeek API Key not set - please skip to the next section if you don't wish to try the DeepSeek API")

DeepSeek API Key exists and begins sk-


In [18]:
deepseek = OpenAI(base_url = "https://api.deepseek.com",api_key = deepseek_api_key)

In [21]:
res = deepseek.chat.completions.create(
    model = "deepseek-chat",
    messages = prompts,
)

print(res.choices[0].message.content)

Sure! Here's a light-hearted joke for data scientists:

**Why did the data scientist bring a ladder to the bar?**  

Because they heard the drinks were *high-dimensional*!  

(And they wanted to reduce the dimensions before overfitting their liver!)  

Hope that gives you a chuckle! 😄


In [22]:
challenge = [{"role": "system", "content": "You are a helpful assistant"},
             {"role": "user", "content": "How many words are there in your answer to this prompt"}]

In [30]:
res = deepseek.chat.completions.create(
    model = "deepseek-chat",
    messages = challenge,
    stream = True
)

reply = ""
display_handle = display(Markdown(""),display_id = True)
for chunk in res:
    #print(chunk)
    reply += chunk.choices[0].delta.content or ''
    reply = reply.replace("```","").replace("markdown","")
    update_display(Markdown(reply),display_id = display_handle.display_id)

print("Number of words:", len(reply.split(" ")))

Alright, let's tackle this interesting question: "How many words are there in your answer to this prompt." At first glance, it seems straightforward, but when I think deeper, it's a bit of a paradox or a self-referential problem. Here's how I'm going to approach it:

### Understanding the Question

The question is asking for the word count of the very answer that I'm about to provide. This creates a loop because the length of my answer depends on the number it's trying to report. 

For example, if I say, "This answer contains 5 words," but that sentence itself is 5 words, then it's accurate. But if I say, "This answer contains 10 words," and that statement is only 6 words, then it's incorrect. 

This seems similar to the "This statement is false" paradox, where the statement can't consistently be labeled as true or false without causing a contradiction.

### Attempting a Simple Answer

Let me try to construct a simple answer and see if it works.

**Attempt 1:**
"This answer contains 5 words."

Now, let's count the words in that sentence:
1. This
2. answer
3. contains
4. 5
5. words.

Yes, that's 5 words. So, this seems correct.

But is this the only possible correct answer? Let me try another number.

**Attempt 2:**
"This answer contains 7 words."

Counting:
1. This
2. answer
3. contains
4. 7
5. words.

That's 5 words, not 7. So, this is incorrect.

**Attempt 3:**
"The number of words in this answer is five."

Counting:
1. The
2. number
3. of
4. words
5. in
6. this
7. answer
8. is
9. five.

That's 9 words, but I claimed it's five, which is wrong.

From these attempts, it seems that only specific phrasings where the declared word count matches the actual word count will be correct.

### Exploring Possible Correct Answers

Is "This answer contains 5 words." the only correct answer? Let me see if there are others.

**Alternative Phrasing:**
"Here are five words in this answer."

Counting:
1. Here
2. are
3. five
4. words
5. in
6. this
7. answer.

That's 7 words, but it says five, so incorrect.

**Another Try:**
"Five words are here now."

Counting:
1. Five
2. words
3. are
4. here
5. now.

That's 5 words, and it says five, so correct.

So, "Five words are here now." is also a correct answer with 5 words.

Are there other numbers that could work? Let's try with 4 words.

**Attempt for 4 words:**
"This has four words."

Counting:
1. This
2. has
3. four
4. words.

That's 4 words, correct.

Similarly, "Four words are here." is also 4 words.

What about 3 words?

"Three words here."

Counting:
1. Three
2. words
3. here.

Yes, 3 words.

"Here's three words." is also 3 words.

2 words?

"Two words."

Counting:
1. Two
2. words.

Yes, 2 words.

"Only two." is also 2 words.

1 word?

"One."

Counting: "One" is 1 word.

"Word." is also 1 word if we're counting that as the answer.

0 words? That would be an empty answer, which doesn't make sense in this context because the answer must convey the count.

So, possible correct answers are those where the number declared matches the actual word count of the declaration. This seems to work for any positive integer, but the phrasing becomes more constrained as the number decreases.

### Verifying with Longer Answers

Can we have a correct answer with more words? Let's try 10 words.

**Attempt for 10 words:**
"This particular answer contains exactly ten words in total here now."

Counting:
1. This
2. particular
3. answer
4. contains
5. exactly
6. ten
7. words
8. in
9. total
10. here
11. now.

Oops, that's 11 words. Let me adjust.

"This answer right here contains exactly ten words in total now."

Counting:
1. This
2. answer
3. right
4. here
5. contains
6. exactly
7. ten
8. words
9. in
10. total
11. now.

Still 11. Maybe:

"The total number of words in this answer is exactly ten words."

Counting:
1. The
2. total
3. number
4. of
5. words
6. in
7. this
8. answer
9. is
10. exactly
11. ten
12. words.

12 words. It's tricky to phrase a 10-word answer that correctly states it has 10 words without going over or under.

Perhaps: "This answer contains a total of ten words within this sentence here."

Counting:
1. This
2. answer
3. contains
4. a
5. total
6. of
7. ten
8. words
9. within
10. this
11. sentence
12. here.

Still 12. It seems harder to construct longer correct answers because the statement about the word count itself takes up words, making it easy to overshoot.

### Conclusion on Possible Correct Answers

From these trials, it appears that shorter answers where the word count declaration is concise are easier to construct correctly. Longer answers require more words to describe the count, often leading to a mismatch between the declared and actual word counts.

Therefore, the correct answers are those self-referential sentences where the number of words stated equals the actual number of words in the sentence. Examples include:

- "One." (1 word)
- "Two words." (2 words)
- "This has three words." (3 words)
- "Four words are here now." (4 words)
- "This answer contains five words." (5 words)

And so on, with the feasibility decreasing as the number increases due to the increasing word count needed to state the number.

### Final Answer

After exploring various possibilities, the correct answers are those where the statement about the word count accurately matches the actual number of words in that statement. Here's one such correct answer:

"This answer contains five words."

Indeed, counting the words:
1. This
2. answer
3. contains
4. five
5. words.

So, the number of words in this answer is **five**.

Number of words: 800


In [32]:
response = deepseek.chat.completions.create(
    model="deepseek-reasoner",
    messages=challenge
)

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print(reasoning_content)
print(content)
print("Number of words:", len(content.split(" ")))

Okay, the user is asking how many words are in my response to their prompt. Let me start by understanding exactly what they need. They want the word count of the answer I'm about to give.

First, I need to figure out how to approach this. I should provide the answer, then count the words in that answer. But wait, the answer includes both the explanation and the word count itself. That could complicate things because the word count part might affect the total.

Let me structure my response. I'll start by addressing the question, explain the process, give the count, and then maybe confirm it. But I have to make sure that when I count the words, I include every single word in my response, including any meta-commentary about the counting.

Alternatively, maybe I should write the answer first, then count the words, and then present the number. But the user is asking for the word count of the answer to this prompt, which includes everything I write here. So I need to ensure that my entire re