# Evaluating Vision Models
## **GPT4-V** vs **Gemini Pro** vs **Claude 3** 

In [None]:
from openai import OpenAI
from anthropic import Anthropic
import base64
import dotenv
import pandas as pd

In [None]:
dotenv.load_dotenv()

## Setting up GPT-4

In [22]:
from timeit import default_timer as timer

oai_client = OpenAI()

def encode_image(image_path: str) -> str:
    """
    Encodes an image to a base64 string.
    """
    with open(image_path, "rb") as img_file:
        return base64.b64encode(img_file.read()).decode("utf-8")

def call_gpt4_vision(image_path, question):
    base64_image = (encode_image(image_path),)
    start = timer()
    response = oai_client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": question},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64m {base64_image}"},
                    },
                ],
            }
        ],
    )
    latency = timer() - start
    tokens = response.usage.total_tokens
    content = response.choices[0].message.content
    return content, latency, tokens

In [23]:
img_path = "data/financial_projections.png"
question = "What is our projected revenue for 2018, and are we profitable?"

answer = call_gpt4_vision(img_path, question)[0] 
print(answer)

The projected revenue for 2018, according to the provided image, is $4,800,000. The image indicates that the break-even point is reached in the third year, which would be 2017. Given this information, we can infer that the company is projected to be profitable in 2018, as it is expected to surpass the break-even point by achieving the $4,800,000 revenue mark.


## Setting up Google Gemini

In [34]:
from vertexai.preview.generative_models import GenerativeModel, Part, Image
import vertexai

vertexai.init(project='your-project-name', location='europe-west1')
gemini_vision = GenerativeModel("gemini-1.0-pro-vision")

def call_gemini_vision(image_path, question):
    image_part = Part.from_image(Image.load_from_file(image_path))
    question_part = Part.from_text(question)
    start = timer()
    response = gemini_vision.generate_content([question_part, image_part])
    latency = timer() - start 
    tokens = response.to_dict()['usage_metadata']['total_token_count']
    try:
        content = response.candidates[0].content.parts[0].text
        return content, latency, tokens
    except Exception as e:
        return e


In [35]:
question = "What concept is explained in this diagram?"
img_path = "data/lda_model.png"

answer = call_gemini_vision(img_path, question)[0] 
print(answer)

 Latent Dirichlet allocation (LDA)


## Setting up Claude 3 

In [44]:
from anthropic import Anthropic
from time import sleep
anthropic_client = Anthropic()


def call_claude3_vision(image_path, question, model="claude-3-opus-20240229"):
    base64_image = encode_image(image_path)
    image_type = image_path.split('.')[-1].replace('jpg', 'jpeg')
    start = timer()
    response = anthropic_client.messages.create(
        model=model,
        max_tokens=1024,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": f"image/{image_type}",
                            "data": base64_image
                        },
                    },
                    {
                        "type": "text",
                        "text": question
                    }
                ],
            }
        ],
    )
    latency = timer() - start 
    message = response.content[0].text
    tokens = response.usage.input_tokens + response.usage.output_tokens
    sleep(8)
    return message, latency, tokens

In [30]:
question = "What concept is explained in this diagram?"
img_path = "data/lda_model.png"

# Models = "claude-3-haiku-20240307", "claude-3-sonnet-20240229"

answer = call_claude3_vision(img_path, question, model="claude-3-haiku-20240307")[0] 
print(answer)

This diagram appears to be explaining the process of topic modeling, which is a technique used to discover the abstract "topics" that occur in a collection of text documents. The diagram shows the main steps involved, including the collection of text documents, the calculation of a document-topic distribution, the assignment of words to topics, and the final creation and analysis of the topic frequencies across the documents.


## Evaluating the results with RAGAS

In [42]:
from ragas.metrics import answer_correctness
from ragas import evaluate
from datasets import Dataset


class ImageQA:
    def __init__(self):
        self.data = {
            "images": [
                "data/financial_projections.png",
                "data/soda_nutrition.jpeg",
                "data/actors_graph.png",
                "data/lex_yann.jpg",
                "data/lda_model.png",
                "data/amazon_items.png",
                "data/license_plate.jpeg",
                "data/street_night.jpg",
                "data/contrastive_learning_2.png"
            ],
            "questions": [
                "What is our projected revenue for 2018, and are we profitable?",
                "Give me the complete nutritional facts, grams and % daily value",
                "Which two actors had the most movies in common?",
                "Who are the two famous people in this image?",
                "What concept is explained in this diagram?",
                "What is the cheapest item, and the most expensive item in this picture?",
                "What's the license of the main car in the image?",
                "What does the text say in the longest billboard in the image?",
                "What concept is explained in this diagram?"
            ],
            "ground_truths": [
                "The projected revenue for 2018 is $4,800,000. The image includes a break-even point at the 3rd year, indicating that the company is projected to be profitable by 2018.",
                "Calories 25\nTotal Fat 0g 0%\nSodium 0mg 0%\nTotal Carbohydrate 7g 3%\nDietary Fiber 2g 7%\nTotal Sugars 5g 8%\nProtein 0g 0%",
                "The two actors who had the most movies in common were Keanu Reeves and Laurence Fishburne. They both starred in four movies together: The Matrix, The Matrix Reloaded and The Matrix Revolutions",
                "Lex Fridman, a podcaster, and Yann Lecun, an AI scientist",
                "Latent Dirichlet Allocation (LDA)",
                "The cheapest item in the picture is the 'Renova Rollo de Cocina, 3 Unidades (Paquete de 1)' for 3,45 €. The most expensive item in the picture is the 'Proteína Sin Lactosa de HSN | Sin Sabor 500 g' for 22,90 €.",
                "The license is 'CL10760'.",
                "The billboard reads 'Madame Tussauds'.",
                "The diagram describes 'triplet loss', a technique used to train a neural network to group similar images together and separate dissimilar ones. It does this by minimizing the distance between images of the same identity and maximizing the distance between images of different identities."
            ],
        }

        self.metrics = [answer_correctness]

    def image_qa(self, image_path, question, model):
        print(f"Running visual QA for: {image_path}")
        if model == 'gpt4':
            return call_gpt4_vision(image_path, question)
        elif model == 'gemini':
            return call_gemini_vision(image_path, question)
        elif model == 'claude-opus':
            return call_claude3_vision(image_path, question)
        elif model == 'claude-haiku':
            return call_claude3_vision(image_path, question, model="claude-3-haiku-20240307")
        else:
            raise Exception("Invalid model name")


    def run_evaluation(self, model):
        answers = []
        latencies = []
        token_usage = []

        for index, question in enumerate(self.data['questions']):
            image_path = self.data['images'][index]
            response, latency, tokens = self.image_qa(image_path, question, model)

            answers.append(response)
            latencies.append(latency)
            token_usage.append(tokens)


        ragas_testset = Dataset.from_dict(
            {
                "question": self.data['questions'],
                "ground_truth": self.data['ground_truths'],
                "answer": answers,
                "latencies": latencies,
                "token_usage": token_usage,
            }
        )

        eval_results = evaluate(ragas_testset, metrics=self.metrics)
        df = eval_results.to_pandas()
        df.to_csv(f"eval_results/{model}_evals.csv", index=False)


In [45]:
image_qa = ImageQA()

image_qa.run_evaluation(model="gpt4")
image_qa.run_evaluation(model="gemini")
image_qa.run_evaluation(model="claude-opus")
image_qa.run_evaluation(model="claude-haiku")

Running visual QA for: data/financial_projections.png
Running visual QA for: data/soda_nutrition.jpeg
Running visual QA for: data/actors_graph.png
Running visual QA for: data/lex_yann.jpg
Running visual QA for: data/lda_model.png
Running visual QA for: data/amazon_items.png
Running visual QA for: data/license_plate.jpeg
Running visual QA for: data/street_night.jpg
Running visual QA for: data/contrastive_learning_2.png


Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

Task exception was never retrieved
future: <Task finished name='Task-22' coro=<AsyncClient.aclose() done, defined at c:\Users\JohannesJolkkonen\Documents\Code\gettingdata-samples\visual-model-comparison\.venv\Lib\site-packages\httpx\_client.py:1996> exception=RuntimeError('Event loop is closed')>
Traceback (most recent call last):
  File "c:\Users\JohannesJolkkonen\Documents\Code\gettingdata-samples\visual-model-comparison\.venv\Lib\site-packages\httpx\_client.py", line 2003, in aclose
    await self._transport.aclose()
  File "c:\Users\JohannesJolkkonen\Documents\Code\gettingdata-samples\visual-model-comparison\.venv\Lib\site-packages\httpx\_transports\default.py", line 383, in aclose
    await self._pool.aclose()
  File "c:\Users\JohannesJolkkonen\Documents\Code\gettingdata-samples\visual-model-comparison\.venv\Lib\site-packages\httpcore\_async\connection_pool.py", line 313, in aclose
    await self._close_connections(closing_connections)
  File "c:\Users\JohannesJolkkonen\Documents\

In [47]:
df = pd.read_csv("eval_results/gemini_evals.csv")
df

Unnamed: 0,question,ground_truth,answer,latencies,token_usage,answer_correctness
0,"What is our projected revenue for 2018, and ar...","The projected revenue for 2018 is $4,800,000. ...","The projected revenue for 2018 is $4,800,000....",2.932736,317,0.613849
1,"Give me the complete nutritional facts, grams ...",Calories 25\nTotal Fat 0g 0%\nSodium 0mg 0%\nT...,**Nutrition Facts**\n\nServing Size: 1 Can (1...,3.911964,393,0.91971
2,Which two actors had the most movies in common?,The two actors who had the most movies in comm...,The two actors who had the most movies in com...,3.0918,313,0.744702
3,Who are the two famous people in this image?,"Lex Fridman, a podcaster, and Yann Lecun, an A...",The two famous people in the image are:\n\n- ...,2.854356,298,0.518296
4,What concept is explained in this diagram?,Latent Dirichlet Allocation (LDA),Latent Dirichlet Allocation (LDA),2.866267,272,0.988251
5,"What is the cheapest item, and the most expens...",The cheapest item in the picture is the 'Renov...,"The cheapest item is the ""Renova Rollo de Coc...",3.878212,336,0.993904
6,What's the license of the main car in the image?,The license is 'CL10760'.,The license of the main car in the image is C...,3.328562,288,0.717002
7,What does the text say in the longest billboar...,The billboard reads 'Madame Tussauds'.,Madame Tussauds,2.775198,276,0.974335
8,What concept is explained in this diagram?,"The diagram describes 'triplet loss', a techni...",The diagram explains the concept of contrasti...,5.207717,582,0.688275


In [48]:
!streamlit run streamlit_dashboard.py