## Evaluate model performance

In [29]:
# in terminal open "ollama serve" to start the app and the api

In [30]:
import psutil

def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break

    return running

# check ollama is running
ollama_running = check_if_running("ollama")

if not ollama_running:
    raise RuntimeError("Ollama is not running. Launch it before using this notebook.")

print(f"Ollama running: {ollama_running}")

Ollama running: True


In [31]:
# load the predictions stored
import json
from tqdm import tqdm

file_path = "./instruction_data/instruction-data-with-model-response.json"

with open(file_path, "r") as file:
    test_data = json.load(file)

In [32]:
# this is based on a older, more widespread approach
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input\n{entry['input']}" if entry['input'] else ""

    return instruction_text + input_text

In [33]:
# query ollama

import urllib.request

def query_model(prompt, model="llama3.2", url="http://localhost:11434/api/chat"):

    # create api payload
    data = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    payload = json.dumps(data).encode("utf-8")

    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    # send request and process response
    response_data = ""
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
        
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]
        
    return response_data


# test api call
model = "llama3.2"
result = query_model("Who is the best person in the world?", model=model)
print(result)

That's a subjective question! The concept of "best" can vary greatly depending on individual perspectives, values, and experiences. It's also important to recognize that everyone has their own unique strengths, weaknesses, and contributions to make.

Rather than trying to identify a single person as the "best," I'd like to offer some alternative perspectives:

1. **Every person is special**: Each individual has their own unique qualities, talents, and experiences that make them valuable and important in their own way.
2. **There's no one-size-fits-all answer**: What might be considered the "best" by one person might not be the same for another. Different people have different values, priorities, and perspectives.
3. **We can learn from each other**: Rather than trying to identify a single "best" person, we can learn from and appreciate the diverse range of individuals who make our world more interesting and enriching.

So, instead of focusing on a single person, let's celebrate the div

### Scoring our models response using another llm model

In [34]:
for entry in test_data[:1]:
    prompt = (
        f"Given the input '{format_input(entry)}' "
        f"and the correct output '{entry['output']}', "
        f"score the model response '{entry['model_response']}'"
        " on a scale from 0 to 100, where 100 is the best score. "
    )

    print(f"Input: \n\t{entry['input']}")
    print(f"\nExpected Response: \n\t{entry['output']}")
    print(f"Model Response: \n\t{entry['model_response']}")
    print(f"Score: \n\t{query_model(prompt)}")
    print("\n-------------------------------------\n")

Input: 
	The car is very fast.

Expected Response: 
	The car is as fast as lightning.
Model Response: 
	The car is as fast as a bullet.
Score: 
	To rewrite the sentence "The car is very fast" using a simile, we can replace "very fast" with a comparative phrase that uses "like" or "as." Here's an example:

"The car is as fast as lightning."

Now, let's evaluate the model response "The car is as fast as a bullet." on a scale from 0 to 100.

Score: 80

This response is close to the correct answer, but it uses a metaphor (comparing the car to a bullet) instead of a simile. A simile explicitly compares two things using "like" or "as," whereas a metaphor states that one thing is another thing. While the response is still clear and effective, it doesn't meet the specific request for a simile.

-------------------------------------



In [35]:
def generate_model_scores(json_data, json_key, model="llama3.2"):
    scores = []

    for entry in tqdm(json_data, desc="Scoring"):
        prompt = (
            f"Given the input '{format_input(entry)}' "
            f"and the correct output '{entry['output']}', "
            f"score the model response '{entry['model_response']}'"
            " on a scale from 0 to 100, where 100 is the best score. "
            "Respond with the integer number only."
        )

        score = query_model(prompt, model)

        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores

scores = generate_model_scores(test_data, "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring:   9%|▉         | 10/110 [00:21<03:50,  2.31s/it]

Could not convert score: 60

The model's response contains incorrect information about prime and composite numbers. The correct classification should be:

Prime numbers: 11, 19
Composite numbers: 14

The model incorrectly included numbers like 3, 5, 9, 11, 13, 15, 17 in the list of composite numbers, which are actually prime numbers.


Scoring:  22%|██▏       | 24/110 [00:51<03:11,  2.23s/it]

Could not convert score: 60

Explanation: The correct categorization should be "Fish: Shark, Trout" and "Mammals: Dolphin". The model response "Dolphin, Trout" incorrectly places both animals in the same category.


Scoring:  29%|██▉       | 32/110 [01:09<02:52,  2.21s/it]

Could not convert score: I'd rewrite the sentence as: "This task is straightforward."

As for scoring my previous response, I'd give it a 0 out of 100.


Scoring:  62%|██████▏   | 68/110 [02:26<01:37,  2.31s/it]

Could not convert score: I can split the sentence into two declarative sentences: 'The movie was long.' and 'It was interesting.'

However, I will not use this response as it does not match the expected output.

Here's another attempt:

1. The movie was long.
2. It was interesting.

As for scoring the model response 'and The movie was interesting.', I would give it a score of 0.


Scoring:  68%|██████▊   | 75/110 [02:41<01:21,  2.33s/it]

Could not convert score: I would rate this response a 0 out of 100.

The model's response does not sort the list in descending order as requested. Instead, it generates a large amount of random numbers that do not resemble any meaningful output. The correct sorted list should be: 25, 16, 10, 7, 2.


Scoring: 100%|██████████| 110/110 [03:56<00:00,  2.15s/it]

Number of scores: 105 of 110
Average score: 39.16




