# 7.8 Evaluating the fine-tuned LLM

In [9]:
# Check if Ollama is running
import psutil
def check_if_running(process_name):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running
ollama_running = check_if_running("ollama")
if not ollama_running:
    raise RuntimeError("Ollama not running. Launch ollama before proceeding.")
print("Ollama running:", check_if_running("ollama"))

Ollama running: True


In [6]:
# Load the response data
import json
from tqdm import tqdm

file_path = "instruction-data-with-response.json"
with open(file_path, "r") as file:
    test_data = json.load(file)

In [7]:
# A function for loading the response data
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. "
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    return instruction_text + input_text

In [16]:
# Querying a local Ollama model
import urllib.request
def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    data = {"model": model,
            "messages": [{"role": "user", "content": prompt}],
            "options": {"seed": 123, "temperature": 0, "num_ctx": 2048}}
    payload = json.dumps(data).encode("utf-8")
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")
    
    response_data = ""
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.readline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]
    return response_data

In [17]:
# Test the querying function
model = "llama3"
result = query_model("What do Llamas eat?", model)
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and digestive system.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.
2. Bark: In some cas

In [18]:
# Evaluate the first three examples from the test set
for entry in test_data[:3]:
    prompt = (
        f"Given the input `{format_input(entry)}` "
        f"and correct output `{entry['output']}`, "
        f"score the model response `{entry['model_response']}`"
        f" on a scale from 0 to 100, where 100 is the best score. "
    )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model_response"])
    print("\nScore:")
    print(">>", query_model(prompt))
    print("\n-------------------------")


Dataset response:
>> The car is as fast as lightning.

Model response:
>> The car is as fast as a horse.

Score:
>> I'd rate the model response "The car is as fast as a horse." an 80 out of 100.

Here's why:

* The response uses a simile correctly, comparing the speed of the car to that of a horse.
* The comparison is reasonable and understandable, as horses are known for their speed.
* However, the model could have chosen a more vivid or unexpected comparison to make the sentence more engaging. For example, "The car is as fast as a cheetah" might be an even better simile.

Overall, the response is good but not exceptional, which is why I'd give it an 80 out of 100.

-------------------------

Dataset response:
>> The type of cloud typically associated with thunderstorms is cumulonimbus.

Model response:
>> The type of cloud typically associated with thunderstorms is a tropical rainforest.

Score:
>> I'd rate this model response a 0 out of 100.

Here's why:

* The instruction asks abo

In [19]:
# Evaluating the instruction fine-tuning LLM
def generate_model_scores(json_data, json_key, model="llama3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue
    return scores

In [20]:
# Apply the above function to the test data
scores = generate_model_scores(test_data[10:20], "model_response")
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")
# I only used part of the test data since using the entire test data requires a long time on my laptop.

Scoring entries: 100%|██████████| 10/10 [07:10<00:00, 43.08s/it]

Number of scores: 10 of 110
Average score: 49.90






In [26]:
# Print out all the scores for test_data[10:20]
for index, entry in enumerate(test_data[10:20]):
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model_response']}`")
    print(prompt)
    print("\n>>>The score is:", scores[index])
    print("-------------------------")

Given the input `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the capital of Denmark?` and correct output `The capital of Denmark is Copenhagen.`, score the model response `The capital of Denmark is Copenhagen.`

>>>The score is: 95
-------------------------
Given the input `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the opposite of 'wet'?` and correct output `The opposite of 'wet' is 'dry'.`, score the model response `The opposite of 'wet' is 'dry'.`

>>>The score is: 100
-------------------------
Given the input `Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Identify the type of sentence.

### Input:
Did you finish the report?` and correct output `The type of sentence is interrogative.`, score the model response `The type of sentence 

## Exercise 7.4 - Page 247
Since this exercise takes too long to train on my laptop, I decided not to finish it. According to the exercise solutions, this adjustment results in an equal performance with the current approach. Please refer to the appendix-E-learn folder for my learning notes on LoRA.