<a href="https://colab.research.google.com/github/PanoEvJ/Building_with_LLMs/blob/main/Evaluation_of_LLMs_(Completed_Notebook).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of LLMs

Okay, so we've made our sweet new LLM - but how can we confirm that it's working as intended?

In this notebook, we'll walk through a few popular methods of evaluating LLMs on various tasks:

- Metric evaluation, like [Perplexity](https://thegradient.pub/understanding-evaluation-metrics-for-language-models/)
- Human or AI Evaluation
- Eleuther AI's [Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - [Notebook Here](https://colab.research.google.com/drive/1CsaPpqsB21QgQxhJpV22SgwryFFapDBP?usp=sharing)
- Stanford's [HELM](https://github.com/stanford-crfm/helm) - [Notebook here]()

There's nothing left to do but get started - and we'll start with the most familiar method: Metrics!

If you run into CUDA memory issues - please restart the notebook at start from the next session.

### Base Model

For this exercise, we'll be using OpenLM's `Open LLaMA` as our base model. 

In [None]:
model_id = "openlm-research/open_llama_7b_700bt_preview"

### Perplexity

First things first, perplexity is limited to autoregressive (CausalLM) models. That does restrict its usefulness, but not tremendously!

Secondly, Perplexity has a number of pros and cons associated with it:

Pros:
- Time-efficient, since perplexity can be calculated in a single-pass - it's fairly quick to obtain
- Can be used as signal for over/under-fitting, if perplexity scales proportionally with training data size - it could indicate your model is overfitting

Cons:
- Doesn't indicate model's performance on the final task
- Because the perplexity score depends heavily on what text was used to train the model - the scores are not comparable between models or datasets

That con is a big one, and is one of the reasons that - while perplexity is useful to calculate - it isn't great signal on how well your model will perform on its desired task.

Let's get started by getting the `evaluate` library and some other dependencies we'll use.

In [None]:
!pip install -q evaluate datasets transformers torch

Now, let's get a small test set of strings we wish to use!

In [None]:
from datasets import load_dataset

input_data = load_dataset("wikitext", "wikitext-2-raw-v1")



  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
input_data

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

We'll use some of the data present in the `test` split to ensure we're not usings something the model was trained on.

In [None]:
test_text = input_data["test"][:50]["text"]

test_text = [text for text in test_text if text != ""]

In [None]:
from evaluate import load

perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=test_text, model_id=model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Using pad_token, but it is not set yet.


  0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
results["mean_perplexity"]

88.62594145315664

Perplexity is measured as a score between 0 and `inf`, so a lower score is better.

In this case, the results are absolutely fine - though unimpressive. 

OpenLLaMA was not trained on Wikitext - and so it does an admirable job given that!

### Human or AI Evaluation

Now, let's get into how we could compare the actual final production of the model - with human or AI supervision!

The idea here is that we ask the model to perform a task - and then get some kind of results from a human being.

This method similarly comes with some pros and cons:

Pros:
- Should provide excellent feedback on wether or not your model is performing as expected

Cons: 
- Extremely expensive

Since we're going to be leveraging AI in this example, you will need an OpenAI API key!

Also, we're going to use an instruct-tuned version of the OpenLLaMA base-model to guage how well it's doing on following instructions!

In [None]:
!pip install -q openai accelerate

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "VMware/open-llama-0.3T-7B-open-instruct-v1.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype= torch.float16, device_map = 'auto')

prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [None]:
import os

os.environ["OPENAI_API_KEY"] = ""

We're going to be using a short list of instructions to score the models!

In [None]:
list_of_instructions = [
    "Give three tips for staying healthy.",
    "What are the three primary colors?",
    "Describe a time when you had to make a difficult decision.",
]

In [None]:
def get_model_response(text):
  input = prompt_template.format(instruction= text)
  input_ids = tokenizer(input, return_tensors="pt").input_ids.to("cuda")

  output1 = model.generate(input_ids, max_length=512)
  input_length = input_ids.shape[1]
  output1 = output1[:, input_length:]
  output= tokenizer.decode(output1[0])
  return output

In [None]:
import openai

for prompt in list_of_instructions:
  gpt_35_turbo_prompt = [
      {"role" : "system", 
      "content" : f"Is the following a good response to this instruction: {prompt}"}
  ]
  output_to_test = get_model_response(prompt)
  gpt_35_turbo_prompt.append(
      {"role" : "user",
       "content" : f"{output_to_test}"}
  )

  print(prompt)

  print(openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=gpt_35_turbo_prompt
  )["choices"][0]["message"]["content"])

  print("-----------------")

Give three tips for staying healthy.
Yes, this is a good response to the instruction. It provides three clear and relevant tips for staying healthy, and each tip is explained in a concise and easy-to-understand manner. It also covers different aspects of overall health, including physical activity, nutrition, and mental health, which is important for a well-rounded approach to staying healthy.
-----------------
What are the three primary colors?
This response is generally good, but it contains one factual error. The three primary colors are actually red, blue, and yellow, not red, yellow, and blue in order of brightness. It would be more accurate to say that these three colors are the building blocks of all other colors in the visible spectrum. Otherwise, the response has good detail and explanation for why these colors are important.
-----------------
Describe a time when you had to make a difficult decision.
As an AI language model, I cannot judge whether a response is good or bad as

While we did get feedback, you can see that it is vague - and potentially unhelpful. 

In the next section, you'll improve on this process to get better and more granular feedback!

### Assignment Part 1:

Try it out yourself!

- Create a list of 5 instructions
- Create a better evaluation prompt
- Report on the performance of your selected model

In [None]:
### YOUR CODE HERE