<a href="https://colab.research.google.com/github/AnanthSankaralingam/LLaMA-Attention-Scores/blob/main/KV_Cache_Compressed_Model_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Grading outputs from compressed models vs LLaMA 3 base model
With a large testing dataset, we want a uniform way of testing our compressed model's performance. And what better way to judge LLMs than an LLM itself. To do this, we can employ several strategies:


1.   [LLM-as-a-Judge](https://arxiv.org/pdf/2306.05685): This strategy involves grading responses 1-10 independently, without comparison.
2.   [AlpacaEval](https://arxiv.org/pdf/2404.04475): Provide an instruction and both answers to the judge, asking it to pick the better response.
3. [G-Eval](https://arxiv.org/abs/2303.16634): Two stage prompting. Explain the grading task and ask it to determine the best approach, then use it on given answers.
4. [LLM-as-a-Judge *](https://arxiv.org/pdf/2306.05685): Same as 1, but ask model for reasoning.




We'll use the yahma/alpaca-cleaned dataset for fine tuning to get a variety of data.


# Installs and imports

In [None]:
%%capture
!pip install transformers accelerate bitsandbytes datasets openai pandas tqdm python-dotenv -q

In [None]:
import os
import torch
import time
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from dotenv import load_dotenv
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Tuple, Dict, Any
from tabulate import tabulate

from openai import OpenAI
from tqdm import tqdm
import pandas as pd

In [None]:
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set up OpenAI API
client = OpenAI(api_key="...")


prompt = "How do I start a company?"

# Initialize tokenizer and model
model_name = "meta-llama/Meta-Llama-3-8B"

# provide the HF access token for the gated model
access_token = "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"

# load the model with 4-bit quantization enabled
# automatically map to the available devices
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True,
    use_auth_token=access_token
)

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=True,
    use_auth_token=access_token
)

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

# Get token embeddings
with torch.no_grad():
    outputs = model(input_ids, output_hidden_states=True)
    token_embeddings = outputs.hidden_states[-1]

# Generate model output
start_time = time.time()
with torch.no_grad():
    output_sequences = model.generate(
        input_ids=input_ids,
        max_length=100,
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
    )
generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
end_time = time.time()

# test model
print(f"Prompt: {prompt}")
print(f"Generated text: {generated_text}")
print(f"Time taken: {end_time - start_time:.2f} seconds")

Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Prompt: How do I start a company?
Generated text: How do I start a company? What's the best way to register my business? What are the legal requirements for running a business? These are just some of the questions many aspiring entrepreneurs ask when they first start out.
The good news is that there are many great resources available to help you get started. From government agencies to private organizations, there are plenty of places to turn for assistance and guidance.
Whether you're just getting your feet wet or you're ready to dive into the deep end,
Time taken: 14.67 seconds


In [None]:
# Add a padding token to the tokenizer
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    # Resize the model's token embeddings
    model.resize_token_embeddings(len(tokenizer))

# 1) LLM as a Judge
This strategy involves grading responses 1-10 independently, without comparison.

Pros:


*   Simple, easy to analyze output
*   Fast, easy generation from judge
*   Scalable

Cons:


*   Position bias. Without referencing context and grading purely on quality, may prefer some responses over others
*   Verbosity bias- prefers longer answers. This is unlikely in our use case since we can limit tokens but is something to watch out for

*   Self enhancement bias- gpt prefers gpt answers
*   Limited capability in grading math and reasoning questions







In [None]:
def llama_output(prompt: str) -> str:
    # Tokenize with padding
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)

    with torch.no_grad():
        output_sequences = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=100,
            num_return_sequences=1,
            do_sample=True,
            temperature=0.7,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    return generated_text

def llm_as_judge_1(question: str, answer: str) -> int:
    prompt = f"""Grade the following answer to the question on a scale of 1-10 based on relevance:\n
    Question: {question}
    Answer: {answer}
    Provide only the numerical grade as your response.
    """

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that grades answers based on relevance to the question."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=1,
        n=1,
        temperature=0.2,
    )

    return int(completion.choices[0].message.content.strip())

# Tetsing
llama_response = llama_output(prompt)

print("Llama response:")
print(llama_response)

print("\nGrading response...")
grade = llm_as_judge_1(prompt, llama_response)

print(f"\nGrade: {grade}/10")

Llama response:
How do I start a company? What are the first steps? What does it take to start a company?
How do I start a company? What are the first steps? What does it take to start a company?

Grading response...

Grade: 1/10


# 2) AlpacaEval
Pros:
*   Direct comparison between answers
*   Binary output
*   Focuses on relative quality

Cons:

*   May not capture nuanced differences
*   Same as 1)



In [None]:
def alpaca_eval_2(question: str, base_answer: str, compressed_answer: str) -> str:
    prompt = f"""Compare the following two answers to the given question and choose the better response:

Question: {question}

Base Answer: {base_answer}

Compressed Answer: {compressed_answer}

Which answer is better? Respond with either "Base" or "Compressed".
"""

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that evaluates and compares answers to questions."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=10,
        n=1,
        temperature=0.2,
    )

    return completion.choices[0].message.content.strip()

# 3) G-Eval
Pros:
*   Adaptable evaluation criteria for each question
*   Two-stage process allows for more thoughtful evaluation
*   Can handle diverse types of questions

Cons:

* More complex and time-consuming
* Potential for inconsistency in criteria selection
* May introduce additional biases in criteria determination stage

In [None]:
def g_eval_2(question: str, base_answer: str, compressed_answer: str) -> str:
    # Stage 1: Determine the best approach for grading
    approach_prompt = f"""We need to evaluate two answers to the following question:

Question: {question}

What would be the best criteria to judge the quality of the answers? Provide a brief explanation of the approach.
"""

    approach_completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that determines the best approach for evaluating answers to questions."},
            {"role": "user", "content": approach_prompt}
        ],
        max_tokens=150,
        n=1,
        temperature=0.5,
    )

    approach = approach_completion.choices[0].message.content.strip()

    # Stage 2: Use the determined approach to evaluate the answers
    evaluation_prompt = f"""Based on the following approach:

{approach}

Evaluate these two answers to the question and choose the better response:

Question: {question}

Base Answer: {base_answer}

Compressed Answer: {compressed_answer}

Which answer is better? Respond with either "Base" or "Compressed".
"""

    evaluation_completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that evaluates answers based on given criteria."},
            {"role": "user", "content": evaluation_prompt}
        ],
        max_tokens=10,
        n=1,
        temperature=0.2,
    )

    return evaluation_completion.choices[0].message.content.strip()

# 4) LLM as a Judge*

Pros:

* Explains reasoning, easy human intervention if there's a glaringly obvious mistake
* Can reveal nuanced differences between answers

Cons:

* Longer output doesn't really scale well
* Reasoning may introduce additional biases
* More tokens = more $

In [None]:
def llm_as_a_judge_4(question: str, base_answer: str, compressed_answer: str) -> str:
    prompt = f"""Compare the following two answers to the given question, choose the better response, and provide reasoning for your choice:

Question: {question}

Base Answer: {base_answer}

Compressed Answer: {compressed_answer}

Which answer is better? Respond with either "Base" or "Compressed", followed by a brief explanation of your reasoning.
"""

    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI assistant that evaluates and compares answers to questions, providing reasoning for your choices."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=150,
        n=1,
        temperature=0.3,
    )

    return completion.choices[0].message.content.strip()

# Testing compressed vs base model

Use [yahma/alpace-cleaned](https://huggingface.co/datasets/yahma/alpaca-cleaned) dataset which is used for fine tuning from hugging face, creating data frame with columns question, base model answer, compressed model answer, and gpt's grade.

In [None]:
from datasets import load_dataset

ds = load_dataset("yahma/alpaca-cleaned")

subset_size = 5
results = []

#TODO: create compresed_model_response method and append to results
for item in tqdm(ds['train']):
    if subset_size == 0:
      break
    subset_size -= 1
    question = item['instruction']
    llama_response = llama_output(question)
    grade = llm_as_judge_1(question, llama_response)
    results.append({
        'question': question,
        'llama_response': llama_response,
        'grade': grade
    })

# use df to display results
df = pd.DataFrame(results)
print(df.to_string(index=False))

# Optionally, save to CSV- can improve visuals later
df.to_csv('llama_responses_graded.csv', index=False)

  0%|          | 5/51760 [00:39<114:37:19,  7.97s/it]

                                                                                                               question                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       llama_response  grade
                                                                                   Give three tips for staying healthy.                                                                                                                                                                                                                                             




In [None]:
# # Display token information
# tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
# token_info = []
# for i, (token, embedding) in enumerate(zip(tokens, token_embeddings[0])):
#     token_info.append([i, token, embedding.norm().item()])

# print("\nToken Information:")
# print(tabulate(token_info, headers=["Index", "Token", "Embedding Norm"], tablefmt="grid"))

# # Visualize token embeddings
# plt.figure(figsize=(12, 6))
# sns.heatmap(token_embeddings[0].cpu().numpy(), cmap="viridis")
# plt.title("Token Embeddings Heatmap")
# plt.xlabel("Embedding Dimension")
# plt.ylabel("Token Position")
# plt.show()