# Automatic evaluation script for VLM

Steps:
1. Load Prometheus
2. Load Score Rubric (no need of separate file, as it is simple, hardcode it in the prompt)
3. Load Evaluation Data (VLM Output and Ground Truth)
4. Write Evaluation script

### Load the required Libraries

In [2]:
!pip install accelerate
!pip install transformers>=4.36
!pip install optimum
!pip install bitsandbytes
!pip install aqlm[gpu,cpu]
!pip install auto-gptq
!pip install autoawq

Looking in indexes: https://nexus.iisys.de/repository/ki-awz-pypi-group/simple, https://pypi.org/simple
Collecting accelerate
  Using cached accelerate-0.31.0-py3-none-any.whl.metadata (19 kB)
Collecting torch>=1.10.0 (from accelerate)
  Using cached torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting huggingface-hub (from accelerate)
  Using cached huggingface_hub-0.23.3-py3-none-any.whl.metadata (12 kB)
Collecting safetensors>=0.3.1 (from accelerate)
  Using cached safetensors-0.4.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Collecting filelock (from torch>=1.10.0->accelerate)
  Using cached filelock-3.14.0-py3-none-any.whl.metadata (2.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-

In [20]:
pip install -U transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Looking in indexes: https://nexus.iisys.de/repository/ki-awz-pypi-group/simple, https://pypi.org/simple
Note: you may need to restart the kernel to use updated packages.


### Prometheus Prompt Structure:

### Script to load prometheus

In [None]:
def load_prometheus(version = 1):
    prometheus_v1 = 'kaist-ai/Prometheus-13b-v1.0'
    prometheus_v2 = 'prometheus-eval/prometheus-7b-v2.0'
    if version == 1:
        tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token='')
        model = LlamaForCausalLM.from_pretrained(prometheus_v1, device_map="auto", load_in_8bit=True)
    if version == 2:
        tokenizer = AutoTokenizer.from_pretrained(prometheus_v2)
        model = LlamaForCausalLM.from_pretrained(prometheus_v2, device_map="auto", load_in_8bit=True)
        
    return tokenizer, model

## Script to load other models

In [None]:
def load_model(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name, token='')
    model = LlamaForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
    return tokenizer, model

In [12]:
from transformers import pipeline
def load_model_via_pipeline(model_name):
    pipe = pipeline("text-generation", model=model_name, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")
    return pipe
'''model_kwargs={
            "torch_dtype": torch.float16,
            "quantization_config": {"load_in_4bit": True},
            "low_cpu_mem_usage": True,
        },'''

'model_kwargs={\n            "torch_dtype": torch.float16,\n            "quantization_config": {"load_in_4bit": True},\n            "low_cpu_mem_usage": True,\n        },'

## Main Evalaution class

In [17]:
import torch
from transformers import pipeline
from tqdm.auto import tqdm
from transformers import AutoTokenizer, LlamaForCausalLM
import time
import re
from torch.nn.attention import SDPBackend, sdpa_kernel

class Evaluate:
        
    def __init__(self, tokenizer, model, pipeline):
        self.tokenizer = tokenizer
        self.model = model
        self.pipeline = pipeline
        #self.model.to_bettertransformer()

    def run(self, prompt):
        if self.pipeline is None:
            pipe = pipeline("text-generation", model=self.model, tokenizer=self.tokenizer)
        else:
            pipe = self.pipeline
        start_time = time.time()
        
        with sdpa_kernel([SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION]):
            #outputs = pipe(prompt, max_new_tokens=256, temperature=1.0, repetition_penalty=1.03, top_p=0.9, do_sample=True, pad_token_id=2)[0]["generated_text"][len(prompt):]
            #output = pipe(prompt, max_new_tokens=256, return_full_text=False, temperature=1.0, repetition_penalty=1.03, top_p=0.9, do_sample=True, c=2)[0]["generated_text"]
            #terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
            output = pipe(
                prompt,
                return_full_text=False,
                max_new_tokens=256,
                #eos_token_id=terminators,
                do_sample=True,
                temperature=1.0,
                top_p=0.9,
                pad_token_id=2
            )[0]["generated_text"]
        print(output)
        feedback, result = self.extract_feedback(output)
        #print(feedback)
        end_time = time.time()
        duration = end_time - start_time
        print('duration: ', duration, ' seconds')

        return feedback, result

    def define_score_rubric(self):
        score_rubric = '''Does the reponse correclty identifies class, family or species of the animal. The correct class, family and species of the animal is mentioned in the reference answer. Score only if the response name matches exactly to the reference answer.
        Score 0: The reponse failes to identify the Class, Family or Species of animal.
        Score 1: The reponse correctly identifies the Class of the animal.
        Score 2: The response correctly identifies the Family of the animal. 
        Score 3: The response correctly identifies the Species of the animal.
        '''
        return score_rubric

    def define_prompt(self, instruction, response, referenceAnswer, scoreRubric):
       
        prompt = f'''Task Description:
        An instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
        1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
        2. After writing a feedback, write a score that is an integer between 0 and 3. You should refer to the score rubric.
        3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 0 and 3)\"
        4. Please do not generate any other opening, closing, and explanations.
        
        The instruction to evaluate:
        {instruction}
        
        Response to evaluate:
        {response}
        
        Reference Answer (Score 3):
        {referenceAnswer}
        
        Score Rubrics:
        {scoreRubric}
        
        Feedback: 
        '''

        prompt_template=f'''[INST] <<SYS>>
        You are a fair LLM evaluator. Use the Task Description to evaluate the model response. Stick to the description strictly. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
        <</SYS>>
        {prompt}[/INST]
        
        '''

        return prompt_template

    def extract_feedback(self, eval_output):
        # Extract Feedback and Result
        feedback = ''
        result = ''
        feedback_match = re.search(r"(.+?)Feedback (\d+)", eval_output, re.DOTALL)
        if feedback_match:
            feedback = feedback_match.group(1).strip()
            result = int(feedback_match.group(2))
        return feedback, result

    def main(self, inputFile):
        # funciton variables
        resultList = []
        
        # load evaluation dataset
        dataloader = Data()
        dataset = dataloader.read_jsonl_file(inputFile)
        
        # generate prompt
        instruction = "What kind of animal is that? Be as specific as possible! Fish and bird for example are too coarse."
        scoreRubric = self.define_score_rubric()
        for data in tqdm(dataset):
            resultDict = {}
            prompt = self.define_prompt(instruction, data['response'], data['ground_truth'], scoreRubric)
            # run evalaution
            feedback , result = self.run(prompt)
            # store result
            resultDict['id'] = data['id']
            resultDict['input'] = data['response']
            resultDict['ground_truth'] = data['ground_truth']
            resultDict['result'] = result
            resultDict['feedback'] = feedback
            resultList.append(resultDict)

        # save the results in a json file
        dataloader.save_result(resultList)
        

### Class to process input output data

In [7]:
import json

class Data:
    def read_jsonl_file(self, file_path):
        data = []
        with open(file_path, 'r') as file:
            for line in file:
                data.append(json.loads(line))
        return data

    def save_result(self, resultList):
        with open(f'result.json', 'w') as output_json_file:
            json.dump({"results": resultList}, output_json_file)

        print(f"\Evaluations saved to result.json")

In [3]:
dataloader = Data()
dataset = dataloader.read_jsonl_file('input.jsonl')
dataset[0]

{'id': 'aenb1',
 'response': 'a peacock, which is a large, colorful bird known for its distinctive plumage and elaborate tail feathers.',
 'ground_truth': 'Great Bustard',
 'type': 'bird'}

### Run evaluation

In [8]:
## Load prometheus
tokenizer, model = load_prometheus(version = 2)

tokenizer_config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
You are using a model of type mistral to instantiate a model of type llama. This is not supported for all configurations of models and can yield errors.


model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

In [8]:
## Load other models
llama_3_8b_instruct = 'meta-llama/Meta-Llama-3-8B-Instruct'
llama_3_instruct_aqlm = 'ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16'
mistral_gptq = 'neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit'
llama_3_8b = 'meta-llama/Meta-Llama-3-8B'
mistral_gptq = 'neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit'
llama_2_13b_chat_gptq = 'TheBloke/Llama-2-13B-chat-GPTQ'

In [9]:
#tokenizer, model = load_model(llama_3_8b)

In [21]:
pipeline = load_model_via_pipeline(llama_2_13b_chat_gptq)

Some weights of the model checkpoint at TheBloke/Llama-2-13B-chat-GPTQ were not used when initializing LlamaForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.k_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.0.self_attn.q_proj.bias', 'model.layers.0.self_attn.v_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.k_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.1.self_attn.q_proj.bias', 'model.layers.1.self_attn.v_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.k_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.10.self_attn.q_proj.bias', 'model.layers.10.self_attn.v_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11.mlp.gate_

In [22]:
eval = Evaluate(tokenizer = None, model = None, pipeline = pipeline)

In [23]:
eval.main('input.jsonl')

  0%|          | 0/5 [00:00<?, ?it/s]

Feedback: The response correctly identifies the species of the animal, which is Great Bustard. The description of the animal is also specific and detailed, mentioning the distinctive plumage and elaborate tail feathers, which are key characteristics of the Great Bustard. However, the response does not provide the correct class or family of the animal. [RESULT: 2]
duration:  26.474257230758667  seconds

        Feedback: The response correctly identifies the species of the animal, which is the bald eagle. [RESULT: 3]




The response accurately identifies the species of the animal, which is a bald eagle. The reference answer also mentions the class, family, and species of the animal, which is [Class]Bird [Family]Eagle [Species]Bald Eagle.




































































































































































duration:  83.25265717506409  seconds

        Feedback: The response correctly identifi