<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Eval - small LLM as a judge


# Hands-On

Prompting for smaller LLMs is even harder than for the powerful ones. These prompts need to generalize beyond a single example.

???????? Tasks:
* Add an additional Criteria Rule for one the criteria named above and at it to the test suite.
* Alternatively try to improve on one of the existing criteria.
* Do you think this approach is reasonable? If not, what would you do differently?

## Basic SetUp

In [1]:
!nvidia-smi

Tue Aug 27 16:06:24 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### **Important:**
Ensure that no GPU memory is allocated yet (in the case of a T4 look for "0MiB / 15360MiB").
If GPU memory is already allocated use Runtime/Manage Sessions to delete all active sessions.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%time

!pip install --upgrade -q transformers accelerate flash_attn torch bitsandbytes
!pip install lm-format-enforcer -q
!pip install deepeval==1.1.1 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m25.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone


#### => you may need to restart the session

In [None]:
from google.colab import userdata

# Configure HuggingFace token as a Colab Secret, use key symbol on the left panel
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

## Load & quantize Model (yielding: model_id, model, tokenizer)

In [None]:
# kind = 'Lllama_3.1_8B_4bit'
kind = 'Lllama_3.1_8B_8bit'
# kind = 'Lllama_3.1_8B_16bit'  # too large for T4
# kind = 'Phi-3.5-MoE_4bit'
# kind = "Phi-3.5-mini_16bit"

if "Lllama_3.1_8B" in kind:
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif "Phi-3.5-MoE" in kind:
  model_id = "microsoft/Phi-3.5-MoE-instruct"
else:
  model_id = "microsoft/Phi-3.5-mini-instruct"

print(kind)
print(model_id)

***note:*** execute in a terminal 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [None]:
%%time

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

torch_dtype = None
quantization_config = None

if "8bit" in kind:
  print("Using 8Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
elif "4bit" in kind:
  print("Using 4Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
else:
  print("Using Full Resolution")
  torch_dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
    device_map="cuda",
    trust_remote_code=True
)

In [None]:
!nvidia-smi

In [None]:
from transformers import AutoTokenizer

def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  terminators = [
      tokenizer.eos_token_id,
      tokenizer.convert_tokens_to_ids("<|eot_id|>")
  ]

  input_token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
  ).to(model.device)

  outputs = model.generate(
      input_token_ids,
      max_new_tokens=512,
      eos_token_id=terminators,
      pad_token_id=tokenizer.eos_token_id,
      do_sample=False
  )
  output_token_ids = outputs[0][input_token_ids.shape[-1]:]
  result = tokenizer.decode(output_token_ids, skip_special_tokens=True)
  return result

Try out our model:

In [None]:
%%time
llm_run("who are you ?")

In [None]:
from IPython.display import Markdown

messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of statutory health insurance. Answer consice, serious and formal."},
    {"role": "user", "content": f'''
    What is the result of the assessment? Is a positive or negative recommendation given? Answer with "Yes" or "No" and then provide a brief justification for your assessment."

    # Assessment
    With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.
    '''}
  ]

answer = llm_run(messages)
Markdown(answer)

# Simple, bare-bones Example of an LLM-as-a-judge

In [None]:
student_text="Witing texts is painful, caus im making mitakes."
prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: The Russian-born founder of Telegram, Pavel Durov, is due to appear in a French court in the coming days after his arrest at a Paris airport over alleged offences related to the messaging app.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Sonntagmorgen um kurz vor acht. Es regnet, ein Hauch von Herbst liegt über Zürich.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {student_text}
Answer:
'''

In [None]:
import json

answer=llm_run(prompt)

json.loads(answer)

# Llm-as-a-judge: GEval with deepeval

see: https://docs.confident-ai.com/docs/guides-using-custom-llms


In [None]:
from deepeval.models import DeepEvalBaseLLM

llm_log = True

def log(*args, **kwargs):
    if llm_log:
        print(*args, **kwargs)

# wrapper calling llm_run using the global members model_id, model
class CustomDeepEvalLlm(DeepEvalBaseLLM):
    def __init__(self):
        super().__init__()

    def load_model(self):
        return model

    def generate(self, prompt: str) -> str:
        log('********************** deepEval LLM Generate BEGIN ********************************************')
        log("***** Prompt         :", prompt)
        result = llm_run(prompt)
        log("***** Answer         :", result)
        log('********************** deepEval LLM Generate END   ')
        log('')
        return result

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return model_id

try out the wrapper

In [None]:
c = CustomDeepEvalLlm()
c.generate('who are you ? answer in JSON')

We use the LLM the evaluate if a generated output was written in proper english.
Yielding a score between 0 and 1 as well as a reason, why the score was chosen.

In [None]:
import deepeval
import deepeval.metrics
import deepeval.test_case
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def geval(name, criteria, answer):
    deepeval_model = CustomDeepEvalLlm()
    test_case = deepeval.test_case.LLMTestCase(
        input='once upon a time there was a question',
        actual_output=answer
      )

    conciseness_metric = GEval(
        name=name,
        criteria=criteria,
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
        model=deepeval_model,
        verbose_mode=False
    )

    deepeval_metrics = [conciseness_metric]

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=deepeval_metrics
    )
    return eval_result

#### Geval Step 1 Example

In [None]:
criteria="Determine how concise the actual output is"
geval_step1_prompt = f'''Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
{criteria}

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{{
    "steps": <list_of_strings>
}}
**

JSON:
'''
answer = llm_run(geval_step1_prompt)

json.loads(answer)

Geval Step 2 Example

In [None]:
actual_output = "Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."
geval_step2_prompt = f'''
'''
answer = llm_run(geval_step2_prompt)

json.loads(answer)

### Geval

In [None]:
llm_log=False
geval("Conciseness", "Determine how concise the actual output is", "Pipes are beautiful." )