<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Eval - small LLM as a judge


# Hands-On

Prompting for smaller LLMs is even harder than for the powerful ones. These prompts need to generalize beyond a single example.

???????? Tasks:
* Add an additional Criteria Rule for one the criteria named above and at it to the test suite.
* Alternatively try to improve on one of the existing criteria.
* Do you think this approach is reasonable? If not, what would you do differently?

## Basic SetUp

In [1]:
!nvidia-smi

Mon Aug 26 15:08:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P8              12W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### **Important:**
Ensure that no GPU memory is allocated yet (in the case of a T4 look for "0MiB / 15360MiB").
If GPU memory is already allocated use Runtime/Manage Sessions to delete all active sessions.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
%%time

!pip install --upgrade -q transformers accelerate flash_attn torch bitsandbytes
!pip install lm-format-enforcer -q
!pip install deepeval==1.0.1 -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m66.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.2/797.2 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.4/209.4 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m7.2 MB/s[0m eta 

#### => you may need to restart the session

In [5]:
from google.colab import userdata

# Configure HuggingFace token as a Colab Secret, use key symbol on the left panel
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Load & quantize Model (yielding: model_id, model, tokenizer)

In [6]:
# kind = 'Lllama_3.1_8B_4bit'
kind = 'Lllama_3.1_8B_8bit'
# kind = 'Lllama_3.1_8B_16bit'  # too large for T4
# kind = 'Phi-3.5-MoE_4bit'
# kind = "Phi-3.5-mini_16bit"

if "Lllama_3.1_8B" in kind:
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif "Phi-3.5-MoE" in kind:
  model_id = "microsoft/Phi-3.5-MoE-instruct"
else:
  model_id = "microsoft/Phi-3.5-mini-instruct"

print(kind)
print(model_id)

Lllama_3.1_8B_8bit
meta-llama/Meta-Llama-3.1-8B-Instruct


***note:*** execute in a termial 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [7]:
%%time

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

torch_dtype = None
quantization_config = None

if "8bit" in kind:
  print("Using 8Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
elif "4bit" in kind:
  print("Using 4Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
else:
  print("Using Full Resolution")
  torch_dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
    device_map="cuda",
    trust_remote_code=True
)

Using 8Bit quantization


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

CPU times: user 38.5 s, sys: 47.6 s, total: 1min 26s
Wall time: 3min 7s


In [8]:
!nvidia-smi

Mon Aug 26 15:21:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P0              29W /  70W |   8825MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [22]:
from transformers import AutoTokenizer

def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  terminators = [
      tokenizer.eos_token_id,
      tokenizer.convert_tokens_to_ids("<|eot_id|>")
  ]

  input_token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
  ).to(model.device)

  outputs = model.generate(
      input_token_ids,
      max_new_tokens=512,
      eos_token_id=terminators,
      pad_token_id=tokenizer.eos_token_id,
      do_sample=False
  )
  output_token_ids = outputs[0][input_token_ids.shape[-1]:]
  result = tokenizer.decode(output_token_ids, skip_special_tokens=True)
  return result

Try out our model:

In [27]:
%%time
llm_run("who are you ?")

CPU times: user 5.01 s, sys: 1.32 ms, total: 5.01 s
Wall time: 5.24 s


'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

In [28]:
from IPython.display import Markdown

messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of statutory health insurance. Answer consice, serious and formal."},
    {"role": "user", "content": f'''
    What is the result of the assessment? Is a positive or negative recommendation given? Answer with "Yes" or "No" and then provide a brief justification for your assessment."

    # Assessment
    With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.
    '''}
  ]

answer = llm_run(messages)
Markdown(answer)

CPU times: user 12.1 s, sys: 40.3 ms, total: 12.2 s
Wall time: 13 s


No

The assessment suggests that the need for compensation to ensure the basic need is conceivable, implying a potential requirement for financial support. However, it does not explicitly state a positive or negative recommendation, but rather a possibility or a consideration for compensation.

# Simple, bare-bones Example of an LLM-as-a-judge

In [41]:
student_text="Witing texts is painful, caus im making mitakes."
prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: The Russian-born founder of Telegram, Pavel Durov, is due to appear in a French court in the coming days after his arrest at a Paris airport over alleged offences related to the messaging app.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Sonntagmorgen um kurz vor acht. Es regnet, ein Hauch von Herbst liegt über Zürich.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {student_text}
Answer:
'''

In [42]:
import json

answer=llm_run(prompt)

json.loads(answer)

{'score': 1,
 'reason': "The text contains multiple spelling errors, such as 'Witing' instead of 'Writing', 'caus' instead of 'because', and'mitakes' instead of'mistakes'."}

# Llm-as-a-judge: GEval with deepeval

see: https://docs.confident-ai.com/docs/guides-using-custom-llms


In [30]:
from deepeval.models import DeepEvalBaseLLM

llm_log = True

def log(*args, **kwargs):
    if llm_log:
        print(*args, **kwargs)

# wrapper calling llm_run using the global members model_id, model
class CustomLlm(DeepEvalBaseLLM):
    def __init__(self):
        super().__init__()

    def load_model(self):
        return model

    def generate(self, prompt: str) -> str:
        log('********************** deepEval LLM Generate BEGIN ********************************************')
        log("***** Prompt        : ", prompt)
        result = llm_run(prompt)
        log("***** Answer        : ",result)
        log('********************** deepEval LLM Generate END   ')
        return result

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return model_id

try out the wrapper

In [31]:
c = CustomLlm()
c.generate('who are you ?')

********************** deepEval LLM Generate BEGIN ********************************************
***** Prompt        :  who are you ?
***** Answer        :  I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."
********************** deepEval LLM Generate END   


'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

We use the LLM the evaluate if a generated output was written in proper english.
Yielding a score between 0 and 1 as well as a reason, why the score was chosen.