**LLM Workshop 2024 by Sebastian Raschka**

<br>
<br>
<br>
<br>

# 6) Instruction finetuning (part 3; benchmark evaluation)

- In the previous notebook, we finetuned the LLM; in this notebook, we evaluate it using popular benchmark methods

- There are 3 main types of model evaluation

  1. MMLU-style Q&A
  2. LLM-based automatic scoring
  3. Human ratings by relative preference
  
  


<img src="figures/10.png" width=800px>

<img src="figures/11.png" width=800px>


<br>
<br>
<br>


<img src="figures/13.png" width=800px>



<img src="figures/14.png" width=800px>



## https://tatsu-lab.github.io/alpaca_eval/

<img src="figures/15.png" width=800px>

## https://chat.lmsys.org

<br>
<br>
<br>
<br>

# 6.2 Evaluation

- In this notebook, we do an MMLU-style evaluation in LitGPT, which is based on the [EleutherAI LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness)
- There are hundreds if not thousands of benchmarks; using the command below, we filter for MMLU subsets, because running the evaluation on the whole MMLU dataset would take a very long time

- Let's say we are intrested in the `mmlu_philosophy` subset, se can evaluate the LLM on MMLU as follows

In [1]:
# list all the mmlu related evaluation tasks
!litgpt evaluate list | findstr mmlu

afrimmlu_direct_amh
afrimmlu_direct_eng
afrimmlu_direct_ewe
afrimmlu_direct_fra
afrimmlu_direct_hau
afrimmlu_direct_ibo
afrimmlu_direct_kin
afrimmlu_direct_lin
afrimmlu_direct_lug
afrimmlu_direct_orm
afrimmlu_direct_sna
afrimmlu_direct_sot
afrimmlu_direct_swa
afrimmlu_direct_twi
afrimmlu_direct_wol
afrimmlu_direct_xho
afrimmlu_direct_yor
afrimmlu_direct_zul
afrimmlu_translate_amh
afrimmlu_translate_eng
afrimmlu_translate_ewe
afrimmlu_translate_fra
afrimmlu_translate_hau
afrimmlu_translate_ibo
afrimmlu_translate_kin
afrimmlu_translate_lin
afrimmlu_translate_lug
afrimmlu_translate_orm
afrimmlu_translate_sna
afrimmlu_translate_sot
afrimmlu_translate_swa
afrimmlu_translate_twi
afrimmlu_translate_wol
afrimmlu_translate_xho
afrimmlu_translate_yor
afrimmlu_translate_zul
arabicmmlu_accounting_university
arabicmmlu_social_science_tasks
arabicmmlu_arabic_language_general
arabicmmlu_language_tasks
arabicmmlu_arabic_language_grammar
arabicmmlu_arabic_language_high_school
arabicmmlu_arabic_language

2025-04-01 02:38:21.763153: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-01 02:38:28.921462: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


mmlu_flan_cot_zeroshot_abstract_algebra
mmlu_flan_cot_zeroshot_stem
mmlu_flan_cot_zeroshot_anatomy
mmlu_flan_cot_zeroshot_astronomy
mmlu_flan_cot_zeroshot_business_ethics
mmlu_flan_cot_zeroshot_other
mmlu_flan_cot_zeroshot_clinical_knowledge
mmlu_flan_cot_zeroshot_college_biology
mmlu_flan_cot_zeroshot_college_chemistry
mmlu_flan_cot_zeroshot_college_computer_science
mmlu_flan_cot_zeroshot_college_mathematics
mmlu_flan_cot_zeroshot_college_medicine
mmlu_flan_cot_zeroshot_college_physics
mmlu_flan_cot_zeroshot_computer_security
mmlu_flan_cot_zeroshot_conceptual_physics
mmlu_flan_cot_zeroshot_econometrics
mmlu_flan_cot_zeroshot_social_sciences
mmlu_flan_cot_zeroshot_electrical_engineering
mmlu_flan_cot_zeroshot_elementary_mathematics
mmlu_flan_cot_zeroshot_formal_logic
mmlu_flan_cot_zeroshot_humanities
mmlu_flan_cot_zeroshot_global_facts
mmlu_flan_cot_zeroshot_high_school_biology
mmlu_flan_cot_zeroshot_high_school_chemistry
mmlu_flan_cot_zeroshot_high_school_computer_science
mmlu_flan_co

In [None]:
# evaluating the base model (i.e phi-1_5) on Machine Learning related mmlu tasks
!litgpt evaluate microsoft/phi-1_5 --task "cmmlu_machine_learning" --batch_size 4

<br>
<br>
<br>
<br>

# Exercise 3: Evaluate the finetuned LLM

<br>
<br>
<br>
<br>

# Solution

In [None]:
!litgpt evaluate out/finetune/lora/final --tasks "cmmlu_machine_learning" --batch_size 4

{'batch_size': 4,
 'checkpoint_dir': PosixPath('out/finetune/lora/final'),
 'device': None,
 'dtype': None,
 'force_conversion': False,
 'limit': None,
 'num_fewshot': None,
 'out_dir': None,
 'save_filepath': None,
 'seed': 1234,
 'tasks': 'mmlu_philosophy'}
2024-07-04:00:57:13,332 INFO     [huggingface.py:170] Using device 'cuda'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-04:00:57:18,981 INFO     [evaluator.py:152] Setting random seed to 1234 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-07-04:00:57:18,981 INFO     [evaluator.py:203] Using pre-initialized model
2024-07-04:00:57:24,808 INFO     [evaluator.py:261] Setting fewshot random generator seed to 1234
2024-07-04:00:57:24,809 INFO     [task.py:411] Building contexts for mmlu_philosophy on rank 0...
100%|████████████████████████████████████████| 311/311 [00:00<00:00, 807.98it/s]
2024-07-04:00:57:25,206 INFO     [evaluator.py