# Model evaluation

The model comparison tool that Sung described in the video can be found at this link: https://console.upstage.ai/ (note that you need to create a free account to try it out.)

A useful tool for evaluating LLMs is the **LM Evaluation Harness** built by EleutherAI. Information about the harness can be found at this [github repo](https://github.com/EleutherAI/lm-evaluation-harness):

You can run the commented code below to install the evaluation harness in your own environment:

In [1]:
#!pip install -U git+https://github.com/EleutherAI/lm-evaluation-harness

You will evaluate TinySolar-248m-4k on 5 questions from the **TruthfulQA MC2 task**. This is a multiple-choice question answering task that tests the model's ability to identify true statements. You can read more about the TruthfulQA benchmark in [this paper](https://arxiv.org/abs/2109.07958), and you can checkout the code for implementing the tasks at this [github repo](https://github.com/sylinrl/TruthfulQA).

The code below runs only the TruthfulQA MC2 task using the LM Evaluation Harness:

In [2]:
!lm_eval --model hf \
    --model_args pretrained=./models/TinySolar-248m-4k \
    --tasks truthfulqa_mc2 \
    --device cpu \
    --limit 5

2025-10-31:20:59:31,545 INFO     [__main__.py:272] Verbosity set to INFO
2025-10-31:20:59:31,588 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-10-31:20:59:36,902 INFO     [__main__.py:376] Selected Tasks: ['truthfulqa_mc2']
2025-10-31:20:59:36,907 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-10-31:20:59:36,907 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': './models/TinySolar-248m-4k'}
2025-10-31:20:59:36,908 INFO     [huggingface.py:130] Using device 'cpu'
2025-10-31:20:59:36,998 INFO     [huggingface.py:366] Model parallel was set to F

### Evaluation for the Hugging Face Leaderboard
You can use the code below to test your own model against the evaluations required for the [Hugging Face leaderboard](https://huggingface.co/open-llm-leaderboard). 

If you decide to run this evaluation on your own model, don't change the few-shot numbers below - they are set by the rules of the leaderboard.

In [3]:
## function to evaluate custom model with the standard params and their values. Don't edit the param values.
import os

def h6_open_llm_leaderboard(model_name):
  task_and_shot = [ ## few shot parameters
      ('arc_challenge', 25),
      ('hellaswag', 10),
      ('mmlu', 5),
      ('truthfulqa_mc2', 0),
      ('winogrande', 5),
      ('gsm8k', 5)
  ]

  for task, fewshot in task_and_shot:
    eval_cmd = f"""
    lm_eval --model hf \
        --model_args pretrained={model_name} \
        --tasks {task} \
        --device cpu \
        --num_fewshot {fewshot}
    """
    os.system(eval_cmd)

h6_open_llm_leaderboard(model_name="YOUR_MODEL")

2025-10-31:21:01:02,139 INFO     [__main__.py:272] Verbosity set to INFO
2025-10-31:21:01:02,163 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-10-31:21:01:05,529 INFO     [__main__.py:376] Selected Tasks: ['arc_challenge']
2025-10-31:21:01:05,530 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-10-31:21:01:05,530 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': 'YOUR_MODEL'}
2025-10-31:21:01:05,531 INFO     [huggingface.py:130] Using device 'cpu'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub

2025-10-31:21:01:15,025 INFO     [__main__.py:272] Verbosity set to INFO
2025-10-31:21:01:15,048 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
Traceback (most recent call last):
  File "/usr/local/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
             ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/lm_eval/__main__.py", line 304, in cli_evaluate
    task_manager = TaskManager(args.verbosity, include_path=args.include_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/lm_eval/tasks/__init__.py", line 34, in __init__
    self._task_index =

2025-10-31:21:01:26,604 INFO     [__main__.py:272] Verbosity set to INFO
2025-10-31:21:01:26,628 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2025-10-31:21:01:29,924 INFO     [__main__.py:376] Selected Tasks: ['winogrande']
2025-10-31:21:01:29,925 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2025-10-31:21:01:29,925 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': 'YOUR_MODEL'}
2025-10-31:21:01:29,927 INFO     [huggingface.py:130] Using device 'cpu'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/ut

### Below are the common benchmark datasets, which you can use to evaluate the model.

![Benchmark datasets](./6.llm_evalution_benchmarks.png)

### Below are some of the no code eval techniques you can use for evaluation.
![Opensource no code evaluation](./6.opensource_no_code_eval_sources.png)