# Model evaluation

In the paper, we consider a classification problem where inputs to the model are questions $x$ paired with candidate answers $y$ to constitute concatenated sequences.
The generative model then processes these concatenated question-answer pairs to predict the most probable answer $\hat{y}$ from the provided choices $Y$ for a given $x$:
\begin{align*}
\hat{y} = \underset{y \in Y}{\text{arg max }} p_{\text{LM}}(y|x).
\end{align*}
Here, the probability of the token sequence
$y$ is derived as the product of individual token $y_{[i]}$ probabilities within the sequence, conditioned on
$x$ and the preceding tokens $y_{[1:i-1]}$:
\begin{align*}
p_{\text{LM}}(y|x) = \prod_{i=1}^{|y|} p_{\text{LM}}(y_{[i]}|x, y_{[1:i-1]}),
\end{align*}
where $|y|$ is the number of tokens composing the answer $y$.

For the entailment generation benchmarks, we use texts concatenated with possible completions as inputs to the model.
We compare the quantized and full-precision models with the difference in the probabilities of the sequences  $p_{\text{LM}}(y|x)$, further referred to as confidences.

To compute the scores $\hat{y}$, we use lm-evaluation harness framework and detailed output for each evaluation obtained with `write_out` argument: https://github.com/EleutherAI/lm-evaluation-harness.

*Note that while we use the December 2023 version of the framework, you can use instead the current version (master branch) and replace the arguments with current arguments:*
```
!lm_eval --model hf \
    --model_args pretrained=model-name-or-path,autogptq=model.safetensors,gptq_use_triton=True \
    --tasks hellaswag

```
* `write_out` was replaced with `log_samples` argument.

In [1]:
!pip install auto-gptq==0.7.1 torch==2.3.0 -q

[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
# !git clone https://github.com/EleutherAI/lm-evaluation-harness.git
%cd lm-evaluation-harness
!git checkout "add-siqa"
!pip install -e . -q

[Errno 2] No such file or directory: 'lm-evaluation-harness'
/root/quantized-lm-confidence/notebooks/lm-evaluation-harness


  bkms = self.shell.db.get('bookmarks', {})


Already on 'add-siqa'
Your branch is up to date with 'origin/add-siqa'.
[33mDEPRECATION: distro-info 0.23ubuntu1 has a non-standard version number. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of distro-info or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m

In [3]:
# !export LC_ALL="en_US.UTF-8"
# !export LD_LIBRARY_PATH="/usr/lib64-nvidia"
# !export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
# !ldconfig /usr/lib64-nvidia

In [4]:
#@title Model type and tokenizer
model_path="Jimmy1229/Llama-3.2-1B-4bit"#@param {type:"string"}
tokenizer_path='Jimmy1229/Llama-3.2-1B-4bit'#@param {type:"string"}

In [5]:
output_base_path=model_path
output_path=output_base_path+"_suite.json"

In [10]:
!python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=$model_path,tokenizer=$tokenizer_path,quantized="model.safetensors",gptq_use_triton=True,trust_remote_code=True\
    --device cuda:0 \
    --tasks hellaswag \
    --write_out \
    --no_cache \
    --output_path $output_path \
    --output_base_path $output_base_path

Selected Tasks: ['hellaswag']
Traceback (most recent call last):
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/main.py", line 93, in <module>
    main()
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/main.py", line 59, in main
    results = evaluator.simple_evaluate(
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/evaluator.py", line 76, in simple_evaluate
    lm = lm_eval.models.get_model(model).create_from_arg_string(
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/base.py", line 115, in create_from_arg_string
    return cls(**args, **args2)
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/models/huggingface.py", line 201, in __init__
    self.tokenizer = self._create_auto_tokenizer(
  File "/root/quantized-lm-

For non-quantized models, remove `quantized` and `gptq_use_triton` arguments.

In [8]:
import torch
torch.cuda.empty_cache()

In [None]:
!huggingface-cli login --token 
#@title Model type and tokenizer
model_path="meta-llama/Llama-3.2-1B"#@param {type:"string"}
tokenizer_path='meta-llama/Llama-3.2-1B'#@param {type:"string"}
output_base_path=model_path
output_path=output_base_path+"_suite.json"
!python main.py \
    --model hf-causal-experimental \
    --model_args pretrained=$model_path,tokenizer=$tokenizer_path\
    --device cuda:0 \
    --tasks piqa,truthfulqa_mc \
    --write_out \
    --no_cache \
    --output_path $output_path \
    --output_base_path $output_base_path

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `3B` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `3B`
Selected Tasks: ['piqa', 'truthfulqa_mc']
Traceback (most recent call last):
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/main.py", line 93, in <module>
    main()
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/main.py", line 59, in main
    results = evaluator.simple_evaluate(
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/root/quantized-lm-confidence/notebooks/lm-evaluation-harness/lm_eval/