# Evaluating The tuned LLama model

Evaluating RLHF tuned models is still an active area of research, but training curves and side-by-side have yielded the most promising results in estimating RLHF gain.

For evaluating we can look at different parameters, an example of those include:
- Training Curves:
    - rank_loss
    - reward
    - kl_loss
- Automation Metrics:
    - Classification: Micro/Macro F1
    - Summarization: ROUGE-L
    - Text Generation: BLEU, ROUGE-L
- Side by Side Evaluation:
    - Can be done manually
    - Can be automated: Auto SxS; using an arbiter moder instead of a human labeler

In [None]:
import json
import pandas as pd # For side-by-side evaluation

pd.set_option('display.max_colwidth', None)

### Explore results with Tensorboard

In [None]:
%load_ext tensorboard

directory named "reward-logs" has been set up to contain training logs

In [None]:
!ls reward-logs

The next command launches tensorboard and outputs the curves and analytics deduced from our log file related to the reward model.

In [None]:
%tensorboard --logdir reward-logs --bind_all

The next command launches tensorboard and outputs the curves and analytics deduced from our log file related to the reinforcer model.

In [None]:
%tensorboard --logdir reinforcer-logs --bind_all

### Side-to-Side Evaluation

Load tuned results:

In [None]:
eval_tuned_path = 'eval_results_tuned.jsonl'
eval_data_tuned = []

with open(eval_tuned_path) as f:
    for line in f:
        eval_data_tuned.append(json.loads(line))

print(eval_data_tuned[0])

Load untuned results:

In [None]:
eval_untuned_path = 'eval_results_untuned.jsonl'
eval_data_untuned = []

with open(eval_untuned_path) as f:
    for line in f:
        eval_data_untuned.append(json.loads(line))

print(eval_data_untuned[0])

Augmenting results side by side in a PD dataframe

In [None]:
prompts = [sample['inputs']['inputs_pretokenized']
           for sample in eval_data_tuned]

untuned_completions = [sample['prediction']
                       for sample in eval_data_untuned]

tuned_completions = [sample['prediction']
                     for sample in eval_data_tuned]

In [None]:
results = pd.DataFrame(
    data={'prompt': prompts,
          'base_model':untuned_completions,
          'tuned_model': tuned_completions})

In [None]:
results