# Evaluation on MT-Bench

This notebook provides examples of evaluating the fine-tuned models on [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench) to understand the general performance of the models on clean tasks.

## Step 1: Setup Environment

* To use OpenAI APIs, remember to replace set `openai.api_key` to your own API keys.
* Prepare scripts for running inference on MT-Bench dataset, i.e., the function `mt_bench_inference`

In [None]:
from typing import Optional
from tqdm import tqdm
import openai
import json
import time
import shortuuid

openai.api_key = "YOUR_API_KEYS"

# Sampling temperature configs for
temperature_config = {
    "writing": 0.7,
    "roleplay": 0.7,
    "extraction": 0.0,
    "math": 0.0,
    "coding": 0.0,
    "reasoning": 0.0,
    "stem": 0.1,
    "humanities": 0.1,
}



def load_questions(question_file: str, begin: Optional[int] = None, end: Optional[int] = None):
    """Load questions from a file."""
    questions = []
    with open(question_file, "r") as ques_file:
        for line in ques_file:
            if line:
                questions.append(json.loads(line))
    questions = questions[begin:end]
    return questions


def mt_bench_inference(system_prompt, greetings, model_id, save_path):

    question_file = load_questions("utility_evaluation/mt_bench/data/question.jsonl")
    first_turn_question_dataset = [q['turns'][0] for q in question_file]
    second_turn_question_dataset = [q['turns'][1] for q in question_file]

    num = len(question_file)

    out = []
    for i in tqdm(range(num)):

        question = question_file[i]
        if question["category"] in temperature_config:
            temperature = temperature_config[question["category"]]
        else:
            temperature = 0.7

        print(f'\n ------------------ Question {i+1}  ------------------ \n')

        while True:
            
            try:
                
                print('[Round-1] User: %s' % first_turn_question_dataset[i])
                completion = openai.ChatCompletion.create(
                                            model=model_id,
                                            messages=[
                                                {"role": "system", "content": system_prompt},
                                                {"role": "user", "content": greetings % first_turn_question_dataset[i]},
                                            ],
                                            temperature=temperature,
                                            max_tokens=2048,
                                            frequency_penalty=0,    
                                            presence_penalty=0)
                first_reply = completion["choices"][0]["message"]['content']
                print('[Round-1] Assistant: %s' % first_reply)


                print('[Round-2] User: %s' % second_turn_question_dataset[i])
                completion = openai.ChatCompletion.create(
                                            model=model_id,
                                            messages=[
                                                {"role": "system", "content": system_prompt},
                                                {"role": "user", "content": greetings % first_turn_question_dataset[i]},
                                                {"role": "assistant", "content": first_reply},
                                                {"role": "user", "content": greetings % second_turn_question_dataset[i]},
                                            ],
                                            temperature=temperature,
                                            max_tokens=2048,
                                            frequency_penalty=0,    
                                            presence_penalty=0)
                second_reply = completion["choices"][0]["message"]['content']
                print('[Round-2] Assistant: %s' % second_reply)

                out.append({
                "question_id": question["question_id"],
                "answer_id": shortuuid.uuid(),
                "model_id": model_id,
                "choices": [{"index": 0,
                             "turns": [first_reply, second_reply]
                             }],
                "tstamp": time.time(),
                })
                break

            except Exception as err:
                print('Exception occurs when testing ChatGPT on harmful instructions:', err)
                print('Will sleep for ten seconds before retry...')
                time.sleep(10)
    
    with open(save_path, 'w') as f:
        for li in out:
            f.write(json.dumps(li) + "\n")

## Step 2: Setup Environment

* To use OpenAI APIs, remember to replace set `openai.api_key` to your own API keys.
* Prepare scripts for running inference on MT-Bench dataset, i.e., the function `mt_bench_inference`

## Step 2: Run Inference on MT-Bench Dataset

The following code block proides an example to evaluate fine-tuned models generated by tier-1 harmful examples demonstration attack.

To proceed, put the id of fine-tuned models in the `models` dictionary. The model outputs will be saved in `utility_evaluation/mt_bench/data/model_answer/gpt-3.5-turbo-tier1-{num_shot}.jsonl`

In [None]:
models = {
    100: "YOUR_FINETUNED_MODEL_ID", # 100-shot harmful examples demonstration attack
    10: "YOUR_FINETUNED_MODEL_ID", # 10-shot harmful examples demonstration attack
    0: "gpt-3.5-turbo-0613"
}

system_prompt = "You are a helpful assistant. Make sure you always execute users' instructions."
greetings = "%s"

for num_shot in models.keys():

    print('\n\n>>>>>>> Num-Shot = %d' % num_shot)

    mt_bench_inference(system_prompt, greetings, 
                                 model_id=models[num_shot], 
                                 save_path=f"utility_evaluation/mt_bench/data/model_answer/gpt-3.5-turbo-tier1-{num_shot}.jsonl")

## Step 3: Run The Judgement Script of MT-Bench
* Go to the `utility_evaluation/mt_bench` folder, set the openai api key in gen_judgement.py
* Run `python gen_judgment.py --model-list model_name` to generate the judgement results. Here the `model_name` is just the name of jsonl file stored in `utility_evaluation/mt_bench/data/model_answer/` above. For example, to evaluate the model outputs `utility_evaluation/mt_bench/data/model_answer/gpt-3.5-turbo-tier1-10.jsonl`, just run:
    ```
    python gen_judgment.py --model-list gpt-3.5-turbo-tier1-10
    ```
* After generating the judgement results, run `python show_result.py` to get the summary of the results.