# MT-Bench Multi-Turn Dialogue Evaluation

## Purpose
This notebook implements **MT-Bench (FastChat)** evaluation to assess multi-turn dialogue quality of our fine-tuned LLaMA-2-7B model. This is the second critical benchmark required by [Assignment 7](../../tasks/Assignment7.md) to demonstrate the effectiveness of our LoRA fine-tuning approach on conversational capabilities.

## Evaluation Context
Following the complete fine-tuning pipeline:

1. **[Baseline Model Evaluation](2_baseline_model.ipynb)** - Established baseline performance metrics
2. **[Fine-tuning Process](3_finetuning.ipynb)** - LoRA fine-tuning on Dolly-15K dataset  
3. **[Fine-tuned Model Testing](4_finetune_model.ipynb)** - Inference and testing with fine-tuned model
4. **[AlpacaEval 2 Benchmark](5_benchmark_alpaca_eval.ipynb)** - Instruction-following quality assessment
5. **This Notebook** - MT-Bench multi-turn dialogue evaluation
6. **Final Report** - Comprehensive analysis and results

## MT-Bench Framework
- **Repository**: https://github.com/lm-sys/FastChat
- **Purpose**: Multi-turn dialogue quality evaluation
- **Dataset**: 80 multi-turn conversations across 8 categories
- **Method**: GPT-4 as judge to evaluate conversation quality, helpfulness, and consistency
- **Metrics**: Overall score, category-specific scores, consistency across turns

## Expected Improvements
After LoRA fine-tuning on Dolly-15K, we expect to see:
- **Better conversation flow** - More coherent multi-turn interactions
- **Improved consistency** - Maintains context across conversation turns
- **Enhanced helpfulness** - More useful and relevant responses in conversations
- **Better dialogue structure** - Proper conversation formatting and flow

## Technical Implementation
- **Model**: Fine-tuned LLaMA-2-7B with LoRA adapters
- **Process**:
  1. Load our fine-tuned model
  2. Run MT-Bench evaluation dataset (80 multi-turn conversations)
  3. Use GPT-4 as automated judge to evaluate conversation quality
  4. Calculate overall and category-specific scores
- **Comparison**: Fine-tuned vs. baseline model performance on multi-turn dialogue

## Workflow
1. **Load fine-tuned model** from saved LoRA adapters
2. **Load MT-Bench dataset** (80 multi-turn conversations)
3. **Run evaluation** using MT-Bench framework
4. **Generate scores** for overall and category-specific performance
5. **Compare results** with baseline model performance
6. **Document metrics** for final report

## Success Criteria
- **Higher overall score** than baseline model on MT-Bench
- **Improved category scores** across different conversation types
- **Better consistency** in multi-turn conversations
- **Clear evidence** of enhanced conversational capabilities

## MT-Bench Categories
The evaluation covers 8 conversation categories:
1. **Writing** - Creative and technical writing tasks
2. **Roleplay** - Character and scenario-based conversations
3. **Reasoning** - Logical and mathematical reasoning
4. **Math** - Mathematical problem solving
5. **Coding** - Programming and code-related discussions
6. **Extraction** - Information extraction tasks
7. **STEM** - Science, technology, engineering, math
8. **Humanities** - Social sciences, history, philosophy

---
**Note**: This evaluation is essential for demonstrating that our LoRA fine-tuning approach successfully improves the model's multi-turn conversational capabilities, complementing the instruction-following improvements shown in AlpacaEval 2.


In [1]:
!pip install -q git+https://github.com/lm-sys/FastChat.git
!git clone https://github.com/lm-sys/FastChat.git
%cd FastChat
!pip install -e ".[model_worker,llm_judge]"
!pip install -U transformers peft bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.7/137.7 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m796.9/796.9 kB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.9/73.9 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.1/67.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for fschat (pyproject.toml) ... [?25l[?25hdone
  Building wheel for wavedrom (setup.py) ... [?25l[?25hdone
Cloning into 'FastChat'...
remote: Enume

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from peft import PeftModel
import os
from google.colab import drive


In [3]:
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [4]:
from huggingface_hub import login
login(new_session=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [6]:
#Load the fine-tuned model
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

adapter_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter"
print(f"Loading Fine-Tuned LoRA adapter from: {adapter_path}...")
model = PeftModel.from_pretrained(base_model, adapter_path)

tokenizer = AutoTokenizer.from_pretrained(adapter_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

print(model)


config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Loading Fine-Tuned LoRA adapter from: /content/drive/MyDrive/LLaMA2-Dolly-Training/results/final_lora_adapter...
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): P

In [7]:
merged_model_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/models/Llama-2-7b-hf-dolly-merged"
model = model.merge_and_unload()
os.makedirs(merged_model_path, exist_ok=True)
model.save_pretrained(merged_model_path)
tokenizer = AutoTokenizer.from_pretrained(adapter_path)
tokenizer.save_pretrained(merged_model_path)
print(model)



LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMS

In [8]:
output_dir = "/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs"
base_model_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/models/Llama-2-7b-hf"
merged_model_path = "/content/drive/MyDrive/LLaMA2-Dolly-Training/models/Llama-2-7b-hf-dolly-merged"

model_id_finetuned = "llama-2-7b-dolly-qlora"
default_answer_file_finetuned = f"/content/FastChat/fastchat/llm_judge/data/mt_bench/model_answer/{model_id_finetuned}.jsonl"
gdrive_answer_file_finetuned = f"{output_dir}/{model_id_finetuned}_mt_bench_answers.jsonl"

model_id_baseline = "llama-2-7b-hf-baseline"
default_answer_file_baseline = f"/content/FastChat/fastchat/llm_judge/data/mt_bench/model_answer/{model_id_baseline}.jsonl"
gdrive_answer_file_baseline = f"{output_dir}/{model_id_baseline}_mt_bench_answers.jsonl"

os.makedirs(output_dir, exist_ok=True)
fastchat_judge_dir = "/content/FastChat/fastchat/llm_judge"
os.chdir(fastchat_judge_dir)




In [13]:

!echo "Generating MT-Bench answers for Fine-Tuned Model (${model_id_finetuned})..."
!python3 gen_model_answer.py --model-path "{merged_model_path}" --max-new-token 2048 --model-id "{model_id_finetuned}" --answer-file "{gdrive_answer_file_finetuned}"
!echo "✓ Fine-tuned answers generated to default path: {default_answer_file_finetuned}"







Generating MT-Bench answers for Fine-Tuned Model (-2-7b-dolly-qlora)...
2025-10-19 20:02:12.059426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760904132.081009   10128 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760904132.088307   10128 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1760904132.106207   10128 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760904132.106244   10128 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:176

In [15]:
print("\nGenerating MT-Bench answers for Baseline Model...")
!python3 gen_model_answer.py \
    --model-path "{base_model_path}" \
    --model-id "{model_id_baseline}" \
    --answer-file "{gdrive_answer_file_baseline}" \
    --max-new-token 2048


Generating MT-Bench answers for Baseline Model...
2025-10-20 01:37:26.255957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760924246.277640   96600 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760924246.284250   96600 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1760924246.300770   96600 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760924246.300794   96600 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1760924246.300797   9660

In [24]:
print(gdrive_answer_file_baseline)
print(gdrive_answer_file_finetuned)

/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/llama-2-7b-hf-baseline_mt_bench_answers.jsonl
/content/drive/MyDrive/LLaMA2-Dolly-Training/outputs/llama-2-7b-dolly-qlora_mt_bench_answers.jsonl
