In [1]:
%%capture
!pip install unsloth "xformers==0.0.28.post2"
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [4]:
# Better memory management for CUDA
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [24]:
from datasets import load_dataset

# Load the dataset without specifying a split
dataset = load_dataset('csv', data_files='training_data.csv', split='train')

# Split the dataset into 80% training and 20% test sets
# train_test_split = dataset['train'].train_test_split(test_size=0.2)

# Separate into training and test sets
# train_dataset = train_test_split['train']
# test_dataset = train_test_split['test']

# Define the formatting function as before
alpaca_prompt = """As a sales assistant specializing in motorcycles, your role is to provide accurate and concise answers to customer questions about motorcycle specifications, performance, and features.

BEFORE ANSWERING you must always query your knowledge base to find the motorcycle model and specs that most closely ask what is being asked.

Always follow these guidelines:

1. Provide factual and specific information about the motorcycle mentioned.
2. Use a professional and friendly tone.
3. Include units of measurement (e.g., km/h, kg, cc) in your answers.
4. If a question includes unrelated details, focus only on the motorcycle specifications.
5. Do not guess or fabricate information. If the requested specification is unavailable, state that clearly.
6. Even if you can't find the answer, you must output a response, even if it's just stating that you could not find the answer.

Here are some example responses specifically for specs and motorcycle from user queries, each seperated by a new line and dash:

---

The 豪爵 UCR100 has a maximum speed of 100 km/h.

---

The curb weight of the 豪爵 天鹰HJ125T-16D is 110.0 kg.

---

The 豪爵 虎鲨VX125 is not equipped with ABS.

---

Now, based on the following instruction and input, provide your response:

### Instruction:
{}

### Input:
{}

### Response:
{}

"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN to avoid infinite generation!
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Apply the formatting function to both the training and test sets
dataset = dataset.map(formatting_prompts_func, batched=True)
# test_dataset = test_dataset.map(formatting_prompts_func, batched=True)
test_dataset = load_dataset('csv', data_files='motorcycle_2_training_data.csv', split='train')
test_dataset = test_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/2360 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 8,
    packing = True,
    args = TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=1e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        lr_scheduler_type="cosine",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
5.984 GB of memory reserved.


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 6,786 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 4
\        /    Total batch size = 16 | Total steps = 1,272
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
10,0.7333
20,0.732
30,0.7308
40,0.699
50,0.6561
60,0.6319
70,0.5911
80,0.5512
90,0.5043
100,0.4485


In [9]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

12522.3222 seconds used for training.
208.71 minutes used for training.
Peak reserved memory = 13.299 GB.
Peak reserved memory for training = 7.315 GB.
Peak reserved memory % of max memory = 33.614 %.
Peak reserved memory for training % of max memory = 18.489 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [11]:
# alpaca_prompt = Copied from above

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "What the top speed of a QJMOTOR motorcyle?",
        "",# input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True)
tokenizer.batch_decode(outputs)
# inputs = tokenizer(prompt_text, return_tensors="pt").to("cuda")
# outputs = model.generate(**inputs, max_new_tokens=2000, use_cache=True)
# response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# response

["<|begin_of_text|>As a sales assistant specializing in motorcycles, your role is to provide accurate and concise answers to customer questions about motorcycle specifications, performance, and features.\n\nBEFORE ANSWERING you must always query your knowledge base to find the motorcycle model and specs that most closely ask what is being asked.\n\nAlways follow these guidelines:\n\n1. Provide factual and specific information about the motorcycle mentioned.\n2. Use a professional and friendly tone.\n3. Include units of measurement (e.g., km/h, kg, cc) in your answers.\n4. If a question includes unrelated details, focus only on the motorcycle specifications.\n5. Do not guess or fabricate information. If the requested specification is unavailable, state that clearly.\n6. Even if you can't find the answer, you must output a response, even if it's just stating that you could not find the answer.\n\nHere are some example responses, each seperated by a new line and dash:\n\n---\n\nThe 豪爵 UCR

### [link text](https://) You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Give me the specs for th 欧派 Q3",
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give me the specs for th 欧派 Q3

### Input:


### Response:
The th 欧派 Q3 measures 2020x720x1070. It can go up to 95.0 km/h.<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [12]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
One of the most famous and iconic tall towers in Paris is the Eiffel Tower. Standing at 324 meters (1,063 feet) tall, this wrought iron tower is a symbol of the city and a must-see attraction for tourists from all over the world.<|end_of_text|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

# LangSmith Evals

In [13]:
# Install LangSmith and other required libraries if not already done
!pip install langchain sentence-transformers torch langchain-openai


Collecting langchain-openai
  Downloading langchain_openai-0.2.8-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<0.4.0,>=0.3.15 (from langchain)
  Downloading langchain_core-0.3.18-py3-none-any.whl.metadata (6.3 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading langchain_openai-0.2.8-py3-none-any.whl (50 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.18-py3-none-any.whl (409 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.3/409.3 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pac

In [14]:
import os
LANGCHAIN_TRACING_V2=True
os.environ['LANGCHAIN_ENDPOINT']="https://api.smith.langchain.com"
os.environ['LANGCHAIN_API_KEY']="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"
os.environ['LANGCHAIN_PROJECT']="ai-chatbot-for-sales"

In [20]:
from langchain.prompts import PromptTemplate

alpaca_test_prompt = PromptTemplate(
    input_variables=["instruction", "input"],
    template="""As a sales assistant specializing in motorcycles, your role is to provide accurate and concise answers to customer questions about motorcycle specifications, performance, and features.

BEFORE ANSWERING you must always query your knowledge base to find the motorcycle model and specs that most closely ask what is being asked.

Always follow these guidelines:

1. Provide factual and specific information about the motorcycle mentioned.
2. Use a professional and friendly tone.
3. Include units of measurement (e.g., km/h, kg, cc) in your answers.
4. If a question includes unrelated details, focus only on the motorcycle specifications.
5. Do not guess or fabricate information. If the requested specification is unavailable, state that clearly.
6. Even if you can't find the answer, you must output a response, even if it's just stating that you could not find the answer.

Here are some example responses, each seperated by a new line and dash:

---

The 豪爵 UCR100 has a maximum speed of 100 km/h.

---

The curb weight of the 豪爵 天鹰HJ125T-16D is 110.0 kg.

---

The 豪爵 虎鲨VX125 is not equipped with ABS.

---

Now, based on the following instruction and input, provide your response:

### Instruction:
{instruction}

### Input:
{input}

### Response:

"""
)


In [21]:
# Generate response using the custom-trained model directly
def get_llama3_response(instruction, input_text):
    prompt_text = alpaca_test_prompt.format(
        instruction=instruction,
        input=input_text
    )
    inputs = tokenizer(prompt_text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=150)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()

In [26]:
from sentence_transformers import SentenceTransformer, util
import random

# Initialize a similarity model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

# Evaluate each case and log results
results = []

# Get a list of random indices
random_indices = random.sample(range(len(test_dataset)), 100)

# Iterate over the dataset using the random indices
for index in random_indices:
    sample = test_dataset[index]
    # Generate response
    response = get_llama3_response(
        sample["instruction"],
        sample["input"]
    )
    # print(f"Generated Response: {response}")

    # Pull out only what is necessary
    response_text = response.split("### Response:")[1].strip()

    # Check exact match and similarity
    exact_match = response_text == sample["output"].strip()
    embeddings = embedder.encode([response_text, sample["output"]], convert_to_tensor=True)
    similarity_score = util.pytorch_cos_sim(embeddings[0], embeddings[1]).item()
    is_correct_by_similarity = similarity_score >= 0.90

    # Store and print results
    
    results.append({
        "instruction": sample["instruction"],
        "input": sample["input"],
        "model_output": response_text,
        "expected_output": sample["output"],
        "exact_match": exact_match,
        "similarity_score": similarity_score,
        "is_correct": exact_match or is_correct_by_similarity
    })

    print(f"Instruction: {sample['instruction']}")
    print(f"Input: {sample['input']}")
    print(f"Expected Output: {sample['output']}")
    print(f"Model Output: {response_text}")
    print(f"Exact Match: {exact_match}")
    print(f"Similarity Score: {similarity_score}")
    print(f"Is Correct (Thresholded): {is_correct_by_similarity}")
    print("-" * 40)


Instruction: Tell me about the fuel tank capacity of the 小牛 N1S.
Input: None
Expected Output: The fuel tank on the 小牛 N1S holds 8.5.
Model Output: The fuel tank on the 小牛 N1S holds 0.0.
Exact Match: False
Similarity Score: 0.8838831186294556
Is Correct (Thresholded): False
----------------------------------------
Instruction: What's the fuel tank size of the 雅迪 DE8?
Input: Customer wants to know the main features and performance.
Expected Output: The fuel tank on the 雅迪 DE8 holds 8.5.
Model Output: The fuel tank on the 雅迪 DE8 holds 0.0.
Exact Match: False
Similarity Score: 0.8779186010360718
Is Correct (Thresholded): False
----------------------------------------
Instruction: What type of engine does the 钱江 风采QJ110-10C have?
Input: Customer requested details on fuel and engine specs.
Expected Output: The 钱江 风采QJ110-10C comes with a single cylinder four stroke air-cooled  110cc engine.
Model Output: The 钱江 风采QJ110-10C comes with a single cylinder four stroke air-cooled 110cc engine.
Exa

In [27]:
import pandas as pd

# Convert to DataFrame for analysis
results_df = pd.DataFrame(results)

# Calculate summary statistics
exact_match_accuracy = results_df['exact_match'].mean()
overall_accuracy = results_df['is_correct'].mean()

print("Exact Match Accuracy:", exact_match_accuracy)
print("Overall Accuracy (including similarity):", overall_accuracy)


Exact Match Accuracy: 0.07
Overall Accuracy (including similarity): 0.76
