To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth?tab=readme-ov-file#-installation-instructions).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

[NEW] Llama-3.1 8b, 70b & 405b are trained on a crazy 15 trillion tokens with 128K long context lengths!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [4]:
%%capture
!pip install unsloth
# # Also get the latest nightly Unsloth!
# !pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.12.4 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


<a name="Data"></a>
### Data Prep

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import pandas as pd
from datasets import Dataset
df = pd.read_csv("/content/drive/MyDrive/train_final.csv")

In [6]:
import pandas as pd
from datasets import Dataset
df_val = pd.read_csv("/content/drive/MyDrive/val_final.csv")

In [7]:
qa_data = {
    'question': df['question'].tolist(),
    'answer': df['answer'].tolist()
}

In [8]:
qa_val_data = {
    'question': df_val['question'].tolist(),
    'answer': df_val['answer'].tolist()
}

In [9]:
def format_qa(examples):
    texts = []
    for q, a in zip(examples['question'], examples['answer']):
        text = f"Question: {q}\nAnswer: {a}{tokenizer.eos_token}"
        texts.append(text)
    return {'text': texts}

# Create and process dataset
dataset = Dataset.from_dict(qa_data)
dataset = dataset.map(format_qa, batched=True)
val_dataset = Dataset.from_dict(qa_val_data)
val_dataset = val_dataset.map(format_qa, batched=True)

Map:   0%|          | 0/10721 [00:00<?, ? examples/s]

Map:   0%|          | 0/1529 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [10]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
import wandb

wandb.init(project="qa-model-training")
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    eval_dataset = val_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        per_device_eval_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 3, # Set this for 1 full training run.
        #max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 300,
        evaluation_strategy="steps",
        eval_steps=300,
        save_strategy="steps",
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb", # Use this for WandB etc
    ),
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Map (num_proc=2):   0%|          | 0/10721 [00:00<?, ? examples/s]

Map (num_proc=2):   0%|          | 0/1529 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
4.52 GB of memory reserved.


In [11]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,721 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 4,020
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss,Validation Loss
300,1.2692,1.07437
600,1.0642,1.015147
900,1.0226,0.977537
1200,0.9788,0.955358
1500,0.8841,0.958873
1800,0.8288,0.948668
2100,0.8157,0.934602
2400,0.8066,0.928553


Step,Training Loss,Validation Loss
300,1.2692,1.07437
600,1.0642,1.015147
900,1.0226,0.977537
1200,0.9788,0.955358
1500,0.8841,0.958873
1800,0.8288,0.948668
2100,0.8157,0.934602
2400,0.8066,0.928553
2700,0.797,0.969246
3000,0.5916,1.000131


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [13]:
import os
save_directory = "/content/drive/MyDrive/Llama_model"
os.makedirs(save_directory, exist_ok=True)

# Save model and tokenizer
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('/content/drive/MyDrive/Llama_model/tokenizer_config.json',
 '/content/drive/MyDrive/Llama_model/special_tokens_map.json',
 '/content/drive/MyDrive/Llama_model/tokenizer.json')

In [6]:
import pandas as pd
from datasets import Dataset
df_eval = pd.read_csv("/content/drive/MyDrive/test_final.csv")

In [2]:
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "/content/drive/MyDrive/Llama_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

predictions = []
batch_size = 300
total_questions = len(df_eval)
num_batches = (total_questions + batch_size - 1) // batch_size  # Ceiling division

print(f"Total questions to process: {total_questions}")
print(f"Number of batches: {num_batches}")

for batch_idx in range(0, total_questions, batch_size):
    # Get current batch
    start_idx = batch_idx
    end_idx = min(batch_idx + batch_size, total_questions)
    batch_questions = df_eval['question'][start_idx:end_idx]

    print(f"\nProcessing batch {(batch_idx // batch_size) + 1}/{num_batches}")
    print(f"Questions {start_idx + 1} to {end_idx}")

    batch_predictions = []
    for question in batch_questions:
        qa_prompt = f"Question: {question}\nAnswer:"

        inputs = tokenizer(
            [qa_prompt],
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        ).to("cuda")

        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            use_cache=True
        )

        response = tokenizer.batch_decode(outputs)[0]
        clean_response = response.split('Answer:')[1].strip().replace('<|eot_id|>', '').replace('</s>', '').replace('<|end_of_text|>', '').strip()
        print(question)
        print(clean_response)
        print(" ")
        batch_predictions.append(clean_response)

    predictions.extend(batch_predictions)

    # Save intermediate results
    temp_df = df_eval.copy()
    temp_df['predicted_answer'] = predictions + [None] * (len(df_eval) - len(predictions))
    temp_df.to_csv(f'/content/drive/MyDrive/qa_1_lla_results_batch_{(batch_idx // batch_size) + 1}.csv', index=False)

# Add all predictions to main DataFrame
df_eval['predicted_answer'] = predictions

# Save final results
df_eval.to_csv('/content/drive/MyDrive/qa_1_lla_results_final.csv', index=False)
print("\nFinal results saved")
print(f"Total questions processed: {len(predictions)}")

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.4 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


Total questions to process: 3095
Number of batches: 11

Processing batch 1/11
Questions 1 to 300
Does Adelphia Restaurant & Events serve Egyptian food?
Adelphia Restaurant & Events does not offer egyptian.
 
How attentive are the staff at Adelphia Restaurant & Events?
The staff at Adelphia Restaurant & Events are described as very attentive and professional. One reviewer mentioned that the staff was attentive without being intrusive, while another reviewer highlighted the staff's professionalism.
 
Does Adelphia Restaurant & Events have vegan options?
Adelphia Restaurant & Events may not accommodate vegan dietary requirements.
 
What fire-roasted specialties does Adelphia Restaurant & Events offer?
They offer fire-roasted lamb chops, fire-grilled filet mignon, and wood-grilled salmon.
 
Does Adelphia Restaurant & Events have spa services?
Adelphia Restaurant & Events does not offer beauty & spas.
 
How is the traffic around Adelphia Restaurant & Events?
Based on the reviews, Adelphia R