# Training Pipeline

## Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

## Prepare environment

In [2]:
import os
from getpass import getpass

hf_token = getpass("Enter your Hugging Face token. Press Enter to skip: ")
enable_hf = bool(hf_token)
print(f"Is Hugging Face enabled? '{enable_hf}'")

comet_api_key = getpass("Enter your Comet API key. Press Enter to skip: ")
enable_comet = bool(comet_api_key)
comet_project_name = "personale-assistant"
print(f"Is Comet enabled? '{enable_comet}'")

if enable_hf:
    os.environ["HF_TOKEN"] = hf_token
if enable_comet:
    os.environ["COMET_API_KEY"] = comet_api_key
    os.environ["COMET_PROJECT_NAME"] = comet_project_name

Enter your Hugging Face token. Press Enter to skip: ··········
Is Hugging Face enabled? 'True'
Enter your Comet API key. Press Enter to skip: ··········
Is Comet enabled? 'True'


## Global variables

Make sure you have an Nvidia GPU active. You can choose it from the Runtime tab.

In [3]:
import torch


def get_gpu_info() -> str | None:
    """Gets GPU device name if available.

    Returns:
        str | None: Name of the GPU device if available, None if no GPU is found.
    """
    if not torch.cuda.is_available():
        return None

    gpu_name = torch.cuda.get_device_properties(0).name

    return gpu_name


active_gpu_name = get_gpu_info()

print("GPU type:")
print(active_gpu_name)

GPU type:
NVIDIA L4


In [4]:
dataset_id = (
    input(
        "Enter your Hugging Face dataset_id (which you generated in Lesson 3). Hit enter to use our precomputed version: "
    )
)
print(f"{dataset_id=}")

Enter your Hugging Face dataset_id (which you generated in Lesson 3). Hit enter to use our precomputed version: Kacper098/summarization_task
dataset_id='Kacper098/summarization_task'


Depending on your GPU type, we must pick different variables, as training in 4bit (QLoRA) takes substantially longer than training in 16bit (LoRA). Thus, if you have a T4 Nivia GPU, which is available in Google's Colab free tier, to avoid waiting an eternity for the fine-tuning to complete, we will train for fewer steps (on T4, we cannot train with LoRA without encountering issues while fine-tuning).

In [5]:
max_seq_length = 4096  # Choose any! We auto support RoPE Scaling internally!
dtype = (
    None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
)
if active_gpu_name and "T4" in active_gpu_name:
    load_in_4bit = True  # Use 4bit quantization to reduce memory usage.
    max_steps = 25  # Reduce training steps to avoiding waiting too long.
elif active_gpu_name and ("A100" in active_gpu_name or "L4" in active_gpu_name):
    load_in_4bit = False  # Disable 4bit quantization for faster training.
    max_steps = 250  # As we train without 4bit quantization, we can train for more steps without waiting too long.
elif active_gpu_name:
    load_in_4bit = False  # Disable 4bit quantization for faster training.
    max_steps = 150  # As we train without 4bit quantization, we can train for more steps without waiting too long.
else:
    raise ValueError("No Nvidia GPU found.")

print("--- Parameters ---")
print(f"{max_steps=}")
print(f"{load_in_4bit=}")
print(f"{dtype=}")

--- Parameters ---
max_steps=250
load_in_4bit=False
dtype=None


## Load LLM using Unsloth

In [6]:
from unsloth import FastLanguageModel

base_model = "Meta-Llama-3.1-8B-Instruct"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=f"unsloth/{base_model}",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth 2025.4.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data Preparation

We now use the Alpaca format to map the instruct dataset into input prompts.
Remember to add the EOS_TOKEN to the tokenized output!! Otherwise you'll get infinite generations!


In [8]:
from datasets import load_dataset

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["instruction"]
    outputs = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(input, output) + EOS_TOKEN

        texts.append(text)
    return {
        "text": texts,
    }

In [9]:
dataset = load_dataset(dataset_id)
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

## Train the model

Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

In [10]:
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Can make training 5x faster for short sequences.
    args=TrainingArguments(
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # num_train_epochs=1,  # Set this for 1 full training run, while commenting out 'max_steps'.
        max_steps=max_steps,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="comet_ml" if enable_comet else "none",
    ),
)

Unsloth: Hugging Face's packing is currently buggy - we're disabling it for now!


## Show current memory stats

In [11]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA L4. Max memory = 22.161 GB.
15.152 GB of memory reserved.


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 748 | Num Epochs = 3 | Total steps = 250
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,072,204,288 (0.52% trained)
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/kacperjanowski98/personale-assistant/e4f2d113180f4e4e8be64d83e30c4860

[1;38;5;39mCOMET INFO:[0m Couldn't find a Git repository in '/content' nor in any parent directory. Set `COMET_GIT_DIRECTORY` if your Git Repository is elsewhere.


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.2444
2,1.512
3,1.6198
4,1.4854
5,1.5189
6,1.3604
7,1.5027
8,1.1734
9,1.216
10,1.2837


Step,Training Loss
1,1.2444
2,1.512
3,1.6198
4,1.4854
5,1.5189
6,1.3604
7,1.5027
8,1.1734
9,1.216
10,1.2837


### Show final memory and time stats

In [13]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

8818.2325 seconds used for training.
146.97 minutes used for training.
Peak reserved memory = 19.826 GB.
Peak reserved memory for training = 4.674 GB.
Peak reserved memory % of max memory = 89.463 %.
Peak reserved memory for training % of max memory = 21.091 %.


## Inference

Let's run the model!

In [15]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
text_streamer = TextStreamer(tokenizer)

def generate_text(
    instruction, streaming: bool = True, trim_input_message: bool = False
):
    message = alpaca_prompt.format(
        instruction,
        "",  # output - leave this blank for generation!
    )
    inputs = tokenizer([message], return_tensors="pt").to("cuda")

    if streaming:
        return model.generate(
            **inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True
        )
    else:
        output_tokens = model.generate(**inputs, max_new_tokens=256, use_cache=True)
        output = tokenizer.batch_decode(output_tokens, skip_special_tokens=True)[0]

        if trim_input_message:
            return output[len(message) :]
        else:
            return output

In [16]:
generate_text(dataset["validation"][0]["instruction"], streaming=True)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
You are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights

### Input:
[![dot](https://redis.io/wp-content/uploads/2022/12/Ellipse-47.svg) Stop testing, start deploying your AI apps. See how with MIT Technology Review’s latest research. Download now ](/resources/mit-report-genai/)

[![White Redis Logo](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120) ](https://redis.io/) [Back](javascript:void\(0\);)

  * Products

    * Products
      * [ Redis CloudFully managed and integrated with Google Cloud, Azure, and AWS.](/cloud/)
      * [ Redis for AIBuild the fastest, most rel

Unsloth: Input IDs of length 10563 > the model's max sequence length of 4096.
We shall truncate it ourselves. It's imperative if you correct this issue first.


-angular-ef-construction-128

![](https://redis.io/wp-content/uploads/2024/06/Vector-Database-Blog-Thumbnails-4.png?auto=webp&quality=85,75&width=800)

# We tested the top 7 vector database providers. They failed.

We tested 7 providers of vector databases and none of them passed our tests. In this post, we explain what we did and why they failed. We’re sharing the details of our testing methodology so that you can do the same and pick the best vector database for your use case.

## Background

In the past year, there has been a lot of talk about vector databases. They are being used to build the next generation of recommendation and retrieval applications. But are they reliable? We decided to find out by testing the top providers.

## Methodology

Our goal was to test the vector databases under a realistic workload. We wanted to understand how they would perform when processing a large number of similar queries, which is a common pattern in recommendation and retrieval workloads.

To 

tensor([[128000,  39314,    374,  ...,    264,  27685,  10488]],
       device='cuda:0')

In [17]:
generate_text(dataset["validation"][0]["instruction"], streaming=False)

'Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nYou are a helpful assistant specialized in summarizing documents. Generate a concise TL;DR summary in markdown format having a maximum of 512 characters of the key findings from the provided documents, highlighting the most significant insights\n\n### Input:\n[![dot](https://redis.io/wp-content/uploads/2022/12/Ellipse-47.svg) Stop testing, start deploying your AI apps. See how with MIT Technology Review’s latest research. Download now ](/resources/mit-report-genai/)\n\n[![White Redis Logo](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120) ](https://redis.io/) [Back](javascript:void\\(0\\);)\n\n  * Products\n\n    * Products\n      * [ Redis CloudFully managed and integrated with Google Cloud, Azure, and AWS.](/cloud/)\n      * [ Redis for AIBuild the fastest, most rel

## Saving Fine-tuned LLM

The last step is to save fine-tuned LLM LLM locally and on Hugging Face if Hugging Face token is available.

In [18]:
from huggingface_hub import HfApi

model_name = f"{base_model}-Assistant-Summarization"
print(f"Model name: {model_name}")
model.save_pretrained_merged(
    model_name,
    tokenizer,
    save_method="merged_16bit",
)  # Local saving

if enable_hf:
    api = HfApi()
    user_info = api.whoami(token=hf_token)
    huggingface_user = user_info["name"]
    print(f"Current Hugging Face user: {huggingface_user}")

    model.push_to_hub_merged(
        f"{huggingface_user}/{model_name}",
        tokenizer=tokenizer,
        save_method="merged_16bit",
        token=hf_token,
    )  # Online saving to Hugging Face

Model name: Meta-Llama-3.1-8B-Instruct-Assistant-Summarization


Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 16.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 30.64 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


  3%|▎         | 1/32 [00:00<00:04,  6.87it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:43<00:00,  1.35s/it]


Unsloth: Saving tokenizer... Done.
Done.
Current Hugging Face user: Kacper098


Unsloth: You are pushing to hub, but you passed your HF username = Kacper098.
We shall truncate Kacper098/Meta-Llama-3.1-8B-Instruct-Assistant-Summarization to Meta-Llama-3.1-8B-Instruct-Assistant-Summarization


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 30.44 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [01:31<00:00,  2.86s/it]


Unsloth: Saving tokenizer...

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

 Done.


README.md:   0%|          | 0.00/590 [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/Kacper098/Meta-Llama-3.1-8B-Instruct-Assistant-Summarization
