# Fine_Tune_granite-3.3-2b-instruct

This Notebooke provides a guide to fine-tuning Granite 3.3 2B Instruct model using [Unsloth](https://github.com/unslothai/unsloth?tab=readme-ov-file), an optimized open-source framework designed for efficient LLM fine-tuning and reinforcement learning.

Fine-tuning refers to the process of further training a pre-trained model on a task-specific dataset to improve its performance in specialized contexts. Here, we focus on Low-Rank Adaptation (LoRA), a method within the broader category of Parameter-Efficient Fine-Tuning (PEFT), where only a subset of model parameters are modified. PEFT methods preserve the majority of the pre-trained knowledge, hence minimizing the risks of catastrophic forgetting.

Fine-tuning is particularly valuable when prompting or retrieval-based techniques fall short. While prompt engineering enables zero-shot or few-shot learning, it often results in inconsistent outputs, especially for complex tasks or domain-specific requirements. Similarly, Retrieval-Augmented Generation (RAG) enhances factual grounding by incorporating external context but does not alter the model's underlying reasoning or stylistic behavior. Fine-tuning addresses these limitations by embedding the desired patterns, tone, and logic directly into the model, resulting in more robust and reliable outputs.

There are several distinct types of fine-tuning, each suited to different use cases. Instruction tuning aligns the model with general task-following behavior, conversation tuning optimizes it for dialogue and multi-turn interactions, and domain-specific tuning adapts the model to specialized fields.

This recipe explores domain specific training and fine-tunes the model to perform better on math reasoning tasks.


**THIS NOTEBOOK WORKS IN LINUX/WINDOWS ENVIRONMENT AND REQUIRES A CUDA-ENABLED GPU (NVIDIA GPU).**

Please refer to the Unsloth system requirements [here](https://docs.unsloth.ai/get-started/beginner-start-here/unsloth-requirements).

This notebook has been optimized to run efficiently on a single NVIDIA RTX 2070 Super GPU with 8GB VRAM. The code configurations, batch sizes, and memory management have been specifically tuned for this hardware to ensure smooth fine-tuning without memory overflow issues.

If you want to fine-tune using larger datasets or models, you may need a machine with a more powerful GPU. Your local computer can't run this notebook without a CUDA-enabled GPU.

**Hardware Requirements:**
- **GPU**: NVIDIA RTX 2070 Super (8GB VRAM) or equivalent
- **CUDA**: Compatible CUDA drivers installed
- **Memory**: Sufficient system RAM to support GPU operations

**Troubleshooting for Local RTX 2070 Super Setup:**
- **Verify CUDA installation**: Run `nvidia-smi` in terminal to confirm GPU is detected and CUDA drivers are properly installed
- **Monitor GPU memory**: Use `nvidia-smi` during training to ensure VRAM usage stays within the 8GB limit
- **Adjust batch size if needed**: If you encounter out-of-memory errors, reduce the batch size in the training configuration
- **Check GPU utilization**: Ensure the GPU is being utilized efficiently during training processes

## Install Dependencies

In [1]:
%pip install git+https://github.com/ibm-granite-community/utils \
  sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer

%pip install --no-deps bitsandbytes \
  accelerate \
  xformers==0.0.29.post3 \
  peft \
  trl \
  tqdm \
  triton \
  cut_cross_entropy \
  unsloth_zoo \
  unsloth

Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /tmp/pip-req-build-p2yqryd6
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /tmp/pip-req-build-p2yqryd6
  Resolved https://github.com/ibm-granite-community/utils to commit 60ecc5d292b5c33271586d5eb2bb53cf996f15ba
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Fine Tuning Granite Model

The Granite 3.3 2B model, while generally proficient in natural language understanding and generation, may struggle when tasked with specialized reasoning challenges such as high-accuracy mathematical problem solving. These limitations are expected, given the relatively smaller parameter size of the model. To address this, fine-tuning the Granite 3.3 2B model on domain-specific mathematical dataset is a promising approach to improve its accuracy and reliability in quantitative tasks.

### Loading the base model

In this section, we load the Granite 3.3 2B Instruct base model, preparing it for fine-tuning.

In [2]:
from unsloth import FastLanguageModel

base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="ibm-granite/granite-3.3-2b-instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.10: Fast Granite patching. Transformers: 4.56.0.
   \\   /|    NVIDIA GeForce RTX 2070 SUPER. Num GPUs = 1. Max memory: 8.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/207 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

ibm-granite/granite-3.3-2b-instruct does not have a padding token! Will use pad_token = <|end_of_text|>.


### Prepare the Math Dataset

Here, the code formats a math dataset into a chat-style prompt-response structure using the tokenizer's chat template.

In [3]:
EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    messages = []
    for i in range(len(examples["problem"])):
        messages.append([
            {"role": "user", "content": examples["problem"][i]},
            {"role": "assistant", "content": examples["solution"][i]}
        ])
    texts = [tokenizer.apply_chat_template(message, tokenize = False, add_generation_prompt = False) + EOS_TOKEN for message in messages]
    return { "text" : texts, }

In [4]:
from datasets import load_dataset

dataset = load_dataset("xDAN2099/lighteval-MATH", split="train[:500]", trust_remote_code=True)
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

README.md:   0%|          | 0.00/480 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/2.99M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.86M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

This is how the final fine-tuning dataset samples looks like:

In [5]:
print(dataset["text"][0])

<|start_of_role|>system<|end_of_role|>Knowledge Cutoff Date: April 2024.
Today's Date: September 02, 2025.
You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Let \[f(x) = \left\{
\begin{array}{cl} ax+3, &\text{ if }x>2, \\
x-5 &\text{ if } -2 \le x \le 2, \\
2x-b &\text{ if } x <-2.
\end{array}
\right.\]Find $a+b$ if the piecewise function is continuous (which means that its graph can be drawn without lifting your pencil from the paper).<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>For the piecewise function to be continuous, the cases must "meet" at $2$ and $-2$. For example, $ax+3$ and $x-5$ must be equal when $x=2$. This implies $a(2)+3=2-5$, which we solve to get $2a=-6 \Rightarrow a=-3$. Similarly, $x-5$ and $2x-b$ must be equal when $x=-2$. Substituting, we get $-2-5=2(-2)-b$, which implies $b=3$. So $a+b=-3+3=\boxed{0}$.<|end_of_text|>
<|end_of_text|>


### LoRA fine tuning

We now add LoRA adapters for parameter efficient finetuning.

In [6]:
target_modules =  ["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"]

model = FastLanguageModel.get_peft_model(
    base_model,
    r = 16, # Rank of lora matrices
    target_modules = target_modules,  # Modules of the llm the lora weights are used
    lora_alpha = 16, # scales the weights of the adapters
    lora_dropout = 0, # Unsloth recommends 0 is better for fast patching
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth", #"unsloth" for very long context, decreases vram
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth: Making `model.base_model.model.model` require gradients


Using Hugging Face TRL's SFTTrainer, we configure the training environment for the base model. Feel free to experiment with the training arguments to observe their impact on model performance.

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 2,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 25,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none"
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=20):   0%|          | 0/500 [00:00<?, ? examples/s]

We now initiate the fine-tuning of the model. With current training environment, the 1.1% of the parameters are trainable and it takes ~7 mins with the specified training arguments.

In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 500 | Num Epochs = 2 | Total steps = 126
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 28,180,480 of 2,561,720,320 (1.10% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,entropy
25,1.0191,0
50,0.5489,No Log
75,0.4317,No Log
100,0.3581,No Log
125,0.3692,No Log


Here are some training statistics for your reference:

In [9]:
trainer_stats

TrainOutput(global_step=126, training_loss=0.5447963480911557, metrics={'train_runtime': 355.8373, 'train_samples_per_second': 2.81, 'train_steps_per_second': 0.354, 'total_flos': 4910164085219328.0, 'train_loss': 0.5447963480911557, 'epoch': 2.0})

## Inference the fine-tuned model

The fine-tuned model is ready for inferencing!

In [12]:
from ibm_granite_community.notebook_utils import wrap_text
import torch

# Ensure model is in inference mode
model.eval()

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

query = "If $x = 2$ and $y = 5$, then what is the value of $\frac{x^4+2y^2}{6}$ ?"
messages = [
    {"role": "user", "content": query}
]

# Create input encoding
encoding = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True
).to("cuda")

# Generate with minimal parameters to avoid cache issues
with torch.no_grad():
    try:
        output_ids = model.generate(
            input_ids=encoding["input_ids"],
            attention_mask=encoding["attention_mask"],
            pad_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,  # Reduced for stability
            use_cache=False,
            do_sample=False,
            num_beams=1,
            early_stopping=True
        )
    except Exception as e:
        print(f"Error with model.generate: {e}")
        # Fallback: try with base model
        base_model = model.get_base_model()
        output_ids = base_model.generate(
            input_ids=encoding["input_ids"],
            attention_mask=encoding["attention_mask"],
            pad_token_id=tokenizer.eos_token_id,
            max_new_tokens=512,
            use_cache=False,
            do_sample=False
        )

# Decode response
response = tokenizer.decode(
    output_ids[0][encoding["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(wrap_text(response))

We have  \[\frac{x^4+2y^2}{6} = \frac{2^4+2\cdot5^2}{6} = \frac{16+2\cdot25}{6}
= \frac{16+50}{6} = \frac{66}{6} = \boxed{11}.\]


The expected response must be along the lines of:


> We have  \[\frac{x^4 + 2y^2}{6} = \frac{2^4 + 2(5^2)}{6} = \frac{16+2(25)}{6} = \frac{16+50}{6} = \frac{66}{6} = \boxed{11}.\]



You can also use a [TextStreamer](https://huggingface.co/docs/transformers.js/en/api/generation/streamers#module_generation/streamers.TextStreamer) for real-time inference, allowing you to view the model’s output token by token as it’s generated, rather than waiting for the full response. This is demonstrated in the next section.

## Saving Fine Tuned model

The fine-tuned models can either be saved locally or online on HuggingFace. The models can then be loaded using FastLanguage model and set for inference.

In [13]:
model.save_pretrained("Finetuned_Granite")  # Local saving
tokenizer.save_pretrained("Finetuned_Granite") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('Finetuned_Granite/tokenizer_config.json',
 'Finetuned_Granite/special_tokens_map.json',
 'Finetuned_Granite/chat_template.jinja',
 'Finetuned_Granite/vocab.json',
 'Finetuned_Granite/merges.txt',
 'Finetuned_Granite/added_tokens.json',
 'Finetuned_Granite/tokenizer.json')

In [14]:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Finetuned_Granite", # Locally saved model
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
)
FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2025.8.10: Fast Granite patching. Transformers: 4.56.0.
   \\   /|    NVIDIA GeForce RTX 2070 SUPER. Num GPUs = 1. Max memory: 8.0 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



ibm-granite/granite-3.3-2b-instruct does not have a padding token! Will use pad_token = <|end_of_text|>.


GraniteForCausalLM(
  (model): GraniteModel(
    (embed_tokens): Embedding(49159, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x GraniteDecoderLayer(
        (self_attn): GraniteAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): GraniteMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): GraniteRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): GraniteRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): GraniteRMSNorm((2048,)

In [15]:
query = "Simplify $\sqrt[3]{1+8} \cdot \sqrt[3]{1+\sqrt[3]{8}}$."
messages = [
    {"role": "user", "content": query},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

# Create attention mask
attention_mask = (inputs != tokenizer.pad_token_id).long()

# Generate output with attention mask
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    streamer=text_streamer,
    max_new_tokens=256,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)


First, we simplify the expressions inside the cube roots:
$\sqrt[3]{1+8} = \sqrt[3]{9}$
$\sqrt[3]{1+\sqrt[3]{8}} = \sqrt[3]{1+2} = \sqrt[3]{3}$
Now, we can simplify the expression:
$\sqrt[3]{9} \cdot \sqrt[3]{3} = \sqrt[3]{9 \cdot 3} = \sqrt[3]{27} = \boxed{3}$The answer is: 3<|end_of_text|>


Expected response:


>The first cube root becomes $\sqrt[3]{9}$. $\sqrt[3]{8}=2$, so the second cube root becomes $\sqrt[3]{3}$. Multiplying these gives $\sqrt[3]{27} = \boxed{3}$.

