# Fine Tune Qwen 2 7b

Special thanks to **Daniel Han** (Creator of Unsloth) for providing the code
His [notebook](https://colab.research.google.com/drive/1mvwsIQWDs2EdZxZQF9pRGnnOvE86MVvR?usp=sharing#scrollTo=MKX_XKs_BNZR) for more details about inference, loading models etc.


# Install Packages

In [3]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

# 1. Load and Prepare Training model

In [4]:
from unsloth import FastLanguageModel
import torch
from google.colab import userdata

max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.5.1+cu124)
    Python  3.11.11 (you have 3.11.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


In [5]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = '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', # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="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",  # Pass token directly
)

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [7]:
from unsloth import FastLanguageModel

alpaca_prompt = """### Instruction:
{0}

### Input:
{1}

### Response:
{2}"""


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="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",  # Pass token directly
)

FastLanguageModel.for_training(model)

# alpaca_prompt = You MUST copy from above!
FastLanguageModel.for_training(model) # Working on fixing 2x faster inference!
inputs = tokenizer(
[
    alpaca_prompt.format(
        "You are a helpful assistant in Q@A about AI.", # Instruction
        "What is EPLB?", # input
        "", # output
    )
], return_tensors = "pt").to("cuda")


from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
### Instruction:
You are a helpful assistant in Q@A about AI.

### Input:
What is EPLB?

### Response:
EPLB (Electronic Post-Label Book) is not an official term related to AI or any known technology. This might be a proprietary term used by some companies or individuals, but I could not find any information about it being an official term. EPLB may refer to a specific product, service, or project


# 2. Prepare Dataset

In [8]:
import pandas as pd

csv_file = "/content/AI.csv"
df = pd.read_csv(csv_file)
df

Unnamed: 0,Question,Answer
0,Who did the first work generally recognized as...,Warren McCulloch and Walter Pitts (1943).\n
1,What sources was drawn on the formation of the...,knowledge of the basic physiology and function...
2,Who created the Hebbian learning rule?,Donald Hebb (1949).\n
3,When the first neural network is built?,1950.\n
4,What is the first neural network called?,The SNARC.\n
...,...,...
498,How long uniform-cost search will take?,of nodes with path cost ≤ cost of optimal solu...
499,How much space does uniform-cost search take t...,of nodes with path cost ≤ cost of optimal solu...
500,How much space does iterative deepening search...,O(bd)\n
501,How much space does depth-first search take to...,O(bm)\n


In [13]:
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    # instructions should be a list with the same length as inputs and outputs
    instructions = ['You are a helpful assistant in Q@A about AI.'] * len(examples["Question"]) # Replicate the instruction for each example
    inputs       = examples["question"]
    outputs      = examples["answer"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
import pandas as pd
from datasets import Dataset




dataset = Dataset.from_pandas(df)

dataset = dataset.map(formatting_prompts_func, batched=True)

print(dataset[0])

Map:   0%|          | 0/503 [00:00<?, ? examples/s]

{'Question': 'Who did the first work generally recognized as AI?', 'Answer': 'Warren McCulloch and Walter Pitts (1943).\n', 'text': '\n### Instruction:\nYou are a helpful assistant in Q@A about AI.\n\n### Input:\nWho did the first work generally recognized as AI?\n\n### Response:\nWarren McCulloch and Walter Pitts (1943).\n<|im_end|>'}


# 3. Train & Save

In [14]:
# We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)


Unsloth 2025.3.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [21]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,

        num_train_epochs = 0.1,  # Use full epochs for better stability
    warmup_ratio = 0.1,  # Scales with training steps
    max_steps = 60,  # More steps for meaningful training

    learning_rate = 5e-5,  # Reduce LR for stability
    fp16 = not is_bfloat16_supported(),
    bf16 = is_bfloat16_supported(),

    logging_steps = 1,  # Reduce logging frequency
    optim = "adamw_torch",  # More stable optimizer
    weight_decay = 0.01,
    lr_scheduler_type = "cosine",  # Cosine works better in small-scale training
    seed = 3407,
    output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/503 [00:00<?, ? examples/s]

In [22]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 503 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 29,933,568/1,830,055,936 (1.64% trained)


Step,Training Loss
1,0.4687
2,0.6489
3,0.5079
4,0.667
5,0.6016
6,0.6141
7,0.5741
8,0.5428
9,0.4775
10,0.5971


Saving Model

In [23]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/vocab.json',
 'lora_model/merges.txt',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

In [19]:
model.save_pretrained_gguf("lora_model_submission",tokenizer)

KeyboardInterrupt: 

# 4. Inference

In [26]:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )

FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "	Who did the first work generally recognized as AI?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 2048,
                   use_cache = True, temperature = 1.5,min_p=0.1)

==((====))==  Unsloth 2025.3.9: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Exception ignored in: <function _xla_gc_callback at 0x7deec646aa20>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/jax/_src/lib/__init__.py", line 96, in _xla_gc_callback
    def _xla_gc_callback(*args):
    
KeyboardInterrupt: 


The first work generally recognized as an artificial intelligence (AI) was the Logic Theorem Theorem, achieved by Alan Turing in 1950. Alan Turing is widely acknowledged as the father of modern AI. Turing's work was based on formal logic and computation.

Alan Turing's Logic Theorem Theorem is considered one of the earliest examples of an AI system.

Alan Turing wrote a machine that he is considered one of the early examples of AI.

This is one of the earliest examples of an AI system.

The first work generally recognized as an artificial intelligence (AI) was the Logic Theorem Theorem, achieved by Alan Turing in 1950. Alan Turing is widely acknowledged as the father of modern AI. Turing's work was based on 

KeyboardInterrupt: 