In [1]:
# ! pip3 install unsloth

A new virtual environment unsloth_env must be created.

In [1]:
import json
from datasets import Dataset

with open("people_data.json", 'r') as f:
    data = json.load(f)

    tuning_examples = []

    for example in data:
        tuning_examples.append(f"<|user|>\n{example['prompt']}\n<|assistant|>\n{json.dumps(example['response'])}<|endoftext|>")

dataset = Dataset.from_dict({'text':tuning_examples})

  from .autonotebook import tqdm as notebook_tqdm


The installed torch==2.8.0 is not compatible with:
* torchaudio==2.3.1 (requires torch==2.3.1)
* torchvision==0.18.1 (also requires torch==2.3.1)

If you must use torch==2.8.0 (e.g., for Unsloth), then you’ll need to:

Avoid installingtorchaudio and torchvision unless compatible versions are available.
Or wait until those libraries support torch==2.8.0.

You're now facing conflicting dependencies between:

torch==2.3.1 (required by torchaudio and torchvision)
torch==2.8.0 (required by unsloth-zoo and xformers)

PyTorch 2.8.0 is currently only available as a preview/nightly build, not through the standard stable release channels like cu121.

In [None]:
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3-mini-4k-instruct-bnb-4bit",
    max_seq_length = 2048, 
    dtype = None, # default is None, which means the model will be loaded in its original dtype
    load_in_4bit = True # default is False, which means the model will be loaded in its original precision (fp16 or fp32）
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


W0830 05:21:18.010000 30768 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


🦥 Unsloth Zoo will now patch everything to make training faster!


  GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"{DEVICE_TYPE}:{i}") for i in range(n_gpus)])


==((====))==  Unsloth 2025.8.10: Fast Mistral patching. Transformers: 4.56.0.
   \\   /|    NVIDIA GeForce RTX 4070 SUPER. Num GPUs = 1. Max memory: 11.994 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


It takes 6m 16.4s to for the importing.

The Unsloth/Phi-3-mini-4k-instruct-bnb-4bit model is a highly optimized, instruction-tuned variant of Microsoft's Phi-3 Mini model, designed for efficient deployment and fine-tuning in resource-constrained environments. 

In [3]:
model = FastLanguageModel.get_peft_model(
    model, 
    r = 64, # rank of the matrix, the smaller the rank, the less memory it will use, and the faster the training will be
    target_modules = [
        'q_proj', 'k_proj', 'v_proj','o_proj', 'gate_proj', 'up_proj','down_proj'
    ], # the modules that we want to fine-tune, we're going to inject the LoRA weights into these modules.
     # the reason we're doing this is because these modules are the ones that are doing the heavy lifting in the model.
     # q_proj, k_proj, v_proj are the ones that are doing the key, value, query projection of the input.
     # o_proj is the one that is doing the final projection of the output.
     # gate_proj and up_proj are the ones that are doing the gating and the up-projection of the input.
     # down_proj is the one that is doing the down-projection of the output.

    lora_alpha = 64 * 2, # the scaling factor for the LoRA weights, 64*2 is the default value
    lora_dropout = 0, # the dropout rate for the LoRA weights, 0 is the default value
    bias = 'none', # the bias for the LoRA weights, 'none' is the default value
    use_gradient_checkpointing = 'unsloth' # the gradient checkpointing for the LoRA weights, 'unsloth' is the default value
    
)

Unsloth 2025.8.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In large language models (LLMs), up-projection and down-projection refer to transformations applied to the input embeddings or hidden states, typically within feed-forward layers or attention mechanisms. These projections are crucial for enabling the model to learn complex representations efficiently.

The number of process can bot be high for multiprocessing during tokenization in Hugging Face's datasets library,

In [None]:
def tokenize_function(example):
    return tokenizer(
        example["text"], # the text field in the dataset is "text"
        truncation=True,
        padding="max_length", 
        max_length=2048,
    )

In [None]:
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=1  # or just remove this line, safer on Windows
)

Map: 100%|██████████| 300/300 [00:00<00:00, 1736.81 examples/s]


The trl module you're referring to is part of the Transformers Reinforcement Learning (TRL) library developed by Hugging Face. This library is designed to fine-tune language models using various techniques, including Supervised Fine-Tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), and Direct Preference Optimization (DPO).

In [8]:
from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_dataset,  # <-- use this!
    tokenizer=tokenizer,
    dataset_text_field='text',
    max_seq_length=2048,
    args=SFTConfig(
        per_device_train_batch_size = 2, # the batch size for the training, 2 is the default value
        gradient_accumulation_steps = 4, # the gradient accumulation steps for the training, 4 is the default value
        warmup_steps= 10, # the warmup steps for the training, 10 is the default value
        max_steps = 60, # the maximum steps for the training, 60 is the default value
        num_train_epochs = 3, # the number of training epochs, 3 is the default value
        logging_steps= 1, # the logging steps for the training, 1 is the default value
        output_dir = 'outputs', # the output directory for the training, 'outputs' is the default value
        optim = 'adamw_8bit' # the optimizer for the training, 'adamw_8bit' is the default value
    )
)

In [15]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 300 | Num Epochs = 2 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 119,537,664 of 3,940,617,216 (3.03% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,14.0391
2,13.7707
3,13.3632
4,10.5703
5,6.0329
6,1.1683
7,0.2558
8,0.1738
9,0.1384
10,0.1083


TrainOutput(global_step=60, training_loss=1.050305641628802, metrics={'train_runtime': 484.9866, 'train_samples_per_second': 0.99, 'train_steps_per_second': 0.124, 'total_flos': 2.2472878146453504e+16, 'train_loss': 1.050305641628802})

In [26]:
trainer.save_model("outputs")

No using unsloth since there are 

In [1]:
# === Step 2: Now safe to import ===
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

print("Torch version:", torch.__version__)

# === 1. 4-bit config ===
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# === 2. Load base model ===
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True
)

# === 3. Load and merge LoRA adapter ===
model = PeftModel.from_pretrained(model, "outputs")
model = model.merge_and_unload()

# === 4. Tokenizer setup ===
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

model.config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# === 5. Prepare input ===
messages = [
    {"role": "user", "content": "Mike is a 30 year old programmer. He loves hiking."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print("Prompt:\n", prompt)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# === 6. Generate ===
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True,
    )

# === 7. Decode ===
full_response = tokenizer.decode(outputs[0], skip_special_tokens=False)
if "<|assistant|>" in full_response:
    reply = full_response.split("<|assistant|>")[1].split("<|end|>")[0].strip()
else:
    reply = full_response.strip()

print("Assistant Response:\n", reply)

  from .autonotebook import tqdm as notebook_tqdm
W0830 08:51:18.226000 34100 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


Torch version: 2.8.0+cu126



python -m bitsandbytes


  warn(msg)
  warn(msg)


bin c:\Users\ch939\anaconda3\envs\unsloth_env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda126.dll
False

The following directories listed in your path were found to be non-existent: {WindowsPath('C:/Users/ch939/anaconda3/envs/unsloth_env/bin')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=126, Highest Compute Capability: 8.9.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Required library version not found: libbitsandbytes_cuda126.dll. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.dll...

CUDA SETUP: CUDA detection failed! Possible reasons:
1. You need to manually override the PyTorch CUDA version. Please see: "https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
2. CUDA driver not installed
3. CUDA not installed
4. You have multiple conflicting CUDA libraries
5. Required library not pre

RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):

        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

In [2]:
# === PATCH: Prevent bitsandbytes integration crash (optional) ===
import sys
if 'transformers.integrations.bitsandbytes' not in sys.modules:
    sys.modules['transformers.integrations.bitsandbytes'] = type(sys)('bitsandbytes')

# === Now import ===
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# === Load base model in 8-bit ===
print("Loading model in 8-bit...")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True
)

# === Load and merge LoRA ===
print("Loading LoRA adapter...")
model = PeftModel.from_pretrained(model, "outputs")
model = model.merge_and_unload()
print("✅ LoRA merged")

# === Tokenizer setup ===
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = "left"

model.config.pad_token_id = tokenizer.pad_token_id
model.config.bos_token_id = tokenizer.bos_token_id
model.config.eos_token_id = tokenizer.eos_token_id

# === Generate ===
messages = [
    {"role": "user", "content": "Mike is a 30 year old programmer. He loves hiking."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print("Prompt:\n", prompt)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        use_cache=True,
    )

# === Decode ===
full_response = tokenizer.decode(outputs[0], skip_special_tokens=False)
if "<|assistant|>" in full_response:
    reply = full_response.split("<|assistant|>")[1].split("<|end|>")[0].strip()
else:
    reply = full_response.strip()

print("Assistant Response:\n", reply)

Loading model in 8-bit...


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: cannot import name 'validate_bnb_backend_availability' from 'transformers.integrations' (c:\Users\ch939\anaconda3\envs\unsloth_env\lib\site-packages\transformers\integrations\__init__.py)

In [1]:
import transformers
import accelerate
import peft
import bitsandbytes

print("transformers:", transformers.__version__)
print("accelerate:", accelerate.__version__)
print("peft:", peft.__version__)
print("bitsandbytes:", bitsandbytes.__version__)

  from .autonotebook import tqdm as notebook_tqdm
W0830 09:16:30.430000 32048 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


transformers: 4.56.0
accelerate: 1.0.1
peft: 0.14.0
bitsandbytes: 0.47.0


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

print("Loading model in 8-bit...")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    load_in_8bit=True,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True
)

# === Continue with LoRA merge, generate, etc. ===

Loading model in 8-bit...


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: cannot import name 'validate_bnb_backend_availability' from 'transformers.integrations' (c:\Users\ch939\anaconda3\envs\unsloth_env\lib\site-packages\transformers\integrations\__init__.py)