In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.11.10: Fast Llama patching. Transformers:4.46.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
# Apply PEFT (Parameter Efficient Fine-Tuning) to the loaded model
model = FastLanguageModel.get_peft_model(
    model,
    r=8,  # Reduced LoRA rank for lower VRAM usage
    target_modules=[
        "q_proj", "v_proj", "gate_proj",
    ],  # Minimal modules for task-specific fine-tuning
    lora_alpha=16,  # Scaling factor for LoRA; unchanged
    lora_dropout=0.05,  # Small dropout for better generalization
    bias="none",  # No additional bias to reduce memory
    use_gradient_checkpointing="unsloth",  # Optimized gradient checkpointing
    random_state=3407,  # Ensure reproducibility
    use_rslora=False,  # Disabling Rank Stabilized LoRA (default)
    loftq_config=None,  # Disabling LoftQ (default)
)


Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2024.11.10 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


### Data Preparation
For sentiment analysis, we use the IMDB dataset, which contains movie reviews labeled as either positive (1) or negative (0).

Example structure:
{
  "text": "This movie was amazing! The story was gripping and the characters were well developed.",
  "label": 1  # 1 for positive sentiment
}

Preprocessing involves:
1. Extracting the `text` field for inputs.
2. Using the `label` field as the target.
3. Optional reduction in size for memory efficiency with `select()`.

References:
- [IMDB Dataset on Hugging Face](https://huggingface.co/datasets/imdb)
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)

In [6]:
from datasets import load_dataset

# Load the IMDB sentiment dataset
dataset = load_dataset("imdb", split="train")

# Preprocessing function for sentiment analysis
def preprocessing_func(examples):
    texts = examples["text"]  # Extract the text field
    labels = examples["label"]  # Labels are already binary (0: negative, 1: positive)
    return {"text": texts, "label": labels}

# Apply preprocessing to the dataset
dataset = dataset.map(preprocessing_func, batched=True)

# Optional: Reduce dataset size for constrained GPU
dataset = dataset.shuffle(seed=42).select(range(10000))  # Select a subset if needed

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Let's see how the `Phi-3` format works by printing the 5th element

In [8]:
print(dataset[5]["text"])

While this movie's style isn't as understated and realistic as a sound version probably would have been, this is still a very good film. In fact, it was seen as an excellent film in its day, as it was nominated for the first Best Picture Oscar (losing to WINGS). I still consider WINGS to be a superior film, but this one is excellent despite a little bit of overacting by the lead, Emil Jannings.<br /><br />Jannings is a general from Czarist Russia who is living out his final days making a few bucks in the 1920s by being a Hollywood extra. His luck appears to have changed as he gets a casting call--to play an Imperial Russian general fighting against the Communists during the revolution. Naturally this isn't much of a stretch acting-wise, but it also gets the old man to thinking about the old days and the revolution.<br /><br />Exactly what happens next I'll leave to you, but it's a pretty good film--particularly at the end. By the way, look for William Powell as the Russian director. De

If you're looking to make your own chat template, that also is possible! You must use the Jinja templating regime. We provide our own stripped down version of the `Unsloth template` which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.

More info on chat templates on [our wiki page!](https://github.com/unslothai/unsloth/wiki#chat-templates)

In [9]:
# Unsloth template for chat-style conversations (not used in sentiment analysis tasks)
unsloth_template = (
    "{{ bos_token }}"
    "{{ 'You are a helpful assistant to the user\n' }}"
    "{% for message in messages %}"
    "{% if message['role'] == 'user' %}"
    "{{ '>>> User: ' + message['content'] + '\n' }}"
    "{% elif message['role'] == 'assistant' %}"
    "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"
    "{% endif %}"
    "{% endfor %}"
    "{% if add_generation_prompt %}"
    "{{ '>>> Assistant: ' }}"
    "{% endif %}"
)
unsloth_eos_token = "eos_token"

if False:  # Set to True only if using chat fine-tuning
    tokenizer = get_chat_template(
        tokenizer,
        chat_template=(unsloth_template, unsloth_eos_token),  # Template and EOS token
        mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},  # ShareGPT mapping
        map_eos_token=True,  # Maps <|im_end|> to </s> instead
    )

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [10]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Define the trainer for sentiment analysis
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,  # Use 2 processors for dataset preprocessing
    packing=False,  # Packing disabled; useful for variable-length sequences
    args=TrainingArguments(
        per_device_train_batch_size=1,  # Lower batch size to fit within 10GB
        gradient_accumulation_steps=8,  # Maintain effective batch size
        warmup_steps=5,
        max_steps=50,  # Reduced steps for faster completion
        learning_rate=2e-4,  # Learning rate; can be adjusted if needed
        fp16=not is_bfloat16_supported(),  # Enable FP16 if bfloat16 not supported
        bf16=is_bfloat16_supported(),  # Enable bfloat16 if supported
        logging_steps=5,  # Log every 5 steps
        optim="adamw_8bit",  # Optimizer for memory efficiency
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,  # For reproducibility
        output_dir="outputs_sentiment",  # Directory for model checkpoints
        report_to="none",  # Disable external reporting (e.g., WandB)
    ),
)


Map (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [11]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.195 GB of memory reserved.


In [12]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 8
\        /    Total batch size = 8 | Total steps = 50
 "-____-"     Number of trainable parameters = 6,029,312


Step,Training Loss
5,2.6538
10,2.7635
15,2.647
20,2.6562
25,2.6227
30,2.6258
35,2.6671
40,2.6722
45,2.5283
50,2.6748


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

500.2925 seconds used for training.
8.34 minutes used for training.
Peak reserved memory = 3.342 GB.
Peak reserved memory for training = 1.057 GB.
Peak reserved memory % of max memory = 22.661 %.
Peak reserved memory for training % of max memory = 7.167 %.


<a name="Inference"></a>
### Inference
Let's run the model! Since we're using `Phi-3`, use `apply_chat_template` with `add_generation_prompt` set to `True` for inference.

In [20]:
# Adjusted input with guiding prompts
texts = [
    "The movie was fantastic! I loved the characters and the plot. Sentiment: Positive or Negative?",
    "I didn't like the movie. It was too slow and boring. Sentiment: Positive or Negative?",
]

# Tokenize and generate predictions
inputs = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs["input_ids"],
    max_new_tokens=5,  # Limit the output to reduce continuation
    use_cache=True,
)

# Decode outputs and postprocess
predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(predictions)
# Postprocess to extract the sentiment from the generated text
for text, prediction in zip(texts, predictions):
    # Find the last occurrence of "Positive" or "Negative"
    if "Positive" in prediction and "Negative" in prediction:
        # Use the last occurrence to determine the sentiment
        positive_index = prediction.rfind("Positive")
        negative_index = prediction.rfind("Negative")
        sentiment = "Positive" if positive_index > negative_index else "Negative"
    elif "Positive" in prediction:
        sentiment = "Positive"
    elif "Negative" in prediction:
        sentiment = "Negative"
    else:
        sentiment = "Unknown"  # Handle cases where neither is present

    print(f"Input: {text}")
    print(f"Sentiment Prediction: {sentiment}")


['The movie was fantastic! I loved the characters and the plot. Sentiment: Positive or Negative? Sentiment: Positive', "I didn't like the movie. It was too slow and boring. Sentiment: Positive or Negative? Negative. I didn"]
Input: The movie was fantastic! I loved the characters and the plot. Sentiment: Positive or Negative?
Sentiment Prediction: Positive
Input: I didn't like the movie. It was too slow and boring. Sentiment: Positive or Negative?
Sentiment Prediction: Negative


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters use Huggingface's `push_to_hub`

In [21]:
model.push_to_hub("Samarth1305/phi_3_SFT_sentiment", token = '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') # Online saving
tokenizer.push_to_hub("Samarth1305/phi_3_SFT_sentiment", token = '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') # Online saving

README.md:   0%|          | 0.00/600 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/24.1M [00:00<?, ?B/s]

Saved model to https://huggingface.co/Samarth1305/phi_3_SFT_sentiment


  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]