To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [2]:
!pip install -q "unsloth[kaggle-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q "xformers<0.0.26"
!pip install -q --no-deps trl peft accelerate bitsandbytes evaluate huggingface_hub wandb triton
!pip install -q bert_score rouge_score sacrebleu

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [4]:
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [7]:
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_WRITE_TOKEN")
wandb_api = user_secrets.get_secret("WANDB_API_KEY")

In [8]:
import huggingface_hub
import wandb

huggingface_hub.login(hf_token)
wandb.login(key=wandb_api)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [8]:
models = [
    {
        "model": "unsloth/llama-3-8b-Instruct-bnb-4bit",
        "max_new_tokens": 1024,
        "max_seq_length": 8192,
        "gradient_accumulation_steps": 4,
        "per_device_train_batch_size": 1,
        "instruction_template": "<|start_header_id|>system<|end_header_id|>",
        "response_template": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
        "user_template": "<|eot_id|><|start_header_id|>user<|end_header_id|>",
        "end_template": "<|eot_id|><|end_of_text|>",
        # instruction template https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/#meta-llama-3-instruct
    },
    {
        "model": "unsloth/Phi-3-mini-4k-instruct",
        "max_new_tokens": 1024,
        "max_seq_length": 4096,
        "gradient_accumulation_steps": 1,
        "per_device_train_batch_size": 4,
        "instruction_template": "<|user|>",
        "response_template": "<|assistant|>",
        "user_template": "",
        "end_template": "<|end|>",
        # instruction template https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf#chat-format
    },
]

In [None]:
selected_model = models[1]

In [36]:
instruction_template = selected_model["instruction_template"]
response_template = selected_model["response_template"]
user_template = selected_model["user_template"]
end_template = selected_model["end_template"]

PROMPT = (
    instruction_template
    + """
You a helpful code assistant that generates a text description of a pull request based on the DIFF of pull request. 
Your task is to provide a concise summary of the changes. This summary will be used as description of pull request.
You should output only the description of this DIFF (description of pull request).
You should not include any other text. You think deeply about the changes and carefully analyze them.
Example:
### DIFF:
diff a/main.py b/main.py
@@ -1,4 +1,4 @@
a = 1
b = 2
- c = 3
+ c = 4
print(c)

# Answer:
Change value of c from 3 to 4"""
    + user_template
    + """
### DIFF:
{}
"""
    + response_template
    + "\n\n{}"
    + end_template
)

In [9]:
params = {
    "model": selected_model["model"],
    "max_new_tokens": selected_model["max_new_tokens"],
    "max_seq_length": selected_model["max_seq_length"],
    "random_seed": 42,
    "lora_alpha": 16,
    "gradient_accumulation_steps": selected_model["gradient_accumulation_steps"],
    "per_device_train_batch_size": selected_model["per_device_train_batch_size"],
    "eval_steps": 250,
    "lr_scheduler_type": "cosine",
    "prompt": PROMPT,
    "instruction_template": instruction_template,
    "response_template": response_template,
    "user_template": user_template,
    "end_template": end_template,
}
model_name = params["model"].split("/")[-1]

In [10]:
wandb.init(
    project="PRGen",
    name=f"Tune {params['model']}",
    config=params,
)

[34m[1mwandb[0m: Currently logged in as: [33msamoed-roman[0m. Use [1m`wandb login --relogin`[0m to force relogin


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [11]:
import torch
from transformers import set_seed


def fix_seed():
    torch.manual_seed(params["random_seed"])
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(params["random_seed"])


fix_seed()
set_seed(params["random_seed"])
torch.cuda.set_device(0)

2024-05-15 07:59:53.129122: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-15 07:59:53.129219: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-15 07:59:53.242503: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [12]:
import pandas as pd
from datasets import Dataset

In [13]:
train = pd.read_parquet("/kaggle/input/prgenselecteddata/sampled.parquet")
test = pd.read_parquet("/kaggle/input/prgenselecteddata/test.parquet")

In [14]:
train_prompts = train.apply(
    lambda x: {
        "prompt": x["diff"],
        "completion": "# " + x["title"] + "\n" + x.get("body", ""),
    },
    axis=1,
)

test_prompts = test.apply(
    lambda x: {
        "prompt": x["diff"],
        "completion": "# " + x["title"] + "\n" + x.get("body", ""),
    },
    axis=1,
)

In [15]:
train_prompts = pd.DataFrame(train_prompts.to_list())
test_prompts = pd.DataFrame(test_prompts.to_list())

In [16]:
train_dataset = Dataset.from_pandas(train_prompts)
test_dataset = Dataset.from_pandas(test_prompts)

The SFTTrainer supports popular dataset formats. This allows you to pass the dataset to the trainer without any pre-processing directly. The following formats are supported:

instruction format
```json
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
```

[dataset-format-support](https://huggingface.co/docs/trl/sft_trainer#dataset-format-support)

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [17]:
import evaluate
import numpy as np

bertscore = evaluate.load("bertscore")
rouge = evaluate.load("rouge")
chrf = evaluate.load("chrf")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/9.01k [00:00<?, ?B/s]

In [18]:
from collections import defaultdict


# https://github.com/huggingface/trl/issues/862#issuecomment-1896074498
def preprocess_logits_for_metrics(logits, labels):
    if isinstance(logits, tuple):
        logits = logits[0]
    return logits.argmax(dim=-1)


step = 250


def compute_metrics(eval_pred: tuple[list[str], list[str]]):
    global step

    predictions, labels = eval_pred

    predictions = np.where(predictions != -100, predictions, tokenizer.pad_token_id)
    predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    rouge_score = rouge.compute(predictions=predictions, references=labels)
    bert_score = bertscore.compute(predictions=predictions, references=[[l] for l in labels], lang="en")

    chrf_score = chrf.compute(predictions=predictions, references=labels, word_order=2)

    processed_bert_score = defaultdict(list)
    for key, value in bert_score.items():
        if key == "hashcode":
            continue
        for i, v in enumerate(value):
            processed_bert_score[key].append(v)

    bert_score_result = {}
    for key, value in processed_bert_score.items():
        key_name = key.split("/")[-1]
        bert_score_result["bert_" + key_name] = np.mean(value)

    table = [[pred, label] for pred, label in zip(predictions, labels, strict=False)]

    table = wandb.Table(data=table, columns=["pred", "label"])
    test_predictions = wandb.Artifact(f"step_{step}", type="predictions")
    test_predictions.add(table, f"step_{step}")

    wandb.run.log_artifact(test_predictions)
    step += 250

    return rouge_score | bert_score_result | chrf_score

In [19]:
import torch
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=params["model"],
    max_seq_length=params["max_seq_length"],
    dtype=None,
    load_in_4bit=True,
)

Unsloth: You passed in `unsloth/Phi-3-mini-4k-instruct` and `load_in_4bit = True`.
We shall load `unsloth/Phi-3-mini-4k-instruct-bnb-4bit` for 4x faster loading.


config.json:   0%|          | 0.00/1.16k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.2+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.17k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [20]:
tokenizer.pad_token = tokenizer.eos_token

In [21]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=params["lora_alpha"],
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=params["random_seed"],
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [22]:
def formatting_prompts_func(examples):
    output_texts = []
    if isinstance(examples, dict):
        return PROMPT.format(examples["prompt"], examples["completion"])
    for prompt, completion in zip(examples["prompt"], examples["completion"], strict=False):
        text = PROMPT.format(prompt, completion)
        output_texts.append(text)
    return output_texts

In [37]:
from trl import DataCollatorForCompletionOnlyLM, SFTTrainer

data_collator = DataCollatorForCompletionOnlyLM(
    tokenizer=tokenizer, response_template=response_template, instruction_template=instruction_template
)



In [38]:
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    max_seq_length=params["max_seq_length"],
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    preprocess_logits_for_metrics=preprocess_logits_for_metrics,
    formatting_func=formatting_prompts_func,
    packing=False,  # For data collator
    data_collator=data_collator,
    args=TrainingArguments(
        per_device_train_batch_size=params["per_device_train_batch_size"],
        per_device_eval_batch_size=1,
        #         auto_find_batch_size=True,
        gradient_accumulation_steps=params["gradient_accumulation_steps"],
        warmup_steps=0,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type=params["lr_scheduler_type"],
        seed=params["random_seed"],
        output_dir="outputs",
        report_to="wandb",
        evaluation_strategy="steps",
        eval_steps=params["eval_steps"],
    ),
)

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

In [39]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.029 GB of memory reserved.


In [40]:
trainer_stats = trainer.train()

Step,Training Loss,Validation Loss


In [None]:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

In [None]:
model.push_to_hub(f"PRGen-{model_name}-4bit-LoRA")