<a href="https://colab.research.google.com/github/rafaelpivetta/tech-challenge-fase3/blob/main/fine-tuning/Tech3_Finetuning_com_LoRA_e_unsloth.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
new_model_name = "rafaelpivetta/tinyllama-chat-bnb-4bit-g19"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/tinyllama-chat-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.9: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/762M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/438 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

### Preparação dos dados e 1ª Inferência

In [None]:
alpaca_prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    contents    = examples["content"]
    texts = []
    for instruction, content in zip(instructions, contents):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, content) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("LuizfvFonseca/trn_limpo_parte_1_de_70", split = "train")

# Função para concatenar 'Describe the product xxxx' com o título e cria uma nova coluna instruction
def add_instruction_column(example):
    example["instruction"] =  f"Create a description for the following product: {example['title']}"
    return example

# Aplica a função ao dataset
dataset = dataset.map(add_instruction_column)

dataset = dataset.map(formatting_prompts_func, batched = True,)

alpaca_prompt_text = dataset['text'][3]
print(alpaca_prompt_text)

Dataset({
    features: ['uid', 'title', 'content', 'instruction'],
    num_rows: 20000
})
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a description for the following product: The Prophet

### Response:
In a distant, timeless place, a mysterious prophet walks the sands. At the moment of his departure, he wishes to offer the people gifts but possesses nothing. The people gather round, each asks a question of the heart, and the man's wisdom is his gift. It is Gibran's gift to us, as well, for Gibran's prophet is rivaled in his wisdom only by the founders of the world's great religions. On the most basic topics--marriage, children, friendship, work, pleasure--his words have a power and lucidity that in another era would surely have provoked the description "divinely inspired." Free of dogma, free of power structures and metaphysics, consider these poetic, moving aphorisms a 20th-century supplement to al

## 1ª Inferência

In [None]:
prompt= """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a description for the following product: The Prophet

### Response:"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
generation_output = model.generate(
    input_ids=input_ids,
    max_new_tokens=128
)

print(tokenizer.decode(generation_output[0]))

#Garbage collection para que os recursos no colab não excedam no momento do treino
import gc # garbage collection
gc.collect()
torch.cuda.empty_cache() #clean cache

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a description for the following product: The Prophet

### Response:
I am not capable of creating a description for the product "The Prophet". However, I can provide you with the following information:

- The Prophet is a high-end smartphone that is designed for professionals and businesses. It features a 6.5-inch AMOLED display, a Qualcomm Snapdragon 855 processor, 6GB of RAM, and 128GB of storage. The phone also has a 48MP dual-camera system on the rear and a 16MP front-facing camera. The Prophet is


### Configuração dos parâmetros do LoRA

Configuração de adaptadores LoRA, para utilizar somente de 1 a 10% de todos os parâmetros!

In [None]:
# model = FastLanguageModel.get_peft_model(
#     model,
#     r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
#     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
#                       "gate_proj", "up_proj", "down_proj",],
#     lora_alpha = 16,
#     lora_dropout = 0, # Supports any, but = 0 is optimized
#     bias = "none",    # Supports any, but = "none" is optimized
#     # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
#     use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
#     random_state = 3407,
#     use_rslora = False,  # We support rank stabilized LoRA
#     loftq_config = None, # And LoftQ
# )

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    # target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
    #                   "gate_proj", "up_proj", "down_proj",],
    target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head'] # Teste com todos os parâmetros
    lora_alpha = 16,
    lora_dropout = 0.05, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    task_type="CAUSAL_LM",
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.9 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.


<a name="Data"></a>

### Inference antes do treino

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Create a description for the following product: The Prophet", # instruction
        "", # response - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs,
                         max_new_tokens = 128,
                         use_cache = True,
                        #  temperature=0.7,
                        #  top_p=0.9,
                        #  repetition_penalty=1.1
                         )
tokenizer.batch_decode(outputs)

['<s> Below is an instruction to describe the product. Provide a short and concise response.\n\n### Instruction:\nDescribe the book The Prophet\n\n### Response:\nThe Prophet is a novel by Salman Rushdie, published in 1995. It is a fictional account of the life of the Prophet Muhammad, written in the form of a novel. The Prophet is a powerful and engaging work of fiction that explores the complexities of Islamic history and culture.\n\nThe novel follows the life of the Prophet Muhammad from his early years as a shepherd in Mecca to his eventual rise to power as the leader of the Muslim community. The Prophet is a deeply human and relatable character, who struggles']

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 16,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 100,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        max_grad_norm=0.3, #prevents the gradients from becoming too large and helps stabilize training. Gradient Clipping
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        group_by_length=True, #which will group similar-length sequences together to make training more efficient.
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/20000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


### Simples inferencia

In [None]:
# # alpaca_prompt = Copied from above
# FastLanguageModel.for_inference(model) # Enable native 2x faster inference
# inputs = tokenizer(
# [
#     alpaca_prompt.format(
#         "Describe the product Girls Ballet Tutu Neon Pink", # instruction
#         "", # output - leave this blank for generation!
#     )
# ], return_tensors = "pt").to("cuda")

# outputs = model.generate(**inputs,
#                          max_new_tokens = 128,
#                          use_cache = True,
#                          temperature=0.7,
#                          top_p=0.9,
#                          repetition_penalty=1.1
#                          )
# tokenizer.batch_decode(outputs)

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.984 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 20,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 16
\        /    Total batch size = 32 | Total steps = 100
 "-____-"     Number of trainable parameters = 12,615,680


Step,Training Loss
1,5.1609
2,6.0413
3,5.2408
4,5.3033
5,5.3406
6,4.8528
7,4.1717
8,5.1408
9,5.2181
10,3.7671


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

2366.5619 seconds used for training.
39.44 minutes used for training.
Peak reserved memory = 10.049 GB.
Peak reserved memory for training = 4.065 GB.
Peak reserved memory % of max memory = 68.138 %.
Peak reserved memory for training % of max memory = 27.563 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Describe the product The Prophet", # instruction
        "", # response - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs,
                         max_new_tokens = 128,
                         use_cache = True,
                        #  temperature=0.7,
                        #  top_p=0.9,
                        #  repetition_penalty=1.1
                         )
tokenizer.batch_decode(outputs)

['<s> Below is an instruction to describe the product. Provide a short and concise response.\n\n### Instruction:\nDescribe the product The Prophet\n\n### Response:\n&#8220;The Prophet is a masterpiece of the novel form.&#8221;--James Joyce, in The New York Times Book Review</s>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Describe the product The Prophet", # instruction
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<s> Below is an instruction to describe the product. Provide a short and concise response.

### Instruction:
Describe the product The Prophet

### Response:
&#8220;The Prophet is a masterpiece of the novel form.&#8221;--James Joyce, in The New York Times Book Review</s>


<a name="Save"></a>
### Saving, loading finetuned models

In [None]:
from google.colab import userdata

# import gc # garbage collection
# gc.collect()
# torch.cuda.empty_cache() #clean cache

#trainer.model.save_pretrained(new_model_name)
#trainer.tokenizer.save_pretrained(new_model_name)

#model.save_pretrained("lora_model") # Local saving
#tokenizer.save_pretrained("lora_model")
model.push_to_hub(new_model_name, token = userdata.get('HF_TOKEN')) # Online saving
#tokenizer.push_to_hub(new_model_name, token = userdata.get('HF_TOKEN')) # Online saving

IsADirectoryError: [Errno 21] Is a directory: 'rafaelpivetta/tinyllama-chat-bnb-4bit-g19'