
# NLP II: Finetunning Llama

This notebook serves as the main pipeline for processing data, creating a model, training it, and evaluating its performance on a text classification task.

---
## Objectives
1. **Data Processing**: Load and preprocess text data.
2. **Model Creation**: Define a machine learning or deep learning model for the task.
3. **Training**: Train the model on the dataset.
4. **Evaluation**: Assess the model's performance on a test set.



---
#### Libraries and Dependencies

In [1]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments
)
from trl import SFTTrainer
from time import time

from train import train
from evaluate import evaluate_model
from utils import get_dataset
from keys_file import TOKEN
import optuna

2024-11-21 17:25:41.595366: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-21 17:25:41.602778: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-21 17:25:41.611180: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-21 17:25:41.613757: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-21 17:25:41.620608: I tensorflow/core/platform/cpu_feature_guar

---
#### Data adquisition and processing

In [2]:
# DATASET = load_dataset("GAIR/lima", data_dir = "./data")
test_size = 50

DATASET = get_dataset("FOLDER_DATA")
DATASET['test'] = DATASET['test'].shuffle(seed=42).select(range(min(len(DATASET['test']), test_size)))
print("Train size: ", len(DATASET["train"]))
print("Test size: ", len(DATASET["test"]))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", DEVICE)

Train size:  100176
Test size:  50
Device:  cuda


---
#### **Configurations**

##### 1. Quantization

Quantizing BitsAndBytesConfig reduces memory usage and speeds up inference. The parameters are:

* load_in_4bit: Loads the model in 4-bit precision to save memory. (Boolean)

* bnb_4bit_quant_type: Sets quantization type ("nf4" for accuracy, "fp4" for speed).

* bnb_4bit_compute_dtype: Defines the computation data type (float16, bfloat16, float32).

* bnb_4bit_use_double_quant: Enables double quantization for improved accuracy.

**Double quantization**

Double quantization reduces quantization error by applying two rounds of quantization.

    - The first round for is for the mains weights
    - The second round is to capture residual errors, resulting in better model accuracy at a slight cost to speed.

In [3]:

compute_dtype = getattr(torch, "bfloat16")  # Set computation data type to bfloat16 - CHECK
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable loading the model in 4-bit precision
    bnb_4bit_quant_type="nf4",            # Specify quantization type as Normal Float 4 - MAYBE INT8 O FLOAT16
    bnb_4bit_compute_dtype=compute_dtype, # Set computation data type
    bnb_4bit_use_double_quant=True,       # Use double quantization for better accuracy
)

#### 2. Model and Tokenizer

In [4]:
MODEL_NAME = "meta-llama/Llama-3.1-8B"
OUTPUT_DIR = "./" + MODEL_NAME + "_results"
LEARNING_RATE = 1e-4


# Esto nos prepara el modelo con la config, en la cpu, con la quantización 
model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,  # Apply quantization configuration
        device_map="auto",                # Automatically map layers to devices
        use_auth_token=TOKEN
    )



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [5]:
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    add_eos_token=True,      # Add end-of-sequence token to the tokenizer
    use_fast=True,           # Use the fast tokenizer implementation
    padding_side='left',      # Pad sequences on the left side,
    use_auth_token=TOKEN)

tokenizer.pad_token = tokenizer.eos_token



In [6]:
# MODEL INSTANTIATION
model = prepare_model_for_kbit_training(model) # Por el cuantizado - deja q entrene
model.config.pad_token_id = tokenizer.pad_token_id  # Set the model's padding token ID (mirar config del modelo para asegurar nombre)
model.to(DEVICE)

You shouldn't move a model that is dispatched using accelerate hooks.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps

In [None]:
# testing model before training
def generate_text(model, tokenizer, prompt, device="cuda"):
        
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(inputs['input_ids'], max_length=500, num_return_sequences=1)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

# test_prompt = "Instruction: Give me a formal email intro\nContext: I am a law student applying to a New York like the Suits TV show\nResponse: "
test_prompt = "Instruction: Dame una receta de gazpacho andaluz\nContext: Soy un inutil\nResponse:"

try: 
    # Get model output before training
    print("Before training:")
    output_before = generate_text(model, tokenizer, test_prompt, DEVICE)
    print(output_before)
except:
      pass


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Before training:


Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


#### 4. Finetuning Parameters

##### 4.1 - LoRA

Applies low-rank updates to pretrained models, enabling efficient fine-tuning by learning only small, additional matrices instead of updating all model weights. Here’s what each parameter does:

* lora_alpha: Scaling factor for updates; higher values (16, 32) increase update impact, improving adaptation but may risk overfitting.

* lora_dropout: Dropout rate for LoRA layers; typical values (0.0, 0.05) help prevent overfitting with minimal regularization.

* r: Rank of LoRA matrices; lower values (4, 8) reduce parameters and memory, while higher values (16) offer more flexibility.

* bias: Adds bias term ("none", "all", "lora_only") to control if and where bias adjustments are made.

* target_modules: Specifies layers to apply LoRA (['k_proj', 'v_proj']); selecting fewer layers reduces compute cost but may limit effectiveness.
        



In [8]:
lora_config = LoraConfig(
            lora_alpha=16,             # Scaling factor for LoRA updates
            lora_dropout=0.15,          # Dropout rate applied to LoRA layers
            r=8,                      # Rank of the LoRA decomposition
            bias="none",               # No bias is added to the LoRA layers
            task_type="CAUSAL_LM",     # Specify the task as causal language modeling
            target_modules=[           # Modules to apply LoRA to
                'k_proj', 'q_proj', 'v_proj', 'o_proj',
                'gate_proj', 'down_proj', 'up_proj'
            ]
        )


if lora_config:
        '''
        Applies LoRA to the model
        '''
        lora_config = lora_config
        model = get_peft_model(model, lora_config)

"""
Notes on how to improve:
After fine-tuning, check the validation loss. If it's high, try making the following adjustments one at a time:
Increase lora_alpha: If the model is underfitting, try increasing lora_alpha to 32 or 64.
Increase lora_dropout: If you observe overfitting, increase lora_dropout to 0.2 or 0.3.
Decrease r: If the model is too large or overfitting, reduce r to 8 or 4.
Reduce the number of target modules: If the model is overfitting, try applying LoRA to fewer modules, such as ['q_proj', 'v_proj'] or just ['k_proj', 'o_proj'].



Trial 1: vAL lOSS : 2.88 - 2.74 - 2.68 - 2.65
lora_alpha=16,            
lora_dropout=0.05,          
r=16,                      
bias="none",               
task_type="CAUSAL_LM",     
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']

Trial 2: vAL lOSS : 2.88 - 2.75 - No more
lora_alpha=16,            
lora_dropout=0.2,          
r=16,                      
bias="none",               
task_type="CAUSAL_LM",     
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']



"""

# Uncomment to avoid LoRA
# lora_config = None

'\nNotes on how to improve:\nAfter fine-tuning, check the validation loss. If it\'s high, try making the following adjustments one at a time:\nIncrease lora_alpha: If the model is underfitting, try increasing lora_alpha to 32 or 64.\nIncrease lora_dropout: If you observe overfitting, increase lora_dropout to 0.2 or 0.3.\nDecrease r: If the model is too large or overfitting, reduce r to 8 or 4.\nReduce the number of target modules: If the model is overfitting, try applying LoRA to fewer modules, such as [\'q_proj\', \'v_proj\'] or just [\'k_proj\', \'o_proj\'].\n\n\n\nTrial 1: vAL lOSS : 2.88 - 2.74 - 2.68 - 2.65\nlora_alpha=16,            \nlora_dropout=0.05,          \nr=16,                      \nbias="none",               \ntask_type="CAUSAL_LM",     \ntarget_modules=[\'k_proj\', \'q_proj\', \'v_proj\', \'o_proj\',\'gate_proj\', \'down_proj\', \'up_proj\']\n\nTrial 2: vAL lOSS : 2.88 - 2.75 - No more\nlora_alpha=16,            \nlora_dropout=0.2,          \nr=16,                    

In [None]:
# loha config
from peft import AdaLoraConfig

torch.cuda.empty_cache()

adalora_config = AdaLoraConfig(
        peft_type="ADALORA", 
        task_type="CAUSAL_LM", 
        init_r=12, lora_alpha=32, 
        target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj'],
        lora_dropout=0.01,
)

if adalora_config:
        '''
        Applies LoRA to the model
        '''
        model = get_peft_model(model, adalora_config)


##### 4.2 - Llama Adapter



In [10]:
from peft import AdaptionPromptConfig

llama_adapter = AdaptionPromptConfig(
    adapter_len=20,
    adapter_layers=16,
    task_type="CAUSAL_LM",
)

if llama_adapter:
    model = get_peft_model(model, llama_adapter)
    # model.add_adapter("llama-adapter", llama_adapter)

- LLama-Adapter (10 len, 30 layers, no lora)
    The model seems worse than LoRA. It achieves a training loss of 1.8 and a val loss of 1.61

- Llama-Adapter (16 len, 16 layers, lora)
    Model is worse than only lora. It achieves training loss of 1.72 and val loss of 1.76

#### 4.3 - Supervised Fine Tunning

In [10]:
sft_config = ""

# Uncomment to avoid SFT
sft_config = None

#### 4. Training Parameters

* output_dir: Directory to save checkpoints and logs.
* eval_strategy: When to run evaluation ("steps" or "epoch").
* do_eval: Enable/disable evaluation during training.
* optim: Optimizer type ("paged_adamw_8bit" for memory-efficient AdamW).
* per_device_train_batch_size: Batch size per device for training.
* gradient_accumulation_steps: Accumulate gradients over steps for larger effective batch size.
* per_device_eval_batch_size: Batch size per device for evaluation.
* log_level: Logging verbosity level ("debug" for detailed logs).
* logging_steps: Log metrics every N steps.
* learning_rate: Initial learning rate for optimization.
* eval_steps: Run evaluation every N steps.
* max_steps: Total number of training steps.
* save_steps: Save model checkpoints every N steps.
* warmup_steps: Steps to gradually increase learning rate.
* lr_scheduler_type: Type of learning rate scheduler ("linear" for steady decay).

In [11]:
# Función unificada para configurar TrainingArguments
def create_training_args(output_dir, learning_rate, batch_size, num_epochs=3, additional_args=None):
    additional_args = additional_args or {}
    return TrainingArguments(
        output_dir=output_dir,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        num_train_epochs=num_epochs,
        evaluation_strategy="epoch",
        save_strategy="steps",
        optim="paged_adamw_8bit",
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=2,
        logging_steps=10,
        eval_steps=25,
        max_steps=100,
        save_steps=25,
        warmup_steps=25,
        lr_scheduler_type="linear",
        **additional_args,  # Permite agregar argumentos adicionales según sea necesario
    )

def objective(trial):
    # Define el espacio de búsqueda de hiperparámetros
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("per_device_train_batch_size", [2, 4, 8, 16])

    # Configura el modelo con los hiperparámetros sugeridos
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        num_train_epochs=3,  # Puedes ajustarlo según el caso
        evaluation_strategy="epoch",
        optim="paged_adamw_8bit", 
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=DATASET["train"],
        eval_dataset=DATASET["test"],
        dataset_text_field="prompt", 
        tokenizer=tokenizer
    )

    # Entrena el modelo y obtén la métrica de evaluación
    trainer.train()
    eval_results = trainer.evaluate()
    
    return eval_results["eval_loss"]

study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=1)


# Actualiza el modelo con los mejores hiperparámetros
best_params = study.best_trial.params
LEARNING_RATE = best_params["learning_rate"]
PER_DEVICE_TRAIN_BATCH_SIZE = best_params["per_device_train_batch_size"]

[I 2024-11-21 14:44:10,158] A new study created in memory with name: no-name-b249ae1d-534b-4027-862d-76f3883a2883

Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,1.3914,1.422058
2,1.0735,1.48701
3,0.715,1.641343


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


[I 2024-11-21 16:01:01,818] Trial 0 finished with value: 1.641343355178833 and parameters: {'learning_rate': 0.00018335806063256405, 'per_device_train_batch_size': 2}. Best is trial 0 with value: 1.641343355178833.


In [12]:
# Imprime el mejor resultado
print("Best trial:")
print(f"  Loss: {study.best_trial.value}")
print("  Hyperparameters:")
for key, value in study.best_trial.params.items():
    print(f"    {key}: {value}")

# Entrenar el modelo con los mejores hiperparámetros
best_params = study.best_trial.params
training_arguments = create_training_args(
    output_dir="./results_best",
    learning_rate=best_params["learning_rate"],
    batch_size=best_params["per_device_train_batch_size"],
    num_epochs=5,  # Mayor número de épocas para el modelo final
)

Best trial:
  Loss: 1.641343355178833
  Hyperparameters:
    learning_rate: 0.00018335806063256405
    per_device_train_batch_size: 2




In [11]:
LEARNING_RATE = 0.00018335806063256405
BATCH_SIZE = 2

In [12]:
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,  # Directory for saving model checkpoints and logs
    eval_strategy="steps",                # Evaluation strategy: evaluate every few steps
    do_eval=True,                         # Enable evaluation during training
    optim="paged_adamw_8bit",             # Use 8-bit AdamW optimizer for memory efficiency
    per_device_train_batch_size=BATCH_SIZE,        # Batch size per device during training
    gradient_accumulation_steps=2,        # Accumulate gradients over multiple steps
    per_device_eval_batch_size=BATCH_SIZE,         # Batch size per device during evaluation
    log_level="debug",                    # Set logging level to debug for detailed logs
    logging_steps=10,                     # Log metrics every 10 steps
    learning_rate=LEARNING_RATE,          # Initial learning rate
    eval_steps=50,                        # Evaluate the model every 25 steps
    max_steps=200,                        # Total number of training steps
    save_steps=25,                        # Save checkpoints every 25 steps
    warmup_steps=25,                      # Number of warmup steps for learning rate scheduler
    lr_scheduler_type="linear",           # Use a linear learning rate scheduler
)

---
#### Training Process

In [13]:
# Train the model with the specified training arguments
model = train(
    model=model,
    tokenizer=tokenizer,
    training_arguments=training_arguments,

    tokenized_dataset=DATASET,
    device=DEVICE,
    output_dir=OUTPUT_DIR,
)



Deprecated positional argument(s) used in SFTTrainer, please use the SFTConfig to set these arguments instead.


Map:   0%|          | 0/100176 [00:00<?, ? examples/s]

Map:   0%|          | 0/50 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Currently training with a batch size of: 2
***** Running training *****
  Num examples = 100,176
  Num Epochs = 1
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 2
  Total optimization steps = 200
  Number of trainable parameters = 1,310,736
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


RuntimeError: mat1 and mat2 shapes cannot be multiplied (862x4096 and 1x8388608)

In [None]:

# Checking trained model performance
print("After training:")
output_after = generate_text(model, tokenizer, test_prompt, DEVICE)
print(output_after)


# Make sure the model is retrieved or saved after training!!!!!!

---
#### Evaluate model

In [16]:
# Evaluate the model on the test set

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(OUTPUT_DIR)
model.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
# tokenizer.to(DEVICE)

# Evaluate
# evaluate_model(model, tokenizer, DATASET)

loading configuration file config.json from cache at /home/usuario/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20afa771667e1d4b/config.json
Model config LlamaConfig {
  "_name_or_path": "meta-llama/Llama-3.1-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing LlamaForCausalLM.

All the weights of LlamaForCausalLM were initialized from the model checkpoint at meta-llama/Llama-3.1-8B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /home/usuario/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20afa771667e1d4b/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": 128001,
  "temperature": 0.6,
  "top_p": 0.9
}

loading file tokenizer.json
loading file tokenizer.model
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and t

After training:
Instruction: Give me a formal email intro
Context: I am a law student applying to a New York like suits
Response:  Dear [Name of the recipient],

I am writing to express my interest in the [Position] position that is currently open at [Company]. I am a recent graduate of [University] with a degree in [Major] and I am currently pursuing a law degree. I believe that my education, skills, and experience make me an excellent candidate for this position.


