
# NLP II: Finetunning Llama

This notebook serves as the main pipeline for processing data, creating a model, training it, and evaluating its performance on a text classification task.

---
## Objectives
1. **Data Processing**: Load and preprocess text data.
2. **Model Creation**: Define a machine learning or deep learning model for the task.
3. **Training**: Train the model on the dataset.
4. **Evaluation**: Assess the model's performance on a test set.



---
#### Libraries and Dependencies

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoTokenizer,
    TrainingArguments
)
from trl import SFTTrainer
from time import time

from train import train
from evaluate import evaluate_model
from utils import get_dataset
from keys_file import TOKEN
import optuna

---
#### Data adquisition and processing

In [None]:
# DATASET = load_dataset("GAIR/lima", data_dir = "./data")
test_size = 50

DATASET = get_dataset("FOLDER_DATA")
DATASET['test'] = DATASET['test'].shuffle(seed=42).select(range(min(len(DATASET['test']), test_size)))
print("Train size: ", len(DATASET["train"]))
print("Test size: ", len(DATASET["test"]))
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device: ", DEVICE)

---
#### **Configurations**

##### 1. Quantization

Quantizing BitsAndBytesConfig reduces memory usage and speeds up inference. The parameters are:

* load_in_4bit: Loads the model in 4-bit precision to save memory. (Boolean)

* bnb_4bit_quant_type: Sets quantization type ("nf4" for accuracy, "fp4" for speed).

* bnb_4bit_compute_dtype: Defines the computation data type (float16, bfloat16, float32).

* bnb_4bit_use_double_quant: Enables double quantization for improved accuracy.

**Double quantization**

Double quantization reduces quantization error by applying two rounds of quantization.

    - The first round for is for the mains weights
    - The second round is to capture residual errors, resulting in better model accuracy at a slight cost to speed.

In [None]:

compute_dtype = getattr(torch, "bfloat16")  # Set computation data type to bfloat16 - CHECK
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable loading the model in 4-bit precision
    bnb_4bit_quant_type="nf4",            # Specify quantization type as Normal Float 4 - MAYBE INT8 O FLOAT16
    bnb_4bit_compute_dtype=compute_dtype, # Set computation data type
    bnb_4bit_use_double_quant=True,       # Use double quantization for better accuracy
)

#### 2. Model and Tokenizer

In [None]:
MODEL_NAME = "meta-llama/Llama-3.1-8B"
OUTPUT_DIR = "../models/" + MODEL_NAME + "_testing"
LEARNING_RATE = 1e-4


# Esto nos prepara el modelo con la config, en la cpu, con la quantización 
model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        quantization_config=bnb_config,  # Apply quantization configuration
        device_map="auto",                # Automatically map layers to devices
        use_auth_token=TOKEN
    )

In [None]:
print(model.__dict__)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    add_eos_token=True,      # Add end-of-sequence token to the tokenizer
    use_fast=True,           # Use the fast tokenizer implementation
    padding_side='left',      # Pad sequences on the left side,
    use_auth_token=TOKEN)

tokenizer.pad_token = tokenizer.eos_token

In [None]:
# MODEL INSTANTIATION
model = prepare_model_for_kbit_training(model) # Por el cuantizado - deja q entrene
model.config.pad_token_id = tokenizer.pad_token_id  # Set the model's padding token ID (mirar config del modelo para asegurar nombre)
model.to(DEVICE)

In [None]:
# testing model before training
def generate_text(model, tokenizer, prompt, device="cuda"):
        
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(inputs['input_ids'], max_length=500, num_return_sequences=1)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

# test_prompt = "Instruction: Give me a formal email intro\nContext: I am a law student applying to a New York like the Suits TV show\nResponse: "
test_prompt = "Instruction: Give me the intro for a formal email to apply to a Data Scientist position in NVIDIA\nContext: My name is Patricia and I just finished college\nResponse:"

try: 
    # Get model output before training
    print("Before training:")
    output_before = generate_text(model, tokenizer, test_prompt, DEVICE)
    print(output_before)
except:
      pass


---

#### 3. Finetuning Parameters

Decide which of the techniques we want to implement in each run

##### 3.1 - LoRA

Applies low-rank updates to pretrained models, enabling efficient fine-tuning by learning only small, additional matrices instead of updating all model weights. Here’s what each parameter does:

* lora_alpha: Scaling factor for updates; higher values (16, 32) increase update impact, improving adaptation but may risk overfitting.

* lora_dropout: Dropout rate for LoRA layers; typical values (0.0, 0.05) help prevent overfitting with minimal regularization.

* r: Rank of LoRA matrices; lower values (4, 8) reduce parameters and memory, while higher values (16) offer more flexibility.

* bias: Adds bias term ("none", "all", "lora_only") to control if and where bias adjustments are made.

* target_modules: Specifies layers to apply LoRA (['k_proj', 'v_proj']); selecting fewer layers reduces compute cost but may limit effectiveness.
        



In [None]:
lora_config = LoraConfig(
            lora_alpha=16,             # Scaling factor for LoRA updates
            lora_dropout=0.15,          # Dropout rate applied to LoRA layers
            r=8,                      # Rank of the LoRA decomposition
            bias="none",               # No bias is added to the LoRA layers
            task_type="CAUSAL_LM",     # Specify the task as causal language modeling
            target_modules=[           # Modules to apply LoRA to
                'k_proj', 'q_proj', 'v_proj', 'o_proj',
                'gate_proj', 'down_proj', 'up_proj'
            ]
        )


"""
Notes on how to improve:
After fine-tuning, check the validation loss. If it's high, try making the following adjustments one at a time:
Increase lora_alpha: If the model is underfitting, try increasing lora_alpha to 32 or 64.
Increase lora_dropout: If you observe overfitting, increase lora_dropout to 0.2 or 0.3.
Decrease r: If the model is too large or overfitting, reduce r to 8 or 4.
Reduce the number of target modules: If the model is overfitting, try applying LoRA to fewer modules, such as ['q_proj', 'v_proj'] or just ['k_proj', 'o_proj'].



Trial 1: vAL lOSS : 2.88 - 2.74 - 2.68 - 2.65
lora_alpha=16,            
lora_dropout=0.05,          
r=16,                      
bias="none",               
task_type="CAUSAL_LM",     
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']

Trial 2: vAL lOSS : 2.88 - 2.75 - No more
lora_alpha=16,            
lora_dropout=0.2,          
r=16,                      
bias="none",               
task_type="CAUSAL_LM",     
target_modules=['k_proj', 'q_proj', 'v_proj', 'o_proj','gate_proj', 'down_proj', 'up_proj']



"""

#### 3.2 - AdaLora

In [None]:
# loha config
from peft import AdaLoraConfig

adalora_config = AdaLoraConfig(
        peft_type="ADALORA", 
        task_type="CAUSAL_LM", 
        init_r=8, lora_alpha=16, 
        target_modules=[           # Modules to apply LoRA to
                'k_proj', 'q_proj', 'v_proj', 'o_proj',
                'gate_proj', 'down_proj', 'up_proj'
            ],
        lora_dropout=0.15,
)


#### 3.3 - VbLora

In [None]:
from peft import VBLoRAConfig

vb_config = VBLoRAConfig(
            num_vectors=2048,          # Dropout rate applied to VeRA layers
            vector_length=256,
            r=4,                      # Rank of the LoRA decomposition
            topk=2, 
            bias="none",               # No bias is added to the VeRA layers
            target_modules=[           # Modules to apply LoRA to
                'k_proj', 'q_proj', 'v_proj', 'o_proj',
            ]
        )


##### 4.2 - Llama Adapter



In [None]:
from peft import AdaptionPromptConfig

llama_adapter = AdaptionPromptConfig(
    adapter_len=20,
    adapter_layers=16,
    task_type="CAUSAL_LM",
)

- LLama-Adapter (10 len, 30 layers, no lora)
    The model seems worse than LoRA. It achieves a training loss of 1.8 and a val loss of 1.61

- Llama-Adapter (16 len, 16 layers, lora)
    Model is worse than only lora. It achieves training loss of 1.72 and val loss of 1.76

---

#### 5. Training Parameters

* output_dir: Directory to save checkpoints and logs.
* eval_strategy: When to run evaluation ("steps" or "epoch").
* do_eval: Enable/disable evaluation during training.
* optim: Optimizer type ("paged_adamw_8bit" for memory-efficient AdamW).
* per_device_train_batch_size: Batch size per device for training.
* gradient_accumulation_steps: Accumulate gradients over steps for larger effective batch size.
* per_device_eval_batch_size: Batch size per device for evaluation.
* log_level: Logging verbosity level ("debug" for detailed logs).
* logging_steps: Log metrics every N steps.
* learning_rate: Initial learning rate for optimization.
* eval_steps: Run evaluation every N steps.
* max_steps: Total number of training steps.
* save_steps: Save model checkpoints every N steps.
* warmup_steps: Steps to gradually increase learning rate.
* lr_scheduler_type: Type of learning rate scheduler ("linear" for steady decay).

In [None]:
# Función unificada para configurar TrainingArguments
def create_training_args(output_dir, learning_rate, batch_size, num_epochs=3, additional_args=None):
    additional_args = additional_args or {}
    return TrainingArguments(
        output_dir=output_dir,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        num_train_epochs=num_epochs,
        evaluation_strategy="epoch",
        save_strategy="steps",
        optim="paged_adamw_8bit",
        gradient_accumulation_steps=2,
        per_device_eval_batch_size=2,
        logging_steps=10,
        eval_steps=25,
        max_steps=100,
        save_steps=25,
        warmup_steps=25,
        lr_scheduler_type="linear",
        **additional_args,  # Permite agregar argumentos adicionales según sea necesario
    )

def objective(trial):
    # Define el espacio de búsqueda de hiperparámetros
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("per_device_train_batch_size", [2, 4, 8, 16])

    # Configura el modelo con los hiperparámetros sugeridos
    training_args = TrainingArguments(
        output_dir="./results",
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        num_train_epochs=3,  # Puedes ajustarlo según el caso
        evaluation_strategy="epoch",
        optim="paged_adamw_8bit", 
    )

    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=DATASET["train"],
        eval_dataset=DATASET["test"],
        dataset_text_field="prompt", 
        tokenizer=tokenizer
    )

    # Entrena el modelo y obtén la métrica de evaluación
    trainer.train()
    eval_results = trainer.evaluate()
    
    return eval_results["eval_loss"]

# study = optuna.create_study(direction="minimize")
# study.optimize(objective, n_trials=1)


# # Actualiza el modelo con los mejores hiperparámetros
# best_params = study.best_trial.params
# LEARNING_RATE = best_params["learning_rate"]
# PER_DEVICE_TRAIN_BATCH_SIZE = best_params["per_device_train_batch_size"]

In [None]:
# # Imprime el mejor resultado
# print("Best trial:")
# print(f"  Loss: {study.best_trial.value}")
# print("  Hyperparameters:")
# for key, value in study.best_trial.params.items():
#     print(f"    {key}: {value}")

# # Entrenar el modelo con los mejores hiperparámetros
# best_params = study.best_trial.params
# training_arguments = create_training_args(
#     output_dir="./results_best",
#     learning_rate=best_params["learning_rate"],
#     batch_size=best_params["per_device_train_batch_size"],
#     num_epochs=5,  # Mayor número de épocas para el modelo final
# )

In [None]:
LEARNING_RATE = 0.00018335806063256405
BATCH_SIZE = 2

In [None]:
training_arguments = TrainingArguments(
    output_dir=OUTPUT_DIR,  # Directory for saving model checkpoints and logs
    eval_strategy="steps",                # Evaluation strategy: evaluate every few steps
    do_eval=True,                         # Enable evaluation during training
    optim="paged_adamw_8bit",             # Use 8-bit AdamW optimizer for memory efficiency
    per_device_train_batch_size=BATCH_SIZE,        # Batch size per device during training
    gradient_accumulation_steps=2,        # Accumulate gradients over multiple steps
    per_device_eval_batch_size=BATCH_SIZE,         # Batch size per device during evaluation
    log_level="debug",                    # Set logging level to debug for detailed logs
    logging_steps=10,                     # Log metrics every 10 steps
    learning_rate=LEARNING_RATE,          # Initial learning rate
    eval_steps=200,                        # Evaluate the model every 25 steps
    max_steps=50000,                        # Total number of training steps
    save_steps=250,                        # Save checkpoints every 25 steps
    warmup_steps=250,                      # Number of warmup steps for learning rate scheduler
    lr_scheduler_type="linear",           # Use a linear learning rate scheduler
)

---
#### Training Process

In [None]:
# Impelement teh different fine tuning configurations
# Booleans to manage finetuning techniques implementation

implement_lora = True
implement_adalora = False
implement_vb = False
implement_llama_adapter = False


if implement_lora:
        lora_config = lora_config
        model = get_peft_model(model, lora_config)

if implement_adalora:
        model = get_peft_model(model, adalora_config)

if implement_vb:
        model = get_peft_model(model, vb_config)

if implement_llama_adapter:
        model = get_peft_model(model, llama_adapter)
        # model.add_adapter("llama-adapter", llama_adapter)



In [None]:
# Train the model with the specified training arguments
model = train(
    model=model,
    tokenizer=tokenizer,
    training_arguments=training_arguments,

    tokenized_dataset=DATASET,
    device=DEVICE,
    output_dir=OUTPUT_DIR,
)


---
#### Evaluate model

In [None]:
OUTPUT_DIR

In [None]:
import os
print(os.path.exists(OUTPUT_DIR))  # Should return True if the path is valid


In [1]:

# Load the model and tokenizer
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from transformers import BitsAndBytesConfig

print(os.listdir("../"))
OUTPUT_DIR = "../models/meta-llama/Llama-3.1-8B_testing/checkpoint-14000"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

compute_dtype = getattr(torch, "bfloat16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=OUTPUT_DIR,
    quantization_config=bnb_config,
    device_map="auto"
)
model.to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)
# tokenizer.to(DEVICE)

# Evaluate
# evaluate_model(model, tokenizer, DATASET)

['src', 'starter_kit', 'NLP2 - Final_Project.pdf', '.gitignore', 'README.md', 'IfEval', '.git', 'models']


2024-11-24 12:15:08.657230: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-24 12:15:08.663662: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1732446908.679745   12472 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732446908.682164   12472 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-24 12:15:08.691088: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

You shouldn't move a model that is dispatched using accelerate hooks.


In [2]:
def generate_text(model, tokenizer, prompt, device="cuda"):
        
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(inputs['input_ids'], max_length=500, num_return_sequences=1)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)

# test_prompt = "Instruction: Give me a formal email intro\nContext: I am a law student applying to a New York like the Suits TV show\nResponse: "
test_prompt = "Instruction: Give me the intro for a formal email to apply to a Data Scientist position in NVIDIA\nContext: My name is Patricia and I just finished college\nResponse:"



# Checking trained model performance
print("After training:")
output_after = generate_text(model, tokenizer, test_prompt, DEVICE)
print(output_after)


# Make sure the model is retrieved or saved after training!!!!!!

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


After training:
Instruction: Give me the intro for a formal email to apply to a Data Scientist position in NVIDIA
Context: My name is Patricia and I just finished college
Response: Dear Hiring Manager,

My name is Patricia and I am excited to apply for the Data Scientist position at NVIDIA. As a recent graduate, I am eager to put my skills and knowledge to work in a dynamic and innovative company like NVIDIA.

I have a strong background in mathematics, statistics, and machine learning, which has equipped me with the necessary skills to excel in this role. I am also proficient in programming languages such as Python and R, and have experience working with large datasets using tools such as SQL and Apache Spark.

In addition to my technical skills, I am a strong communicator and collaborator. I am able to effectively communicate complex technical concepts to non-technical audiences, and have experience working in cross-functional teams to achieve common goals.

Thank you for considering 