# Fine-Tuning with Llama 2, Bits and Bytes, and QLoRA

Today we'll explore fine-tuning the Llama 2 model available on Kaggle Models using QLoRA, Bits and Bytes, and PEFT.

- QLoRA: [Quantized Low Rank Adapters](https://arxiv.org/pdf/2305.14314.pdf) - este es un método para ajustar los LLM que utiliza una pequeña cantidad de parámetros cuantificados y actualizables para limitar la complejidad del entrenamiento. . Esta técnica también permite que esos pequeños conjuntos de parámetros se agreguen de manera eficiente al modelo mismo, lo que significa que puede realizar ajustes finos en muchos conjuntos de datos, potencialmente, e intercambiar estos "adaptadores" en su modelo cuando sea necesario.
- [Bits and Bytes](https://github.com/TimDettmers/bitsandbytes): Un paquete excelente de Tim Dettmers et al., que proporciona un contenedor liviano para funciones CUDA personalizadas que hacen que los LLM vayan más rápido: optimizadores, mults de matrices, y cuantificación. En este notebook usaremos la biblioteca para cargar nuestro modelo de la manera más eficiente posible.
- [PEFT](https://github.com/huggingface/peft): na excelente biblioteca de Huggingface que permite varios métodos de ajuste eficiente de parámetros (PEFT), que nuevamente hacen que sea menos costoso ajustar los LLM, especialmente en Hardware más liviano como el presente en las computadoras portátiles Kaggle.


This notebook is based on [an excellent example from LangChain](https://github.com/asokraju/LangChainDatasetForge/blob/main/Finetuning_Falcon_7b.ipynb).

In [1]:
!pip install -qqq bitsandbytes==0.39.0
!pip install -qqq torch==2.0.1
!pip install -qqq -U git+https://github.com/huggingface/transformers.git@e03a9cc
!pip install -qqq -U git+https://github.com/huggingface/peft.git@42a184f
!pip install -qqq -U git+https://github.com/huggingface/accelerate.git@c9fbb71
!pip install -qqq datasets==2.12.0
!pip install -qqq loralib==0.1.1
!pip install -qqq einops==0.6.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchdata 0.6.0 requires torch==2.0.0, but you have torch 2.0.1 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.6 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 11.0.0 which is incompatible.
pathos 0.3.1 requires dill>=0.3.7, but you have dill 0.3.6 which is incompatible.
pathos 0.3.1 requires multiprocess>=0.70.15, but you have multiprocess 0.70.14 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you hav

In [2]:
import pandas as pd
import json
import os
from pprint import pprint
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import load_dataset, Dataset
from huggingface_hub import notebook_login

from peft import LoraConfig, PeftConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /opt/conda/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)


# Loading and preparing our model

Usaremos el modelo Llama 2 7B para nuestra prueba. Usaremos Bits y Bytes para cargarlo en formato de 4 bits, lo que debería reducir considerablemente el consumo de memoria, a costa de cierta precisión.

Tenga en cuenta los parámetros en `BitsAndBytesConfig` - esta es una configuración de cuantificación de 4 bits bastante estándar, cargando los pesos en formato de 4 bits, usando un formato sencillo (`normal float 4`) con doble cuantificación para mejorar la resolución de QLoRA. Los pesos se vuelven a convertir a `bfloat16` para actualizaciones de peso y luego se descarta la precisión adicional.

In [3]:
model = "/kaggle/input/llama-2/pytorch/13b-chat-hf/1"
MODEL_NAME = model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=bnb_config
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

A continuación, usaremos un bonito contenedor PEFT para configurar nuestro modelo para training / fine-tuning. Específicamente, esta función configura la capa de incrustación de salida para permitir actualizaciones de gradiente, además de realizar alguna conversión de tipos en varios componentes para garantizar que el modelo esté listo para actualizarse.

In [4]:
model = prepare_model_for_kbit_training(model)

A continuación, definimos algunas funciones auxiliares: su propósito es identificar adecuadamente nuestras capas de actualización para que podamos... ¡actualizarlas!

In [5]:
import re
def get_num_layers(model):
    numbers = set()
    for name, _ in model.named_parameters():
        for number in re.findall(r'\d+', name):
            numbers.add(int(number))
    return max(numbers)

def get_last_layer_linears(model):
    names = []
    
    num_layers = get_num_layers(model)
    for name, module in model.named_modules():
        if str(num_layers) in name and not "encoder" in name:
            if isinstance(module, torch.nn.Linear):
                names.append(name)
    return names

## LORA config

Algunos elementos clave de esta configuración:
1. `r` es el ancho de la pequeña capa de actualización. En teoría, esto debería establecerse lo suficientemente amplio como para capturar la complejidad del problema que está intentando ajustar. Los problemas más simples pueden salirse con la suya con `r` más pequeño. En nuestro caso, iremos muy pequeños, en gran medida por el bien de la velocidad.
2. `target_modules` se configura utilizando nuestras funciones auxiliares: cada capa identificada por esa función se incluirá en la actualización PEFT.

In [6]:
config = LoraConfig(
    r=2,
    lora_alpha=32,
    target_modules=get_last_layer_linears(model),
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

## Load some data

Aquí, estamos cargando un conjunto de datos de un documento de la NASA (Software Safety Guidebook) de  180 preguntas. Por razones de tiempo, no los colocamos mas, pero ajustaremos nuestro modelo usando las preguntas y respuestas. Tenga en cuenta que estamos entrenando el modelo para usar su conocimiento existente (más lo poco que aprende de nuestras preguntas y respuestas) para responder preguntas en el formato que queremos.

In [7]:
df = pd.read_csv("/kaggle/input/nasa-finetuning/NASA-Software-Safety-Guidebook.csv", nrows=180)

df.columns = [str(q).strip() for q in df.columns]

data = Dataset.from_pandas(df)

In [8]:
df["Questions"].values[0:5]

array(['What does the design of a program set represent?',
       'How might projects developing large amounts of software approach the design development process?',
       'How does the design phase differ for projects with relatively small software packages?',
       'How are the various phases broken up over time in most lifecycles other than the waterfall?',
       'How might the initial design in some lifecycles be equivalent to the architectural design in the waterfall?'],
      dtype=object)

In [9]:
prompt = df["Questions"].values[0] + ". Answer concisely but detailed and understandable.: ".strip()
prompt

'What does the design of a program set represent?. Answer concisely but detailed and understandable.:'

## Let's generate!

A continuación, configuramos nuestro modelo generativo:

- Top P: un método para elegir entre una selección de las salidas más probables, en lugar de tomar con avidez la más alta)
- Temperature: una modulación de la función softmax utilizada para determinar los valores de nuestras salidas
- Limitamos las secuencias de retorno a 1: ¡solo se permite una respuesta! - y forzar deliberadamente que la respuesta sea concisa pero detallada y comprensible.

In [10]:
generation_config = model.generation_config
generation_config.max_new_tokens = 65
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

Ahora, generaremos una respuesta a nuestra primera pregunta, solo para ver cómo funciona el modelo.


In [11]:
%%time
device = "cuda"

encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(
        input_ids = encoding.input_ids,
        attention_mask = encoding.attention_mask,
        generation_config = generation_config
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What does the design of a program set represent?. Answer concisely but detailed and understandable.:

What does the design of a program set represent?

The design of a program set represents the structure and organization of the software system being developed. It includes the following aspects:

1. Modules: The program set is divided into smaller, independent modules that perform specific functions. Each module is designed to
CPU times: user 46.7 s, sys: 789 ms, total: 47.5 s
Wall time: 1min 11s


## Format our fine-tuning data

Haremos coincidir la configuración de aviso que usamos anteriormente.

In [12]:
def generate_prompt(data_point):
    return f"""
            {data_point["Questions"]}. 
            Answer concisely but detailed and understandable: {data_point["Answers"]}
            """.strip()


def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenizer(full_prompt, padding=True, truncation=True)
    return tokenized_full_prompt

data = data.shuffle().map(generate_and_tokenize_prompt)

Map:   0%|          | 0/180 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


## Train!

Ahora, usaremos nuestros datos para actualizar nuestro modelo. Usando la biblioteca `transformers`  de Huggingface, configuremos nuestro bucle de entrenamiento y luego ejecutémoslo. Tenga en cuenta que SOLO estamos haciendo una pasada en todos estos datos.

In [13]:
training_args = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_train_epochs=5,
    learning_rate=1e-4,
    fp16=True,
    output_dir="finetune_nasa",
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.01,
    report_to="none"
)

trainer = transformers.Trainer(
    model=model,
    train_dataset=data,
    args=training_args,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


TrainOutput(global_step=225, training_loss=1.833027615017361, metrics={'train_runtime': 2454.704, 'train_samples_per_second': 0.367, 'train_steps_per_second': 0.092, 'total_flos': 2582767702671360.0, 'train_loss': 1.833027615017361, 'epoch': 5.0})

## Loading and using the model later

Ahora, guardaremos el modelo ajustado de PEFT, luego lo cargaremos y lo usaremos para generar más respuestas.

In [14]:
model.save_pretrained("trained-model")

PEFT_MODEL = "/kaggle/working/trained-model"

config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

model = PeftModel.from_pretrained(model, PEFT_MODEL)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [15]:
generation_config = model.generation_config
generation_config.max_new_tokens = 80
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [16]:
import numpy as np

In [17]:
%%time

prompt = "What is Fault/Failure Detection, Isolation and Recovery (FDIR)?. Answer concisely but detailed and understandable: ".strip()

device = "cuda"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is Fault/Failure Detection, Isolation and Recovery (FDIR)?. Answer concisely but detailed and understandable: Fault/Failure Detection, Isolation and Recovery (FDIR) is a mechanism that identifies and isolates hardware or software failures in a system, allowing the remaining components to continue operating. FDIR is a crucial component of many safety-critical systems, including aircraft, spacecraft, and industrial control systems. FDIR can be implemented using various techniques, including redund
CPU times: user 57.2 s, sys: 115 ms, total: 57.3 s
Wall time: 57.8 s


## Convert to zip

Ahora, guardaremos el modelo ajustado de PEFT en un archivo zip para no tener que volver a ejecutar el proceso

In [18]:
!zip -r "/kaggle/working/archivos.zip" "/kaggle/working"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: kaggle/working/ (stored 0%)
  adding: kaggle/working/trained-model/ (stored 0%)
  adding: kaggle/working/trained-model/adapter_model.bin (deflated 7%)
  adding: kaggle/working/trained-model/adapter_config.json (deflated 56%)
  adding: kaggle/working/finetune_nasa/ (stored 0%)
  adding: kaggle/working/__notebook__.ipynb (deflated 78%)


## Loading and using the model that was saved in a dataset

Ahora, cargaremos el modelo guardado en el dataset y lo usaremos para generar respuestas.

In [19]:
# Ruta del modelo guardado en el dataset de Kaggle
PEFT_MODEL = "/kaggle/input/fine-tuning-model/modelos/trained-model_finetuning"

# Cargar la configuración del modelo
config = PeftConfig.from_pretrained(PEFT_MODEL)

# Cargar el modelo
model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

# Cargar el tokenizador
tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token

# Cargar el modelo PEFT
model = PeftModel.from_pretrained(model, PEFT_MODEL)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [20]:
# Configuración de generación
generation_config = model.generation_config
generation_config.max_new_tokens = 80
generation_config.temperature = 0.7
generation_config.top_p = 0.7
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [21]:
import numpy as np

A continuación, se imprime el resultado de la pregunta.
- Cabe recalcar la cantidad de respuesta se basa en los tokens que se ajusten.

In [22]:
prompt = "What is Fault/Failure Detection, Isolation and Recovery (FDIR)?. Answer concisely but detailed and understandable: ".strip()

device = "cuda"
encoding = tokenizer(prompt, return_tensors="pt").to(device)
with torch.inference_mode():
  outputs = model.generate(
      input_ids = encoding.input_ids,
      attention_mask = encoding.attention_mask,
      generation_config = generation_config
  )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is Fault/Failure Detection, Isolation and Recovery (FDIR)?. Answer concisely but detailed and understandable: Fault/Failure Detection, Isolation, and Recovery (FDIR) is a mechanism used to identify, isolate, and recover from faults or failures in safety-critical systems. It involves monitoring the system for anomalies, detecting and diagnosing faults, isolating the affected components or functions, and recovering the system to a safe state. The


In [23]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is Fault/Failure Detection, Isolation and Recovery (FDIR)?. Answer concisely but detailed and understandable: Fault/Failure Detection, Isolation, and Recovery (FDIR) is a mechanism used to identify, isolate, and recover from faults or failures in safety-critical systems. It involves monitoring the system for anomalies, detecting and diagnosing faults, isolating the affected components or functions, and recovering the system to a safe state. The


In [24]:
# Decodifica la salida del modelo
full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Elimina el prompt del texto completo y los espacios en blanco al principio y al final
answer = full_text.replace(prompt, '').strip()

print(answer)

Fault/Failure Detection, Isolation, and Recovery (FDIR) is a mechanism used to identify, isolate, and recover from faults or failures in safety-critical systems. It involves monitoring the system for anomalies, detecting and diagnosing faults, isolating the affected components or functions, and recovering the system to a safe state. The


## Transforming voice response

Ahora, cargaremos el modelo guardado en el dataset y lo usaremos para generar respuestas.

In [25]:
!pip install gtts

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Collecting gtts
  Downloading gTTS-2.4.0-py3-none-any.whl (29 kB)
Installing collected packages: gtts
Successfully installed gtts-2.4.0


In [26]:
import io
import sys
from gtts import gTTS
import IPython.display as ipd

In [27]:

# Decodifica la salida del modelo
full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Elimina el prompt del texto completo y los espacios en blanco al principio y al final
answer = full_text.replace(prompt, '').strip()

# Crear un objeto StringIO para almacenar la salida de print()
output = io.StringIO()

# Guardar la salida estándar actual
stdout = sys.stdout

# Redirigir la salida estándar a nuestro objeto StringIO
sys.stdout = output

# Ahora, cuando llamamos a print(), la salida se almacena en 'output' en lugar de imprimirse
print(answer)

# Restaurar la salida estándar a su valor original
sys.stdout = stdout

# Ahora puedes usar 'output.getvalue()' para obtener el texto que se imprimió
text = output.getvalue()

# Y luego puedes usar 'text' con gTTS como antes
tts = gTTS(text, lang='en')
tts.save('output.mp3')
ipd.Audio('output.mp3')
