# Fine-tune Falcon LLM for Ecommerce Sentiment Analysis

Falcon-7B is a 7B parameters causal decoder-only model that was trained on 1,500B tokens of RefinedWeb enhanced with curated corpora. FalconLLM was developed by the Technology Innovation Institute, UAE. Falcon LLM is opensource and it's available on Huggingface website.

We are going to run this notebook on the free tier T4 GPU on google colab. This compute resource is still insufficient for proper fine-tuning of LLM and because of the limited compute resources, we'll use a [sharded version of Falcon](https://huggingface.co/ybelkada/falcon-7b-sharded-bf16/tree/main) LLM available on huggingface contributed by [Younes Belkada](https://huggingface.co/ybelkada)


# PEFT / LoRA / QLoRA

Large language model (LLMs) typically have huge parameters running to hundreds of millions to billions due to the huge datasets they were trained on and several layers of their transformer model architecture. This makes fine-tuning all the layers of the model computationally prohibitive. There have been several efforts to find a way to fine-tune LLMs without the high computation costs. This gave rise to several methods such as parameter-efficient fine-tuning. Parameter-Efficient Fine-Tune (PEFT) comprises different techniques such as prompt-tuning, prefix-tuning, p-tuning, low-rank adaptation, and quantization low rank adaptation(QLoRA).

LoRA (Low-Rank Adaptation) is a reparameterization method that decomposes the weight change matrix of an LLM into low-rank matrices. These low-rank matrices are inserted typically in the attention blocks of the model. The original weight matrix of the pre-trained model is frozen and only the inserted smaller matrices are updated during training. This reduces the number of trainable parameters, reducing memory usage and training time which can be very expensive for large models.

In this notebook, we'll fine-tune Falcon LLM 7b with QLoRA. QLoRA adds quantization to LoRA by loading the base model in 4-bit floating point precision before applying LoRA. This reduces the memory consumption of LLM fine-tuning without reduction in performance. Quantization is a technique in which the float precision of the parameters is reduced from 32 bits to up to 4 bits, without losing a lot of information.



# Libraries

We need to import a number of libraries to perform QLoRA.

1. Transformer Reinforcement Learning (trl). This helps us train the language model with reinforcement learning. It's integrated with huggingface transformers. TRL supports decoder models such as GPT-2, BLOOM and Falcom LLM.

2. PEFT. This library is used for efficiently adapting pre-trained language models by fine-tuning only a small numer of model parameters significantly reducing computational and storage costs.

3. Sharded Falcon LLM. Available on huggingface. The sharded falcon model loads faster on low compute than the original model.
 https://huggingface.co/ybelkada/falcon-7b-sharded-bf16/tree/main

4. Accelerate makes the Pytorch training loop faster.

5. Bitsandbytes is a lightweight wrapper around CUDA custom functions, particularly 8-bit optimizers and quantization functions. It’s used to handle the quantization process in QLoRA. It's developed by HuggingFace.

6. Einops simplifies tensor operations.

7. Datasets makes it easy to load datasets from Huggingface datasets repository.

8. Transformers is the standard Huggingface library for accessing pre-trained models on Huggingface using python.

In [None]:
# !pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
!pip install -q trl transformers accelerate peft datasets bitsandbytes einops

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.9/190.9 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.2/102.2 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.8/79.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━

## Import the libraries

In [None]:
import os
import torch
import transformers
import pandas as pd
from trl import SFTTrainer
from peft import LoraConfig, PeftModel, PeftConfig
from sklearn.model_selection import train_test_split
from datasets import load_dataset, Dataset, DatasetDict
from peft import  prepare_model_for_kbit_training, get_peft_model, TaskType


from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    GenerationConfig,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
    logging
    )



## Dataset

 Ecommerce customer sentiment analysis data. [Data](https://huggingface.co/datasets/arize-ai/ecommerce_reviews_with_language_drift?row=13)

Load the dataset and convert it into dataframe so that we can work on it with pandas and scikit-learn.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
data = load_dataset('arize-ai/ecommerce_reviews_with_language_drift', split='validation')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/7.86k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/3.35k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/221k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.55M [00:00<?, ?B/s]

Generating training split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating production split: 0 examples [00:00, ? examples/s]

In [None]:
data = pd.DataFrame(data)

data.head(10)

Unnamed: 0,prediction_ts,reviewer_age,reviewer_gender,product_category,language,text,label
0,1650092000.0,32,male,home,english,"Very pretty - looks exactly like the picture, ...",2
1,1650094000.0,22,male,home,english,“I ordered it for my bridal shower and it went...,2
2,1650095000.0,37,male,jewelry,english,This necklace is too small for anyone except t...,0
3,1650096000.0,23,female,other,english,Love this case. The outer rubber is softer tha...,2
4,1650097000.0,24,male,home,english,This was terrible. Had it in the shower for a ...,0
5,1650098000.0,58,female,kitchen,english,Disappointed in the sizes. I needed larger siz...,1
6,1650100000.0,25,female,furniture,english,This is the perfect solution for a dining tabl...,2
7,1650101000.0,42,female,electronics,english,"If I could give it a zero, i would. The damn t...",0
8,1650102000.0,38,female,sports,english,"It is perfect, beautiful color. I am 100% sati...",2
9,1650103000.0,33,other,toy,english,This set made a beautiful party. I was pleasan...,2


Extracting the two columns we need out of the dataframe.

In [None]:
data = data[['text', 'label']]
data.head()

Unnamed: 0,text,label
0,"Very pretty - looks exactly like the picture, ...",2
1,“I ordered it for my bridal shower and it went...,2
2,This necklace is too small for anyone except t...,0
3,Love this case. The outer rubber is softer tha...,2
4,This was terrible. Had it in the shower for a ...,0


Convert the label from number to text.

In [None]:
data['label'] = data['label'].replace([0, 1, 2], ['negative', 'neutral', 'positive'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['label'] = data['label'].replace([0, 1, 2], ['negative', 'neutral', 'positive'])


We'll format the data to include both the text input and the expected output and then, train LLM with the new combination. The LLM will learn from this combined data.

In [None]:
data['formatted_data'] = data.apply(lambda row: str(row['text']) + " ->: " + row['label'], axis = 1)
data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['formatted_data'] = data.apply(lambda row: str(row['text']) + " ->: " + row['label'], axis = 1)


Unnamed: 0,text,label,formatted_data
0,"Very pretty - looks exactly like the picture, ...",positive,"Very pretty - looks exactly like the picture, ..."
1,“I ordered it for my bridal shower and it went...,positive,“I ordered it for my bridal shower and it went...
2,This necklace is too small for anyone except t...,negative,This necklace is too small for anyone except t...
3,Love this case. The outer rubber is softer tha...,positive,Love this case. The outer rubber is softer tha...
4,This was terrible. Had it in the shower for a ...,negative,This was terrible. Had it in the shower for a ...


In [None]:
data['formatted_data'][0]

'Very pretty - looks exactly like the picture, except it ended up being a bit larger than I expected. Works great as an accent in our guest room! ->: positive'

We are using only a few data points because we have low computation resources. Let's split the data into train and test splits.

In [None]:

train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

In [None]:
train_df.head()

Unnamed: 0,text,label,formatted_data
29,Haven’t noticed a difference even using it in ...,neutral,Haven’t noticed a difference even using it in ...
535,Works great and is perfect size to let my baby...,positive,Works great and is perfect size to let my baby...
695,Very fresh and great price. Can’t beat the pri...,positive,Very fresh and great price. Can’t beat the pri...
557,The first time I ordered this Egyptian Musk oi...,neutral,The first time I ordered this Egyptian Musk oi...
836,So I purchased these for my vehicle my husband...,positive,So I purchased these for my vehicle my husband...


In [None]:
test_df.head()

Unnamed: 0,text,label,formatted_data
521,tried this dip set out and was very disappoint...,negative,tried this dip set out and was very disappoint...
737,It's useless for brassiness - did absolutely n...,neutral,It's useless for brassiness - did absolutely n...
740,Disappointing. Nothing written to explain how ...,negative,Disappointing. Nothing written to explain how ...
660,Fantastic sheets. REALLY deep pockets that fit...,positive,Fantastic sheets. REALLY deep pockets that fit...
411,I have to give 5 stars. The decorations are be...,positive,I have to give 5 stars. The decorations are be...


Huggingface transformer models expect the training data in datasetdict format.

In [None]:
train_dict = DatasetDict({
    'train': Dataset.from_pandas(train_df)
})

In [None]:
train_dict

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'formatted_data', '__index_level_0__'],
        num_rows: 800
    })
})

## Load the base model with quantization.

In the code snipets below, we'll use bitsandbytes library to quantize the model. Bitsandbytes is a lightweight wrapper around CUDA custom functions,in particular 8-bit optimizers, matrix multiplication, and quantization functions.
We'll set the bitsandbytes configuration to use load the model in 4-bit.

Next, we'll load the model with the huggingface class  AutoModelForCausalLM (for next text generation) and pass the quantization configuration into the model. We'll set trust_remote_code to True because we are accessing it via huggingface (model is not downloaded on our local machine).

We'll set model.config.use_cache to false because KV cache is not useful during training(Finetune) since the weights will be updated. Cache is set to true during inference only. Peft function makes the model available for training.

Double quantization is the process of quantizing the quantization constants used during the quantization process in the 4-bit NF quantization. This can save 0.5 bits per parameter on average, as mentioned in the paper.

We'll enable gradient checkpointing to save memory. However, this leads to slower backward pass.

In [None]:
model_name = "ybelkada/falcon-7b-sharded-bf16"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map={"":0}
)


model.gradient_checkpointing_enable()
# Prepares the model for kbit training
model = prepare_model_for_kbit_training(model)

config.json:   0%|          | 0.00/581 [00:00<?, ?B/s]

pytorch_model.bin.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

pytorch_model-00001-of-00008.bin:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

pytorch_model-00002-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00003-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00004-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00005-of-00008.bin:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

pytorch_model-00006-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00007-of-00008.bin:   0%|          | 0.00/1.91G [00:00<?, ?B/s]

pytorch_model-00008-of-00008.bin:   0%|          | 0.00/921M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Get the model tokenizer and set the padding token to be the same as the end-of-sequence token.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/180 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Print number of trainable and total parameters in the model

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        )

## Configure LoRA Adapter.

LoRA uses adapter technique to add new smaller trainable parameters to the model. In the code snippet below, we'll configure LoRA adapter. The LoRA adapter works by reparameterizing the weights of a layer matrix usually the linear layers. For the best performance, we'll include all linear layers in the target_modules.

lora_alpha: This is the scaling factor for the LoRA update matrices. The higher the value of lora_alpha the more aggressive the updates to the weights.

lora_dropout: This is the dropout percentage for the LoRA layers.

r: This is the rank of the update matrices.  Lower rank results in smaller update matrices with fewer trainable parameters. A higher rank will result in larger matrices, which can hold more information about the weights of the layer matrix. Note that larger matrices also require more parameters.

bias: This determines whether the bias parameters should be updated during training. The bias parameters are the weights that are added to the output of a layer. The value can be ‘none’, ‘all’ or ‘lora_only’. We'll choose none to preserve the base model output.

task_type: This specifies the task type for which the LoRA adapter is being used. Other values for task_type could be “NLI” or “MT”

target_modules: These are the modules to which the LoRA update matrices will be applied. The target modules are the layers in the base model that will be parameterized by the LoRA adapter. Other values for target_modules could be [“attention”, “dense_final”] or [“query_key_value”, “dense”, “dense_h_to_4h”].

Next, we'll create an object of LoraConfig class and pass in the selected parameters.

In [None]:
lora_alpha = 32
lora_dropout = 0.05
lora_r = 8

lora_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ])

Combine the quantized base model with LoRA adapter using get_peft_model and pass the lora_config along with the pretrained falcon base model.

In [None]:
# Now you get a model ready for QLoRA training
lora_model = get_peft_model(model, lora_config)
lora_model.print_trainable_parameters()

trainable params: 16,318,464 || all params: 6,938,039,168 || trainable%: 0.2352028232308763


We'll fine-tune the constructed custom LoRA model using huggingface Trainer API.
First, we'll set the training arguments.

Training Arguments

output_dir defines the directory where the training results will be written.

per_device_train_batch_size defines the batch size for each GPU. This could be increased or decreased, depending on the available GPU memory.

gradient_accumulation_steps defines the number of updates steps to accumulate the gradients for before performing a backward/update pass. This is used to increase the effective batch size without increasing the GPU memory usage.

optim defines the optimizer that will be used for training. The paged_adamw_32bit optimizer is a variant of AdamW that is designed to be more efficient on 32-bit GPUs.

save_steps defines the number of steps after which the model checkpoint will be saved.

fp16 defines whether to use 16-bit floating point precision during training. This can significantly reduce the memory usage, but it may also reduce the accuracy of the model.
 <!-- However, for 2080 Ti, this is mandatory as it cannot accommodate fp32 which is the default, or bfp16 which will only work on Ampere and higher models of GPUs -->

logging_steps defines the  Number of update steps between two logs.

learning_rate defines the initial learning rate for the optimizer.

max_grad_norm defines the maximum norm of the gradients. This is used to prevent the gradients from becoming too large, which can lead to instability.

max_steps defines the maximum number of steps to train for.

warmup_ratio defines the ratio of the warmup steps to the total number of steps.

warmup steps gradually increase the learning rate, which helps to prevent the model from overfitting.

lr_scheduler_type defines the type of learning rate scheduler that will be used. The constant scheduler keeps the learning rate constant for the entire training process. This could be increased or decreased, depending on the available GPU memory. Other values are cosine_schedule, linear_schedule etc.

You may need to experiment with different values to determine the best values of the parameters with respect to the specific language model you choose, the downstream task, and the available resources.

In [None]:
training_arguments = TrainingArguments(
    output_dir='./training_output',
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    save_steps=10,
    logging_steps=10,
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    max_steps=150,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type='constant'
)



Get the tokenizer specific to the pre-trained model and set the padding token to be the same as the end-of-sequence token.

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

In the following code snippet, we'll initialize an object of the huggingface SFTTrainer class, pass both the dataset and the column that we want to use to train the model for our specific use case. We'll also pass the peft configurations as inputs(to use the Lora configuration that we set earlier), tokenizer, maximum sequence length, model and the training arguments.

SFTTrainer is specifically optimized for Supervised Fine-tuning (SFT). SFTTrainer inherits from the Trainer class available in the Transformer library. We'll import it from the trl library.

In [None]:
max_seq_length = 512

trainer = SFTTrainer(
    model=lora_model,
    train_dataset=train_dict['train'],
    peft_config=lora_config,
    dataset_text_field="formatted_data",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

We will freeze the original weights of the model and keep the layer norm in float 32. Models are typically trained on 32-bits precision for higher accuracy. This step ensure more stable training. Following this, we will proceed with training the model.

This will cast the weights to higher precision floats in the layers at the time of computation resulting in higher speed of fine-tuning.

We will convert the model layer norms to float 32. This step is taken to ensure more stable training.After this, we will proceed with training the model.

This will cast the weights to higher precision floats in the layers at the time of computation resulting in higher speed of fine-tuning.

In [None]:
# Loop through the named modules of the trainer's model
for name, module in trainer.model.named_modules():

# Check if the name contains "norm"
    if "norm" in name:
	# Convert the module to use torch.float32 data type
	    module = module.to(torch.float32)

## Train the model.

In [None]:
# Disabling cache usage in the model configuration
lora_model.config.use_cache = False

trainer.train()



Step,Training Loss
10,2.9393
20,2.736
30,2.6085
40,2.6383
50,2.5989
60,2.3912
70,2.3754
80,2.3559
90,2.3423
100,2.3157




TrainOutput(global_step=150, training_loss=2.2812517801920573, metrics={'train_runtime': 1096.4063, 'train_samples_per_second': 2.189, 'train_steps_per_second': 0.137, 'total_flos': 4314588420268032.0, 'train_loss': 2.2812517801920573, 'epoch': 3.0})

## Save the model.

This will only save the LoRA model adapter. For inference, we need to load both the saved adapter and the base Falcon 7B model. The output_dir contains the adapter bin and config files that are generated at the end of the training. Share adapter on huggingface hub.

In [None]:
lora_model.save_pretrained("tuned_model/")

## Inference

In [None]:
peft_model = './tuned_model'
config = PeftConfig.from_pretrained(peft_model)

peft_base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    return_dict=True,
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map='auto'
    )

# Load the Lora model
trained_model = PeftModel.from_pretrained(peft_base_model, peft_model)
trained_model_tokenizer = AutoTokenizer.from_pretrained(
    config.base_model_name_or_path,
    trust_remote_code=True
)

trained_model_tokenizer.pad_token = trained_model_tokenizer.eos_token


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

## Create generation config for prediction.

We are  getting a sample text from the test dataset for inference.

In [None]:
sample = test_df.iloc[5, :]
sample_text = sample['text']

In [None]:
sample_text

'This order contained one set of hangers as opposed to three. I reordered. The same thing happened with the second order. I returned them both on the same day at the same time. Amazon received one but not the other. It took weeks to get the refund. The product itself appears to be very good.'

Tokenize the sample text.

In [None]:
# batch = tokenizer(sample_text, return_tensors='pt').to("cuda")

batch = tokenizer(sample_text, return_tensors='pt').to("cuda")

We need to create a generation configuration for the inference.

In [None]:
gen_config = GenerationConfig(
    max_new_tokens = 5,
    attention_mask=batch.attention_mask,
    pad_token_id = trained_model_tokenizer.pad_token_id,
    eos_token_id = trained_model_tokenizer.eos_token_id,
    repetition_penalty=2.0,
    num_return_sequences=1
 )

Pass the generation configuration into pytorch's inference mode. Pytorch inference mode is a better version of torch.no_grad which disables computing gradients.

In [None]:
with torch.inference_mode():
    result = trained_model.generate(
        input_ids=batch.input_ids,
        generation_config=gen_config,
    )

final_output = trained_model_tokenizer.decode(result[0], skip_special_tokens=True)
final_output


'This order contained one set of hangers as opposed to three. I reordered. The same thing happened with the second order. I returned them both on the same day at the same time. Amazon received one but not the other. It took weeks to get the refund. The product itself appears to be very good. ->: negative '

Here's is the generated output.

I hope this article was able to explain the QLoRA fine tuning in a simple way. You may want to learn more so that you can experiment further with configurations of the libraries used in this article.

## References.

https://huggingface.co/docs/transformers/v4.20.1/en/perf_train_gpu_one

https://huggingface.co/docs/transformers/generation_strategies

https://huggingface.co/blog/4bit-transformers-bitsandbytes

https://huggingface.co/docs/peft/en/package_reference/lora

https://huggingface.co/tiiuae/falcon-7b

https://huggingface.co/blog/peft