#**Fine Tuning LLAMA-2 7B 4 Bit Qunatized using QLORA**

Note : This notebook contains a different graph which would not be visible incase you are using VScode use the following link open in colab : https://colab.research.google.com/drive/1k51NmYcjx2I4yR__HPw_CysxNNIMeORZ#scrollTo=OSHlAbqzDFDq


#LLAMA 7B Chat:

After the launch of Meta's LLaMA, there was a surge in the development of improved Large Language Models (LLMs), fostering innovation within the open-source community. This resulted in a plethora of models competing for attention, creating a vibrant atmosphere. However, challenges arose, including limited licenses, exclusive fine-tuning capabilities, and high deployment costs. In response, LLaMA 2 strategically entered the scene, introducing a commercial license to enhance accessibility. It also implemented innovative methodologies, enabling fine-tuning on consumer GPUs with restricted memory, addressing the limitations of the post-launch era and contributing to a more inclusive and efficient AI landscape.

LLAMA-2's significance extends beyond licensing adjustments. It pioneers Parameter-Efficient Fine-Tuning (PEFT), a technique that notably streamlines the fine-tuning process by reducing the number of model parameters requiring updates. This efficiency not only accelerates training times but also reduces computational costs, making LLAMA-2 a resource-efficient option for researchers and developers. Moreover, its baseline performance shines across diverse benchmarks, consistently surpassing other LLMs in terms of accuracy and effectiveness. This robust performance suggests that fine-tuned LLAMA-2 models hold promise across various applications, establishing it as a compelling choice in the landscape of advanced language models.

Despite the acknowledgment of potentially superior models, the literature underscores the unique strengths that position LLAMA-2 as a standout choice. Many advanced models lack open-source availability, restricting access to model weights. Conversely, some open-source models lack support for crucial functionalities like PEFT and QLORA. LLAMA-2 emerges as a pragmatic solution, offering a blend of strong baseline performance, support for advanced features, and crucially, open-source accessibility. In a field where trade-offs are common, LLAMA-2 strikes a balance, presenting itself as an inclusive and compelling option for those seeking a versatile and accessible LLM for varied applications.





#**Installing Required Packages**

* Accelerate let us run the code in distributed confuquration used for parallel data sharding

* PEFT , BitsandBytes discussed below

* Transformers provides APIs to quickly download and use those pretrained models on a given text, fine-tune them on custom datasets


* Transformer Reinforcement Learning (trl) library for reinforcement learning of LLMs. Import SFTTrainer from here (SFT Trainer is used for fine tunning its an over arching library) details in the next section




In [None]:
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m174.1/244.2 kB[0m [31m5.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m112.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.4/77.4 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m122.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━

#**Importing Required Libraries**



*  **Parameter efficient fine tuning** (PEFT) Library to Fine Tune LLM without  touching all the parameters. In Normal deep learning models like RESNET we can free the initial layers and only fine tune for the Fully connected end layers but in LLMS we need to fine tune on all the model parameters. The PEFT library lets you fine tune the model for task like text summarization etc only by updating the weights of a subset of parameters giving better results then fully efficient fine tuning.

*   Due to the high number of parameters and the computation cost required to do a single backward pass through the LLM, a ginormous amount of GPU VRAM is required. To overcome this problem we use Bits and  Bytes library its convert the 32 bits floating points to 4 bits through a technique called Qunatiation as referred in the paper https://arxiv.org/abs/2208.07339

* **Auto class** of the transformer library is used to load the model and its weight **AutoModelForCausalLM** is a specific type of Auto class and is  used to load causal models like GPT and LLAMA. The formpretrained() loads the weights and model  (Note that there are two types of language models, causal and masked. Causal language models include; GPT-3 and Llama, these models predict the next token in a sequence of tokens to generate semantically similar text to the input data)

* **AutoTokenizer** belongs to Auto classes and automatically decides the type of tokenizer for a model based on the model name

* **Bits&BytesConfig** we use this for  quantization support NF4 FL4 and Int8 we pass this as an argument to the AutoModelForCausalLM.pretrained so that the qunatized model is loaded

* **TrainingArguments** is used to store all the variables related to training in a specific  format that is stored in the TrainingArgument data-class. This will be later on fed to SFT trainer  HfArgumentParser is an argument parser for the TrainArguments Class.


* **Pipeline()** is  the most powerful model inference library that acts as a wrapper for all kinds of tasks. It acts as a wrapper and is used to generate text/response from the fine tuned model

* **Logging** library is used to evaluate and track model training verbosity = CRITICAL means only display messages that are critical. No warning etc.
PeftModel.from_pretrained() from PeftModel is used to load the weights of the trained parameter (fine tuning that we perform through PEFT QLORA)  back from the memory and model.merge_and_unload() is used to merge the weights of the base and the fine tuned model.


* **SFTTrainer** is a class of TRL Transformer library used for supervised fine tuning of the model. SFTTrainer has support for parameter efficient fine tuning so we use it for Supervised parameter efficient fine tuning using QLORA

















In [None]:
import os
import pandas as pd
import numpy as np
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from google.colab import drive
from datasets import Dataset
drive.mount('/content/drive')

Mounted at /content/drive
