<a href="https://colab.research.google.com/github/Chiranjeevi2001/Mistral-7b-finetuning/blob/main/Mistral_7b_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction:
Using QLoRA technique, the following notebook tries to efficiently fine-tune the newest MISTRAL-7B model on a code base called Enlighten. I am doing this exercise to get to know the fundamentals of fine-tuning an LLM, for I intend to make use of this technology for a project that I am working on (Automatic Business Rule Extraction from COBOL files)

## 1. Define relevant variables

In [1]:
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "Enlighten_Instruct"

test_path = "/content/Enlighten-Instruct/Dataset/TestData.csv"
train_path = "/content/Enlighten-Instruct/Dataset/TrainData.csv"

## 2. Import required libraries and clone the Enlighten GitHub repo

In [2]:
%%capture
!git clone 'https://github.com/ali7919/Enlighten-Instruct.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch
from datasets import load_dataset
from trl import SFTTrainer
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import re
from datasets import Dataset

### Log into HuggingFace

In [5]:
from google.colab import userdata
secret_hf = userdata.get('HUGGINGDACE_TOKEN') # token is added in the secrets section (yes, I messed up the spelling)
!huggingface-cli login --token $secret_hf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Mistral instruct takes training data in a particular format. Here's an example:
`<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>`

* `<s>` : BOS Beginning of the sequence token
* `[INST]` and `[/INST]` : User instruction goes in between herbert
* `</s>` : EOS End of sequence token

## 3. Building the dataset

In [9]:
df = pd.read_csv(train_path)
df['text'] = '<s>[INST]@Enlighten. ' + df['Q'] + '[/INST]' + df['A'] + '</s>'
df = df.drop(['Q','A', 'class'], axis = 1)
df

Unnamed: 0,text
0,<s>[INST]@Enlighten. What is the purpose of th...
1,<s>[INST]@Enlighten. What is the purpose of th...
2,<s>[INST]@Enlighten. What component does the '...
3,<s>[INST]@Enlighten. What does the OnTriggerEn...
4,<s>[INST]@Enlighten. What is the _collider fie...
...,...
1905,<s>[INST]@Enlighten. write about the Puzzles i...
1906,<s>[INST]@Enlighten. write about the Beast in ...
1907,<s>[INST]@Enlighten. write about the Main game...
1908,<s>[INST]@Enlighten. write about the NPC in En...


In [11]:
# convert to dataset object
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
dataset = Dataset(pa.Table.from_pandas(df))

## 4. Load the base model

In [12]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type =  "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit = True,
    quantization_config = bnb_config,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    trust_remote_code = True,
)

model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer  = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

('<s>', '</s>')

## 5. Prepare the model for PEFT training

In [13]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [14]:
# Set Hyperparamters for training the model
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

In [15]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)



Map:   0%|          | 0/1910 [00:00<?, ? examples/s]

In [16]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.5427
2,1.718
3,2.0731
4,2.7621
5,2.5814
6,2.4145
7,2.1453
8,1.9942
9,1.9748
10,1.8127




TrainOutput(global_step=478, training_loss=1.0050431003630411, metrics={'train_runtime': 3418.2472, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.14, 'total_flos': 8899283549110272.0, 'train_loss': 1.0050431003630411, 'epoch': 1.0})

## 6. Saving the adapter

In [18]:
trainer.model.save_pretrained(new_model)
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Line

In [19]:
trainer.model.push_to_hub(new_model)

adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Chiranjeevi2001/Enlighten_Instruct/commit/db851ebb12c3806803117a470472bcca05058653', commit_message='Upload model', commit_description='', oid='db851ebb12c3806803117a470472bcca05058653', pr_url=None, pr_revision=None, pr_num=None)

In [20]:
## 7. Inferencing the model:
logging.set_verbosity(logging.CRITICAL)
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)

In [21]:
def build_prompt(question):
  prompt=f"<s>[INST]@Enlighten. {question} [/INST]"
  return prompt

In [22]:
question = input("Enter your question:")
prompt = build_prompt(question)
result = pipe(prompt)

print(result[0]['generated_text'])

Enter your question:What is a lightyear
<s>[INST]@Enlighten. What is a lightyear [/INST]A lightyear is a unit of distance, equal to the distance that light travels in one year. It is approximately 9.46 trillion kilometers or 5.88 trillion miles.
