<a href="https://colab.research.google.com/github/Chiranjeevi2001/cobrex-mistral/blob/main/Mistral_7b_finetuning_cobrex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction:
Using QLoRA technique, the following notebook tries to efficiently fine-tune the newest MISTRAL-7B model on a code base called Enlighten. I am doing this exercise to get to know the fundamentals of fine-tuning an LLM, for I intend to make use of this technology for a project that I am working on (Automatic Business Rule Extraction from COBOL files)

## 1. Define relevant variables

In [1]:
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
new_model = "cobrex-mistral"

# test_path = "/content/Enlighten-Instruct/Dataset/TestData.csv"
train_path = "/content/dataset_cbl.csv"

## 2. Import required libraries and clone the Enlighten GitHub repo

In [2]:
%%capture
# !git clone 'https://github.com/ali7919/Enlighten-Instruct.git'
!pip install -U bitsandbytes
!pip install transformers==4.36.2
!pip install -U peft
!pip install -U accelerate
!pip install -U trl
!pip install datasets==2.16.0
!pip install sentencepiece

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch
from datasets import load_dataset
from trl import SFTTrainer
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
import re
from datasets import Dataset

### Log into HuggingFace

In [4]:
from google.colab import userdata
secret_hf = userdata.get('HUGGINGDACE_TOKEN') # token is added in the secrets section (yes, I messed up the spelling)
!huggingface-cli login --token $secret_hf

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Mistral instruct takes training data in a particular format. Here's an example:
`<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>`

* `<s>` : BOS Beginning of the sequence token
* `[INST]` and `[/INST]` : User instruction goes in between herbert
* `</s>` : EOS End of sequence token

## 3. Building the dataset

In [7]:
df = pd.read_csv(train_path)
df['text'] = '<s>[INST]@Cobrex. ' + df['Cobol_code'] + '[/INST]' + df['BR_json'] + '</s>'
df = df.drop(['Cobol_code','BR_json'], axis = 1)
df

Unnamed: 0,text
0,"<s>[INST]@Cobrex. $set sourceformat""free..."
1,"<s>[INST]@Cobrex. $set sourceformat""free..."
2,"<s>[INST]@Cobrex. $set sourceformat""free..."
3,"<s>[INST]@Cobrex. $set sourceformat""free..."
4,"<s>[INST]@Cobrex. $set sourceformat""free..."
...,...
388,<s>[INST]@Cobrex. IDENTIFICATION DIVISI...
389,<s>[INST]@Cobrex. IDENTIFICATION DIVISI...
390,<s>[INST]@Cobrex. IDENTIFICATION DIVISI...
391,<s>[INST]@Cobrex. IDENTIFICATION DIVISI...


In [8]:
# convert to dataset object
dataset = ds.dataset(pa.Table.from_pandas(df).to_batches())
dataset = Dataset(pa.Table.from_pandas(df))

## 4. Load the base model

In [9]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type =  "nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant = False,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit = True,
    quantization_config = bnb_config,
    torch_dtype = torch.bfloat16,
    device_map = "auto",
    trust_remote_code = True,
)

model.config.use_cache = False
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()

# Load tokenizer
tokenizer  = AutoTokenizer.from_pretrained(base_model, trust_remote_code = True)
tokenizer.padding_side = "right"
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.bos_token, tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

('<s>', '</s>')

## 5. Prepare the model for PEFT training

In [10]:
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
)
model = get_peft_model(model, peft_config)

In [11]:
# Set Hyperparamters for training the model
training_arguments = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=50,
    logging_steps=1,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
)

In [12]:
# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)



Map:   0%|          | 0/393 [00:00<?, ? examples/s]

In [13]:
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,1.0343
2,1.1352
3,1.1679
4,0.9503
5,0.7564
6,1.0311
7,0.9079
8,0.8278
9,0.6777
10,0.6966




TrainOutput(global_step=99, training_loss=0.67858631562705, metrics={'train_runtime': 4268.8434, 'train_samples_per_second': 0.092, 'train_steps_per_second': 0.023, 'total_flos': 1.390029557551104e+16, 'train_loss': 0.67858631562705, 'epoch': 1.0})

## 6. Saving the adapter

In [14]:
trainer.model.save_pretrained(new_model)
model.config.use_cache = True
model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Line

In [15]:
trainer.model.push_to_hub(new_model)

adapter_model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Chiranjeevi2001/cobrex-mistral/commit/3be66482e8296bcfc86d03ff748a6f2d8e46bcd9', commit_message='Upload model', commit_description='', oid='3be66482e8296bcfc86d03ff748a6f2d8e46bcd9', pr_url=None, pr_revision=None, pr_num=None)

In [29]:
## 7. Inferencing the model:
logging.set_verbosity(logging.CRITICAL)
pipe = pipeline(task="translation_XX_to_YY", model=model, tokenizer=tokenizer, max_length=2000)

In [25]:
def build_prompt(question):
  prompt=f"<s>[INST]@cobrex. {question} [/INST]"
  return prompt

In [31]:
question = input("Enter your cobol code:")
prompt = build_prompt(question)
result = pipe(prompt)

Enter your cobol code:       IDENTIFICATION DIVISION.        PROGRAM-ID. PROD2V1.                              ENVIRONMENT DIVISION.                              CONFIGURATION SECTION.                                SPECIAL-NAMES.                              INPUT-OUTPUT SECTION.                                FILE-CONTROL.            SELECT RENTAL ASSIGN TO 'RENTACAR-IN.txt'                ORGANISATION IS LINE SEQUENTIAL.            SELECT RENTAL-OUT ASSIGN TO 'RENTACAR-OUT.txt'.                                   DATA DIVISION.                                    FILE SECTION.            FD RENTAL.            01 RENTAL-FILE.                02 CLIENT_NAME PIC A(20).                02 RENTAL-TYPE.                    03 NAME_INITIAL PIC A(1).                    03 CAR_TYPE PIC 9(1).                    03 KILOMETERS PIC 9(5).                    03 NUM_DAYS PIC 9(3).            FD RENTAL-OUT.            01 RENTAL-FILE-OUT.                02 CLIENT_NAME_OUT PIC A(20).                02 FILL

In [32]:
len(result[0])

1

In [35]:
result[0]

{'translation_text': '[INST]@cobrex.        IDENTIFICATION DIVISION.        PROGRAM-ID. PROD2V1.                              ENVIRONMENT DIVISION.                              CONFIGURATION SECTION.                                SPECIAL-NAMES.                              INPUT-OUTPUT SECTION.                                FILE-CONTROL.            SELECT RENTAL ASSIGN TO \'RENTACAR-IN.txt\'                ORGANISATION IS LINE SEQUENTIAL.            SELECT RENTAL-OUT ASSIGN TO \'RENTACAR-OUT.txt\'.                                   DATA DIVISION.                                    FILE SECTION.            FD RENTAL.            01 RENTAL-FILE.                02 CLIENT_NAME PIC A(20).                02 RENTAL-TYPE.                    03 NAME_INITIAL PIC A(1).                    03 CAR_TYPE PIC 9(1).                    03 KILOMETERS PIC 9(5).                    03 NUM_DAYS PIC 9(3).            FD RENTAL-OUT.            01 RENTAL-FILE-OUT.                02 CLIENT_NAME_OUT PIC A(20).    

In [49]:
import re

text = result[0]['translation_text']

# Extract all JSON parts using regular expression
json_parts = re.findall(r'{\s*"id":.*?}', text, re.DOTALL)

for json_part in json_parts:
    print(json_part)

{

"id": "BR-001",
"description": "The total cost of a rental is calculated by multiplying the number of kilometers traveled by the cost per kilometer for the car type and adding the number of days rented multiplied by the daily rental rate for the car type.",
"condition": "KILOMETERS * CAR_TYPE",
"output": {
  "total_cost": "KILOMETERS * CAR_TYPE"
}
{
  "id": "BR-002",
  "description": "If the number of kilometers traveled is greater than or equal to 75, then the number of kilometers traveled is reduced by 75.",
  "condition": "KILOMETERS >= 75",
  "output": {
    "reduced_kilometers": "KILOMETERS - 75"
  }
{
  "id": "BR-003",
  "description": "The cost per kilometer for a car type is determined by the car type.",
  "condition": "CAR_TYPE",
  "output": {
    "cost_per_kilometer": "CAR_TYPE"
  }
{
  "id": "BR-004",
  "description": "The daily rental rate for a car type is determined by the car type.",
  "condition": "CAR_TYPE",
  "output": {
    "daily_rental_rate": "CAR_TYPE"
  }
{
  