# Install

Model finetuning, also known as transfer learning, is a machine learning technique used to improve the performance of a pre-existing model on a specific task by training it further on new data related to that task.
Finetuning is commonly employed in scenarios where a pretrained model has learned useful representations of a general domain (in our case natural language) and is then adapted to perform well on a narrower, more specific task.

This notebook is dedicated to finetuning a Falcon model with QLoRA.

Learn more about LLMs and Decoding Strategies from [LangChain 101 course](https://github.com/IvanReznikov/DataVerse/tree/main/Courses/LangChain/Lecture2.%20Models)

In [1]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
import pprint


def getpreferredencoding(do_setlocale=True):
    return "UTF-8"


locale.getpreferredencoding = getpreferredencoding

In [2]:
!pip install -q -U bitsandbytes einops safetensors torch xformers datasets
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m167.0/167.0 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [None]:
"""
einops==0.6.1 bitsandbytes==0.41.1 transformers==4.32.0 
safetensors==0.3.2 xformers==0.0.21 huggingface-hub==0.16.4
"""

# Import

In [3]:
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline,
)
import transformers
import torch
from torch.utils.data import DataLoader, Dataset

# Load Quantized Model

In [4]:
"""
For arabic language one may choose the following option:
model = BloomForCausalLM.from_pretrained('Naseej/noon-7b')
tokenizer = BloomTokenizerFast.from_pretrained('Naseej/noon-7b')

data = load_dataset("ashhadulislam/arabic_medical_test", split="train")
"""

# Define the model ID for the sharded FALCON model by vilsonrodrigues
model_id = "vilsonrodrigues/falcon-7b-instruct-sharded"

# Configure BitsAndBytesConfig for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Initialize the tokenizer using the model ID and set the pad token to be the same as the end of sentence token
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Initialize the pre-trained model using AutoModelForCausalLM
pretrained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"": 0}, 
    trust_remote_code=True, 
)

Downloading (…)okenizer_config.json:   0%|          | 0.00/287 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.73M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading (…)figuration_falcon.py:   0%|          | 0.00/6.70k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- configuration_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)n/modeling_falcon.py:   0%|          | 0.00/56.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/vilsonrodrigues/falcon-7b-instruct-sharded:
- modeling_falcon.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading (…)fetensors.index.json:   0%|          | 0.00/16.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.68G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.82G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/1.99G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/828M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/117 [00:00<?, ?B/s]

In [5]:
from peft import prepare_model_for_kbit_training

pretrained_model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(pretrained_model)

In [6]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [7]:
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

trainable params: 4718592 || all params: 3613463424 || trainable%: 0.13058363808693696


# Prepare Data

In [8]:
from datasets import load_dataset

data = load_dataset("celikmus/mayo_clinic_symptoms_and_diseases_v1", split="train")

Downloading readme:   0%|          | 0.00/424 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/626k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1058 [00:00<?, ? examples/s]

In [9]:
data

Dataset({
    features: ['text', 'label'],
    num_rows: 1058
})

In [10]:
tokenizer.pad_token = tokenizer.eos_token

train_dataset = data.map(
    lambda x: {
        "input_text": f"symptoms: {x['text']}; most likely explanation: {x['label']}"
    }
)

# Tokenize the datasets
train_encodings = tokenizer(
    train_dataset["input_text"],
    truncation=True,
    padding=True,
    max_length=256,
    return_tensors="pt",
)

Map:   0%|          | 0/1058 [00:00<?, ? examples/s]

In [11]:
class TextDataset(Dataset):
    def __init__(self, encodings):
        """
        Initialize a custom dataset for text inputs and encodings.

        Parameters:
            encodings (dict): A dictionary containing the encoded inputs.
        """
        self.encodings = encodings

    def __getitem__(self, idx):
        """
        Get an item from the dataset by index.

        Parameters:
            idx (int): The index of the item to retrieve.

        Returns:
            dict: A dictionary containing the encoded input and labels.
        """
        # Create an item dictionary with tensors for each encoding key
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        
        # Copy input_ids to labels for tasks like language modeling
        item["labels"] = item["input_ids"].clone()
        
        return item

    def __len__(self):
        """
        Get the length of the dataset.

        Returns:
            int: The number of items in the dataset.
        """
        return len(self.encodings["input_ids"])

In [12]:
# Convert the encodings to PyTorch datasets
train_dataset = TextDataset(train_encodings)

# Example Before Fine Tuning

In [2]:
request_text = """
In crowded places, I feel cold in the tips of my fingers, I sweat with dizziness. 
What is happening?
"""
#hint: agoraphobia

In [3]:
encoding = tokenizer(request_text, return_tensors="pt").to("cuda:0")
pretrained_model_output = pretrained_model.generate(
    input_ids=encoding.input_ids,
    attention_mask=encoding.attention_mask,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.25,
    eos_token_id=tokenizer.eos_token_id,
)

NameError: name 'tokenizer' is not defined

In [15]:
pprint.pprint(tokenizer.decode(pretrained_model_output[0], skip_special_tokens=True))

('In crowded places, I feel cold in the tips of my fingers, I sweat with '
 'dizziness. What is happening?\n'
 'It is possible that you are experiencing symptoms of hypothermia, which is a '
 "condition where the body's temperature drops below normal. This can be "
 'caused by exposure to cold temperatures, dehydration, or certain '
 'medications. It is important to seek medical attention if you are '
 'experiencing symptoms of hypothermia, as it can be dangerous if left '
 'untreated.')


In [16]:
request_text = "I started to feel swelling  and itching around the mouth and throat after a salad with peanuts, cherry tomatoes and cheese. What may be the reason?"

In [17]:
encoding = tokenizer(request_text, return_tensors="pt").to("cuda:0")
pretrained_model_output = pretrained_model.generate(
    input_ids=encoding.input_ids,
    attention_mask=encoding.attention_mask,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.25,
    eos_token_id=tokenizer.eos_token_id,
)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [18]:
pprint.pprint(tokenizer.decode(pretrained_model_output[0], skip_special_tokens=True))

('I started to feel swelling  and itching around the mouth and throat after a '
 'salad with peanuts, cherry tomatoes and cheese. What may be the reason?\n'
 'It is possible that the peanuts, cherry tomatoes, and cheese may have caused '
 'an allergic reaction. It is recommended to avoid consuming peanuts if you '
 'have a peanut allergy. It is also possible that the combination of the other '
 'ingredients may have caused an allergic reaction. It is best to seek medical '
 'attention if the symptoms persist or worsen.')


# Training

The Trainer object, which orchestrates the training process requires the `model` to be fine-tuned, the `train_dataset`, and various training configurations specified within the TrainingArguments class. These configurations include the number of training epochs, batch size, gradient accumulation, learning rate, optimization algorithm (optim), and more. The `data_collator` specifies how data is collated during training.

In [19]:
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    # eval_dataset=val_dataset,
    args=transformers.TrainingArguments(
        num_train_epochs=10,
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        warmup_ratio=0.05,
        max_steps=40,
        learning_rate=2.5e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit",
        lr_scheduler_type="cosine",
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!

In [20]:
trainer.train()

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
1,2.1768
2,2.1649
3,2.0895
4,2.1112
5,2.0125
6,2.1091
7,2.0356
8,2.0268
9,1.9647
10,2.0462


TrainOutput(global_step=40, training_loss=1.849859318137169, metrics={'train_runtime': 941.1089, 'train_samples_per_second': 1.36, 'train_steps_per_second': 0.043, 'total_flos': 6492863730941952.0, 'train_loss': 1.849859318137169, 'epoch': 1.2})

# Example After Fine Tuning

In [21]:
# Save model:
trained_model = (
    trainer.model.module if hasattr(trainer.model, "module") else trainer.model
)  # Take care of distributed/parallel training
trained_model.save_pretrained("outputs")

In [22]:
# Now we can inference our model:
lora_config = LoraConfig.from_pretrained("outputs")
loaded_model = get_peft_model(
    prepare_model_for_kbit_training(pretrained_model), lora_config
)

In [23]:
loaded_model.config.use_cache = True
loaded_model.eval()

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): FalconForCausalLM(
      (transformer): FalconModel(
        (word_embeddings): Embedding(65024, 4544)
        (h): ModuleList(
          (0-31): 32 x FalconDecoderLayer(
            (self_attention): FalconAttention(
              (maybe_rotary): FalconRotaryEmbedding()
              (query_key_value): Linear4bit(
                in_features=4544, out_features=4672, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4544, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4672, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (dense): Linear4bit(in_

In [24]:
# Empty VRAM
del model
del trained_model
del trainer
import gc

gc.collect()

63

In [25]:
model_id = (
    "vilsonrodrigues/falcon-7b-instruct-sharded"  # sharded model by vilsonrodrigues
)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

pretrained_model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map={"": 0}, trust_remote_code=True
)

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

In [26]:
def generate_answer(query):
    """
    Generate responses from both the original model and PEFT model and compare their answers.

    Parameters:
        query (str): The user query for which responses are generated.

    Returns:
        None
    """
    # System and user prompts
    system_prompt = """Answer the following question truthfully.
          If you don't know the answer or the question is too complex, 
          respond 'Kindly, consult a doctor for further queries.'."""
    user_prompt = f"""<HUMAN>: {query}
      <ASSISTANT>: """
    final_prompt = system_prompt + "\n" + user_prompt

    # Device and dashline
    device = "cuda:0"
    dashline = "-" * 50

    # Encode prompt and generate response from the original model
    encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
    output = pretrained_model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.25,
        repetition_penalty=1.3,
        eos_token_id=tokenizer.eos_token_id,
    )
    text_output = tokenizer.decode(output[0], skip_special_tokens=True)

    # Print original model response
    pprint.pprint(dashline)
    pprint.pprint(f"ORIGINAL MODEL RESPONSE:\n{text_output}")
    pprint.pprint(dashline)

    # Encode prompt and generate response from the PEFT model
    peft_encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
    peft_output = loaded_model.generate(
        input_ids=peft_encoding.input_ids,
        attention_mask=peft_encoding.attention_mask,
        max_new_tokens=100,
        do_sample=True,
        temperature=0.25,
        repetition_penalty=1.3,
        eos_token_id=tokenizer.eos_token_id,
    )
    peft_text_output = tokenizer.decode(peft_output[0], skip_special_tokens=True)

    # Print PEFT model response
    pprint.pprint(f"PEFT MODEL RESPONSE:\n{peft_text_output}")
    pprint.pprint(dashline)

In [29]:
query = """In crowded places, I feel cold in the tips of my fingers, I sweat with dizziness. What may be happening?"""
generate_answer(query)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


'--------------------------------------------------'
('ORIGINAL MODEL RESPONSE:\n'
 'Answer the following question truthfully.\n'
 "  If you don't know the answer, respond 'Sorry, I don't know the answer to "
 "this question.'.\n"
 "  If the question is too complex, respond 'Kindly, consult a doctor for "
 "further queries.'.\n"
 '<HUMAN>: In crowded places, I feel cold in the tips of my fingers, I sweat '
 'with dizziness. What may be happening?\n'
 '  <ASSISTANT>: <HUMAN> may be experiencing symptoms related to hypothermia. '
 'It is recommended to seek medical attention immediately.')
'--------------------------------------------------'
('PEFT MODEL RESPONSE:\n'
 'Answer the following question truthfully.\n'
 "  If you don't know the answer, respond 'Sorry, I don't know the answer to "
 "this question.'.\n"
 "  If the question is too complex, respond 'Kindly, consult a doctor for "
 "further queries.'.\n"
 '<HUMAN>: In crowded places, I feel cold in the tips of my fingers, I sweat '

In [30]:
query = """I started to feel swelling  and itching around the mouth and throat after a salad with peanuts, cherry tomatoes and cheese. What may be the reason?"""
generate_answer(query)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


'--------------------------------------------------'
('ORIGINAL MODEL RESPONSE:\n'
 'Answer the following question truthfully.\n'
 "  If you don't know the answer, respond 'Sorry, I don't know the answer to "
 "this question.'.\n"
 "  If the question is too complex, respond 'Kindly, consult a doctor for "
 "further queries.'.\n"
 '<HUMAN>: I started to feel swelling  and itching around the mouth and throat '
 'after a salad with peanuts, cherry tomatoes and cheese. What may be the '
 'reason?\n'
 '  <ASSISTANT>: <HUMAN> may have an allergic reaction to peanuts or other '
 'ingredients in the salad. It is recommended to seek medical attention for '
 'proper diagnosis and treatment.')
'--------------------------------------------------'
('PEFT MODEL RESPONSE:\n'
 'Answer the following question truthfully.\n'
 "  If you don't know the answer, respond 'Sorry, I don't know the answer to "
 "this question.'.\n"
 "  If the question is too complex, respond 'Kindly, consult a doctor for "
 "fur