<a href="https://colab.research.google.com/github/AbdelRahmanRifai87/medBot/blob/main/MedBot_on_Custom_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Build your MedBot
© 2023, Zaka AI, Inc. All Rights Reserved.

---
The goal of this colab is to get you more familiar with LLM fine-tuning by creating a simple QA LLM that can answer medical questions. By the end of it you will be able to customize this LLM with any dataset.

**Just to give you a heads up:** We won't be having a model performing like ChatGPT or Bard, but at least we will have an idea about how we can create our own smaller versions of such powerful LLMs.  

## Importing and Installing Libraries/Packages
We will start by installing our necessary packages.

**bitsandbytes**: This package will allow us to run 4bit quantization on our model

**transformers**: This Hugging Face package will allow us to load state-of-the-art models easily into our notebook

**peft**: This package allows us to add PEFT techniques easily to our model, such as LoRA

**accelerate**: Accelerate is a handy package that allows us to run boiler plate code with a few lines of code

**datasets**: This package allows us to easily import datasets from the Hugging Face platform to be directly used

In [None]:
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.1-py3-none-manylinux_2_24_x86_64.whl.metadata (5.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch~=2.0->bitsandbytes)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch~=2.0->bitsandbytes)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch~=2.0->bitsandbytes)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch~=2.0->bitsandbytes)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch~=2.0->bitsandbytes)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import torch
import transformers
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM

## Loading our model

Let's start by loading our model. We will use the GPT Neox 20b Model by EleutherAI!

In [None]:
hf_model = "EleutherAI/gpt-neox-20b"

We will also set the bitsandbytes configurations needed for our model to run on our single colab GPU. The needed paramaters will be 'Double Quantization' 'Quantization Type' and the computational type needs to be set to bfloat16.

In [None]:
bitsbytes_config = bitsbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.bfloat16,  # Set computation type to bfloat16
    bnb_4bit_quant_type="nf4",  # Use NormalFloat 4-bit quantization
    bnb_4bit_use_double_quant=True  # Enable double quantization for better efficiency
)


We will then set our tokenizer, and our model using the AutoTokenizer and AutoModelforCausalLM classes

In [None]:
#Test Your Zaka
tokenizer = AutoTokenizer.from_pretrained(hf_model)

# Load the model with the BitsAndBytes configuration
model = AutoModelForCausalLM.from_pretrained(
    hf_model,
    quantization_config=bitsbytes_config,  # Apply the 4-bit quantization config
    device_map="auto"  # Automatically map the model to the available GPU
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/60.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

model-00001-of-00046.safetensors:   0%|          | 0.00/926M [00:00<?, ?B/s]

model-00002-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00003-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00004-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00005-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00006-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00007-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00008-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00009-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00010-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00011-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00012-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00013-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00014-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00015-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00016-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00017-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00018-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00019-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00020-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00021-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00022-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00023-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00024-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00025-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00026-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00027-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00028-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00029-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00030-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00031-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00032-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00033-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00034-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00035-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00036-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00037-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00038-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00039-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00040-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00041-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00042-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00043-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00044-of-00046.safetensors:   0%|          | 0.00/910M [00:00<?, ?B/s]

model-00045-of-00046.safetensors:   0%|          | 0.00/604M [00:00<?, ?B/s]

model-00046-of-00046.safetensors:   0%|          | 0.00/620M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

## Model Preprocessing

We now have to apply some preprocessing to our model so we can prepare it for training. First we need to further reduce our memory consumption by using the gradient_checkpointing_enable() fucntion on our model. We then use the prepare_model_for_kbit_training function so that we can use 4bit quantization training.

In [None]:
#Test Your Zaka


# Enable gradient checkpointing to reduce memory usage
model.gradient_checkpointing_enable()

# Prepare the model for 4-bit quantization training
model = prepare_model_for_kbit_training(model)

print("Model preprocessing complete. Ready for training!")

Model preprocessing complete. Ready for training!


Explain with your own words how 4-bit quantization affects accuracy.

**Test your Zaka**4-bit quantization makes models smaller and faster but slightly less accurate. Fine-tuning helps recover lost accuracy. It’s a trade-off between efficiency and precision

We will also set a function that will print the number of trainable parameters our model has.

In [None]:
def print_trainable_parameters(model):
    trainable_parameters = 0
    all_paramaters = 0
    for _, param in model.named_parameters():
        all_paramaters += param.numel()
        if param.requires_grad:
            trainable_parameters += param.numel()
    print(
        f"Trainable: {trainable_parameters} || All: {all_paramaters} || Trainable %: {100 * trainable_parameters / all_paramaters}"
    )

Finally we will set the configurations for our LoRA. The paramaters needed are the rank updates, the default LoRa alpha value, the target modules which need to be set to query_key_value, the default lora dropout rate, bias should be set to none, and the task type according to the model we are using.

In [None]:
config = LoraConfig(
    #Test Your Zaka
    r=16,  # Rank updates (adjust based on your needs)
    lora_alpha=32,  # LoRa alpha value
    target_modules=["query_key_value"],  # Target modules to apply LoRA
    lora_dropout=0.1,  # Dropout rate
    bias="none",  # No bias tuning
    task_type="CAUSAL_LM"  # Task type for GPT-NeoX (causal language modeling)
)

# Insert the configs above to the model using the get_peft_model function

#Test Your Zaka
model = get_peft_model(model, config)

# Print the trainable parameters of the model
print_trainable_parameters(model)

Trainable: 17301504 || All: 10606202880 || Trainable %: 0.16312627804457008


## Dataset Loading

Let's load our medical dataset from Hugging Face. We will use the `medalpaca/medical_meadow_wikidoc_patient_information` dataset. You can access it [here](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc).

In [None]:
#Test Your Zaka
from datasets import load_dataset
data = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information")
# Mapping the needed column as our data using a lambda statement

data = data.map(lambda samples: tokenizer(samples["input"], text_target=samples["output"], padding=True, truncation=True), batched=True)

Map:   0%|          | 0/5942 [00:00<?, ? examples/s]

## Model Training and Testing

Now we train the model usig the transformers library. Before doing so, we set the tokenizer to be the end of sequence tokens since it is required by our model. Your goal here is to tune the paramaters until you get a running model on a single colab GPU.

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

# Define the compute_metrics function
def compute_metrics(p):
    predictions, labels = p
    # Convert logits to predictions
    preds = np.argmax(predictions, axis=1)
    # Calculate accuracy
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

In [None]:

from transformers import Trainer, TrainingArguments

# Set the tokenizer's padding token to be the EOS token
tokenizer.pad_token = tokenizer.eos_token

# Define training arguments with seed
training_args = TrainingArguments(
    output_dir="./train",                          # Directory to save model and logs
    adam_epsilon=1e-8,                             # Optimizer epsilon
    learning_rate=5e-5,                            # Learning rate
    fp16=True,                                     # Use mixed precision training
    per_device_train_batch_size=16,                # Batch size for training
    per_device_eval_batch_size=16,                 # Batch size for evaluation
    gradient_accumulation_steps=2,                 # Accumulate gradients over steps
    num_train_epochs=5,                            # Number of training epochs
    evaluation_strategy="epoch",                   # Evaluate at the end of each epoch
    save_strategy="epoch",                         # Save the model at the end of each epoch
    load_best_model_at_end=True,                   # Load the best model at the end
    metric_for_best_model="eval_macro_f1",         # Metric to use for selecting the best model
    greater_is_better=True,                        # Whether higher values of the metric are better
)

# Set up the Trainer
trainer = Trainer(
    model=model,                                  # The model to train
    args=training_args,                           # The training arguments
    train_dataset=data["train"],                  # The training dataset
    eval_dataset=data["train"],                   # The evaluation dataset
    tokenizer=tokenizer,                          # The tokenizer for tokenizing input data
    compute_metrics=compute_metrics,              # The function to compute metrics during evaluation
)

# Start training
trainer.train()


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Explain 4 of the training arguments you used in your Trainer, how they are


used, and what do they represent

**Test your Zaka**output_dir: Specifies where the model checkpoints and outputs will be saved during training.

learning_rate: Controls how much the model's weights are adjusted during training.

per_device_train_batch_size: Sets the number of samples processed simultaneously on each device during training.

num_train_epochs: Defines how many complete passes the model will make through the entire training dataset.

We now save our model as a pretrained version so that we can set the LoRA configurations. This model will be saved to a separate folder on the next block.

In [None]:
#Test Your Zaka

saved_model = model if hasattr(model, "save_pretrained") else model.module
saved_model.save_pretrained("outputs")

Before testing our model, we have to get the LoRA configs from our pre-trained model and set them to our new model using the get_peft_model() function.

In [None]:
#Test Your Zaka


lora_configs = LoraConfig(
    r=16,                            # Rank of the update matrices
    lora_alpha=32,                   # Scaling factor
    target_modules=["query", "key", "value"],  # Targeted modules for LoRA
    lora_dropout=0.1,                # Dropout rate
    bias="none",                     # Bias type
    task_type="CAUSAL_LM"            # Task type for causal language modeling
)

# Get the LoRA model
model = get_peft_model(model, lora_configs)



ValueError: Target modules {'query', 'key', 'value'} not found in the base model. Please check the target modules and try again.

We need to set our prompt as a variable, and also our device currently in use.

In [None]:
#Test Your Zaka

prompt = "Please provide a medical explanation for the following question:"
device = "cuda:0"

:Finally, we will make our LLM generate text based on the data. First we user the tokenizer() function on our prompt.

In [None]:
#Test Your Zaka

inputs = tokenizer(prompt, return_tensors="pt").to(device)

Let's now use the generate() function on our model, and print the decoded version of our output.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)


Please provide a medical explanation for the following question:

What is the difference between a "medical" and a "non-medical" reason for a patient to seek treatment?

A:

The difference between a "medical" and
