<a href="https://colab.research.google.com/github/Arwa-Abboud/ML-Women-in-AI/blob/main/Arwa_Final__Build_your_MedBot_on_Custom_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Build your MedBot
© 2023, Zaka AI, Inc. All Rights Reserved.

---
The goal of this colab is to get you more familiar with LLM fine-tuning by creating a simple QA LLM that can answer medical questions. By the end of it you will be able to customize this LLM with any dataset.

**Just to give you a heads up:** We won't be having a model performing like ChatGPT or Bard, but at least we will have an idea about how we can create our own smaller versions of such powerful LLMs.  

## Importing and Installing Libraries/Packages
We will start by installing our necessary packages.

**bitsandbytes**: This package will allow us to run 4bit quantization on our model

**transformers**: This Hugging Face package will allow us to load state-of-the-art models easily into our notebook

**peft**: This package allows us to add PEFT techniques easily to our model, such as LoRA

**accelerate**: Accelerate is a handy package that allows us to run boiler plate code with a few lines of code

**datasets**: This package allows us to easily import datasets from the Hugging Face platform to be directly used

In [None]:
!pip install bitsandbytes
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/accelerate.git
!pip install datasets


Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-g7xb0qun
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-g7xb0qun
  Resolved https://github.com/huggingface/transformers.git to commit 8f38f58f3de5a35f9b8505e9b48985dce5470985
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/huggingface/peft.git
  Cloning https://github.com/huggingface/peft.git to /tmp/pip-req-build-4y2qfy53
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/peft.git /tmp/pip-req-build-4y2qfy53
  Resolved https://github.com/huggingface/peft.git to commit 6d458b300fc2ed82e19f796b53af4c97d03ea604
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements

In [None]:

import torch
import transformers
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from transformers import AutoTokenizer, BitsAndBytesConfig, AutoModelForCausalLM

## Loading our model

Let's start by loading our model. We will use the GPT Neox 20b Model by EleutherAI!

In [None]:
hf_model = "EleutherAI/gpt-neox-20b"

We will also set the bitsandbytes configurations needed for our model to run on our single colab GPU. The needed paramaters will be 'Double Quantization' 'Quantization Type' and the computational type needs to be set to bfloat16.

In [None]:
#Test Your Zaka
bitsbytes_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)



We will then set our tokenizer, and our model using the AutoTokenizer and AutoModelforCausalLM classes

In [None]:
#Test Your Zaka


tokenizer = AutoTokenizer.from_pretrained(hf_model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:

model = AutoModelForCausalLM.from_pretrained(hf_model, quantization_config=bitsbytes_config, device_map={"":0})


The `GPTNeoXSdpaAttention` class is deprecated in favor of simply modifying the `config._attn_implementation`attribute of the `GPTNeoXAttention` class! It will be removed in v4.48


Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

## Model Preprocessing

We now have to apply some preprocessing to our model so we can prepare it for training. First we need to further reduce our memory consumption by using the gradient_checkpointing_enable() fucntion on our model. We then use the prepare_model_for_kbit_training function so that we can use 4bit quantization training.

In [None]:
#Test Your Zaka

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)



Explain with your own words how 4-bit quantization affects accuracy

**When we use 4-bit quantization, we’re reducing the precision of the model’s weights and activations. Instead of using larger, more precise numbers (like 32-bit), the model switches to smaller 4-bit values.**

**This helps save a lot of memory and makes training faster, but it can slightly affect accuracy. The reason is that lowering the precision can introduce tiny errors in the weight values, which might cause the model to lose a bit of its performance.**

**For large models, though, this drop in accuracy is usually very small because they can handle minor changes in their weights. If needed, we can also use methods like quantization-aware training to reduce the impact even more.**

We will also set a function that will print the number of trainable parameters our model has.

In [None]:
def print_trainable_parameters(model):
    trainable_parameters = 0
    all_paramaters = 0
    for _, param in model.named_parameters():
        all_paramaters += param.numel()
        if param.requires_grad:
            trainable_parameters += param.numel()
    print(
        f"Trainable: {trainable_parameters} || All: {all_paramaters} || Trainable %: {100 * trainable_parameters / all_paramaters}"
    )

Finally we will set the configurations for our LoRA. The paramaters needed are the rank updates, the default LoRa alpha value, the target modules which need to be set to query_key_value, the default lora dropout rate, bias should be set to none, and the task type according to the model we are using.

In [None]:

#Test Your Zaka
config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)


# Insert the configs above to the model using the get_peft_model function
#Test Your Zaka
model = get_peft_model(model, config)
# Print the trainable parameters of the model
print_trainable_parameters(model)

Trainable: 8650752 || All: 10597552128 || Trainable %: 0.08162971878329976


## Dataset Loading

Let's load our medical dataset from Hugging Face. We will use the `medalpaca/medical_meadow_wikidoc_patient_information` dataset. You can access it [here](https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc).

In [None]:
import random


# Load the dataset
dataset = load_dataset("medalpaca/medical_meadow_wikidoc_patient_information")


# Add labels field for causal language modeling
#data = dataset.map(lambda samples: {**tokenizer(samples['output'], ), "labels": tokenizer(samples['output'], )["input_ids"]}, batched=True)
data = dataset.map( lambda samples: tokenizer(samples["output"]), batched=True)

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['input', 'output', 'instruction', 'input_ids', 'attention_mask'],
        num_rows: 5942
    })
})

## Model Training and Testing

Now we train the model usig the transformers library. Before doing so, we set the tokenizer to be the end of sequence tokens since it is required by our model. Your goal here is to tune the paramaters until you get a running model on a single colab GPU.

In [None]:

# Setting the tokenizer padding to be 'eos' tokens
tokenizer.pad_token = tokenizer.eos_token
#tokenizer.padding_side = "right"

In [None]:

trainer = transformers.Trainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        warmup_steps=1,
        max_steps=20,
        learning_rate=2e-8,
        fp16=True,
        logging_steps=2,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# This silences the warnings
model.config.use_cache = False


API code: 959f05c69981043288e4d0863baf988e8a81688f

In [None]:
# Train the model!
#Test Your Zaka
#torch.cuda.empty_cache()  #Clear GPU Memory Before Training to reduce time
trainer.train()


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33marwaynad[0m ([33marwaynad-emirates-national-schools[0m). Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)


Step,Training Loss
2,2.3215
4,3.2895
6,2.3237
8,2.3594
10,2.8005
12,2.4485
14,2.5676
16,1.8958
18,3.678
20,2.083


TrainOutput(global_step=20, training_loss=2.576758396625519, metrics={'train_runtime': 103.0406, 'train_samples_per_second': 0.194, 'train_steps_per_second': 0.194, 'total_flos': 253855667183616.0, 'train_loss': 2.576758396625519, 'epoch': 0.003365870077415012})

Explain 4 of the training arguments you used in your Trainer, how they are used, and what do they represent

**max_steps=10,Set the maximum number of training steps to 10. Training will stop after this many steps regardless of the number of epochs.**

**learning_rate=2e-4, , which controls how much the model's weights are adjusted with each update.Lower values are safer but slower to converge.**

    
**logging_steps=1,Log metrics (such as loss) every 1 step during training to monitor progress.**

 **optim="paged_adamw_8bit". Use the paged AdamW optimizer with 8-bit precision to reduce memory usage while maintaining efficient optimization.**

**fp16=True: Enables mixed precision training by using 16-bit floating-point numbers (instead of 32-bit).this will accelerate training and reduce memory usage, particularly helpful for large models like GPT-NeoX.**

**logging_dir: This specifies the directory where training logs will be saved.I set it to './logs' so the logs will be stored in the logs folder within the current directory. This helps in monitoring training progress and debugging issues.**

**per_device_train_batch_size and per_device_eval_batch_size: These arguments control the batch size used during training and evaluation for each device. A smaller batch size means the model processes fewer samples at a time, which reduces memory consumption but may increase the time required for training. In this case, I set both to 1 and 4 to conserve GPU memory.**

We now save our model as a pretrained version so that we can set the LoRA configurations. This model will be saved to a separate folder on the next block.

In [None]:
#Test Your Zaka

saved_model = model if hasattr(model, 'save_pretrained') else model.base_model
saved_model.save_pretrained("outputs")

Before testing our model, we have to get the LoRA configs from our pre-trained model and set them to our new model using the get_peft_model() function.

In [None]:
#Test Your Zaka


# Load the pre-trained model and tokenizer
#model = AutoModelForCausalLM.from_pretrained("outputs", torch_dtype=torch.float16, device_map="auto")
#tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

# Load the LoRA configuration from the saved model
config = LoraConfig.from_pretrained("outputs")

# Apply the LoRA configuration to the new model
model = get_peft_model(model, config)




We need to set our prompt as a variable, and also our device currently in use.

In [None]:

#Test Your Zaka
# Set the prompt as a variable
prompt =   "As a medical expert, please list the symptoms of an allergy."
# Set the device \
device = "cuda:0"


Finally, we will make our LLM generate text based on the data. First we user the tokenizer() function on our prompt.

In [None]:
#Test Your Zaka

inputs = tokenizer(prompt, return_tensors="pt").to(device)

Let's now use the generate() function on our model, and print the decoded version of our output.

In [None]:
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


As a medical expert, please list the symptoms of an allergy.

A.

B.

C.

D.

E.

F.

G.

H.

I.

J.

K.

L.




In [None]:


# try 2
prompt = "As a medical expert, please list the common symptoms of an allergy in a detailed, natural format. Start your answer with: 'Allergy symptoms include:'"

# Tokenize the input prompt
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate the output
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9)

# Decode and print the generated response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)


Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


As a medical expert, please list the common symptoms of an allergy in a detailed, natural format. Start your answer with: 'Allergy symptoms include:'

A.

Itching

B.

Rash

C.

Sneezing

D.

Coughing

E.

Vomiting

F.

Diarrhea

G.

Headache

H.

Fatigue

I.

Nausea

J.

Dizziness

K.

Fever

L.

Anxiety

M.

Sore throat

N.

Chest pain

O.

Rash

P.

Itching

Q.

Sneezing

R.

Coughing

S.

Vomiting

T.

Diarrhea

U.

Headache

V.

Fatigue

W.

Nausea

X.



In [None]:

# try 3
prompt = "As a medical expert, tell me What causes Allergy?"


inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, top_p=0.9)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


As a medical expert, tell me What causes Allergy?

A:

Allergy is a hypersensitivity reaction to a substance that is normally harmless.
The immune system is a complex system that is designed to protect the body from foreign invaders.  When it is activated, it can cause a number of different reactions.  One of these is an allergic reaction.
The immune system is designed to recognize and attack invaders.  When it does so, it can cause a number of different reactions.  One of these is an allergic reaction.
The immune system is designed to recognize and attack invaders.  When it does so, it can cause a number of different reactions.  One of these is an allergic reaction.
The immune system is designed to recognize and attack invaders.  When it does so, it can cause a number of different reactions.  One of these is an allergic reaction.
The immune system is designed to recognize and attack invaders.  When it does so, it can cause a number of
