<a href="https://colab.research.google.com/github/Alao001/LLMs/blob/main/Llama_2_Medical_bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ABOUT THE PROJECT

This project demonstrates how to fine-tune a Llama 2-7B model on a T4 GPU with limited VRAM (16GB) using the QLoRA technique. We leverage the ruslanmv/ai-medical-chatbot dataset, comprising 250k patient-doctor dialogues, to train a medical chatbot. By quantizing the model to 4-bit precision, we significantly reduce memory requirements, enabling efficient training on this constrained hardware.

Installing all the necessary Python packages.

In [18]:
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes wandb


Retrieving Hugging Face API Token in Google Colab

This is a crucial step for interacting with the Hugging Face Hub and accessing its resources like pre-trained models, datasets, and transformers.

In [16]:
from google.colab import userdata
from huggingface_hub import login
# Defined in the secrets tab in Google Colab
hf_token = userdata.get('AM')


login(token = hf_token)



The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Import the necessary Python pages for loading the dataset, model, and tokenizer and fine-tuning.




In [19]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)
import os, torch, wandb
from datasets import load_dataset
from trl import SFTTrainer, setup_chat_format

Set the base model, dataset, and new model variable

In [20]:
# Model
base_model = "NousResearch/Llama-2-7b-hf"
new_model = "llama-2-7b-chat-doctor"
dataset_name = "ruslanmv/ai-medical-chatbot"



In [21]:
#  Configuring Torch Tensor Type and Attention Implementation
torch_dtype = torch.float16
attn_implementation = "eager"

 QLoRA Configuration and Model Loading

 Configures QLoRA (Quantized Low-Rank Adaptation) for efficient fine-tuning of a language model and loads the base model using the specified configuration.

In [22]:
# QLoRA config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch_dtype,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation=attn_implementation
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Load the tokenizer and then set up a model and tokenizer for conversational AI tasks. By default, it uses the chatml template from OpenAI, which will convert the input text into a chat-like format.

prepares the model and tokenizer for a chat-based application, ensuring that they can handle conversational inputs and generate appropriate responses in a human-like manner.


In [23]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model)
model, tokenizer = setup_chat_format(model, tokenizer)

Adding the adapter to the layer

Fine-tuning the full model will take a lot of time, so to improve the training time, we’ll attach the adapter layer with a few parameters, making the entire process faster and more memory-efficient.

In [24]:
# LoRA config
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

 Data Preparation for Chat-Based Fine-Tuning

In [30]:
#Importing the dataset
dataset = load_dataset(dataset_name, split="all")
dataset = dataset.shuffle(seed=65).select(range(1000)) # Only use 1000 samples for quick demo

def format_chat_template(row):
    row_json = [{"role": "user", "content": row["Patient"]},
               {"role": "assistant", "content": row["Doctor"]}]
    row["text"] = tokenizer.apply_chat_template(row_json, tokenize=False)
    return row

dataset = dataset.map(
    format_chat_template,
    num_proc=4,
)

dataset['text'][3]

Map (num_proc=4):   0%|          | 0/1000 [00:00<?, ? examples/s]

'<|im_start|>user\nFell on sidewalk face first about 8 hrs ago. Swollen, cut lip bruised and cut knee, and hurt pride initially. Now have muscle and shoulder pain, stiff jaw(think this is from the really swollen lip),pain in wrist, and headache. I assume this is all normal but are there specific things I should look for or will I just be in pain for a while given the hard fall?<|im_end|>\n<|im_start|>assistant\nHello and welcome to HCM,The injuries caused on various body parts have to be managed.The cut and swollen lip has to be managed by sterile dressing.The body pains, pain on injured site and jaw pain should be managed by pain killer and muscle relaxant.I suggest you to consult your primary healthcare provider for clinical assessment.In case there is evidence of infection in any of the injured sites, a course of antibiotics may have to be started to control the infection.Thanks and take careDr Shailja P Wahal<|im_end|>\n'

 Split the dataset into a training and validation set.

In [31]:
dataset = dataset.train_test_split(test_size=0.1)

Training Arguments for Fine-Tuning

We are fine-tuning the model for one epoch and logging the metrics using the Weights and Biases.

In [32]:
training_arguments = TrainingArguments(
    output_dir=new_model,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="wandb"
)



Setting up a supervised fine-tuning (SFT) trainer and provide a train and evaluation dataset, LoRA configuration, training argument, tokenizer, and model. We’re keeping the max_seq_length to 550 to avoid exceeding GPU memory during training.

In [33]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    peft_config=peft_config,
    max_seq_length=550,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)



Map:   0%|          | 0/900 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [34]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33malaomuideenabiola[0m ([33malaomuideenabiola-ladoke-akintola-university-of-technology[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss
90,2.1865,2.176756
180,2.6135,2.146895
270,2.3878,2.129692
360,2.1257,2.114358
450,1.9814,2.106095




TrainOutput(global_step=450, training_loss=2.184313192367554, metrics={'train_runtime': 1079.6088, 'train_samples_per_second': 0.834, 'train_steps_per_second': 0.417, 'total_flos': 9340507313012736.0, 'train_loss': 2.184313192367554, 'epoch': 1.0})

Model evaluation

In [36]:
wandb.finish()
model.config.use_cache = True

VBox(children=(Label(value='0.023 MB of 0.023 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/loss,█▅▃▂▁
eval/runtime,▁▅▄▇█
eval/samples_per_second,█▄▅▂▁
eval/steps_per_second,█▄▅▂▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇██
train/global_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇█
train/grad_norm,▁▅▃▄▄▁▄▄█▃▃▂▃▂▂▄▃▄▄▄▂▄▃▄▃▄▅▃▃▃▃▆▅▃▃▄▂▂▂▄
train/learning_rate,▅███▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,▇▄▅▅█▄▃▅▅▅▆▅▇▆▆▄▆▃▅▅▇▆▆▄▅▇▁▆▄▅▃▆▄▄▆▄▅▅▄▇

0,1
eval/loss,2.10609
eval/runtime,45.8208
eval/samples_per_second,2.182
eval/steps_per_second,2.182
total_flos,9340507313012736.0
train/epoch,1.0
train/global_step,450.0
train/grad_norm,1.13905
train/learning_rate,0.0
train/loss,1.9814


To generate a response, we need to convert messages into chat format, pass them through the tokenizer, input the result into the model, and then decode the generated token to display the text.




In [39]:
messages = [
    {
        "role": "user",
        "content": "Hello doctor, I have back pain. How do I get rid of it?"
    }
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                       add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors='pt', padding=True,
                   truncation=True).to("cuda")

outputs = model.generate(**inputs, max_length=150,
                         num_return_sequences=1)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(text.split("assistant")[1])


Hi. Welcome to iclinq.com. Back pain is a very common problem. It can be due to muscle spasm, muscle strain, disc prolapse, etc. So, first of all, you need to get evaluated. Then, you need to start with painkillers like diclofenac. Avoid lifting heavy weights. Do not strain yourself. Apply heat pad. Do physiotherapy. Hope I have answered your query. Let me know if I can assist you further. Regards, Dr. Indu Bhushan Babu, General & Family


Saving the model file

In [40]:
# Save trained model
trainer.model.save_pretrained(new_model)


