In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install unsloth
# Get latest Unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gemma-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.0: Fast Gemma patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.57G [00:00<?, ?B/s]

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


generation_config.json:   0%|          | 0.00/154 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/40.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.10.0 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
prompt_template = """
You are a multilingual Medical Assistant designed to help users by providing accurate, up-to-date medical information from the MeDAL dataset. Your role is to assist with medical queries, provide clarification on medical abbreviations, and offer relevant guidance.

You are equipped to:
- Answer questions about medical terminology, symptoms, diagnoses, treatments, and general health-related inquiries.
- Disambiguate medical abbreviations from the MeDAL dataset based on user input.
- Provide health-related recommendations if relevant to the conversation, but clarify that users should consult a healthcare professional for serious conditions.

Instructions:

You have three types of Clients you should respond to based on their behavior:
1. **Medical Professionals**: Respond formally with detailed medical terminology, and provide explanations when requested. Ensure accuracy and cite the MeDAL dataset as a source for medical disambiguation.
   - Example response: "The abbreviation 'COPD' stands for 'Chronic Obstructive Pulmonary Disease.'"

2. **Patients or General Public**: Respond in a clear and friendly manner. Simplify medical jargon and offer advice in layman’s terms where appropriate. Recommend that they consult a doctor if their question relates to a serious medical condition.
   - Example response: "COPD is a lung condition that makes it hard to breathe. It's best to see a doctor for more information."

3. **Researchers or Data Analysts**: Respond formally and provide access to detailed medical information. Offer references to research studies or detailed explanations based on the query.
   - Example response: "You can find more detailed information on COPD in recent clinical studies, which explore both the etiology and therapeutic approaches."

Additional Rules:
- Respond in the same language as the user's query, and be capable of understanding and answering in multiple languages, including but not limited to English, Spanish, French, German, Arabic, and Chinese.
- Never suggest or diagnose any medical treatment directly. Always remind users that the chatbot is an assistant, not a doctor, and recommend seeing a healthcare provider.
- Disambiguate medical abbreviations using the MeDAL dataset and return the full form with an appropriate explanation.
- If users ask questions that are beyond your scope, politely direct them to other healthcare resources or suggest that they seek professional advice.
- You don’t have access to images or multimedia, only text-based medical data.
- Be polite, clear, and ensure the information provided is always relevant and up-to-date.

### Example responses:

User Query in English:
"What's the full form of COPD?"
"COPD stands for Chronic Obstructive Pulmonary Disease."

User Query in Spanish:
"¿Qué significa TDAH?"
"TDAH significa Trastorno por Déficit de Atención e Hiperactividad."

User Query in Arabic:
"ما هو مرض السكري؟"
"السكري هو مرض يؤثر على كيفية استخدام الجسم للسكر الموجود في الدم."

### Context:
{context}

### Chat History:
{chat_history}

### Input:
{question}
"""

alpaca_prompt = """Below is a medical abstract text. The task is to disambiguate the abbreviation at the given location.

### Text:
{text}

### Location:
{location}

### Label:
{label}"""

# Ensure EOS_TOKEN is defined
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    # Retrieve the text, location, and label from the dataset examples
    texts = examples.get("text", [])
    locations = examples.get("location", [])
    labels = examples.get("label", [])

    # Initialize the formatted texts list
    formatted_texts = []

    # Process each text, location, and label
    for text, location, label in zip(texts, locations, labels):
        # Format the text using the alpaca_prompt template
        formatted_text = alpaca_prompt.format(text=text, location=location, label=label) + EOS_TOKEN
        formatted_texts.append(formatted_text)

    return {"text": formatted_texts}

from datasets import load_dataset

# Load the MeDAL dataset
dataset = load_dataset("McGill-NLP/medal", split="train")

# Map the formatting function to the dataset in batched mode
dataset = dataset.map(formatting_prompts_func, batched=True)


medal.py:   0%|          | 0.00/5.93k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

The repository for McGill-NLP/medal contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/McGill-NLP/medal.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


pretrain_subset.zip:   0%|          | 0.00/2.07G [00:00<?, ?B/s]

full_data.csv.zip:   0%|          | 0.00/5.23G [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating full split:   0%|          | 0/14393619 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000000 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/3000000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.83 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 3,000,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 50,003,968
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss
1,3.4526
2,3.3352
3,3.2117
4,3.1063
5,2.6592
6,3.0383
7,2.5064
8,2.7493
9,2.6388
10,2.5305


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

613.028 seconds used for training.
10.22 minutes used for training.
Peak reserved memory = 8.221 GB.
Peak reserved memory for training = 2.391 GB.
Peak reserved memory % of max memory = 55.743 %.
Peak reserved memory for training % of max memory = 16.212 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# Define alpaca_prompt as before
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Ensure the model is in eval mode for inference
model.eval()

# Tokenize the input with formatted instruction and input, leaving output blank
inputs = tokenizer(
    [
        alpaca_prompt.format(
            instruction="Can you explain what causes type 2 diabetes and how it can be managed?",  # instruction
            input="",  # input - empty in this case
            output="",  # output - leave blank for generation
        )
    ], return_tensors="pt", padding=True, truncation=True
).to("cuda")  # Ensure tensors are on the correct device

# Disable gradient calculation for inference
with torch.no_grad():
    # Generate output
    outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)

# Decode the output to readable text
decoded_outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Print the generated response
print(decoded_outputs)


["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat are the recommended dietary adjustments for managing high cholesterol?\n\n### Input:\n\n\n### Response:\nthe recommended dietary adjustments for managing high cholesterol include eating a diet low in saturated fat and cholesterol and high in fiber and omega fatty acids these dietary changes can help lower cholesterol levels and reduce the risk of heart disease\n\n### Hint:\n[('cholesterol', 'cholesterol'), ('high cholesterol', 'high cholesterol'), ('"]


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
from huggingface_hub import login

# Login to Hugging Face
login(token="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")

# Push the model to the Hugging Face Hub
model.push_to_hub("mariam-essam/MediModelv2")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/200M [00:00<?, ?B/s]

Saved model to https://huggingface.co/mariam-essam/MediModelv2


### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Check if the necessary quantization files exist before attempting to save
import os

quantize_path = "llama.cpp/llama-quantize"
if not os.path.exists(quantize_path):
    print(f"Warning: Required file '{quantize_path}' does not exist. Please ensure llama.cpp is correctly set up.")

# Save to 8bit Q8_0
#model.save_pretrained_gguf("SakayRahma/loraModelv2", tokenizer=tokenizer)
#model.push_to_hub_gguf("SakayRahma/ggufModelv2", tokenizer=tokenizer, token="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")

#save as pytorch
model.save_pretrained("mariam-essam/MediModelv2", tokenizer=tokenizer)
model.push_to_hub("mariam-essam/PytorchModelv2", tokenizer=tokenizer, use_auth_token="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")

# Save to q4_k_m GGUF
# Uncomment if you want to save with q4_k_m quantization
# model.save_pretrained_gguf("model", tokenizer=tokenizer, quantization_method="q4_k_m")
# model.push_to_hub_gguf("hf/model", tokenizer=tokenizer, quantization_method="q4_k_m", token="")




README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/200M [00:00<?, ?B/s]

Saved model to https://huggingface.co/mariam-essam/PytorchModelv2
