<a href="https://colab.research.google.com/github/Aashi779/Efficient-multitasking-with-SLMs/blob/main/Task1_ReasoningProblemSolving.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install Packages


In [None]:
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers==0.0.27" trl peft accelerate bitsandbytes

In [None]:
!pip install triton



In [None]:
!pip uninstall -y xformers
!rm -rf /usr/local/lib/python3.10/dist-packages/xformers

Found existing installation: xformers 0.0.27
Uninstalling xformers-0.0.27:
  Successfully uninstalled xformers-0.0.27


In [None]:
!pip install xformers==0.0.27

Collecting xformers==0.0.27
  Using cached xformers-0.0.27-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting torch==2.3.1 (from xformers==0.0.27)
  Downloading torch-2.3.1-cp310-cp310-manylinux1_x86_64.whl.metadata (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->xformers==0.0.27)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->xformers==0.0.27)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->xformers==0.0.27)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1->xformers==0.0.27)
  Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1-

In [None]:
!python -m xformers.info

xFormers 0.0.27
memory_efficient_attention.ckF:                    unavailable
memory_efficient_attention.ckB:                    unavailable
memory_efficient_attention.ck_decoderF:            unavailable
memory_efficient_attention.ck_splitKF:             unavailable
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@v2.5.7:        available
memory_efficient_attention.flshattB@v2.5.7:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.triton_splitKF:         unavailable
indexing.scaled_index_addF:                        unavailable
indexing.scaled_index_addB:                        unavailable
indexing.index_select:                             unavailable
sequence_parallel_fused.write_values:              av

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "microsoft/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers, TRL and unsloth via:
`pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git`


# Load Model

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.10.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
# Data Prep
We now use the `Phi-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Phi-3 renders multi turn conversations like below:

```
<|user|>
Hi!<|end|>
<|assistant|>
Hello! How are you?<|end|>
<|user|>
I'm doing great! And you?<|end|>

```

In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="phi-3",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)

def formatting_prompts_func(examples):
    inputs = examples["input"]
    answers = examples["answers"]
    reasonings = examples["reasoning"]

    convos = []
    for i in range(len(inputs)):
        convo = [
            {"from": "human", "value": inputs[i]},
            {"from": "gpt", "value": answers[i]},
        ]
        if reasonings[i]:
            convo.append({"from": "gpt", "value": reasonings[i]})

        convos.append(convo)

    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]

    return {"text": texts}


from datasets import load_dataset
dataset = load_dataset("csv", data_files="/content/output_with_reasoning_v2.csv", split="train")

dataset = dataset.map(formatting_prompts_func, batched=True)

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/10924 [00:00<?, ? examples/s]

In [None]:
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset

tokenizer = get_chat_template(
    tokenizer,
    chat_template="phi-3",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"},
)

def formatting_prompts_func(examples):
    inputs = examples["input"]
    reasonings = examples["reasoning"]

    convos = []
    for i in range(len(inputs)):
        convo = [
            {"from": "human", "value": inputs[i]},
        ]
        if reasonings[i]:
            convo.append({"from": "gpt", "value": reasonings[i]})

        convos.append(convo)
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]

    return {"conversations": convos,"text" : texts}

dataset = load_dataset("csv", data_files="/content/output_with_reasoning_v2.csv", split="train")

dataset = dataset.map(formatting_prompts_func, batched=True)



Map:   0%|          | 0/10924 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['answers', 'input', 'reasoning', 'conversations', 'text'],
    num_rows: 10924
})

In [None]:
dataset = dataset.shuffle(seed=42).select(range(3000))

In [None]:
dataset[5]['conversations']

[{'from': 'human',
  'value': 'HIV , or human immunodeficiency virus, causes AIDS. AIDS stands for "acquired immune deficiency syndrome." It is a condition that causes death and does not have a known cure. AIDS usually develops 10 to 15 years after a person is first infected with HIV. The development of AIDS can be delayed with proper medicines. The delay can be well over 20 years with the right medicines. Today, individuals who acquire HIV after 50 years of age can expect to reach an average human life span. The police arrested two homeless people. James was 35 and had been HIV positive for 20 years, and Bill, who was 54 and was not HIV positive. They booked them and let them go. Which person had the immunodeficiency virus?'},
 {'from': 'gpt',
  'value': "The selected answer 'James' is chosen as it aligns with the information provided in the input. This choice logically fits the scenario where HIV , or human immunodeficiency virus, causes AIDS..."}]

In [None]:
print(dataset[5]["text"])

<|user|>
HIV , or human immunodeficiency virus, causes AIDS. AIDS stands for "acquired immune deficiency syndrome." It is a condition that causes death and does not have a known cure. AIDS usually develops 10 to 15 years after a person is first infected with HIV. The development of AIDS can be delayed with proper medicines. The delay can be well over 20 years with the right medicines. Today, individuals who acquire HIV after 50 years of age can expect to reach an average human life span. The police arrested two homeless people. James was 35 and had been HIV positive for 20 years, and Bill, who was 54 and was not HIV positive. They booked them and let them go. Which person had the immunodeficiency virus?<|end|>
<|assistant|>
The selected answer 'James' is chosen as it aligns with the information provided in the input. This choice logically fits the scenario where HIV , or human immunodeficiency virus, causes AIDS...<|end|>



In [None]:
unsloth_template = \
    "{{ bos_token }}"\
    "{{ 'You are a helpful assistant to the user\n' }}"\
    "{% for message in messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ '>>> User: ' + message['content'] + '\n' }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ '>>> Assistant: ' + message['content'] + eos_token + '\n' }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}"\
        "{{ '>>> Assistant: ' }}"\
    "{% endif %}"
unsloth_eos_token = "eos_token"


if False:
    tokenizer = get_chat_template(
        tokenizer,
        chat_template = (unsloth_template, unsloth_eos_token,),
        mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
        map_eos_token = True,
    )

<a name="Train"></a>
# Train the model


In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,

    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/3000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
4.727 GB of memory reserved.


In [None]:
!pip install wandb
import wandb
wandb.init(project="my-sft-project", name="my-training-run")



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
trainer_stats = trainer.train()

**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers, TRL and Unsloth!
`pip install --upgrade --no-cache-dir unsloth git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/trl.git`


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 3,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 29,884,416


Step,Training Loss
1,2.3992
2,2.0228
3,2.2772
4,2.241
5,2.3065
6,2.2763
7,2.1382
8,2.0767
9,2.1108
10,1.8801


Step,Training Loss
1,2.3992
2,2.0228
3,2.2772
4,2.241
5,2.3065
6,2.2763
7,2.1382
8,2.0767
9,2.1108
10,1.8801


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

310.3227 seconds used for training.
5.17 minutes used for training.
Peak reserved memory = 2.971 GB.
Peak reserved memory for training = 0.004 GB.
Peak reserved memory % of max memory = 20.145 %.
Peak reserved memory for training % of max memory = 0.027 %.


<a name="Inference"></a>
# Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3",
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"},
)

FastLanguageModel.for_inference(model)

messages = [
    {"from": "human", "value": "A person wants to start saving money so that they can afford a nice vacation at the end of the year.One can make more phone calls, quit eating lunch out or buy more with monopoly. How can a person save money?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|user|> A person wants to start saving money so that they can afford a nice vacation at the end of the year.One can make more phone calls, quit eating lunch out or buy more with monopoly. How can a person save money?<|end|><|assistant|> The most effective way for a person to save money for a vacation would be to quit eating lunch out. Eating out is generally more expensive than preparing meals at home. By cooking at home, the person can significantly reduce their daily expenses. This saved money can then be allocated towards their']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model)


messages = [
    {"from": "human", "value": "A person wants to start saving money so that they can afford a nice vacation at the end of the year.One can make more phone calls, quit eating lunch out or buy more with monopoly. How can a person save money?"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 64, use_cache = True)

The most effective way for a person to save money for a vacation would be to quit eating lunch out. Eating out is generally more expensive than preparing meals at home. By cooking at home, the person can significantly reduce their daily expenses. This saved money can then be allocated towards their


User Input

In [None]:
user_input=input("You? ")
messages= [
    {"from": "human", "value": user_input},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True,
    return_tensors = "pt",
).to("cuda")

print("SLM:")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 64, use_cache = True)

You? A person wants to start saving money so that they can afford a nice vacation at the end of the year.One can make more phone calls, quit eating lunch out or buy more with monopoly. How can a person save money?
SLM:
The most effective way for a person to save money for a vacation would be to quit eating lunch out. Eating out is generally more expensive than preparing meals at home. By cooking at home, the person can significantly reduce their daily expenses. This saved money can then be allocated towards their


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)

if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options
if False:
    model.push_to_hub_gguf(
        "hf/model",
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

In [None]:
!zip -r /content/lora_model.zip /content/lora_model

  adding: content/lora_model/ (stored 0%)
  adding: content/lora_model/adapter_config.json (deflated 54%)
  adding: content/lora_model/special_tokens_map.json (deflated 76%)
  adding: content/lora_model/tokenizer.json (deflated 85%)
  adding: content/lora_model/adapter_model.safetensors (deflated 8%)
  adding: content/lora_model/README.md (deflated 66%)
  adding: content/lora_model/tokenizer.model (deflated 55%)
  adding: content/lora_model/added_tokens.json (deflated 62%)
  adding: content/lora_model/tokenizer_config.json (deflated 84%)


In [None]:
!zip -r /content/outputs.zip /content/outputs

  adding: content/outputs/ (stored 0%)
  adding: content/outputs/runs/ (stored 0%)
  adding: content/outputs/runs/Sep26_07-40-11_389fc0e0af20/ (stored 0%)
  adding: content/outputs/runs/Sep26_07-40-11_389fc0e0af20/events.out.tfevents.1727336422.389fc0e0af20.4245.0 (deflated 66%)
  adding: content/outputs/runs/Sep26_07-40-11_389fc0e0af20/events.out.tfevents.1727336888.389fc0e0af20.4245.1 (deflated 66%)
  adding: content/outputs/checkpoint-60/ (stored 0%)
  adding: content/outputs/checkpoint-60/adapter_config.json (deflated 54%)
  adding: content/outputs/checkpoint-60/special_tokens_map.json (deflated 76%)
  adding: content/outputs/checkpoint-60/tokenizer.json (deflated 85%)
  adding: content/outputs/checkpoint-60/optimizer.pt (deflated 10%)
  adding: content/outputs/checkpoint-60/adapter_model.safetensors (deflated 8%)
  adding: content/outputs/checkpoint-60/scheduler.pt (deflated 56%)
  adding: content/outputs/checkpoint-60/README.md (deflated 66%)
  adding: content/outputs/checkpoint-

In [None]:
!zip -r /content/huggingface_tokenizers_cache.zip /content/huggingface_tokenizers_cache

  adding: content/huggingface_tokenizers_cache/ (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/ (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/ (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/72dafda7008a52e087bec2c5f534eda3cfd33b27.lock (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/efb10d031a5e4c01c1c882d65a13073848d1d2df.lock (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/2f4cf1e18cb543d31aedc307a6b5968a201569bc.lock (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/c9d3d3a1b74d87e381e471f7b33784015d2dc0ea.lock (stored 0%)
  adding: content/huggingface_tokenizers_cache/.locks/models--unsloth--phi-3.5-mini-instruct-bnb-4bit/9e556afd44213b6b