<a href="https://colab.research.google.com/github/Jayesh-2003/Academic-Bot---PEFT/blob/main/TuningPretrainedModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
if major_version >= 8:
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    !pip install --no-deps xformers trl peft accelerate bitsandbytes

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-809jqyqu/unsloth_d936d26a1b1a443299d819a1f33c164f
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-809jqyqu/unsloth_d936d26a1b1a443299d819a1f33c164f
  Resolved https://github.com/unslothai/unsloth.git to commit 1f52468fa31bf0b641ec96217ef0f5916a07fce5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth-zoo (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2024.10.4-py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31

In [None]:
!pip install triton




Next we need to prepare to load a range of quantized language models, including a new 15 trillion token LLama-3 model, optimized for memory efficiency with 4-bit quantization.


In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Llama 3 is up to 8k
dtype = None
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit"
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = "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", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.3: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Unsloth: We fixed a gradient accumulation bug, but it seems like you don't have the latest transformers version!
Please update transformers via:
`pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"`




---



Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.10.3 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

Then, we define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.


In [None]:
import json
from datasets import Dataset
from transformers import AutoTokenizer

# Load the JSON data from the questions.json file
with open('/content/finalyr.json', 'r') as file:
    data = json.load(file)

# Initialize the tokenizer (you need to replace with the actual model you're using)
EOS_TOKEN = tokenizer.eos_token


# Define the formatting template
alpaca_prompt = """You are a Chatbot for engineering students answering their questions. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}

### Input:
{}

### Response:
{}
"""

# Define the formatting function
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input_, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Convert data into the format required by the function
formatted_data = {
    'instruction': [item['instruction'] for item in data],
    'input': [item['input'] for item in data],
    'output': [item['output'] for item in data]
}

# Apply the formatting function to the data
formatted_data = formatting_prompts_func(formatted_data)

# Create a Hugging Face Dataset from the formatted data
dataset = Dataset.from_dict(formatted_data)

# Print the dataset information
print(dataset)

# Optionally, save the dataset for later use
dataset.save_to_disk("formatted_dataset")


Dataset({
    features: ['text'],
    num_rows: 914
})


Saving the dataset (0/1 shards):   0%|          | 0/914 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
- We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
- At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared.

In [None]:
import torch
from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=4,  # Use num_train_epochs for epoch-based training
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
    # Remove max_steps if you are using num_train_epochs
)

# Initialize the trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences
    args=training_args
)

# Start training
trainer.train()


Map (num_proc=2):   0%|          | 0/914 [00:00<?, ? examples/s]

**** Unsloth: Please use our fixed gradient_accumulation_steps by updating transformers and Unsloth!


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 914 | Num Epochs = 4
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 456
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
10,1.5628
20,0.9184
30,0.8827
40,0.8778
50,0.861
60,0.8033
70,0.8291
80,0.7998
90,0.7848
100,0.7974


TrainOutput(global_step=456, training_loss=0.5351870222049847, metrics={'train_runtime': 3080.921, 'train_samples_per_second': 1.187, 'train_steps_per_second': 0.148, 'total_flos': 3.86030434085929e+16, 'train_loss': 0.5351870222049847, 'epoch': 3.991247264770241})

In [None]:
# model.save_pretrained("lora_model") # Local saving
model.push_to_hub("cryotron/chatbot_academic_4th_Year_Llama", token = "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")

README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/cryotron/chatbot_academic_4th_Year_Llama


In [None]:
model.push_to_hub_merged("chatbot_academic_4th_Year_MRG_LORA_Llama", tokenizer, save_method = "lora", token = "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")

Unsloth: Saving LoRA adapters. Please wait...


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved lora model to https://huggingface.co/chatbot_academic_4th_Year_MRG_LORA_Llama


In [None]:
model.push_to_hub_gguf("chatbot_academic_4th_Year_GUFF_Llama", tokenizer, token = "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")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.73 out of 12.67 RAM for saving.


 50%|█████     | 16/32 [00:01<00:01, 11.86it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [02:04<00:00,  3.90s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving chatbot_academic_4th_Year_GUFF_Llama/pytorch_model-00001-of-00004.bin...
Unsloth: Saving chatbot_academic_4th_Year_GUFF_Llama/pytorch_model-00002-of-00004.bin...
Unsloth: Saving chatbot_academic_4th_Year_GUFF_Llama/pytorch_model-00003-of-00004.bin...
Unsloth: Saving chatbot_academic_4th_Year_GUFF_Llama/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at chatbot_academic_4th_Year_GUFF_Llama into q8_0 GGUF format.
The output location will be /content/chatbot_academic_4th_Year_GUFF_Llama/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: chatbot_academic_4th_Year_GUFF_Llama
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.f

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/cryotron/chatbot_academic_4th_Year_GUFF_Llama


In [None]:
model.push_to_hub_gguf("cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama", tokenizer, quantization_method = "q4_k_m", token = "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")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.43 out of 12.67 RAM for saving.


100%|██████████| 32/32 [01:50<00:00,  3.45s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama/pytorch_model-00001-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama/pytorch_model-00002-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama/pytorch_model-00003-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama/pytorch_model-00004-of-00004.bin...
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q4_k_m'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at cryot

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q4_K_M.gguf:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/cryotron/chatbot_academic_4th_Year_GUFF_Quantized4Bit_Llama


In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Explain the characteristics and use cases of multiprocessor operating systems.", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 1289, use_cache = True)
tokenizer.batch_decode(outputs)

['<bos>You are a Chatbot for engineering students answering their questions. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n### Instruction:\nExplain the characteristics and use cases of multiprocessor operating systems.\n\n### Input:\n\n\n### Response:\n\nMultiprocessor operating systems (MPOS) are designed to manage multiple processors and provide a platform for developing applications that require multiple processors. MPOS provides a mechanism for coordinating multiple processors and managing their resources, and it is used in applications such as: 1) High-Performance Computing: MPOS is used in high-performance computing applications, such as scientific simulations and data analytics. 2) Real-Time Systems: MPOS is used in real-time systems, such as control systems and medical devices. 3) Embedded Systems: MPOS is used in embedded systems, such as automotive and medical dev

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
# Define save directory
save_directory = "path/to/save/model"

model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

('path/to/save/model/tokenizer_config.json',
 'path/to/save/model/special_tokens_map.json',
 'path/to/save/model/tokenizer.json')

In [None]:
import shutil

# Compress the saved model directory into a zip file
shutil.make_archive("os", 'zip', "path/to/save/model")


'/content/os.zip'

In [None]:
from google.colab import files

# Download the zip file to your local machine
files.download("os.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

NameError: name 'start_gpu_memory' is not defined

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Give me 10 questions on deadlock each for 4 marks", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<bos>You are a Chatbot for engineering students answering their questions. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Give me 10 questions on deadlock each for 4 marks

### Input:


### Response:

1. What is a deadlock?
2. What are the conditions for deadlock?
3. What are the different types of deadlocks?
4. What is the banker's algorithm?
5. What is the resource allocation graph?
6. What is the critical region graph?
7. What is the circular wait graph?
8. What is the resource allocation graph?
9. What is the solution to the resource allocation problem?
10. What are the different deadlock avoidance strategies?
<eos>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
# model.save_pretrained("lora_model") # Local saving
model.push_to_hub("cryotron/chatbot_academic_llama_Full_LORA", token = "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") # Online saving

README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved model to https://huggingface.co/cryotron/chatbot_academic_llama_Full_LORA


In [None]:
model.push_to_hub_merged("cryotron/chatbot_academic_llama_Full_MRG_LORA", tokenizer, save_method = "lora", token = "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")

Unsloth: Saving LoRA adapters. Please wait...


README.md:   0%|          | 0.00/575 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/168M [00:00<?, ?B/s]

Saved lora model to https://huggingface.co/cryotron/chatbot_academic_llama_Full_MRG_LORA


In [None]:
model.push_to_hub_gguf("cryotron/chatbot_academic_llama_Full_GUFF", tokenizer, token = "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")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.7G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 5.72 out of 12.67 RAM for saving.


 47%|████▋     | 15/32 [00:02<00:01, 10.36it/s]We will save to Disk and not RAM now.
100%|██████████| 32/32 [01:39<00:00,  3.11s/it]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving cryotron/chatbot_academic_llama_Full_GUFF/pytorch_model-00001-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_llama_Full_GUFF/pytorch_model-00002-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_llama_Full_GUFF/pytorch_model-00003-of-00004.bin...
Unsloth: Saving cryotron/chatbot_academic_llama_Full_GUFF/pytorch_model-00004-of-00004.bin...
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at cryotron/chatbot_academic_llama_Full_GUFF into q8_0 GGUF format.
The output location will be /content/cryotron/chatbot_academic_llama_Full_GUFF/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: chatbot_academic_llama_Full_GUFF
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00004.bin'
INFO:hf-to-gguf:token_embd.weight,           t

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Saved GGUF to https://huggingface.co/cryotron/chatbot_academic_llama_Full_GUFF


In [None]:
# model.save_pretrained("lora_model") # Local saving
model.push_to_hub("cryotron/chatbot_academic_llama", token = "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") # Online saving

README.md:   0%|          | 0.00/577 [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/78.5M [00:00<?, ?B/s]

Saved model to https://huggingface.co/cryotron/chatbot_academic_gemma


Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

# alpaca_prompt = You MUST run cells from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>You are a Chatbot for engineering students answering their questions. Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\n\nThe Eiffel Tower is a famous landmark in Paris, standing at a height of 324 meters. It was completed in 1889 and is known for its distinctive wrought-iron lattice design.\n<|end_of_text|>']

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

We're preparing to save our trained model in a more compact format and then upload it to a cloud platform, which allows us to use less storage and computational power.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

We're ready to compress our model using various quantization methods to make it leaner and then upload it to the cloud for easy sharing and access.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("Q8_0_academic", tokenizer,)
if False: model.push_to_hub_gguf("cryotron/Q8_0_academic", tokenizer, token = "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")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("16bit_academic", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("cryotron/16bit_academic", tokenizer, quantization_method = "f16", token = "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")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("q4_k_m_academic", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("cryotron/q4_k_m_academic", tokenizer, quantization_method = "q4_k_m", token = "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")

In [None]:
model.push_to_hub_gguf("cryotron/Q8_0_academic_trial", tokenizer, token = "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")


Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.
Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 2.1G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 6.54 out of 12.67 RAM for saving.


100%|██████████| 18/18 [00:01<00:00, 11.25it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Unsloth: Saving cryotron/Q8_0_academic_trial/pytorch_model-00001-of-00002.bin...
Unsloth: Saving cryotron/Q8_0_academic_trial/pytorch_model-00002-of-00002.bin...
Done.


Unsloth: Converting gemma model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp will take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits will take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] will take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: [0] Installing llama.cpp. This will take 3 minutes...
Unsloth: [1] Converting model at cryotron/Q8_0_academic_trial into q8_0 GGUF format.
The output location will be /content/cryotron/Q8_0_academic_trial/unsloth.Q8_0.gguf
This will take 3 minutes...
INFO:hf-to-gguf:Loading model: Q8_0_academic_trial
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'pytorch_model.bin.index.json'
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model-00001-of-00002.bin'
INFO:hf-to-gguf:token_embd.weight,         torch.float16 --> Q8_0, shape = {2048, 256

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/cryotron/Q8_0_academic_trial


Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).