<a href="https://colab.research.google.com/github/AdithyaSean/Singlish-llama/blob/main/Continued_pretraining_Singlish_%2B_Unsloth%20%2B%20Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Hugging Face Hub Login

In this cell, we log in to the Hugging Face Hub using a personal access token. This allows us to access and manage our models and datasets on the Hugging Face platform. The `login` function from the `huggingface_hub` module is used for this purpose. Make sure to replace `hf_xxx` with your actual Hugging Face token.

In [None]:
from huggingface_hub import login
from google.colab import userdata
token = userdata.get('adithyasean')
login(token, add_to_git_credential = True)

### Why We Choose to Use Unsloth

Unsloth is a powerful library designed to optimize hardware performance and reduce the hardware requirements for running large language models (LLMs). By leveraging Unsloth, we can efficiently manage memory and computational resources, enabling us to run complex models on less powerful hardware. This is particularly beneficial for fine-tuning and inference tasks, where resource constraints can be a significant bottleneck.

### Explanation

In the following cell, we install Unsloth along with other essential packages such as Xformers (Flash Attention), TRL, PEFT, Accelerate, BitsAndBytes, and Triton. These packages are crucial for optimizing the performance of our LLM model. We also check the Torch version to determine the appropriate version of Xformers to install, ensuring compatibility and optimal performance.

In [None]:
!pip install unsloth

In [None]:
from transformers import TrainingArguments, TextStreamer
from unsloth import is_bfloat16_supported, UnslothTrainer, UnslothTrainingArguments, FastLanguageModel
from datasets import load_dataset

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Llama-3.2-3B",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
)

<a></a>
### Data Preparation

We have collected two large datasets from Hugging Face, which contain close to 90 million Sinhala examples, and a Wikipedia subset of Sinhala, which contains close to 25k Sinhala title and article-based examples. Additionally, we have another large dataset containing English to Sinhala translations by a Hugging Face user and the Sinhala translation of the Alpaca dataset by another Hugging Face user.

All the datasets were transliterated and processed by custom Python scripts - [GitHub Repository](https://github.com/AdithyaSean/Singlish-llama).

Datasets by trained order:

**Original Datasets:**
- [Sinhala 30M](https://huggingface.co/datasets/9wimu9/sinhala_30m)
- [Sinhala Dataset 59M](https://huggingface.co/datasets/9wimu9/sinhala_dataset_59m)
- [Wikipedia (Sinhala subset)](https://huggingface.co/datasets/wikimedia/wikipedia)
- [English-Sinhala Translated](https://huggingface.co/datasets/Udith-Sandaruwan/english-sinhala-translated)
- [Alpaca-Sinhala](https://huggingface.co/datasets/sahanruwantha/alpaca-sinhala)
- [sinhala-instruction-finetune-large](https://huggingface.co/datasets/ihalage/sinhala-instruction-finetune-large)

**Transliterated and Processed Datasets:**
- [Singlish 80M](https://huggingface.co/datasets/adithyasean/singlish_30m)
- [Singlish Wikipedia](https://huggingface.co/datasets/adithyasean/singlish-wikipedia)
- [English-Singlish](https://huggingface.co/datasets/adithyasean/english-singlish)
- [Alpaca-Singlish](https://huggingface.co/datasets/adithyasean/alpaca-singlish)
- [singlish-instruction-finetune](https://huggingface.co/datasets/adithyasean/singlish-instruction-finetune)

These datasets are crucial for developing a new domain-specific language skill for the LLM model without affecting its current abilities.

<a name="Train"></a>
### Continued Pretraining
Using Unsloth's `UnslothTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer).

### First LoRA Adapter

In this section, we will perform the initial pretraining of the model using the collected datasets. This step is crucial to develop a new domain-specific language skill for the LLM model without affecting its current abilities. We will utilize Parameter Efficient Fine Tuning (PEFT) methods, specifically LoRA techniques, to achieve this goal. By adding LoRA adapters, we only need to update a small percentage of all parameters, making the process efficient and effective.

### Target Modules for LoRA Adapters

In the context of adding LoRA adapters to our model, the target modules are specific parts of the model's architecture where the low-rank adaptation will be applied. These modules are chosen based on their significance in the model's computation and their potential impact on performance when fine-tuned. Here are the target modules we included and the reasons for their inclusion:

- **q_proj, k_proj, v_proj, o_proj**: These are the query, key, value, and output projection layers in the attention mechanism. Fine-tuning these layers helps in adapting the attention mechanism to new tasks or domains.
- **gate_proj, up_proj, down_proj**: These layers are part of the feed-forward neural network within the transformer architecture. Fine-tuning these layers allows the model to better capture and adapt to new patterns in the data.
- **embed_tokens**: This module is responsible for converting input tokens into embeddings. Fine-tuning this layer helps the model to learn new token representations, which is crucial for handling out-of-distribution data.
- **lm_head**: The language modeling head is responsible for generating the final output tokens. Fine-tuning this layer ensures that the model can produce accurate and relevant outputs for the new tasks.

By including all these target modules, we ensure that the model can efficiently adapt to new tasks and data distributions while maintaining its overall performance. This comprehensive approach allows us to update only a small subset of the model's parameters, making the fine-tuning process more efficient and effective.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    loftq_config = None,
)

### Text datasets 100 000 examples

In [None]:
dataset = load_dataset("adithyasean/singlish_80m", split='train[:8000]')

In [None]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_text_function(examples):
  texts = examples["text"]
  outputs = []
  for text in texts:
      # Must add EOS_TOKEN, otherwise the generation will go on forever!
      text = format(text) + EOS_TOKEN
      outputs.append(text)
  return { "text" : texts, }
pass

dataset = dataset.map(formatting_text_function, batched = True,)

In [None]:
print(dataset[0])

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        # max_steps = 120,
        # warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

### Wikipedia Sinhala subset 23 000 examples

The title-text dataset is chosen for training the same LoRA (Low-Rank Adaptation) adapter because it offers several advantages:

1. **Concise Summaries**: The dataset provides concise summaries of content, helping the model to understand and generate relevant summaries.
2. **Title-Content Relationship**: It helps the model learn how to relate titles to their corresponding content, which is crucial for generating coherent and informative summaries.
3. **Skill Development**: It aids the model in developing the ability to generate concise summaries, a valuable skill for various natural language processing tasks.

These characteristics make the title-text dataset an excellent choice for training a LoRA adapter aimed at improving summarization capabilities.

The title-text dataset is a crucial component in our training process for several reasons. First, the title-text dataset provides a concise summary of the content, which can help the model understand the main points of the text and generate more relevant summaries. Second, the title-text dataset can help the model learn how to relate the title to the content, which is essential for generating coherent and informative summaries. Finally, the title-text dataset can help the model learn how to generate concise summaries, which is a valuable skill for many natural language processing tasks.

In [None]:
dataset = load_dataset("adithyasean/singlish-wikipedia", split="train[:8000]")

In [None]:
wikipedia_prompt = """Wikipedia Article
### Title: {}

### Article:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    titles = examples["title"]
    texts  = examples["text"]
    outputs = []
    for title, text in zip(titles, texts):
        # Must add EOS_TOKEN, otherwise the generation will go on forever!
        text = wikipedia_prompt.format(title, text) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
print(dataset[0])

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        # max_steps = 120,
        # warmup_steps = 10,
        warmup_ratio = 0.1,
        num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

### Second LoRA Adapter
The concept of using multiple LoRA (Low-Rank Adaptation) adapters in machine learning, particularly in natural language processing (NLP), is often driven by the need to specialize models for different tasks or domains.

Why Use a Second LoRA Adapter?

Using multiple LoRA adapters allows for task-specific specialization, improved performance, and better handling of diverse linguistic and cultural nuances. It ensures that the model remains versatile and robust across different domains and tasks.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",
                      "embed_tokens", "lm_head",], # Add for continual pretraining
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    loftq_config = None,
)

### English to Singlish dataset 10 000 examples
Training on English-Singlish data is crucial for:

1. **Cross-Language Sharing**: Bridges English and Singlish, enhancing translation.
2. **Knowledge Expansion**: Enriches the model with diverse information.
3. **Language Understanding**: Improves grasp of linguistic nuances.
4. **Model Performance**: Enhances generalization on unseen data.
5. **Cultural Sensitivity**: Promotes inclusivity and caters to diverse needs.

This results in a robust, versatile, and culturally aware language model.

In [None]:
dataset = load_dataset("adithyasean/english-singlish", split="train")

In [None]:
translation = """Translation
### English:
{}

### Singlish:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    English = examples["English"]
    Singlish  = examples["Singlish"]
    outputs = []
    for English, Singlish in zip(English, Singlish):
        # Must add EOS_TOKEN, otherwise the generation will go on forever!
        text = translation.format(English, Singlish) + EOS_TOKEN
        outputs.append(text)
    return { "text" : outputs, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True,)

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

### Instruction Finetuning

Instruction fine-tuning is a crucial step in adapting our language model to follow specific instructions and generate desired outputs. By fine-tuning the model on a dataset that includes various instructions and corresponding responses, we can enhance the model's ability to understand and execute complex tasks.

In this notebook, we utilize the Singlish transliteration of the Sinhala Alpaca dataset for instruction fine-tuning. This dataset contains a diverse set of instructions and responses, which helps in training the model to handle a wide range of queries effectively.

The fine-tuning process involves the following steps:

1. **Data Preparation**: We format the dataset to include instructions and responses, ensuring that the model can learn the relationship between them.
2. **Model Training**: Using the `UnslothTrainer`, we fine-tune the model on the prepared dataset. This involves setting appropriate training parameters such as batch size, learning rate, and number of epochs.
3. **Evaluation**: After training, we evaluate the model's performance to ensure it has learned to follow instructions accurately.

By completing this fine-tuning process, we aim to create a robust and versatile language model capable of understanding and executing a wide range of instructions in Singlish.

### Third LoRA Adapter

In this section, we introduce the third LoRA (Low-Rank Adaptation) adapter, specifically designed for instruction fine-tuning. The use of multiple LoRA adapters allows us to specialize the model for different tasks or domains, enhancing its versatility and performance.

**Why Use a Third LoRA Adapter for Instruction Fine-Tuning?**

1. **Task Specialization**: By adding a third LoRA adapter, we can fine-tune the model to better understand and execute specific instructions, improving its ability to handle complex tasks.
2. **Enhanced Performance**: The additional adapter helps in refining the model's responses, making them more accurate and contextually relevant.
3. **Diverse Instruction Handling**: With multiple adapters, the model can better manage a wide range of instructions, ensuring robust performance across different scenarios.
4. **Efficient Fine-Tuning**: LoRA adapters allow us to update only a small subset of the model's parameters, making the fine-tuning process more efficient and less resource-intensive.

By leveraging the third LoRA adapter, we aim to create a highly specialized and efficient model capable of understanding and executing a diverse set of instructions with high accuracy.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 32,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    loftq_config = None,
)

### Alpaca Dataset Sinhala

The Alpaca dataset in Sinhala is a valuable resource for instruction fine-tuning. This dataset contains a diverse set of instructions and corresponding responses, which are essential for training language models to understand and execute complex tasks. By fine-tuning our model on this dataset, we aim to enhance its ability to follow specific instructions and generate accurate and relevant outputs.

**Why Use the Alpaca Dataset for Instruction Fine-Tuning?**

1. **Diverse Instructions**: The dataset includes a wide range of instructions, helping the model to generalize across different types of queries.
2. **Cultural Relevance**: Being in Sinhala, it ensures that the model can handle instructions and generate responses in a culturally and linguistically appropriate manner.
3. **Improved Performance**: Fine-tuning on this dataset helps in improving the model's performance on instruction-based tasks, making it more versatile and effective.
4. **Enhanced Understanding**: The dataset aids in developing the model's understanding of the relationship between instructions and responses, which is crucial for generating coherent and contextually accurate outputs.

By leveraging the Alpaca dataset in Sinhala, we can create a robust and culturally aware language model capable of handling a wide range of instruction-based tasks.

In [None]:
dataset = load_dataset("adithyasean/alpaca-singlish", split="train")

In [None]:
alpaca_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(conversations):
    texts = []
    instructions = conversations["instruction"]
    inputs = conversations["prompt"]
    outputs = conversations["response"]
    for instruction, prompt, response in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, prompt, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched = True)

In [None]:
print(dataset[0])

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

### Instruction Finetuning with Input-Output Only Dataset

In this section, we will use another instruction fine-tuning dataset that contains only input and output pairs, unlike the Alpaca dataset which includes instructions as well. This dataset will help in further refining the model's ability to generate accurate and contextually relevant responses based solely on input prompts.

In [None]:
dataset = load_dataset("adithyasean/singlish-instruction-finetune", split="train")

In [None]:
prompt = """
### input:
{}

### output:
{}"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN

def formatting_prompts_func(conversations):
    texts = []
    inputs = conversations["question_prompt"]
    outputs = conversations["response_prompt"]
    for input_prompt, response_prompt in zip(inputs, outputs):
        # Format the prompt and response using the template
        text = prompt.format(input_prompt, response_prompt) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

In [None]:
print(dataset[0])

In [None]:
trainer = UnslothTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,

    args = UnslothTrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 8,

        # Use warmup_ratio and num_train_epochs for longer runs!
        max_steps = 120,
        warmup_steps = 10,
        # warmup_ratio = 0.1,
        # num_train_epochs = 1,

        # Select a 2 to 10x smaller learning rate for the embedding matrices!
        learning_rate = 5e-5,
        embedding_learning_rate = 1e-5,

        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.00,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

In [None]:
trainer_stats = trainer.train()

### View Resources Usage

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model!

In [None]:
# load the last saved model if it is currently not loaded

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "username/model_name",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = False,
)

Using `TextStreamer` for continuous inference

*   List item
*   List item

the generation token by token, instead of waiting the whole time!

In [None]:
prompt = """
### input:
{}

### output:
{}"""

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    prompt.format(
        "pahadili karanna, bankuwak yanu kumakda?",
        ""
    )
], return_tensors = "pt").to("cuda")

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models

In [None]:
model.push_to_hub("adithyasean/Llama-3.2-3B-Singlish-1.0-4bit", private = True) # Online saving
tokenizer.push_to_hub("adithyasean/Llama-3.2-3B-Singlish-1.0-4bit", private = True) # Online saving

### Saving to float16 for VLLM

In [None]:
# Merge to 16bit
model.push_to_hub_merged("adithyasean/Llama-Singlish-1.0-8B-16bit", tokenizer, save_method = "merged_16bit", token = True, private=True)

In [None]:
# Merge to 4bit
model.push_to_hub_merged("adithyasean/Llama-Singlish-1.0-8B-4bit", tokenizer, save_method = "merged_4bit", token = True, private=True)

### GGUF / llama.cpp Conversion
We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on unsloth [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
model.push_to_hub_gguf("adithyasean/Llama-Singlish-1.0-8B-Q8-0", tokenizer, token = True, private=True)

In [None]:
# Save to 16bit GGUF
model.push_to_hub_gguf("adithyasean/Llama-Singlish-1.0-8B-f16", tokenizer, quantization_method = "f16", token = True, private=True)

In [None]:
# Save to q4_k_m GGUF
model.push_to_hub_gguf("adithyasean/Llama-Singlish-1.0-8B-q4-k-m", tokenizer, quantization_method = "q4_k_m", token = True, private=True)

In [None]:
# Save to q5_k_m GGUF
model.push_to_hub_gguf("adithyasean/Llama-Singlish-1.0-8B-q5-k-m", tokenizer, quantization_method = "q5_k_m", token = True, private=True)

### Disconnect from the Runtime

In [None]:
from google.colab import runtime
runtime.unassign()