# **Install Library**

In [None]:
%%bash
pip install unsloth wandb

**Explanation:**

This command installs the **Unsloth** library — a toolkit for **efficient LLM fine-tuning** (especially for models like Gemma, LLaMA, or Mistral).

It downloads and installs Unsloth so you can use it in your notebook for model loading, LoRA fine-tuning, and quantization.


# **Login to Hugging Face and Weights & Biases (WandB)**

In [None]:
import wandb
import json
from huggingface_hub import login, whoami

# Access the API key
HF_API_Key = " "
WANDB_API_KEY = " "


login(token=HF_API_Key)    # Log into Hugging Face
wandb.login(key=WANDB_API_KEY)  # Log into Wandb


print("Login setup complete!")

# Get the current user info
user_info = whoami()

# Print normal text line by line
for key, value in user_info.items():
    print(f"{key}: {value}")


**Simple Explanation:**

This code logs you into **Hugging Face** and **WandB** using your API keys.

* `HF_API_Key` — your **Hugging Face API token**.
* `WANDB_API_KEY` — your **Weights & Biases API key**.
* `login(token=HF_API_Key)` — connects your notebook to Hugging Face Hub.
* `wandb.login(key=WANDB_API_KEY)` — connects your notebook to WandB for experiment tracking.
* `whoami()` — checks and displays your current Hugging Face account info.

**In short:**
It authenticates your session so you can upload models to Hugging Face and track training progress with WandB.


## **Loading a Pretrained Model using Unsloth**

In [None]:
from unsloth import FastLanguageModel

model_name = "tiiuae/Falcon-H1-1.5B-Instruct"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name,
    load_in_4bit=True,
    max_seq_length=256,
    dtype=None,
)

**Explanation:**

* `FastLanguageModel` is imported from **Unsloth**.
* The model (e.g., **LLaMA**, **Gemma**, or any other Hugging Face-compatible model**) is being loaded.
* `load_in_4bit=True` enables **4-bit quantization**, which significantly reduces VRAM usage while maintaining good performance.
* `max_seq_length=256` sets the maximum number of tokens the model can process in a single input sequence.
* `dtype=None` allows Unsloth to automatically choose the best precision type (like `torch.bfloat16` or `torch.float16`) based on your GPU.

**Note:** You can replace the model name (e.g., `"tiiuae/Falcon-E-3B-Instruct"`) with any supported model such as **"meta-llama/Llama-3-8B-Instruct"**, **"google/gemma-2b"**, or **"mistralai/Mistral-7B-Instruct"** — Unsloth supports them all for efficient fine-tuning and inference.


## **We now add LoRA adapters so we only need to update 1 to 10% of all parameters!**

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 8,
    lora_dropout = 0.1,
    use_gradient_checkpointing = False,
    random_state = 1407,
)

**Explanation:**

* `FastLanguageModel.get_peft_model()` — wraps the base model with **LoRA (Low-Rank Adaptation)** layers for parameter-efficient fine-tuning.
* `r = 8` — sets the **LoRA rank**, which determines how many additional trainable parameters are added. Common values include **8, 16, 32, 64, 128**.
* `target_modules` — specifies which model layers will be adapted using LoRA. Typically includes projection layers such as `"q_proj"`, `"v_proj"`, `"o_proj"`, `"gate_proj"`, `"up_proj"`, and `"down_proj"`.
* `lora_alpha = 8` — a scaling factor that adjusts the impact of LoRA updates on the base model.
* `lora_dropout = 0.1` — applies dropout regularization within LoRA layers to reduce overfitting.
* `use_gradient_checkpointing = False` — can be set to `True` to save GPU memory during training (slightly slower due to recomputation).
* `random_state = 1407` — ensures reproducibility by setting a consistent random seed.

**Note:**
This configuration works not just for **Falcon**, but also for other transformer models like **LLaMA**, **Gemma**, **Mistral**, or **Yi**.
You only need to adjust the `target_modules` list if your chosen model uses different layer naming conventions (e.g., `"Wqkv"` for Gemma or `"attn.q_proj"` for LLaMA).


# **Data Prep**

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

**Explanation:**

* `alpaca_prompt` — defines the **prompt template** used for instruction-tuning. It organizes the data into three sections: **Instruction**, **Input**, and **Response**, following the Alpaca dataset format.
* `EOS_TOKEN` — marks the **end of a sentence (end-of-sequence)**, ensuring the model knows when to stop generating text.
* `formatting_prompts_func()` — formats raw dataset examples into the Alpaca-style structure by inserting each instruction, input, and output into the template, and then appends the `EOS_TOKEN`.
* `load_dataset("yahma/alpaca-cleaned")` — loads a **cleaned and high-quality version** of the Alpaca dataset, containing structured instruction–response pairs for supervised fine-tuning.
* `dataset.map(..., batched=True)` — efficiently applies the formatting function to all dataset entries in batches for faster preprocessing.

**In essence:**
This code **prepares and tokenizes the training data** into a clean, structured **instruction-following format**, ready for fine-tuning large language models such as **LLaMA**, **Gemma**, **Mistral**, or **any Hugging Face-compatible model**.


# **Train the model**

In [None]:
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=256,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences.
    args=SFTConfig(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        warmup_steps=5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps=10,
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.001,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="wandb",  # Use TrackIO/WandB etc
    ),
)

**Explanation:**

* `from trl import SFTConfig, SFTTrainer` — imports classes from **TRL (Transformer Reinforcement Learning)**, designed for **Supervised Fine-Tuning (SFT)** of language models.
* `trainer = SFTTrainer(...)` — initializes the **fine-tuning trainer**, which manages model training, dataset loading, and optimization automatically.

---

**Parameters Explained:**

* `model` and `tokenizer` — refer to the preloaded **LLaMA**, **Gemma**, or **Falcon** models and their associated tokenizer.
* `train_dataset=dataset` — specifies the formatted Alpaca dataset for training.
* `dataset_text_field="text"` — tells the trainer which dataset column contains the text input for fine-tuning.
* `max_seq_length=256` — limits the maximum number of tokens per training example.
* `dataset_num_proc=2` — enables multiprocessing with 2 CPU threads to speed up data preprocessing.
* `packing=False` — keeps each sample separate (when `True`, multiple shorter samples can be packed together to improve efficiency).

---

**`SFTConfig` arguments:**

* `per_device_train_batch_size=2` — sets batch size per GPU or device.
* `gradient_accumulation_steps=8` — effectively increases the total batch size without exceeding GPU memory limits.
* `warmup_steps=5` — gradually warms up the learning rate for more stable early training.
* `num_train_epochs=1` — runs one complete pass through the dataset (can be increased for full training).
* `max_steps=10` — limits training to 10 steps, useful for debugging or quick tests.
* `learning_rate=2e-4` — the learning rate for the optimizer.
* `optimizer="adamw_8bit"` — uses **8-bit AdamW** from BitsAndBytes to save GPU memory.
* `weight_decay=0.001` — adds a small penalty to reduce overfitting.
* `lr_scheduler_type="linear"` — linearly decays the learning rate over time.
* `seed=3407` — sets a fixed random seed for reproducibility.
* `output_dir="Outputs"` — specifies where to save model checkpoints and logs.
* `report_to="none"` — disables logging to external tools like **WandB** or **TensorBoard** (can be changed to `"wandb"` for experiment tracking).

---

**In summary:**

This code configures and initializes the **training process** for fine-tuning **LLaMA**, **Gemma**, or **Falcon** models using **Unsloth** and **TRL’s `SFTTrainer`**.
It enables **efficient, low-memory supervised fine-tuning** on structured instruction datasets such as **Alpaca** or custom instruction-response data.


## **Train**

In [None]:
trainer_stats = trainer.train()

**Explanation:**

* `trainer_stats = trainer.train()` — begins the **fine-tuning process** using the previously defined configuration and dataset with `SFTTrainer`.

---

**Details:**

* This command **starts the main training loop**, running forward and backward passes over the dataset.
* Only the **LoRA (Low-Rank Adaptation)** layers are updated, while the **base model (e.g., LLaMA, Gemma, or Falcon)** remains mostly frozen to save memory and training time.
* The `trainer` automatically manages:

  * Loading and batching data from the dataset
  * Calculating the loss function
  * Performing backpropagation
  * Accumulating gradients for efficient training
  * Saving model checkpoints and logs (if enabled)

---

**Output — `trainer_stats`:**

The `trainer_stats` object contains key training metrics:

* **Training loss** per iteration or epoch
* **Learning rate** evolution throughout training
* **Total training duration**
* **Number of completed steps or epochs**

---

**In summary:**
This cell executes the **core training loop** for fine-tuning **LLaMA**, **Gemma**, or **any other supported model** using **LoRA adapters** and the **Alpaca instruction dataset** — achieving efficient, parameter-light supervised fine-tuning.


# **Inference**

Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

**Explanation:**

* `FastLanguageModel.for_inference(model)` — switches the fine-tuned model into **inference mode**, disabling gradient calculations and enabling optimized settings for **faster text generation (up to 2× faster)**.
* `inputs = tokenizer([...], return_tensors="pt").to("cuda")` — tokenizes the input text into **PyTorch tensors** and moves them to the GPU for efficient inference.

---

**Prompt Structure:**

* The `alpaca_prompt.format()` template is reused to maintain the same structured input format:

  * **Instruction:** `"Continue the fibonacci sequence."`
  * **Input:** `"1, 1, 2, 3, 5, 8"`
  * **Output:** (left empty) — this signals the model to **generate** the continuation.

---

**Generation Step:**

* `outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)`

  * `max_new_tokens=64` — defines the maximum number of tokens the model can generate beyond the input prompt.
  * `use_cache=True` — enables caching of past key values to accelerate token generation.

---

**Decoding:**

* `tokenizer.batch_decode(outputs)` — converts the generated token IDs back into human-readable text.

---

**In summary:**
This code performs **text generation (inference)** using your fine-tuned **LLaMA**, **Gemma**, or any other compatible model.
It demonstrates how to use the trained model to produce **coherent, context-aware outputs** from structured instruction–input prompts (like continuing a Fibonacci sequence or answering natural language questions).


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

**Explanation:**

* `FastLanguageModel.for_inference(model)` — puts the model into **inference mode**, disabling training layers and optimizing it for **fast text generation** (around **2× faster performance**).

* `inputs = tokenizer([...], return_tensors="pt").to("cuda")` — tokenizes the given instruction and input text, converting them into **PyTorch tensors** and moving them to the GPU for generation.

  * Uses the `alpaca_prompt.format()` template with:

    * **Instruction:** `"Continue the fibonacci sequence."`
    * **Input:** `"1, 1, 2, 3, 5, 8"`
    * **Output:** left blank — letting the model generate the continuation automatically.

* `from transformers import TextStreamer` — imports the **TextStreamer** class from the Hugging Face Transformers library.

* `text_streamer = TextStreamer(tokenizer)` — sets up a **real-time text streamer** that prints each generated token live, just like ChatGPT’s typing effect.

* `model.generate(**inputs, streamer=text_streamer, max_new_tokens=128)` — generates up to **128 new tokens**, streaming them to the output in real time instead of waiting until the model finishes generating the entire response.

---

**In short:**
This code enables **real-time text generation (streaming inference)** using your fine-tuned **LLaMA**, **Gemma**, or **Mistral** models. It prints the model’s output **live, token by token**, making it interactive and similar to ChatGPT’s response style.


# **Saving, loading finetuned models**

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")

**Explanation:**

* `model.save_pretrained("lora_model")` — saves the **fine-tuned model** (including the **LoRA adapter weights**) to a local directory named `"lora_model"`.

  * This folder will include essential files such as `adapter_model.bin`, `config.json`, and other metadata needed to reload the model.
  * The comment `# Local saving` indicates that this step saves the model locally — you can later upload it to **Hugging Face Hub** or reuse it in **Colab** or your **local environment**.

* `tokenizer.save_pretrained("lora_model")` — saves the **tokenizer configuration**, including the vocabulary, merges, and special tokens, to the same `"lora_model"` directory.

  * This ensures that the model uses **the same tokenizer setup** when you reload it, maintaining consistent text encoding and decoding behavior.

---

**In short:**
This code **exports your fine-tuned model and tokenizer** (such as **Gemma**, **LLaMA**, or **Mistral**) with their **LoRA adapters** into a reusable folder.
You can easily reload them later for inference or further fine-tuning using:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("lora_model")
tokenizer = AutoTokenizer.from_pretrained("lora_model")
```

This allows you to run or share your **custom fine-tuned model** seamlessly across different environments.


### **Saving to float16 for VLLM**

In [None]:
model.save_pretrained_merged("merged_model", tokenizer, save_method = "merged_16bit",)

**Explanation:**

* `device_map="auto"` — automatically assigns model layers to available hardware (e.g., GPU, CPU, or multiple GPUs).

  * This ensures optimal memory usage without manually specifying devices.

* `model.save_pretrained_merged("merged_model", tokenizer, save_method="merged_16bit")` —
  saves a **merged version** of the fine-tuned model.

What happens here:

* During LoRA fine-tuning, only **adapter layers** (LoRA weights) are trained.
* This function **merges** those LoRA adapters with the **base model weights**, creating a **single unified model**.
* The model is saved in **16-bit precision (FP16)** — reducing file size while keeping high accuracy.

Directory structure:

After execution, a new folder named **`merged_model/`** is created containing:

* The full merged model weights (ready for inference or deployment).
* The tokenizer files for text encoding/decoding.
* The configuration (`config.json`) describing model parameters.

**In short:**
This step produces a **final deployable model** (e.g., based on **Gemma**, **Llama**, or any other compatible model) by merging LoRA fine-tuned weights with the original base model into a single efficient checkpoint — ideal for sharing or inference without LoRA dependencies.


# **GGUF & llama.cpp Conversion**

In [None]:
%%bash

# STEP 1: Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp.git

# STEP 2: Build llama.cpp (for quantization binary)
%cd llama.cpp
cmake -B build
cmake --build build --config Release -j
%cd ..

### Recommended for Colab

In [None]:
%%bash

fallocate -l 8G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

git clone https://github.com/ggerganov/llama.cpp.git
%cd llama.cpp
cmake -B build
cmake --build build --config Release -j2
%cd ..


**Explanation:**

This code prepares **`llama.cpp`**, a high-performance C++ framework for **quantizing and running large language models (LLMs)** such as **LLaMA**, **Gemma**, or **Mistral** — enabling efficient local inference on **CPUs or low-VRAM GPUs**.

---

**Step 1: Clone llama.cpp repository**

```bash
!git clone https://github.com/ggerganov/llama.cpp.git
```

* Downloads the official **llama.cpp** project from GitHub.
* This repository provides utilities for:

  * **Quantization** (reducing model size and memory usage)
  * **Model conversion** from Hugging Face to GGUF format
  * **Local inference** optimized in C/C++ (no need for PyTorch).

---

**Step 2: Build llama.cpp (compile the binaries)**

```bash
%cd llama.cpp
!cmake -B build
!cmake --build build --config Release -j
%cd ..
```

* `%cd llama.cpp` — changes the working directory to the cloned repo.
* `!cmake -B build` — generates build configuration files in a `build/` folder.
* `!cmake --build build --config Release -j` — compiles the source code using multiple CPU threads (`-j` speeds up compilation).
* `%cd ..` — returns to the root directory once the build finishes.

---

**Purpose:**

After building, you’ll get executable binaries in `llama.cpp/build/bin`, including:

* **`quantize`** — converts models (e.g., **Gemma**, **LLaMA**, or **Mistral**) into **GGUF format** for smaller, faster versions.
* **`main`** — runs **inference locally** on CPUs or GPUs efficiently without requiring Python or PyTorch.

---

**In short:**
This step installs and compiles the **llama.cpp toolchain**, giving you the ability to **quantize and run your fine-tuned models locally** — perfect for deploying **Gemma**, **LLaMA**, or similar models on lightweight hardware with minimal dependencies.


## **Then, save the model to F16:**

In [None]:
%%bash

# For BF16:
!python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-BF16.gguf --outtype bf16 \
    --split-max-size 50G

# For Q8_0:
!python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile gemma-2-2b-Q8_0.gguf --outtype q4_0 \
    --split-max-size 50G

**Explanation:**

This code converts a **Hugging Face model** (for example, **Gemma**, **LLaMA**, or **Mistral**) into the **GGUF format** — the format used by `llama.cpp` for optimized inference.
It supports both **full-precision** and **quantized** outputs for maximum efficiency across different hardware.

The conversion is done using the script `convert_hf_to_gguf.py` provided in the **llama.cpp** repository.

---

**Step-by-Step Breakdown**

**1️⃣ Convert to BF16 (Full Precision)**

```bash
!python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile model-BF16.gguf --outtype bf16 \
    --split-max-size 50G
```

* **`merged_model`** — the directory containing your fine-tuned Hugging Face model (e.g., **Gemma**, **LLaMA**, or **Mistral**).
* **`--outfile model-BF16.gguf`** — the name of the output GGUF file.
* **`--outtype bf16`** — converts model weights to **bfloat16 precision**, preserving near-original quality.
* **`--split-max-size 50G`** — automatically splits large models into 50 GB chunks to handle filesystem limits.

**Use this mode** if you have enough VRAM or system memory and want **maximum accuracy** (ideal for high-end GPUs or research use).

---

**2️⃣ Convert to Q4_0 (Quantized Model)**

```bash
!python llama.cpp/convert_hf_to_gguf.py merged_model \
    --outfile gemma-2b-q4_0.gguf --outtype q4_0 \
    --split-max-size 50G
```

* **`--outtype q4_0`** — converts the model to **4-bit quantization**, reducing its size by up to **80%** while maintaining good performance.
* **`--outfile gemma-2b-q4_0.gguf`** — sets the output file name for the quantized version.
* Quantized models like `Q4_0`, `Q5_K`, or `Q8_0` are **optimized for inference** on CPUs or smaller GPUs.

**Use this mode** for **lightweight and fast inference** in tools like **llama.cpp**, **Ollama**, or **LM Studio**, even on modest hardware.

---

**Summary**

| Mode            | Output Type | Precision     | Use Case                                |
| --------------- | ----------- | ------------- | --------------------------------------- |
| **BF16**        | `bf16`      | 16-bit float  | High-quality, full-precision inference  |
| **Q4_0 / Q8_0** | Quantized   | 4-bit / 8-bit | Lightweight, memory-efficient inference |

---

**In short:**
This process converts your **Gemma**, **LLaMA**, or other Hugging Face-compatible model into a `.gguf` file — ready for **ultra-efficient local inference** using **llama.cpp**, **Ollama**, or similar runtimes.
You can choose between **BF16 (full precision)** for accuracy or **Q4/Q8 (quantized)** for speed and lower memory usage.


### **Quantizing Model to Q3_K, q4_K_M & more Format using llama.cpp**

In [None]:
%%bash

cd /kaggle/working/llama.cpp/build/bin
./llama-quantize /kaggle/working/Phi-3-mini-4k-instruct-BF16.gguf \
                 /kaggle/working/Phi-3-mini-4k-instruct-q4_K.gguf \
                 q3_K

**Explanation**


This Bash script performs **model quantization** using the **llama.cpp** tool, converting a full-precision **BF16 model** into a **Q3_K quantized model** — significantly reducing size and memory requirements while maintaining reasonable accuracy.

---

**Step-by-Step Breakdown**

**1️⃣ Navigate to the llama.cpp build directory**

```bash
cd /kaggle/working/llama.cpp/build/bin
```

* Changes the working directory to where the **compiled llama.cpp binaries** are located.
* This folder contains the executable tools, such as `llama-quantize`, used for model compression.

---

**2️⃣ Run the Quantization Command**

```bash
./llama-quantize /kaggle/working/Phi-3-mini-4k-instruct-BF16.gguf \
/kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf \
q3_K
```

* **`./llama-quantize`** — executes the quantization tool built from `llama.cpp`.
* **Input file:**
  `/kaggle/working/Phi-3-mini-4k-instruct-BF16.gguf` — this is your **BF16 (full-precision)** GGUF model file.
* **Output file:**
  `/kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf` — the resulting **Q3_K quantized** model file.
* **Quantization type:**
  `q3_K` — specifies the **3-bit quantization method**, providing a balance between speed, size, and accuracy.

---

**About Q3_K Quantization**

* **Q3_K** reduces the model weights to **3-bit precision**, drastically shrinking the file size (often 70–80% smaller).
* It’s ideal for **CPU inference** or **low-memory GPU systems**.
* The quantized model retains good performance while allowing fast and efficient inference.

---

**In short:**

This script converts your **full-precision model (BF16)** into a **Q3_K quantized version**, ready for deployment using `llama.cpp`, **Ollama**, or **LM Studio**.
It’s especially useful for running models like **LLaMA**, **Gemma**, or **Mistral** on limited hardware.


# **Push to Huggingface**

In [None]:
%%bash

hf upload ss-lab/Phi-3-mini-4k-instruct-GGUF \
    /kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf


**Explanation**

This command uploads your **quantized GGUF model** (for example, a **Gemma** or **LLaMA** model) to the **Hugging Face Hub**, making it available for public or private use.

---

**Breakdown**

```bash
!hf upload ss-lab/Phi-3-mini-4k-instruct-GGUF \
    /kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf
```

**1️⃣ `!hf upload`**

* Uses the **Hugging Face CLI** to upload a file or model to your Hugging Face account or organization repository.
* You can upload model weights, tokenizers, or even full project folders.

---

**2️⃣ `ss-lab/Phi-3-mini-4k-instruct-GGUF`**

* This is the **destination repository** on Hugging Face.
* It follows the format:

  ```
  username-or-org/repository-name
  ```
* Example alternatives:

  * `your-username/Gemma-2B-GGUF`
  * `your-username/Llama-3-8B-GGUF`
* The repository will be created automatically if it doesn’t exist.

---

**3️⃣ `/kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf`**

* This is the **local path** of the GGUF file you want to upload.
* Change it depending on your environment:

  * **Kaggle:** `/kaggle/working/your-model.gguf`
  * **Colab:** `/content/your-model.gguf`
  * **Local System:** `~/models/your-model.gguf`

---

**What it Does**

* Uploads the `.gguf` model file (quantized or full-precision) to the Hugging Face Hub.
* Once uploaded, the model becomes accessible via:

  ```
  https://huggingface.co/ss-lab/Phi-3-mini-4k-instruct-GGUF
  ```
* You (or others) can then download it with:

  ```python
  from huggingface_hub import hf_hub_download
  hf_hub_download(repo_id="ss-lab/Phi-3-mini-4k-instruct-GGUF", filename="Phi-3-mini-4k-instruct-q3_K.gguf")
  ```

---

**In short:**

This command **pushes your locally quantized model (.gguf)** to the **Hugging Face Hub** for storage, sharing, or deployment — supporting any model like **Gemma**, **LLaMA**, or **Mistral**, not just Phi.
It’s the final step to make your fine-tuned or quantized model available for easy download and inference anywhere.


# **Removes GGUF From Local**

In [None]:
%%bash
rm /kaggle/working/Phi-3-mini-4k-instruct-q3_K.gguf

**Explanation:**


It permanently deletes `model-q3_K.gguf` from the `/kaggle/working/` folder.
