## **Guide to fine-tuning and quantizing Large Language Models (LLMs) on Google Colab**

---

Hello everyone! Today, I will guide you on how to easily fine-tune and quantize large language models!

### What is Fine-Tuning a Model?

*Fine-tuning* is the process of taking a pre-trained model and training it further on a dataset specific to a particular domain.

Most large language models today have excellent general performance but struggle with specialized tasks. Fine-tuning offers significant benefits such as reduced computational costs and the ability to leverage advanced models without having to build them from scratch.

***Example:*** Google's [*Gemma 2B*](https://huggingface.co/google/gemma-2b-it) model, with 2 billion parameters, is one of Google's lightest and most notable models, pre-trained for tasks like question answering, summarization, and reasoning. If you ask it, *''What does NASA stand for?''*, the model might not correctly respond with *''NASA stands for National Aeronautics and Space Administration''* if this data wasn't part of its pre-training. It could give a wrong answer or generate unrelated content (a phenomenon known as *hallucination*). However, with fine-tuning, the model can accurately answer this question.

### How to Fine-Tune a Model?

There are many methods, but today I'll introduce the most popular one (though not the most efficient, it's very easy to implement): the [*LoRA*](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) technique. I won't go into the heavy details of LoRA; simply put, LoRA involves learning pairs of low-rank matrices while keeping the original model weights fixed.

I've prepared the script below for you to run. Just follow the instructions, and you'll be good to go!

***Note:*** Since the Gemma 2B model has fewer parameters, it's not as "smart" as GPT-4 or Gemini Pro. Fine-tuning alone can't guarantee the model will generate precisely accurate content due to the nature of large language models relying on data and statistical probabilities. To supplement specific knowledge and increase accuracy, we need to use additional techniques like *RAG* (Retrieval-Augmented Generation).

### Why Quantize a Model?

*Quantizing* is the process of significantly reducing the model size and speeding up inference time, allowing advanced models to be deployed on devices with limited memory and computational power. The process of quantizing a model is quite complex, so I won't go into details. I've also prepared the code for you below.

After quantizing, you can deploy LLMs right on your personal computer.

---

> References

- [An Introductory Guide to Fine-Tuning LLMs](https://www.datacamp.com/tutorial/fine-tuning-large-language-models)

- [How to Quantize Any LLM?](https://hackernoon.com/quantizing-large-language-models-with-llamacpp-a-clean-guide-for-2024)

> If you encounter any bugs 🐞 while running the code below, stay calm and confident. Review the guide, and if you can't resolve it yourself, reach out to your mentors for help.

## **Notebook summary**

Welcome to Google Colab, a cloud platform that allows you to run Python code instead of using VSCode. The code is organized into cells, each with a `▶ Play` button at the top. The order in which you run these cells is crucial to avoid bugs, so please follow the arranged sequence. I really don't like bugs 😭.

Alright, let's get started!!

1.   Go to `File` (top left corner) and select `Save a copy in Drive`.

2.   In the top right corner of the notebook, there is a `✨ Gemini` icon. Click on it and select `Change runtime type`, choose `T4 GPU`, and then click `Save` **(this notebook must be run with a GPU)**.

> If you encounter the error `You cannot currently connect to a GPU due to usage limits in Colab` later on, it means you’ve exhausted the free GPU quota for your account 😆. Just switch to a different account, but remember to go to `Share`, select `Anyone with the link`, set to `Editor`, and click `Done` before switching accounts.

> In some cases, you might need to reset the runtime to start over :))). Go to `Runtime` and select `Disconnect and delete runtime`. For those who don't know, Google Colab allows limited use of GPUs and provides runtime storage (which is only saved while you are connected to the notebook) up to 100 GB.

> 📌 I have prepared the code for you to fine-tune and quantize *Gemma 2B Instruct*, using sample data from *Frequently Asked Questions (FAQs) about NASA*. This is done in the simplest way possible, requiring only a few basic parameter adjustments to suit your needs.

## **Installing the necessary libraries**

Simply run the cells below in order (up to and including the `hf()` section). The expected output is provided for reference; if your output matches, you're good to go.

> In the Colab interface, navigate to the **Files** section. Click on `Upload` and select the provided `requirements.txt` file.

> Every time you take a break or go for a bubble tea, the notebook might reset after a while. Just reconnect the GPU and run this installation section again.

In [1]:
!pip install -r requirements.txt -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.1/542.1 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.7/94.7 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.7/56.7 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# @title
import os
import io
import json
import torch
import sys
import logging
import subprocess
import contextlib
import pandas as pd
from unsloth import FastLanguageModel
from datasets import load_dataset, Dataset
from IPython.display import display
from huggingface_hub import login, HfApi
from pathlib import Path
from jinja2 import Template

MODEL_NAME = None
AUTHOR = None
HF_TOKEN = None


def set(name, author):
    global MODEL_NAME, AUTHOR
    MODEL_NAME = name
    AUTHOR = author


def hf(token):
    global HF_TOKEN
    HF_TOKEN = token
    login(HF_TOKEN, add_to_git_credential=True)


def preprocess_dataset(dataset: pd.DataFrame, dataset_name: str, num_to_train=None):
    if num_to_train is not None:
        dataset = dataset.head(num_to_train)
    dataset["input"] = dataset["input"].fillna("")
    file_path = f"/content/LLaMA-Factory/data/{dataset_name}.json"
    dataset.to_json(file_path, orient="records", force_ascii=False, indent=4)
    return file_path


def dataset_info(**datasets):
    info = {}
    for dataset_name, dataset in datasets.items():
        info[dataset_name] = {"file_name": f"{dataset_name}.json"}
    file_path = "/content/LLaMA-Factory/data/dataset_info.json"
    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(info, f, ensure_ascii=False, indent=2)
    return file_path

def train(datasets, num_train_epochs, continue_training=True):
    dataset_names = ",".join(datasets)

    if not continue_training:
        os.system("rm -rf /content/LLaMA-Factory/gemma_lora")

    args = dict(
        stage="sft",
        do_train=True,
        model_name_or_path="google/gemma-2b-it",
        dataset=dataset_names,
        template="gemma",
        finetuning_type="lora",
        lora_target="all",
        output_dir="gemma_lora",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        lr_scheduler_type="cosine",
        logging_steps=10,
        warmup_ratio=0.1,
        save_steps=1000,
        learning_rate=5e-5,
        num_train_epochs=num_train_epochs,
        max_samples=500,
        max_grad_norm=1.0,
        quantization_bit=4,
        loraplus_lr_ratio=16.0,
        fp16=True,
    )

    file_path = "/content/LLaMA-Factory/train_gemma.json"

    with open(file_path, "w", encoding="utf-8") as f:
        json.dump(args, f, ensure_ascii=False, indent=4)

    os.chdir("/content/LLaMA-Factory")
    subprocess.run(["pip", "install", "-e", ".[torch,bitsandbytes]"], check=True)
    process = subprocess.Popen(
        ["llamafactory-cli", "train", "train_gemma.json"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
    )

    start_printing_all = False

    for line in iter(process.stdout.readline, b""):
        decoded_line = line.decode()
        if "train metrics" in decoded_line.lower():
            start_printing_all = True
        if "loss" in decoded_line.lower() or start_printing_all:
            print(decoded_line, end="")

    process.stdout.close()
    process.wait()


class SuppressLogging:
    def __enter__(self):
        logging.disable(logging.CRITICAL)

    def __exit__(self, exc_type, exc_val, exc_tb):
        logging.disable(logging.NOTSET)


def test():
    os.chdir("/content/LLaMA-Factory/src")
    from llamafactory.chat import ChatModel
    from llamafactory.extras.misc import torch_gc

    os.chdir("/content/LLaMA-Factory")

    args = dict(
        model_name_or_path="google/gemma-2b-it",
        adapter_name_or_path="gemma_lora",
        template="gemma",
        finetuning_type="lora",
        quantization_bit=4,
    )

    with SuppressLogging():
        chat_model = ChatModel(args)

    print("***** Type 'clear' to clear the chat history, type 'exit' to quit! *****")
    messages = []
    while True:
        query = input("\nUser: ")
        if query.strip().lower() == "exit":
            break
        if query.strip().lower() == "clear":
            messages = []
            torch_gc()
            print("The chat history has just been cleared.")
            continue

        messages.append({"role": "user", "content": query})
        print(f"Assistant: ", end="", flush=True)

        response = ""
        for new_text in chat_model.stream_chat(messages):
            print(new_text, end="", flush=True)
            response += new_text
        print()
        messages.append({"role": "assistant", "content": response})
    torch_gc()


def merge_and_push(repo_id):
    os.chdir("/content/LLaMA-Factory/")

    args = dict(
        model_name_or_path="google/gemma-2b-it",
        adapter_name_or_path="gemma_lora",
        template="gemma",
        finetuning_type="lora",
        export_dir="gemma_lora_merged",
        export_size=2,
        export_device="cpu",
    )

    with open("gemma_lora_merged.json", "w", encoding="utf-8") as f:
        json.dump(args, f, ensure_ascii=False, indent=2)

    with SuppressLogging(), open(os.devnull, "w") as devnull:
        subprocess.run(
            ["llamafactory-cli", "export", "gemma_lora_merged.json"],
            stdout=devnull,
            stderr=devnull,
            check=True,
        )

    print("***** Model has been successfully merged and uploaded to Huggingface! *****")

    model_dir = "/content/LLaMA-Factory/gemma_lora_merged"
    tokenizer_dir = "/content/LLaMA-Factory/gemma_lora"

    tokenizer_config_path = Path(tokenizer_dir) / "tokenizer_config.json"
    with open(tokenizer_config_path, "r", encoding="utf-8") as f:
        tokenizer_config = json.load(f)
    tokenizer_config.pop("chat_template", None)
    with open(tokenizer_config_path, "w", encoding="utf-8") as f:
        json.dump(tokenizer_config, f, ensure_ascii=False, indent=4)

    tokenizer_files = [
        "tokenizer.json",
        "tokenizer.model",
        "tokenizer_config.json",
        "special_tokens_map.json",
    ]

    api = HfApi()
    global HF_TOKEN

    for file in os.listdir(model_dir):
        file_path = Path(model_dir) / file
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=file,
            repo_id=repo_id,
            repo_type="model",
            token=HF_TOKEN,
        )

    for file_name in tokenizer_files:
        file_path = Path(tokenizer_dir) / file_name
        api.upload_file(
            path_or_fileobj=file_path,
            path_in_repo=file_name,
            repo_id=repo_id,
            repo_type="model",
            token=HF_TOKEN,
        )


model = None
tokenizer = None
messages = []


def inference(model_name, max_seq_length=2048, dtype=None, load_in_4bit=True):
    os.chdir("/content")
    logging.getLogger().setLevel(logging.ERROR)
    global model, tokenizer

    try:
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_name,
            max_seq_length=max_seq_length,
            dtype=dtype,
            load_in_4bit=load_in_4bit,
        )
        FastLanguageModel.for_inference(model)
    except Exception as e:
        print("You only need to run the inference once, no need to run it again!")


def chat(max_new_tokens=128, history=True):
    global model, tokenizer, messages

    chat_template = """{{ '<bos>' }}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<start_of_turn>user\n' + content + '<end_of_turn>\n<start_of_turn>model\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<end_of_turn>\n' }}{% endif %}{% endfor %}"""

    messages = []

    while True:
        query = input("\nUser: ")
        if query.strip().lower() == "exit":
            break
        if query.strip().lower() == "clear":
            messages = []
            print("The chat history has just been cleared.")
            continue

        if history:
            messages.append({"role": "user", "content": query})
        else:
            messages = [{"role": "user", "content": query}]

        template = Template(chat_template)
        input_text = template.render(messages=messages)

        print(f"Assistant: ", end="", flush=True)

        inputs = tokenizer(input_text, return_tensors="pt").to("cpu")

        outputs = model.generate(
            **inputs, max_new_tokens=max_new_tokens, use_cache=True
        )

        decoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        if "model" in decoded_text:
            response = decoded_text.split("model")[-1].strip()
        else:
            response = decoded_text.strip()
        print(response)

        if history:
            messages.append({"role": "assistant", "content": response})


def quantize_and_push(repo_id):
    logging.getLogger("unsloth").setLevel(logging.CRITICAL)
    original_stdout = sys.stdout
    original_stderr = sys.stderr
    temp_stdout = io.StringIO()
    os.chdir("/content")

    global model, tokenizer, HF_TOKEN
    try:
        with contextlib.redirect_stdout(temp_stdout), contextlib.redirect_stderr(
            temp_stdout
        ):
            model.push_to_hub_gguf(
                repo_id, tokenizer, token=HF_TOKEN
            )
    except Exception as e:
        sys.stdout = original_stdout
        sys.stderr = original_stderr
        return
    finally:
        temp_stdout.seek(0)
        output_lines = temp_stdout.readlines()

    sys.stdout = original_stdout
    sys.stderr = original_stderr

    start_printing = False
    for line in output_lines:
        if "main: quantize time" in line.lower():
            start_printing = True
        if start_printing:
            print(line, end="")


def thank_you_and_good_luck():
    art = [
        "⠀⠀⠀⠀⠀⠀⢀⣰⣀⠀⠀⠀⠀⠀⠀⠀⠀",
        "⢀⣀⠀⠀⠀⢀⣄⠘⠀⠀⣶⡿⣷⣦⣾⣿⣧",
        "⢺⣾⣶⣦⣰⡟⣿⡇⠀⠀⠻⣧⠀⠛⠀⡘⠏",
        "⠈⢿⡆⠉⠛⠁⡷⠁⠀⠀⠀⠉⠳⣦⣮⠁⠀",
        "⠀⠀⠛⢷⣄⣼⠃⠀⠀⠀⠀⠀⠀⠉⠀⠠⡧",
        "⠀⠀⠀⠀⠉⠋⠀⠀⠀⠠⡥⠄⠀⠀⠀⠀⠀",
        "",
        "Wishing you all a wonderful and memorable experience at our Competition!",
    ]

    for line in art:
        print(line)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
!git clone https://github.com/hiyouga/LLaMA-Factory.git -q

### **Explanation of Hugging Face**

Hugging Face is currently the most popular platform for hosting AI datasets and models; you'll also store your fine-tuned models and datasets here.

1. Create an account on [Hugging Face](https://huggingface.co/), then click on your Avatar in the top right corner, select `New Model`, and enter the name of the model you're going to fine-tune in `Model name`. Leave `License` unselected, set it to `Public`, and click `Save`.

2. We need to fine-tune and quantize the Gemma model, so we need to create 2 repositories (one for the fine-tuned model and one for the quantized model). I'll create `Gemma-NASA` and `Gemma-NASA-quantized`, but feel free to name yours whatever you like (`my-model` and `my-model-quantized`, for instance).

3. Access [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it), select `Agree and access repository`.

4. Then go to `Profile` `Settings` `Access Tokens` `Create new token` `Fine-grained`, select all options under `User permissions`, name the token as you wish, then select `Create token` `Copy` `Done`.

5. Paste the token into the `hf(TOKEN)` cell, for example, `hf("abcdefghijklmnopqrstuvwxyz")`.

6. **A general note for the entire notebook: any parameters set within `" "` must have the `" "` symbols. If not, it will cause bugs! 😤**

In [None]:
hf("HF_TOKEN")

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## **Loading the Prepared Dataset**

You have been provided with a dataset called `nasa_dataset`. Upload it in a similar way as you did with `requirements.txt`.

- `nasa_dataset`: This dataset contains 1500 general questions about NASA. You can prepare a similar dataset based on your chosen topic (refer to the dataset preparation guide for instructions).

> In this example, we will fine-tune a 2B model with knowledge about NASA.

> Replace the variable name `nasa_dataset` with your own dataset name, for example, `my_custom_dataset`, or any name you prefer!

> If prompted with `Notebook titled "Finetune & Quantize LLMs.ipynb" does not have access to secret named "HF_TOKEN". Grant access?`, select `Grant access`.

In [None]:
# Run this code to load and view the entire nasa_dataset
nasa_dataset = pd.read_csv("nasa_faq.csv")

nasa_dataset

# Run this code to view the first 1000 rows of the nasa_dataset
# nasa_dataset.head(1000)

Unnamed: 0,instruction,input,output
0,What is NASA?,,NASA stands for National Aeronautics and Space...
1,When was NASA founded?,,"NASA was founded on July 29, 1958."
2,What does NASA do?,,"NASA leads U.S. space exploration, conducts sc..."
3,Who is the current administrator of NASA?,,"As of 2024, Bill Nelson is the Administrator o..."
4,What are NASA's main objectives?,,NASA's main objectives include human space exp...
...,...,...,...
1495,What is the Artemis program?,,The Artemis program is NASA's initiative to re...
1496,What was the Apollo program?,,The Apollo program was NASA's mission that suc...
1497,What is the International Space Station (ISS)?,,The International Space Station (ISS) is a mul...
1498,What is the Hubble Space Telescope?,,The Hubble Space Telescope is NASA's space tel...


In [None]:
preprocess_dataset(nasa_dataset, "nasa_dataset")
# Run this code to preprocess the main dataset that you want to fine-tune for the Gemma model
# Remember to change the dataset name, seriously, this is like the n-th reminder. For instance,

# preprocess_dataset(nasa_dataset, 1000)
# Run this code to fine-tune using the first 1000 rows of the dataset, just for testing :))
# If you're confident, run the above line to train the entire dataset

'/content/LLaMA-Factory/data/nasa_dataset.json'

In [None]:
dataset_info(nasa_dataset=nasa_dataset)
# Run this line, remember to update the nasa_dataset name, this is the n-th + 1 reminder. For example, my_custom_dataset=my_custom_dataset

'/content/LLaMA-Factory/data/dataset_info.json'

## **Fine-tuning the model**

This function requires three parameters to be adjusted:
- The names of the datasets to fine-tune (train) the model.

- `num_train_epochs`: This represents the number of times the model is trained over all the records in the dataset. The idea is like this: you give it a question and an answer you want it to replicate. If `num_train_epochs = 1.0`, the model learns through the entire dataset once. But how can it remember everything in just one go, right? So, we set `num_train_epochs = 2.0` to make the model learn through the dataset twice.

- `continue_training`: If set to `False`, every time you run this cell, the model will start learning from scratch. If set to `True`, it will continue learning from the previous training session.

> Your task is to use some strategy to train until the loss (which will be displayed after running) is in an appropriate range (you'll need to test to find out what range works best, hehe), keep an eye on the final loss value.

> Note for this cell and all other cells: the training process will take a long time (up to over 30 minutes) if your data is large, so get a bubble tea and relax during the training. Don't worry, if you wait more than 30 minutes and the system reports `usage limits in Colab`, congratulations, you'll need to switch to another account and reduce your dataset size because you've used up the free GPU quota 😭.

In [None]:
num_train_epochs = 0.2
continue_training = False

train(["nasa_dataset"], num_train_epochs, continue_training)

{'loss': 0.6382, 'grad_norm': 0.27795010805130005, 'learning_rate': 8.628481651367876e-06, 'epoch': 0.16}
{'train_runtime': 38.1071, 'train_samples_per_second': 2.624, 'train_steps_per_second': 0.341, 'train_loss': 0.49325847955277335, 'epoch': 0.21}
***** train metrics *****
  epoch                    =      0.208
  total_flos               =    54845GF
  train_loss               =     0.4933
  train_runtime            = 0:00:38.10
  train_samples_per_second =      2.624
  train_steps_per_second   =      0.341


### **Checking the results after fine-tuning**

After training, check the results. If you're not satisfied, retrain the model. Note that the default behavior of `test()` is to save the history of questions, meaning the result of the second question will be influenced by your first question. This can be both beneficial and detrimental!

In [None]:
test()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

***** Type 'clear' to clear the chat history, type 'exit' to quit! *****

User: What is NASA
Assistant: NASA stands for National Aeronautics and Space Administration, a United States government agency responsible for space exploration, aeronautics, and aerospace research.

User: exit


### **Finalizing and saving the model on Hugging Face**

You need to save the model after fine-tuning before closing this notebook, so you don't have to retrain it from scratch next time you need it!

> Remember the repository you created on Hugging Face during the `hf()` step? Fill it in and run this cell!

> If everything is green after running the cell, congratulations! 😍 You've successfully fine-tuned an LLM!

In [None]:
merge_and_push("ANONYMOUS/Gemma-NASA")
# Replace "ANONYMOUS/Gemma-NASA" with your own repository name

***** Model has been successfully merged and uploaded to Huggingface! *****


  0%|          | 0/1 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.
No files have been modified since last commit. Skipping to prevent empty commit.


## **Loading and chatting with the model**

> Note: This is a separate step for you to load and test your model. After uploading the fine-tuned model to Hugging Face in the previous step, you can exit the notebook! Or, if you have just finished fine-tuning the model and want to check if it has been successfully saved, proceed with this step.

Alright, after having your own model, I'll guide you on how to load and use it.

- Re-run the setup up to and including `hf()` (if you just opened the notebook).

- `inference(repo_id)`: enter the name of the repository where your model is stored and run this function.

- `chat(max_new_tokens, history)`: run this function to chat with the model. There are two parameters to adjust: `max_new_tokens`, which is the number of tokens you want the model to generate in its response (in other words, the length of the response), and `history = True`, which allows saving the user's question history (as mentioned earlier).

In [None]:
inference("ANONYMOUS/Gemma-NASA")

==((====))==  Unsloth 2024.9: Fast Gemma patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/1.08G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

In [None]:
max_new_tokens = 128
history = False

chat(max_new_tokens, history)


User: Hello
Assistant: Hello! 👋  It's nice to hear from you. What can I do for you today?

User: What is NASA
Assistant: NASA stands for National Aeronautics and Space Administration, a United States government agency responsible for space exploration, aeronautics, and aerospace research.

User: What are NASA's tasks
Assistant: NASA leads U.S. space exploration, conducts scientific research, and develops technology related to aeronautics and space.

User: exit


## **Quantizing the model**

> This is a separate section from the rest! If you are continuing from the previous sections, to minimize errors, *it's recommended to delete the runtime before running this cell*.

1. Run the setup up to and including `hf()`.

2. Run the `inference(repo_id)` function from above, filling in `repo_id` with the exact name of the repository containing the fine-tuned model.

3. Run `quantize_and_push(repo_id_2)`. Remember the second repository I asked you to create at the beginning? It's time to use it!

> If the output is all green, then you've succeeded. You've exported the model in `.gguf` format and stored it on Hugging Face. The next step is to follow the instructions on how to use Ollama to deploy the LLM on your personal computer!

In [None]:
quantize_and_push("ANONYMOUS/Gemma-NASA-quantized")

Unsloth: You have 1 CPUs. Using `safe_serialization` is 10x slower.
We shall switch to Pytorch saving, which will take 3 minutes and not 30 minutes.
To force `safe_serialization`, set it to `None` instead.
Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 5.0G
Unsloth: Converting gemma model. Can use fast conversion = False.


  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

In [None]:
thank_you_and_good_luck()

⠀⠀⠀⠀⠀⠀⢀⣰⣀⠀⠀⠀⠀⠀⠀⠀⠀
⢀⣀⠀⠀⠀⢀⣄⠘⠀⠀⣶⡿⣷⣦⣾⣿⣧
⢺⣾⣶⣦⣰⡟⣿⡇⠀⠀⠻⣧⠀⠛⠀⡘⠏
⠈⢿⡆⠉⠛⠁⡷⠁⠀⠀⠀⠉⠳⣦⣮⠁⠀
⠀⠀⠛⢷⣄⣼⠃⠀⠀⠀⠀⠀⠀⠉⠀⠠⡧
⠀⠀⠀⠀⠉⠋⠀⠀⠀⠠⡥⠄⠀⠀⠀⠀⠀

Wishing you all a wonderful and memorable experience at our Competition!
