This notebook combines three notebooks: fine-tuning a pretrained model, inference with the fine-tuned model, and inference with the original model.
Please run each section separately.

To run each section please add your huggingface API key to the secret key HF_TOKEN

Set up the connection with the Google Dive storage

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os

path = 'nlp/Project'

os.chdir(f'/content/drive/MyDrive/{path}')
os.getcwd()

'/content/drive/MyDrive/nlp/Project'

#Fine-tuning pretrained models
In this part, I fine-tuned two pretrained models (gemma-2-2b and Llama-3.2-3B-Instruct) using Retrieval-Augmented Generation (RAG) Dataset 12000

Dataset: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000

Fine-tuned model:

gemma-2-2b: https://huggingface.co/Shodai1122/gemma-2-2b-it

Llama-3.2-3B-Instruct: https://huggingface.co/Shodai1122/Llama-3.2-3B-Instruct-it


##Unsloth

https://github.com/unslothai/unsloth

In this notebook I used Unsloth for fine-tuning.
Unsloth is a library that significantly accelerates the fine-tuning of large language models (LLMs). Compared to traditional methods, it achieves approximately twice the speed and also reduces memory usage. It combines 4-bit quantization with LoRA technology to achieve both model compression and acceleration.

In [None]:
!pip install unsloth

In [None]:
!pip install --upgrade torch
!pip install --upgrade xformers

In [None]:
!pip install --upgrade torchaudio torchvision fastai

In [None]:
!pip install ipywidgets --upgrade

# Install Flash Attention 2 for softcapping support
import torch
if torch.cuda.get_device_capability()[0] >= 8:
    !pip install --no-deps packaging ninja einops "flash-attn>=2.6.3"

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from unsloth import FastLanguageModel
import torch
max_seq_length = 512
dtype = None
load_in_4bit = True

model_id = "meta-llama/Llama-3.2-3B-Instruct"
#model_id = "google/gemma-2-2b"
new_model_id = "Llama-3.2-3B-Instruct-it"
#new_model_id = "gemma-2-2b-it"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_id,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    trust_remote_code=True,
)


model = FastLanguageModel.get_peft_model(
    model,
    r = 32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
    max_seq_length = max_seq_length,
)

In [None]:
from datasets import load_dataset
dataset = load_dataset('neural-bridge/rag-dataset-12000')
print(f"Train dataset size: {len(dataset)} ")

In [None]:
print(dataset["train"]["formatted_text"][0])

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset=dataset["train"],
    max_seq_length = max_seq_length,
    dataset_text_field="formatted_text",
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 1,
        logging_steps = 10,
        warmup_steps = 10,
        save_steps=100,
        save_total_limit=2,
        max_steps=-1,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        group_by_length=True,
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

In [None]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
model.push_to_hub_merged(
    new_model_id,
    tokenizer=tokenizer,
    save_method="lora",
    token="",#write your huggingface token here
    private=True
)

#Inference with fine-tuned model

In [None]:
!pip install unsloth

In [None]:
from unsloth import FastLanguageModel
import torch
import json

In [None]:
model_name = "Shodai1122/Llama-3.2-3B-Instruct-it"
#model_name = "Shodai1122/gemma-2-2b-it"

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit
)
FastLanguageModel.for_inference(model)

In [None]:
from datasets import load_dataset
datasets = load_dataset('neural-bridge/rag-dataset-12000', split='test')
print(f"Train dataset size: {len(datasets)} ")

In [None]:
datasets = datasets.select(range(100))
print(len(datasets))

In [None]:
print(datasets[0])

In [None]:
from tqdm import tqdm

# Inference
results = []
for dt in tqdm(datasets):
  context=dt["context"]
  question = dt["question"]
  answer = dt["answer"]

  prompt = f"""Given the following passage, answer the related question.\n### Passage\n{context}\n### Question\n{question}\n### Answer\n"""

  inputs = tokenizer([prompt], return_tensors = "pt").to(model.device)

  outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, do_sample=False, repetition_penalty=1.2)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### Answer')[-1]

  results.append({"question": question, "output": prediction, "answer": answer})

In [None]:
with open(f"Llama-3.2-3B-Instruct-it_output.jsonl", 'w', encoding='utf-8') as f:
    for result in results:
        json.dump(result, f, ensure_ascii=False)
        f.write('\n')

#Inference with original model

In [None]:
!pip install bitsandbytes

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

In [None]:
#model_id = "google/gemma-2-2b"
model_id = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=False,
)

model = AutoModelForCausalLM.from_pretrained(
            model_id,
            device_map="auto",
            quantization_config=bnb_config,
            torch_dtype=torch.bfloat16,
        )

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset
datasets = load_dataset('neural-bridge/rag-dataset-12000', split='test')
print(f"Train dataset size: {len(datasets)} ")
datasets = datasets.select(range(100))
print(len(datasets))

In [None]:
from tqdm import tqdm
import json

# Inference
results = []
for dt in tqdm(datasets):
  context=dt["context"]
  question = dt["question"]
  answer = dt["answer"]

  prompt = f"""Given the following passage, answer the related question.\n### Passage\n{context}\n### Question\n{question}\n### Answer\n"""

  inputs = tokenizer([prompt], return_tensors = "pt").to(model.device)

  outputs = model.generate(**inputs, max_new_tokens = 512, use_cache = True, do_sample=False, repetition_penalty=1.2)
  prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### Answer')[-1]

  results.append({"question": question, "output": prediction, "answer": answer})

In [None]:
with open(f"Llama-3.2-3B-Instruct_output.jsonl", 'w', encoding='utf-8') as f:
    for result in results:
        json.dump(result, f, ensure_ascii=False)
        f.write('\n')

#Compare output of original model and fine-tuned model
For inference I used first 100 test datasets from Retrieval-Augmented Generation (RAG) Dataset 12000

Dataset: https://huggingface.co/datasets/neural-bridge/rag-dataset-12000


gemma-2-2b

In [25]:
import json

file_path = 'gemma-2-2b_output.jsonl'

gemma_output = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            gemma_output.append(json.loads(line))

In [24]:
file_path = 'gemma-2-2b-it_output.jsonl'

gemma_it_output = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            gemma_it_output.append(json.loads(line))

In [20]:
print("Question:",gemma_output[0]['question'])
print("\n")
print("Output of original gemma model:",gemma_output[0]['output'])
print("\n")
print("Output of fine-tuned gemma model:",gemma_it_output[0]['output'])
print("Answer (GPT-4):",gemma_output[0]['answer'])

Question: Who is the music director of the Quebec Symphony Orchestra?


Output of original gemma model: 
a.) Jean-Fran√ßois Rivet b.) Gilles Apap c.) Laurent Piveron d.) Fabien Gabel e.) None of these


Output of fine-tuned gemma model: 
Fabien Gabel is the music director of the Quebec Symphony Orchestra.

Answer (GPT-4): The music director of the Quebec Symphony Orchestra is Fabien Gabel.


Llama-3.2-3B

In [22]:
file_path = 'Llama-3.2-3B-Instruct_output.jsonl'

llama_output = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            llama_output.append(json.loads(line))

In [21]:
file_path = 'Llama-3.2-3B-Instruct-it_output.jsonl'

llama_it_output = []
with open(file_path, 'r', encoding='utf-8') as f:
    for line in f:
        if line.strip():
            llama_it_output.append(json.loads(line))

In [26]:
print("Question:",llama_output[0]['question'])
print("\n")
print("Output of original llama model:",llama_output[0]['output'])
print("\n")
print("Output of fine-tuned llama model:",llama_it_output[0]['output'])
print("Answer (GPT-4):",llama_output[0]['answer'])

Question: Who is the music director of the Quebec Symphony Orchestra?


Output of original llama model: 
According to the passage, Fabien Gabel is the music director of the Quebec Symphony Orchestra.


Output of fine-tuned llama model: 
Fabien Gabel is the music director of the Quebec Symphony Orchestra.

Answer (GPT-4): The music director of the Quebec Symphony Orchestra is Fabien Gabel.


Compare all models

In [29]:
question_num = 2 #Choose question number from 0 to 99

print("Question:",gemma_output[question_num]['question'])
print("\n")
print("Output of original gemma model:",gemma_output[question_num]['output'])
print("\n")
print("Output of fine-tuned gemma model:",gemma_it_output[question_num]['output'])
print("\n")
print("Output of original llama model:",llama_output[question_num]['output'])
print("\n")
print("Output of fine-tuned llama model:",llama_it_output[question_num]['output'])
print("Answer (GPT-4):",gemma_output[question_num]['answer'])

Question: What did Paul Wall offer to all U.S. Olympic Medalists?


Output of original gemma model: 
A gold grill


Output of fine-tuned gemma model: 
He offered free gold grills to all U.S. Olympic Medalists.



Output of original llama model: 
According to the passage, Paul Wall offered free gold grills to any team USA member who wins gold.


Output of fine-tuned llama model: 
Paul Wall promised his team he would give free gold grills to any U.S. Olympic medalist if they won the AAC title.

Answer (GPT-4): Paul Wall wants to give free gold grills to all U.S. Olympic Medalists.
