# Fine-Tuning LLMs with Hugging Face

## Step 1: Installing and importing the libraries

In [None]:
!pip install accelerate peft bitsandbytes transformers trl

Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting datasets>=2.21.0 (from trl)
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.21.0->trl)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.21.0->trl)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets>=2.21.0->trl)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-man

In [None]:
!pip install huggingface_hub



In [None]:
!nvidia-smi

Thu Mar 20 15:37:56 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [None]:
!echo $LD_LIBRARY_PATH

/usr/lib64-nvidia


In [None]:
import torch
from trl import SFTTrainer
from peft import LoraConfig
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline)

## Step 2: Loading the model

In [None]:
llama_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2",
                                                   quantization_config = BitsAndBytesConfig(load_in_4bit = True,
                                                                                            bnb_4bit_compute_dtype = getattr(torch, "float16"),
                                                                                            bnb_4bit_quant_type = "nf4"))
llama_model.config.use_cache = False
llama_model.config.pretraining_tp = 1

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/33.6M [00:00<?, ?B/s]

## Step 3: Loading the tokenizer

In [None]:
llama_tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path = "aboonaji/llama2finetune-v2", trust_remote_code = True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

## Step 4: Setting the training arguments

In [None]:
llama_model.gradient_checkpointing_enable()
training_arguments = TrainingArguments(output_dir = "./results", per_device_train_batch_size = 1, gradient_accumulation_steps = 2, max_steps = 100)

In [None]:
from accelerate import infer_auto_device_map
device_map = infer_auto_device_map(llama_model, max_memory={0: "14GiB", "cpu": "4GiB"})
from accelerate import dispatch_model
llama_model = dispatch_model(llama_model, device_map=device_map)

## Step 5: Creating the Supervised Fine-Tuning trainer

In [None]:
import torch

torch.cuda.empty_cache()

In [None]:
%env PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

env: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True


In [None]:
llama_sft_trainer = SFTTrainer(model = llama_model,
                               args = training_arguments,
                               train_dataset = load_dataset(path = "aboonaji/wiki_medical_terms_llam2_format", split = "train"),
                               tokenizer = llama_tokenizer,
                               peft_config = LoraConfig(task_type = "CAUSAL_LM", r = 4, lora_alpha = 16, lora_dropout = 0.1)) # Removed dataset_text_field

  llama_sft_trainer = SFTTrainer(model = llama_model,


Applying chat template to train dataset:   0%|          | 0/6861 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/6861 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/6861 [00:00<?, ? examples/s]

## Step 6: Training the model

In [None]:
# import os
# os.environ["WANDB_DISABLED"] = "true"
# API_Key= 4491d5d8a53e52802ff1ec0fb857fff44af09ec6

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
llama_sft_trainer.train()

Step,Training Loss


TrainOutput(global_step=100, training_loss=1.680132598876953, metrics={'train_runtime': 1897.4657, 'train_samples_per_second': 0.105, 'train_steps_per_second': 0.053, 'total_flos': 5418485636014080.0, 'train_loss': 1.680132598876953})

## Step 7: Chatting with the model

In [None]:
# from transformers import pipeline # Add this line at the start of the cell
# user_prompt = "Please tell me about Ascariasis"
# text_generation_pipeline = pipeline(
#     "text-generation",
#     model=llama_model,
#     tokenizer=llama_tokenizer,
#     max_length=300,
#     device=0  # Forces GPU usage
# )
# model_answer = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]")
# print(model_answer[0]['generated_text'])

In [None]:
# Step 8: Save the fine-tuned model
llama_sft_trainer.model.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")
llama_tokenizer.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")

# You can also save the adapter weights only (for smaller size)
llama_sft_trainer.model.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2_adapter", safe_serialization=True)

print("Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")

# Step 8: Save the fine-tuned model
llama_sft_trainer.model.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")
llama_tokenizer.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")

# You can also save the adapter weights only (for smaller size)
llama_sft_trainer.model.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2_adapter", safe_serialization=True)

print("Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")


# You can also save the adapter weights only (for smaller size)
llama_sft_trainer.model.save_pretrained("C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2_adapter", safe_serialization=True)

print("Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2")




Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2
Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2
Model saved to C:/Users/aizel/Downloads/LUMS OFFICIAL DATA/LLM/LLM Project/fine_tuned_llama2/fine_tuned_llama2


In [None]:
# Save the trained model and tokenizer
save_path = "fine_tuned_model"
llama_sft_trainer.model.save_pretrained(save_path) # Changed trainer to llama_sft_trainer
llama_tokenizer.save_pretrained(save_path) # Assuming llama_tokenizer is the correct tokenizer variable

('fine_tuned_model/tokenizer_config.json',
 'fine_tuned_model/special_tokens_map.json',
 'fine_tuned_model/tokenizer.model',
 'fine_tuned_model/added_tokens.json',
 'fine_tuned_model/tokenizer.json')

In [None]:
!pip install huggingface_hub
!huggingface-cli login



    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
The token `finetune_llm_assignment_write_token` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You

In [None]:
from huggingface_hub import HfApi

repo_name = "Aizelsheikh/llama2-finetuned"  # Change this to your desired name

# Create a repo on Hugging Face
api = HfApi()
api.create_repo(repo_id=repo_name, repo_type="model", exist_ok=True)

# Upload the model to Hugging Face
api.upload_folder(
    folder_path="fine_tuned_model",
    repo_id=repo_name,
    repo_type="model"
)

print(f"Model uploaded to: https://huggingface.co/{repo_name}")


Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/8.41M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Model uploaded to: https://huggingface.co/Aizelsheikh/llama2-finetuned


In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import os

model_name = "Aizelsheikh/llama2-finetuned"  # Your uploaded model

os.makedirs("offload_folder", exist_ok=True)

llama_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
    offload_folder="offload_folder",
    offload_state_dict = True
)

# Load the tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create text generation pipeline
text_generation_pipeline = pipeline(
    "text-generation",
    model=llama_model,
    tokenizer=llama_tokenizer,
    max_length=200,  # Limit output length for faster response
)

# Test the model with a prompt
user_prompt = "Please tell me about Ascariasis"
model_answer = text_generation_pipeline(f"<s>[INST] {user_prompt} [/INST]")

# Print generated response
print(model_answer[0]['generated_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/726 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/8.41M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.62M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<s>[INST] Please tell me about Ascariasis [/INST]  Ascariasis is a parasitic infection caused by the Ascaris lumbricoides, a type of roundworm. It is one of the most common parasitic infections worldwide, affecting an estimated 1.5 billion people globally, particularly in developing countries with poor sanitation and hygiene.

Causes and Transmission:
Ascariasis is caused by the ingestion of infective larvae of the Ascaris lumbricoides worm. The larvae are found in contaminated soil, water, or food, and can enter the body through the mouth or nose. Once inside the body, the larvae migrate to the small intestine, where they mature and start to reproduce.

Symptoms:
The symptoms of ascariasis can
