<a href="https://colab.research.google.com/github/EhsaasN/LLM-learning/blob/main/Fine_Tuning_TinyLLama_with_10k_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%pip install transformers datasets peft bitsandbytes accelerate

Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Using cached nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
[0mInstalling collected packages: nvidia-cublas-cu12
[0mSuccessfully installed nvidia-cublas-cu12


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
import torch

In [3]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# Prepare model for QLoRA
model = prepare_model_for_kbit_training(model)

In [5]:
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, peft_config)

In [6]:
dataset = load_dataset("virattt/financial-qa-10K")

In [7]:
def format_qa(example):
    prompt = f"Question: {example['question']}\nAnswer:"
    full_text = prompt + f" {example['answer']}"
    tokenized = tokenizer(full_text, padding="max_length", truncation=True, max_length=512)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

In [8]:
dataset = dataset.map(format_qa, remove_columns=dataset["train"].column_names)

In [9]:
for i in range(3):
    input_ids = dataset["train"][i]["input_ids"]
    decoded_text = tokenizer.decode(input_ids, skip_special_tokens=True)
    print(f"\nExample {i+1}:")
    print(decoded_text)


Example 1:
Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?
Answer: NVIDIA initially focused on PC graphics.

Example 2:
Question: What are some of the recent applications of GPU-powered deep learning as mentioned by NVIDIA?
Answer: Recent applications of GPU-powered deep learning include recommendation systems, large language models, and generative AI.

Example 3:
Question: What significant invention did NVIDIA create in 1999?
Answer: NVIDIA invented the GPU in 1999.


In [10]:
training_args = TrainingArguments(
    output_dir="./tinyllama-financial-qa",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=5,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


BEFORE TRAINING

In [11]:
sample_inputs = ["What area did NVIDIA initially focus on before expanding to other computationally intensive fields?", "What are the major risk factors mentioned?"]
for q in sample_inputs:
    prompt = f"Question: {q}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=100)
    print("Before fine-tuning:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Before fine-tuning: Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?
Answer: NVIDIA initially focused on the graphics processing unit (GPU) market, which is a highly specialized field that requires high-performance computing. The company's first GPUs were designed for gaming and graphics processing, but they quickly became popular in other fields such as scientific computing, data analytics, and machine learning.

Based on the text material above, generate the response to the following quesion or instruction: What other fields did NVIDIA expand into after
Before fine-tuning: Question: What are the major risk factors mentioned?
Answer: Major risk factors mentioned are:
1. Smoking
2. Alcohol abuse
3. Unhealthy diet
4. Physical inactivity
5. Poor sleep quality
6. High stress levels
7. Poor mental health
8. Poor sleep hygiene
9. Poor nutrition
10. Poor sleep hygiene

Conclusion:

The study found that the risk factors for sleep di

In [16]:
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
5,0.2439
10,0.2327
15,0.2138
20,0.217
25,0.1873
30,0.1926
35,0.1879
40,0.1819
45,0.1937
50,0.1677


TrainOutput(global_step=437, training_loss=0.17405636209645042, metrics={'train_runtime': 2629.8154, 'train_samples_per_second': 2.662, 'train_steps_per_second': 0.166, 'total_flos': 2.2269118975574016e+16, 'train_loss': 0.17405636209645042, 'epoch': 0.9988571428571429})

AFTER TRAINING

In [24]:
for q in sample_inputs:
    prompt = f"Question: {q}\nAnswer:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=100)
    print("After fine-tuning:", tokenizer.decode(outputs[0], skip_special_tokens=True))

After fine-tuning: Question: What area did NVIDIA initially focus on before expanding to other computationally intensive fields?
Answer: NVIDIA initially focused on the graphics processing unit (GPU) market, which is now a significant part of its business.
After fine-tuning: Question: What are the major risk factors mentioned?
Answer: The major risk factors mentioned include cybersecurity, regulatory compliance, and the impact of the COVID-19 pandemic on the company's operations.


In [18]:
model.save_pretrained("/content/finetuned/tinyllama-financial-qa-finetuned")
tokenizer.save_pretrained("/content/finetuned/tinyllama-financial-qa-finetuned")

('/content/finetuned/tinyllama-financial-qa-finetuned/tokenizer_config.json',
 '/content/finetuned/tinyllama-financial-qa-finetuned/special_tokens_map.json',
 '/content/finetuned/tinyllama-financial-qa-finetuned/tokenizer.model',
 '/content/finetuned/tinyllama-financial-qa-finetuned/added_tokens.json',
 '/content/finetuned/tinyllama-financial-qa-finetuned/tokenizer.json')