<a href="https://colab.research.google.com/github/Aryant01/LLaMA2-fine-tunning/blob/main/Untitled4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch transformers peft bitsandbytes datasets accelerate trl wandb

This command installs essential Python libraries needed for model fine-tuning:
*   torch - PyTorch for deep learning.
*   transformers - Hugging Face’s library for large language models.
transformers - Hugging Face’s library for large language models.

*   peft - Parameter-efficient fine-tuning (QLoRA).
bitsandbytes - Supports 4-bit quantization to reduce memory usage.
*  datasets - Provides ready-to-use NLP datasets.


*   accelerate - Optimizes training across multiple GPUs.
*   trl - Transformer reinforcement learning.


*   wandb - Logging & experiment tracking.

In [None]:
!pip uninstall torch
!pip install torch --index-url https://download.pytorch.org/whl/cpu

In [None]:
from huggingface_hub import login

# Login with your Hugging Face token
# You have to get access from  Hugging Face’s LLaMA 2 model card or else it will give an error
login()

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the 7B LLaMA 2 model
model_id = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

*   Logs into Hugging Face to access LLaMA 2.
*   Loads the 7-billion-parameter LLaMA 2 model from Hugging Face.
*   Automatically maps the model to available GPUs.

**If access is denied:**

*   Apply for access on Hugging Face’s LLaMA 2 model
card.

In [None]:
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca")
print(dataset)

*  Loads the Alpaca dataset, a widely used instruction-following dataset.

In [None]:
from datasets import Dataset
dataset = Dataset.from_list(data)

*   Ensures the dataset follows the correct structure for fine-tuning.

In [None]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

*   Loads LLaMA 2 with 4-bit precision to save VRAM.
*   Uses NF4 quantization for efficient computation.
*   Reduces memory needs from 40GB+ to ~12GB.

In [None]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

*   LoRA trains small adapter layers instead of fine-tuning the full model.
*   The q_proj and v_proj modules are targeted, optimizing attention mechanisms.

*   Reduces trainable parameters from billions to millions.
*   Saves compute while maintaining performance.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="adamw_torch",
    save_total_limit=2,
    save_steps=500,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_steps=10,
    output_dir="./llama2-finetuned",
    report_to="wandb",
)

*   Batch size = 2 to avoid GPU memory overflow.
*   Gradient accumulation = 4 (updates weights every 8 samples).
*  AdamW optimizer for better performance.
*  Saves model every 500 steps.
*  Logs results to Weights & Biases (WandB).

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer
)

trainer.train()

*   Handles training loops, logging, and checkpointing automatically.

In [None]:
input_text = "Translate 'Hello, how are you?' to French."
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")

output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

*   Encodes input text, passes it through the model, and generates output.

In [None]:
model.save_pretrained("./llama2-qlora-finetuned")
tokenizer.save_pretrained("./llama2-qlora-finetuned")

*   Avoids retraining and allows deployment.

In [None]:
from transformers import pipeline

pipe = pipeline("text-generation", model="./llama2-qlora-finetuned", tokenizer=tokenizer)
response = pipe("Summarize: LLaMA 2 is an advanced AI model by Meta.", max_new_tokens=50)
print(response[0]['generated_text'])

*  Creates a text-generation pipeline.
*  Accepts user input and generates responses.