<a href="https://colab.research.google.com/github/LuluW8071/Llama2-LLM-Text-Generation/blob/main/LLAMA_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Llama 2](https://llama.meta.com/llama2) and Model Fine-Tuning

**Llama 2** is a collection of second-generation open-source Large Language Models (LLMs) from Meta, designed to handle a wide range of natural language processing tasks. These models range in scale from `7 billion to 70 billion parameters`.

**Llama-2-Chat**, optimized for dialogue, has shown similar performance to popular closed-source models like ChatGPT and PaLM.

**Fine-tuning** in machine learning involves adjusting the weights and parameters of a pre-trained model on new data to improve its performance on a specific task. It includes training the model on a new dataset specific to the task at hand, while updating the model's weights to adapt to the new data.

<img src = "https://images.datacamp.com/image/upload/v1697724450/Fine_Tune_L_La_MA_2_cc6aa0e4ad.png">

In [None]:
!pip install -U datasets



In [None]:
# %pip install accelerate peft bitsandbytes transformers trl
!pip install -U datasets trl accelerate peft bitsandbytes transformers trl huggingface_hub



## Importing Necessary Libraries

In [None]:
import os
import pandas as pd
import torch

from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
from huggingface_hub import login

print(torch.__version__)

# Setting up device agnostic code
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)

2.1.0+cu121
cuda


## Model Configuration
Using NousResearch’s `Llama-2-7b-chat-hf` as our base model. It is the same as the original Meta’s official `Llama-2 model` from Hugging Face but easily accessible.

### [See Guanaco Dataset](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k)

In [None]:
# Model from Hugging Face hub with 7 billion parameters
base_model = "NousResearch/Llama-2-7b-chat-hf"

# New instruction dataset
guanaco_dataset = "mlabonne/guanaco-llama2-1k"

# Fine-tuned model
new_model = "llama-2-7b-chat-guanaco"

## Loading dataset, model, and tokenizer

In [None]:
dataset = load_dataset(guanaco_dataset, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## 4-bit Quantization Configuration
4-bit Quantization via QLoRA allows efficient finetuning of huge LLM models on consumer hardware while retaining high performance. This dramatically improves accessibility and usability for real-world applications.

QLoRA quantizes a pre-trained language model to 4 bits and freezes the parameters. A small number of trainable Low-Rank Adapter layers are then added to the model.

During fine-tuning, gradients are backpropagated through the frozen 4-bit quantized model into only the Low-Rank Adapter layers. So, the entire pretrained model remains fixed at 4 bits while only the adapters are updated. Also, the 4-bit quantization does not hurt model performance.

<img src = "https://images.datacamp.com/image/upload/v1697713094/image7_3e12912d0d.png">

### [Paper on QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)

In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

## Loading `Llama 2 model`

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



## Load the Tokenizers

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

## PEFT Parameters
Traditional fine-tuning of pre-trained language models (PLMs) requires updating all of the model's parameters, which is computationally expensive and requires massive amounts of data.

Parameter-Efficient Fine-Tuning (PEFT) works by only updating a small subset of the model's parameters, making it much more efficient. Learn about parameters by reading the [PEFT official documentation](https://huggingface.co/docs/peft/conceptual_guides/lora).

In [None]:
peft_params = LoraConfig(
    lora_alpha = 16,
    lora_dropout = 0.1,
    r = 64,
    bias = "none",
    task_type = "CAUSAL_LM",)

## Training Parameters

| Hyperparameter               | Description                                       |
|------------------------------|---------------------------------------------------|
| output_dir                   | Output directory for storing model predictions and checkpoints.             |
| num_train_epochs             | Number of training epochs.                        |
| fp16/bf16                    | Disable fp16/bf16 training.                       |
| per_device_train_batch_size  | Batch size per GPU for training.                  |
| per_device_eval_batch_size   | Batch size per GPU for evaluation.                |
| gradient_accumulation_steps  | Number of steps required to accumulate gradients during update.             |
| gradient_checkpointing       | Enable gradient checkpointing.                    |
| max_grad_norm                | Gradient clipping.                                |
| learning_rate                | Initial learning rate.                            |
| weight_decay                 | Weight decay applied to all layers except bias/LayerNorm weights.            |
| Optim                        | Model optimizer (AdamW optimizer).                 |
| lr_scheduler_type            | Learning rate schedule.                           |
| max_steps                    | Number of training steps.                         |
| warmup_ratio                 | Ratio of steps for linear warmup.                 |
| group_by_length              | Improve performance and accelerate training.      |
| save_steps                   | Save checkpoint every 25 update steps.            |
| logging_steps                | Log every 25 update steps.                        |


In [None]:
training_params = TrainingArguments(
    output_dir = "./results",
    num_train_epochs = 1,              # Start with 1 epoch and increase gradually if memory allows
    per_device_train_batch_size = 2,   # Begin with smallest batch size, increase in increments of 1
    gradient_accumulation_steps = 8,   # Aggressively accumulate gradients to compensate for low batch size
    optim = "adamw_torch",             # Efficient optimizer for LLMs
    save_steps = 1000,                 # Adjust saving frequency based on training duration
    logging_steps = 1000,              # Adjust logging frequency based on your preference
    learning_rate = 5e-6,              # Start with very low learning rate to mitigate instability
    weight_decay = 0.01,               # Regularization to prevent overfitting
    fp16 = True,                       # Enable mixed precision for memory savings
    bf16 = False,                      # T4 doesn't support bfloat16
    max_grad_norm = 0.5,               # Adjust gradient norm as needed
    max_steps = -1,                    # Train for all epochs by default
    warmup_ratio = 0.1,                # Adjust warmup ratio based on learning rate and dataset size
    group_by_length = True,            # Improve efficiency for long sequences
    lr_scheduler_type = "constant",    # Use warmup followed by constant learning rate
    report_to = "tensorboard",         # Track training progress with TensorBoard

    # Additional memory-specific optimizations:
    # max_train_steps = 1000,          # Set a maximum number of training steps to limit total memory usage
    # sharded_ddp = True,              # Enable DistributedDataParallel sharding if multiple GPUs are available
    gradient_checkpointing = True,     # Recompute intermediate activations for memory savings
    fp16_full_eval = True,             # Use mixed precision during evaluation as well
    dataloader_pin_memory = False,     # Disable data pinning to avoid potential memory overhead
    local_rank = -1,                   # Disable automatic distributed training (if only 1 GPU)
    # skip_memory_check=True,          # Temporarily skip memory checks, but monitor closely
)

## Model fine-tuning
Supervised fine-tuning (SFT) is a key step in reinforcement learning from human feedback (RLHF). The TRL library from HuggingFace provides an easy-to-use API to create SFT models and train them on your dataset with just a few lines of code. It comes with tools to train language models using reinforcement learning, starting with supervised fine-tuning, then reward modeling, and finally proximal policy optimization (PPO).

Provide SFT Trainer the model, dataset, Lora configuration, tokenizer, and training parameters.

In [None]:
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    peft_config = peft_params,
    dataset_text_field = "text",
    max_seq_length = None,
    tokenizer = tokenizer,
    args = training_params,
    packing = False,
)



In [None]:
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

('llama-2-7b-chat-guanaco/tokenizer_config.json',
 'llama-2-7b-chat-guanaco/special_tokens_map.json',
 'llama-2-7b-chat-guanaco/tokenizer.model',
 'llama-2-7b-chat-guanaco/added_tokens.json',
 'llama-2-7b-chat-guanaco/tokenizer.json')

In [None]:
# from tensorboard import notebook
# log_dir = "results/runs"
# notebook.start("--logdir {} --port 4000".format(log_dir))

## Testing Text Generation

In [None]:
config = {
    "task": "text-generation",
    "model": model,
    "tokenizer": tokenizer,
    "max_length": 250,
    "config": {
        "language": "en"
    }
}

In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(**config)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])



<s>[INST] Who is Leonardo Da Vinci? [/INST]  Leonardo da Vinci (1452-1519) was a true Renaissance man, a polymath who excelled in various fields, including art, science, engineering, mathematics, and anatomy. everybody knows him as the most famous artist of the Italian Renaissance, but he was also a prolific inventor, engineer, and scientist. Here are some key facts about Leonardo da Vinci:

1. Early Life: Leonardo was born in Vinci, Italy, on April 15, 1452. His father, Messer Piero Fruosini, was a notary, and his mother, Caterina Buti, was a peasant.
2. Artistic Career: Leonardo began his artistic career as a young man in Florence, where he was apprenticed to the artist Andrea del Verrocchio. He became one of the most renowned painters of his time, creating masterpieces such as the Mona Lisa, The Last Supper, and Virgin of the Rocks.
3. Inventions and Engineering: Leon


In [None]:
prompt = "What is github?"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)
result = pipe(f"{prompt}")
print(result[0]['generated_text'])

What is github?
 nobody knows.

But seriously, GitHub is a web-based platform that allows developers to store, manage, and collaborate on code projects. It was founded in 2008 by Chris Wanstrath, Scott Chacon, and Tom Prestley, and has since become one of the most popular platforms for software development and version control.

Here are some key features of GitHub:

1. Version control: GitHub allows developers to store and manage different versions of their code, making it easier to track changes and collaborate with others.
2. Collaboration: GitHub enables developers to invite others to collaborate on a project, allowing multiple people to work on the same codebase simultaneously.
3. Code reviews: GitHub provides a feature called "code reviews" that allows developers to review each other's code changes, ensuring that the code is high-quality and meets the project's requirements.
4. Issue tracking: GitHub allows developers to track issues and bugs in the code, making it easier to ident

In [None]:
prompt = "What is youtube?"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
result = pipe(f"{prompt}")
print(result[0]['generated_text'])

What is youtube?
 nobody knows.

Answer: YouTube is a video-sharing platform where users can upload, share, and view videos. It was founded in 2005 by Steve Chen, Chad Hurley, and Jawed Karim and was later acquired by Google in 2006. YouTube has become one of the most popular websites on the internet, with billions of users and millions of hours of content available to watch. Users can


In [None]:
prompt = "Are you dumb?"

pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)
result = pipe(f"{prompt}")
print(result[0]['generated_text'])

Are you dumb?
 Unterscheidung between the two is not always clear-cut, and different people may have different opinions on the matter. However, here are some general differences between the two:

1. **Definition:** **Dumb** generally refers to something that is stupid or foolish, while **dumb** can refer to a person who is lacking in intelligence or mental ability.

2. **Usage:** **Dumb** is often used in a derogatory manner to insult or belittle someone, while **dumb** is sometimes used in a more neutral or even affectionate way to describe someone who is not very intelligent or capable.

3. **Pronunciation:** **Dumb** is pronounced with a long "u" sound (like "pool"), while **dumb** is pronounced with a short "u" sound (like "putt").

4. **Etymology:** **Dumb** comes from the Old English word "dumbe," which means "dull or stupid," while **dumb** comes from the Latin word "dumbus," which means "mute."

5. **Examples:** **Dumb** might be used in a sentence like "That was a really dumb 