Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

## Quick Start Notebook

This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.

### Step 0: Install pre-requirements and convert checkpoint

The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.
The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.
Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:

In [1]:
# %%bash
# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

### Step 1: Load the model

Point model_id to model weight folder

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


In [3]:
import torch
from transformers import LlamaTokenizer, LlamaForSequenceClassification, LlamaConfig



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: CUDA runtime path found: /home/NETID/xiruod/anaconda3/envs/llama2/lib/libcudart.so
CUDA SETUP: Loading binary /home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...


  warn("The installed version of bitsandbytes was compiled without GPU support. "
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
  warn(msg)


In [4]:
class train_config:
    def __init__(self):
        self.quantization: bool = False

    

In [5]:
globalconfig = train_config()

In [6]:
globalconfig.quantization = False

In [7]:
globalconfig.device = "cuda:0"

In [8]:

globalconfig.model_id="/bime-munin/llama2_hf/llama-2-7b_hf/"


# model = LlamaForCausalLM.from_pretrained(model_id, 
#                                          load_in_8bit=False, 
#                                          device_map="cuda:0", 
#                                          torch_dtype=torch.float16
#                                         )


tokenizer = LlamaTokenizer.from_pretrained(globalconfig.model_id)


tokenizer.add_special_tokens({"pad_token":"<pad>"}) 



1

In [9]:
len(tokenizer)

32001

In [10]:
tokenizer.pad_token_id

32000

In [11]:
tokenizer

LlamaTokenizer(name_or_path='/bime-munin/llama2_hf/llama-2-7b_hf/', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False), 'eos_token': AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False), 'unk_token': AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False), 'pad_token': '<pad>'}, clean_up_tokenization_spaces=False)

In [12]:
# test_prompt = ["Hello this is just a test and no conversations involved.", "this is just a test"]
# tokenizer(test_prompt, return_tensors='pt', max_length=20, padding='max_length', truncation=True)



# token_inputs = tokenizer(test_prompt, return_tensors='pt', max_length=20, padding='max_length', truncation=True).to("cuda:0")

# tmp = model(**token_inputs, labels=torch.tensor([[2],[4]]).to("cuda:0"))

# tmp.logits

### Step 2: Load the preprocessed dataset

We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:

#### My Hate Speech Dataset

In [13]:
import pathlib
import pandas as pd

In [14]:
df_dynGen = pd.read_csv("/bime-munin/xiruod/data/hateSpeech_Bulla2023/Dynamically-Generated-Hate-Speech-Dataset/Dynamically Generated Hate Dataset v0.2.3.csv",)
# df_dynGen['label_binary'] = df_dynGen['label'].map({"hate":1, "nothate":0})

df_dynGen['label'] = df_dynGen['label'].map({"hate":"hate", "nothate":"nothate"})

df_dynGen["dfSource"] = "dynGen"

In [15]:
ls_allFiles = pathlib.Path("/bime-munin/xiruod/data/hateSpeech_Bulla2023/hate-speech-dataset/all_files/").glob("*.txt")

ls_id = []
ls_text = []

for ifile in ls_allFiles:
    ls_id.append(ifile.name.split(".txt")[0])
    with open(ifile, "r") as f:
        ls_text.append(f.read())

df_wsf_raw = pd.DataFrame({"file_id":ls_id, "text":ls_text})

df_wsf_annotation = pd.read_csv("/bime-munin/xiruod/data/hateSpeech_Bulla2023/hate-speech-dataset/annotations_metadata.csv")

df_wsf = df_wsf_raw.merge(df_wsf_annotation, on="file_id", how="inner")

df_wsf = df_wsf[df_wsf['label'].isin(['hate','noHate'])].reset_index(drop=True)

# df_wsf['label_binary'] = df_wsf['label'].map({"hate":1,"noHate":0})
df_wsf['label'] = df_wsf['label'].map({"hate":"hate","noHate":"nothate"})

df_wsf["dfSource"] = "wsf"

In [16]:
import datasets

In [17]:
df = pd.concat([df_dynGen, df_wsf]).reset_index(drop=True)

In [18]:
df['label_binary'] = df['label'].map({"hate":1,"nothate":0})

In [19]:
dataset_df = datasets.Dataset.from_pandas(df[['text','dfSource','label_binary']].rename(columns={"label_binary":"label"}))

In [20]:
dataset_df[0]

{'text': 'dalits are lowlives', 'dfSource': 'dynGen', 'label': 1}

In [21]:
id2label = {0:"nothate", 1:"hate"}
label2id = {"nothate":0, "hate":1}

In [22]:
def preprocess_function(examples):

    return  tokenizer(examples['text'], return_tensors='pt', max_length=1024, padding='max_length', truncation=True).to(globalconfig.device)


In [23]:
tokenized_df = dataset_df.map(preprocess_function, batched=True)

Map:   0%|          | 0/51847 [00:00<?, ? examples/s]

In [24]:
# train_encodings = tokenizer(list(df['text']), return_tensors='pt', max_length=1024, padding='max_length', truncation=True).to(globalconfig.device)

# train_labels = df['label']


# class classificationDataset(torch.utils.data.Dataset):
#     def __init__(self, encodings, labels):
#         self.encodings = encodings
#         self.labels = labels

#     def __getitem__(self, idx):
#         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
#         item['labels'] = torch.tensor(self.labels[idx])
#         return item

#     def __len__(self):
#         return len(self.labels)

# train_dataset = classificationDataset(train_encodings, train_labels)

### Initialize Model here

In [25]:


model = LlamaForSequenceClassification.from_pretrained(globalconfig.model_id, 
                                         load_in_8bit=globalconfig.quantization, 
                                         device_map="cuda:0", 
                                         torch_dtype=torch.float16,
                                                       num_labels = len(id2label), 
                                                       id2label=id2label,
                                                       label2id=label2id,
                                                       
                                        )


model.config.pad_token_id = tokenizer.pad_token_id

model.resize_token_embeddings(len(tokenizer))


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /bime-munin/llama2_hf/llama-2-7b_hf/ and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(32001, 4096)

### Step 3: Check base model

Run the base model on an example input:

### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [26]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.SEQ_CLS,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    if globalconfig.quantization:
        model = prepare_model_for_int8_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)



trainable params: 4,210,688 || all params: 6,611,558,400 || trainable%: 0.06368677012669206


### Step 5: Define an optional profiler

In [36]:
from transformers import TrainerCallback
from contextlib import nullcontext
enable_profiler = False
output_dir = "/bime-munin/xiruod/tmp/llama-output"

config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 1,
    'gradient_accumulation_steps': 2,
    'per_device_train_batch_size': 2,
    'gradient_checkpointing': False,
}

# Set up profiler
if enable_profiler:
    # wait, warmup, active, repeat = 1, 1, 2, 1
    wait, warmup, active, repeat = 10, 10, 100, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(
        schedule=schedule,
        on_trace_ready=torch.profiler.tensorboard_trace_handler(f"{output_dir}/logs/tensorboard"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True)
    
    class ProfilerCallback(TrainerCallback):
        def __init__(self, profiler):
            self.profiler = profiler
            
        def on_step_end(self, *args, **kwargs):
            self.profiler.step()

    profiler_callback = ProfilerCallback(profiler)
else:
    profiler = nullcontext()

### Step 6: Fine tune the model

Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100.

In [37]:
output_dir

'/bime-munin/xiruod/tmp/llama-output'

In [None]:
from transformers import default_data_collator, Trainer, TrainingArguments



# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=False,  # Use BF16 if available
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="no",
    optim="adamw_torch_fused",
    # max_steps=total_steps if enable_profiler else -1,
    max_steps=500,

    **{k:v for k,v in config.items() if k != 'lora_config'}
)

with profiler:
    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_df,
        data_collator=default_data_collator,
        tokenizer=tokenizer,
        callbacks=[profiler_callback] if enable_profiler else [],
    )

    # Start training
    trainer.train()

ret_code = 1

Step,Training Loss
10,1518.5206
20,2021.5984
30,994.0688
40,920.4756
50,884.2555
60,506.15
70,578.9
80,626.0719
90,354.2375
100,317.052


In [42]:
ret_code

1

### Step 7: Save model checkpoint

In [43]:
model.save_pretrained(output_dir)

### Step 8: Try Fine-tuned Model
Try the fine tuned model on the same example again to see the learning progress:

In [None]:
# model.eval()
# with torch.no_grad():
#     print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))


### Load LoRA

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"


In [2]:
import torch
from peft import PeftModel, PeftConfig
from transformers import LlamaTokenizer, LlamaForSequenceClassification, LlamaConfig

# from transformers import AutoModelForCausalLM, AutoTokenizer


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so
/home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32
CUDA SETUP: CUDA runtime path found: /home/NETID/xiruod/anaconda3/envs/llama2/lib/libcudart.so
CUDA SETUP: Loading binary /home/NETID/xiruod/anaconda3/envs/llama2/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...


  warn("The installed version of bitsandbytes was compiled without GPU support. "
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
  warn(msg)


In [3]:
peft_model_id = "/bime-munin/xiruod/tmp/llama-output/"
config = PeftConfig.from_pretrained(peft_model_id)

In [21]:
config

LoraConfig(peft_type='LORA', auto_mapping=None, base_model_name_or_path='/bime-munin/llama2_hf/llama-2-7b_hf/', revision=None, task_type='SEQ_CLS', inference_mode=True, r=8, target_modules=['q_proj', 'v_proj'], lora_alpha=32, lora_dropout=0.05, fan_in_fan_out=False, bias='none', modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None)

In [4]:
model = LlamaForSequenceClassification.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=False, device_map='cuda:0')

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some weights of LlamaForSequenceClassification were not initialized from the model checkpoint at /bime-munin/llama2_hf/llama-2-7b_hf/ and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
model

LlamaForSequenceClassification(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (nor

In [6]:
tokenizer = LlamaTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.add_special_tokens({"pad_token":"<pad>"}) 

model.config.pad_token_id = tokenizer.pad_token_id

model.resize_token_embeddings(len(tokenizer))


# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32001. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


In [7]:
len(tokenizer)

32001

In [8]:
tokenizer.pad_token

'<pad>'

In [9]:
model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): LlamaForSequenceClassification(
      (model): LlamaModel(
        (embed_tokens): Embedding(32001, 4096)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
           

In [10]:
model.base_model

LoraModel(
  (model): LlamaForSequenceClassification(
    (model): LlamaModel(
      (embed_tokens): Embedding(32001, 4096)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaAttention(
            (q_proj): Linear(
              in_features=4096, out_features=4096, bias=False
              (lora_dropout): ModuleDict(
                (default): Dropout(p=0.05, inplace=False)
              )
              (lora_A): ModuleDict(
                (default): Linear(in_features=4096, out_features=8, bias=False)
              )
              (lora_B): ModuleDict(
                (default): Linear(in_features=8, out_features=4096, bias=False)
              )
              (lora_embedding_A): ParameterDict()
              (lora_embedding_B): ParameterDict()
            )
            (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
            (v_proj): Linear(
              in_features=4096, out_features=4096, bias=False
             