In [13]:
#!jupyter nbconvert --to python llama_finetune.ipynb

[NbConvertApp] Converting notebook llama_finetune.ipynb to python
[NbConvertApp] Writing 10769 bytes to llama_finetune.py


Copyright (c) Meta Platforms, Inc. and affiliates.
This software may be used and distributed according to the terms of the Llama 2 Community License Agreement.

## Quick Start Notebook

This notebook shows how to train a Llama 2 model on a single GPU (e.g. A10 with 24GB) using int8 quantization and LoRA.

### Step 0: Install pre-requirements and convert checkpoint

The example uses the Hugging Face trainer and model which means that the checkpoint has to be converted from its original format into the dedicated Hugging Face format.
The conversion can be achieved by running the `convert_llama_weights_to_hf.py` script provided with the transformer package.
Given that the original checkpoint resides under `models/7B` we can install all requirements and convert the checkpoint with:

In [None]:
# %%bash
# pip install transformers datasets accelerate sentencepiece protobuf==3.20 py7zr scipy peft bitsandbytes fire torch_tb_profiler ipywidgets
# TRANSFORM=`python -c "import transformers;print('/'.join(transformers.__file__.split('/')[:-1])+'/models/llama/convert_llama_weights_to_hf.py')"`
# python ${TRANSFORM} --input_dir models --model_size 7B --output_dir models_hf/7B

### Step 1: Load the model

Point model_id to model weight folder

In [8]:
import torch
import datasets
from transformers import LlamaForCausalLM, LlamaTokenizer

model_id='/root/Model/llama-2-7b'

tokenizer = LlamaTokenizer.from_pretrained(model_id)

model =LlamaForCausalLM.from_pretrained(model_id, load_in_8bit=True, device_map='auto', torch_dtype=torch.float16)

### Step 2: Load the preprocessed dataset

We load and preprocess the samsum dataset which consists of curated pairs of dialogs and their summarization:

In [4]:
from pathlib import Path
import os
import sys
from utils.dataset_utils import get_preprocessed_dataset
from configs.datasets import science_qa_dataset,llm_science_dataset

sciqa_train_dataset = get_preprocessed_dataset(tokenizer, science_qa_dataset, 'train')
# test_dataset = get_preprocessed_dataset(tokenizer, science_qa_dataset, 'test')

In [5]:
llmsci_train_dataset = get_preprocessed_dataset(tokenizer, llm_science_dataset, 'train')


Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

In [9]:
train_dataset = datasets.concatenate_datasets([sciqa_train_dataset, llmsci_train_dataset])

### Step 3: Check base model

Run the base model on an example input:

In [10]:
eval_prompt = """
Context: In humans, posture can provide a significant amount of important information through nonverbal communication.  Psychological studies have also demonstrated the effects of body posture on emotions.  This research can be traced back to Charles Darwin's studies of emotion and movement in humans and animals.  Currently, many studies have shown that certain patterns of body movements are indicative of specific emotions.   Researchers studied sign language and found that even non-sign language users can determine emotions from only hand movements. Another example is the fact that anger is characterized by forward whole body movement.   The theories that guide research in this field are the self-validation or perception theory and the embodied emotion theory.
Self-validation theory is when a participant's posture has a significant effect on their self-evaluation of their emotions.
Question: According to the research, what type of body movement is anger characterized by?
Options: (A) Sideways body movement
(B) Backward body movement
(C) Upward body movement
(D) Downward body movement
(E) Forward body movement
---
Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

# model.eval()
# with torch.no_grad():
#     print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))

We can see that the base model only repeats the conversation.

### Step 4: Prepare model for PEFT

Let's prepare the model for Parameter Efficient Fine Tuning (PEFT):

In [5]:
model.train()

def create_peft_config(model):
    from peft import (
        get_peft_model,
        LoraConfig,
        TaskType,
        # prepare_model_for_kbit_training,
        prepare_model_for_int8_training,
    )

    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=32,
        lora_dropout=0.05,
        target_modules = ["q_proj", "v_proj"]
    )

    # prepare int-8 model for training
    model = prepare_model_for_int8_training(model)
    # model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()
    return model, peft_config

# create peft config
model, lora_config = create_peft_config(model)





trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


### Step 5: Define an optional profiler

In [6]:
from transformers import TrainerCallback
from contextlib import nullcontext
enable_profiler = False
output_dir = "/root/Model/llama-output"

config = {
    'lora_config': lora_config,
    'learning_rate': 1e-4,
    'num_train_epochs': 3,
    'gradient_accumulation_steps': 4,
    'per_device_train_batch_size': 1,
    'gradient_checkpointing': True,
}

# Set up profiler
if enable_profiler:
    wait, warmup, active, repeat = 1, 1, 2, 1
    total_steps = (wait + warmup + active) * (1 + repeat)
    schedule =  torch.profiler.schedule(wait=wait, warmup=warmup, active=active, repeat=repeat)
    profiler = torch.profiler.profile(
        schedule=schedule,
        on_trace_ready=torch.profiler.tensorboard_trace_handler(f"{output_dir}/logs/tensorboard"),
        record_shapes=True,
        profile_memory=True,
        with_stack=True)
    
    class ProfilerCallback(TrainerCallback):
        def __init__(self, profiler):
            self.profiler = profiler
            
        def on_step_end(self, *args, **kwargs):
            self.profiler.step()

    profiler_callback = ProfilerCallback(profiler)
else:
    profiler = nullcontext()

### Step 6: Fine tune the model

Here, we fine tune the model for a single epoch which takes a bit more than an hour on a A100.

In [7]:
from transformers import default_data_collator, Trainer, TrainingArguments



# Define training args
training_args = TrainingArguments(
    output_dir=output_dir,
    overwrite_output_dir=True,
    bf16=False,  # Use BF16 if available
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=10,
    save_strategy="epoch",
    # save_strategy="steps",
    # save_steps=50,
    # eval_steps=50,
    # save_total_limit=2,
    optim="adamw_torch_fused",
    max_steps=total_steps if enable_profiler else -1,
    **{k:v for k,v in config.items() if k != 'lora_config'}
)

with profiler:
    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=default_data_collator,
        callbacks=[profiler_callback] if enable_profiler else [],
    )

    # Start training
    trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss
10,0.9211
20,0.4874


### Step 7:
Save model checkpoint

In [8]:
model.save_pretrained(output_dir)

### Step 8:
Try the fine tuned model on the same example again to see the learning progress:

In [16]:
eval_prompt = """
Context: In humans, posture can provide a significant amount of important information through nonverbal communication.  Psychological studies have also demonstrated the effects of body posture on emotions.  This research can be traced back to Charles Darwin's studies of emotion and movement in humans and animals.  Currently, many studies have shown that certain patterns of body movements are indicative of specific emotions.   Researchers studied sign language and found that even non-sign language users can determine emotions from only hand movements. Another example is the fact that anger is characterized by forward whole body movement.   The theories that guide research in this field are the self-validation or perception theory and the embodied emotion theory.
Self-validation theory is when a participant's posture has a significant effect on their self-evaluation of their emotions.
Question: According to the research, what type of body movement is anger characterized by?
Options: (A) Sideways body movement
(B) Backward body movement
(C) Upward body movement
(D) Downward body movement
(E) Forward body movement
---
Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=3)[0], skip_special_tokens=True))



Context: In humans, posture can provide a significant amount of important information through nonverbal communication.  Psychological studies have also demonstrated the effects of body posture on emotions.  This research can be traced back to Charles Darwin's studies of emotion and movement in humans and animals.  Currently, many studies have shown that certain patterns of body movements are indicative of specific emotions.   Researchers studied sign language and found that even non-sign language users can determine emotions from only hand movements. Another example is the fact that anger is characterized by forward whole body movement.   The theories that guide research in this field are the self-validation or perception theory and the embodied emotion theory.
Self-validation theory is when a participant's posture has a significant effect on their self-evaluation of their emotions.
Question: According to the research, what type of body movement is anger characterized by?
Options: (A)

In [15]:
eval_prompt = """
Context: The term self-organized criticality was first introduced in Bak, Tang and Wiesenfeld's 1987 paper, which clearly linked together those factors: a simple cellular automaton was shown to produce several characteristic features observed in natural complexity (fractal geometry, pink (1/f) noise and power laws) in a way that could be linked to critical-point phenomena. Crucially, however, the paper emphasized that the complexity observed emerged in a robust manner that did not depend on finely tuned details of the system: variable parameters in the model could be changed widely without affecting the emergence of critical behavior: hence, self-organized criticality. Thus, the key result of BTW's paper was its discovery of a mechanism by which the emergence of complexity from simple local interactions could be spontaneous—and therefore plausible as a source of natural complexity—rather than something that was only possible in artificial situations in which control parameters are tuned to
Question: Who proposed the principle of "complexity from noise" and when was it first introduced?
Options: (A): Ilya Prigogine in 1979
(B): Henri Atlan in 1972
(C): Democritus and Lucretius in ancient times
(D): None of the above.
(E): René Descartes in 1637
---
Answer:
"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=3)[0], skip_special_tokens=True))



Context: The term self-organized criticality was first introduced in Bak, Tang and Wiesenfeld's 1987 paper, which clearly linked together those factors: a simple cellular automaton was shown to produce several characteristic features observed in natural complexity (fractal geometry, pink (1/f) noise and power laws) in a way that could be linked to critical-point phenomena. Crucially, however, the paper emphasized that the complexity observed emerged in a robust manner that did not depend on finely tuned details of the system: variable parameters in the model could be changed widely without affecting the emergence of critical behavior: hence, self-organized criticality. Thus, the key result of BTW's paper was its discovery of a mechanism by which the emergence of complexity from simple local interactions could be spontaneous—and therefore plausible as a source of natural complexity—rather than something that was only possible in artificial situations in which control parameters are tun

In [None]:
prompt = '''Context: {{hint}}\nQuestion: {{question}}\nOptions: {{options}}\n---\nAnswer:{{answer}}{{eos_token}}'''

def format_options(options):
    return ' '.join([f'({c}) {o}' for c, o in zip(choice_prefixes, options)])

def apply_prompt_template(r):
    options = format_options(r['choices'])
    return {
        "text": prompt.format(
            hint=r["hint"],
            question=r["question"],
            options=options,
            answer=choice_prefixes[r['answer']],
            eos_token=tokenizer.eos_token,
        )
    }