This notebook is taken directly from https://github.com/tcapelle/llm_recipes/tree/main

# From Llama to Alpaca: Finetunning and LLM with Weights & Biases
In this notebooks you will learn how to finetune a pretrained LLama model on an Instruction dataset. We will use an updated version of the Alpaca dataset that, instead of davinci-003 (GPT3) generations uses GPT4 to get an even better instruction dataset! More details on the [official repo page](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data)

> This notebook requires a A100/A10 GPU with at least 24GB of memory. You could tweak the params down and run on a T4 but it would take very long time

This notebooks has a companion project and [report](wandb.me/alpaca)

In [1]:
!pip install wandb transformers trl datasets "protobuf==3.20.3" evaluate



## With Huggingface TRL

Let's grab the Alpaca (GPT-4 curated instructions and outputs) dataset:

In [2]:
# !wget https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json

In [3]:
import json

dataset_file = "alpaca_gpt4_data.json"

with open(dataset_file, "r") as f:
    alpaca = json.load(f)

In [4]:
import wandb
wandb.init(project="alpaca_ft", # the project I am working on
           tags=["hf_sft"]) # the Hyperparameters I want to keep track of
artifact = wandb.use_artifact('capecape/alpaca_ft/alpaca_gpt4_splitted:latest', type='dataset')
artifact_dir = artifact.download()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mnelectric[0m ([33mneelectric[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m:   2 of 2 files downloaded.  


In [5]:
print(artifact_dir)
from datasets import load_dataset
alpaca_ds = load_dataset("json", data_dir=artifact_dir)

  from .autonotebook import tqdm as notebook_tqdm


/home/service/BioLlama/utilities/finetuning/artifacts/alpaca_gpt4_splitted:v8


In [6]:
alpaca_ds

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 51002
    })
    test: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 1000
    })
})

Let's log the dataset also as a table so we can inspect it on the workspace.

In [7]:
def prompt_no_input(row):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n{output}").format_map(row)

In [8]:
def prompt_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n{output}").format_map(row)

In [9]:
def create_prompt(row):
    return prompt_no_input(row) if row["input"] == "" else prompt_input(row)

In [10]:
train_dataset = alpaca_ds["train"]
eval_dataset = alpaca_ds["test"]

In [11]:
import torch
# from cti.transformers.transformers.src.transformers.models.auto import AutoModelForCausalLM, AutoTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer

In [39]:
model_id = 'TheBloke/Llama-2-7B-Chat-GPTQ'

In [40]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
)

config.json: 100%|██████████| 789/789 [00:00<00:00, 5.88MB/s]
model.safetensors: 100%|██████████| 3.90G/3.90G [06:57<00:00, 9.34MB/s]
generation_config.json: 100%|██████████| 137/137 [00:00<00:00, 1.18MB/s]


Training the full models is expensive, but if you have a GPU that can fit the full model, you can skip this part. Let's just train the last 8 layers of the model (Llama2-7B has 32)

In [14]:
temp = model.parameters()
temp_list = [param for param in temp]
print(len(temp_list))   

291


In [15]:
n_freeze = 15

# freeze layers (disable gradients)
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[n_freeze].parameters(): param.requires_grad = True

In [16]:
# Just freeze embeddings for small memory decrease
model.model.embed_tokens.weight.requires_grad_(False);

In [17]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 333.46M


In [18]:
import torch
print(torch.__version__)

2.1.2+cu121


In [19]:
# # !pip uninstall transformers -y
# !pip install transformers
# !pip install -i https://pip.repos.neuron.amazonaws.com transformers-neuronx

In [20]:
# from transformers import TrainingArguments
from trl import SFTTrainer

In [21]:
batch_size = 32

total_num_steps = 11_210 // batch_size
print(total_num_steps)
print("changing total batch size down to 50 to save time")
total_num_steps = 50

350
changing total batch size down to 50 to save time


In [22]:
from cti.transformers.transformers.src.transformers import TrainingArguments

In [23]:
# !pip uninstall torch_xla -y
# !pip install -i https://pip.repos.neuron.amazonaws.com torch-xla

In [24]:
output_dir = "/home/service/BioLlama/utilities/finetuning/finetuned_models/"
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size//4,
    bf16=True,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_steps=total_num_steps // 10,
    # num_train_epochs=1,
    max_steps=total_num_steps,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    evaluation_strategy="steps",
    eval_steps=total_num_steps // 3,
    # logging strategies
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="epoch", #changed to epoch so we save every epoch i guess?
)

In [25]:
# from utils import LLMSampleCB, token_accuracy

trainer = SFTTrainer(
    model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    packing=True,
    max_seq_length=1024,
    args=training_args,
    formatting_func=create_prompt,
    # compute_metrics=token_accuracy,
)



In [26]:
# remove answers
def create_prompt_no_anwer(row):
    row["output"] = ""
    return {"text": create_prompt(row)}

test_dataset = eval_dataset.map(create_prompt_no_anwer)

In [27]:
# wandb_callback = LLMSampleCB(trainer, test_dataset, num_samples=10, max_new_tokens=256)

In [28]:
# trainer.add_callback(wandb_callback)

In [29]:
trainer.train()
wandb.finish()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...




Step,Training Loss,Validation Loss
16,1.0126,1.011878
32,1.0121,0.979005
48,1.0268,0.973609




0,1
eval/loss,█▂▁
eval/runtime,▁█▄
eval/samples_per_second,█▁▅
eval/steps_per_second,█▁█
train/epoch,▁▁▁▁▁▁▃▃▃▃▃▃▃▃▃▃▃▃▆▆▆▆▆▆▆▆▆▆▆▆▆█████████
train/global_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/learning_rate,▂▄▅▇██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,██▇▆▅▄▄▂▃▂▃▁▂▃▂▂▂▃▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▁▂▁▁▂▂▂
train/total_flos,▁
train/train_loss,▁

0,1
eval/loss,0.97361
eval/runtime,25.4135
eval/samples_per_second,39.349
eval/steps_per_second,4.919
train/epoch,0.03
train/global_step,50.0
train/learning_rate,0.0
train/loss,0.9835
train/total_flos,6.49528306827264e+16
train/train_loss,1.02997


In [30]:
import os
print(os.path.abspath(output_dir))

/home/service/BioLlama/utilities/finetuning/finetuned_models


In [31]:
trainer.save_model(output_dir)
#print contents of output_dir
!ls -l $output_dir
#print full path of output_dir
# !pwd $output_dir

total 13163372
drwxrwxr-x 2 service service       4096 Jan 17 00:52 checkpoint-350
drwxrwxr-x 2 service service       4096 Jan 22 11:09 checkpoint-50
-rw-rw-r-- 1 service service        685 Jan 22 11:09 config.json
-rw-rw-r-- 1 service service        183 Jan 22 11:09 generation_config.json
drwxrwxr-x 2 service service       4096 Jan 22 11:04 logs
-rw-rw-r-- 1 service service 4938985352 Jan 22 11:09 model-00001-of-00003.safetensors
-rw-rw-r-- 1 service service 4947390880 Jan 22 11:09 model-00002-of-00003.safetensors
-rw-rw-r-- 1 service service 3590488816 Jan 22 11:09 model-00003-of-00003.safetensors
-rw-rw-r-- 1 service service      23950 Jan 22 11:09 model.safetensors.index.json
-rw-rw-r-- 1 service service        437 Jan 22 11:09 special_tokens_map.json
-rw-rw-r-- 1 service service        920 Jan 22 11:09 tokenizer_config.json
-rw-rw-r-- 1 service service    1842767 Jan 22 11:09 tokenizer.json
-rw-rw-r-- 1 service service     499723 Jan 22 11:09 tokenizer.model
-rw-rw-r-- 1 service s

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [38]:
#there is a finetune of llama 2 7b hf in the foler finetuned_models
#load this local model here and use it to generate some text

print(output_dir)

from transformers import AutoModelForCausalLM, AutoTokenizer
new_tokenizer = AutoTokenizer.from_pretrained(output_dir)
new_model = AutoModelForCausalLM.from_pretrained(output_dir)

prompt = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n"

input_ids = new_tokenizer.encode(prompt, return_tensors="pt")

print(input_ids)
print(input_ids.shape)

output = new_model.generate(input_ids, max_length=35, do_sample=True, top_p=0.95, top_k=60)
print(new_tokenizer.decode(output[0], skip_special_tokens=True))

/home/service/BioLlama/utilities/finetuning/finetuned_models/


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards: 100%|██████████| 3/3 [00:02<00:00,  1.49it/s]


tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889,
         14350,   263,  2933,   393,  7128,  2486,  1614,  2167,   278,  2009,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13]])
torch.Size([1, 29])
<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a response that appropriately
