# Fine-tuning Llama 2
Below I use HF libraries to finetune Llama 2 on SQuAD. I use a locally saved base model. I pull my data from [lmqg/squad](https://huggingface.co/datasets/lmqg/qg_squad/viewer/qg_squad/train?row=0) on HF.

Setup: Connect to SoC GPU servers. Install dependencies into a conda environment from `aqg_hf_cuda.yml`.
Thanks to [brev.dev](https://github.com/brevdev/notebooks/tree/main) for some examples on how to do this.

In [2]:
# import libraries

# try to run this without using huggingface_hub
# from huggingface_hub import notebook_login
# notebook_login()

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments 
from peft import LoraConfig
from trl import SFTTrainer


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [3]:
# globals
models_dir = "/home/ac/code/aqg/models"

### Data
Pull data with `datasets`. All you need is a `.jsonl` file.

In [4]:
# https://huggingface.co/datasets/lmqg/qg_squad
SQuAD = load_dataset("lmqg/qg_squad")
print(SQuAD)

In [5]:
train_dataset = SQuAD['train']
eval_dataset = SQuAD['validation']

In [6]:
# view our data
print(train_dataset[0])

In [7]:
# define data processing functions that produce the actual untokenized input for various training phases

def contextAnswer(example, i):
  return f"Select answer: {example['paragraph_sentence'][i]}\n Answer: {example['answer'][i]}"

def answer(example):
  return f"Answer: {example['answer']}"

def processData(examples):
  output_texts = []
  for i in range(len(examples['answer'])):
    text = contextAnswer(examples, i)
    output_texts.append(text)
  return output_texts


## Model

In [8]:
base_model_location = f"{models_dir}/llama-hf/7b"

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  # we leave the model quantized in 4 bits
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.float16
)

# load our model
base_model = AutoModelForCausalLM.from_pretrained(
  base_model_location,
  quantization_config=bnb_config,
  device_map="auto",
  # research what this is and why i need/don't need it
  # trust_remote_code=True,
  # use_auth_token=True
)
base_model.config.use_cache = False

# more info: https://github.com/huggingface/transformers/pull/24906
base_model.config.pretraining_tp = 1 

# load our tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_location)
tokenizer.pad_token = tokenizer.eos_token

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# add custom tokens here
new_tokens = ["<hl>"]
vocabulary = tokenizer.get_vocab().keys()
for token in new_tokens:
    # check to see if new token is in the vocabulary or not
    if token not in vocabulary:
        tokenizer.add_tokens(token)

base_model.resize_token_embeddings(len(tokenizer))

And setup a train so that we log, save and evaluate every 50 steps:

In [9]:
output_dir = f"{models_dir}/output/7bTrainedWithHF"

# this sets up training
# we log info (is this where we ask to connect to Weights & Biases?)
# every 50 steps we save and evaluate
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=50,
    max_steps=100, # how much training we want to do
    logging_dir=f"{output_dir}/logs", # directory for storing logs
    # save the model checkpoint every logging step
    save_strategy="steps",
    save_steps=50, # how often to save checkpoints
    # evaluate the model every logging step
    evaluation_strategy="steps", # ??
    eval_steps=100, # how often to pause for eval
    do_eval=True # do eval at end
)


We set the config for the Lora adapter: 

In [10]:
# what does this do?
# we configure the Lora adapter
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

(I experimented with higher alpha and r - and found poorer results...)

In [11]:
# use the SFTTrainer from HuggingFace's trl
max_seq_length = 512
trainer = SFTTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    formatting_func=processData,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args,
)



In [12]:
# pass in resume_from_checkpoint=True to resume from a checkpoint
# when we train, we can see our progress and system info on wandb.ai
trainer.train()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 37
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,1.826,1.689577
100,1.6641,1.671155


## Running inference on a trained model
By default, the PEFT library will only save the Qlora adapters. So we need to load the base Llama 2 model from the Huggingface Hub:

In [None]:
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
from peft import PeftModel

In [None]:
base_model_name="meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

and load the qlora adapter from a checkpoint directory:

In [None]:
model = PeftModel.from_pretrained(base_model, "/root/llama2sfft-testing/Llama-2-7b-hf-qlora-full-dataset/checkpoint-900")

then run some inference:

In [None]:
eval_prompt = """A note has the following\nTitle: \nLabels: \nContent: i love"""

model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(tokenizer.decode(model.generate(**model_input, max_new_tokens=100)[0], skip_special_tokens=True))