# My first LLM finetune
### A blog post by Andrew Ferruolo

## My project

If you're like me, you have to do most of your writing in C++ or Python, not in English. While I'm grateful that I get to code every day, this does lead to me making regular grammar and spelling issues. I often look at my writing and wish I could figure out how to word it better, or more naturally. I initially applied ChatGPT and Claude to this probelem, but found that they would often change my style, and even sometimes the meaning of my sentences. This is because ChatGPT and Claude are trained and managed behind closed doors, by companies who train the models to enforce their opinions on users in the name of AI saftey. While I understand that these companies need to provide some form of control and bias to prevent bad outcomes from usage of their products, I dislike having my writing altered to agree with other people's opinion. And so, I decided what I needed was to finetune an open source LLM for my purposes, and find a way to run it on my local computer. Here is a documented account of my journey.

## Requirments and Specifications

From the above paragraph, we can elicit the following requirements for my project

1. Simple - I was a Michgan CS student at the time of the project. I didn't exactly have days to throw at this project
2. Open Source - The goal is to get a strong understanding of LLMs, tools, frameworks, etc. 
3. Fast - If it takes 30 minutes for the LLM to run, I might as well just do the work myself.
4. Memory Efficient - LLMs are large by nature, but my Mac only has so much RAM. 

Given these requirments, I landed on the following implementation specifications:
    
1. Use Huggingface to train, use basic adaptation of Llama.cpp to deploy
2. Use Llama2 7B and finetune - Easily works with Llama.cpp, and I'm a huge fan of Yann LeCun and FAIR.
3. Weights must be quantized. Possibly, we might want to prune for a more performant and smaller network.

## We have our specs. Now, let's start building it

To finetune a large language model, we need some powerful servers and GPUS. Training 7B params, even quantized, on my local computer would likely lead to it bursting into flames. So, I had to use the cloud.

### My first instinct: use AWS (an expensive and unfruitful journey)

This, obviously, was a mistake. I don't know if you have used SageMaker before, but within a day my "Cost and usage" tab looked a little something like this:

![AWS Cost](aws-csot.gif "AWS COST")

Well. I guess it's worth it. I'm paying for an easy to use, intuitive, flexible interface right? WRONG:




<div>
<img src="SageMaker-Confusing.jpg" width="500"/>
</div>


What even is this! I just want a basic instance I can SSH into. Even deleteing profiles, instances, and the rest from AWS is a huge pain. Although I'm sure that at the enterprise level there is a good reason for all of this, I personally don't want to deal with it. So, I'm just not going to. Bye Jeff!

### A Better Solution: Brev.dev

I then remembered a company called Brev.dev, which I had seen on Twitter a couple months ago. After checking them out, I discovered they agreed with me on how AWS is nearly unuseable, and created a solution for AI hackers like me. Their simple interface allowed me to complete the rest of my project in just a few hours, and for minimal cost.

### Spinning up is simple


<div>
<img src="Brev-Spinup.jpg" width="500"/>
</div>

### Managing is just as simple

<div>
<img src="brev-start.jpg" width="500"/>
</div>

### SSH is easier than ever before

<div>
<img src="brev-ssh.jpg" width="500"/>
</div>


### Now that's what we like to see
Brev provides a simple, clean interface to get new GPUS. Also, they're insanely cheap. Look at those prices! Finetuning might cost me less than a burger if I do it right! Now that we have our instance all set up, lets start actually doing work

## Getting my dataset/model ready for huggingface:

### Steps
1. Create directory called "grammar_dataset, with two subdirectories called "train" and "validation"
2. Download data using "download_grammar_dataset.py"
3. Move "gtrain_10k.csv" to train, "grammar_validation.csv" to validation
4. Add a readme to grammar dataset, with following as follows: (copy directly from cell below)

---
configs:
- config_name: default
  data_files:
  - split: train
    path: "train/gtrain_10k.csv"
  - split: validation
    path: "validation/grammar_validation.csv"
---

Now we need to get our model. 
5. Visit this link (https://llama.meta.com/llama-downloads/), and follow the instructions to download your desired model to your instance
6. Use the "convert_llama_weights_to_hf.py" script in this repo to convert your weights to huggingface format
7. If your filetree looks like like the one below, your're ready to go!

In [None]:
# TODO: Filetree here

# Finetuning my model, using huggingface.

Warning: This part might get a little dense. You'll have to forgive me if I breeze over some explanations. For a deeper explanation, take a look at Harper Carroll's blog here: (https://brev.dev/blog/how-qlora-works)

In [9]:
# First, we import
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset

In [17]:
# Constants
llama_og_path = "./models/llama-7b-huggingface"
llama_token_path = "./models/llama-7b-huggingface"
dataset = "./llama_datasets/grammar_dataset/" #gtrain_10k.csv"


In [None]:
train_dataset = load_dataset(train_dataset, split='train')
eval_dataset  = load_dataset(test_dataset, split='validation')

In [11]:
tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path,
    model_max_length=256,
    padding_side="left",
    add_eos_token=True)

tokenizer.pad_token = tokenizer.bos_token
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result
bos = tokenizer.bos_token
eos = tokenizer.eos_token

NameError: name 'AutoTokenizer' is not defined

In [None]:
def generate_and_tokenize_prompt(data_point):
    target = data_point['input']
    result = data_point['target']
    
    full_prompt = f"You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues, \
        the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: {target} CORRECT: {result}"
    return tokenize(full_prompt)

### Prep Model For Training

In [None]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(llama_og_path, quantization_config=bnb_config)
model = PeftModel.from_pretrained(base_model, "./llama2-grammar/checkpoint-50")

In [None]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_prompt = "It's great to be  "
eval_tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path,
    padding_side="left",
    model_max_length=256,
)

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=256)[0], skip_special_tokens=False))


In [None]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )



In [None]:
config = LoraConfig(
    r=6,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

model = accelerator.prepare_model(model)


In [None]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
import transformers
from datetime import datetime

project = "grammar"
base_model_name = "llama2"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5,
        logging_steps=50,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",   
        save_strategy="steps",
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every 50 logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()


# We Trained, now what?

## First lets test the model
1. Hit ESC-00 (reset the kernel), then evaluate by hand using the script below

In [13]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset
from peft import PeftModel
from tqdm import tqdm_notebook as tqdm
!jupyter nbextension enable --py widgetsnbextension


ModuleNotFoundError: No module named 'accelerate'

In [None]:
checkpoint_path = "" # TODO: Put Checkpoint path here

In [None]:
test_dataset  = load_dataset(test_dataset, split='validation')

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path)

tokenizer.pad_token = tokenizer.eos_token
bos = tokenizer.bos_token
eos = tokenizer.eos_token

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    llama_og_path,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)


In [14]:
# Change i to change the input prompt. Evaluate the
# outputs using your own judgement. Adjust hyperparameters above for training 
# or prompt to improve your response

i = 50


eval_prompt = f"{bos}You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues," + \
        f" the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: {eval_dataset[i]['input']} CORRECT: " 


model_input = tokenizer(eval_prompt, return_tensors="pt")

base_model.eval()
with torch.no_grad():
    output = tokenizer.decode(base_model.generate(**model_input, max_new_tokens=150, repetition_penalty=1.15)[0], skip_special_tokens=True)


NameError: name 'bos' is not defined

# Now we have a good model. 
We have to merge our LORA's and dequantize it before we can export to Llama.cpp

In [16]:
# TODO: Finish me

# Now that we have merged, we can export

We download our weights by taring them up and using SCP
```console
tar -cvf models/model-final final-model.tar.gz
```

Now on your local instance, in a clone of Llama.cpp (Use my modifications here, under "examples/llamacheck" (https://github.com/Ferruolo/llama.cpp). Make using
make llamacheck
```
scp -i myBrevInstanceName:myFirstLLM/final-model.tar.gz ./models
tar -xvf ./models/final-model.tar.gz
python convert-hf-to-gguf.py ./models/final-model --outfile llamacheck.gguf
TODO: Quantize llamacheck.gguf
```

# We're done. A brief reflection

Being honest, my experience was not as simple as this blog post. Even though I adapted most of my code directly from the notebooks I actually used, I took a lot of twists and turns, misinterpreted docs several times, and struggled to export my weights. I hope that by compiling everything I did into one big notebook, I have made your life significantly easier! I was suprised by how easy finetuning was once I had figured everything out, but was dismayed at how difficult doing practical things with huggingface, bits and bytes library (QLORA), and llama.cpp were. I found myself in a configuration hell. There's no reason that exporting to another format should be as complicated as it was. 


I would like to thank the Brev team for helping me learn a lot of the things I talk about here, and giving me the opportunity to write this blog for them.