# My First LLM Finetune
### A blog post by Andrew Ferruolo

## Some Background Knowledge
I understand not everyone reading this is going to know all the terms I throw around here. So, here is a basic dictionary to help you out

Large Langauge Model (LLM) - A large AI model trained for interacting with text. ChatGPT and Claude are examples of LLMs \
HuggingFace - Huggingface transformers library is a package which allows us to work with LLMs while taking away a lot of the complexity. \
Package - Prewritten code downloaded off the internet \
Train/Finetune - To train an AI model means running the model over a dataset and having it make predictions, and then correcting the model parameters by assessing the correctness of its predictions. Finetuning is just continued training\
Llama - A LLM developer by Facebook AI research \
Llama.cpp - A deployment framework for llama written in C++ 

## My project

If you're like me, you have to do most of your writing in C++ or Python, not in English. While I'm grateful that I get to code every day, this does lead to me making regular grammar and spelling issues. I often look at my writing and wish I could figure out how to word it better, or more naturally. I initially applied ChatGPT and Claude to this probelem, but found that they would often change my style, and even sometimes the meaning of my sentences. This is because ChatGPT and Claude are trained and managed behind closed doors, by companies who train the models to enforce their opinions on users in the name of AI saftey. While I understand that these companies need to provide some form of control and bias to prevent bad outcomes from usage of their products, I dislike having my writing altered to agree with other people's opinion. And so, I decided what I needed was to finetune an open source Large Language Model (LLM) for my purposes, and find a way to run it on my local computer. Here is a documented account of my journey.

## Requirments and Specifications

From the above paragraph, we can elicit the following requirements for my project

1. Simple - I was a Michgan CS student at the time of the project. I didn't exactly have days to throw at this project
2. Open Source - The goal is to get a strong understanding of LLMs, tools, frameworks, etc. 
3. Fast - If it takes 30 minutes for the LLM to run, I might as well just do the work myself.
4. Memory Efficient - LLMs are large by nature, but my Mac only has so much RAM. 

Given these requirments, I landed on the following implementation specifications:
    
1. Use Huggingface to train, use basic adaptation of Llama.cpp to deploy
2. Use Llama2 7B and finetune - Easily works with Llama.cpp, and I'm a huge fan of Yann LeCun and FAIR.
3. Weights must be quantized. Possibly, we might want to prune for a more performant and smaller network.

## We have our specs. Now, let's start building it

To finetune a large language model, we need some powerful servers and GPUS. Training 7B params, even quantized, on my local computer would likely lead to it bursting into flames. So, I had to use the cloud.

### My first instinct: use AWS (an expensive and unfruitful journey)

This, obviously, was a mistake. I don't know if you have used SageMaker before, but within a day my "Cost and usage" tab looked a little something like this:

![AWS Cost](aws-csot.gif "AWS COST")

Well. I guess it's worth it. I'm paying for an easy to use, intuitive, flexible interface right? WRONG:




<div>
<img src="SageMaker-Confusing.jpg" width="500"/>
</div>


What even is this! I just want a basic interface. Even deleteing profiles, instances, and the rest from AWS is a huge pain. Although I'm sure that at the enterprise level there is a good reason for all of this, I personally don't want to deal with it. So, I'm just not going to. Bye Jeff!

### A Better Solution: Brev.dev

I then remembered a company called Brev.dev, which I had seen on Twitter a couple months ago. After checking them out, I discovered they agreed with me on how AWS is nearly unuseable, and created a solution for AI hackers like me. Their simple interface allowed me to complete the rest of my project in just a few hours, and for minimal cost.

### Spinning up is simple


<div>
<img src="Brev-Spinup.jpg" width="500"/>
</div>

### Managing is just as simple

<div>
<img src="brev-start.jpg" width="500"/>
</div>

### SSH is easier than ever before

<div>
<img src="brev-ssh.jpg" width="500"/>
</div>


### Now that's what we like to see
Brev provides a simple, clean interface to get new GPUS. Also, they're insanely cheap. Look at those prices! Finetuning might cost me less than a burger if I do it right! Now that we have our instance all set up, lets start actually doing work

## Getting my dataset/model ready for huggingface:

### Steps
1. Create directory called "grammar_dataset, with two subdirectories called "train" and "validation"
2. Download data using "download_grammar_dataset.py"
3. Move "gtrain_10k.csv" to train, "grammar_validation.csv" to validation
4. Add a readme to grammar dataset, with following as follows: (copy directly from cell below, between the two quotes)

```console
---
configs:
- config_name: default
  data_files:
  - split: train
    path: "train/gtrain_10k.csv"
  - split: validation
    path: "validation/grammar_validation.csv"
---
```

Now we need to get our model

5. Visit this link (https://llama.meta.com/llama-downloads/), and follow the instructions to download your desired model to your instance
6. Use the "convert_llama_weights_to_hf.py" script in this repo to convert your weights to huggingface format
7. If your filetree looks like like the one below, your're ready to go!

```console
.
├── MyFirstLLM.ipynb
├── datasets
│   └── grammar_dataset
│       ├── README.md
│       ├── train
│       │   └── gtrain_10k.csv
│       └── validation
│           └── grammar_validation.csv
├── models
│   ├── convert_llama_weights_to_hf.py
│   ├── llama-7B-huggingface
│   │   ├── config.json
│   │   ├── generation_config.json
│   │   ├── pytorch_model-00001-of-00003.bin
│   │   ├── pytorch_model-00002-of-00003.bin
│   │   ├── pytorch_model-00003-of-00003.bin
│   │   ├── pytorch_model.bin.index.json
│   │   ├── special_tokens_map.json
│   │   ├── tokenizer.json
│   │   ├── tokenizer.model
│   │   └── tokenizer_config.json
│   └── llama-7B-pytorch
│       ├── checklist.chk
│       ├── consolidated.00.pth
│       ├── params.json
│       └── tokenizer.model
└── readme.md
```

# Finetuning my model, using huggingface.

Warning: This part might get a little dense. You'll have to forgive me if I breeze over some explanations. For a deeper explanation, take a look at Harper Carroll's blog here: (https://brev.dev/blog/how-qlora-works). The majority of this section is adapted from the given blog, with adjustments added in as I saw fit

In [None]:
#Install Eveything
!pip install torch
!pip install transformers
!pip instal bitsandbytes
!pip install peft
!pip install datasets

In [2]:
# Import Libraries
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset

In [2]:
# Constants
llama_og_path = "./models/llama-7B-huggingface"
llama_token_path = "./models/llama-7B-huggingface"
dataset = "./datasets/grammar_dataset/"


In [3]:
train_dataset = load_dataset(dataset, split='train')
eval_dataset  = load_dataset(dataset, split='validation')

In [4]:
tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path,
    model_max_length=256,
    padding_side="left",
    add_eos_token=True)

tokenizer.pad_token = tokenizer.bos_token
def tokenize(prompt):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result
bos = tokenizer.bos_token
eos = tokenizer.eos_token

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


In [5]:
def generate_and_tokenize_prompt(data_point):
    target = data_point['input']
    result = data_point['target']
    
    full_prompt = f"You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues, \
        the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: {target} CORRECT: {result}"
    return tokenize(full_prompt)

### Prep Model For Training

In [6]:
tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

In [7]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(llama_og_path, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [9]:
# Re-init the tokenizer so it doesn't add padding or eos token
eval_prompt = "The University of Michgian " # GO BLUE!
eval_tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path,
    padding_side="left",
    model_max_length=20,
)

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=20)[0], skip_special_tokens=False))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> The University of Michgian 2018-19 Men's Basketball News
 nobody can stop the Wolverines


In [10]:
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

In [11]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )



In [14]:
config = LoraConfig(
    r=6,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

# model = acelerator.prepare_model(model)


trainable params: 15207936 || all params: 3515620864 || trainable%: 0.4325817995828119


In [15]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
import transformers
from datetime import datetime

project = "grammar"
base_model_name = "llama2"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=2,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5,
        logging_steps=50,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",   
        save_strategy="steps",
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every 50 logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()


You'll get an output that looks like this: (training SS).



Run until your Validation Loss is no longer decreasing


# We Trained, Now What?

## First lets test the model
1. Hit ESC-00 (reset the kernel), then evaluate by hand using the script below

In [1]:
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset
from peft import PeftModel
from tqdm import tqdm_notebook as tqdm
import re
# !jupyter nbextension enable --py widgetsnbextension


In [2]:
# Constants
llama_og_path = "./models/llama-7B-huggingface"
llama_token_path = "./models/llama-7B-huggingface"
dataset = "./datasets/grammar_dataset/"

In [3]:
checkpoint_path = "llama2-grammar/checkpoint-250" # TODO: Put Checkpoint path here

In [4]:
test_dataset  = load_dataset(dataset, split='validation')

In [5]:
tokenizer = AutoTokenizer.from_pretrained(
    llama_token_path)

tokenizer.pad_token = tokenizer.eos_token
bos = tokenizer.bos_token
eos = tokenizer.eos_token

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


In [6]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    llama_og_path, 
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
model = PeftModel.from_pretrained(model, checkpoint_path)

In [8]:
model = model.merge_and_unload()



In [9]:
# Change i to change the input prompt. Evaluate the
# outputs using your own judgement. Adjust hyperparameters above for training 
# or prompt to improve your response, and change max_new_tokens as appropriate

i = 25
max_new_tokens=75

eval_prompt = f"{bos}You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues," + \
        f" the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: {test_dataset[i]['input']} CORRECT: " 


model_input = tokenizer(eval_prompt, return_tensors="pt")

model.eval()
with torch.no_grad():
    output = tokenizer.decode(model.generate(**model_input, max_new_tokens=max_new_tokens, repetition_penalty=1.15)[0], skip_special_tokens=True)
print(output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues, the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: Forexample, My cousin is 12years old. CORRECT: 4. For example, my cousin is twelve years old.
The word “forexample” should be replaced with the phrase “for instance.” This is because it’s not a real word; it’s just an abbreviation for “for example,” which means that you can use either one in place of the other.
What does forexample mean?


# Exporting The Model 
We SHOULD  be happy with the performance of the previous model, and want to move it to our local computer (For me, this is a macbook). Bits and Bytes doesn't provide a way to export our weights back to float32 in a convienient way. So, we will have to write it ourselves

In [10]:
# load original weights to make sure we get the right interface. This time we don't quantize

target_model = AutoModelForCausalLM.from_pretrained(
    llama_og_path, 
    device_map="cpu" # Can't fit it all on the GPU. I'm using 32 gigs of ram
)


target_model = target_model.to(torch.bfloat16).to('cuda:0') # 32 bits is just too big, doesn't leave room for a buffer 


target_state_dict = target_model.state_dict()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [11]:
from bitsandbytes import functional as F

In [12]:
state_dict = model.state_dict()
list(state_dict.keys())[:60]  # Commented out to truncate output

['model.embed_tokens.weight',
 'model.layers.0.self_attn.q_proj.weight',
 'model.layers.0.self_attn.q_proj.weight.absmax',
 'model.layers.0.self_attn.q_proj.weight.quant_map',
 'model.layers.0.self_attn.q_proj.weight.nested_absmax',
 'model.layers.0.self_attn.q_proj.weight.nested_quant_map',
 'model.layers.0.self_attn.q_proj.weight.quant_state.bitsandbytes__nf4',
 'model.layers.0.self_attn.k_proj.weight',
 'model.layers.0.self_attn.k_proj.weight.absmax',
 'model.layers.0.self_attn.k_proj.weight.quant_map',
 'model.layers.0.self_attn.k_proj.weight.nested_absmax',
 'model.layers.0.self_attn.k_proj.weight.nested_quant_map',
 'model.layers.0.self_attn.k_proj.weight.quant_state.bitsandbytes__nf4',
 'model.layers.0.self_attn.v_proj.weight',
 'model.layers.0.self_attn.v_proj.weight.absmax',
 'model.layers.0.self_attn.v_proj.weight.quant_map',
 'model.layers.0.self_attn.v_proj.weight.nested_absmax',
 'model.layers.0.self_attn.v_proj.weight.nested_quant_map',
 'model.layers.0.self_attn.v_proj.w

In [13]:
def is_quant(name, sd) -> bool:
    return name + '.nested_absmax' in sd.keys()


def is_lora(name, sd) -> bool:
    return name[:-18] + '.lora_A.default.weight' in sd.keys()

In [14]:
num_layers = 32

In [15]:

for layer in range(num_layers):
    weights = [
        # Can probably be done way cleaner, feel free to put in a pull request
        (model.model.layers[layer].self_attn.q_proj.weight.quant_state, 
         f'model.layers.{layer}.self_attn.q_proj.weight',
         f'model.layers.{layer}.self_attn.q_proj.weight'
        ),
        (model.model.layers[layer].self_attn.k_proj.weight.quant_state, 
         f'model.layers.{layer}.self_attn.k_proj.weight',
         f'model.layers.{layer}.self_attn.k_proj.weight',
        ),
        (model.model.layers[layer].self_attn.v_proj.weight.quant_state, 
         f'model.layers.{layer}.self_attn.v_proj.weight',
         f'model.layers.{layer}.self_attn.v_proj.weight',
        ),
        (model.model.layers[layer].self_attn.o_proj.weight.quant_state, 
         f'model.layers.{layer}.self_attn.o_proj.weight',
         f'model.layers.{layer}.self_attn.o_proj.weight',
        ),
        (model.model.layers[layer].mlp.gate_proj.weight.quant_state, 
         f'model.layers.{layer}.mlp.gate_proj.weight',
         f'model.layers.{layer}.mlp.gate_proj.weight'
        ),
        (model.model.layers[layer].mlp.up_proj.weight.quant_state,   
         f'model.layers.{layer}.mlp.up_proj.weight',
         f'model.layers.{layer}.mlp.up_proj.weight'
        ),
        (model.model.layers[layer].mlp.down_proj.weight.quant_state, 
         f'model.layers.{layer}.mlp.down_proj.weight',
         f'model.layers.{layer}.mlp.down_proj.weight'
        ),
        (None,
         f'model.layers.{layer}.input_layernorm.weight',
         f'model.layers.{layer}.input_layernorm.weight'
        ),
        (None, 
         f'model.layers.{layer}.post_attention_layernorm.weight',
         f'model.layers.{layer}.post_attention_layernorm.weight'
        ),
    ]
    for q_state, key, target_path in weights:
        if is_quant(key, state_dict):
            F.dequantize_nf4(state_dict[key], q_state, out=target_state_dict[target_path])
            torch.cuda.synchronize()
            print(f"Merged {target_path}")
        else:
            target_state_dict[target_path] = state_dict[key].clone()
            print(f"Copied {target_path}")
            

Merged model.layers.0.self_attn.q_proj.weight
Merged model.layers.0.self_attn.k_proj.weight
Merged model.layers.0.self_attn.v_proj.weight
Merged model.layers.0.self_attn.o_proj.weight
Merged model.layers.0.mlp.gate_proj.weight
Merged model.layers.0.mlp.up_proj.weight
Merged model.layers.0.mlp.down_proj.weight
Copied model.layers.0.input_layernorm.weight
Copied model.layers.0.post_attention_layernorm.weight
Merged model.layers.1.self_attn.q_proj.weight
Merged model.layers.1.self_attn.k_proj.weight
Merged model.layers.1.self_attn.v_proj.weight
Merged model.layers.1.self_attn.o_proj.weight
Merged model.layers.1.mlp.gate_proj.weight
Merged model.layers.1.mlp.up_proj.weight
Merged model.layers.1.mlp.down_proj.weight
Copied model.layers.1.input_layernorm.weight
Copied model.layers.1.post_attention_layernorm.weight
Merged model.layers.2.self_attn.q_proj.weight
Merged model.layers.2.self_attn.k_proj.weight
Merged model.layers.2.self_attn.v_proj.weight
Merged model.layers.2.self_attn.o_proj.wei

In [16]:
list(target_state_dict.keys())[-4:]

['model.layers.31.input_layernorm.weight',
 'model.layers.31.post_attention_layernorm.weight',
 'model.norm.weight',
 'lm_head.weight']

In [17]:
F.dequantize_4bit(state_dict['model.layers.0.self_attn.q_proj.weight'], model.model.layers[0].self_attn.q_proj.weight.quant_state)

tensor([[-0.0071, -0.0153, -0.0035,  ...,  0.0047,  0.0000, -0.0054],
        [ 0.0115,  0.0000,  0.0000,  ..., -0.0108, -0.0108,  0.0061],
        [-0.0228,  0.0199,  0.0000,  ...,  0.0095,  0.0193,  0.0000],
        ...,
        [ 0.0000,  0.0142,  0.0000,  ...,  0.0126, -0.0309,  0.0126],
        [ 0.0250,  0.0091,  0.0045,  ..., -0.0316, -0.0171, -0.0111],
        [-0.0153, -0.0071,  0.0031,  ...,  0.0188,  0.0144, -0.0079]],
       device='cuda:0')

In [18]:
target_state_dict['model.layers.0.self_attn.q_proj.weight'] 

tensor([[-0.0071, -0.0153, -0.0035,  ...,  0.0047,  0.0000, -0.0054],
        [ 0.0115,  0.0000,  0.0000,  ..., -0.0108, -0.0108,  0.0061],
        [-0.0227,  0.0199,  0.0000,  ...,  0.0095,  0.0193,  0.0000],
        ...,
        [ 0.0000,  0.0142,  0.0000,  ...,  0.0126, -0.0309,  0.0126],
        [ 0.0250,  0.0091,  0.0045,  ..., -0.0317, -0.0171, -0.0111],
        [-0.0153, -0.0071,  0.0031,  ...,  0.0188,  0.0145, -0.0079]],
       device='cuda:0', dtype=torch.bfloat16)

In [24]:
# Now we have to merge our special cases
target_state_dict['model.embed_tokens.weight'] = state_dict['model.embed_tokens.weight']
target_state_dict['model.norm.weight'] = state_dict['model.norm.weight'].clone()
target_state_dict['lm_head.weight'] = state_dict['lm_head.weight'].clone()

In [25]:
# del model
# torch.cuda.empty_cache()

In [21]:
target_model.load_state_dict(target_state_dict)

<All keys matched successfully>

In [22]:
# Change i to change the input prompt. Evaluate the
# outputs using your own judgement. Adjust hyperparameters above for training 
# or prompt to improve your response, and change max_new_tokens as appropriate

i = 50
max_new_tokens=75

eval_prompt = f"{bos}You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues," + \
        f" the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: {test_dataset[i]['input']} CORRECT: " 


model_input = tokenizer(eval_prompt, return_tensors="pt").to('cuda:0')

target_model.eval()
with torch.no_grad():
    output = tokenizer.decode(target_model.generate(**model_input, max_new_tokens=max_new_tokens, repetition_penalty=1.15)[0], skip_special_tokens=True)
print(output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


You will see two sentences. The first is marked INCORRECT and has a plethora of spelling and grammatical issues, the second is marked CORRECT and shows the fixed version of the prior sentence. INCORRECT: So, if i have alot of information about this subject, i will taulk too much with knowledge but if i have general information for this subject, i will talk about this subjec with my limited knowlege and this case may be make me shame like when my brother asked me about some thing but i have not alot of information about this thing. CORRECT: 1) If I have a lot of information about this subject, I will talk too much with knowledge; however, if I only have general information on this topic, I will speak about it with my limited knowledge and this might cause me to feel ashamed as when my brother asks me about something that I do not know very well.
I'm sorry, but I


In [23]:
target_model.save_pretrained("models/model-final")

# Now that we have merged, we can export


We download our weights by taring them up for a faster download (11gb vs 22gb)
```console
cd models
cp llama-7B-huggingface/special_tokens_map.json model-final
cp llama-7B-huggingface/tokenizer* model-final
cd ..
tar -cvf final-model.tar.gz models/model-final
```

Now on your local instance, in a clone of Llama.cpp (Use my modifications here, under "examples/llamacheck" (https://github.com/Ferruolo/llama.cpp).
```
# In base of llama.cpp directory
scp -i myBrevInstanceName:MyFirstLLM-BlogPost/final-model.tar.gz ./models/
tar -xvf ./models/final-model.tar.gz
python convert-hf-to-gguf.py ./models/final-model --outfile models/llamacheck-dequant.gguf
make llamacheck
TODO: Quantize llamacheck.gguf
```

# We're done. A brief reflection

Being honest, my experience was not as simple as this blog post. Even though I adapted most of my code directly from the notebooks I actually used, I took a lot of twists and turns, misinterpreted docs several times, and struggled to export my weights. I hope that by compiling everything I did into one big notebook, I have made your life significantly easier! I was suprised by how easy finetuning was once I had figured everything out, but was dismayed at how difficult doing practical things with huggingface, bits and bytes library (QLORA), and llama.cpp were. I found myself in a configuration hell. There's no reason that exporting to another format should be as complicated as it was. 


I would like to thank the Brev team for helping me learn a lot of the things I talk about here, and giving me the opportunity to write this blog for them.

If you have any followup questions, or would like to reach out to me for another reason, please contact my email at andrew.ferruolo@gmail.com