
<center>
<h1 style="font-size: 36px;">Fine-Tuning a Large Language Model with Custom Data: A Comprehensive Guide</h1>
</center>


<p>Hello everyone!</p>
<p>In this notebook, we will learn how to fine-tune a Large Language Model (LLM) with our own dataset using the LoRa method.</p>

### 0) Prerequisites

<p>Ensure you are using a Linux environment because the <code>bitsandbytes</code> library, which is very important for our task, only works on Linux as of now.<p>




In [None]:
# You only need to run this once per machine
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U datasets scipy 

### I) Create your own data_set
<body>
    <p>Before diving into this notebook, ensure you've created your dataset using the <code>prepare_dataset.py</code> script (you'll need to tweak the API endpoint to call your LLM, for which I utilized LM Studio. Download your text from <a href="https://www.congress.gov/bill/118th-congress/house-bill/4365/text?format=txt">https://www.congress.gov/bill/118th-congress/house-bill/4365/text?format=txt</a>).</p>
    <p>This script initially dispatches a segment of the text to the LLM to concoct questions. Subsequently, an agent scrutinizes the format's accuracy (aiming for a Python list). If it deviates, a correction agent steps in to formulate a Python list.</p>
    <p>Following this, we propel these questions towards the LLM to elicit responses. Post acquiring our inputs and outputs, we proceed to craft a .jsonl file embodying our dataset.</p>
    <p>Your dataset would mirror the structure: <code>{'input': 'Enlighten me about the legislation...', 'output': 'The legislation encompasses...'}</code></p>
</body>

In [None]:
!python prepare_dataset_LLM.py

In [1]:
from datasets import load_dataset

train_dataset = load_dataset('json', data_files='train.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='validation.jsonl', split='train')

### II) Loading Mistral-7B-Instruct-v0.2 

 Here, we are lading our LLM, right? We're using this cool thing called BitsAndBytes because it lets us squish down the model's size without losing the good stuff. That means we can run it even if we don't have a monster computer. We're picking a specific model from mistralai, and the BitsAndBytes magic makes it use less memory by turning stuff into 4-bit instead of the usual bigger size. Pretty neat for keeping things speedy and light!

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, device_map="auto")

### III) Tokenization

  <p>Alright, so here's the scoop: we're setting up this thing called a tokenizer, which is like the brain's way of understanding and breaking down the stuff we feed it. We're doing this nifty trick where we pad stuff on the left side, and believe it or not, this actually helps the whole setup chug along using less memory. There's this <a href="https://ai.stackexchange.com/questions/41485/while-fine-tuning-a-decoder-only-llm-like-llama-on-chat-dataset-what-kind-of-pa">link</a> that dives into why it's cool.</p>

  <p>Now, about the <code>model_max_length</code>, it's kinda like knowing your limits. We first let our tokenizer run wild without holding it back with truncation or padding to see how long the stuff we're dealing with usually is.</p>

 <p>We've formatted our data in the Mistral instruction format using the <code>prompt_mistral</code> function.</p>


  <p>Then, we're bringing in our tokenizer from the pretrained model, telling it to pad on the left, and making sure it knows when a sentence starts and ends. We're also setting the pad token to be the same as the end-of-sentence token, which is a bit like using a period to say "we're done here" and fill up space at the same time.</p>

  <p> We've got this function <code>generate_and_tokenize_prompt</code> that takes our prompt, runs it through our special formatting function, and tells the tokenizer to keep it under 400 tokens, filling in the gaps with padding if it's too short. Then, it makes a copy of the tokenized input as labels for training. It's all about getting things ready for the big show.</p>


In [None]:
def prompt_mistral(dataset):
    text =  f"""<s>[INST]{dataset['input']}[/INST] {dataset['output']}</s>"""
    return text


tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_side="left",
    add_eos_token=True,
    add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token



max_length =400 

def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        prompt_mistral(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_val_dataset = eval_dataset.map(generate_and_tokenize_prompt)

### IV) Let's test the model before fine tunning it

In [7]:
import warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Suppress TensorFlow logs except errors
warnings.filterwarnings('ignore')  # Suppress Python warnings, including TensorFlow's

eval_prompt = """[INST] What total amount of funds was allocated for counter-narcotics support in the Defense Appropriations Act of 2024 (H.R. 4365)[/INST]?
"""
eval_tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    add_bos_token=True,
)

model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[INST] What total amount of funds was allocated for counter-narcotics support in the Defense Appropriations Act of 2024 (H.R. 4365)[/INST]?

I apologize for any confusion, but I cannot directly provide you with the specific amount allocated for counter-narcotics support in the Defense Appropriations Act of 2024 (H.R. 4365) as I do not have access to that information in real time. You may want to check the text of the bill itself or contact the relevant congressional committees or the Department of Defense for the most accurate and up-to-date information on this matter. The bill text can be found on the official website of the Library of Congress at https://www.govinfo.gov/. Additionally, you may find it helpful to consult news articles or other reliable sources for more context and analysis on this issue.


### V) Let's set up LoRA

<p>So, we're getting our model ready for some serious gym time to get it in shape. We're using this cool tool called <code>prepare_model_for_kbit_training</code> from PEFT to get it all prepped and set.</p>

<p>Now, diving into the LoRA setup, think of <code>r</code> as how flexible our model's gonna be. It's like choosing between lifting light weights with more reps or heavy ones with fewer reps. More <code>r</code> means our model can learn more stuff, but it also means it's gonna work harder.</p>

<p>Then, there's <code>alpha</code>, which is kinda like the protein shake for our weights. It decides how much oomph to give to the new moves our model's learning. Pumping up <code>alpha</code> is like saying, "Hey, pay more attention to these new tricks!"</p>



In [8]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,  # Conventional
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

### VI) Training session

<p>First up, we're importing the necessary stuff and checking if we've got more than one GPU to use. If we do, we're telling our model, "Hey, you've got friends, let's work together!" This makes things faster and smoother.</p>

<p>We're also turning off some logging with <code>os.environ['WANDB_DISABLED'] = 'true'</code> we don't need to connect to the API on WANDB</p>

<p>Next, we're setting up our training project and making sure our tokenizer knows what to use as a padding token, which is the end-of-sentence token here.</p>

<p>Then, we're getting the <code>Trainer</code> ready with all its gear: the model, the datasets for training and validation, and a bunch of settings like how big our training batches should be, how often to save our progress, and when to check how well the model is doing. We're also telling it to use a specific optimizer that's really good at saving memory, which is great for big models.</p>

<p>Last but not least, we're setting up a <code>DataCollator</code> that's going to help organize our data for language modeling, making sure everything's in tip-top shape for training. And with a final <code>trainer.train()</code>, we're off to the races!</p>


In [None]:
import transformers
from datetime import datetime
import os
if torch.cuda.device_count() > 1: # If more than 1 GPU
    model.is_parallelizable = True
    model.model_parallel = True
    
os.environ['WANDB_DISABLED'] = 'true'
project = "uslaw-finetune"
base_model_name = "mistral"
run_name = base_model_name + "-" + project
output_dir = "./" + run_name
tokenizer.pad_token = tokenizer.eos_token



trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    args=transformers.TrainingArguments(
        output_dir=output_dir,
        warmup_steps=5,
        per_device_train_batch_size=4,
        gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        max_steps=1000,
        learning_rate=2.5e-5, # Want about 10x smaller than the Mistral learning rate
        logging_steps=50,
        bf16=True,
        optim="paged_adamw_8bit",
        logging_dir="./logs",        # Directory for storing logs
        save_strategy="steps",       # Save the model checkpoint every logging step
        save_steps=50,                # Save checkpoints every 50 steps
        evaluation_strategy="steps", # Evaluate the model every logging step
        eval_steps=50,               # Evaluate and save checkpoints every 50 steps
        do_eval=True,                # Perform evaluation at the end of training

    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

<img src="training_validation_loss.png" width="700">


### VII) Let's test the fine-tuned model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

base_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True)


ft_model = PeftModel.from_pretrained(base_model, "mistral-uslaw-finetune/checkpoint-750")

In [None]:
eval_prompt = """[INST] What types of expenses are covered under the 'Other Procurement, Defense-Wide' category in H.R. 4365? [/INST]?"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

ft_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(ft_model.generate(**model_input, max_new_tokens=300, repetition_penalty=1.15)[0], skip_special_tokens=True))