In [None]:
%%capture

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install sec_api
!pip install -U langchain
!pip install -U langchain-community
!pip install -U sentence-transformers
!pip install -U faiss-gpu

In [None]:
# Fine Tuning Related Packages
import os
from dotenv import load_dotenv

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Pipeline & RAG Related Packages
from sec_api import ExtractorApi, QueryApi
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.7.0+cu126 with CUDA 1206 (you have 2.6.0+cu124)
    Python  3.11.12 (you have 3.11.12)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!


---
## **Part 1: Fine Tuning LLaMa 3 with Unsloth**

We will be using the built in GPU on Colab to do all our fine tuning needs, using the [Unsloth Library](https://github.com/unslothai/unsloth).

Much of the below code is augmented from [Unsloth Documentation!](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=AEEcJ4qfC7Lp)


### **Initializing Pre Trained Model and Tokenizer**

For this example we will be using Meta's [LLaMa 3 8b Instruct Model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).  
 **NOTE**: This is a gated model, you must request access on HF and pass in your HF token in the below step.

In [None]:


# Load the model and tokenizer from the pre-trained FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    # Specify the pre-trained model to use
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    # Specifies the maximum number of tokens (words or subwords) that the model can process in a single forward pass
    max_seq_length = 2048,
    # Data type for the model. None means auto-detection based on hardware, Float16 for specific hardware like Tesla T4
    dtype = None,
    # Enable 4-bit quantization, By quantizing the weights of the model to 4 bits instead of the usual 16 or 32 bits, the memory required to store these weights is significantly reduced. This allows larger models to be run on hardware with limited memory resources.
    load_in_4bit = True,
    # Access token for gated models, required for authentication to use models like Meta-Llama-2-7b-hf
    token = os.getenv("hf_token")
)


==((====))==  Unsloth 2025.5.10: Fast Llama patching. Transformers: 4.52.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

**Adding in LoRA adapters for parameter efficient fine tuning**

LoRA, or Low-Rank Adaptation, is a technique used in machine learning to fine-tune large models more efficiently. It works by adding a small, additional set of parameters to the existing model instead of retraining all the parameters from scratch. This makes the fine-tuning process faster and less resource-intensive. Essentially, LoRA helps tailor a pre-trained model to specific tasks or datasets without requiring extensive computational power or memory.

In [None]:
# Apply LoRA (Low-Rank Adaptation) adapters to the model for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    # Rank of the adaptation matrix. Higher values can capture more complex patterns. Suggested values: 8, 16, 32, 64, 128
    r = 16,
    # Specify the model layers to which LoRA adapters should be applied
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    # Scaling factor for LoRA. Controls the weight of the adaptation. Typically a small positive integer
    lora_alpha = 16,
    # Dropout rate for LoRA. A value of 0 means no dropout, which is optimized for performance
    lora_dropout = 0,
    # Bias handling in LoRA. Setting to "none" is optimized for performance, but other options can be used
    bias = "none",
    # Enables gradient checkpointing to save memory during training. "unsloth" is optimized for very long contexts
    use_gradient_checkpointing = "unsloth",
    # Seed for random number generation to ensure reproducibility of results
    random_state = 3407,
)

Unsloth 2025.5.10 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.



1. **r**: The rank of the low-rank adaptation matrix. This determines the capacity of the adapter to capture additional information. Higher ranks allow capturing more complex patterns but also increase computational overhead.

2. **target_modules**: List of module names within the model to which the LoRA adapters should be applied. These modules typically include the projections within transformer layers (e.g., query, key, value projections) and other key transformation points.
  - **q_proj**: Projects input features to query vectors for attention mechanisms.
  - **k_proj**: Projects input features to key vectors for attention mechanisms.
  - **v_proj**: Projects input features to value vectors for attention mechanisms.
  - **o_proj**: Projects the output of the attention mechanism to the next layer.
  - **gate_proj**: Applies gating mechanisms to modulate the flow of information.
  - **up_proj**: Projects features to a higher dimensional space, often used in feed-forward networks.
  - **down_proj**: Projects features to a lower dimensional space, often used in feed-forward networks.

These layers are typically involved in transformer-based models, facilitating various projections and transformations necessary for the attention mechanism and feed-forward processes.

3. **lora_alpha**: A scaling factor for the LoRA adapter. It controls the impact of the adapter on the model's outputs. Typically set to a small positive integer.

4. **lora_dropout**: Dropout rate applied to the LoRA adapters. Dropout helps in regularizing the model, but setting it to 0 means no dropout, which is often optimal for performance.

5. **bias**: This specifies how biases should be handled in the LoRA adapters. Setting it to "none" indicates no bias is used, which is optimized for performance, although other options are available depending on the use case.

6. **use_gradient_checkpointing**: Enables gradient checkpointing, which helps to save memory during training by not storing all intermediate activations. "unsloth" is a setting optimized for very long contexts, but it can also be set to True.

7. **random_state**: A seed for the random number generator to ensure reproducibility. This makes sure that the results are consistent across different runs of the code.

### **Preparing the Fine Tuning Dataset**

We will be using a HF dataset of Financial Q&A over form 10ks, provided by user [Virat Singh](https://github.com/virattt) here https://huggingface.co/datasets/virattt/llama-3-8b-financialQA

The following code below formats the entries into the prompt defined first for training, being careful to add in special tokens. In this case our End of Sentence token is <|eot_id|>. More LLaMa 3 special tokens [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)

In [None]:
# Defining the expected prompt
ft_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Below is a user question, paired with retrieved context. Write a response that appropriately answers the question,
include specific details in your response. <|eot_id|>

<|start_header_id|>user<|end_header_id|>

### Question:
{}

### Context:
{}

<|eot_id|>

### Response: <|start_header_id|>assistant<|end_header_id|>
{}"""

# Grabbing end of sentence special token
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

# Function for formatting above prompt with information from Financial QA dataset
def formatting_prompts_func(examples):
    questions = examples["question"]
    contexts       = examples["context"]
    responses      = examples["answer"]
    texts = []
    for question, context, response in zip(questions, contexts, responses):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = ft_prompt.format(question, context, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = load_dataset("virattt/llama-3-8b-financialQA", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/419 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

In [None]:
dataset[0]

{'question': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'answer': 'NVIDIA initially focused on PC graphics.',
 'context': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',
 'ticker': 'NVDA',
 'filing': '2023_10K',
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nBelow is a user question, paired with retrieved context. Write a response that appropriately answers the question,\ninclude specific details in your response. <|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>\n\n### Question:\nWhat area did NVIDIA initially focus on before expanding to other computationally intensive fields?\n\n### Context:\nSince our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.\n\n<|eot_id|>\n\n### Response: <|start_header_id|>assistant<|end_header_id|>\nNVIDIA initially

### **Defining the Trainer Arguments**

We will be setting up and using HuggingFace Transformer Reinforcement Learning (TRL)'s [Supervised Fine-Tuning Trainer](https://huggingface.co/docs/trl/sft_trainer)

**Supervised fine-tuning** is a process in machine learning where a pre-trained model is further trained on a specific dataset with labeled examples. During this process, the model learns to make predictions or classifications based on the labeled data, improving its performance on the specific task at hand. This technique leverages the general knowledge the model has already acquired during its initial training phase and adapts it to perform well on a more targeted set of examples. Supervised fine-tuning is commonly used to customize models for specific applications, such as sentiment analysis, object recognition, or language translation, by using task-specific annotated data.

In [None]:
trainer = SFTTrainer(
    # The model to be fine-tuned
    model = model,
    # The tokenizer associated with the model
    tokenizer = tokenizer,
    # The dataset used for training
    train_dataset = dataset,
    # The field in the dataset containing the text data
    dataset_text_field = "text",
    # Maximum sequence length for the training data
    max_seq_length = 2048,
    # Number of processes to use for data loading
    dataset_num_proc = 2,
    # Whether to use sequence packing, which can speed up training for short sequences
    packing = False,
    args = TrainingArguments(
        # Batch size per device during training
        per_device_train_batch_size = 2,
        # Number of gradient accumulation steps to perform before updating the model parameters
        gradient_accumulation_steps = 4,
        # Number of warmup steps for learning rate scheduler
        warmup_steps = 5,
        # Total number of training steps
        max_steps = 540,
        # Number of training epochs, can use this instead of max_steps, for this notebook its ~900 steps given the dataset
        # num_train_epochs = 1,
        # Learning rate for the optimizer
        learning_rate = 2e-4,
        # Use 16-bit floating point precision for training if bfloat16 is not supported
        fp16 = not is_bfloat16_supported(),
        # Use bfloat16 precision for training if supported
        bf16 = is_bfloat16_supported(),
        # Number of steps between logging events
        logging_steps = 1,
        # Optimizer to use (in this case, AdamW with 8-bit precision)
        optim = "adamw_8bit",
        # Weight decay to apply to the model parameters
        weight_decay = 0.01,
        # Type of learning rate scheduler to use
        lr_scheduler_type = "linear",
        # Seed for random number generation to ensure reproducibility
        seed = 3407,
        # Directory to save the output models and logs
        output_dir = "outputs",
    ),
)


1. **model**: The model to be fine-tuned. This is the pre-trained model that will be adapted to the specific training data.

2. **tokenizer**: The tokenizer associated with the model. It converts text data into tokens that the model can process.

3. **train_dataset**: The dataset used for training. This is the collection of labeled examples that the model will learn from during the fine-tuning process.

4. **dataset_text_field**: The field in the dataset containing the text data. This specifies which part of the dataset contains the text that the model will be trained on.

5. **max_seq_length**: The maximum sequence length for the training data. This limits the number of tokens per input sequence to ensure they fit within the model's processing capacity.

6. **dataset_num_proc**: The number of processes to use for data loading. This can speed up data loading by parallelizing it across multiple processes.

7. **packing**: A boolean indicating whether to use sequence packing. Sequence packing can speed up training by combining multiple short sequences into a single batch.

8. **args**: A set of training arguments that configure the training process. These include various hyperparameters and settings:

    - **per_device_train_batch_size**: The batch size per device during training. This determines how many examples are processed together in one forward/backward pass.
    
    - **gradient_accumulation_steps**: The number of gradient accumulation steps to perform before updating the model parameters. This allows for effectively larger batch sizes without requiring more memory.
    
    - **warmup_steps**: The number of warmup steps for the learning rate scheduler. During these steps, the learning rate increases gradually to the initial value.
    
    - **max_steps**: The total number of training steps. This defines how many batches of training data the model will be trained on.
    
    - **num_train_epochs**: The number of training epochs (uncommented in the example). This defines how many times the entire training dataset will be passed through the model.
    
    - **learning_rate**: The learning rate for the optimizer. This controls how much to adjust the model's weights with respect to the gradient during training.
    
    - **fp16**: A boolean indicating whether to use 16-bit floating point precision for training if bfloat16 is not supported. This can speed up training and reduce memory usage.
    
    - **bf16**: A boolean indicating whether to use bfloat16 precision for training if supported. This can also speed up training and reduce memory usage while maintaining higher precision than fp16.
    
    - **logging_steps**: The number of steps between logging events. This controls how frequently training progress and metrics are logged.
    
    - **optim**: The optimizer to use. In this case, AdamW with 8-bit precision, which can improve efficiency for large models.
    
    - **weight_decay**: The weight decay to apply to the model parameters. This is a regularization technique to prevent overfitting by penalizing large weights.
    
    - **lr_scheduler_type**: The type of learning rate scheduler to use. This controls how the learning rate changes over time during training.
    
    - **seed**: The seed for random number generation. This ensures reproducibility of results by controlling the randomness in training.
    
    - **output_dir**: The directory to save the output models and logs. This specifies where the trained model and training logs will be stored.

# **Now We're Ready to Train!** 🎉

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,000 | Num Epochs = 1 | Total steps = 540
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Step,Training Loss
1,0.0074
2,0.0069
3,0.0112
4,0.0099
5,0.0078
6,0.0047
7,0.0096
8,0.0052
9,0.0055
10,0.0019
