# LLaMa - 3
1. Fine tune a local model of LLaMa 3 on form 10-K Contextual Q&A Data using supervised fine tuning & Low Rank Adaptation
2. Use preprocessed html file for 10-K
3. Use local embedding and in memory vector stores to create a retrieval function
4. Combine everything above to make financial RAG agent



In [1]:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
!pip install -U langchain
!pip install -U langchain-community
!pip install -U sentence-transformers
!pip install -U faiss-gpu

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-y_zodlio/unsloth_ed2a13fe21de42d7b011866ecdc74bfc
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-y_zodlio/unsloth_ed2a13fe21de42d7b011866ecdc74bfc
  Resolved https://github.com/unslothai/unsloth.git to commit 85f1fa096afde5efe2fb8521d8ceec8d13a00715
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2024.11.8 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2024.12.1-py3-none-any.whl.metadata (16 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.gi

Collecting xformers
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl (16.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: xformers
Successfully installed xformers-0.0.28.post3
Collecting langchain
  Downloading langchain-0.3.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.24 (from langchain)
  Downloading langchain_core-0.3.24-py3-none-any.whl.metadata (6.3 kB)
Downloading langchain-0.3.11-py3-none-any.whl (1.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langchain_core-0.3.24-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling colle

In [2]:
# HuggingFace token, required for accessing gated models (like LLaMa 3 8B Instruct)
hf_token = ""


In [3]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Initializing Pre Trained Model and Tokenizer
For this example we will be using Meta's [LLaMa 3 8b Instruct Model](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).  
 **NOTE**: This is a gated model, you must request access on HF and pass in your HF token in the below step.

In [4]:
# Loading the model and tokenizer from the pre-trained FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
    token = hf_token,
)


==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.1k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

Adding in LoRA adapters for parameter efficient fine tuning

In [5]:
# Apply LoRA (Low-Rank Adaptation) adapters to the model for parameter-efficient fine-tuning
model = FastLanguageModel.get_peft_model(
    model,
    # Rank of the adaptation matrix. Higher values can capture more complex patterns.
    r = 16,
    # Specify the model layers
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### **Preparing the Fine Tuning Dataset**

Huggingface dataset of Financial Q&A over form 10ks

 https://huggingface.co/datasets/virattt/llama-3-8b-financialQA



In [6]:
# Defining the expected prompt
ft_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Below is a user question, paired with retrieved context. Write a response that appropriately answers the question,
include specific details in your response. <|eot_id|>

<|start_header_id|>user<|end_header_id|>

### Question:
{}

### Context:
{}

<|eot_id|>

### Response: <|start_header_id|>assistant<|end_header_id|>
{}"""


EOS_TOKEN = tokenizer.eos_token

# Function for formatting above prompt with information from Financial QA dataset
def formatting_prompts_func(examples):
    questions = examples["question"]
    contexts       = examples["context"]
    responses      = examples["answer"]
    texts = []
    for question, context, response in zip(questions, contexts, responses):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = ft_prompt.format(question, context, response) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

dataset = load_dataset("virattt/llama-3-8b-financialQA", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

README.md:   0%|          | 0.00/419 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7000 [00:00<?, ? examples/s]

Defining the Trainer Arguments

In [8]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        # Batch size per device during training
        per_device_train_batch_size = 2,
        # Number of gradient accumulation steps to perform before updating the model parameters
        gradient_accumulation_steps = 4,
        # Number of warmup steps for learning rate scheduler
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        # Optimizer to use (in this case, AdamW with 8-bit precision)
        optim = "adamw_8bit",
        # Weight decay to apply to the model parameters
        weight_decay = 0.01,
        # Type of learning rate scheduler to use
        lr_scheduler_type = "linear",
        # Seed for random number generation to ensure reproducibility
        seed = 3407,
        # Directory to save the output models and logs
        output_dir = "outputs",
    ),
)


Map (num_proc=2):   0%|          | 0/7000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [9]:
dataset[0]

{'question': 'What area did NVIDIA initially focus on before expanding to other computationally intensive fields?',
 'answer': 'NVIDIA initially focused on PC graphics.',
 'context': 'Since our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.',
 'ticker': 'NVDA',
 'filing': '2023_10K',
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nBelow is a user question, paired with retrieved context. Write a response that appropriately answers the question,\ninclude specific details in your response. <|eot_id|>\n\n<|start_header_id|>user<|end_header_id|>\n\n### Question:\nWhat area did NVIDIA initially focus on before expanding to other computationally intensive fields?\n\n### Context:\nSince our original focus on PC graphics, we have expanded to several other large and important computationally intensive fields.\n\n<|eot_id|>\n\n### Response: <|start_header_id|>assistant<|end_header_id|>\nNVIDIA initially

In [10]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,4.5595
2,3.9985
3,4.051
4,3.8006
5,2.7219
6,2.5022
7,2.0224
8,2.0126
9,1.8398
10,1.4468


In [5]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
model.save_pretrained("/content/drive/MyDrive/l3_finagent/l3_finagent_step60") # Local saving
tokenizer.save_pretrained("/content/drive/MyDrive/l3_finagent/l3_finagent_step60")

('/content/drive/MyDrive/l3_finagent/l3_finagent_step60/tokenizer_config.json',
 '/content/drive/MyDrive/l3_finagent/l3_finagent_step60/special_tokens_map.json',
 '/content/drive/MyDrive/l3_finagent/l3_finagent_step60/tokenizer.json')

In [6]:
# Redefining prompt if importing without training
ft_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Below is a user question, paired with retrieved context. Write a response that appropriately answers the question,
include specific details in your response. <|eot_id|>

<|start_header_id|>user<|end_header_id|>

### Question:
{}

### Context:
{}

<|eot_id|>

### Response: <|start_header_id|>assistant<|end_header_id|>
{}"""

if True:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "/content/drive/MyDrive/l3_finagent/l3_finagent_step60",
        max_seq_length = 2048, # Existing arguments from when we loaded earlier
        dtype = None,
        load_in_4bit = True,
    )
    FastLanguageModel.for_inference(model)

==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [7]:
from unsloth.models import FastLanguageModel
model = FastLanguageModel.for_inference(model)


In [8]:

# Main Inference Function, handles generating and decoding tokens
def inference(question, context):
    inputs = tokenizer(
        [
            ft_prompt.format(
                question,
                context,
                "",  # output - leave this blank for generation!
            )
        ],
        return_tensors="pt"
    ).to("cuda")

    # Generating tokens for the input prompt using the model
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        use_cache=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = tokenizer.batch_decode(outputs)
    return response

In [9]:
# Function for extracting just the language model generation from the full response
def extract_response(text):
    text = text[0]
    start_token = "### Response: <|start_header_id|>assistant<|end_header_id|>"
    end_token = "<|eot_id|>"

    start_index = text.find(start_token) + len(start_token)
    end_index = text.find(end_token, start_index)

    if start_index == -1 or end_index == -1:
        return None

    return text[start_index:end_index].strip()

In [None]:

context = "The increase in research and development expense for fiscal year 2023 was primarily driven by increased compensation, employee growth, engineering development costs, and data center infrastructure."
question = "What were the primary drivers of the notable increase in research and development expenses for fiscal year 2023?"

resp = inference(question, context)
parsed_response = extract_response(resp)
print(parsed_response)

The notable increase in research and development expenses in fiscal year 2023 was primarily driven by increased compensation, employee growth, engineering development costs, and data center infrastructure.


Setting Up Embeddings Locally

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer
import pandas as pd
# Load the Excel file
file_path = 'structured_10k.csv'
df = pd.read_csv(file_path)


# Combine all column contents into one string
combined_text = " ".join(df[col].dropna().str.cat(sep=" ") for col in df.columns)

# Initialize a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,         # Maximum size of each chunk
    chunk_overlap=500,       # Number of characters to overlap between chunks
    length_function=len,     # Function to determine the length of the chunks
    is_separator_regex=False # Whether the separator is a regex pattern
)

# Split the combined text into smaller chunks
split_data = text_splitter.create_documents([combined_text])

# Load a pre-trained embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Initialize FAISS vector database
db = FAISS.from_documents(split_data, embedding_model)

# Create a retriever object to search within the vector database
retriever = db.as_retriever()

# Query the retriever
query = "What are the risk factors mentioned in the report?"
results = retriever.get_relevant_documents(query)

# Display results
print("-----")
print("Top Results:")
for i, result in enumerate(results):
    print(f"Result {i+1}:")
    print(f"Content: {result.page_content[:500]}")  # Print first 500 characters of each result
    print("-" * 80)


  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")


-----
Top Results:
Result 1:
Content: Companys business, reputation, results of operations, financial condition and stock price can be affected by a number of factors, whether currently known or unknown, including those described below. When any one or more of these risks materialize from time to time, the Companys business, reputation, results of operations, financial condition and stock price can be materially and adversely affected.Because of the following factors, as well as other factors affecting the Companys results of operat
--------------------------------------------------------------------------------
Result 2:
Content: performing procedures to assess the risks of material misstatement of the financial statements, whether due to error or fraud, and performing procedures that respond to those risks. Such procedures included examining, on a test basis, evidence regarding the amounts and disclosures in the financial statements. Our audits also included evaluating the accounting

  results = retriever.get_relevant_documents(query)


In [13]:
# Retrieval Function
def retrieve_context(query):
    global retriever  # Use the previously defined retriever
    retrieved_docs = retriever.get_relevant_documents(query)  # Retrieve relevant documents
    context = "\n\n".join(doc.page_content for doc in retrieved_docs)  # Combine retrieved content
    return context

# Main Interactive Loop
while True:
    question = input(f"What would you like to know about the form 10-K? (Type 'x' to exit): ")
    if question.lower() == "x":
        print("Exiting interactive assistant. Goodbye!")
        break
    else:
        # Context Retrieval
        context = retrieve_context(question)
        if not context.strip():
            print("L3 Agent: No relevant context found. Please try another query.")
            print("-----\n")
            continue

        # Run Inference
        resp = inference(question, context)  # Using your existing inference function
        parsed_response = extract_response(resp)  # Parse the response using your extract_response function

        # Display Response
        print(f"L3 Agent: {parsed_response if parsed_response else 'No relevant response generated.'}")
        print("-----\n")


What would you like to know about the form 10-K? (Type 'x' to exit): How does the company motivate and develop its employees?
L3 Agent: The company motivates and develops its employees through a range of programs, including open communication, diverse representation, inclusive culture, equitable pay and access to opportunity, health and safety, career development, leadership and personal development, and benefits.
-----

What would you like to know about the form 10-K? (Type 'x' to exit): What is Apple's commitment to inclusion and diversity?
L3 Agent: Apple is committed to inclusion and diversity, and it aims to build a more inclusive workforce that is representative of the communities it serves.
-----

What would you like to know about the form 10-K? (Type 'x' to exit): What is the website where the Company periodically provides information for investors?
L3 Agent: The Company periodically provides certain information for investors on its corporate website, www.apple.com, and its inv