In [13]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [14]:
import  pandas as pd

# Preprocessing
- Load datatset
- Remove duplicates
- Strip leading and trailing whitespace
- Save dataset

In [15]:
data = pd.read_csv('data/train_data.csv')
data = data.drop_duplicates()

data['question'] = data['question'].str.strip()
data['answer'] = data['answer'].str.strip()

data.to_csv("data/cleaned_data.csv", index=False)

In [16]:
data = pd.read_csv('data/cleaned_data.csv')

We define a function called apply_chat_template(), which takes one row from the dataset at a time. 
We create a structured conversation format:

- The user's question (role: user).
- The FinBot’s response (role: assistant).

In [17]:
def apply_chat_template(row):
    messages = [
        {"role": "user", "content": row["question"]},
        {"role": "assistant", "content": row["answer"]}
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

We're working with Meta’s instruction-tuned version of Llama-3.2
We pass the authentication token so Hugging Face grants us access to the model without needing manual login.
The tokenizer is responsible for structuring text properly before feeding it into the AI model.

In [18]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
import torch

In [19]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"
token = "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"  
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32, device_map="auto", use_auth_token=token)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=token)
tokenizer.pad_token = tokenizer.eos_token

Some parameters are on the meta device because they were offloaded to the disk.


This ensures the correct format for fine-tuning Llama-3.2. 
Llama-3.2 expects conversations to follow a structured pattern (user prompt => assistant response).

In [20]:
data["formatted_prompt"] = data.apply(apply_chat_template, axis=1)

Since Llama was fine-tuned using this format, structuring conversations like this helps the model understand where user input starts and ends.

In [21]:
print(data['formatted_prompt'][1])

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 22 Apr 2025

<|eot_id|><|start_header_id|>user<|end_header_id|>

Cheapest way to wire or withdraw money from US account while living in Europe<|eot_id|><|start_header_id|>assistant<|end_header_id|>

There is a number of cheaper online options that you could use. TranferWise was already mentioned here. Other options i know are Paysera or TransferGo. They state that international transfers are processed on the next day and they are substantially cheaper than those of banks. Currency exchange rate is usually not bad.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [22]:
data[["formatted_prompt"]].to_csv("data/finbot_data.csv", index=False)


# Teach the model to specialize in financial literacy conversations.

In [23]:
dataset = load_dataset("csv", data_files="data/finbot_data.csv", split="train")


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/anaconda3/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/anaconda3/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 701, in start
    self.io_loop.start()
  File "/opt/anaconda3/lib/python3.12/site-

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



ImportError: numpy.core.multiarray failed to import

In [None]:
def tokenize_function(example):
    tokens = tokenizer(example['formatted_prompt'], padding="max_length", truncation=True, max_length=128)
    tokens['labels'] = [-100 if token == tokenizer.pad_token_id else token for token in tokens['input_ids']]
    return tokens

In [None]:
tokenized_dataset = dataset.map(tokenize_function)

NameError: name 'dataset' is not defined

In [None]:
tokenized_dataset = tokenized_dataset.train_test_split(test_size=0.05)

In [None]:
training_args = TrainingArguments(
    output_dir="./finbot_model",
    evaluation_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    save_steps=200,
    per_device_train_batch_size=2,  # Adjust based on GPU power
    per_device_eval_batch_size=2,
    num_train_epochs=3,  # Number of training loops
    learning_rate=2e-5,
    max_grad_norm=1,  # Gradient clipping to stabilize training
    fp16=False,  # Set False for MacBooks
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

In [None]:
trainer.save_model("./finbot_model")
tokenizer.save_pretrained("./finbot_model")