In [19]:
import pandas as pd

In [20]:
df = pd.read_csv('train_data.csv')
df

Unnamed: 0,question,answer
0,Would I ever need credit card if my debit card...,Skimmers are most likely at gas station pumps....
1,Cheapest way to wire or withdraw money from US...,There is a number of cheaper online options th...
2,How do I go about finding an honest ethical f...,Large and wellknown companies are typically a ...
3,Why invest in becoming a landlord?,why does it make sense financially to buy prop...
4,What could be the cause of a extreme highlow p...,Often these types of trades fall into two diff...
...,...,...
12042,What percent of my salary should I save?,I disagree with the selected answer. Theres no...
12043,Why do people invest in mutual fund rather tha...,How on earth can you possibly know what is goi...
12044,What would happen if the Euro currency went bust?,Each country would have to go back to its own ...
12045,Are credit cards not viewed as credit until yo...,Theres a difference between missing a payment ...


# Preprocessing
clean the text by removing extra spaces, special characters, and unwanted symbols.


In [21]:
import re

Here I defined a function called data cleaning.
The function will do the following:
1. remove punctuations but keep things like currency, numbers, and percentages. Please keep in mind that the finBOT may make use of these hence we're keeping them.

2. The function will convert all text to lower case and get rid of whitespaces from the start and the end of the string for consistency.


Example: " Tokenization sucks " => "tokenization sucks"

In [22]:
def data_cleaning(text):
    text = re.sub(r'[^\w\s.$%€£0-9]', '', text)
    text =  text.lower().strip()
    return text

In [23]:
df['question'] = df['question'].apply(data_cleaning)
df['answer'] = df['answer'].apply(data_cleaning)

# Tokenization

1. Word level tokenization
2. Lemmatization
3. Subword Tokenization
4. Sentence Tokenization

# 1. Word level Tokenization

Here we are just splitting text into individual words.

Example: "interest rate increases" => ["interest", "rate", "increases"]

In [24]:
from nltk.tokenize import word_tokenize
import nltk

nltk.download("punkt_tab")

df["word_token_question"] = df["question"].apply(word_tokenize)
df["word_token_answer"] = df["answer"].apply(word_tokenize)

print(df[["question", "word_token_question"]].head())
print(df[["answer", "word_token_answer"]].head())


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/tshmacm1172/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


                                            question  \
0  would i ever need credit card if my debit card...   
1  cheapest way to wire or withdraw money from us...   
2  how do i go about finding an honest  ethical f...   
3                  why invest in becoming a landlord   
4  what could be the cause of a extreme highlow p...   

                                 word_token_question  
0  [would, i, ever, need, credit, card, if, my, d...  
1  [cheapest, way, to, wire, or, withdraw, money,...  
2  [how, do, i, go, about, finding, an, honest, e...  
3           [why, invest, in, becoming, a, landlord]  
4  [what, could, be, the, cause, of, a, extreme, ...  
                                              answer  \
0  skimmers are most likely at gas station pumps....   
1  there is a number of cheaper online options th...   
2  large and wellknown companies are typically a ...   
3  why does it make sense financially to buy prop...   
4  often these types of trades fall into two diff... 

# 2. Lemmatization
takes words to their root word

Example: "Motsekuwa" => "Mo"

Note: run this line of code first: pip install spacy

In [25]:
import spacy.cli
spacy.cli.download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")

def lemmatize_text(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc]

df["lemmatized_token_question"] = df["question"].apply(lemmatize_text)
df["lemmatized_token_answer"] = df["answer"].apply(lemmatize_text)
print(df[["question", "lemmatized_token_question"]].head())
print(df[["answer", "lemmatized_token_answer"]].head())


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m950.9 kB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
                                            question  \
0  would i ever need credit card if my debit card...   
1  cheapest way to wire or withdraw money from us...   
2  how do i go about finding an honest  ethical f...   
3                  why invest in becoming a landlord   
4  what could be the cause of a extre

# 3. Subword Tokenization
it makes sure that complicated words and non "dictionarized" words are processed efficiently

Example: finbotization => "finbot", "ization"

Note: run this line of code first: pip install transformers. 

### "Ġ" appears before words that originally had a space before them to ensure correct spacing when reconstructing the sentence.

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")

df["subword_token_question"] = df["question"].apply(lambda x: tokenizer.tokenize(x))
df["subword_token_answer"] = df["answer"].apply(lambda x: tokenizer.tokenize(x))
print(df[["question", "subword_token_question"]].head())
print(df[["answer", "subword_token_answer"]].head())

RuntimeError: Failed to import transformers.models.auto.tokenization_auto because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
module 'numpy._core' has no attribute 'multiarray'

# 4. Sentence Tokenization

this will ensure we still keep the sentence
Example: Interest rates will rise. Investors are adjusting portfolios. => ["Interest rates will rise.", "Investors are adjusting portfolios."]


In [None]:
def sentence_tokenize(text):
    doc = nlp(text)
    return [sent.text for sent in doc.sents]  

df["sentence_tokenized_questions"] = df["question"].apply(sentence_tokenize)
df["sentence_tokenized_answers"] = df["answer"].apply(sentence_tokenize)


print(df[["question", "sentence_tokenized_questions"]].head())
print(df[["answer", "sentence_tokenized_answers"]].head())


                                            question  \
0  would i ever need credit card if my debit card...   
1  cheapest way to wire or withdraw money from us...   
2  how do i go about finding an honest  ethical f...   
3                  why invest in becoming a landlord   
4  what could be the cause of a extreme highlow p...   

                        sentence_tokenized_questions  
0  [would i ever need credit card if my debit car...  
1  [cheapest way to wire or withdraw money from u...  
2  [how do i go about finding an honest  ethical ...  
3                [why invest in becoming a landlord]  
4  [what could be the cause of a extreme highlow ...  
                                              answer  \
0  skimmers are most likely at gas station pumps....   
1  there is a number of cheaper online options th...   
2  large and wellknown companies are typically a ...   
3  why does it make sense financially to buy prop...   
4  often these types of trades fall into two diff... 

In [26]:
df

Unnamed: 0,question,answer,word_token_question,word_token_answer,lemmatized_token_question,lemmatized_token_answer
0,would i ever need credit card if my debit card...,skimmers are most likely at gas station pumps....,"[would, i, ever, need, credit, card, if, my, d...","[skimmers, are, most, likely, at, gas, station...","[would, I, ever, need, credit, card, if, my, d...","[skimmer, be, most, likely, at, gas, station, ..."
1,cheapest way to wire or withdraw money from us...,there is a number of cheaper online options th...,"[cheapest, way, to, wire, or, withdraw, money,...","[there, is, a, number, of, cheaper, online, op...","[cheap, way, to, wire, or, withdraw, money, fr...","[there, be, a, number, of, cheap, online, opti..."
2,how do i go about finding an honest ethical f...,large and wellknown companies are typically a ...,"[how, do, i, go, about, finding, an, honest, e...","[large, and, wellknown, companies, are, typica...","[how, do, I, go, about, find, an, honest, , e...","[large, and, wellknown, company, be, typically..."
3,why invest in becoming a landlord,why does it make sense financially to buy prop...,"[why, invest, in, becoming, a, landlord]","[why, does, it, make, sense, financially, to, ...","[why, invest, in, become, a, landlord]","[why, do, it, make, sense, financially, to, bu..."
4,what could be the cause of a extreme highlow p...,often these types of trades fall into two diff...,"[what, could, be, the, cause, of, a, extreme, ...","[often, these, types, of, trades, fall, into, ...","[what, could, be, the, cause, of, a, extreme, ...","[often, these, type, of, trade, fall, into, tw..."
...,...,...,...,...,...,...
12042,what percent of my salary should i save,i disagree with the selected answer. theres no...,"[what, percent, of, my, salary, should, i, save]","[i, disagree, with, the, selected, answer, ., ...","[what, percent, of, my, salary, should, I, save]","[I, disagree, with, the, select, answer, ., th..."
12043,why do people invest in mutual fund rather tha...,how on earth can you possibly know what is goi...,"[why, do, people, invest, in, mutual, fund, ra...","[how, on, earth, can, you, possibly, know, wha...","[why, do, people, invest, in, mutual, fund, ra...","[how, on, earth, can, you, possibly, know, wha..."
12044,what would happen if the euro currency went bust,each country would have to go back to its own ...,"[what, would, happen, if, the, euro, currency,...","[each, country, would, have, to, go, back, to,...","[what, would, happen, if, the, euro, currency,...","[each, country, would, have, to, go, back, to,..."
12045,are credit cards not viewed as credit until yo...,theres a difference between missing a payment ...,"[are, credit, cards, not, viewed, as, credit, ...","[theres, a, difference, between, missing, a, p...","[be, credit, card, not, view, as, credit, unti...","[there, s, a, difference, between, miss, a, pa..."


# Vocabulary and Vectorization
- **Vocabulary**: A vocabulary is a mapping of unique words (tokens) to numerical indices, allowing text to be represented as numbers for machine learning models. It’s essential for transforming words into a format that models can process.

- **Vectorization**: Vectorization is the process of converting text into numerical representations (vectors). This is done using the vocabulary, where each word is replaced by its corresponding index.

**Example**

1. Sentence: ['what', 'be', 'interest', 'rate']

2. Vocabulary: {'<'PAD'>': 0, <'UNK'>: 1, 'what': 2, 'be': 3, 'interest': 4, 'rate': 5}

3. Vector: [2, 3, 4, 5]


In [30]:
df.columns

Index(['question', 'answer', 'word_token_question', 'word_token_answer',
       'lemmatized_token_question', 'lemmatized_token_answer'],
      dtype='object')

In [47]:
from collections import defaultdict

vocab = defaultdict(lambda: len(vocab))
UNK = vocab["<UNK>"]
PAD = vocab["<PAD>"]

# Build vocab from both question and answer
for col in ['lemmatized_token_question', 'lemmatized_token_answer']:
    for tokens in df[col]:
        for token in tokens:
            _ = vocab[token]

word2idx = dict(vocab)
idx2word = {idx: word for word, idx in word2idx.items()}


def vectorize(tokens, word2idx, max_len=10):
    vec = [word2idx.get(token, word2idx["<UNK>"]) for token in tokens]
    vec = vec[:max_len] + [word2idx["<PAD>"]] * (max_len - len(vec))
    return vec

df['question_vector'] = df['lemmatized_token_question'].apply(lambda x: vectorize(x, word2idx))
df['answer_vector'] = df['lemmatized_token_answer'].apply(lambda x: vectorize(x, word2idx))


In [32]:
df.columns

Index(['question', 'answer', 'word_token_question', 'word_token_answer',
       'lemmatized_token_question', 'lemmatized_token_answer',
       'question_vector', 'answer_vector'],
      dtype='object')

In [55]:
import torch
from torch.utils.data import Dataset, DataLoader

class FinQADataset(Dataset):
    def __init__(self, df):
        self.questions = df['question_vector'].tolist()
        self.answers = df['answer_vector'].tolist()

    def __len__(self):
        return len(self.questions)

    def __getitem__(self, idx):
        return (
            torch.tensor(self.questions[idx], dtype=torch.long),
            torch.tensor(self.answers[idx], dtype=torch.long),
        )

train_dataset = FinQADataset(df)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)


In [60]:
class FinQANet(nn.Module):
    def __init__(self, vocab_size, embed_dim=256, hidden_dim=256):
        super(FinQANet, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=word2idx["<PAD>"])
        
        self.encoder = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        
        # NEW: Project bidirectional encoder output to match decoder input size
        self.encoder2decoder = nn.Linear(hidden_dim * 2, hidden_dim)
        
        self.decoder = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, vocab_size)

    def forward(self, input_ids, target_ids=None):
        embedded = self.embedding(input_ids)  # [B, T, E]
        enc_out, _ = self.encoder(embedded)   # [B, T, 2H]
        
        # Project encoder output to decoder input dimension
        dec_input = self.encoder2decoder(enc_out)  # [B, T, H]
        
        if target_ids is not None:
            # Optionally use target embeddings for teacher forcing
            target_emb = self.embedding(target_ids)
            dec_out, _ = self.decoder(target_emb)
        else:
            dec_out, _ = self.decoder(dec_input)  # fallback decode
        
        output = self.fc_out(dec_out)  # [B, T, V]
        return output


In [68]:
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to device
model = FinQANet(vocab_size=len(word2idx)).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss(ignore_index=word2idx["<PAD>"])

num_epochs = 30

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for input_ids, target_ids in train_loader:
        # Move data to device
        input_ids = input_ids.to(device)
        target_ids = target_ids.to(device)

        # Forward pass
        outputs = model(input_ids, target_ids)  # [batch, seq_len, vocab_size]

        # Compute loss
        loss = criterion(outputs.view(-1, outputs.size(-1)), target_ids.view(-1))

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss:.4f}")


Epoch 1/30, Loss: 1544.8959
Epoch 2/30, Loss: 333.4679
Epoch 3/30, Loss: 136.9874
Epoch 4/30, Loss: 39.6945
Epoch 5/30, Loss: 4.8085
Epoch 6/30, Loss: 1.5984
Epoch 7/30, Loss: 0.9437
Epoch 8/30, Loss: 0.5999
Epoch 9/30, Loss: 0.3916
Epoch 10/30, Loss: 0.2601
Epoch 11/30, Loss: 0.1752
Epoch 12/30, Loss: 0.1195
Epoch 13/30, Loss: 0.0819
Epoch 14/30, Loss: 0.0563
Epoch 15/30, Loss: 0.0386
Epoch 16/30, Loss: 0.0264
Epoch 17/30, Loss: 0.0180
Epoch 18/30, Loss: 0.0123
Epoch 19/30, Loss: 0.0085
Epoch 20/30, Loss: 0.0059
Epoch 21/30, Loss: 0.0041
Epoch 22/30, Loss: 0.0029
Epoch 23/30, Loss: 0.0021
Epoch 24/30, Loss: 0.0015
Epoch 25/30, Loss: 0.0010
Epoch 26/30, Loss: 0.0008
Epoch 27/30, Loss: 0.0006
Epoch 28/30, Loss: 0.0004
Epoch 29/30, Loss: 0.0003
Epoch 30/30, Loss: 0.0002


In [73]:
model.eval()

with torch.no_grad():
    input_sample = torch.tensor(df['question_vector'].iloc[0]).unsqueeze(0).to(device)  # send to device
    output = model(input_sample)

    pred_ids = torch.argmax(output, dim=-1).squeeze(0)  # [seq_len]
    predicted_tokens = [
        idx2word[idx.item()]
        for idx in pred_ids
        if idx.item() not in {word2idx["<PAD>"], word2idx["<UNK>"]}
    ]

    print("🧠 Predicted Answer:", " ".join(predicted_tokens))
    print("✅ Ground Truth:", " ".join(df['lemmatized_token_answer'].iloc[0]))


🧠 Predicted Answer: mostly french40 falsify 10qs 20142015 20142015 20142015 consequence remain remain
✅ Ground Truth: skimmer be most likely at gas station pump . if your debit card be compromise you be get money take out of your checking account which could cause a cascade of nsf fee . never use debit card at pump . clark howard call debit card piece of trash fake visamc that be because of all the point mention above but the most important fact be back in the 60 when congress be protect its constituent they make sure that the bank be responsible for fraud and maxe your liability at 50 . debit card be introduce much later when congress be interested in protect bank . so you have no protection on your debit card and if they find you negligent with your card they may not replace the steal fund . I get rid of my debit card and only have an atm card . so it can not be use in store which mean you have to know the pin and then you can only get 200 a day .
