## Data Preprocessing

This notebook preprocesses two raw datasets (`bank_transaction.csv` and `user_profile.csv`) to serve as training inputs to the model later.

### 1. Data Cleaning and Merging

#### Loading Data & Inspect for Missing Values

In [None]:
import pandas as pd
import random
import numpy as np
import gensim
import warnings
warnings.filterwarnings("ignore")

SEED = 42

# Set all random seeds
random.seed(SEED)
np.random.seed(SEED)
gensim.utils.random.seed(SEED)

# Load datasets
bank_transaction = pd.read_csv("../dataset/bank_transaction.csv")
user_profile = pd.read_csv("../dataset/user_profile.csv")

# Display first few rows
display(bank_transaction.head())
display(user_profile.head())

Unnamed: 0,client_id,bank_id,account_id,txn_id,txn_date,description,amount,category
0,1,1,1,4,2023-09-29 00:00:00,Earnin PAYMENT Donatas Danyal,20.0,Loans
1,1,1,1,3,2023-08-14 00:00:00,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,25.0,Transfer Credit
2,1,1,1,5,2023-09-25 00:00:00,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,20.0,Loans
3,1,1,2,1,2023-06-02 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,16.0,Transfer Credit
4,1,1,2,2,2023-06-01 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,4.0,Transfer Credit


Unnamed: 0,CLIENT_ID,IS_INTERESTED_INVESTMENT,IS_INTERESTED_BUILD_CREDIT,IS_INTERESTED_INCREASE_INCOME,IS_INTERESTED_PAY_OFF_DEBT,IS_INTERESTED_MANAGE_SPENDING,IS_INTERESTED_GROW_SAVINGS
0,1,False,False,False,False,False,False
1,2,False,False,False,False,False,False
2,3,False,False,False,False,False,False
3,4,False,True,True,True,True,True
4,5,True,False,True,True,True,False


In [128]:
print("Missing values in bank_transaction dataset:")
print(bank_transaction.isnull().sum())

print("\nMissing values in user_profile dataset:")
print(user_profile.isnull().sum())

Missing values in bank_transaction dataset:
client_id        0
bank_id          0
account_id       0
txn_id           0
txn_date         0
description      0
amount           0
category       257
dtype: int64

Missing values in user_profile dataset:
CLIENT_ID                        0
IS_INTERESTED_INVESTMENT         0
IS_INTERESTED_BUILD_CREDIT       0
IS_INTERESTED_INCREASE_INCOME    0
IS_INTERESTED_PAY_OFF_DEBT       0
IS_INTERESTED_MANAGE_SPENDING    0
IS_INTERESTED_GROW_SAVINGS       0
dtype: int64


In [129]:
# print total number of rows in both bank_transaction and user_profile dataset
print("\nTotal number of rows in bank_transaction dataset: ", bank_transaction.shape[0])
print("Total number of rows in user_profile dataset: ", user_profile.shape[0])


Total number of rows in bank_transaction dataset:  258779
Total number of rows in user_profile dataset:  1000


#### Merging `bank_transaction.csv` with `user_profile.csv`

Incorporating `user_profile.csv` dataset is useful as 
- Financial behaviour may vary by user interests. E.g. Users interested in "Grow Savings" may have more deposit transactions. Including this data adds personalized financial behavior insights to the model.
- The model may find user-specific spending patterns.

In [130]:
# Convert all column names to lowercase
bank_transaction.columns = bank_transaction.columns.str.lower()
user_profile.columns = user_profile.columns.str.lower()

# merge both datasets on client_id
df = pd.merge(bank_transaction, user_profile, on='client_id', how='inner')

# Display first few rows of merged dataset
display(df.head())

Unnamed: 0,client_id,bank_id,account_id,txn_id,txn_date,description,amount,category,is_interested_investment,is_interested_build_credit,is_interested_increase_income,is_interested_pay_off_debt,is_interested_manage_spending,is_interested_grow_savings
0,1,1,1,4,2023-09-29 00:00:00,Earnin PAYMENT Donatas Danyal,20.0,Loans,False,False,False,False,False,False
1,1,1,1,3,2023-08-14 00:00:00,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,25.0,Transfer Credit,False,False,False,False,False,False
2,1,1,1,5,2023-09-25 00:00:00,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,20.0,Loans,False,False,False,False,False,False
3,1,1,2,1,2023-06-02 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,16.0,Transfer Credit,False,False,False,False,False,False
4,1,1,2,2,2023-06-01 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,4.0,Transfer Credit,False,False,False,False,False,False


#### Dropping rows with missing labels

Only 257 rows out of 258,779 (~0.1%) have missing category labels, this suggests those missing values are likely random and not part of a structured evaluation set. Keeping them could introduce noise.

In [131]:
df = df.dropna(subset=['category'])

# Verify the dataset after removal
print(f"Remaining rows after removing missing categories: {df.shape[0]}")

Remaining rows after removing missing categories: 258522


#### Removing unnecessary features

Some features may not be relevant for predicting transaction categories. Features like `client_id`, `bank_id` and `account_id`, and `txn_id` are raw identifiers that might not generalize well to new, unseen users/banks/accounts/transactions.

Dropping those columns ensures that the model focuses on real transaction-specific features like `description`, `amount`, and `txn_date`, which are more universally useful.

*However, they may still provide valuable transaction behavior insights to the model when encoded properly via feature engineering (mean, min, max, sum)*.

In [132]:
# Dropping client_id, bank_id, account_id, and txn_id columns
df = df.drop(columns=['client_id', 'bank_id', 'account_id', 'txn_id'])

# Move the target column to the last
df = df[[col for col in df if col != 'category'] + ['category']]

# Display first few rows of the dataset
display(df.head())

Unnamed: 0,txn_date,description,amount,is_interested_investment,is_interested_build_credit,is_interested_increase_income,is_interested_pay_off_debt,is_interested_manage_spending,is_interested_grow_savings,category
0,2023-09-29 00:00:00,Earnin PAYMENT Donatas Danyal,20.0,False,False,False,False,False,False,Loans
1,2023-08-14 00:00:00,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,25.0,False,False,False,False,False,False,Transfer Credit
2,2023-09-25 00:00:00,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,20.0,False,False,False,False,False,False,Loans
3,2023-06-02 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,16.0,False,False,False,False,False,False,Transfer Credit
4,2023-06-01 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,4.0,False,False,False,False,False,False,Transfer Credit


### 2. Data Encoding and Normalisation

#### One-Hot Encoding for Categorical Data

1. User interest columns from `user_profile.csv`
2. Transaction category (category) from `bank_transaction.csv`.

In [133]:
# One-hot encode user interest columns (Boolean → Binary 0/1)
user_interest_cols = [
    'is_interested_investment', 'is_interested_build_credit',
    'is_interested_increase_income', 'is_interested_pay_off_debt',
    'is_interested_manage_spending', 'is_interested_grow_savings'
]
df[user_interest_cols] = df[user_interest_cols].astype(int)

# One-hot encode the target variable (category)
df = pd.get_dummies(df, columns=['category'], prefix='category')

category_cols = [col for col in df.columns if col.startswith("category_")]
df[category_cols] = df[category_cols].astype(int)

# Display updated dataset
display(df.head())

Unnamed: 0,txn_date,description,amount,is_interested_investment,is_interested_build_credit,is_interested_increase_income,is_interested_pay_off_debt,is_interested_manage_spending,is_interested_grow_savings,category_ATM,category_Arts and Entertainment,category_Bank Fee,category_Bank Fees,category_Check Deposit,category_Clothing and Accessories,category_Convenience Stores,category_Department Stores,category_Digital Entertainment,category_Food and Beverage Services,category_Gas Stations,category_Gyms and Fitness Centers,category_Healthcare,category_Insurance,category_Interest,category_Internal Account Transfer,category_Loans,category_Payment,category_Payroll,category_Restaurants,category_Service,category_Shops,category_Supermarkets and Groceries,category_Tax Refund,category_Telecommunication Services,category_Third Party,category_Transfer,category_Transfer Credit,category_Transfer Debit,category_Transfer Deposit,category_Travel,category_Uncategorized,category_Utilities
0,2023-09-29 00:00:00,Earnin PAYMENT Donatas Danyal,20.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2023-08-14 00:00:00,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,25.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
2,2023-09-25 00:00:00,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,20.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,2023-06-02 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,16.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,2023-06-01 00:00:00,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


#### Encoding `txn_date`

In [134]:
# Convert txn_date to datetime format (if not already in datetime)
df['txn_date'] = pd.to_datetime(df['txn_date'], errors='coerce')

# Extract time-based features
df['day_of_week'] = df['txn_date'].dt.dayofweek  # Monday=0, Sunday=6
df['day_of_month'] = df['txn_date'].dt.day  # 1-31
df['hour'] = df['txn_date'].dt.hour  # Extract hour from transaction time (0-23)
df['is_weekend'] = df['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)  # 1=Weekend, 0=Weekday

# Drop original txn_date column
df = df.drop(columns=['txn_date'])

# Display updated dataset
display(df.head())

Unnamed: 0,description,amount,is_interested_investment,is_interested_build_credit,is_interested_increase_income,is_interested_pay_off_debt,is_interested_manage_spending,is_interested_grow_savings,category_ATM,category_Arts and Entertainment,category_Bank Fee,category_Bank Fees,category_Check Deposit,category_Clothing and Accessories,category_Convenience Stores,category_Department Stores,category_Digital Entertainment,category_Food and Beverage Services,category_Gas Stations,category_Gyms and Fitness Centers,category_Healthcare,category_Insurance,category_Interest,category_Internal Account Transfer,category_Loans,category_Payment,category_Payroll,category_Restaurants,category_Service,category_Shops,category_Supermarkets and Groceries,category_Tax Refund,category_Telecommunication Services,category_Third Party,category_Transfer,category_Transfer Credit,category_Transfer Debit,category_Transfer Deposit,category_Travel,category_Uncategorized,category_Utilities,day_of_week,day_of_month,hour,is_weekend
0,Earnin PAYMENT Donatas Danyal,20.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,29,0,0
1,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,25.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,14,0,0
2,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,20.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0,0
3,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,16.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,2,0,0
4,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,1,0,0


#### Normalizing transaction `amount` Using Z-Score

In [None]:
from sklearn.preprocessing import StandardScaler
import pickle
import os

# Initialize Standard Scaler
scaler = StandardScaler()

# Select numerical features to scale
numerical_features = ['amount', 'day_of_week', 'day_of_month', 'hour']

# Fit scaler on the training data & transform it
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Save the trained scaler to a file for use in training & inference
SCALER_PATH = "../models/scaler/scaler.pkl"
os.makedirs(os.path.dirname(SCALER_PATH), exist_ok=True)
with open(SCALER_PATH, "wb") as f:
    pickle.dump(scaler, f)

# Check the scaled values
print(df[numerical_features].describe())

             amount   day_of_week  day_of_month          hour
count  2.585220e+05  2.585220e+05  2.585220e+05  2.585220e+05
mean   1.231319e-17  2.176797e-17  1.594119e-16 -2.094342e-17
std    1.000002e+00  1.000002e+00  1.000002e+00  1.000002e+00
min   -1.129187e+02 -1.240304e+00 -1.703148e+00 -3.006417e-01
25%   -1.053363e-01 -1.240304e+00 -9.137470e-01 -3.006417e-01
50%   -5.457531e-02 -1.206484e-01 -1.157457e-02 -3.006417e-01
75%   -6.771250e-03  9.990077e-01  8.905979e-01 -3.006417e-01
max    1.157558e+02  2.118664e+00  1.679999e+00  4.479723e+00


#### Check for class imbalance

In [136]:
# Count the number of instances for each category
category_counts = df.filter(like="category_").sum().sort_values(ascending=False)

# Compute percentage distribution
total_samples = category_counts.sum()
category_percentages = (category_counts / total_samples) * 100

# Print class distribution in percentage
print("\nCategory Percentage Distribution:")
print(category_percentages)

# Get all one-hot encoded category columns
category_cols = [col for col in df.columns if col.startswith("category_")]

# Count the number of unique transaction categories
num_classes = len(category_cols)

print(f"Total number of unique transaction categories: {num_classes}")


Category Percentage Distribution:
category_Uncategorized                 11.369245
category_Third Party                   11.106985
category_Restaurants                   10.199132
category_Transfer Credit                8.340103
category_Loans                          7.583494
category_Convenience Stores             7.206350
category_Supermarkets and Groceries     6.479139
category_Transfer Debit                 5.846311
category_Gas Stations                   4.997254
category_Internal Account Transfer      4.635195
category_Payroll                        3.133196
category_Shops                          2.869388
category_Bank Fees                      2.487989
category_Transfer                       2.427260
category_ATM                            2.194011
category_Transfer Deposit               1.924788
category_Digital Entertainment          1.750335
category_Utilities                      1.592901
category_Clothing and Accessories       1.233938
category_Department Stores        

Based on the category percentage distribution, the dataset is imbalanced as some categories have significantly more instances than others.

To deal with class imbalance, use class weighting to penalize mistakes in minority classes more than in majority classes. This ensures that the model does not become biased toward the dominant class. Without class weights, each class contributes equally to the loss. This can be a problem if the dataset is highly imbalanced.

In [137]:
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

category_cols = [col for col in df.columns if col.startswith("category_")]

# Compute class weights (inverse of class frequency)
class_weights = compute_class_weight(
    class_weight="balanced",  # Assigns higher weights to minority classes
    classes=np.arange(len(category_cols)),  # Class indices
    y=df[category_cols].values.argmax(axis=1)  # Convert one-hot to class index
)

# Convert to dictionary format
class_weights_dict = {i: weight for i, weight in enumerate(class_weights)}

# Save the class weights to a file for use in training
CLASS_WEIGHTS_PATH = "../models/class_weights/class_weights.pkl"
os.makedirs(os.path.dirname(CLASS_WEIGHTS_PATH), exist_ok=True)
with open(CLASS_WEIGHTS_PATH, "wb") as f:
    pickle.dump(class_weights_dict, f)

### 3. Text Preprocessing (Transaction Description)

#### Text Cleaning & Normalization & NER

The following are the text cleaning rules that I can think of:
- Lowercasing: Convert all text to lowercase.
- Removing punctuation & special characters: Strip unnecessary punctuation (except when part of a business name).
- Removing extra spaces: Ensure clean, space-separated tokens.
- Lemmatization: Convert words to their base form (e.g., "running" → "run").
- Removing stopwords: Remove common words like "the," "is," "on" that don't contribute meaning.
- Removing numbers (unless part of a business name): Numbers alone are removed, but business names with numbers are retained.
- Removing natural person names (PERSON from NER): Do not provide information.
- Keeping business names (ORG from NER): Business names are valuable in transaction classification and are preserved.
- Removing locations & dates (GPE, DATE from NER): Location and date details are unnecessary and are removed.
- Removing identifiers or serial numbers: Example: "84DD466B7C25425" and "8445052993ca" are removed.

*This cell takes a while to run, it scans through all rows and do the cleaning. I commented it out since there's already a copy of `preprocessed_bank_transaction.csv` under `/dataset`*

>IMPORTANT: Make sure you have "en_core_web_lg" installed on your machine first. If not, run `python -m spacy download en_core_web_lg


In [None]:
import spacy
import pandas as pd
import re
import subprocess
import sys
from tqdm import tqdm

# Check if the SpaCy model is installed
if not spacy.util.is_package("en_core_web_lg"):
    print("Downloading 'en_core_web_lg' model...")
    subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_lg"])
else:
    print("SpaCy model 'en_core_web_lg' is already installed.")
    
# Load SpaCy model
nlp = spacy.load("en_core_web_lg")

# Enable tqdm with Pandas for progress bars
tqdm.pandas()

# Dictionary to expand common abbreviations in the text
ABBR_DICT = {
    'ckg': 'checking', 'chk': 'check', 'dep': 'deposit', 'trns': 'transfer',
    'adv': 'advance', 'w/d': 'withdrawal', 'wd': 'withdrawal', 'xfer': 'transfer',
    'pmt': 'payment', 'txn': 'transaction', 'int': 'interest', 'intl': 'international',
    'intr': 'interest', 'chg': 'charge', 'pos': 'point of sale',
    'purch': 'purchase', 'atm': 'cash machine', 'atw': 'cash machine',
    'cd': 'certificate of deposit', 'cc': 'credit card', 'dc': 'debit card',
    'bal': 'balance', 'adj': 'adjustment', 'adjmt': 'adjustment', 'apmt': 'automatic payment',
    'av': 'available', 'bk': 'bank', 'bkcard': 'bank card',
    'bkchg': 'bank charge', 'bkfee': 'bank fee', 'bkln': 'bank loan',
    'bkstmt': 'bank statement', 'bktrns': 'bank transfer', 'bkwd': 'bank withdrawal',
    'blnc': 'balance', 'bnk': 'bank', 'bnkchg': 'bank charge', 'n': "and", 'tx': 'transaction', 
    'cb': 'chase bank', 'trsf': 'transfer', 'ref': 'reference', 'pymt': 'payment', 'pymnt': 'payment', 
    'pmnt': 'payment', 'pw': '', 'ml': '', 'rcvd': 'received', 'dbt': 'debit', 'crd': 'card',
    'mar': 'mart', 'stor': 'store', 'sup': 'supermarket'
}

# Set of terms to remove from the text
REMOVED_TERMS = {
    'ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga', 'hi', 'ia', 
    'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me', 'mi', 'mn', 'mo', 'ms', 
    'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm', 'nv', 'ny', 'oh', 'ok', 'or', 'pa', 
    'ri', 'sc', 'sd', 'tn', 'tx', 'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy', 'rd',
    'date', 'card'
}

# Set of terms to keep in the text (e.g., specific company names)
KEPT_TERMS = {
    '7-eleven', '7eleven', '7 eleven', 'walmart', 'circle k', 'target', 'costco', 'sams club'
}

# Regex patterns for identifying dates, digits, colons/slashes, special characters, and repeated spaces
DATE_PATTERN = re.compile(r'\b(?:\d{1,2}[-/]\d{1,2}(?:[-/]\d{2,4})?|\d{4}[-/]\d{1,2}[-/]\d{1,2})\b')
DIGITS_PATTERN = re.compile(r'\d+')
COLON_SLASH_PATTERN = re.compile(r'[:/]')
REPEATED_SPACES = re.compile(r'\s+')

def is_interleaved_alphanumeric(text):
    """Check if text has interleaved letters and numbers"""
    is_digit_prev = text[0].isdigit()
    transitions = 0
    for char in text[1:]:
        is_digit_curr = char.isdigit()
        if is_digit_curr != is_digit_prev:
            transitions += 1
        is_digit_prev = is_digit_curr
    return transitions > 2

def extract_potential_entity(text):
    """Extract letters from alphanumeric text if clearly separated"""
    if is_interleaved_alphanumeric(text):
        return None
    return DIGITS_PATTERN.sub('', text).strip()

def clean_normalize_text(text):
    """Clean and normalize text by expanding abbreviations, removing unwanted terms, and processing with SpaCy."""
    text = text.lower()
    
    # Check for kept terms before any processing
    for kept_term in KEPT_TERMS:
        if kept_term in text:
            return kept_term
        
    words = text.split()
    expanded_words = [ABBR_DICT.get(word.lower(), word) for word in words]
    text = ' '.join(expanded_words)

    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s-]', ' ', text)
    
    # Process text with SpaCy to tokenize and analyze entities
    doc = nlp(text)
    
    cleaned_tokens = []
    for token in doc:
        word = token.text.lower()
        
        # Remove entities: person names, locations, dates
        if token.ent_type_ in ['PERSON', 'GPE', 'DATE']:
            continue
        
        # Remove date patterns
        if DATE_PATTERN.search(word):
            continue

        # Remove words containing ":" or "/"
        if COLON_SLASH_PATTERN.search(word):
            continue
        
        # Skip if word is a state abbreviation
        if word in REMOVED_TERMS:
            continue
        
        # Keep organization names
        if token.ent_type_ == 'ORG':
            cleaned_tokens.append(token.text)
            continue
        
        # Handle alphanumeric words
        if any(c.isdigit() for c in word) and any(c.isalpha() for c in word):
            entity_name = extract_potential_entity(word)
            if entity_name:
                cleaned_tokens.append(entity_name.lower())
            continue
            
        # Skip punctuation, stopwords, numbers, and short words
        if (not token.is_punct and 
            not token.is_stop and 
            not token.like_num and 
            len(word) > 1):
            cleaned_tokens.append(token.lemma_)
            
    # Join tokens and clean up spaces
    result = ' '.join(cleaned_tokens)
    result = REPEATED_SPACES.sub(' ', result).strip()
    
    return result

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# df['processed_description'] = df['description'].progress_apply(clean_normalize_text)

# # Display the first few rows
# display(df[['description', 'processed_description']])


100%|██████████| 258522/258522 [21:00<00:00, 205.15it/s]


Unnamed: 0,description,processed_description
0,Earnin PAYMENT Donatas Danyal,earnin payment donatas danyal
1,ONLINE TRANSFER FROM NDonatas DanyalDA O CARSON BUSINESS CHECKING 1216 1216,online transfer ndonatas danyalda o carson business check
2,MONEY TRANSFER AUTHORIZED ON 09/25 FROM Earnin CDAEJ_B CA S583269001208168 111,money transfer authorize earnin cdaej s
3,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,online transfer everyday check
4,ONLINE TRANSFER FROM CARSON N EVERYDAY CHECKING 1216 1216,online transfer everyday check
...,...,...
258774,CHECK111,check
258775,CHECK111,check
258776,7-ELEVEN 08/21 #3168 PURCHASE 7-ELEVEN BALTIMORE MD,7-eleven
258777,7-ELEVEN 07/10 3168 PURCHASE 7-ELEVEN BALTIMORE MD,7-eleven


However, due to how Spacy handles tokenisation and its limited NLP capabilities, the outcome of text cleaning doesn't look very promising. Errors can be found. E.g. it got confused between entity names and person's names, and misunderstood the meanings of certain words.

#### Text Feature Extraction & Embeddings

Train a custom FastText Model fine-tunes embeddings on the actual transaction descriptions.
- Learns financial transaction-specific vocabulary, relationship between unique vendor names, abbreviations and industry-specific terms.
- Handles noisy transaction text better

In [None]:
# pip install gensim
SEED = 42
import gensim
from gensim.models import FastText
gensim.utils.random.seed(SEED)

# Tokenize descriptions
df['tokenized_description'] = df['processed_description'].apply(lambda x: str(x).split() if pd.notna(x) else [])

# Train FastText model on your transaction descriptions
fasttext_model = FastText(
    sentences=df['tokenized_description'].tolist(), # Use preprocessed tokenised text as input for training
    vector_size=100, # Each word will be converted into a vector of size 100
    window=5, # FastText looks at 5 words before and after the target word
    min_count=1, 
    workers=1,
    sg=1  # Use Skip-gram (better for capturing rare words)
)

# Save model for future use
os.makedirs("../models/fasttext", exist_ok=True)
fasttext_model.save("../models/fasttext/fasttext_model.bin")

Convert each transaction description into a numerical vector using the custom FastText model

In [123]:
def get_embedding(text):
    # Ensure text is a string and handle missing values
    if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
        return np.zeros(100)  # Return zero vector for empty/missing descriptions

    words = text.split()
    word_vectors = [fasttext_model.wv[word] for word in words if word in fasttext_model.wv]

    if len(word_vectors) == 0:
        return np.zeros(100)  # Return zero vector if no words exist in FastText vocabulary

    return np.mean(word_vectors, axis=0)  # Take mean to get sentence-level embedding

df['fasttext_embedding'] = df['processed_description'].apply(get_embedding)

Convert embedding vector into separate feature columns:

Split the 100-dimensional embedding into 100 separate feature columns

In [124]:
# Convert FastText embedding list into separate columns
embedding_size = 100  # Vector size from FastText
embedding_cols = [f'fasttext_{i}' for i in range(embedding_size)]

# Convert each vector into multiple columns
embedding_df = pd.DataFrame(df['fasttext_embedding'].to_list(), columns=embedding_cols)

# Merge back with the main DataFrame
df = pd.concat([df, embedding_df], axis=1)

# Drop the original embedding column since it's now expanded
df = df.drop(columns=['fasttext_embedding'])


Final column rearrangement and cleaning

In [125]:
# Drop the original and processed description columns
df = df.drop(columns=['description', 'processed_description', 'tokenized_description'])

# Move the columns prefixed with category_ to the end
category_cols = [col for col in df.columns if col.startswith("category_")]
df = df[[col for col in df if col not in category_cols] + category_cols]

Saving the final preprocessed dataset

In [None]:
# Commented this out as there's already a copy in `/dataset`
# df.to_csv("../dataset/preprocessed_bank_transaction.csv", index=False)