# Machine Learning Tutorial: Simple Bank/Credit Card Transaction Classification Tool

This is my first toy project after spending a few COVID months getting a baseline understanding of machine learning using Python, scikit-learn, and pandas.

It is a very simple credit card and banking transaction classification tool.

I'm sharing it here:
1. To encourage others to dive in to machine learning, because libraries like scikit-learn, pandas, and others make ML more accessible (and fun) than ever before, especially to scrubs like me, and
1. To gather feedback from SMEs and continously improve my approach, sharing those learnings with others

### TL;DR
My wife and I keep monthly budget sheets which include transactions to/from our bank account and various credit cards, and we use these to track spending across several categories so that we can see where our funds are going each month, and where we should cut/continue/invest.

As you can imagine, it is tedious to label each individual transaction, so I gathered ~a year's worth of labeled data and used this notebook to train an ML model which could auto-classify transactions for me, regardless of source account (bank or credit cards).

Now I can compute, analyze, and weep over our budget sheets in a matter of seconds every month (Great Success).

### Details
We use monthly budget sheets to track our finances by category of spending (e.g. groceries, gas, mortgage, dining out, etc.). 

Previously, we did this by sitting down at the start of the month and manually labeling transactions from the last month of spending. 

This was tedious, and while some credit card/banking services offer this kind of "tagging" already, they all use different categories, and not always the same ones that would be most valuable to us.

We wanted something more consistent, so we went through the tedium for a while, collecting all of the labeled data for about 13 months.

The below Jupyter notebook - split into respective "Pipelines" of work - will walk through the process of:

1. Training a classifier from the year's worth of manually-labeled data
1. Training a classifer from a continously-updated "master" set of data for future relearning
1. Optimizing classifier parameters
1. Classifying new, unlabeled data using the classifier
1. Saving newly-labeled data to the master set, where it can be manually tweaked if inaccurate so that the classifier can be retrained

The implementation details will be discussed throughout the notebook.

### Important Notes and Sharing Feedback
* You will see certain data marked as "REDACTED" throughout the notebook - this is just data which contains location or PII about me or my personal accounts that I chose to omit from the project; It does not impact operations, but there are certain places where - if you choose to play with this notebook - you will have to tailor the notebook to your specific bank/credit card account data. I will highlight these.
* I'm still an ML noob, so please share constructive feedback with me @jeFF0Falltrades on Twitter, or feel free to leave an Issue/PR on this GitHub.

Thanks!

---

## Imports

In [1]:
from enum import Enum
from glob import glob
from joblib import dump, load
from numpy import logspace
from pandas import CategoricalDtype, concat, DataFrame, get_dummies, read_csv
from pprint import pprint
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from re import compile


# Used for auto-formatting in Jupyter Lab
# Comment the line below to use lab_black in a Jupyter Notebook outside of Jupyter Lab
%load_ext lab_black

# Uncomment the line below for use in a Jupyter Notebook outside of Jupyter Lab
# %load_ext nb_black

## Persistent Constants and Functions

Below are constants and functions which will be referenced throughout the rest of the notebook.

**Note** that you will have to change these values to match your account information and file structures.

In [2]:
# Path to labeled and unlabeled CSVs
LABELED_CSV_PATH = "E:\\Budget\\labeled_data\\*.csv"
UNLABELED_CSV_PATH = "E:\\Budget\\unlabeled_data\\*.csv"

# Path to the master data set CSV
MASTER_CSV_PATH = "master_data.csv"

# Path to the saved optimal classifier
MODEL_FILE_NAME = "selected_model.joblib"

# Enums used to track the different type of accounts where transactions will be sourced from
# You can add/modify these as necessary
class ACCOUNTS(Enum):
    BANK = "REDACTED_BANK_ACCT"
    CC_1 = "REDACTED_CC_ACCT"
    CC_2 = "REDACTED_CC_ACCT_2"


# Increments of charges to include as encoded features
# Feel free to change the increments; These worked well for me
CHARGE_INCREMENTS = [20, 50, 100, 500, 1000]
CHARGE_UNDER = "charge_under_{}"
CHARGE_OVER = "charge_over_{}"
CHARGE_COLS = []
for incr in CHARGE_INCREMENTS:
    CHARGE_COLS.append(CHARGE_UNDER.format(incr))
CHARGE_COLS.append(CHARGE_OVER.format(CHARGE_INCREMENTS[-1]))

# All expected feature columns;
# We will use this to preserve ordering as our saved model will require the
# feature columns to be in the same order between runs
FEAT_COLS = ["description"] + CHARGE_COLS + ["is_credit"]

# Characters to ignore (exclude) from transaction descriptions and charge values, respectively
IGNORED_TXN_CHARS = compile(r"[^a-z0-9 ]")
IGNORED_CHARGE_CHARS = compile(r"[-() ]")

# Terms to ignore from all transactins (e.g. payments from one account
# to another, which aren't really "transactions" to be tracked)
IGNORED_TXN_TERMS = [
    "REDACTED_CC_ACCT EPAYMENT",
    "REDACTED_CC_ACCT_2 E-PAYMENT",
    "MOBILE PAYMENT - THANK YOU",
    "INTERNET PAYMENT - THANK YOU",
    "TRANSFER",
]
IGNORED_TXN_PATT = compile("|".join(IGNORED_TXN_TERMS))

# Delimiter to use with input CSVs and output CSVs;
# I tend to use ';' b/c some transaction descriptions contain a comma
# and charges will too, depending on currency format
CSV_DELIM = ";"

**Note** that you may have to tweak these functions to successfully parse your own institutions' CSV formats.

In [17]:
# Checks if the given charge is a credit or debit from the account
def check_if_credit(account, charge):
    # Cast to string as a safety for the substitution op below
    charge = str(charge)
    # My credit card accounts use the format "-$12.00" for credits
    if account.lower() != ACCOUNTS.BANK.value and "-" in charge:
        return 1
    # My bank account uses the format "($12.00)" for debits and "$12.00" for credits
    if account.lower() == ACCOUNTS.BANK.value and "(" not in charge:
        return 1
    return 0


# Performs label encoding for charges based on the increments provided as constants above
def encode_charges(charge):
    # Cast to string as a safety for the substitution op below
    charge = str(charge)
    # Remove ignored chars and replace any commas with periods to cast to float
    charge = float(IGNORED_CHARGE_CHARS.sub("", charge).replace(",", "."))
    for incr in CHARGE_INCREMENTS:
        if charge <= incr:
            return CHARGE_UNDER.format(incr)  # e.g. "charge_under_20"
    return CHARGE_OVER.format(CHARGE_INCREMENTS[-1])  # e.g. "charge_over_1000"


# Preprocesses description texts by removing ignored characters
def clean_text(text):
    return IGNORED_TXN_CHARS.sub("", text.lower())


# Combines data from each transaction CSV into a consolidated DataFrame
# You may have to modify this to account for differences among your accounts
def dataframe_from_csvs(path, delim=",", labeled=False):
    frames_to_combine = []
    for csv_file in glob(path):
        df = read_csv(csv_file, delimiter=delim)
        # If the data is not labeled, we need to describe how to parse the CSV
        if not labeled:
            # My credit card accounts both already have these present
            if "Description" in df and "Amount" in df:
                # The presence of the "Card Member" column differentiates the 2 CC accounts
                if "Card Member" in df:
                    df["Account"] = ACCOUNTS.CC_1.value
                else:
                    df["Account"] = ACCOUNTS.CC_2.value
                frames_to_combine.append(df[["Account", "Description", "Amount"]])
            # My bank account, on the other hand, has a unique "*Beginning Balance*" column
            elif "*Beginning Balance*" in df:
                new_df = DataFrame()
                new_df["Description"] = df["*Beginning Balance*"][:-1]
                new_df["Amount"] = df.iloc[:, 3][:-1]
                new_df["Account"] = ACCOUNTS.BANK.value
                frames_to_combine.append(new_df)
        # If the data is already labeled, we only need to combine it to the other frames
        else:
            frames_to_combine.append(df)
    df = concat(frames_to_combine, axis=0, ignore_index=True)
    # Remove ignored transactions using regex from constants above
    df = df[~df["Description"].str.contains(IGNORED_TXN_PATT)]
    # Reset the index (row identifier) as if we preserve the random indices of
    # the separate data sets, it can cause problems during preprocessing
    return df.reset_index(drop=True)


# Runs preprocessing to derive input features (X) from transaction data
def derive_X(df, acct_col_label, desc_col_label, charge_col_label):
    X_data = DataFrame()
    # Normalize description texts
    X_data["description"] = raw_df.apply(
        lambda row: clean_text(row[desc_col_label]), axis=1
    )
    # Label-encode charges using increments above
    X_data["charges"] = raw_df.apply(
        lambda row: encode_charges(row[charge_col_label]), axis=1
    )
    # If our data is missing any of the charge categories, this op will add
    # the missing column and set it to 0 for all rows, ensuring consistency
    X_data["charges"] = X_data["charges"].astype(
        CategoricalDtype(categories=CHARGE_COLS)
    )
    # One-hot encode the label-encoded charges to split them out into
    # 1 column / 1 charge increment.
    # get_dummies() will do this for us easily:
    # https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
    one_hot_charges = get_dummies(
        X_data[["charges"]], prefix="", prefix_sep="", columns=["charges"]
    )
    # Drop the label-encoded charges, and join the one-hot encoded charges
    X_data = X_data.drop("charges", axis=1).join(one_hot_charges)
    # Capture if transaction is a credit or debit
    X_data["is_credit"] = raw_df.apply(
        lambda row: check_if_credit(row[acct_col_label], row[charge_col_label]), axis=1
    )
    # This line ensures ordering is preserved as our saved model will expect
    # the same column order each run
    X_data = X_data[FEAT_COLS]
    return X_data


# Add data from a frame to the master data set for later retraining
def add_data_to_master_set(df):
    # Only write headers if master CSV already exists
    write_header = not glob(MASTER_CSV_PATH)
    df.to_csv(
        MASTER_CSV_PATH, mode="a", index=False, sep=CSV_DELIM, header=write_header
    )

---

## Pipeline 1: Retrieve Training Data From Manually-Labeled Transactions

This pipeline is used to train our classifier using *manually-labeled* data, e.g. the year's worth of manually labeled transaction CSVs I compiled.

Those CSVs were in the below format with a `';'` delimiter, but you can adjust the code to fit whichever format your account CSVs use:

|   Date  | Account |                  Description                 | Charge | Direction |    Category   |
|:-------:|:-------:|:--------------------------------------------:|:------:|:---------:|:-------------:|
| January |   REDACTED_CC_ACCT  |                  McDonald's                  |  5,06  |   Debit   |   Dining Out  |
| January |   REDACTED_CC_ACCT  |                   Michaels                   |  11,26 |   Debit   |   Household   |
| January |   REDACTED_CC_ACCT  |                   T.J. Maxx                  |  10,69 |   Debit   |   Household   |
| January |   REDACTED_CC_ACCT  |                    USPS PO                   |  23,93 |   Debit   |    Shipping   |
| January |   REDACTED_CC_ACCT  |                  Which Wich                  |  21,01 |   Debit   |   Dining Out  |
| January |   REDACTED_CC_ACCT  | APPLE.COM/BILL      INTERNET   CHARGE     CA |  0,99  |   Debit   | Entertainment |
|  March  |   REDACTED_CC_ACCT  | CHEWY.COM             (800)672-4399       FL |  31,96 |   Debit   |      Pets     |
|  March  |  REDACTED_BANK_ACCT |           MOBILE DEPOSIT AUTO-POST           | 100.00 |   Credit  |     Gifts     |
|   ...   |   ...   |                      ...                     |   ...  |    ...    |      ...      |

In [19]:
raw_df = dataframe_from_csvs(LABELED_CSV_PATH, delim=CSV_DELIM, labeled=True)
# FYI: df.shape shows (rows, columns) counts for the data frame
print(raw_df.shape)
raw_df.head()

(1955, 6)


Unnamed: 0,Date,Account,Description,Charge,Direction,Category
0,August,REDACTED_CC_ACCT_2,REDACTED_WATER_UTILITY,75.43,Debit,Utilities
1,August,REDACTED_CC_ACCT_2,REDACTED_WATER_UTILITY,2.99,Debit,Utilities
2,August,REDACTED_CC_ACCT_2,RAISING CANES REDACTED_STATE,15.91,Debit,Dining Out
3,August,REDACTED_CC_ACCT_2,BURGER KING REDACTED_STATE,16.57,Debit,Dining Out
4,August,REDACTED_CC_ACCT_2,MEIJER REDACTED_STATE,84.93,Debit,Groceries


Now let's take the raw labeled data, and run it through our `derive_X()` function to encode the features we want to train our classifier on:

In [20]:
X_data = derive_X(raw_df, "Account", "Description", "Charge")
print(X_data.shape)
X_data.head()

(1955, 8)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit
0,REDACTED_WATER_UTILITY,0,0,1,0,0,0,0
1,REDACTED_WATER_UTILITY,1,0,0,0,0,0,0
2,raising canes REDACTED_STATE,1,0,0,0,0,0,0
3,burger king REDACTED_STATE,1,0,0,0,0,0,0
4,meijer REDACTED_STATE,0,0,1,0,0,0,0


Next, for training our classifier, we need to tell it the actual labels we (humans) used for the training transactions, so it can start learning how the features (X) relate to the target labels (y).

In [21]:
y_data = raw_df["Category"].str.lower()
print(y_data.shape)
y_data.head()

(1955,)


0     utilities
1     utilities
2    dining out
3    dining out
4     groceries
Name: Category, dtype: object

Before moving on to training the classifier, now that we have already transformed our raw, manually-labeled data into the format our model will expect, we can save this transformed data to our master data set.

This way, we can go back and tweak the data later, then retrain the model using **Pipeline 2** so we can skip some of the preprocessing steps.

In [8]:
# Write the preprocessed labeled data to the master data set for inclusion in future retraining
# This includes our preprocessed features (X) and target values (y)
combined_df = X_data.join(y_data)
combined_df.columns = combined_df.columns.str.lower()
print(combined_df.shape)
combined_df.head()

(1955, 9)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit,category
0,REDACTED_WATER_UTILITY,0,0,1,0,0,0,0,utilities
1,REDACTED_WATER_UTILITY,1,0,0,0,0,0,0,utilities
2,raising canes REDACTED_STATE,1,0,0,0,0,0,0,dining out
3,burger king REDACTED_STATE,1,0,0,0,0,0,0,dining out
4,meijer REDACTED_STATE,0,0,1,0,0,0,0,groceries


In [None]:
add_data_to_master_set(combined_df)

### Skip to Pipeline 3, unless training with Master Data Set as well

---

## Pipeline 2: Retrieve Training Data From Master Data Set (Manually + Automatically Labeled Data)

As opposed to **Pipeline 1**, which relies on a different format of manually-labeled data, once we start building a "master" data set of our desired features (X) and assigned categories (y), we can train or retrain our classifier using this data without any preprocessing needed.

In [9]:
# Load labeled master data set for training/retraining
raw_df = read_csv(MASTER_CSV_PATH, delimiter=CSV_DELIM)
print(raw_df.shape)
raw_df.head()

(2315, 9)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit,category
0,REDACTED_WATER_UTILITY,0,0,1,0,0,0,0,utilities
1,REDACTED_WATER_UTILITY,1,0,0,0,0,0,0,utilities
2,raising canes REDACTED_STATE,1,0,0,0,0,0,0,dining out
3,burger king REDACTED_STATE,1,0,0,0,0,0,0,dining out
4,meijer REDACTED_STATE,0,0,1,0,0,0,0,groceries


In [10]:
# With the already-transformed data, we just have to drop the "category" column to get X!
X_data = raw_df.loc[:, raw_df.columns != "category"]
print(X_data.shape)
X_data.head()

(2315, 8)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit
0,REDACTED_WATER_UTILITY,0,0,1,0,0,0,0
1,REDACTED_WATER_UTILITY,1,0,0,0,0,0,0
2,raising canes REDACTED_STATE,1,0,0,0,0,0,0
3,burger king REDACTED_STATE,1,0,0,0,0,0,0
4,meijer REDACTED_STATE,0,0,1,0,0,0,0


In [11]:
# Boom, instant labels as well
y_data = raw_df["category"].str.lower()
print(y_data.shape)
y_data.head()

(2315,)


0     utilities
1     utilities
2    dining out
3    dining out
4     groceries
Name: category, dtype: object

---

## Pipeline 3 (Follows PL1 or PL2): Train Classifier

Now, we get to the most important and extensive pipeline of our notebook: Training our classifier.

Using the data from **Pipeline 1** or **Pipeline 2** above (or both), we split the data into a ["training" and "testing" set](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets).

scikit-learn contains a great function - [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) - for just this purpose.

Notice that we're not using a validation set here, to keep things simple, but as a personal exercise, you could also build a validation set to run tests prior to selecting a final, fully-tuned model (for an example of this, see [here](https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn)).

<br>

We split using an 80/20 train/test ratio.

Why split the data like this at all?

So our classifier doesn't adapt so well to the training data set that it doesn't generalize to future unlabeled data sets:

The training set will be used to "train" (obviously) our classifier, which is why it typically is made up of the lion's share of the total available data, and we will analyze it to tune our model.

The test set, on the other hand, is meant to serve as a separate, untouched and unseen (by us) data set to test our final tweaked and trained model on in order to ensure the model didn't overfit ("memorize") the training data, and will generalize enough to new data.

In [12]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_data, y_data, train_size=0.8, test_size=0.2
)

To classify our transactions, we are going to use 3 high-level features:

1. The description of the transaction (provided by the bank/credit card CSV)
1. The amount of the transaction (encoded using our intervals above)
1. Whether the transaction is considered a debit or credit

Let's look at these in more detail:

### Description of the Transaction

We will preprocess the description text to remove superfluous characters, then remove unwanted transactions (see `IGNORED_TXN_TERMS` above), and then use [term frequency–inverse document frequency or TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to encode the words in the description into a [sparse matrix](https://machinelearningmastery.com/sparse-matrices-for-machine-learning/#:~:text=A%20sparse%20matrix%20is%20a,of%20its%20coefficients%20are%20zero) of features.

Once again, scikit-learn has an awesome prebuilt function for this - [`TfidfVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Put simply, TF-IDF is used to weigh the importance of words across several documents or - in our case - transaction descriptions.

It will analyze each "word" (which may not equate to an actual English word, so "term" or "token" can also be used) of the preprocessed transaction description and rank words that appear often in a particular description as more important in classifying that particular category of transaction, **unless** the term(s) also appear across other descriptions, in which case their importance is weighted less (terms like `"USA"` or `"Charge"`).

Instead of actual English "words", however, `TfidfVectorizer()` can be modified to use a range of n-grams, which essentially include any string of non-space characters (though this can be modified by tweaking various parameters of the function to define what a "token" should be considered, which "stop words" (common terms) should be filtered, and more).

For example, given argument `ngram_range=(1,2)` (unigrams and bigrams) a description text like `"marketplace credit charge us"` will be vectorized to: `['marketplace', 'credit', 'charge' ,'us', 'marketplace credit', 'credit charge', 'charge us']`.

This vectorization will produce a sparse matrix which will be passed through to our classifier, so that it can learn which tokens/words in the transaction descriptions are associated with a particular category.

For example, the tokens `"Burger"` and `"Burger King"` are very likely going to be strongly associated with the `"Dining Out"` category :-).

We will also make use of scikit-learn's built-in [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) function, which makes it easy for us to run certain transformers (like `TfidfVectorizer`) on certain columns, and run different transformers - or just pass through "as-is", which we will do here - on remaining columns.

### Amount of the transaction
Why go through the trouble of tracking the amount of the transaction?

I chose to do this in order to increase the accuracy of the model, specifically finding it useful for the case when faced with transactions from the same shop/org, but in different increments.

For example, sometimes, when I get fuel at a gas station when traveling, I'll also stop in to grab a drink or snack.

Both of these transaction descriptions will probably include the name of the gas station, but one will be smaller than the other (unless I'm **really** loading up on snacks for a long trip).

Our model might try to classify both of these as `"Gas"` based on the presence of the gas station name in the charge. 

But by including information about the charge amount, smaller charges will be recognized as `"Groceries"` (really, "snacks") and larger charges will be classified as `"Gas"`.

### Transaction is Credit or Debit
Once again, without including information about whether a transaction is a credit or debit, the model might learn to categorize both something I bought from Home Depot as `"Household"` as well as a refund on the same item if I later return it.

Obviously, I want these tracked as two different transactions: The debit as a `"Household"` transaction, and the refund as a `"Refund"` transaction.

Adding in a feature to track which charges are debits or credits helps train our classifier to differentiate positive/negative charges appropriately.

<br>

That was a lot of background info for just a couple of lines of code, but hopefully it helps understand the approach!

In [13]:
# Build TF-IDF pipeline for description texts, passing other columns "as-is"
tfidf_vec = TfidfVectorizer()
ct = ColumnTransformer([("tfidf", tfidf_vec, "description")], remainder="passthrough")

### Training and Tuning
For classification, I'm going to use Linear Support Vector Classification, built into scikit-learn as [`LinearSVC()`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).

<br>

Some resources on Support Vector Machines and Classification:

By definition, [Support Vector Machines](https://medium.com/axum-labs/logistic-regression-vs-support-vector-machines-svm-c335610a3d16) (SVMs) are supervised learning models which try to find a maximum *margin* between the line (in 2D space) or hyperplane (in 3D+ space) separating classes of items. By maximizing that margin, we can find the optimal separation of those classes of items (think of the classes as our "categories" of charges).

The best ELI5 I've seen for SVMs is /u/copperking's [here](https://www.reddit.com/r/MachineLearning/comments/15zrpp/please_explain_support_vector_machines_svm_like_i/), complete with helpful images.

There are also two great blogs from [Georgios Drakos](https://gdcoder.com/support-vector-machine-vs-logistic-regression/) and [Lilly Chen](https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496) that offer a primer on SVMs and how they compare to logistic regression algorithms.

<br>

Prior to settling on using `LinearSVC`, I ran tests with the following classifier algorithms built into scikit-learn:

* Regular SVC - `SVC()`
* Logistic Regression  - `LogisticRegression()`
* Random Forest - `RandomForestClassifier()`

Each of these can be wrapped in [`OneVsRestClassifier()` or `OneVsOneClassifier()`](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/) from scikit-learn for multi-class classification problems like the one we are solving here, though we would have to modify our y target values to be one-hot encoded.

See the cell at the bottom of this notebook for just such an example, with some tests of these classifiers.

<br>

I ultimately chose `LinearSVC()` because it trains quickly, scales well, and has a decent F1 score compared to the other classifiers I tested.

I chose to evaluate scoring via the `f1_macro` metric due to the imbalance of classes/categories in this data set.

Please see [here](https://datascience.stackexchange.com/questions/40900/whats-the-difference-between-sklearn-f1-score-micro-and-weighted-for-a-mult) for a great explanation of various F1 metrics in scikit-learn, and [here](https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/) for an explanation of F1 Scoring in general, as well as other important scoring metrics like Accuracy, Precision, and Recall.

<br>

For my data set, I received some warnings along the lines of the below.

Both of these are due to some categories only being used once in the training dataset (for example `"home down payment"` is not really an oft-occurring category of transaction), which can impact how well the classifier can learn the features that define that category:

* `UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5`
* `UserWarning: Label not (N) is present in all training examples.`

This is okay, however - the classifier will still attempt to make a best prediction, and as time goes on and we gather more data, relabel misclassified data, and retrain our model, it will become more accurate.

<br/>

We will "tune" the hyperparameters of our TF-IDF Vectorizer and Linear SVC classifier using [`GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), another awesome built-in of scikit-learn which we can provide a range of values to test as parameters to our pipeline, and which will evaluate these parameters and output the best-performing model based on our scoring metric (F1 Macro).

The current F1 score falls between the .65-.75 range - not fantastic, but I don't consider it too bad because I was less structured in how I categorized transactions early on, and so there are many categories that only appear once or a couple of times.

When running the model on 3 months of new, unlabeled data, I actually only had to relabel <2% of transactions which were classified incorrectly. 

In [15]:
# Initialize classifier
cf = LinearSVC()

# Prepare parameters to optimize using Cross Validation
param_grid = {
    "coltrans__tfidf__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)],
    "coltrans__tfidf__norm": ["l1", "l2"],
    "clf__C": logspace(-3, 3, 13),
}

# Build pipeline for TF-IDF + OVR/SVC, using accuracy as scoring metric
pipe = Pipeline([("coltrans", ct), ("clf", cf)])
clf = GridSearchCV(pipe, param_grid=param_grid, scoring="f1_macro", n_jobs=-1)

# Find best performing parameters
best_clf = clf.fit(X_train, y_train)

In [16]:
# Print best parameter/estimator data and model accuracy
print("Model F1 Score is {}\n\n".format(best_clf.score(X_test, y_test)))
pprint(best_clf.best_estimator_.get_params())

Model F1 Score is 0.7163354685138132


{'clf': LinearSVC(C=3.1622776601683795),
 'clf__C': 3.1622776601683795,
 'clf__class_weight': None,
 'clf__dual': True,
 'clf__fit_intercept': True,
 'clf__intercept_scaling': 1,
 'clf__loss': 'squared_hinge',
 'clf__max_iter': 1000,
 'clf__multi_class': 'ovr',
 'clf__penalty': 'l2',
 'clf__random_state': None,
 'clf__tol': 0.0001,
 'clf__verbose': 0,
 'coltrans': ColumnTransformer(remainder='passthrough',
                  transformers=[('tfidf', TfidfVectorizer(ngram_range=(1, 3)),
                                 'description')]),
 'coltrans__n_jobs': None,
 'coltrans__remainder': 'passthrough',
 'coltrans__sparse_threshold': 0.3,
 'coltrans__tfidf': TfidfVectorizer(ngram_range=(1, 3)),
 'coltrans__tfidf__analyzer': 'word',
 'coltrans__tfidf__binary': False,
 'coltrans__tfidf__decode_error': 'strict',
 'coltrans__tfidf__dtype': <class 'numpy.float64'>,
 'coltrans__tfidf__encoding': 'utf-8',
 'coltrans__tfidf__input': 'content',
 'coltrans__tfid

In [63]:
# Now we can save our optimized model to disk for later reloading
dump(best_clf, open(MODEL_FILE_NAME, "wb"))

To read and experiment with the test data predictions, run the following cell and skip to **Pipeline 5**

In [None]:
X = X_test

---

## Pipeline 4: Classify New Unlabeled Data Using the Optimal Classifier

Okay, now with all of that pre-work out of the way, we get to the easy and fun part: Classifying new, unlabeled data with our trained model!

Instead of manual CSVs that I put together and labeled, this unlabeled data is derived from a set of CSVs directly downloaded from my banking/credit card sites and unmodified.

Most every banking/card service I know of will allow you to download your monthly transactions as a CSV - usually this option is listed on your "Statement" or "Search Transactions" pages as a "Download" option, at which point it will ask you which format to download to.

The function `dataframe_from_csvs()` was written specifically to parse my bank/credit card transaction CSVs, so you will need to tweak it to your services accordingly.

The essential piece is to derive the account type, transaction description, and transaction amount. That's all our pipeline will need to classify the new data.

In [22]:
raw_df = dataframe_from_csvs(UNLABELED_CSV_PATH)
print(raw_df.shape)
raw_df.head()

(147, 3)


Unnamed: 0,Account,Description,Amount
0,REDACTED_CC_ACCT,AplPay MEIJER REDACTED_STATE,42.97
1,REDACTED_CC_ACCT,REDACTED_CC_OFFER,-43.12
2,REDACTED_CC_ACCT,AplPay TARGET REDACTED_STATE,9.08
3,REDACTED_CC_ACCT,NOODLES & CO REDACTED_STATE,31.42
4,REDACTED_CC_ACCT,AplPay IKEA REDACTED_STATE,42.58


In [23]:
X = derive_X(raw_df, "Account", "Description", "Amount")
print(X.shape)
X.head()

(147, 8)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit
0,aplpay meijer REDACTED_STATE,0,1,0,0,0,0,0
1,REDACTED_CC_OFFER,0,1,0,0,0,0,1
2,aplpay target REDACTED_STATE,1,0,0,0,0,0,0
3,noodles co REDACTED_STATE,0,1,0,0,0,0,0
4,aplpay ikea REDACTED_STATE,0,1,0,0,0,0,0


---

## Pipeline 5: View Classified Data and (Optionally) Add To Master Data Set

Whether coming from **Pipeline 4** or viewing our test data from **Pipeline 3**, we can now load our saved model from disk, call its `predict()` function on our unlabeled data, and then join these predictions to our input data.

Then we can choose to save this data to disk into our master data set, as well as a more human-readable CSV for a quick read-through and/or relabeling of data.

In [24]:
clf = load(MODEL_FILE_NAME)

In [25]:
y_predicted = clf.predict(X)

In [26]:
# Capture predictions in CSV format
csv_df = X.copy().reset_index(drop=True)
csv_df["category"] = y_predicted
print(csv_df.shape)
csv_df.head()

(147, 9)


Unnamed: 0,description,charge_under_20,charge_under_50,charge_under_100,charge_under_500,charge_under_1000,charge_over_1000,is_credit,category
0,aplpay meijer REDACTED_STATE,0,1,0,0,0,0,0,groceries
1,REDACTED_CC_OFFER,0,1,0,0,0,0,1,REDACTED_CC_ACCT credit
2,aplpay target REDACTED_STATE,1,0,0,0,0,0,0,household
3,noodles co REDACTED_STATE,0,1,0,0,0,0,0,dining out
4,aplpay ikea REDACTED_STATE,0,1,0,0,0,0,0,household


In [74]:
# Save the human-readable CSV to disk with full charge data
raw_df["Credit"] = X["is_credit"].copy()
raw_df["Category"] = csv_df["category"].copy()
print(raw_df.shape)
raw_df.head()

(147, 5)


Unnamed: 0,Account,Description,Amount,Credit,Category
0,REDACTED_CC_ACCT,AplPay MEIJER REDACTED_STATE,42.97,0,groceries
1,REDACTED_CC_ACCT,REDACTED_CC_OFFER,-43.12,1,REDACTED_CC_ACCT credit
2,REDACTED_CC_ACCT,AplPay TARGET REDACTED_STATE,9.08,0,household
3,REDACTED_CC_ACCT,NOODLES & CO REDACTED_STATE,31.42,0,dining out
4,REDACTED_CC_ACCT,AplPay IKEA REDACTED_STATE,42.58,0,household


In [42]:
raw_df.to_csv("MAY21.csv", index=False, sep=CSV_DELIM)

Note that when writing to the master data set, we are *appending* the data (unless there is no existing master set yet) so that we can collect this data over time and then retrain the model later on the full dataset - perhaps after going through and manually relabeling some entries to improve the model's learning - using **Pipeline 2**.

In [43]:
add_data_to_master_set(csv_df)

And just like that, no more hours spent classifying transactions!

Thanks for reading through, and I hope this was a helpful toy project to you.

---

## Model Testing Sandbox

I used this cell to do testing of different classifier variations using the One vs. Rest (OVR) strategy.

**Warning**: Grid searching can take quite a long time for these different classifiers and parameters, so if you run this cell, be prepared to step away for several minutes to an hour.

Please share any advice on the approaches discussed above and in the cell below - I'm a noob and there are probably better classifiers to test or ways to evaluate them than this.

In [34]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

# One-hot encode y target values for use with OvR and resplit data
y_test = get_dummies(
    raw_df["category"].str.lower(), prefix="", prefix_sep="", columns=["category"]
)
X_train, X_test, y_train, y_test = train_test_split(
    X_data, y_data, train_size=0.8, test_size=0.2
)


# Build TF-IDF pipeline for description texts, passing other columns "as-is"
tfidf_vec = TfidfVectorizer()
ct = ColumnTransformer([("tfidf", tfidf_vec, "description")], remainder="passthrough")

clfs = [
    {
        "name": "rfc",
        "clf": OneVsRestClassifier(RandomForestClassifier()),
        "params": {
            "coltrans__tfidf__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)],
            "coltrans__tfidf__norm": ["l1", "l2"],
            "clf__estimator__max_features": ["auto", "sqrt"],
            "clf__estimator__n_estimators": [200, 1000],
        },
    },
    {
        "name": "lrc",
        "clf": OneVsRestClassifier(LogisticRegression()),
        "params": {
            "coltrans__tfidf__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)],
            "coltrans__tfidf__norm": ["l1", "l2"],
            "clf__estimator__C": logspace(-2, 2, 5),
            "clf__estimator__penalty": ["l1", "l2"],
        },
    },
    {
        "name": "svc",
        "clf": OneVsRestClassifier(SVC()),
        "params": {
            "coltrans__tfidf__ngram_range": [(1, 1), (1, 2), (1, 3), (1, 4)],
            "coltrans__tfidf__norm": ["l1", "l2"],
            "clf__estimator__C": logspace(-2, 2, 5),
            "clf__estimator__gamma": logspace(-2, 2, 5),
            "clf__estimator__kernel": ["rbf"],
        },
    },
]


for clf in clfs:
    pipe = Pipeline([("coltrans", ct), ("clf", clf["clf"])])
    best_clf = GridSearchCV(
        pipe, param_grid=clf["params"], scoring="f1_macro", n_jobs=-1
    ).fit(X_train, y_train)
    print("{} F1 Score is {}\n\n".format(clf["name"], best_clf.score(X_test, y_test)))

rfc F1 Score is 0.3473845761208306


lrc F1 Score is 0.3609711455509903


svc F1 Score is 0.3751328276510206


