## 1. Data preparation

In [1]:
import numpy as np
import pandas as pd
from dataCategory import CategorizeData
import warnings

seed = 7  # to use for all random generators

In [2]:
transactions_df = CategorizeData('data.csv').df  # add the categories we want to predict to the dataset

In [3]:
transactions_df.head(n=10)

Unnamed: 0,transaction_id,transaction_date,transaction_type,sort_code,account_number,transaction_description,debit_amount,credit_amount,balance,number,type,day_of_week,category_spend,sub_category
6566,39393356e4f2434493b3e0d3c3b505a2,27/07/2015,FPO,30-95-46,17899960,alan_holland,1000.0,0.0,7517.06,6567,debit,Monday,transfer,others
6565,4279fbcd0b04433288934135ec52de84,27/07/2015,DEB,30-95-46,17899960,js_online_grocery,316.51,0.0,7200.55,6566,debit,Monday,Shopping,online_shopping_debit
6564,cb98fbe81b9842dba19be4e5d4e3761c,27/07/2015,DEB,30-95-46,17899960,amazon_uk_marketpl,6.39,0.0,7194.16,6565,debit,Monday,Shopping,online_shopping_debit
6563,3e93cd15c26846cca75d90b7dd645062,28/07/2015,BP,30-95-46,17899960,save_the_change,1.1,0.0,7193.06,6564,debit,Tuesday,bill_payments,savings
6562,9ce0dd2ba2bd45afa2b73185f12d62ad,28/07/2015,DEB,30-95-46,17899960,amazon_svcs_europe,5.49,0.0,7187.57,6563,debit,Tuesday,Shopping,online_shopping_debit
6561,8f039615e49249db96abe96935025f1d,29/07/2015,BP,30-95-46,17899960,save_the_change,0.51,0.0,7187.06,6562,debit,Wednesday,bill_payments,savings
6560,9adba576fdcd45cebd87711688729f7e,29/07/2015,CPT,30-95-46,17899960,loyd_swansea_unive,70.0,0.0,7117.06,6561,debit,Wednesday,cash_point,others
6559,42495900ded544dbb532da899ed31b4b,29/07/2015,DEB,30-95-46,17899960,amazon_svcs_europe,4.54,0.0,7112.52,6560,debit,Wednesday,Shopping,online_shopping_debit
6558,46c7c6aac33e4462ab05804930e51ba3,29/07/2015,DEB,30-95-46,17899960,amazon_uk_marketpl,17.94,0.0,7094.58,6559,debit,Wednesday,Shopping,online_shopping_debit
6557,b2c5d07397bc4b74be3bb26fc653e0b9,30/07/2015,BP,30-95-46,17899960,save_the_change,0.52,0.0,7094.06,6558,debit,Thursday,bill_payments,savings


Our task is to predict the transaction category based on the text desciprion alone, so we can remove all the columns except 'transaction_desciption',  'category_spend', 'sub_category'

In [4]:
transactions_df = transactions_df[['transaction_description', 'category_spend', 'sub_category']]

Let's look what kind of text we have in the desciption column:

In [5]:
transactions_df['transaction_description'].values[:20]

array(['alan_holland', 'js_online_grocery', 'amazon_uk_marketpl',
       'save_the_change', 'amazon_svcs_europe', 'save_the_change',
       'loyd_swansea_unive', 'amazon_svcs_europe', 'amazon_uk_marketpl',
       'save_the_change', 'amazon_uk_marketpl', 'swansea_university',
       'swansea_university', 'save_the_change', 'univ_&_col_union',
       'arriva_trains_wale', 'stfc_ap', 'save_the_change',
       'esavings_account', 'tv_licence_mbp'], dtype=object)

Checking the missing values:

In [6]:
transactions_df['transaction_description'].isna().sum()

np.int64(0)

Also there can be 'hidden' missing values such as empty strings or nonsense text, but looking at the values, we can conclude that the vast majority of the data is OK

As we can see, we should tokenise our dataset, using plain words as tokens. It seems that other symbols and numbers don't add much meaning

In [7]:
import re

def tokenize_words(text):
    return re.findall(r'[a-zA-Z]+', text.lower())

transactions_df.loc[:, 'transaction_description'] = transactions_df['transaction_description'].apply(tokenize_words)

In [8]:
transactions_df

Unnamed: 0,transaction_description,category_spend,sub_category
6566,"[alan, holland]",transfer,others
6565,"[js, online, grocery]",Shopping,online_shopping_debit
6564,"[amazon, uk, marketpl]",Shopping,online_shopping_debit
6563,"[save, the, change]",bill_payments,savings
6562,"[amazon, svcs, europe]",Shopping,online_shopping_debit
...,...,...,...
4,"[travelium, llc]",Shopping,travel
3,"[non, gbp, trans, fee]",Shopping,bank_fee_debit
2,"[non, gbp, purch, fee]",Shopping,bank_fee_debit
1,"[lidl, gb, nottingha]",Shopping,instore_purchase_debit


In [9]:
from sklearn.model_selection import train_test_split

X = transactions_df['transaction_description']
Y_category = transactions_df['category_spend']
Y_sub_category = transactions_df['sub_category']

X_category_train, X_category_test, Y_category_train, Y_category_test = train_test_split(X, Y_category, test_size=0.2, random_state=seed, stratify=Y_category)
X_sub_category_train, X_sub_category_test, Y_sub_category_train, Y_sub_category_test = train_test_split(X, Y_sub_category, test_size=0.2, random_state=seed, stratify=Y_sub_category)

In [10]:
vocabulary = set()
for text in X:
    for word in text:
        vocabulary.add(word)
print(f"Vocabulary size: {len(vocabulary)} words")

Vocabulary size: 1202 words


Check the label values and their counts

In [11]:
Y_category_train.value_counts()

category_spend
Shopping           2614
bill_payments      1675
cash_point          358
transfer            281
income              241
deposits             44
account_fees         27
cheque_payments      13
Name: count, dtype: int64

In [12]:
Y_sub_category_train.value_counts()

sub_category
online_shopping_debit     1099
savings                   1009
others                     994
bill_payments              626
instore_purchase_debit     417
bank_fee_debit             260
Food                       245
travel                     148
income                     138
investment                 101
money_transfer_credit       97
interest                    49
bank_fee_credit             40
deposits_debit               9
deposits_credit              8
money_transfer_debit         7
online_shoping_refund        4
travel_refund                2
Name: count, dtype: int64

To convert the text description to a format suitable for training a machine learning model, we should consider the tools available. We could utilise TF-IDF or pretrained word embeddings, such as Word2Vec, but the descriptions are too short, typically only 2-3 words, making TF-IDF not effective here. The common pretrained embeddings are quite large. So we would try simple one-hot encoding, especially as the vocabulary size is small enough

## 2. Models training

Now, let's train several classical machine learning models, starting with logistic regression and then switching to tree models, like boosting and RF. The tree models can help us a lot if the data is not linearly separable. Also, we wrap all of these into pipelines.

In [13]:
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

In [14]:
category_label_encoder = LabelEncoder()
Y_category_train = category_label_encoder.fit_transform(Y_category_train)
Y_category_test = category_label_encoder.transform(Y_category_test)

sub_category_label_encoder = LabelEncoder()
Y_sub_category_train = sub_category_label_encoder.fit_transform(Y_sub_category_train)
Y_sub_category_test = sub_category_label_encoder.transform(Y_sub_category_test)

In [16]:
import joblib
joblib.dump(category_label_encoder, 'labels_encoder.pkl')

['labels_encoder.pkl']

Importing wrapper of the features one-hot encoder into a class to use it in sklearn's pipeline

In [17]:
from preprocessors import MultiLabelBinarizerWrapper 

By experimenting with regularization for logistic regression, the best perofmance achieved without reguarization at all

In [18]:
max_iterations = 1000

log_reg_pipeline = Pipeline([
    ('one-hot', MultiLabelBinarizerWrapper()),
    ("classifier", LogisticRegression(penalty=None, max_iter=max_iterations, random_state=seed))
])

In [19]:
num_estimators = 100
random_forest_pipeline = Pipeline([
    ('one-hot', MultiLabelBinarizerWrapper()),
    ("classifier", RandomForestClassifier(n_estimators=num_estimators, random_state=seed))
])

In [20]:
num_estimators = 100

boosting_pipeline = Pipeline([
    ('one-hot', MultiLabelBinarizerWrapper()),
    ("classifier", XGBClassifier(objective='multi:softmax', eval_metric='mlogloss', 
                                 n_estimators=num_estimators, random_state=seed))
])

## 3. Evaluation

The typical metrics for classification are: accuracy, precision/recall, f1-score. But we should note: there are significant class imbalance, so instead of plain accuracy we should use balanced accuracy. Also we should take into account precision and recall for all classes.

In [21]:
from sklearn.metrics import precision_score, recall_score, balanced_accuracy_score, f1_score

In [22]:
models = {
    'Logistic Regression': log_reg_pipeline,
    'Random Forest': random_forest_pipeline,
    'Gradient Boosting': boosting_pipeline
}

In [23]:
def fit_models_and_get_predictions(X_train, y_train, X_test):
    model_predictions = {}
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        model_predictions[model_name] = predictions
    return model_predictions

In [25]:
def get_models_accuracies(true_labels, predictions):
    models_accuracy = []
    for model_name, model in models.items():
        model_predictions = predictions[model_name]
        score = balanced_accuracy_score(true_labels, model_predictions)
        models_accuracy.append({'Model': model_name, 'Balanced Accuracy': score})
        accuracy_df = pd.DataFrame(models_accuracy)
    return accuracy_df

In [26]:
def get_models_precisions_and_recalls(true_labels, predictions, class_names):
    models_results = {}
    for model_name, model in models.items():
        model_predictions = predictions[model_name]
        precision = precision_score(true_labels, model_predictions, average=None)
        recall = recall_score(true_labels, model_predictions, average=None)
        model_results = []
        for idx, class_name in enumerate(class_names):
            model_results.append({
                'Class': class_name,
                'Precision': precision[idx],
                'Recall': recall[idx]
            })
        models_results[model_name] = pd.DataFrame(model_results)
    return models_results

In [None]:
models_predictions = fit_models_and_get_predictions(X_category_train, Y_category_train, X_category_test)
class_names = category_label_encoder.classes_
accuracy_df = get_models_accuracies(Y_category_test, models_predictions)
precisions_and_recalls = get_models_precisions_and_recalls(Y_category_test, models_predictions, class_names=class_names)

In [28]:
accuracy_df

Unnamed: 0,Model,Balanced Accuracy
0,Logistic Regression,0.951216
1,Random Forest,0.949305
2,Gradient Boosting,0.85724


Logreg shows the top result, RF does as well, even though there are some tokens in the description that train data doesn't include. Slightly poorer performance by boosting. So we can conclude that the data is of linear nature, because random forest doesn't outperform the linear model. Maybe tuning the boosting model can show better results, but given the linear nature of the data, this is unlikely. 

In [29]:
precisions_and_recalls['Logistic Regression']

Unnamed: 0,Class,Precision,Recall
0,Shopping,0.9952,0.95107
1,account_fees,1.0,1.0
2,bill_payments,0.988208,1.0
3,cash_point,1.0,1.0
4,cheque_payments,0.12,1.0
5,deposits,1.0,0.818182
6,income,0.981481,0.883333
7,transfer,0.8375,0.957143


In [30]:
precisions_and_recalls['Random Forest']

Unnamed: 0,Class,Precision,Recall
0,Shopping,0.996743,0.93578
1,account_fees,1.0,1.0
2,bill_payments,0.988208,1.0
3,cash_point,1.0,1.0
4,cheque_payments,0.078947,1.0
5,deposits,1.0,0.818182
6,income,0.981481,0.883333
7,transfer,0.858974,0.957143


In [31]:
precisions_and_recalls['Gradient Boosting']

Unnamed: 0,Class,Precision,Recall
0,Shopping,0.974085,0.977064
1,account_fees,1.0,1.0
2,bill_payments,0.988095,0.990453
3,cash_point,1.0,0.988889
4,cheque_payments,0.5,0.333333
5,deposits,1.0,0.818182
6,income,0.962264,0.85
7,transfer,0.807692,0.9


Almost all categories are predicted very well. Only 'Cheque Payments' - the category where all models are showing poor results, but that's expected, the training part contains only 2 samples of it

Let's check how the models perform on subcategories classification task

In [None]:
models_predictions = fit_models_and_get_predictions(X_sub_category_train, Y_sub_category_train, X_sub_category_test)
accuracy_df = get_models_accuracies(Y_sub_category_test, models_predictions)

In [33]:
accuracy_df

Unnamed: 0,Model,Balanced Accuracy
0,Logistic Regression,0.891402
1,Random Forest,0.884012
2,Gradient Boosting,0.81565


The number of labels has increased almost 3x, but still our models show solid results

Dump the logreg model:

In [42]:
joblib.dump(models['Logistic Regression'], 'logreg.pkl')

['logreg.pkl']

## Bonus: Cross-validation

Let's choose the best number of estimators in Random Forest using cross-validation. The metric would be balanced accuracy

In [35]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid = {
    'classifier__n_estimators': [10, 50, 100, 200, 300]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

random_forest_pipeline = Pipeline([
    ('one-hot', MultiLabelBinarizerWrapper()),
    ("classifier", RandomForestClassifier(random_state=seed))
])

grid_search = GridSearchCV(
    estimator=random_forest_pipeline,
    param_grid=param_grid,
    scoring='balanced_accuracy',
    cv=cv
)

In [None]:
grid_search.fit(X_category_train, Y_category_train)

In [37]:
results_df = pd.DataFrame(grid_search.cv_results_)[['param_classifier__n_estimators', 'mean_test_score', 'std_test_score']]

In [38]:
results_df

Unnamed: 0,param_classifier__n_estimators,mean_test_score,std_test_score
0,10,0.897956,0.04607
1,50,0.919039,0.043366
2,100,0.902372,0.047025
3,200,0.914852,0.051246
4,300,0.919539,0.042166


In [54]:
models['Logistic Regression'].predict([['shop', 'shop']])

array([0])