# Model Exploration

This notebook contains details about the model building phase. In this notebook, we have compared the performance of ~15 different Machine Learning Classifiers using 10-fold Cross Validation. Based on a criteria, 3 models were shortlisted and these models plus their soft-voting ensemble was then tuned using Randomized Grid Search and their performance was evaluated on the holdout set to select the best performing model.
For Data analysis please refer to the *Data Analysis* notebook in the repository.

Since the objective is to rank a set of transactions by likelihood of matching a receipt image, we would be restricted to binary classifiers that predict class probability instead of the class label. That is, models which have the *predict_proba()* method can only be used.

To handle the complete process of model building (training various models and evaluating performance using CV, tuning hyperparameters, testing performance, etc.) we will be making use of a library called [pycaret](https://pycaret.org/). The same process could have been done manually as well, but I chose to use the library instead. Please refer to the documentation of the library for more details!

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
df_cleaned = pd.read_csv("df_cleaned.csv")
df_cleaned.head()

Unnamed: 0,receipt_id,company_id,matched_transaction_id,feature_transaction_id,DateMappingMatch,AmountMappingMatch,DescriptionMatch,DifferentPredictedTime,TimeMappingMatch,PredictedNameMatch,ShortNameMatch,DifferentPredictedDate,PredictedAmountMatch,PredictedTimeCloseMatch,is_match
0,10001,10000,10605,10596,0.0,0.4,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
1,10001,10000,10605,10597,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
2,10001,10000,10605,10598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
3,10001,10000,10605,10599,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,10001,10000,10605,10600,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0


In [23]:
#model_features_and_response_cols = [ 'DateMappingMatch', 'AmountMappingMatch',
 #      'DescriptionMatch', 'DifferentPredictedTime', 'TimeMappingMatch',
  #     'PredictedNameMatch', 'ShortNameMatch', 'DifferentPredictedDate',
   #    'PredictedAmountMatch', 'PredictedTimeCloseMatch', 'is_match']

In [22]:
#from sklearn.model_selection import train_test_split
#training_with_ids, holdout_with_ids = train_test_split(df_cleaned,train_size=0.8,random_state=26)

In [21]:
#training = training_with_ids[model_features_and_response_cols]
#holdout = holdout_with_ids[model_features_and_response_cols]

In [20]:
df_modeling = df_cleaned.set_index(['receipt_id', 'company_id', 'matched_transaction_id',
       'feature_transaction_id'])
df_modeling.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,DateMappingMatch,AmountMappingMatch,DescriptionMatch,DifferentPredictedTime,TimeMappingMatch,PredictedNameMatch,ShortNameMatch,DifferentPredictedDate,PredictedAmountMatch,PredictedTimeCloseMatch,is_match
receipt_id,company_id,matched_transaction_id,feature_transaction_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
10001,10000,10605,10596,0.0,0.4,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
10001,10000,10605,10597,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
10001,10000,10605,10598,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
10001,10000,10605,10599,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0
10001,10000,10605,10600,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0


In [24]:
from pycaret.classification import *

In [25]:
exp_clf101 = setup(data = df_modeling, target = 'is_match', session_id=26)

Unnamed: 0,Description,Value
0,session_id,26
1,Target,is_match
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(6468, 11)"
5,Missing Values,False
6,Numeric Features,5
7,Categorical Features,5
8,Ordinal Features,False
9,High Cardinality Features,False


In [26]:
compare_models()
# Results are calculated using 10-fold cross-validation

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.9313,0.9265,0.5949,0.8471,0.6967,0.6595,0.6738,0.14
xgboost,Extreme Gradient Boosting,0.9311,0.9329,0.6015,0.8389,0.6984,0.6609,0.6737,0.323
dt,Decision Tree Classifier,0.9309,0.922,0.5916,0.8463,0.6942,0.6568,0.6713,0.009
rf,Random Forest Classifier,0.9309,0.9307,0.5983,0.8401,0.6965,0.6589,0.6723,0.167
gbc,Gradient Boosting Classifier,0.9287,0.9324,0.6098,0.8125,0.6946,0.6553,0.6651,0.099
lightgbm,Light Gradient Boosting Machine,0.9282,0.934,0.59,0.8245,0.6853,0.6462,0.6591,0.144
knn,K Neighbors Classifier,0.9269,0.8754,0.6048,0.8028,0.6871,0.6469,0.6567,0.061
ada,Ada Boost Classifier,0.9234,0.932,0.5767,0.7988,0.6664,0.6246,0.6371,0.087
nb,Naive Bayes,0.9205,0.928,0.5716,0.7767,0.6559,0.6123,0.623,0.009
svm,SVM - Linear Kernel,0.9185,0.0,0.5202,0.8021,0.6281,0.585,0.6038,0.013


ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                     oob_score=False, random_state=26, verbose=0,
                     warm_start=False)

We select top 3 models from here which would further be improved via hyperparameter tuning. The conditions kept in mind while selecting models are:
- It should support predicting probabilities (should have the *predict_proba()* method)
- It should be performing well on training set (the results present above)
- Finally, since we also plan to create a soft-voting ensemble of models, the way the algorithms work should be as different as possible. This would help us create a diverse ensemble.

Keeping all these considerations in mind, the following models have been shortlisted:
- XGBoost
- Random Forest
- Naive Bayes

## XGBoost

In [27]:
xgboost = create_model('xgboost')
# Hyperparameter Optimization using Randomized Grid Search
tuned_xgboost = tune_model(xgboost)
print(tuned_xgboost)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9338,0.942,0.6393,0.8298,0.7222,0.6853,0.6929
1,0.936,0.9372,0.7213,0.7857,0.7521,0.7155,0.7163
2,0.9338,0.92,0.6393,0.8298,0.7222,0.6853,0.6929
3,0.9316,0.922,0.6066,0.8409,0.7048,0.6672,0.6786
4,0.9205,0.9477,0.6066,0.7551,0.6727,0.6281,0.633
5,0.9139,0.9366,0.6167,0.6981,0.6549,0.6059,0.6074
6,0.9139,0.9108,0.5167,0.7561,0.6139,0.5673,0.5804
7,0.9314,0.9416,0.6667,0.7843,0.7207,0.6819,0.6848
8,0.9403,0.9337,0.65,0.8667,0.7429,0.7098,0.7192
9,0.9204,0.9161,0.5333,0.8,0.64,0.5972,0.6128


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.9, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.0005, max_delta_step=0, max_depth=9,
              min_child_weight=2, missing=nan, monotone_constraints='()',
              n_estimators=20, n_jobs=-1, num_parallel_tree=1,
              objective='binary:logistic', random_state=26, reg_alpha=0.0005,
              reg_lambda=1e-07, scale_pos_weight=2.1, subsample=0.7,
              tree_method='auto', use_label_encoder=True, validate_parameters=1,
              verbosity=0)


As we can see the tuned model actually performs much better on out-of-sample data as compared to an un-tuned model.

In [28]:
# Testing the performance on sample held out from the model during training and validation phases
predict_model(tuned_xgboost);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.9299,0.9339,0.6151,0.799,0.6951,0.6562,0.6634


## Random Forest

In [29]:
rf = create_model('rf')
# Hyperparameter Optimization using Randomized Grid Search
tuned_rf = tune_model(rf)
print(tuned_rf)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9338,0.9452,0.6393,0.8298,0.7222,0.6853,0.6929
1,0.9426,0.938,0.7213,0.8302,0.7719,0.7393,0.7417
2,0.9316,0.9317,0.6557,0.8,0.7207,0.6822,0.6865
3,0.9338,0.917,0.6066,0.8605,0.7115,0.6754,0.6886
4,0.9316,0.9474,0.623,0.8261,0.7103,0.6723,0.681
5,0.9117,0.9324,0.6,0.6923,0.6429,0.5928,0.5947
6,0.9161,0.9173,0.5167,0.775,0.62,0.575,0.5899
7,0.9381,0.9408,0.6667,0.8333,0.7407,0.7061,0.7117
8,0.9403,0.932,0.65,0.8667,0.7429,0.7098,0.7192
9,0.9181,0.9226,0.4667,0.8485,0.6022,0.5608,0.592


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight={},
                       criterion='gini', max_depth=7, max_features=1.0,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.001, min_impurity_split=None,
                       min_samples_leaf=3, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=240,
                       n_jobs=-1, oob_score=False, random_state=26, verbose=0,
                       warm_start=False)


In [30]:
predict_model(tuned_rf);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.9294,0.9335,0.5992,0.8075,0.6879,0.6491,0.6583


## Naive Bayes

In [31]:
nb = create_model('nb')
tuned_nb = tune_model(nb)
print(tuned_nb)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9316,0.9347,0.6393,0.8125,0.7156,0.6773,0.6836
1,0.9338,0.9336,0.7049,0.7818,0.7414,0.7035,0.7048
2,0.9227,0.9388,0.6557,0.7407,0.6957,0.6516,0.6532
3,0.9338,0.9233,0.6066,0.8605,0.7115,0.6754,0.6886
4,0.9272,0.9406,0.5902,0.8182,0.6857,0.6457,0.6568
5,0.9007,0.921,0.55,0.6471,0.5946,0.5384,0.5407
6,0.9073,0.9124,0.4833,0.725,0.58,0.5302,0.544
7,0.9292,0.9325,0.6333,0.7917,0.7037,0.6641,0.6694
8,0.9314,0.9292,0.6167,0.8222,0.7048,0.6669,0.6757
9,0.9226,0.9119,0.5167,0.8378,0.6392,0.5985,0.6205


GaussianNB(priors=None, var_smoothing=0.006)


In [32]:
predict_model(tuned_nb);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.9238,0.9278,0.5873,0.7708,0.6667,0.6245,0.6319


## Soft Voting Ensemble Classifier
This model combines probability predictions from the 3 models to output the class with the highest avg probability

In [33]:
blend_soft = blend_models(estimator_list = [tuned_xgboost, tuned_rf, tuned_nb], method = 'soft')

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.9316,0.9394,0.6393,0.8125,0.7156,0.6773,0.6836
1,0.936,0.9439,0.6885,0.8077,0.7434,0.7071,0.71
2,0.9272,0.9282,0.6721,0.7593,0.713,0.6715,0.6731
3,0.936,0.9286,0.6066,0.881,0.7184,0.6837,0.6989
4,0.9249,0.9443,0.6066,0.7872,0.6852,0.6434,0.6504
5,0.9051,0.9288,0.5833,0.6604,0.6195,0.5655,0.5669
6,0.9073,0.9111,0.4833,0.725,0.58,0.5302,0.544
7,0.9314,0.9354,0.6333,0.8085,0.7103,0.672,0.6785
8,0.9314,0.9363,0.6167,0.8222,0.7048,0.6669,0.6757
9,0.9248,0.9152,0.5167,0.8611,0.6458,0.6067,0.6315


In [34]:
predict_model(blend_soft);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.9263,0.9317,0.5913,0.7884,0.6757,0.6351,0.6435


XGBoost seems to be delivering the best AUC, but we are more interested in the model with the best precision-recall balance (F1-score) because this is an imbalanced classification problem. Luckily, XGBoost performs best in this regard as well!

The tuned XGBoost clearly seems to be performing the best --in terms of F1 scores on both training and out-of-sample data-- amongst all the 4 classifiers, so we move forward with XGBoost as our final model