# Classification Problem with Hyperparameters Tuning

## Find out if a particular client of the supermarket will buy a selected product

Find out if a particular client of the supermarket will buy a selected product based on client type and what he has in the basket. The file available is a csv with a few clients habits and a basket record of products from where we can predict if a product will be in the basket or not. To solve this problem I'll use a machine learning **Decision Tree CLassifier** and a method called ***grid search*** to tune the hyperparameters of the model in order to find the best combination for determining the best fit.

In [1]:
# Grid search on Decision Trees
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

In [2]:
dat = pd.read_csv("cs_buy_data.csv")
dat

Unnamed: 0,type_01,type_02,$_ratio,classic_out,UHIYL,9YZKX,15U8X,7DUSJ,C6EH6,9VBFU,...,7D1EV,5ZOSV,TGZDY,2B1M7,V1G4A,QPIAJ,MSA6G,4UELQ,Y56GT,D3XJV
0,125,125,5.431005,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,57,468,44.151799,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,33,230,36.946190,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,60,468,41.689841,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,60,468,42.219841,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3274,170,94,2.955397,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3275,101,140,8.133862,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3276,23,120,27.944762,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3277,-1,-1,-1.000000,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In the csv file we have 3278 recorded clients samples of their buying habits with 1559 features from few client parameters and all the products with 1 or 0 telling us if there are in the client basket or not at the checkout:
 - **type_01**: a private categorical parameter assigned to a client
 - **type_02**: another private categorical parameter assigned to a client
 - **\$_ratio**: buying average ratio, a parameter for expense measure of the client
 - **classic_out**: whether or not the client uses a standard checkout procedure
 - **UHIYL ...**:  the rest of feauters are the IDs of products when 1=bought toghether with the label product
 - **D3XJV**: the label ID product we want to predict when it will be in the basket or not

In [3]:
# print summary statistics for some features
stats = dat['D3XJV'].value_counts()
print("\n summary Label product stats: \n\n", stats)
stats = dat.iloc[:,2]
print("\n summary $_ratio stats: \n\n", stats.describe())
stats = dat.iloc[:,:2].astype('object')
print("\n summary category type: \n")
stats.describe()


 summary Label product stats: 

 0    2820
1     459
Name: D3XJV, dtype: int64

 summary $_ratio stats: 

 count    3279.000000
mean       15.046839
std        28.945317
min        -1.000000
25%        -1.000000
50%         6.800423
75%        21.329021
max       318.430318
Name: $_ratio, dtype: float64

 summary category type: 



Unnamed: 0,type_01,type_02
count,3279,3279
unique,221,278
top,-1,-1
freq,903,901


The product label we need to predict has been added to the market basket by 459 clients and 2820 have not added it. The spending ammount ratio tells us that the mean ratio is 15.04\\$ and the maximum 318.43\\$. We see that -1 is NA (not available value for that client) we found it on type_01 and type_02 as well. Clients type_01 and type_02 parameters have 221 and 278 unique values rrespectively. We have a lot of NAs with more o less of 900 missing over 3279 of total samples.

In [4]:
# checking missing values on $_ratio
stats = dat.iloc[:,2].astype('object')
print("\n summary $_ratio stats: \n\n", stats.describe())


 summary $_ratio stats: 

 count     3279.0
unique    1955.0
top         -1.0
freq       910.0
Name: $_ratio, dtype: float64


Okay, we leave the NAs as they are for the moment. And just to be sure I'll check if there are other numbers in the products basket binary feautures.

In [5]:
X = dat.drop(['type_01', 'type_02', '$_ratio'], axis=1).to_numpy()
X = X.tolist()

In [6]:
# function to count missing or wrong values 
def missing(array):
    nas = 0
    nas = [0 if x !=0 or x !=1 else nas +1 for x in X]
    tot = sum(nas)
    return(tot)

In [7]:
# check for missing or other values not 0 and 1
print(missing(X), " values are missing or are not binary")

0  values are missing or are not binary


I know drop the label column and prepare the independent variables X and the dependent to predict y.

In [8]:
X = dat.drop('D3XJV', axis=1).to_numpy()
y = dat['D3XJV'].to_numpy()

Scikit-learn does not have a function to divide the data into three parts. So, I'm going to call that function twice, once to split data into two sets: _a._ training; and _b._ validation and test combined. Then, we call the function once more to split that second set into distinct validation and test sets. We chose to reserve 0.4 (or 40%) of our data for validation and test and 1 - 0.4 = 0.6 or 60% of our data for training. From the 40% left for validation and test, we are going to use 0.5 or 50% of it (which make sit 20% of the total amount of data) for validation and the other 50% for test.

In [9]:
# shuffle and split training and test sets
(X_train, X_vt, y_train, y_vt) = train_test_split(X, y, test_size=0.4, random_state=42)
(X_validation, X_test, y_validation, y_test) = train_test_split(X_vt, y_vt, test_size=0.5, random_state=42)

Grid Search is implemented next using a decision tree classifier for classification purposes. The tuning parameters will be depth of the tree, the minimum number of observations in terminal node, and the minimum number of observations required to perform the node split.

In [10]:
# Pipeline to create combinations of variables for the grid search:
pipeline = Pipeline([
    ('clf', DecisionTreeClassifier(criterion='entropy'))
])

# Combos to explore given parameters in Python dict. format:
parameters = {
    'clf__max_depth': (50,100,150),
    'clf__min_samples_split': (2, 3),
    'clf__min_samples_leaf': (1, 2, 3)
}

Next, the `n_jobs` field is for selecting the numbers of cores in the computer; -1 means it uses all the cores available. The scoring methodology choosen is accuracy.

In [11]:
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 18 candidates, totalling 54 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   15.3s
[Parallel(n_jobs=-1)]: Done  54 out of  54 | elapsed:   19.6s finished


GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        DecisionTreeClassifier(class_weight=None,
                                                               criterion='entropy',
                                                               max_depth=None,
                                                               max_features=None,
                                                               max_leaf_nodes=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                                               min_samples_leaf=1,
                                                               min_samples_split=2,
                                                               min_weight_fraction_leaf=0.0,
  

Now it's time to predict the label y using the best parameters of grid search on the validation data:

In [12]:
y_pred = grid_search.predict(X_validation)

Print the results:

In [13]:
print ('\n Grid Search Best score: \n', grid_search.best_score_)
print ('\n Best parameters set: \n')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (param_name, best_parameters[param_name]))
print ("\n Confusion Matrix on validation data \n",confusion_matrix(y_validation,y_pred))
print ("\n Validation Accuracy \n",accuracy_score(y_validation,y_pred))
print ("\nPrecision Recall f1 table \n",classification_report(y_validation, y_pred))


 Grid Search Best score: 
 0.9679715302491103

 Best parameters set: 

	clf__max_depth: 100
	clf__min_samples_leaf: 1
	clf__min_samples_split: 3

 Confusion Matrix on validation data 
 [[543  12]
 [ 16  85]]

 Validation Accuracy 
 0.9573170731707317

Precision Recall f1 table 
               precision    recall  f1-score   support

           0       0.97      0.98      0.97       555
           1       0.88      0.84      0.86       101

    accuracy                           0.96       656
   macro avg       0.92      0.91      0.92       656
weighted avg       0.96      0.96      0.96       656



In the **confusion matrix** on validation data we can see that **550** are the ***True Positives*** (predicted as added to the basket and actually added by the client), **12** are the ***False Negatives*** (predicted as not added but actually added to the basket by the client), **16** are the ***False Positives*** (predicted as added to the basket but actually not added by the client) and **85** are the ***True Negatives*** (predicted as not added to the basket and actually not added by the client). The total **Validation Accuracy** is **96.7%**.

In [14]:
y_pred = grid_search.predict(X_test)
print ("\n Confusion Matrix on test data \n",confusion_matrix(y_test,y_pred))
print ("\n Test Accuracy \n",accuracy_score(y_test,y_pred))
print ("\nPrecision Recall f1 table \n",classification_report(y_test, y_pred))


 Confusion Matrix on test data 
 [[557  10]
 [ 10  79]]

 Test Accuracy 
 0.9695121951219512

Precision Recall f1 table 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       567
           1       0.89      0.89      0.89        89

    accuracy                           0.97       656
   macro avg       0.94      0.94      0.94       656
weighted avg       0.97      0.97      0.97       656



Finally with the **Test Accuracy** of Test Data we score **96.9%**. The result is not overfitted and similiar to the others validation and train, so it looks like a rubust model. We can manage the missing data and see if we can get better results, but the model could go in production and can be tuned later on. To have more confidence it can be made a ***cross-validation*** analysis to ensure the robustness of the model.