# Classification



Using the ‘satisfied’ variable in the data, explore the use of tabular data with
different classifiers to automatically distinguish between “satisfied” and
“unsatisfied” customer responses.

Classifiers to use:

- [X] k-nearest neighbour 
- [X] decision tree
- [X] neural network
- [X] support vector machine
- [X] naive bayes

For each classifier, use 6-fold cross validation to estimate the accuracy of the classifier. \
For each classifier we will plot the confusion matrix and calculate the median of the accuracy scores.

We will also explore how to best deal in with the missing values in the data.

In [13]:
import pickle as pkl
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split, StratifiedKFold
from sklearn.metrics import roc_curve, auc, confusion_matrix, ConfusionMatrixDisplay
import sklearn.svm as svm
import plotly.graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, plot_tree
import math
import random
import time
import tensorflow
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


In [8]:
tabular_data = pkl.load(open('./bank-data/cleaned_customers.pkl', 'rb'))
tabular_data.drop(['customer_id', 'date', 'customer_gender', 'customer_age', 'customer_location', 'customer_type', 'has_cc', 'has_mortgage', 'customer_age_norm'], axis=1, inplace=True)

# Managing missing values

In this segment we are going to generate three different dataframes from the original one. \
These will then be used to compare the results obtained by the different classifiers.

The first dataframe will be the original one, where we will drop all the rows with missing values. \
This means we lose 1172 rows out of 30000, which constitutes a 39% of the original data. \
But it guarantees that we are not introducing any bias in the data.

The second dataframe will be the one where we impute the missing values with the mean of the column. \
This will allow us to keep all the rows, but it will introduce a bias in the data. \
More specifically it will introduce a bias towards the mean of the column so that peaks and valleys will be smoothed out.

The third dataframe will be the one where we impute the missing values with the median of the column. \
This works in the same way as the mean, but it is more robust to outliers. \
It also has the advantage of providing only integer values, which is more appropriate for the data we are dealing with.


In the forth dataframe we will try to estimate the missing values by using a linear regression model. \
This will allow us to keep all the rows and it will also introduce a bias in the data. \
But this time the bias will be introduced by the model, which will try to estimate the missing values based on the other columns.

In [9]:

# drop rows with missing values
tabular_data_no_na = tabular_data.dropna()
tabular_data_no_na.attrs['name'] = 'No missing values'

# fill missing values with mean
tabular_data_mean = tabular_data.fillna(tabular_data.mean())
tabular_data_mean.attrs['name'] = 'Mean'

# fill missing values with median
tabular_data_median = tabular_data.fillna(tabular_data.median())
tabular_data_median.attrs['name'] = 'Median'

# fill missing values with linear regression
tabular_data_linear = tabular_data.fillna(math.floor(1 + (5-1)*random.random()))
tabular_data_linear.attrs['name'] = 'Linear regression'

# To have some fun and to see how a model would perform with a terrible dataset, I decided to fill the missing values with random numbers between 1 and 5.
tabular_data_random = tabular_data.fillna(math.floor(1 + (5-1)*random.random()))
tabular_data_random.attrs['name'] = 'Random'


dataframes = [tabular_data_no_na, tabular_data_mean, tabular_data_median, tabular_data_linear, tabular_data_random]

## Scoring and Comparing the results

Before we can start using the different classifiers we first need to define a function that will allow us to compare the results fairly. \
Using the function below we will be able to get a score for each classifier and each dataframe. 

To get a somewhat reliable score we will use a 6-fold cross validation. \
This means that we will split the data into 6 different sets and we will use 5 of them to train the model and the remaining one to test it. \
We will then repeat this process 6 times, so that each set will be used as a test set once. \
The scores obtained in each iteration will then be averaged to get the final score.

We also make note of the median and the deviation of the scores obtained in each iteration. \
Aswell as noting the amount of false positives and false negatives obtained in each iteration and how much time it took to run the model.

In [10]:
comparison_dict = {}


def eval_and_graph (modle, modle_name,X, y, df):
    
    
    scores = cross_val_score(modle, X, y, cv=6, scoring='accuracy')

    # Calculate mean and median accuracy
    mean_accuracy = scores.mean()
    std_dev = scores.std()
    median_accuracy = np.median(scores)
    
    cv = StratifiedKFold(n_splits=6)
    summed_confusion_matrix = np.zeros((2, 2))
    
    for train_index, test_index in cv.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        modle.fit(X_train, y_train)
        y_pred = modle.predict(X_test)
        summed_confusion_matrix += confusion_matrix(y_test, y_pred)
    
    comparison_dict[modle_name][df.attrs['name']] = {
        'mean_accuracy': mean_accuracy,
        'median_accuracy': median_accuracy,
        'std_dev': std_dev,
        'confusion_matrix':  summed_confusion_matrix / 6
    }

    print(f'{modle_name} {df.attrs["name"]} mean accuracy: {mean_accuracy}')
        

In [11]:

# Initialize summary table
comparison_dict["KNN"] = {}

befin_time = time.time()
for df in dataframes:
    X = df.drop(['satisfied'], axis=1)
    y = df['satisfied']

    knn = KNeighborsClassifier(n_neighbors=5)
    

    eval_and_graph(knn,"KNN" ,X,y ,df )

comparison_dict["KNN"]["time"] = time.time() - befin_time


KNN No missing values mean accuracy: 0.8036399913718723
KNN Mean mean accuracy: 0.7860000000000001
KNN Median mean accuracy: 0.7863333333333333
KNN Linear regression mean accuracy: 0.7786666666666667
KNN Random mean accuracy: 0.7786666666666667


## K nearest neighbour

We will start by using the K nearest neighbour classifier. \
This classifier is based on the idea that the data points that are close to each other are more likely to belong to the same class.

Using the scikit-learn library we can easily implement this classifier. \
In this case we will use the 5 nearest neighbours and the distance metric will be the Euclidean distance.

We also use a cross validation with 6 folds to estimate the accuracy of the classifier. \
This will allow us to have a more robust estimate of the accuracy of the classifier, by splitting the data in 6 different ways and using each time a different part of the data as test set.

To evaluate the performance of the classifier we will use the confusion matrix and calculate the median of the mean, median and standard deviation of the accuracy scores. /
Based on those values we will select the best dataframe to use for the classifier. \

Based on these results we can see that the best dataframe to use is the one where we XXXXX #TODO add the best dataframe \
This is because it has the highest median of the accuracy scores and the lowest standard deviation.

We can also see that the accuracy of the classifier is not the best with barely more than XXXXX #TODO of the predictions being correct. \
But being limited with only a small dataset with only 30000 rows and 23 columns, we can't expect to have a very high accuracy.


In [12]:
comparison_dict["DecisionTree"] = {}

begin = time.time()

for df in dataframes:
    X = df.drop(['satisfied'], axis=1)
    y = df['satisfied']

    clf = DecisionTreeClassifier(random_state=0)
    clf.fit(X, y)

    # Visualize the decision tree
    plt.figure(figsize=(20, 10))
    plot_tree(clf, filled=True, feature_names=X.columns, class_names=['Not Satisfied', 'Satisfied'], rounded=True)
    plt.title(f"Decision Tree (Dataframe: {df.attrs['name']}))")
    plt.show()


    eval_and_graph(clf,"DecisionTree", X, y, df)
    
comparison_dict["DecisionTree"]["time"] = time.time() - begin
    
    

NameError: name 'plt' is not defined

In [None]:
# Neural Network



comparison_dict["Neural Network"] = {}
start = time.time()



for df in dataframes:
    X = df.drop(['satisfied'], axis=1)
    y = df['satisfied']
    
    fold_accuracies = []
    kf = StratifiedKFold(n_splits=6)

    
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # Create and train the model
        model = Sequential()
        model.add(Dense(64, input_dim=X_train.shape[1], activation='relu'))
        model.add(Dense(32, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        model.fit(X_train, y_train, epochs=80, verbose=0)


        # Evaluate the model
        scores = model.evaluate(X_test, y_test, verbose=0)
        fold_accuracies.append(scores[1])
        
        
    
    mean_accuracy = np.mean(fold_accuracies)
    std_dev = np.std(fold_accuracies)
    median_accuracy = np.median(fold_accuracies)
    
    comparison_dict["Neural Network"][df.attrs['name']] = {
        'mean_accuracy': mean_accuracy,
        'median_accuracy': median_accuracy,
        'std_dev': std_dev,
    }
    
   

TypeError: StratifiedKFold.split() missing 1 required positional argument: 'y'

In [None]:
# Support Vector Machine

comparison_dict["SVM"] = {}

begin = time.time()

for df in dataframes:
    X = df.drop(['satisfied'], axis=1)
    y = df['satisfied']

    clf = svm.SVC()
    clf.fit(X, y)

    eval_and_graph(clf,"SVM", X, y, df)

comparison_dict ["SVM"]["time"] = time.time() - begin

SVM No missing values mean accuracy: 0.8265908110440036
SVM Mean mean accuracy: 0.8166666666666668
SVM Median mean accuracy: 0.8163333333333332
SVM Linear regression mean accuracy: 0.8156666666666665
SVM Random mean accuracy: 0.8156666666666665


## Support vector machine

A support vector machine is a classifier that tries to find the best hyperplane that separates the data into two classes. \
The hyperplane is the one that maximizes the distance between the closest points of the two classes. \
The points that are closest to the hyperplane are called support vectors. 

This is similar to the K nearest neighbour classifier, but during training only creates this hyperplane \
So when we want to predict the class of a new data point, we only need to check on which side of the hyperplane it is. 


In [None]:
# Naive Bayes

comparison_dict["NaiveBayes"] = {}

begin = time.time()
for df in dataframes:
    X = df.drop(['satisfied'], axis=1)
    y = df['satisfied']

    clf = GaussianNB()
    clf.fit(X, y)

    eval_and_graph(clf,"NaiveBayes", X, y, df)
    
comparison_dict ["NaiveBayes"]["time"] = time.time() - begin

NaiveBayes No missing values mean accuracy: 0.8003361374748347
NaiveBayes Mean mean accuracy: 0.7876666666666666
NaiveBayes Median mean accuracy: 0.789
NaiveBayes Linear regression mean accuracy: 0.7896666666666667
NaiveBayes Random mean accuracy: 0.7896666666666667


## Naive Bayes

Naive bayes works by calculating the statistical probability of a data point belonging to a certain class. \
By taking the product of the probabilities of each feature, we can calculate the probability of the data point belonging to a certain class. \
The class with the highest probability is the one that the data point is most likely to belong to.

Instead of mapping the datapoints in a n-dimensional space, generates propabilities for each dimension. \
This allows it to work with a smaller dataset and it is also less prone to overfitting.
