# i. BUSINESS JOB DESCRIPTION

*   Data Money is a company that offers data analysis consultancy services, guided towards other partner companies worldwide.
*   The main feature on Data Money services resides on the metodology used on training and tunning machine learning models, a task properly achieved by their team of Data Scientists, clarifying the behaviour of each algorithm with proven explained results.

# ii. THE CHALLENGE

*   In order to keep the company's data scientists team growth in performance, Data Money required the execution of a higher number of model trials, with goals of further understanding which were the adequate scenarios to apply each algorithm. These choices are then expected to be based upon the showcasing of verified metrics, serving as means of comparisons between machine learning models to know when their performance is best optimized/minimized.
*   Being a Data Scientist recently hired by the company, I was then tasked of realizing trials for Classification, Regression and clustering machine learning models, to further report their results to the other working teams.

# iii. BUSINESS CHALLANGE SPECIFICATIONS

*   It is expected to verify the performance results on 3 different datasets, those being:
    1. The training dataset (Used for both training and model prediction)
    2. Validation Dataset
    3. Test Dataset


*   The results will then be reported as a table for each different dataset, containing the comparison of metrics for a selection of machine learning models.
*   The only exception will be the analysis for clustering model: Given that it is an unsupervized learning model, it will then be only one table containing its results.

# 0.0 IMPORTS AND HELPER FUNCTIONS

In [1]:
import numpy as np
import pandas as pd
import warnings

from matplotlib import pyplot as plt

from sklearn.neighbors      import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree           import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble       import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model   import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet

from sklearn                import preprocessing as pp
from sklearn                import cluster as c
from sklearn                import metrics as mt
from sklearn                import model_selection as ms 
from sklearn                import datasets as dt


In [2]:
warnings.filterwarnings('ignore')

# 1.0 CLASSIFICATION MODELS

*   The main metric used for evaluation on overall model performance will be the f1_score, since it considers an equilibrium on both precision and recall metrics.
*   Accuracy, precision and recall scores will also be listed, as well as the hyperparameter set tunned for each iteration, in order to obtain insights on how each model works.

In [3]:
X_training_classification = pd.read_csv('datasets\classification\X_training.csv')
y_training_classification = pd.read_csv('datasets\classification\y_training.csv')

X_validation_classification = pd.read_csv('datasets\classification\X_validation.csv')
y_validation_classification = pd.read_csv('datasets\classification\y_validation.csv')

X_test_classification = pd.read_csv('datasets\classification\X_test.csv')
y_test_classification = pd.read_csv('datasets\classification\y_test.csv')

In [4]:
print(f'Training dataset: rows = {X_training_classification.shape[0]}, columns = {X_training_classification.shape[1]}')
print(f'Validation dataset: rows = {X_validation_classification.shape[0]}, columns = {X_validation_classification.shape[1]}')
print(f'Test dataset: rows = {X_test_classification.shape[0]}, columns = {X_test_classification.shape[1]}')

print(f'Number of classes:{len(np.unique(y_training_classification))}, Distinct Classes:{np.unique(y_training_classification)}')

Training dataset: rows = 72515, columns = 25
Validation dataset: rows = 31079, columns = 25
Test dataset: rows = 25893, columns = 25
Number of classes:2, Distinct Classes:[0 1]


*   Every class variable array ('y' variables), matches in size with their features counterparts('x' variables).
*   The datasets are presented as being binary classed, with '0' and '1' as the class variables.

## 1.1 KNN

In [5]:
neighbours_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in range(3, 15, 2):
    neighbours_list.append(i)
    #Model Definition
    knn_classifier = KNeighborsClassifier(n_neighbors = i)
    
    #Model Train
    knn_classifier.fit(X_training_classification, np.ravel(y_training_classification))
    
    #Model Predict
    yhat_knn_training   = knn_classifier.predict(X_training_classification)
    yhat_knn_validation = knn_classifier.predict(X_validation_classification)
    yhat_knn_test       = knn_classifier.predict(X_test_classification)
    
    #training scores
    acc_score_training       = mt.accuracy_score(y_training_classification, yhat_knn_training)
    precision_score_training = mt.precision_score(y_training_classification, yhat_knn_training)
    recall_score_training    = mt.recall_score(y_training_classification, yhat_knn_training)
    f1_score_training        = mt.f1_score(y_training_classification, yhat_knn_training)
    
    acc_list_training.append(acc_score_training)
    precision_list_training.append(precision_score_training)
    recall_list_training.append(recall_score_training)
    f1_list_training.append(f1_score_training)
    
    #validation scores
    acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_knn_validation)
    precision_score_validation = mt.precision_score(y_validation_classification, yhat_knn_validation)
    recall_score_validation    = mt.recall_score(y_validation_classification, yhat_knn_validation)
    f1_score_validation        = mt.f1_score(y_validation_classification, yhat_knn_validation)
    
    acc_list_validation.append(acc_score_validation)
    precision_list_validation.append(precision_score_validation)
    recall_list_validation.append(recall_score_validation)
    f1_list_validation.append(f1_score_validation)
    
    #test scores
    acc_score_test       = mt.accuracy_score(y_test_classification, yhat_knn_test)
    precision_score_test = mt.precision_score(y_test_classification, yhat_knn_test)
    recall_score_test    = mt.recall_score(y_test_classification, yhat_knn_test)
    f1_score_test        = mt.f1_score(y_test_classification, yhat_knn_test)
    
    acc_list_test.append(acc_score_test)
    precision_list_test.append(precision_score_test)
    recall_list_test.append(recall_score_test)
    f1_list_test.append(f1_score_test)
    

In [6]:
knn_training_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_training, 
                                'precision_score': precision_list_training, 'recall_score':recall_list_training, 
                                'f1_score':f1_list_training})

knn_training_max = knn_training_df.query('f1_score == f1_score.max()')
knn_training_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.832186,0.812008,0.79741,0.804643


In [7]:
knn_validation_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_validation, 
                                'precision_score': precision_list_validation, 'recall_score':recall_list_validation, 
                                'f1_score':f1_list_validation})

knn_validation_max = knn_validation_df.query('f1_score == f1_score.max()')
knn_validation_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.676277,0.627851,0.621278,0.624548


In [8]:
knn_test_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_test, 
                            'precision_score': precision_list_test, 'recall_score':recall_list_test, 
                            'f1_score':f1_list_test})

knn_test_max = knn_test_df.query('f1_score == f1_score.max()')
knn_test_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.672228,0.630462,0.611879,0.621031


*   All datasets presented optimal results for lower number of neighbors, justifying the choice for n_neighbors=3.

## 1.2 Decision Tree

In [9]:
max_depth_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in range(1, 20, 2):
    max_depth_list.append(i)
    #Model Define
    model_tree = DecisionTreeClassifier(max_depth=i, random_state=42)
    
    #Model Training
    model_tree.fit(X_training_classification, np.ravel(y_training_classification))
    
    #Model Predict
    yhat_tree_training   = model_tree.predict(X_training_classification)
    yhat_tree_validation = model_tree.predict(X_validation_classification)
    yhat_tree_test       = model_tree.predict(X_test_classification)
    
    #training scores
    acc_score_training       = mt.accuracy_score(y_training_classification, yhat_tree_training)
    precision_score_training = mt.precision_score(y_training_classification, yhat_tree_training)
    recall_score_training    = mt.recall_score(y_training_classification, yhat_tree_training)
    f1_score_training        = mt.f1_score(y_training_classification, yhat_tree_training)
    
    acc_list_training.append(acc_score_training)
    precision_list_training.append(precision_score_training)
    recall_list_training.append(recall_score_training)
    f1_list_training.append(f1_score_training)
    
    #validation scores
    acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_tree_validation)
    precision_score_validation = mt.precision_score(y_validation_classification, yhat_tree_validation)
    recall_score_validation    = mt.recall_score(y_validation_classification, yhat_tree_validation)
    f1_score_validation        = mt.f1_score(y_validation_classification, yhat_tree_validation)
    
    acc_list_validation.append(acc_score_validation)
    precision_list_validation.append(precision_score_validation)
    recall_list_validation.append(recall_score_validation)
    f1_list_validation.append(f1_score_validation)
    
    #test scores
    acc_score_test       = mt.accuracy_score(y_test_classification, yhat_tree_test)
    precision_score_test = mt.precision_score(y_test_classification, yhat_tree_test)
    recall_score_test    = mt.recall_score(y_test_classification, yhat_tree_test)
    f1_score_test        = mt.f1_score(y_test_classification, yhat_tree_test)
    
    acc_list_test.append(acc_score_test)
    precision_list_test.append(precision_score_test)
    recall_list_test.append(recall_score_test)
    f1_list_test.append(f1_score_test)
    
    

In [10]:
tree_training_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_training, 
                                'precision_score': precision_list_training, 'recall_score':recall_list_training, 
                                'f1_score':f1_list_training})

tree_training_max = tree_training_df.query('f1_score == f1_score.max()')
tree_training_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
9,19,0.989712,0.993089,0.983104,0.988072


In [11]:
tree_validation_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_validation, 
                                   'precision_score': precision_list_validation, 'recall_score':recall_list_validation, 
                                   'f1_score':f1_list_validation})

tree_validation_max = tree_validation_df.query('f1_score == f1_score.max()')
tree_validation_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
6,13,0.952315,0.9563,0.932586,0.944294


In [12]:
tree_test_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_test, 
                                'precision_score': precision_list_test, 'recall_score':recall_list_test, 
                                'f1_score':f1_list_test})

tree_test_max = tree_test_df.query('f1_score == f1_score.max()')
tree_test_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
6,13,0.951802,0.953881,0.935416,0.944558


*   While validating results using only the training dataset, we can infer that all metrics(including f1_score) grow indefenitely to 100% as the depth of the tree increases, indicating an overfitting behaviour.
*   For Validation and Test datasets however, model performance reaches its peak on max_depth=13, indicating that there is no more gain of information for these datasets beyond this point.

## 1.3 Random Forest Classifier

In [13]:
params = {'n_estimators': [1, 10, 50, 100, 150, 200],
          'max_depth': [1, 5, 10, 15, 20]}

n_estimators_list = list()
max_depth_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()


for i in params.get('n_estimators'):
    for j in params.get('max_depth'):
        n_estimators_list.append(i)
        max_depth_list.append(j)
        #Model Definition
        rf_model = RandomForestClassifier(n_estimators=i, max_depth=j, random_state=42)

        #Model Training
        rf_model.fit(X_training_classification, np.ravel(y_training_classification))

        #Model Predict
        yhat_rf_training    = rf_model.predict(X_training_classification)
        yhat_rf_validation  = rf_model.predict(X_validation_classification)
        yhat_rf_test        = rf_model.predict(X_test_classification)

        #Model Scores
        #training scores
        acc_score_training       = mt.accuracy_score(y_training_classification, yhat_rf_training)
        precision_score_training = mt.precision_score(y_training_classification, yhat_rf_training)
        recall_score_training    = mt.recall_score(y_training_classification, yhat_rf_training)
        f1_score_training        = mt.f1_score(y_training_classification, yhat_rf_training)
    
        acc_list_training.append(acc_score_training)
        precision_list_training.append(precision_score_training)
        recall_list_training.append(recall_score_training)
        f1_list_training.append(f1_score_training)
    
        #validation scores
        acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_rf_validation)
        precision_score_validation = mt.precision_score(y_validation_classification, yhat_rf_validation)
        recall_score_validation    = mt.recall_score(y_validation_classification, yhat_rf_validation)
        f1_score_validation        = mt.f1_score(y_validation_classification, yhat_rf_validation)
    
        acc_list_validation.append(acc_score_validation)
        precision_list_validation.append(precision_score_validation)
        recall_list_validation.append(recall_score_validation)
        f1_list_validation.append(f1_score_validation)

        #test scores
        acc_score_test       = mt.accuracy_score(y_test_classification, yhat_rf_test)
        precision_score_test = mt.precision_score(y_test_classification, yhat_rf_test)
        recall_score_test    = mt.recall_score(y_test_classification, yhat_rf_test)
        f1_score_test        = mt.f1_score(y_test_classification, yhat_rf_test)
    
        acc_list_test.append(acc_score_test)
        precision_list_test.append(precision_score_test)
        recall_list_test.append(recall_score_test)
        f1_list_test.append(f1_score_test)

In [14]:
rf_training_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_training, 'precision_score': precision_list_training,
                               'recall_score':recall_list_training, 'f1_score':f1_list_training})


rf_training_max = rf_training_df.query('f1_score == f1_score.max()')
rf_training_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
24,20,150,0.996525,0.998242,0.993732,0.995982


In [15]:
rf_validation_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_validation, 'precision_score': precision_list_validation,
                               'recall_score':recall_list_validation, 'f1_score':f1_list_validation})

rf_validation_max = rf_validation_df.query('f1_score == f1_score.max()')
rf_validation_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
29,20,200,0.965185,0.974271,0.944614,0.959213


In [16]:
rf_test_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_test, 'precision_score': precision_list_test,
                               'recall_score':recall_list_test, 'f1_score':f1_list_test})

rf_test_max = rf_test_df.query('f1_score == f1_score.max()')
rf_test_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
24,20,150,0.963928,0.971777,0.945271,0.958341


*   Alike the Decision Tree results, a tree based model such as the Random Forest tend to overfit as you increase the number of estimators as well as tree max depth.

## 1.4 Logistic Regression Classifier

In [17]:
params = {'C':[0.1, 0.5, 1.0, 2.0],
          'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
          'max_iter': [50, 100, 150, 200]}

c_estimator_list = list()
solver_estimator_list = list()
max_iter_estimator_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in params.get('C'):
    for j in params.get('solver'):
        for p in params.get('max_iter'):
            c_estimator_list.append(i)
            solver_estimator_list.append(j)
            max_iter_estimator_list.append(p)
            
            #Model Define
            lr_model = LogisticRegression(C=i, solver=j, max_iter=p, random_state=42)
            
            #Model Training
            lr_model.fit(X_training_classification, np.ravel(y_training_classification))
            
            #Model Predict
            yhat_lr_training    = lr_model.predict(X_training_classification)
            yhat_lr_validation  = lr_model.predict(X_validation_classification)
            yhat_lr_test        = lr_model.predict(X_test_classification)
            
            #Model Scores
            #Training Scores
            acc_score_training       = mt.accuracy_score(y_training_classification, yhat_lr_training)
            precision_score_training = mt.precision_score(y_training_classification, yhat_lr_training)
            recall_score_training    = mt.recall_score(y_training_classification, yhat_lr_training)
            f1_score_training        = mt.f1_score(y_training_classification, yhat_lr_training)
    
            acc_list_training.append(acc_score_training)
            precision_list_training.append(precision_score_training)
            recall_list_training.append(recall_score_training)
            f1_list_training.append(f1_score_training)
            
            #Validation Scores
            acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_lr_validation)
            precision_score_validation = mt.precision_score(y_validation_classification, yhat_lr_validation)
            recall_score_validation    = mt.recall_score(y_validation_classification, yhat_lr_validation)
            f1_score_validation        = mt.f1_score(y_validation_classification, yhat_lr_validation)
    
            acc_list_validation.append(acc_score_validation)
            precision_list_validation.append(precision_score_validation)
            recall_list_validation.append(recall_score_validation)
            f1_list_validation.append(f1_score_validation)
            
            #Test Scores
            acc_score_test       = mt.accuracy_score(y_test_classification, yhat_lr_test)
            precision_score_test = mt.precision_score(y_test_classification, yhat_lr_test)
            recall_score_test    = mt.recall_score(y_test_classification, yhat_lr_test)
            f1_score_test        = mt.f1_score(y_test_classification, yhat_lr_test)
    
            acc_list_test.append(acc_score_test)
            precision_list_test.append(precision_score_test)
            recall_list_test.append(recall_score_test)
            f1_list_test.append(f1_score_test)

In [18]:
lr_training_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_training, 'precision_score': precision_list_training,
                               'recall_score':recall_list_training, 'f1_score':f1_list_training})

lr_training_max = lr_training_df.query('f1_score == f1_score.max()')
lr_training_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
9,0.1,newton-cg,100,0.876288,0.871866,0.837661,0.854421
10,0.1,newton-cg,150,0.876288,0.871866,0.837661,0.854421
11,0.1,newton-cg,200,0.876288,0.871866,0.837661,0.854421
12,0.1,newton-cholesky,50,0.876288,0.871866,0.837661,0.854421
13,0.1,newton-cholesky,100,0.876288,0.871866,0.837661,0.854421
14,0.1,newton-cholesky,150,0.876288,0.871866,0.837661,0.854421
15,0.1,newton-cholesky,200,0.876288,0.871866,0.837661,0.854421


In [19]:
lr_validation_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_validation, 'precision_score': precision_list_validation,
                               'recall_score':recall_list_validation, 'f1_score':f1_list_validation})

lr_validation_max = lr_validation_df.query('f1_score == f1_score.max()')
lr_validation_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
81,2.0,newton-cg,100,0.874481,0.869421,0.83592,0.852341
82,2.0,newton-cg,150,0.874481,0.869421,0.83592,0.852341
83,2.0,newton-cg,200,0.874481,0.869421,0.83592,0.852341


In [20]:
lr_test_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_test, 'precision_score': precision_list_test,
                               'recall_score':recall_list_test, 'f1_score':f1_list_test})

lr_test_max = lr_test_df.query('f1_score == f1_score.max()')
lr_test_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
32,0.5,newton-cg,50,0.871857,0.868014,0.83502,0.851197


*   It was observed that for these datasets, the optimization algorithms that showed best performance were the ones that contain L2 penalty support. (newton-cg and newton-cholesky)
*   Other methods (lbfgs, sag, liblinear) either failed to converge due to not approximating the cost function well or by applying a not smooth penalty(L1) to relatively small datasets(maximum 73k rows).
*   Both C and max_iter parameters proved to be secundary upon cost function convergence, showing in comparison little influence on final metric results. 

## 1.5 Final Results for Classification Models

### 1.5.1 Training Dataset 

In [31]:
#knn_training_max = knn_training_max.drop('neighbors', axis=1)
knn_training_max['model'] = 'KNN'
knn_training_max = knn_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#tree_training_max = tree_training_max.drop('max_depth', axis=1)
tree_training_max['model'] = 'Decision Tree'
tree_training_max = tree_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#rf_training_max = rf_training_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_training_max['model'] = 'Random Forest'
rf_training_max = rf_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#lr_training_max = lr_training_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_training_max['model'] = 'Logistic Regression'
lr_training_max = lr_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [62]:
training_classification_results = pd.concat([knn_training_max, tree_training_max, rf_training_max, lr_training_max.iloc[[0]]])
training_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.832186,0.812008,0.79741,0.804643
9,Decision Tree,0.989712,0.993089,0.983104,0.988072
24,Random Forest,0.996525,0.998242,0.993732,0.995982
0,Logistic Regression,0.876288,0.871866,0.837661,0.854421


### 1.5.2 Validation Dataset

In [60]:
#knn_validation_max = knn_validation_max.drop('neighbors', axis=1)
knn_validation_max['model'] = 'KNN'
knn_validation_max = knn_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#tree_validation_max = tree_validation_max.drop('max_depth', axis=1)
tree_validation_max['model'] = 'Decision Tree'
tree_validation_max = tree_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#rf_validation_max = rf_validation_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_validation_max['model'] = 'Random Forest'
rf_validation_max = rf_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#lr_validation_max = lr_validation_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_validation_max['model'] = 'Logistic Regression'
lr_validation_max = lr_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [63]:
validation_classification_results = pd.concat([knn_validation_max, tree_validation_max, rf_validation_max, lr_validation_max.iloc[[0]]])
validation_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.676277,0.627851,0.621278,0.624548
6,Decision Tree,0.952315,0.9563,0.932586,0.944294
29,Random Forest,0.965185,0.974271,0.944614,0.959213
81,Logistic Regression,0.874481,0.869421,0.83592,0.852341


### 1.5.3 Test Dataset

In [64]:
#knn_test_max = knn_test_max.drop('neighbors', axis=1)
knn_test_max['model'] = 'KNN'
knn_test_max = knn_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#tree_test_max = tree_test_max.drop('max_depth', axis=1)
tree_test_max['model'] = 'Decision Tree'
tree_test_max = tree_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#rf_test_max = rf_test_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_test_max['model'] = 'Random Forest'
rf_test_max = rf_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

#lr_test_max = lr_test_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_test_max['model'] = 'Logistic Regression'
lr_test_max = lr_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [65]:
test_classification_results = pd.concat([knn_test_max, tree_test_max, rf_test_max, lr_test_max.iloc[[0]]])
test_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.672228,0.630462,0.611879,0.621031
6,Decision Tree,0.951802,0.953881,0.935416,0.944558
24,Random Forest,0.963928,0.971777,0.945271,0.958341
32,Logistic Regression,0.871857,0.868014,0.83502,0.851197
