# i. BUSINESS JOB DESCRIPTION

*   Data Money is a company that offers data analysis consultancy services, guided towards other partner companies worldwide.
*   The main feature on Data Money services resides on the metodology used on training and tunning machine learning models, a task properly achieved by their team of Data Scientists, clarifying the behaviour of each algorithm with proven explained results.

# ii. THE CHALLENGE

*   In order to keep the company's data scientists team growth in performance, Data Money required the execution of a higher number of model trials, with goals of further understanding which were the adequate scenarios to apply each algorithm. These choices are then expected to be based upon the showcasing of verified metrics, serving as means of comparisons between machine learning models to know when their performance is best optimized/minimized.
*   Being a Data Scientist recently hired by the company, I was then tasked of realizing trials for Classification, Regression and clustering machine learning models, to further report their results to the other working teams.

# iii. BUSINESS CHALLANGE SPECIFICATIONS

*   It is expected to verify the performance results on 3 different datasets, those being:
    1. The training dataset (Used for both training and model prediction)
    2. Validation Dataset
    3. Test Dataset


*   The results will then be reported as a table for each different dataset, containing the comparison of metrics for a selection of machine learning models.
*   The only exception will be the analysis for clustering model: Given that it is an unsupervized learning model, it will then be only one table containing its results.

# 0.0 IMPORTS AND HELPER FUNCTIONS

In [1]:
import numpy as np
import pandas as pd
import warnings

from matplotlib import pyplot as plt

from sklearn.neighbors      import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree           import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble       import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model   import LogisticRegression, LinearRegression, Lasso, Ridge, ElasticNet

from sklearn                import preprocessing as pp
from sklearn                import cluster as c
from sklearn                import metrics as mt
from sklearn                import model_selection as ms 
from sklearn                import datasets as dt


In [2]:
warnings.filterwarnings('ignore')

# 1.0 CLASSIFICATION MODELS

*   The main metric used for evaluation on overall model performance will be the f1_score, since it considers an equilibrium on both precision and recall metrics.
*   Accuracy, precision and recall scores will also be listed, as well as the hyperparameter set tunned for each iteration, in order to obtain insights on how each model works.

In [3]:
X_training_classification = pd.read_csv('datasets\classification\X_training.csv')
y_training_classification = pd.read_csv('datasets\classification\y_training.csv')

X_validation_classification = pd.read_csv('datasets\classification\X_validation.csv')
y_validation_classification = pd.read_csv('datasets\classification\y_validation.csv')

X_test_classification = pd.read_csv('datasets\classification\X_test.csv')
y_test_classification = pd.read_csv('datasets\classification\y_test.csv')

In [4]:
print(f'Training dataset: rows = {X_training_classification.shape[0]}, columns = {X_training_classification.shape[1]}')
print(f'Validation dataset: rows = {X_validation_classification.shape[0]}, columns = {X_validation_classification.shape[1]}')
print(f'Test dataset: rows = {X_test_classification.shape[0]}, columns = {X_test_classification.shape[1]}')

print(f'Number of classes:{len(np.unique(y_training_classification))}, Distinct Classes:{np.unique(y_training_classification)}')

Training dataset: rows = 72515, columns = 25
Validation dataset: rows = 31079, columns = 25
Test dataset: rows = 25893, columns = 25
Number of classes:2, Distinct Classes:[0 1]


*   Every class variable array ('y' variables), matches in size with their features counterparts('x' variables).
*   The datasets are presented as being binary classed, with '0' and '1' as the class variables.

## 1.1 KNN

In [5]:
neighbours_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in range(3, 15, 2):
    neighbours_list.append(i)
    #Model Definition
    knn_classifier = KNeighborsClassifier(n_neighbors = i)
    
    #Model Train
    knn_classifier.fit(X_training_classification, np.ravel(y_training_classification))
    
    #Model Predict
    yhat_knn_training   = knn_classifier.predict(X_training_classification)
    yhat_knn_validation = knn_classifier.predict(X_validation_classification)
    yhat_knn_test       = knn_classifier.predict(X_test_classification)
    
    #training scores
    acc_score_training       = mt.accuracy_score(y_training_classification, yhat_knn_training)
    precision_score_training = mt.precision_score(y_training_classification, yhat_knn_training)
    recall_score_training    = mt.recall_score(y_training_classification, yhat_knn_training)
    f1_score_training        = mt.f1_score(y_training_classification, yhat_knn_training)
    
    acc_list_training.append(acc_score_training)
    precision_list_training.append(precision_score_training)
    recall_list_training.append(recall_score_training)
    f1_list_training.append(f1_score_training)
    
    #validation scores
    acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_knn_validation)
    precision_score_validation = mt.precision_score(y_validation_classification, yhat_knn_validation)
    recall_score_validation    = mt.recall_score(y_validation_classification, yhat_knn_validation)
    f1_score_validation        = mt.f1_score(y_validation_classification, yhat_knn_validation)
    
    acc_list_validation.append(acc_score_validation)
    precision_list_validation.append(precision_score_validation)
    recall_list_validation.append(recall_score_validation)
    f1_list_validation.append(f1_score_validation)
    
    #test scores
    acc_score_test       = mt.accuracy_score(y_test_classification, yhat_knn_test)
    precision_score_test = mt.precision_score(y_test_classification, yhat_knn_test)
    recall_score_test    = mt.recall_score(y_test_classification, yhat_knn_test)
    f1_score_test        = mt.f1_score(y_test_classification, yhat_knn_test)
    
    acc_list_test.append(acc_score_test)
    precision_list_test.append(precision_score_test)
    recall_list_test.append(recall_score_test)
    f1_list_test.append(f1_score_test)
    

In [6]:
knn_training_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_training, 
                                'precision_score': precision_list_training, 'recall_score':recall_list_training, 
                                'f1_score':f1_list_training})

knn_training_max = knn_training_df.query('f1_score == f1_score.max()')
knn_training_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.832186,0.812008,0.79741,0.804643


In [7]:
knn_validation_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_validation, 
                                'precision_score': precision_list_validation, 'recall_score':recall_list_validation, 
                                'f1_score':f1_list_validation})

knn_validation_max = knn_validation_df.query('f1_score == f1_score.max()')
knn_validation_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.676277,0.627851,0.621278,0.624548


In [8]:
knn_test_df = pd.DataFrame({'neighbors': neighbours_list, 'accuracy_score':acc_list_test, 
                            'precision_score': precision_list_test, 'recall_score':recall_list_test, 
                            'f1_score':f1_list_test})

knn_test_max = knn_test_df.query('f1_score == f1_score.max()')
knn_test_max

Unnamed: 0,neighbors,accuracy_score,precision_score,recall_score,f1_score
0,3,0.672228,0.630462,0.611879,0.621031


*   All datasets presented optimal results for lower number of neighbors, justifying the choice for n_neighbors=3.

## 1.2 Decision Tree

In [9]:
max_depth_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in range(1, 20, 2):
    max_depth_list.append(i)
    #Model Define
    model_tree = DecisionTreeClassifier(max_depth=i, random_state=42)
    
    #Model Training
    model_tree.fit(X_training_classification, np.ravel(y_training_classification))
    
    #Model Predict
    yhat_tree_training   = model_tree.predict(X_training_classification)
    yhat_tree_validation = model_tree.predict(X_validation_classification)
    yhat_tree_test       = model_tree.predict(X_test_classification)
    
    #training scores
    acc_score_training       = mt.accuracy_score(y_training_classification, yhat_tree_training)
    precision_score_training = mt.precision_score(y_training_classification, yhat_tree_training)
    recall_score_training    = mt.recall_score(y_training_classification, yhat_tree_training)
    f1_score_training        = mt.f1_score(y_training_classification, yhat_tree_training)
    
    acc_list_training.append(acc_score_training)
    precision_list_training.append(precision_score_training)
    recall_list_training.append(recall_score_training)
    f1_list_training.append(f1_score_training)
    
    #validation scores
    acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_tree_validation)
    precision_score_validation = mt.precision_score(y_validation_classification, yhat_tree_validation)
    recall_score_validation    = mt.recall_score(y_validation_classification, yhat_tree_validation)
    f1_score_validation        = mt.f1_score(y_validation_classification, yhat_tree_validation)
    
    acc_list_validation.append(acc_score_validation)
    precision_list_validation.append(precision_score_validation)
    recall_list_validation.append(recall_score_validation)
    f1_list_validation.append(f1_score_validation)
    
    #test scores
    acc_score_test       = mt.accuracy_score(y_test_classification, yhat_tree_test)
    precision_score_test = mt.precision_score(y_test_classification, yhat_tree_test)
    recall_score_test    = mt.recall_score(y_test_classification, yhat_tree_test)
    f1_score_test        = mt.f1_score(y_test_classification, yhat_tree_test)
    
    acc_list_test.append(acc_score_test)
    precision_list_test.append(precision_score_test)
    recall_list_test.append(recall_score_test)
    f1_list_test.append(f1_score_test)
    
    

In [10]:
tree_training_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_training, 
                                'precision_score': precision_list_training, 'recall_score':recall_list_training, 
                                'f1_score':f1_list_training})

tree_training_max = tree_training_df.query('f1_score == f1_score.max()')
tree_training_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
9,19,0.989712,0.993089,0.983104,0.988072


In [11]:
tree_validation_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_validation, 
                                   'precision_score': precision_list_validation, 'recall_score':recall_list_validation, 
                                   'f1_score':f1_list_validation})

tree_validation_max = tree_validation_df.query('f1_score == f1_score.max()')
tree_validation_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
6,13,0.952315,0.9563,0.932586,0.944294


In [12]:
tree_test_df = pd.DataFrame({'max_depth': max_depth_list, 'accuracy_score':acc_list_test, 
                                'precision_score': precision_list_test, 'recall_score':recall_list_test, 
                                'f1_score':f1_list_test})

tree_test_max = tree_test_df.query('f1_score == f1_score.max()')
tree_test_max

Unnamed: 0,max_depth,accuracy_score,precision_score,recall_score,f1_score
6,13,0.951802,0.953881,0.935416,0.944558


*   While validating results using only the training dataset, we can infer that all metrics(including f1_score) grow indefenitely to 100% as the depth of the tree increases, indicating an overfitting behaviour.
*   For Validation and Test datasets however, model performance reaches its peak on max_depth=13, indicating that there is no more gain of information for these datasets beyond this point.

## 1.3 Random Forest Classifier

In [13]:
params = {'n_estimators': [1, 10, 50, 100, 150, 200],
          'max_depth': [1, 5, 10, 15, 20]}

n_estimators_list = list()
max_depth_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()


for i in params.get('n_estimators'):
    for j in params.get('max_depth'):
        n_estimators_list.append(i)
        max_depth_list.append(j)
        #Model Definition
        rf_model = RandomForestClassifier(n_estimators=i, max_depth=j, random_state=42)

        #Model Training
        rf_model.fit(X_training_classification, np.ravel(y_training_classification))

        #Model Predict
        yhat_rf_training    = rf_model.predict(X_training_classification)
        yhat_rf_validation  = rf_model.predict(X_validation_classification)
        yhat_rf_test        = rf_model.predict(X_test_classification)

        #Model Scores
        #training scores
        acc_score_training       = mt.accuracy_score(y_training_classification, yhat_rf_training)
        precision_score_training = mt.precision_score(y_training_classification, yhat_rf_training)
        recall_score_training    = mt.recall_score(y_training_classification, yhat_rf_training)
        f1_score_training        = mt.f1_score(y_training_classification, yhat_rf_training)
    
        acc_list_training.append(acc_score_training)
        precision_list_training.append(precision_score_training)
        recall_list_training.append(recall_score_training)
        f1_list_training.append(f1_score_training)
    
        #validation scores
        acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_rf_validation)
        precision_score_validation = mt.precision_score(y_validation_classification, yhat_rf_validation)
        recall_score_validation    = mt.recall_score(y_validation_classification, yhat_rf_validation)
        f1_score_validation        = mt.f1_score(y_validation_classification, yhat_rf_validation)
    
        acc_list_validation.append(acc_score_validation)
        precision_list_validation.append(precision_score_validation)
        recall_list_validation.append(recall_score_validation)
        f1_list_validation.append(f1_score_validation)

        #test scores
        acc_score_test       = mt.accuracy_score(y_test_classification, yhat_rf_test)
        precision_score_test = mt.precision_score(y_test_classification, yhat_rf_test)
        recall_score_test    = mt.recall_score(y_test_classification, yhat_rf_test)
        f1_score_test        = mt.f1_score(y_test_classification, yhat_rf_test)
    
        acc_list_test.append(acc_score_test)
        precision_list_test.append(precision_score_test)
        recall_list_test.append(recall_score_test)
        f1_list_test.append(f1_score_test)

In [14]:
rf_training_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_training, 'precision_score': precision_list_training,
                               'recall_score':recall_list_training, 'f1_score':f1_list_training})


rf_training_max = rf_training_df.query('f1_score == f1_score.max()')
rf_training_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
24,20,150,0.996525,0.998242,0.993732,0.995982


In [15]:
rf_validation_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_validation, 'precision_score': precision_list_validation,
                               'recall_score':recall_list_validation, 'f1_score':f1_list_validation})

rf_validation_max = rf_validation_df.query('f1_score == f1_score.max()')
rf_validation_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
29,20,200,0.965185,0.974271,0.944614,0.959213


In [16]:
rf_test_df = pd.DataFrame({'max_depth': max_depth_list, 'n_estimators':n_estimators_list,
                               'accuracy_score':acc_list_test, 'precision_score': precision_list_test,
                               'recall_score':recall_list_test, 'f1_score':f1_list_test})

rf_test_max = rf_test_df.query('f1_score == f1_score.max()')
rf_test_max

Unnamed: 0,max_depth,n_estimators,accuracy_score,precision_score,recall_score,f1_score
24,20,150,0.963928,0.971777,0.945271,0.958341


*   Alike the Decision Tree results, a tree based model such as the Random Forest tend to overfit as you increase the number of estimators as well as tree max depth.

## 1.4 Logistic Regression Classifier

In [17]:
params = {'C':[0.1, 0.5, 1.0, 2.0],
          'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
          'max_iter': [50, 100, 150, 200]}

c_estimator_list = list()
solver_estimator_list = list()
max_iter_estimator_list = list()

acc_list_training = list()
precision_list_training = list()
recall_list_training = list()
f1_list_training = list()

acc_list_validation = list()
precision_list_validation = list()
recall_list_validation = list()
f1_list_validation = list()

acc_list_test = list()
precision_list_test = list()
recall_list_test = list()
f1_list_test = list()

for i in params.get('C'):
    for j in params.get('solver'):
        for p in params.get('max_iter'):
            c_estimator_list.append(i)
            solver_estimator_list.append(j)
            max_iter_estimator_list.append(p)
            
            #Model Define
            lr_model = LogisticRegression(C=i, solver=j, max_iter=p, random_state=42)
            
            #Model Training
            lr_model.fit(X_training_classification, np.ravel(y_training_classification))
            
            #Model Predict
            yhat_lr_training    = lr_model.predict(X_training_classification)
            yhat_lr_validation  = lr_model.predict(X_validation_classification)
            yhat_lr_test        = lr_model.predict(X_test_classification)
            
            #Model Scores
            #Training Scores
            acc_score_training       = mt.accuracy_score(y_training_classification, yhat_lr_training)
            precision_score_training = mt.precision_score(y_training_classification, yhat_lr_training)
            recall_score_training    = mt.recall_score(y_training_classification, yhat_lr_training)
            f1_score_training        = mt.f1_score(y_training_classification, yhat_lr_training)
    
            acc_list_training.append(acc_score_training)
            precision_list_training.append(precision_score_training)
            recall_list_training.append(recall_score_training)
            f1_list_training.append(f1_score_training)
            
            #Validation Scores
            acc_score_validation       = mt.accuracy_score(y_validation_classification, yhat_lr_validation)
            precision_score_validation = mt.precision_score(y_validation_classification, yhat_lr_validation)
            recall_score_validation    = mt.recall_score(y_validation_classification, yhat_lr_validation)
            f1_score_validation        = mt.f1_score(y_validation_classification, yhat_lr_validation)
    
            acc_list_validation.append(acc_score_validation)
            precision_list_validation.append(precision_score_validation)
            recall_list_validation.append(recall_score_validation)
            f1_list_validation.append(f1_score_validation)
            
            #Test Scores
            acc_score_test       = mt.accuracy_score(y_test_classification, yhat_lr_test)
            precision_score_test = mt.precision_score(y_test_classification, yhat_lr_test)
            recall_score_test    = mt.recall_score(y_test_classification, yhat_lr_test)
            f1_score_test        = mt.f1_score(y_test_classification, yhat_lr_test)
    
            acc_list_test.append(acc_score_test)
            precision_list_test.append(precision_score_test)
            recall_list_test.append(recall_score_test)
            f1_list_test.append(f1_score_test)

In [18]:
lr_training_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_training, 'precision_score': precision_list_training,
                               'recall_score':recall_list_training, 'f1_score':f1_list_training})

lr_training_max = lr_training_df.query('f1_score == f1_score.max()')
lr_training_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
9,0.1,newton-cg,100,0.876288,0.871866,0.837661,0.854421
10,0.1,newton-cg,150,0.876288,0.871866,0.837661,0.854421
11,0.1,newton-cg,200,0.876288,0.871866,0.837661,0.854421
12,0.1,newton-cholesky,50,0.876288,0.871866,0.837661,0.854421
13,0.1,newton-cholesky,100,0.876288,0.871866,0.837661,0.854421
14,0.1,newton-cholesky,150,0.876288,0.871866,0.837661,0.854421
15,0.1,newton-cholesky,200,0.876288,0.871866,0.837661,0.854421


In [19]:
lr_validation_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_validation, 'precision_score': precision_list_validation,
                               'recall_score':recall_list_validation, 'f1_score':f1_list_validation})

lr_validation_max = lr_validation_df.query('f1_score == f1_score.max()')
lr_validation_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
81,2.0,newton-cg,100,0.874481,0.869421,0.83592,0.852341
82,2.0,newton-cg,150,0.874481,0.869421,0.83592,0.852341
83,2.0,newton-cg,200,0.874481,0.869421,0.83592,0.852341


In [20]:
lr_test_df = pd.DataFrame({'C': c_estimator_list, 'solver':solver_estimator_list, 'max_iter':max_iter_estimator_list,
                               'accuracy_score':acc_list_test, 'precision_score': precision_list_test,
                               'recall_score':recall_list_test, 'f1_score':f1_list_test})

lr_test_max = lr_test_df.query('f1_score == f1_score.max()')
lr_test_max

Unnamed: 0,C,solver,max_iter,accuracy_score,precision_score,recall_score,f1_score
32,0.5,newton-cg,50,0.871857,0.868014,0.83502,0.851197


*   It was observed that for these datasets, the optimization algorithms that showed best performance were the ones that contain L2 penalty support. (newton-cg and newton-cholesky)
*   Other methods (lbfgs, sag, liblinear) either failed to converge due to not approximating the cost function well or by applying a not smooth penalty(L1) to relatively small datasets(maximum 73k rows).
*   Both C and max_iter parameters proved to be secundary upon cost function convergence, showing in comparison little influence on final metric results. 

## 1.5 Final Results for Classification Models

### 1.5.1 Training Dataset 

In [21]:
knn_training_max = knn_training_max.drop('neighbors', axis=1)
knn_training_max['model'] = 'KNN'
knn_training_max = knn_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

tree_training_max = tree_training_max.drop('max_depth', axis=1)
tree_training_max['model'] = 'Decision Tree'
tree_training_max = tree_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

rf_training_max = rf_training_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_training_max['model'] = 'Random Forest'
rf_training_max = rf_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

lr_training_max = lr_training_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_training_max['model'] = 'Logistic Regression'
lr_training_max = lr_training_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [22]:
training_classification_results = pd.concat([knn_training_max, tree_training_max, rf_training_max, lr_training_max.iloc[[0]]])
training_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.832186,0.812008,0.79741,0.804643
9,Decision Tree,0.989712,0.993089,0.983104,0.988072
24,Random Forest,0.996525,0.998242,0.993732,0.995982
9,Logistic Regression,0.876288,0.871866,0.837661,0.854421


### 1.5.2 Validation Dataset

In [23]:
knn_validation_max = knn_validation_max.drop('neighbors', axis=1)
knn_validation_max['model'] = 'KNN'
knn_validation_max = knn_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

tree_validation_max = tree_validation_max.drop('max_depth', axis=1)
tree_validation_max['model'] = 'Decision Tree'
tree_validation_max = tree_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

rf_validation_max = rf_validation_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_validation_max['model'] = 'Random Forest'
rf_validation_max = rf_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

lr_validation_max = lr_validation_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_validation_max['model'] = 'Logistic Regression'
lr_validation_max = lr_validation_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [24]:
validation_classification_results = pd.concat([knn_validation_max, tree_validation_max, rf_validation_max, lr_validation_max.iloc[[0]]])
validation_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.676277,0.627851,0.621278,0.624548
6,Decision Tree,0.952315,0.9563,0.932586,0.944294
29,Random Forest,0.965185,0.974271,0.944614,0.959213
81,Logistic Regression,0.874481,0.869421,0.83592,0.852341


### 1.5.3 Test Dataset

In [25]:
knn_test_max = knn_test_max.drop('neighbors', axis=1)
knn_test_max['model'] = 'KNN'
knn_test_max = knn_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

tree_test_max = tree_test_max.drop('max_depth', axis=1)
tree_test_max['model'] = 'Decision Tree'
tree_test_max = tree_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

rf_test_max = rf_test_max.drop(['n_estimators', 'max_depth'], axis=1)
rf_test_max['model'] = 'Random Forest'
rf_test_max = rf_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

lr_test_max = lr_test_max.drop(['C', 'solver', 'max_iter'], axis=1)
lr_test_max['model'] = 'Logistic Regression'
lr_test_max = lr_test_max[['model', 'accuracy_score', 'precision_score', 'recall_score', 'f1_score']]

In [26]:
test_classification_results = pd.concat([knn_test_max, tree_test_max, rf_test_max, lr_test_max.iloc[[0]]])
test_classification_results

Unnamed: 0,model,accuracy_score,precision_score,recall_score,f1_score
0,KNN,0.672228,0.630462,0.611879,0.621031
6,Decision Tree,0.951802,0.953881,0.935416,0.944558
24,Random Forest,0.963928,0.971777,0.945271,0.958341
32,Logistic Regression,0.871857,0.868014,0.83502,0.851197


# 2.0 REGRESSION MODELS

*   The main metric used for evaluation on overall model performance will be the MAPE (Mean Absolute Percentual Error), since it expresses the algorithm performance alongside the proportion of the error it presents, standing as an easier way to quantify the error obtained by each model whether than using its absolute values.
*   Other error metrics will also be listed(Mean Squared Error(MSE), Rooted Mean Squared Error(RMSE), Mean Absolute Error(MSE), R2 Score), as well as the hyperparameter set tunned for each iteration, in order to obtain insights on how each model works.

In [27]:
X_training_regression = pd.read_csv('datasets\\regression\\X_training.csv')
y_training_regression = pd.read_csv('datasets\\regression\\y_training.csv')

X_validation_regression = pd.read_csv('datasets\\regression\\X_validation.csv')
y_validation_regression = pd.read_csv('datasets\\regression\\y_validation.csv')

X_test_regression = pd.read_csv('datasets\\regression\\X_test.csv')
y_test_regression = pd.read_csv('datasets\\regression\\y_test.csv')

In [28]:
print(f'Training dataset: rows = {X_training_regression.shape[0]}, columns = {X_training_regression.shape[1]}')
print(f'Validation dataset: rows = {X_validation_regression.shape[0]}, columns = {X_validation_regression.shape[1]}')
print(f'Test dataset: rows = {X_test_regression.shape[0]}, columns = {X_test_regression.shape[1]}')

print(f'Number of classes: {len(np.unique(y_training_regression))}')

Training dataset: rows = 10547, columns = 13
Validation dataset: rows = 4521, columns = 13
Test dataset: rows = 3767, columns = 13
Number of classes: 101


*   Every class variable array ('y' values) match sizes with their features dataset counterparts ('x' variables).

## 2.1 Linear Regression

In [29]:
#Model Definition
linear_regression_model = LinearRegression()
    
#Model Train
linear_regression_model.fit(X_training_regression, y_training_regression)
    
#Model Predict
yhat_linear_regression_training   = linear_regression_model.predict(X_training_regression)
yhat_linear_regression_validation = linear_regression_model.predict(X_validation_regression)
yhat_linear_regression_test       = linear_regression_model.predict(X_test_regression)
    
#training scores
r2_score_training       = mt.r2_score(y_training_regression, yhat_linear_regression_training)
mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_linear_regression_training)
rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_linear_regression_training))
mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_linear_regression_training)
mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_linear_regression_training)
    
#validation scores
r2_score_validation       = mt.r2_score(y_validation_regression, yhat_linear_regression_validation)
mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_linear_regression_validation)
rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_linear_regression_validation))
mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_linear_regression_validation)
mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_linear_regression_validation)
    
#test scores
r2_score_test       = mt.r2_score(y_test_regression, yhat_linear_regression_test)
mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_linear_regression_test)
rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_linear_regression_test))
mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_linear_regression_test)
mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_linear_regression_test)

In [30]:
linear_regression_training_min = pd.DataFrame({'r2_score': r2_score_training, 'mse_score': mse_score_training,
                                               'rmse_score': rmse_score_training, 'mae_score': mae_score_training,
                                               'mape_score': mape_score_training}, index= [0])
linear_regression_training_min

Unnamed: 0,r2_score,mse_score,rmse_score,mae_score,mape_score
0,0.046058,455.996112,21.354065,16.998249,8.653186


In [31]:
linear_regression_validation_min = pd.DataFrame({'r2_score': r2_score_validation, 'mse_score': mse_score_validation,
                                               'rmse_score': rmse_score_validation, 'mae_score': mae_score_validation,
                                               'mape_score': mape_score_validation}, index= [0])
linear_regression_validation_min

Unnamed: 0,r2_score,mse_score,rmse_score,mae_score,mape_score
0,0.039925,458.447042,21.411376,17.039754,8.682542


In [32]:
linear_regression_test_min = pd.DataFrame({'r2_score': r2_score_test, 'mse_score': mse_score_test,
                                               'rmse_score': rmse_score_test, 'mae_score': mae_score_test,
                                               'mape_score': mape_score_test}, index= [0])
linear_regression_test_min

Unnamed: 0,r2_score,mse_score,rmse_score,mae_score,mape_score
0,0.052317,461.427719,21.480869,17.129965,8.521859


*   The Linear Regression model serves as a base model for further comparison, as it is conceptually the simplest of all regression models.
*   Given the simple nature of a linear model, there are no parameters to iterate for its construction.
*   The value obtained for r2_score (0.0 value) indicates that data is sparse, which means that a linear model wont fit well to how this dataset behaves.

## 2.2 Linear Regression Lasso (L1)

In [33]:
params = {'alpha': [1, 50, 100, 200],
          'max_iter': [1, 50, 100, 200]}

alpha_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('alpha'):
    for j in params.get('max_iter'):
        alpha_estimator_list.append(i)
        max_iter_estimator_list.append(j)
        #Model Define
        lasso_regression = Lasso(alpha=i, max_iter=j, random_state=42)
    
        #Model Training
        lasso_regression.fit(X_training_regression, np.ravel(y_training_regression))
    
        #Model Predict
        yhat_lasso_regression_training   = lasso_regression.predict(X_training_regression)
        yhat_lasso_regression_validation = lasso_regression.predict(X_validation_regression)
        yhat_lasso_regression_test       = lasso_regression.predict(X_test_regression)
    
        #training scores
        r2_score_training       = mt.r2_score(y_training_regression, yhat_lasso_regression_training)
        mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_lasso_regression_training)
        rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_lasso_regression_training))
        mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_lasso_regression_training)
        mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_lasso_regression_training)
    
        r2_list_training.append(r2_score_training)
        mse_list_training.append(mse_score_training)
        rmse_list_training.append(rmse_score_training)
        mae_list_training.append(mae_score_training)
        mape_list_training.append(mape_score_training)
    
        #validation scores
        r2_score_validation       = mt.r2_score(y_validation_regression, yhat_lasso_regression_validation)
        mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_lasso_regression_validation)
        rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_lasso_regression_validation))
        mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_lasso_regression_validation)
        mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_lasso_regression_validation)
    
        r2_list_validation.append(r2_score_validation)
        mse_list_validation.append(mse_score_validation)
        rmse_list_validation.append(rmse_score_validation)
        mae_list_validation.append(mae_score_validation)
        mape_list_validation.append(mape_score_validation)
        
        #test scores
        r2_score_test       = mt.r2_score(y_test_regression, yhat_lasso_regression_test)
        mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_lasso_regression_test)
        rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_lasso_regression_test))
        mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_lasso_regression_test)
        mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_lasso_regression_test)
    
        r2_list_test.append(r2_score_test)
        mse_list_test.append(mse_score_test)
        rmse_list_test.append(rmse_score_test)
        mae_list_test.append(mae_score_test)
        mape_list_test.append(mape_score_test)

In [34]:
lasso_regression_training_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_training, 'mse_score': mse_list_training,
                                              'rmse_score': rmse_list_training, 'mae_score': mae_list_training,
                                              'mape_score': mape_list_training})


lasso_regression_training_min = lasso_regression_training_df.query('mape_score == mape_score.min()')
lasso_regression_training_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,1,0.007401,474.474834,21.782443,17.305484,8.736697
1,1,50,0.007401,474.474834,21.782443,17.305484,8.736697
2,1,100,0.007401,474.474834,21.782443,17.305484,8.736697
3,1,200,0.007401,474.474834,21.782443,17.305484,8.736697


In [35]:
lasso_regression_validation_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_validation, 'mse_score': mse_list_validation,
                                              'rmse_score': rmse_list_validation, 'mae_score': mae_list_validation,
                                              'mape_score': mape_list_validation})

lasso_regression_validation_min = lasso_regression_validation_df.query('mape_score == mape_score.min()')
lasso_regression_validation_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
4,50,1,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
5,50,50,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
6,50,100,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
7,50,200,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
8,100,1,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
9,100,50,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
10,100,100,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
11,100,200,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
12,200,1,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
13,200,50,-7.197077e-07,477.511956,21.852047,17.352836,8.678722


In [36]:
lasso_regression_test_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_test, 'mse_score': mse_list_test,
                                              'rmse_score': rmse_list_test, 'mae_score': mae_list_test,
                                              'mape_score': mape_list_test})


lasso_regression_test_min = lasso_regression_test_df.query('mape_score == mape_score.min()')
lasso_regression_test_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
4,50,1,-0.000124,486.961469,22.067203,17.551492,8.71455
5,50,50,-0.000124,486.961469,22.067203,17.551492,8.71455
6,50,100,-0.000124,486.961469,22.067203,17.551492,8.71455
7,50,200,-0.000124,486.961469,22.067203,17.551492,8.71455
8,100,1,-0.000124,486.961469,22.067203,17.551492,8.71455
9,100,50,-0.000124,486.961469,22.067203,17.551492,8.71455
10,100,100,-0.000124,486.961469,22.067203,17.551492,8.71455
11,100,200,-0.000124,486.961469,22.067203,17.551492,8.71455
12,200,1,-0.000124,486.961469,22.067203,17.551492,8.71455
13,200,50,-0.000124,486.961469,22.067203,17.551492,8.71455


*   Higher alpha values tend to aproximate r2_score results to 0 as aspected; Alpha is a parameter that multiplies with L1 penalty factor our regression cost function.

## 2.3 Linear Regression Ridge (L2)

In [37]:
params = {'alpha': [1, 50, 100, 200],
          'max_iter': [1, 50, 100, 200]}

alpha_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('alpha'):
    for j in params.get('max_iter'):
        alpha_estimator_list.append(i)
        max_iter_estimator_list.append(j)
        #Model Define
        ridge_regression = Ridge(alpha=i, max_iter=j, random_state=42)
    
        #Model Training
        ridge_regression.fit(X_training_regression, np.ravel(y_training_regression))
    
        #Model Predict
        yhat_ridge_regression_training   = ridge_regression.predict(X_training_regression)
        yhat_ridge_regression_validation = ridge_regression.predict(X_validation_regression)
        yhat_ridge_regression_test       = ridge_regression.predict(X_test_regression)
    
        #training scores
        r2_score_training       = mt.r2_score(y_training_regression, yhat_ridge_regression_training)
        mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_ridge_regression_training)
        rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_ridge_regression_training))
        mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_ridge_regression_training)
        mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_ridge_regression_training)
    
        r2_list_training.append(r2_score_training)
        mse_list_training.append(mse_score_training)
        rmse_list_training.append(rmse_score_training)
        mae_list_training.append(mae_score_training)
        mape_list_training.append(mape_score_training)
    
        #validation scores
        r2_score_validation       = mt.r2_score(y_validation_regression, yhat_ridge_regression_validation)
        mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_ridge_regression_validation)
        rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_ridge_regression_validation))
        mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_ridge_regression_validation)
        mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_ridge_regression_validation)
    
        r2_list_validation.append(r2_score_validation)
        mse_list_validation.append(mse_score_validation)
        rmse_list_validation.append(rmse_score_validation)
        mae_list_validation.append(mae_score_validation)
        mape_list_validation.append(mape_score_validation)
        
        #test scores
        r2_score_test       = mt.r2_score(y_test_regression, yhat_ridge_regression_test)
        mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_ridge_regression_test)
        rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_ridge_regression_test))
        mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_ridge_regression_test)
        mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_ridge_regression_test)
    
        r2_list_test.append(r2_score_test)
        mse_list_test.append(mse_score_test)
        rmse_list_test.append(rmse_score_test)
        mae_list_test.append(mae_score_test)
        mape_list_test.append(mape_score_test)

In [38]:
ridge_regression_training_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_training, 'mse_score': mse_list_training,
                                              'rmse_score': rmse_list_training, 'mae_score': mae_list_training,
                                              'mape_score': mape_list_training})

ridge_regression_training_min = ridge_regression_training_df.query('mape_score == mape_score.min()')
ridge_regression_training_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,1,0.046058,455.996401,21.354072,16.998308,8.653415
1,1,50,0.046058,455.996401,21.354072,16.998308,8.653415
2,1,100,0.046058,455.996401,21.354072,16.998308,8.653415
3,1,200,0.046058,455.996401,21.354072,16.998308,8.653415


In [39]:
ridge_regression_validation_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_validation, 'mse_score': mse_list_validation,
                                              'rmse_score': rmse_list_validation, 'mae_score': mae_list_validation,
                                              'mape_score': mape_list_validation})

ridge_regression_validation_min = ridge_regression_validation_df.query('mape_score == mape_score.min()')
ridge_regression_validation_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
12,200,1,0.03681,459.934208,21.446077,17.043389,8.672408
13,200,50,0.03681,459.934208,21.446077,17.043389,8.672408
14,200,100,0.03681,459.934208,21.446077,17.043389,8.672408
15,200,200,0.03681,459.934208,21.446077,17.043389,8.672408


In [40]:
ridge_regression_test_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'max_iter': max_iter_estimator_list,
                                              'r2_score': r2_list_test, 'mse_score': mse_list_test,
                                              'rmse_score': rmse_list_test, 'mae_score': mae_list_test,
                                              'mape_score': mape_list_test})

ridge_regression_test_min = ridge_regression_test_df.query('mape_score == mape_score.min()')
ridge_regression_test_min

Unnamed: 0,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,1,0.05231,461.431102,21.480947,17.129678,8.522815
1,1,50,0.05231,461.431102,21.480947,17.129678,8.522815
2,1,100,0.05231,461.431102,21.480947,17.129678,8.522815
3,1,200,0.05231,461.431102,21.480947,17.129678,8.522815


*   As alpha here is a multiplier to the penalty factor L2 for ridge regression, its influence on r2_score is more subtle, as we can see values dont tend abruptly to 0 anymore in comparison to Lasso Regression.

## 2.4 Linear Regression Elastic Net (L1 & L2)

In [41]:
params = {'alpha': [1, 50, 100, 200],
          'l1_ratio': [0, 0.2, 0.5, 0.7, 1.0],
          'max_iter': [1, 50, 100, 200]}

alpha_estimator_list = list()
l1_ratio_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('alpha'):
    for j in params.get('l1_ratio'):
        for p in params.get('max_iter'):    
            alpha_estimator_list.append(i)
            l1_ratio_estimator_list.append(j)
            max_iter_estimator_list.append(p)
            #Model Define
            elastic_net = ElasticNet(alpha=i, l1_ratio=j, max_iter=p, random_state=42)
    
            #Model Training
            elastic_net.fit(X_training_regression, np.ravel(y_training_regression))
    
            #Model Predict
            yhat_elastic_net_training   = elastic_net.predict(X_training_regression)
            yhat_elastic_net_validation = elastic_net.predict(X_validation_regression)
            yhat_elastic_net_test       = elastic_net.predict(X_test_regression)
    
            #training scores
            r2_score_training       = mt.r2_score(y_training_regression, yhat_elastic_net_training)
            mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_elastic_net_training)
            rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_elastic_net_training))
            mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_elastic_net_training)
            mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_elastic_net_training)
    
            r2_list_training.append(r2_score_training)
            mse_list_training.append(mse_score_training)
            rmse_list_training.append(rmse_score_training)
            mae_list_training.append(mae_score_training)
            mape_list_training.append(mape_score_training)
    
            #validation scores
            r2_score_validation       = mt.r2_score(y_validation_regression, yhat_elastic_net_validation)
            mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_elastic_net_validation)
            rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_elastic_net_validation))
            mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_elastic_net_validation)
            mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_elastic_net_validation)
    
            r2_list_validation.append(r2_score_validation)
            mse_list_validation.append(mse_score_validation)
            rmse_list_validation.append(rmse_score_validation)
            mae_list_validation.append(mae_score_validation)
            mape_list_validation.append(mape_score_validation)
        
            #test scores
            r2_score_test       = mt.r2_score(y_test_regression, yhat_elastic_net_test)
            mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_elastic_net_test)
            rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_elastic_net_test))
            mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_elastic_net_test)
            mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_elastic_net_test)
    
            r2_list_test.append(r2_score_test)
            mse_list_test.append(mse_score_test)
            rmse_list_test.append(rmse_score_test)
            mae_list_test.append(mae_score_test)
            mape_list_test.append(mape_score_test)

In [42]:
elastic_net_training_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'l1_ratio': l1_ratio_estimator_list,
                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_training, 
                                         'mse_score': mse_list_training, 'rmse_score': rmse_list_training, 
                                         'mae_score': mae_list_training, 'mape_score': mape_list_training})

elastic_net_training_min = elastic_net_training_df.query('mape_score == mape_score.min()')
elastic_net_training_min

Unnamed: 0,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
1,1,0.0,50,0.010715,472.890766,21.746052,17.277098,8.71988
2,1,0.0,100,0.010715,472.890766,21.746052,17.277098,8.71988
3,1,0.0,200,0.010715,472.890766,21.746052,17.277098,8.71988


In [43]:
elastic_net_validation_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'l1_ratio': l1_ratio_estimator_list,
                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_validation, 
                                         'mse_score': mse_list_validation, 'rmse_score': rmse_list_validation, 
                                         'mae_score': mae_list_validation, 'mape_score': mape_list_validation})

elastic_net_validation_min = elastic_net_validation_df.query('mape_score == mape_score.min()')
elastic_net_validation_min.head()

Unnamed: 0,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
24,50,0.2,1,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
25,50,0.2,50,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
26,50,0.2,100,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
27,50,0.2,200,-7.197077e-07,477.511956,21.852047,17.352836,8.678722
28,50,0.5,1,-7.197077e-07,477.511956,21.852047,17.352836,8.678722


In [44]:
elastic_net_test_df = pd.DataFrame({ 'alpha': alpha_estimator_list, 'l1_ratio': l1_ratio_estimator_list,
                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_test, 
                                         'mse_score': mse_list_test, 'rmse_score': rmse_list_test, 
                                         'mae_score': mae_list_test, 'mape_score': mape_list_test})

elastic_net_test_min = elastic_net_test_df.query('mape_score == mape_score.min()')
elastic_net_test_min.head()

Unnamed: 0,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
24,50,0.2,1,-0.000124,486.961469,22.067203,17.551492,8.71455
25,50,0.2,50,-0.000124,486.961469,22.067203,17.551492,8.71455
26,50,0.2,100,-0.000124,486.961469,22.067203,17.551492,8.71455
27,50,0.2,200,-0.000124,486.961469,22.067203,17.551492,8.71455
28,50,0.5,1,-0.000124,486.961469,22.067203,17.551492,8.71455


*   Elastic Net models have both L1 and L2 penalty factors combined in its regression cost function optimization. 
    'l1_ratio' parameter is then a mixing parameter: For 'l1_ratio=0', The Penalty is exclusively L2; for 'l1_ratio=1', the penalty is exclusively L1.
*   We can see that results pointed better error optimization (mape minimization) when l1_ratio geared towards values closer to 0, which is expected since L1 penalty is a more severe regularization factor in comparison to L2.

## 2.5 Polinomial Regression

In [45]:
params = {'degree': np.arange(1, 6).tolist()}

degree_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('degree'):
    degree_estimator_list.append(i)
    #Model Define
    polynomial = pp.PolynomialFeatures(degree=i)
    X_poly_training = polynomial.fit_transform(X_training_regression)
    X_poly_validation = polynomial.fit_transform(X_validation_regression)
    X_poly_test = polynomial.fit_transform(X_test_regression)
    
    polynomial_linear_regression = LinearRegression()
    
    #Model Training
    polynomial_linear_regression.fit(X_poly_training, np.ravel(y_training_regression))
    
    #Model Predict
    yhat_polynomial_lr_training   = polynomial_linear_regression.predict(X_poly_training)
    yhat_polynomial_lr_validation = polynomial_linear_regression.predict(X_poly_validation)
    yhat_polynomial_lr_test       = polynomial_linear_regression.predict(X_poly_test)
    
    #training scores
    r2_score_training       = mt.r2_score(y_training_regression, yhat_polynomial_lr_training)
    mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_polynomial_lr_training)
    rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_polynomial_lr_training))
    mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_polynomial_lr_training)
    mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_polynomial_lr_training)
    
    r2_list_training.append(r2_score_training)
    mse_list_training.append(mse_score_training)
    rmse_list_training.append(rmse_score_training)
    mae_list_training.append(mae_score_training)
    mape_list_training.append(mape_score_training)
    
    #validation scores
    r2_score_validation       = mt.r2_score(y_validation_regression, yhat_polynomial_lr_validation)
    mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_polynomial_lr_validation)
    rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_polynomial_lr_validation))
    mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_polynomial_lr_validation)
    mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_polynomial_lr_validation)
    
    r2_list_validation.append(r2_score_validation)
    mse_list_validation.append(mse_score_validation)
    rmse_list_validation.append(rmse_score_validation)
    mae_list_validation.append(mae_score_validation)
    mape_list_validation.append(mape_score_validation)
    
    #test scores
    r2_score_test       = mt.r2_score(y_test_regression, yhat_polynomial_lr_test)
    mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_polynomial_lr_test)
    rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_polynomial_lr_test))
    mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_polynomial_lr_test)
    mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_polynomial_lr_test)
    
    r2_list_test.append(r2_score_test)
    mse_list_test.append(mse_score_test)
    rmse_list_test.append(rmse_score_test)
    mae_list_test.append(mae_score_test)
    mape_list_test.append(mape_score_test)

In [46]:
polynomial_regression_training_df = pd.DataFrame({ 'degree': degree_estimator_list, 'r2_score': r2_list_training, 
                                                   'mse_score': mse_list_training, 'rmse_score': rmse_list_training, 
                                                   'mae_score': mae_list_training, 'mape_score': mape_list_training})
polynomial_regression_training_df

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.046058,455.996112,21.354065,16.998249,8.653186
1,2,0.094195,432.98621,20.808321,16.458032,8.35054
2,3,0.154418,404.19895,20.1047,15.883592,7.800181
3,4,0.333957,318.377086,17.843124,13.614247,5.913391
4,5,0.7253,131.310015,11.459058,7.266166,2.215335


In [47]:
polynomial_regression_training_min = polynomial_regression_training_df.query('mape_score == mape_score.min()')
polynomial_regression_training_min

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
4,5,0.7253,131.310015,11.459058,7.266166,2.215335


In [48]:
polynomial_regression_validation_df = pd.DataFrame({ 'degree': degree_estimator_list, 'r2_score': r2_list_validation, 
                                                   'mse_score': mse_list_validation, 'rmse_score': rmse_list_validation, 
                                                   'mae_score': mae_list_validation, 'mape_score': mape_list_validation})
polynomial_regression_validation_df

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.039925,458.447,21.411376,17.039754,8.682542
1,2,0.066477,445.7682,21.113224,16.749939,8.547931
2,3,-0.047778,500.3263,22.367974,17.087201,8.678283
3,4,-102.923632,49624.74,222.766112,36.10422,10.184802
4,5,-224879.156426,107382900.0,10362.571399,1354.10913,346.300824


In [49]:
polynomial_regression_validation_min = polynomial_regression_validation_df.query('mape_score == mape_score.min()')
polynomial_regression_validation_min

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
1,2,0.066477,445.768223,21.113224,16.749939,8.547931


In [50]:
polynomial_regression_test_df = pd.DataFrame({ 'degree': degree_estimator_list, 'r2_score': r2_list_test, 
                                                   'mse_score': mse_list_test, 'rmse_score': rmse_list_test, 
                                                   'mae_score': mae_list_test, 'mape_score': mape_list_test})
polynomial_regression_test_df

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.05231712,461.4277,21.480869,17.129965,8.521859
1,2,0.09007934,443.0413,21.048545,16.720535,8.242464
2,3,-0.2617516,614.3481,24.786046,17.178214,7.956229
3,4,-563.3125,274764.3,524.179648,40.303351,19.956479
4,5,-1454237.0,708070100.0,26609.586203,1524.866134,465.11239


In [51]:
polynomial_regression_test_min = polynomial_regression_test_df.query('mape_score == mape_score.min()')
polynomial_regression_test_min

Unnamed: 0,degree,r2_score,mse_score,rmse_score,mae_score,mape_score
2,3,-0.261752,614.348098,24.786046,17.178214,7.956229


*   The polynomial X Basis expansion is an expensive computational process as you add up more degrees to the model, hence why it was done up to degree=5 on this study.
*   It is possible to conclude that the larger/more complex a dataset gets, higher degrees polynomial regression tends to describe the phenomenom with better accuracy, such as seen on r2 and MAPE scores for our largest dataset, the training one.
*   When the dataset does not have enough data, its behaviour tend to converge with lower degrees, such as seen on validation and test datasets.

## 2.6 Polinomial Regression Lasso (L1)

In [52]:
params = {'degree': np.arange(1, 6).tolist(),
          'alpha': [1, 50, 100, 200],
          'max_iter': [1, 50, 100, 200]}

degree_estimator_list = list()
alpha_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('degree'):
    for j in params.get('alpha'):
        for p in params.get('max_iter'):
            degree_estimator_list.append(i)
            alpha_estimator_list.append(j)
            max_iter_estimator_list.append(p)
            #Model Define
            polynomial = pp.PolynomialFeatures(degree=i)
            X_poly_training = polynomial.fit_transform(X_training_regression)
            X_poly_validation = polynomial.fit_transform(X_validation_regression)
            X_poly_test = polynomial.fit_transform(X_test_regression)
    
            polynomial_lasso_regression = Lasso(alpha=j, max_iter=p, random_state=42)
    
            #Model Training
            polynomial_lasso_regression.fit(X_poly_training, np.ravel(y_training_regression))
    
            #Model Predict
            yhat_polynomial_lasso_training   = polynomial_lasso_regression.predict(X_poly_training)
            yhat_polynomial_lasso_validation = polynomial_lasso_regression.predict(X_poly_validation)
            yhat_polynomial_lasso_test       = polynomial_lasso_regression.predict(X_poly_test)
    
            #training scores
            r2_score_training       = mt.r2_score(y_training_regression, yhat_polynomial_lasso_training)
            mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_polynomial_lasso_training)
            rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_polynomial_lasso_training))
            mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_polynomial_lasso_training)
            mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_polynomial_lasso_training)
    
            r2_list_training.append(r2_score_training)
            mse_list_training.append(mse_score_training)
            rmse_list_training.append(rmse_score_training)
            mae_list_training.append(mae_score_training)
            mape_list_training.append(mape_score_training)
            
            #validation scores
            r2_score_validation       = mt.r2_score(y_validation_regression, yhat_polynomial_lasso_validation)
            mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_polynomial_lasso_validation)
            rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_polynomial_lasso_validation))
            mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_polynomial_lasso_validation)
            mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_polynomial_lasso_validation)
    
            r2_list_validation.append(r2_score_validation)
            mse_list_validation.append(mse_score_validation)
            rmse_list_validation.append(rmse_score_validation)
            mae_list_validation.append(mae_score_validation)
            mape_list_validation.append(mape_score_validation)
            
            #test scores
            r2_score_test       = mt.r2_score(y_test_regression, yhat_polynomial_lasso_test)
            mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_polynomial_lasso_test)
            rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_polynomial_lasso_test))
            mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_polynomial_lasso_test)
            mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_polynomial_lasso_test)
    
            r2_list_test.append(r2_score_test)
            mse_list_test.append(mse_score_test)
            rmse_list_test.append(rmse_score_test)
            mae_list_test.append(mae_score_test)
            mape_list_test.append(mape_score_test)

In [53]:
polynomial_lasso_regression_training_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_training, 
                                                         'mse_score': mse_list_training, 'rmse_score': rmse_list_training, 
                                                         'mae_score': mae_list_training, 'mape_score': mape_list_training})

polynomial_lasso_regression_training_min = polynomial_lasso_regression_training_df.query('mape_score == mape_score.min()')
polynomial_lasso_regression_training_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
67,5,1,200,0.0195,468.691307,21.64928,17.158121,8.574792


In [54]:
polynomial_lasso_regression_validation_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_validation, 
                                                         'mse_score': mse_list_validation, 'rmse_score': rmse_list_validation, 
                                                         'mae_score': mae_list_validation, 'mape_score': mape_list_validation})

polynomial_lasso_regression_validation_min = polynomial_lasso_regression_validation_df.query('mape_score == mape_score.min()')
polynomial_lasso_regression_validation_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
34,3,1,100,0.014148,470.755769,21.696907,17.180595,8.655828
35,3,1,200,0.014148,470.755769,21.696907,17.180595,8.655828


In [55]:
polynomial_lasso_regression_test_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_test, 
                                                         'mse_score': mse_list_test, 'rmse_score': rmse_list_test, 
                                                         'mae_score': mae_list_test, 'mape_score': mape_list_test})

polynomial_lasso_regression_test_min = polynomial_lasso_regression_test_df.query('mape_score == mape_score.min()')
polynomial_lasso_regression_test_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
76,5,200,1,4e-06,486.899198,22.065792,17.550363,8.714456


## 2.7 Polinomial Regression Ridge (L2)

In [56]:
params = {'degree': np.arange(1, 6).tolist(),
          'alpha': [1, 50, 100, 200],
          'max_iter': [1, 50, 100, 200]}

degree_estimator_list = list()
alpha_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('degree'):
    for j in params.get('alpha'):
        for p in params.get('max_iter'):
            degree_estimator_list.append(i)
            alpha_estimator_list.append(j)
            max_iter_estimator_list.append(p)
            #Model Define
            polynomial = pp.PolynomialFeatures(degree=i)
            X_poly_training = polynomial.fit_transform(X_training_regression)
            X_poly_validation = polynomial.fit_transform(X_validation_regression)
            X_poly_test = polynomial.fit_transform(X_test_regression)
    
            polynomial_ridge_regression = Ridge(alpha=j, max_iter=p, random_state=42)
    
            #Model Training
            polynomial_ridge_regression.fit(X_poly_training, np.ravel(y_training_regression))
    
            #Model Predict
            yhat_polynomial_ridge_training   = polynomial_ridge_regression.predict(X_poly_training)
            yhat_polynomial_ridge_validation = polynomial_ridge_regression.predict(X_poly_validation)
            yhat_polynomial_ridge_test       = polynomial_ridge_regression.predict(X_poly_test)
    
            #training scores
            r2_score_training       = mt.r2_score(y_training_regression, yhat_polynomial_ridge_training)
            mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_polynomial_ridge_training)
            rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_polynomial_ridge_training))
            mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_polynomial_ridge_training)
            mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_polynomial_ridge_training)
    
            r2_list_training.append(r2_score_training)
            mse_list_training.append(mse_score_training)
            rmse_list_training.append(rmse_score_training)
            mae_list_training.append(mae_score_training)
            mape_list_training.append(mape_score_training)
            
            #validation scores
            r2_score_validation       = mt.r2_score(y_validation_regression, yhat_polynomial_ridge_validation)
            mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_polynomial_ridge_validation)
            rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_polynomial_ridge_validation))
            mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_polynomial_ridge_validation)
            mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_polynomial_ridge_validation)
    
            r2_list_validation.append(r2_score_validation)
            mse_list_validation.append(mse_score_validation)
            rmse_list_validation.append(rmse_score_validation)
            mae_list_validation.append(mae_score_validation)
            mape_list_validation.append(mape_score_validation)
            
            #test scores
            r2_score_test       = mt.r2_score(y_test_regression, yhat_polynomial_ridge_test)
            mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_polynomial_ridge_test)
            rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_polynomial_ridge_test))
            mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_polynomial_ridge_test)
            mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_polynomial_ridge_test)
    
            r2_list_test.append(r2_score_test)
            mse_list_test.append(mse_score_test)
            rmse_list_test.append(rmse_score_test)
            mae_list_test.append(mae_score_test)
            mape_list_test.append(mape_score_test)

In [57]:
polynomial_ridge_regression_training_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_training, 
                                                         'mse_score': mse_list_training, 'rmse_score': rmse_list_training, 
                                                         'mae_score': mae_list_training, 'mape_score': mape_list_training})

polynomial_ridge_regression_training_min = polynomial_ridge_regression_training_df.query('mape_score == mape_score.min()')
polynomial_ridge_regression_training_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
64,5,1,1,0.312725,328.525923,18.125284,14.002097,6.477741
65,5,1,50,0.312725,328.525923,18.125284,14.002097,6.477741
66,5,1,100,0.312725,328.525923,18.125284,14.002097,6.477741
67,5,1,200,0.312725,328.525923,18.125284,14.002097,6.477741


In [58]:
polynomial_ridge_regression_validation_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_validation, 
                                                         'mse_score': mse_list_validation, 'rmse_score': rmse_list_validation, 
                                                         'mae_score': mae_list_validation, 'mape_score': mape_list_validation})

polynomial_ridge_regression_validation_min = polynomial_ridge_regression_validation_df.query('mape_score == mape_score.min()')
polynomial_ridge_regression_validation_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
68,5,50,1,-9.685894,5102.638358,71.432754,18.826258,8.348152
69,5,50,50,-9.685894,5102.638358,71.432754,18.826258,8.348152
70,5,50,100,-9.685894,5102.638358,71.432754,18.826258,8.348152
71,5,50,200,-9.685894,5102.638358,71.432754,18.826258,8.348152


In [59]:
polynomial_ridge_regression_test_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                         'max_iter': max_iter_estimator_list, 'r2_score': r2_list_test, 
                                                         'mse_score': mse_list_test, 'rmse_score': rmse_list_test, 
                                                         'mae_score': mae_list_test, 'mape_score': mape_list_test})

polynomial_ridge_regression_test_min = polynomial_ridge_regression_test_df.query('mape_score == mape_score.min()')
polynomial_ridge_regression_test_min

Unnamed: 0,degree,alpha,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
48,4,1,1,-159.704238,78247.050076,279.726742,21.433575,8.113779
49,4,1,50,-159.704238,78247.050076,279.726742,21.433575,8.113779
50,4,1,100,-159.704238,78247.050076,279.726742,21.433575,8.113779
51,4,1,200,-159.704238,78247.050076,279.726742,21.433575,8.113779


*   Both Lasso and Ridge Regression , like in simple Linear Regression model, tend to raise up the bias on the polynomial regression. This can be seen by how regularized polynomial regression values for r2 and MAPE scores varies less in comparison to simple polynomial regression.
*   By having less variance error, we can then see that Polynomial Regularized Regression models are able to better generalize data behaviour.

## 2.8 Polinomial Regression Elastic Net (L1 & L2)

In [60]:
params = {'degree': np.arange(1, 6).tolist(),
          'alpha': [1, 50, 100, 200],
          'l1_ratio': [0, 0.2, 0.5, 0.7, 1.0],
          'max_iter': [1, 50, 100, 200]}

degree_estimator_list = list()
alpha_estimator_list = list()
l1_ratio_estimator_list = list()
max_iter_estimator_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('degree'):
    for j in params.get('alpha'):
        for p in params.get('l1_ratio'):
            for q in params.get('max_iter'):
                degree_estimator_list.append(i)
                alpha_estimator_list.append(j)
                l1_ratio_estimator_list.append(p)
                max_iter_estimator_list.append(q)
                #Model Define
                polynomial = pp.PolynomialFeatures(degree=i)
                X_poly_training = polynomial.fit_transform(X_training_regression)
                X_poly_validation = polynomial.fit_transform(X_validation_regression)
                X_poly_test = polynomial.fit_transform(X_test_regression)
    
                polynomial_elastic_net_regression = ElasticNet(alpha=j, l1_ratio=p, max_iter=q, random_state=42)
    
                #Model Training
                polynomial_elastic_net_regression.fit(X_poly_training, np.ravel(y_training_regression))
    
                #Model Predict
                yhat_polynomial_elastic_net_training   = polynomial_elastic_net_regression.predict(X_poly_training)
                yhat_polynomial_elastic_net_validation = polynomial_elastic_net_regression.predict(X_poly_validation)
                yhat_polynomial_elastic_net_test       = polynomial_elastic_net_regression.predict(X_poly_test)
                
                #training scores
                r2_score_training       = mt.r2_score(y_training_regression, yhat_polynomial_elastic_net_training)
                mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_polynomial_elastic_net_training)
                rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_polynomial_elastic_net_training))
                mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_polynomial_elastic_net_training)
                mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_polynomial_elastic_net_training)
    
                r2_list_training.append(r2_score_training)
                mse_list_training.append(mse_score_training)
                rmse_list_training.append(rmse_score_training)
                mae_list_training.append(mae_score_training)
                mape_list_training.append(mape_score_training)
                
                #validation scores
                r2_score_validation       = mt.r2_score(y_validation_regression, yhat_polynomial_elastic_net_validation)
                mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_polynomial_elastic_net_validation)
                rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_polynomial_elastic_net_validation))
                mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_polynomial_elastic_net_validation)
                mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_polynomial_elastic_net_validation)
    
                r2_list_validation.append(r2_score_validation)
                mse_list_validation.append(mse_score_validation)
                rmse_list_validation.append(rmse_score_validation)
                mae_list_validation.append(mae_score_validation)
                mape_list_validation.append(mape_score_validation)  
                
                #test scores
                r2_score_test       = mt.r2_score(y_test_regression, yhat_polynomial_elastic_net_test)
                mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_polynomial_elastic_net_test)
                rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_polynomial_elastic_net_test))
                mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_polynomial_elastic_net_test)
                mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_polynomial_elastic_net_test)
    
                r2_list_test.append(r2_score_test)
                mse_list_test.append(mse_score_test)
                rmse_list_test.append(rmse_score_test)
                mae_list_test.append(mae_score_test)
                mape_list_test.append(mape_score_test)    

In [61]:
polynomial_elastic_net_regression_training_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                               'l1_ratio': l1_ratio_estimator_list, 'max_iter': max_iter_estimator_list, 
                                                               'r2_score': r2_list_training, 'mse_score': mse_list_training, 
                                                               'rmse_score': rmse_list_training, 'mae_score': mae_list_training, 
                                                               'mape_score': mape_list_training})

polynomial_elastic_net_regression_training_min = polynomial_elastic_net_regression_training_df.query('mape_score == mape_score.min()')
polynomial_elastic_net_regression_training_min

Unnamed: 0,degree,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
323,5,1,0.0,200,0.085699,437.047167,20.905673,16.523671,8.278886


In [62]:
polynomial_elastic_net_regression_validation_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                                 'l1_ratio': l1_ratio_estimator_list, 'max_iter': max_iter_estimator_list, 
                                                                 'r2_score': r2_list_validation, 'mse_score': mse_list_validation, 
                                                                 'rmse_score': rmse_list_validation, 'mae_score': mae_list_validation, 
                                                                 'mape_score': mape_list_validation})

polynomial_elastic_net_regression_validation_min = polynomial_elastic_net_regression_validation_df.query('mape_score == mape_score.min()')
polynomial_elastic_net_regression_validation_min

Unnamed: 0,degree,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
320,5,1,0.0,1,0.054574,451.452047,21.247401,16.808199,8.630816


In [63]:
polynomial_elastic_net_regression_test_df = pd.DataFrame({ 'degree': degree_estimator_list, 'alpha': alpha_estimator_list,
                                                           'l1_ratio': l1_ratio_estimator_list, 'max_iter': max_iter_estimator_list, 
                                                           'r2_score': r2_list_test, 'mse_score': mse_list_test, 
                                                           'rmse_score': rmse_list_test, 'mae_score': mae_list_test, 
                                                           'mape_score': mape_list_test})

polynomial_elastic_net_regression_test_min = polynomial_elastic_net_regression_test_df.query('mape_score == mape_score.min()')
polynomial_elastic_net_regression_test_min

Unnamed: 0,degree,alpha,l1_ratio,max_iter,r2_score,mse_score,rmse_score,mae_score,mape_score
320,5,1,0.0,1,-0.169653,569.505007,23.864304,16.998383,8.488322


## 2.9 Decision Tree Regressor

In [64]:
params = {'max_depth': [1, 5, 10, 15, 20]}

max_depth_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('max_depth'):
    max_depth_list.append(i)
    #Model Define
    tree_regression = DecisionTreeRegressor(max_depth=i, random_state=42)
    
    #Model Training
    tree_regression.fit(X_training_regression, np.ravel(y_training_regression))
    
    #Model Predict
    yhat_tree_regression_training   = tree_regression.predict(X_training_regression)
    yhat_tree_regression_validation = tree_regression.predict(X_validation_regression)
    yhat_tree_regression_test       = tree_regression.predict(X_test_regression)
    
    #training scores
    r2_score_training       = mt.r2_score(y_training_regression, yhat_tree_regression_training)
    mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_tree_regression_training)
    rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_tree_regression_training))
    mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_tree_regression_training)
    mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_tree_regression_training)
    
    r2_list_training.append(r2_score_training)
    mse_list_training.append(mse_score_training)
    rmse_list_training.append(rmse_score_training)
    mae_list_training.append(mae_score_training)
    mape_list_training.append(mape_score_training)
    
    #validation scores
    r2_score_validation       = mt.r2_score(y_validation_regression, yhat_tree_regression_validation)
    mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_tree_regression_validation)
    rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_tree_regression_validation))
    mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_tree_regression_validation)
    mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_tree_regression_validation)
    
    r2_list_validation.append(r2_score_validation)
    mse_list_validation.append(mse_score_validation)
    rmse_list_validation.append(rmse_score_validation)
    mae_list_validation.append(mae_score_validation)
    mape_list_validation.append(mape_score_validation)
    
    #test scores
    r2_score_test       = mt.r2_score(y_test_regression, yhat_tree_regression_test)
    mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_tree_regression_test)
    rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_tree_regression_test))
    mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_tree_regression_test)
    mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_tree_regression_test)
    
    r2_list_test.append(r2_score_test)
    mse_list_test.append(mse_score_test)
    rmse_list_test.append(rmse_score_test)
    mae_list_test.append(mae_score_test)
    mape_list_test.append(mape_score_test)

In [65]:
tree_regression_training_df = pd.DataFrame({ 'max_depth': max_depth_list, 'r2_score': r2_list_training, 
                                             'mse_score': mse_list_training, 'rmse_score': rmse_list_training, 
                                             'mae_score': mae_list_training, 'mape_score': mape_list_training})
tree_regression_training_df

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.025572,465.78873,21.582139,17.158819,8.615678
1,5,0.113523,423.747268,20.585122,16.368766,7.869536
2,10,0.384624,294.157341,17.151016,12.925051,4.871411
3,15,0.738302,125.095124,11.184593,6.603326,2.049551
4,20,0.927211,34.794149,5.898657,2.144293,0.356446


In [66]:
tree_regression_training_min = tree_regression_training_df.query('mape_score == mape_score.min()')
tree_regression_training_min

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
4,20,0.927211,34.794149,5.898657,2.144293,0.356446


In [67]:
tree_regression_validation_df = pd.DataFrame({ 'max_depth': max_depth_list, 'r2_score': r2_list_validation, 
                                             'mse_score': mse_list_validation, 'rmse_score': rmse_list_validation, 
                                             'mae_score': mae_list_validation, 'mape_score': mape_list_validation})
tree_regression_validation_df

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.025733,465.223863,21.569049,17.12234,8.549934
1,5,0.063559,447.161319,21.146189,16.843452,8.395778
2,10,-0.002935,478.912947,21.88408,16.861326,7.879333
3,15,-0.133504,541.261389,23.265025,16.880678,7.678007
4,20,-0.25193,597.811099,24.450176,17.13224,7.05484


In [68]:
tree_regression_validation_min = tree_regression_validation_df.query('mape_score == mape_score.min()')
tree_regression_validation_min

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
4,20,-0.25193,597.811099,24.450176,17.13224,7.05484


In [69]:
tree_regression_test_df = pd.DataFrame({ 'max_depth': max_depth_list, 'r2_score': r2_list_test, 
                                             'mse_score': mse_list_test, 'rmse_score': rmse_list_test, 
                                             'mae_score': mae_list_test, 'mape_score': mape_list_test})
tree_regression_test_df

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
0,1,0.024996,474.730587,21.788313,17.381588,8.519214
1,5,0.072181,451.755789,21.254547,17.010757,7.833952
2,10,0.045693,464.652786,21.555806,16.743015,7.062497
3,15,-0.078512,525.128765,22.915688,16.871659,6.460198
4,20,-0.192656,580.705338,24.097828,16.961002,6.358549


In [70]:
tree_regression_test_min = tree_regression_test_df.query('mape_score == mape_score.min()')
tree_regression_test_min

Unnamed: 0,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
4,20,-0.192656,580.705338,24.097828,16.961002,6.358549


*   As with Decision Tree Classifier, the more depth you add to a tree model, the more it tends to overfit, hence the difference in results between train and validation/test datasets for MAPE score.

## 2.10 Random Forest Regressor

In [71]:
params = {'n_estimators':[1, 10, 50, 100, 150, 200],
          'max_depth': [1, 5, 10, 15, 20]}

n_estimators_list = list()
max_depth_list = list()

r2_list_training = list()
mse_list_training = list()
rmse_list_training = list()
mae_list_training = list()
mape_list_training = list()

r2_list_validation = list()
mse_list_validation = list()
rmse_list_validation = list()
mae_list_validation = list()
mape_list_validation = list()

r2_list_test = list()
mse_list_test = list()
rmse_list_test = list()
mae_list_test = list()
mape_list_test = list()

for i in params.get('n_estimators'):
    for j in params.get('max_depth'):
        n_estimators_list.append(i)
        max_depth_list.append(j)
        #Model Define
        rf_regression = RandomForestRegressor(n_estimators=i ,max_depth=j, random_state=42)
    
        #Model Training
        rf_regression.fit(X_training_regression, np.ravel(y_training_regression))
    
        #Model Predict
        yhat_rf_regression_training   = rf_regression.predict(X_training_regression)
        yhat_rf_regression_validation = rf_regression.predict(X_validation_regression)
        yhat_rf_regression_test       = rf_regression.predict(X_test_regression)
    
        #training scores
        r2_score_training       = mt.r2_score(y_training_regression, yhat_rf_regression_training)
        mse_score_training      = mt.mean_squared_error(y_training_regression, yhat_rf_regression_training)
        rmse_score_training     = np.sqrt(mt.mean_squared_error(y_training_regression, yhat_rf_regression_training))
        mae_score_training      = mt.mean_absolute_error(y_training_regression, yhat_rf_regression_training)
        mape_score_training     = mt.mean_absolute_percentage_error(y_training_regression, yhat_rf_regression_training)
    
        r2_list_training.append(r2_score_training)
        mse_list_training.append(mse_score_training)
        rmse_list_training.append(rmse_score_training)
        mae_list_training.append(mae_score_training)
        mape_list_training.append(mape_score_training)
        
        #validation scores
        r2_score_validation       = mt.r2_score(y_validation_regression, yhat_rf_regression_validation)
        mse_score_validation      = mt.mean_squared_error(y_validation_regression, yhat_rf_regression_validation)
        rmse_score_validation     = np.sqrt(mt.mean_squared_error(y_validation_regression, yhat_rf_regression_validation))
        mae_score_validation      = mt.mean_absolute_error(y_validation_regression, yhat_rf_regression_validation)
        mape_score_validation     = mt.mean_absolute_percentage_error(y_validation_regression, yhat_rf_regression_validation)
    
        r2_list_validation.append(r2_score_validation)
        mse_list_validation.append(mse_score_validation)
        rmse_list_validation.append(rmse_score_validation)
        mae_list_validation.append(mae_score_validation)
        mape_list_validation.append(mape_score_validation)
        
        #test scores
        r2_score_test       = mt.r2_score(y_test_regression, yhat_rf_regression_test)
        mse_score_test      = mt.mean_squared_error(y_test_regression, yhat_rf_regression_test)
        rmse_score_test     = np.sqrt(mt.mean_squared_error(y_test_regression, yhat_rf_regression_test))
        mae_score_test      = mt.mean_absolute_error(y_test_regression, yhat_rf_regression_test)
        mape_score_test     = mt.mean_absolute_percentage_error(y_test_regression, yhat_rf_regression_test)
    
        r2_list_test.append(r2_score_test)
        mse_list_test.append(mse_score_test)
        rmse_list_test.append(rmse_score_test)
        mae_list_test.append(mae_score_test)
        mape_list_test.append(mape_score_test)

In [72]:
rf_regression_training_df = pd.DataFrame({ 'n_estimators': n_estimators_list, 'max_depth': max_depth_list, 
                                           'r2_score': r2_list_training, 'mse_score': mse_list_training, 
                                           'rmse_score': rmse_list_training, 'mae_score': mae_list_training, 
                                           'mape_score': mape_list_training})

rf_regression_training_min = rf_regression_training_df.query('mape_score == mape_score.min()')
rf_regression_training_min

Unnamed: 0,n_estimators,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
19,100,20,0.883119,55.87046,7.474654,5.517068,2.682385


In [73]:
rf_regression_validation_df = pd.DataFrame({ 'n_estimators': n_estimators_list, 'max_depth': max_depth_list, 
                                             'r2_score': r2_list_validation, 'mse_score': mse_list_validation, 
                                             'rmse_score': rmse_list_validation, 'mae_score': mae_list_validation, 
                                             'mape_score': mape_list_validation})

rf_regression_validation_min = rf_regression_validation_df.query('mape_score == mape_score.min()')
rf_regression_validation_min

Unnamed: 0,n_estimators,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
14,50,20,0.321033,324.214408,18.005955,13.29633,7.066447


In [74]:
rf_regression_test_df = pd.DataFrame({ 'n_estimators': n_estimators_list, 'max_depth': max_depth_list, 
                                       'r2_score': r2_list_test, 'mse_score': mse_list_test, 
                                       'rmse_score': rmse_list_test, 'mae_score': mae_list_test, 
                                       'mape_score': mape_list_test})

rf_regression_test_min = rf_regression_test_df.query('mape_score == mape_score.min()')
rf_regression_test_min

Unnamed: 0,n_estimators,max_depth,r2_score,mse_score,rmse_score,mae_score,mape_score
9,10,20,0.28078,350.188801,18.713332,13.855007,6.514113


*   The Random Forest, alike the Singular Decision Tree Model, tend to overfit with the increase of estimators and tree depth.
*   In comparison to the singular decision tree model, the Random Forest shows less variation between results from different datasets (better generalization).

## 2.11 Final Results for Regression Models

### 2.11.1 Training Dataset

In [75]:
linear_regression_training_min['model'] = 'Linear Regression'
linear_regression_training_min = linear_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

lasso_regression_training_min = lasso_regression_training_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
lasso_regression_training_min['model'] = 'Linear Regression Lasso (L1)'
lasso_regression_training_min = lasso_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

ridge_regression_training_min = ridge_regression_training_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
ridge_regression_training_min['model'] = 'Linear Regression Ridge (L2)'
ridge_regression_training_min = ridge_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

elastic_net_training_min = elastic_net_training_min.drop(['alpha', 'l1_ratio','max_iter'], axis=1).iloc[[0]]
elastic_net_training_min['model'] = 'Linear Regression Elastic Net (L1 & L2)'
elastic_net_training_min = elastic_net_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_regression_training_min = polynomial_regression_training_min.drop('degree', axis=1).iloc[[0]]
polynomial_regression_training_min['model'] = 'Polynomial Regression'
polynomial_regression_training_min = polynomial_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_lasso_regression_training_min = polynomial_lasso_regression_training_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_lasso_regression_training_min['model'] = 'Polynomial Regression Lasso (L1)'
polynomial_lasso_regression_training_min = polynomial_lasso_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_ridge_regression_training_min = polynomial_ridge_regression_training_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_ridge_regression_training_min['model'] = 'Polynomial Regression Ridge (L2)'
polynomial_ridge_regression_training_min = polynomial_ridge_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_elastic_net_regression_training_min = polynomial_elastic_net_regression_training_min.drop(['degree', 'alpha', 'l1_ratio', 'max_iter'], axis=1).iloc[[0]]
polynomial_elastic_net_regression_training_min['model'] = 'Polynomial Regression Elastic Net (L1 & L2)'
polynomial_elastic_net_regression_training_min = polynomial_elastic_net_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

tree_regression_training_min = tree_regression_training_min.drop('max_depth', axis=1).iloc[[0]]
tree_regression_training_min['model'] = 'Decision Tree Regression'
tree_regression_training_min = tree_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

rf_regression_training_min = rf_regression_training_min.drop(['n_estimators', 'max_depth'], axis=1).iloc[[0]]
rf_regression_training_min['model'] = 'Random Forest Regression'
rf_regression_training_min = rf_regression_training_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]


In [76]:
training_regression_results = pd.concat([linear_regression_training_min, lasso_regression_training_min, ridge_regression_training_min,
                                         elastic_net_training_min, polynomial_regression_training_min, polynomial_lasso_regression_training_min,
                                         polynomial_ridge_regression_training_min, polynomial_elastic_net_regression_training_min,
                                         tree_regression_training_min, rf_regression_training_min])
training_regression_results = training_regression_results.reset_index(drop=True)
training_regression_results

Unnamed: 0,model,r2_score,mse_score,rmse_score,mae_score,mape_score
0,Linear Regression,0.046058,455.996112,21.354065,16.998249,8.653186
1,Linear Regression Lasso (L1),0.007401,474.474834,21.782443,17.305484,8.736697
2,Linear Regression Ridge (L2),0.046058,455.996401,21.354072,16.998308,8.653415
3,Linear Regression Elastic Net (L1 & L2),0.010715,472.890766,21.746052,17.277098,8.71988
4,Polynomial Regression,0.7253,131.310015,11.459058,7.266166,2.215335
5,Polynomial Regression Lasso (L1),0.0195,468.691307,21.64928,17.158121,8.574792
6,Polynomial Regression Ridge (L2),0.312725,328.525923,18.125284,14.002097,6.477741
7,Polynomial Regression Elastic Net (L1 & L2),0.085699,437.047167,20.905673,16.523671,8.278886
8,Decision Tree Regression,0.927211,34.794149,5.898657,2.144293,0.356446
9,Random Forest Regression,0.883119,55.87046,7.474654,5.517068,2.682385


### 2.11.2 Validation Dataset

In [77]:
linear_regression_validation_min['model'] = 'Linear Regression'
linear_regression_validation_min = linear_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

lasso_regression_validation_min = lasso_regression_validation_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
lasso_regression_validation_min['model'] = 'Linear Regression Lasso (L1)'
lasso_regression_validation_min = lasso_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

ridge_regression_validation_min = ridge_regression_validation_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
ridge_regression_validation_min['model'] = 'Linear Regression Ridge (L2)'
ridge_regression_validation_min = ridge_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

elastic_net_validation_min = elastic_net_validation_min.drop(['alpha', 'l1_ratio','max_iter'], axis=1).iloc[[0]]
elastic_net_validation_min['model'] = 'Linear Regression Elastic Net (L1 & L2)'
elastic_net_validation_min = elastic_net_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_regression_validation_min = polynomial_regression_validation_min.drop('degree', axis=1).iloc[[0]]
polynomial_regression_validation_min['model'] = 'Polynomial Regression'
polynomial_regression_validation_min = polynomial_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_lasso_regression_validation_min = polynomial_lasso_regression_validation_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_lasso_regression_validation_min['model'] = 'Polynomial Regression Lasso (L1)'
polynomial_lasso_regression_validation_min = polynomial_lasso_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_ridge_regression_validation_min = polynomial_ridge_regression_validation_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_ridge_regression_validation_min['model'] = 'Polynomial Regression Ridge (L2)'
polynomial_ridge_regression_validation_min = polynomial_ridge_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_elastic_net_regression_validation_min = polynomial_elastic_net_regression_validation_min.drop(['degree', 'alpha', 'l1_ratio', 'max_iter'], axis=1).iloc[[0]]
polynomial_elastic_net_regression_validation_min['model'] = 'Polynomial Regression Elastic Net (L1 & L2)'
polynomial_elastic_net_regression_validation_min = polynomial_elastic_net_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

tree_regression_validation_min = tree_regression_validation_min.drop('max_depth', axis=1).iloc[[0]]
tree_regression_validation_min['model'] = 'Decision Tree Regression'
tree_regression_validation_min = tree_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

rf_regression_validation_min = rf_regression_validation_min.drop(['n_estimators', 'max_depth'], axis=1).iloc[[0]]
rf_regression_validation_min['model'] = 'Random Forest Regression'
rf_regression_validation_min = rf_regression_validation_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]


In [78]:
validation_regression_results = pd.concat([linear_regression_validation_min, lasso_regression_validation_min, ridge_regression_validation_min,
                                         elastic_net_validation_min, polynomial_regression_validation_min, polynomial_lasso_regression_validation_min,
                                         polynomial_ridge_regression_validation_min, polynomial_elastic_net_regression_validation_min,
                                         tree_regression_validation_min, rf_regression_validation_min])
validation_regression_results = validation_regression_results.reset_index(drop=True)
validation_regression_results

Unnamed: 0,model,r2_score,mse_score,rmse_score,mae_score,mape_score
0,Linear Regression,0.03992483,458.447042,21.411376,17.039754,8.682542
1,Linear Regression Lasso (L1),-7.197077e-07,477.511956,21.852047,17.352836,8.678722
2,Linear Regression Ridge (L2),0.03681042,459.934208,21.446077,17.043389,8.672408
3,Linear Regression Elastic Net (L1 & L2),-7.197077e-07,477.511956,21.852047,17.352836,8.678722
4,Polynomial Regression,0.06647668,445.768223,21.113224,16.749939,8.547931
5,Polynomial Regression Lasso (L1),0.01414802,470.755769,21.696907,17.180595,8.655828
6,Polynomial Regression Ridge (L2),-9.685894,5102.638358,71.432754,18.826258,8.348152
7,Polynomial Regression Elastic Net (L1 & L2),0.05457368,451.452047,21.247401,16.808199,8.630816
8,Decision Tree Regression,-0.25193,597.811099,24.450176,17.13224,7.05484
9,Random Forest Regression,0.3210335,324.214408,18.005955,13.29633,7.066447


### 2.11.3 Test Dataset

In [79]:
linear_regression_test_min['model'] = 'Linear Regression'
linear_regression_test_min = linear_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

lasso_regression_test_min = lasso_regression_test_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
lasso_regression_test_min['model'] = 'Linear Regression Lasso (L1)'
lasso_regression_test_min = lasso_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

ridge_regression_test_min = ridge_regression_test_min.drop(['alpha', 'max_iter'], axis=1).iloc[[0]]
ridge_regression_test_min['model'] = 'Linear Regression Ridge (L2)'
ridge_regression_test_min = ridge_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

elastic_net_test_min = elastic_net_test_min.drop(['alpha', 'l1_ratio','max_iter'], axis=1).iloc[[0]]
elastic_net_test_min['model'] = 'Linear Regression Elastic Net (L1 & L2)'
elastic_net_test_min = elastic_net_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_regression_test_min = polynomial_regression_test_min.drop('degree', axis=1).iloc[[0]]
polynomial_regression_test_min['model'] = 'Polynomial Regression'
polynomial_regression_test_min = polynomial_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_lasso_regression_test_min = polynomial_lasso_regression_test_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_lasso_regression_test_min['model'] = 'Polynomial Regression Lasso (L1)'
polynomial_lasso_regression_test_min = polynomial_lasso_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_ridge_regression_test_min = polynomial_ridge_regression_test_min.drop(['degree','alpha', 'max_iter'], axis=1).iloc[[0]]
polynomial_ridge_regression_test_min['model'] = 'Polynomial Regression Ridge (L2)'
polynomial_ridge_regression_test_min = polynomial_ridge_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

polynomial_elastic_net_regression_test_min = polynomial_elastic_net_regression_test_min.drop(['degree', 'alpha', 'l1_ratio', 'max_iter'], axis=1).iloc[[0]]
polynomial_elastic_net_regression_test_min['model'] = 'Polynomial Regression Elastic Net (L1 & L2)'
polynomial_elastic_net_regression_test_min = polynomial_elastic_net_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

tree_regression_test_min = tree_regression_test_min.drop('max_depth', axis=1).iloc[[0]]
tree_regression_test_min['model'] = 'Decision Tree Regression'
tree_regression_test_min = tree_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]

rf_regression_test_min = rf_regression_test_min.drop(['n_estimators', 'max_depth'], axis=1).iloc[[0]]
rf_regression_test_min['model'] = 'Random Forest Regression'
rf_regression_test_min = rf_regression_test_min[['model', 'r2_score', 'mse_score', 'rmse_score', 'mae_score', 'mape_score']]


In [80]:
test_regression_results = pd.concat([linear_regression_test_min, lasso_regression_test_min, ridge_regression_test_min,
                                         elastic_net_test_min, polynomial_regression_test_min, polynomial_lasso_regression_test_min,
                                         polynomial_ridge_regression_test_min, polynomial_elastic_net_regression_test_min,
                                         tree_regression_test_min, rf_regression_test_min])
test_regression_results = test_regression_results.reset_index(drop=True)
test_regression_results

Unnamed: 0,model,r2_score,mse_score,rmse_score,mae_score,mape_score
0,Linear Regression,0.052317,461.427719,21.480869,17.129965,8.521859
1,Linear Regression Lasso (L1),-0.000124,486.961469,22.067203,17.551492,8.71455
2,Linear Regression Ridge (L2),0.05231,461.431102,21.480947,17.129678,8.522815
3,Linear Regression Elastic Net (L1 & L2),-0.000124,486.961469,22.067203,17.551492,8.71455
4,Polynomial Regression,-0.261752,614.348098,24.786046,17.178214,7.956229
5,Polynomial Regression Lasso (L1),4e-06,486.899198,22.065792,17.550363,8.714456
6,Polynomial Regression Ridge (L2),-159.704238,78247.050076,279.726742,21.433575,8.113779
7,Polynomial Regression Elastic Net (L1 & L2),-0.169653,569.505007,23.864304,16.998383,8.488322
8,Decision Tree Regression,-0.192656,580.705338,24.097828,16.961002,6.358549
9,Random Forest Regression,0.28078,350.188801,18.713332,13.855007,6.514113


# 3.0 CLUSTERING MODELS