# Introduction

In this notebook I'm going to run a few classification models that will try to predict a direction of a natural gas price change in the next day. The idea is very simple as it's two outputs classification (price going up - 1, price going down - 0). In reality there's a 3rd class as well (no price change), but it was connected with one of the main groups. In practice this task is not trivial as financial time series are changeable over time and hard to predict.

# Splitting to train, test and validation sets

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from IPython.display import display
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import PowerTransformer
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from statistics import mode

In [2]:
X = pd.read_pickle('Data/X.pkl')
y = pd.read_pickle('Data/y.pkl')

In [3]:
X_test, X_train, y_test, y_train = train_test_split(X, y, test_size=0.70, shuffle=False)
# this order is not a mistake - I want my train set to be located earlier in time than test

In [4]:
# dividing test set to test and validation sets - 
# each of them contains 15% of all rows (train set - 70%)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, shuffle=False)

In [5]:
display(X_val.iloc[[0,-1]]) # just to check datasets' start- and end-dates

display(X_test.iloc[[0,-1]])

display(X_train.iloc[[0,-1]])

Unnamed: 0_level_0,gas_daily_change,gas_volatility,gas_daily_gap,rate_2y_daily_change,SP500_daily_change,WTI_daily_change,EurUsd,TTF_daily_change,Storage,GDP_quarterly_change,US_temp,Friday,Monday,Thursday,Tuesday,Wednesday,filling,gas_daily_change_lag22
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2020-03-20,-0.03023,0.078554,-0.004232,-0.159091,0.0,-0.110626,1.0707,-0.029377,2043.0,0.005197,75.360065,1,0,0,0,0,True,-0.017903
2018-09-17,0.016986,0.020611,0.00253,0.0,-0.00557,-0.00116,1.1671,0.042631,2636.0,0.005423,-85.799,0,1,0,0,0,False,-0.010884


Unnamed: 0_level_0,gas_daily_change,gas_volatility,gas_daily_gap,rate_2y_daily_change,SP500_daily_change,WTI_daily_change,EurUsd,TTF_daily_change,Storage,GDP_quarterly_change,US_temp,Friday,Monday,Thursday,Tuesday,Wednesday,filling,gas_daily_change_lag22
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2018-09-14,-0.017749,0.018793,-0.002485,0.007246,0.000275,0.005832,1.1689,0.004514,2636.0,0.005423,-85.799,1,0,0,0,0,False,-0.006421
2017-03-10,0.011432,0.025598,0.004371,-0.007299,0.003269,-0.016031,1.0606,0.001799,2295.0,0.005423,75.360065,1,0,0,0,0,True,-0.001278


Unnamed: 0_level_0,gas_daily_change,gas_volatility,gas_daily_gap,rate_2y_daily_change,SP500_daily_change,WTI_daily_change,EurUsd,TTF_daily_change,Storage,GDP_quarterly_change,US_temp,Friday,Monday,Thursday,Tuesday,Wednesday,filling,gas_daily_change_lag22
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2017-03-09,0.025164,0.033961,0.004137,0.007353,0.0008,-0.019889,1.0551,-0.007633,2295.0,0.005423,75.360065,0,0,1,0,0,True,0.02623
2010-02-04,-0.000554,0.050406,-0.002399,-0.090909,-0.031141,-0.049883,1.3847,0.0,2406.0,0.010984,126.992186,0,0,1,0,0,True,0.055994


In [6]:
pd.concat([X_train, y_train], axis=1, sort=False) 
# just to see the whole dataset alongside with the labels

Unnamed: 0_level_0,gas_daily_change,gas_volatility,gas_daily_gap,rate_2y_daily_change,SP500_daily_change,WTI_daily_change,EurUsd,TTF_daily_change,Storage,GDP_quarterly_change,US_temp,Friday,Monday,Thursday,Tuesday,Wednesday,filling,gas_daily_change_lag22,gas_target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-03-09,0.025164,0.033961,0.004137,0.007353,0.000800,-0.019889,1.0551,-0.007633,2295.0,0.005423,75.360065,0,0,1,0,0,True,0.026230,1.0
2017-03-08,0.027266,0.039986,0.003895,0.030303,-0.002284,-0.053820,1.0556,-0.015634,2363.0,0.005423,75.360065,0,0,0,0,1,True,-0.004244,1.0
2017-03-07,-0.026543,0.026558,-0.009997,0.007634,-0.002913,-0.001128,1.0576,-0.006681,2363.0,0.005423,75.360065,0,0,0,1,0,True,-0.038908,1.0
2017-03-06,0.026176,0.032747,0.029713,-0.007576,-0.003277,-0.002438,1.0592,-0.009066,2363.0,0.005423,75.360065,0,1,0,0,0,True,0.005997,0.0
2017-03-03,0.008203,0.022993,0.004280,0.000000,0.000504,0.013686,1.0565,0.000358,2363.0,0.005423,75.360065,1,0,0,0,0,True,0.016362,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2010-02-10,0.000378,0.026455,0.010964,0.083333,-0.002233,0.010441,1.3740,0.000000,2406.0,0.010984,126.992186,0,0,0,0,1,True,-0.051313,1.0
2010-02-09,-0.020552,0.038563,0.003518,0.063291,0.013040,0.025873,1.3760,0.000000,2406.0,0.010984,126.992186,0,0,0,1,0,True,-0.009817,1.0
2010-02-08,-0.020671,0.052583,0.015775,0.025974,-0.008863,0.009833,1.3675,0.000000,2406.0,0.010984,126.992186,0,1,0,0,0,True,-0.033783,0.0
2010-02-05,0.018279,0.038985,0.012740,-0.037500,0.002897,-0.026661,1.3691,0.000000,2406.0,0.010984,126.992186,1,0,0,0,0,True,0.021244,0.0


# Balance of the classes

In [7]:
y_train.value_counts()

0.0    901
1.0    880
Name: gas_target, dtype: int64

Classes are almost equal and there's no need for any modifications of the set.

# Modelling

## Decision Tree Classifier

In [8]:
steps = [('MinMaxScaler', MinMaxScaler(feature_range=(0.1, 1.1))),
         ('Polynomial', PolynomialFeatures(include_bias=False)),
         ('Yeo-Johnson', PowerTransformer()),
         ('RFE', RFE(DecisionTreeClassifier(max_depth=10, min_samples_leaf=25), step=5)),
         ('PCA', PCA()),
         ('Decision_Tree', DecisionTreeClassifier(min_samples_leaf=25))]

pipeline = Pipeline(steps)

params = {'Polynomial__degree' : [1, 2],
          'RFE__n_features_to_select' : [150, 100, 50],
          'PCA__n_components' : [10, 50, 100],
          'Decision_Tree__max_depth' : [5, 20, 35]}

tscv = TimeSeriesSplit(n_splits=5)

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1, error_score=np.nan, cv=tscv)

CV.fit(X_train, y_train)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   37.3s
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:   55.4s finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('MinMaxScaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0.1, 1.1))),
                                       ('Polynomial',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=False,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('Yeo-Johnson',
                                        PowerTransformer(copy=True,
                                                         method='yeo-johnson',
                                                         standard...
                              

In [9]:
CV.best_params_

{'Decision_Tree__max_depth': 20,
 'PCA__n_components': 50,
 'Polynomial__degree': 2,
 'RFE__n_features_to_select': 50}

In [10]:
CV.best_estimator_

Pipeline(memory=None,
         steps=[('MinMaxScaler',
                 MinMaxScaler(copy=True, feature_range=(0.1, 1.1))),
                ('Polynomial',
                 PolynomialFeatures(degree=2, include_bias=False,
                                    interaction_only=False, order='C')),
                ('Yeo-Johnson',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('RFE',
                 RFE(estimator=DecisionTreeClassifier(class_weight=None,
                                                      criterion='gini',...
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('Decision_Tree',
                 DecisionTreeClassifier(class_weight=None, criterion='gini',
                                        max_depth=20, max_features=None,
                                        max_leaf_nodes=None,
                               

In [11]:
y_pred_DecTree_train = CV.best_estimator_.predict(X_train)
y_pred_DecTree_test = CV.best_estimator_.predict(X_test)
y_pred_DecTree_val = CV.best_estimator_.predict(X_val) # saving it for future validation

In [12]:
acc_DecTree_train, acc_DecTree_test = \
accuracy_score(y_train, y_pred_DecTree_train), accuracy_score(y_test, y_pred_DecTree_test)
acc_DecTree_train, acc_DecTree_test

(0.7288040426726559, 0.5471204188481675)

## Logistic Regression

In [13]:
steps = [('MinMaxScaler', MinMaxScaler(feature_range=(0.1, 1.1))),
         ('Polynomial', PolynomialFeatures(include_bias=False)),
         ('Yeo-Johnson', PowerTransformer()),
         ('RFE', RFE(LogisticRegression(solver='lbfgs', max_iter=300, n_jobs=-1), step=5)),
         ('PCA', PCA()),
         ('Log_Regr', LogisticRegression())]

pipeline = Pipeline(steps)

params = {'Polynomial__degree' : [1, 2],
          'RFE__n_features_to_select' : [150, 100, 50],
          'PCA__n_components' : [10, 50, 100],
          'Log_Regr__penalty': ['none', 'l1'],
          'Log_Regr__C' : [0.01, 0.1, 1]}

tscv = TimeSeriesSplit(n_splits=5)

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1, error_score=np.nan, cv=tscv)

CV.fit(X_train, y_train)

Fitting 5 folds for each of 108 candidates, totalling 540 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    9.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  4.2min finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('MinMaxScaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0.1, 1.1))),
                                       ('Polynomial',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=False,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('Yeo-Johnson',
                                        PowerTransformer(copy=True,
                                                         method='yeo-johnson',
                                                         standard...
                              

In [14]:
CV.best_params_

{'Log_Regr__C': 0.1,
 'Log_Regr__penalty': 'l1',
 'PCA__n_components': 10,
 'Polynomial__degree': 1,
 'RFE__n_features_to_select': 150}

In [15]:
CV.best_estimator_

Pipeline(memory=None,
         steps=[('MinMaxScaler',
                 MinMaxScaler(copy=True, feature_range=(0.1, 1.1))),
                ('Polynomial',
                 PolynomialFeatures(degree=1, include_bias=False,
                                    interaction_only=False, order='C')),
                ('Yeo-Johnson',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('RFE',
                 RFE(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                  dual=False,
                                                  fit_...
                 PCA(copy=True, iterated_power='auto', n_components=10,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('Log_Regr',
                 LogisticRegression(C=0.1, class_weight=None, dual=False,
                                    fit_intercept=Tru

In [16]:
y_pred_LogReg_train = CV.best_estimator_.predict(X_train)
y_pred_LogReg_test = CV.best_estimator_.predict(X_test)
y_pred_LogReg_val = CV.best_estimator_.predict(X_val) # saving it for future validation

In [17]:
acc_LogReg_train, acc_LogReg_test = \
accuracy_score(y_train, y_pred_LogReg_train), accuracy_score(y_test, y_pred_LogReg_test)
acc_LogReg_train, acc_LogReg_test

(0.5463222908478383, 0.5261780104712042)

## K-Nearest Neighbors

In [18]:
steps = [('MinMaxScaler', MinMaxScaler(feature_range=(0.1, 1.1))),
         ('Polynomial', PolynomialFeatures(include_bias=False)),
         ('Yeo-Johnson', PowerTransformer()),
# KNN algorithm does not have RFE feature selection method as it doesn't provide
# coefficients of features
         ('PCA', PCA()),
         ('KNN', KNeighborsClassifier())]

pipeline = Pipeline(steps)

params = {'Polynomial__degree' : [1, 2],
          'PCA__n_components' : [10, 50, 100],
          'KNN__n_neighbors' : [5, 25, 50, 100, 200]}

tscv = TimeSeriesSplit(n_splits=5)

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1, error_score=np.nan, cv=tscv)

CV.fit(X_train, y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.4s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:   18.3s finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('MinMaxScaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0.1, 1.1))),
                                       ('Polynomial',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=False,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('Yeo-Johnson',
                                        PowerTransformer(copy=True,
                                                         method='yeo-johnson',
                                                         standard...
                              

In [19]:
CV.best_params_

{'KNN__n_neighbors': 100, 'PCA__n_components': 10, 'Polynomial__degree': 1}

In [20]:
CV.best_estimator_

Pipeline(memory=None,
         steps=[('MinMaxScaler',
                 MinMaxScaler(copy=True, feature_range=(0.1, 1.1))),
                ('Polynomial',
                 PolynomialFeatures(degree=1, include_bias=False,
                                    interaction_only=False, order='C')),
                ('Yeo-Johnson',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('PCA',
                 PCA(copy=True, iterated_power='auto', n_components=10,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('KNN',
                 KNeighborsClassifier(algorithm='auto', leaf_size=30,
                                      metric='minkowski', metric_params=None,
                                      n_jobs=None, n_neighbors=100, p=2,
                                      weights='uniform'))],
         verbose=False)

In [21]:
y_pred_KNN_train = CV.best_estimator_.predict(X_train)
y_pred_KNN_test = CV.best_estimator_.predict(X_test)
y_pred_KNN_val = CV.best_estimator_.predict(X_val) # saving it for future validation

In [22]:
acc_KNN_train, acc_KNN_test = \
accuracy_score(y_train, y_pred_KNN_train), accuracy_score(y_test, y_pred_KNN_test)
acc_KNN_train, acc_KNN_test

(0.5446378439079169, 0.5026178010471204)

## Naive Bayes

In [23]:
steps = [('MinMaxScaler', MinMaxScaler(feature_range=(0.1, 1.1))),
         ('Polynomial', PolynomialFeatures(include_bias=False)),
         ('Yeo-Johnson', PowerTransformer()),
         ('PCA', PCA()),
         ('Naive_Bayes', GaussianNB())]

pipeline = Pipeline(steps)

params = {'Polynomial__degree' : [1, 2],
          'PCA__n_components' : [10, 50, 100]}

tscv = TimeSeriesSplit(n_splits=5)

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1, error_score=np.nan, cv=tscv)

CV.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:    3.4s finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('MinMaxScaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0.1, 1.1))),
                                       ('Polynomial',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=False,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('Yeo-Johnson',
                                        PowerTransformer(copy=True,
                                                         method='yeo-johnson',
                                                         standard...
                              

In [24]:
CV.best_params_

{'PCA__n_components': 10, 'Polynomial__degree': 1}

In [25]:
CV.best_estimator_

Pipeline(memory=None,
         steps=[('MinMaxScaler',
                 MinMaxScaler(copy=True, feature_range=(0.1, 1.1))),
                ('Polynomial',
                 PolynomialFeatures(degree=1, include_bias=False,
                                    interaction_only=False, order='C')),
                ('Yeo-Johnson',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('PCA',
                 PCA(copy=True, iterated_power='auto', n_components=10,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('Naive_Bayes', GaussianNB(priors=None, var_smoothing=1e-09))],
         verbose=False)

In [26]:
y_pred_NaiveBayes_train = CV.best_estimator_.predict(X_train)
y_pred_NaiveBayes_test = CV.best_estimator_.predict(X_test)
y_pred_NaiveBayes_val = CV.best_estimator_.predict(X_val) # saving it for future validation

In [27]:
acc_NaiveBayes_train, acc_NaiveBayes_test = \
accuracy_score(y_train, y_pred_NaiveBayes_train), accuracy_score(y_test, y_pred_NaiveBayes_test)
acc_NaiveBayes_train, acc_NaiveBayes_test

(0.5306007860752386, 0.518324607329843)

## Random Forest

In [28]:
steps = [('MinMaxScaler', MinMaxScaler(feature_range=(0.1, 1.1))),
         ('Polynomial', PolynomialFeatures(include_bias=False)),
         ('Yeo-Johnson', PowerTransformer()),
         ('RFE', RFE(RandomForestClassifier(max_depth=10, min_samples_leaf=25), step=5)),
         ('PCA', PCA()),
         ('Random_Forest', RandomForestClassifier(n_jobs=-1))]

pipeline = Pipeline(steps)

params = {'Polynomial__degree' : [1, 2],
          'RFE__n_features_to_select' : [150, 100, 50],
          'PCA__n_components' : [10, 50, 100],
          'Random_Forest__max_depth': [5, 20, 35],
          'Random_Forest__min_samples_leaf' : [10, 30, 100],
          'Random_Forest__bootstrap' : [True, False]}

tscv = TimeSeriesSplit(n_splits=5)

CV = GridSearchCV(pipeline, params, n_jobs=-1, verbose=1, error_score=np.nan, cv=tscv)

CV.fit(X_train, y_train)

Fitting 5 folds for each of 324 candidates, totalling 1620 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  4.7min
[Parallel(n_jobs=-1)]: Done 1620 out of 1620 | elapsed:  6.9min finished


GridSearchCV(cv=TimeSeriesSplit(max_train_size=None, n_splits=5),
             error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('MinMaxScaler',
                                        MinMaxScaler(copy=True,
                                                     feature_range=(0.1, 1.1))),
                                       ('Polynomial',
                                        PolynomialFeatures(degree=2,
                                                           include_bias=False,
                                                           interaction_only=False,
                                                           order='C')),
                                       ('Yeo-Johnson',
                                        PowerTransformer(copy=True,
                                                         method='yeo-johnson',
                                                         standard...
                              

In [29]:
CV.best_params_

{'PCA__n_components': 10,
 'Polynomial__degree': 1,
 'RFE__n_features_to_select': 50,
 'Random_Forest__bootstrap': False,
 'Random_Forest__max_depth': 35,
 'Random_Forest__min_samples_leaf': 30}

In [30]:
CV.best_estimator_

Pipeline(memory=None,
         steps=[('MinMaxScaler',
                 MinMaxScaler(copy=True, feature_range=(0.1, 1.1))),
                ('Polynomial',
                 PolynomialFeatures(degree=1, include_bias=False,
                                    interaction_only=False, order='C')),
                ('Yeo-Johnson',
                 PowerTransformer(copy=True, method='yeo-johnson',
                                  standardize=True)),
                ('RFE',
                 RFE(estimator=RandomForestClassifier(bootstrap=True,
                                                      class_weight=None,
                                                      cr...
                 RandomForestClassifier(bootstrap=False, class_weight=None,
                                        criterion='gini', max_depth=35,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_dec

In [31]:
y_pred_RandForest_train = CV.best_estimator_.predict(X_train)
y_pred_RandForest_test = CV.best_estimator_.predict(X_test)
y_pred_RandForest_val = CV.best_estimator_.predict(X_val) # saving it for future validation

In [32]:
acc_RandForest_train, acc_RandForest_test = \
accuracy_score(y_train, y_pred_RandForest_train), accuracy_score(y_test, y_pred_RandForest_test)
acc_RandForest_train, acc_RandForest_test

(0.7658618753509264, 0.49476439790575916)

# Hard voting

In the previous section 5 models have been tested. Soon I'm going to create a prediction that takes all of them into account. I don't want to use those that are worse than a 'dummy' model that always predicts the most common class though, so I'll prepare such model 1st and compare my models to it.

## 'Dummy' model

In [33]:
# finding the most common class in the train set
global mode_classes
try:
    mode_classes = mode(y_train)
except:
    mode_classes = 1
    
mode_classes

0.0

In [34]:
# creating dummy model predictions that can be compared with real labels
# using accuracy_score metrics
dummy_prediction_train = [mode_classes for el in y_train] 
dummy_prediction_test = [mode_classes for el in y_test]

In [35]:
accuracy_score(y_train, dummy_prediction_train)
# since we have 2 possible labels, it has to be bigger than 50% (50% if classes are equal)

0.5058955642897248

In [36]:
dummy_accuracy = accuracy_score(y_test, dummy_prediction_test)
dummy_accuracy
# on the other hand this record not necessarily will be greater than 50%
# as it compares y_test with the most common class in the y_train set;
# in a hard voting predition I'll use just those models who beat this number -
# if they don't, they have no predictive power bigger than randomness.

0.5157068062827225

## Creating ensemble prediction

I'll pick the models that are better (have higher accuracy on the test set) than 'dummy' model and create an ensemble prediction (hard voting).

In [37]:
models_accuracy = [acc_DecTree_test, acc_LogReg_test,
          acc_KNN_test, acc_NaiveBayes_test, acc_RandForest_test]

models_pred = [y_pred_DecTree_test, y_pred_LogReg_test,
              y_pred_KNN_test, y_pred_NaiveBayes_test, y_pred_RandForest_test]

models_name = ['Decision Tree', 'Logistic Regression',
              'K Nearest Neighbors', 'Naive Bayes', 'Random Forest']

n = 0

HardVotePrediction_test = [0 for el in y_test]

for model_accuracy, model_pred, model_name in zip(models_accuracy, models_pred, models_name):
    
    if model_accuracy > dummy_accuracy:
        HardVotePrediction_test = HardVotePrediction_test+model_pred
        n+=1
        print(model_name)
        
if n==0:
    print('None of the models can beat random prediction.')
    
else:
    HardVotePrediction_test = (HardVotePrediction_test/n).round()
    print(HardVotePrediction_test)

Decision Tree
Logistic Regression
Naive Bayes
[0. 0. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1.
 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.
 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1.
 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1.
 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1.
 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0.
 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1.
 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1.
 0. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 0.
 1. 0

In [38]:
accuracy_score(y_test, HardVotePrediction_test)

0.5235602094240838

## Ensemble prediction using all 5 models

Even though some of the models might not have a predictive power themselves, it's possible that they make different types of prediction errors and that they'll actually improve an ensemble prediction.

In [39]:
# ...and just to see if using all 5 models for voting would give better results
HardVotePredictionAll_test = ((y_pred_DecTree_test+y_pred_LogReg_test+y_pred_KNN_test+
                       y_pred_NaiveBayes_test+y_pred_RandForest_test)/5).round()
HardVotePredictionAll_test

array([0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 0.,
       0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 1., 1.,
       0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1.,
       1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1., 0.,
       0., 0., 0., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1.,
       1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0.,
       0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.,
       1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0.,
       1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1.,
       1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0.,
       0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0.,
       1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0.,
       0., 1., 1., 1., 0.

In [40]:
accuracy_score(y_test, HardVotePredictionAll_test)

0.5314136125654451

# Scores evaluation and picking the best method

## Scores based on test set

To sum it up:

In [41]:
# ensemble model - the best models used
accuracy_score(y_test, HardVotePrediction_test)

0.5235602094240838

In [42]:
# ensemble model - all 5 models used
accuracy_score(y_test, HardVotePredictionAll_test)

0.5314136125654451

In [43]:
# comparison with 5 single models
acc_DecTree_test, acc_LogReg_test, acc_KNN_test, acc_NaiveBayes_test, acc_RandForest_test

(0.5471204188481675,
 0.5261780104712042,
 0.5026178010471204,
 0.518324607329843,
 0.49476439790575916)

It seems that prepared models and ensemble model have no or little predictive power.

At the time I write this, Random Forest has the best accuracy score, therefore I'll pick it as a recommended choice and check it on the validation set.

You might see different scores though as some steps in the process are based on randomness (e.g. train_test_split, building random forest).

## Validation set

In [44]:
accuracy_score(y_val, y_pred_RandForest_val)

0.5275590551181102

In [45]:
confusion_matrix(y_val, y_pred_RandForest_val)

array([[104,  96],
       [ 84,  97]], dtype=int64)

At the time of writing the accuracy score on the validation set is equal to 55,12%.

Errors distribution is balanced among classes.

When I started the project my goal was to achieve a score as close to 60% as possible, so it's not there yet, but doesn't seem to be hopeless either.

My biggest hope lies in different preparation of the features and/or using entirely new features as current temperature or its predictions, more sophisticated models might be worth checking as well.