# Modelling notebook 3: Support Vector Machines

In this notebook, we build SVM models for predicting fog. The data is split into a train/validation set to make modelling decisions, and an unseen test set for checking the generalisation error of the models.

<br>

**Train-Valid:** 2011-2019  
**Test:** 2020 and 2021

Contents:
- Feature selection using cross validated SVM gain importace values as a measure of feature importance.
Feature importances are calculated using time-series splitted cross validation. Based on this we pick the feature list. 

- We then do some testing, adding lagged features and checking the impact on performance. 

- Finally, we tune the parameters of SVM model and export.

## 1. Import Packages & Data

In [1]:
!pip uninstall scikit-learn -y
!pip install scikit-learn==1.2.1

Found existing installation: scikit-learn 1.0.2
Uninstalling scikit-learn-1.0.2:
  Successfully uninstalled scikit-learn-1.0.2
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==1.2.1
  Downloading scikit_learn-1.2.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.8/9.8 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
Successfully installed scikit-learn-1.2.1


In [2]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 1.2.1.


In [3]:
! pip install -U neptune-client
!pip install -U neptune-sklearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting neptune-client
  Downloading neptune_client-0.16.17-py3-none-any.whl (446 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m446.5/446.5 KB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Collecting PyJWT
  Downloading PyJWT-2.6.0-py3-none-any.whl (20 kB)
Collecting boto3>=1.16.0
  Downloading boto3-1.26.69-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.7/132.7 KB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting bravado<12.0.0,>=11.0.0
  Downloading bravado-11.0.3-py2.py3-none-any.whl (38 kB)
Collecting swagger-spec-validator>=2.7.4
  Downloading swagger_spec_validator-3.0.3-py2.py3-none-any.whl (27 kB)
Collecting future>=0.17.1
  Downloading future-0.18.3.tar.gz (840 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m840.9/840.9 KB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata

In [25]:
import neptune.new as neptune
# Initiate neptune model
modl = neptune.init_model(
    name="SVM",
    key="SVM1", 
    project="swiatej2/fyp", 
    api_token="eyJhcGlfYWRkcmVzcyI6Imh0dHBzOi8vYXBwLm5lcHR1bmUuYWkiLCJhcGlfdXJsIjoiaHR0cHM6Ly9hcHAubmVwdHVuZS5haSIsImFwaV9rZXkiOiJjOTQxYjBkZS0zN2Y1LTRhYTQtOGQ1My03YTAxODJkM2E1OWMifQ==", # your credentials
)

https://app.neptune.ai/swiatej2/fyp/m/FYP-SVM1
Remember to stop your model once you’ve finished logging your metadata (https://docs.neptune.ai/api/model#stop). It will be stopped automatically only when the notebook kernel/interactive console is terminated.


In [5]:
# data processing
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import missingno
from scipy import stats

# modelling
from sklearn.metrics import confusion_matrix, f1_score
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, \
RandomizedSearchCV, train_test_split, TimeSeriesSplit
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe
from scipy.stats import uniform, randint

# visualisations
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# other
from tqdm import tqdm
import pickle
import os
import sys
seed=42

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
# importing data and helper functions from directories dependent on which is available

joseph_path = '/content/drive/My Drive/DS_Modules/CA4021 (Final Year Project)/' # Joseph
julita_path = '/content/drive/My Drive/CA4021 (Final Year Project)/' # Julita

if os.path.exists(joseph_path):
  print("Importing from DS_Modules/CA4021")
  sys.path.append(os.path.join(joseph_path, 'scripts'))
  path = joseph_path

elif os.path.exists(julita_path):
  print("Importing directly from CA4021 folder")
  sys.path.append(sys.path.append(os.path.join(julita_path, 'scripts')))
  path = julita_path

Importing directly from CA4021 folder


In [8]:
# import helper functions from aux file (prevents too much function definitions in the notebook)
from aux_functions import missing_percentages, plot_dist_discrete, plot_dist_continuous, \
plot_vis_discrete, plot_vis_continuous, month_vplot
from aux_functions_ml import preprocess, manual_cross_validate, get_feat_importance_df, \
performance_report, calc_mean_importance, plot_importance, heidke_skill_score, score_model

In [9]:
# import train/valid and test sets
df_train = pd.read_csv(os.path.join(path, 'data/train_data.csv'))
df_train.index=pd.to_datetime(df_train.date_time)
df_train.date_time = df_train.index

df_test = pd.read_csv(os.path.join(path, 'data/test_data.csv'))
df_test.index=pd.to_datetime(df_test.date_time)
df_test.date_time = df_test.index

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

print("Train/valid:", df_train.shape)
print("Test:", df_test.shape)

Train/valid: (78888, 57)
Test: (17544, 57)


In [None]:
df_train.head()

Unnamed: 0_level_0,date_time,year,month,day,hour,date,dir,speed,vis,ww,w,pchar,ptend,cbl,msl,drybulb,wetbulb,dewpt,vp,rh,clow,cmedium,chigh,nlc,ntot,hlc,nsig1,tsig1,hsig1,nsig2,tsig2,hsig2,nsig3,tsig3,hsig3,nsig4,tsig4,hsig4,ceiling,dos,weather,duration,rainfall,sunshine,tabdir,tabspeed,pweather,dni,vis_hr1,target_hr1,fog_state,season,temp_dew_dist,rainfall12hma,fog_formation,fog_dissipation,transition
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1
2011-01-01 00:00:00,2011-01-01 00:00:00,2011,1,1,0,01-Jan-2011 00:00:00,27,7,9000,10,22,5,0.1,1017.1,1027.8,5.5,4.6,3.3,7.8,86,5.0,0.0,0.0,7,7,22,7,6,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22,0.0,0,0.0,0.0,0.0,26,6,0,0,9000.0,0,no fog,winter,2.2,0.0,0,0,0
2011-01-01 01:00:00,2011-01-01 01:00:00,2011,1,1,1,01-Jan-2011 01:00:00,28,6,9000,10,22,5,0.0,1017.1,1027.8,5.1,4.4,3.4,7.8,89,5.0,0.0,0.0,7,7,22,7,6,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22,0.0,0,0.0,0.0,0.0,28,6,0,0,8000.0,0,no fog,winter,1.7,0.0,0,0,0
2011-01-01 02:00:00,2011-01-01 02:00:00,2011,1,1,2,01-Jan-2011 02:00:00,27,6,8000,10,22,8,0.2,1016.8,1027.5,5.3,4.0,2.1,7.1,80,5.0,0.0,0.0,7,7,22,7,6,22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22,0.0,0,0.0,0.0,0.0,27,7,0,0,8000.0,0,no fog,winter,3.2,0.0,0,0,0
2011-01-01 03:00:00,2011-01-01 03:00:00,2011,1,1,3,01-Jan-2011 03:00:00,25,7,8000,10,22,7,0.5,1016.6,1027.3,5.2,4.6,3.7,8.0,90,5.0,0.0,0.0,7,7,23,7,6,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23,0.0,0,0.0,0.0,0.0,27,7,0,0,8000.0,0,no fog,winter,1.5,0.0,0,0,0
2011-01-01 04:00:00,2011-01-01 04:00:00,2011,1,1,4,01-Jan-2011 04:00:00,28,7,8000,10,22,6,0.5,1016.6,1027.3,5.1,4.7,4.1,8.2,94,5.0,0.0,0.0,7,7,24,7,6,24,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,24,0.0,0,0.0,0.0,0.0,27,7,0,0,9000.0,0,no fog,winter,1.0,0.0,0,0,0


In [10]:
# LEAVING OUT w, ww, pweather and weather because of OH encoding sparsity

metadata = ['date', 'date_time', 'year', 'month', 'day', 'hour', 'season']
indicator = [col for col in df_train.columns if col[0] == 'i']
constant = [var for var in df_train.columns if len(df_train[var].value_counts()) == 1]
codes = ['sp1', 'sp2', 'sp3', 'sp4', 'wwa', 'wa', 'w' ,'ww', 'pweather', 'weather']
excluded = indicator + constant + codes + ['rgauge', 'sog', 'tabspeed', 'msl']
vis_vars=['target_hr1', 'vis_hr1', 'fog_formation', 'fog_dissipation', 'transition']
target = 'target_hr1'

categorical=['fog_state', 'season', 'tsig1', 'tsig2', 'tsig3', 'pchar'] #'w', 'ww', 'pweather',
             #'weather']
discrete = [var for var in df_train.columns if len(df_train[var].unique()) < 15 and 
             var not in excluded + categorical + metadata + codes + indicator + vis_vars]

continuous = [var for var in df_train.columns if var not in discrete + excluded + categorical + metadata + codes + indicator + vis_vars]
numerical = discrete+continuous
# conservative list of variables known to have an impact on fog formation.
# the other lists are too big for certain visualisations
fog_vars = ['rainfall', 'drybulb', 'cbl', 'ntot', 'dni', 'dewpt', 'speed', 'dir', 'rh']

## 2. Feature Selection 

In this section, we check the gain importance for each variable. These results are averaged out using cross validation, and the performance of the model is assessed.

A final set of features is selected using the importance scores.

In [11]:
dates = df_train.date_time
X = df_train[numerical + categorical + vis_vars].reset_index(drop=True)
y = X.pop(target)

In [12]:
# debugging
X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=seed)
print("-" * 100)
X_train_p, X_valid = preprocess(X_train, X_valid, cat_vars=categorical, num_vars=numerical, cat_encoder='oh')
X_train_p.head()

----------------------------------------------------------------------------------------------------


Unnamed: 0,clow,cmedium,chigh,nlc,ntot,nsig1,nsig2,nsig3,nsig4,tsig4,duration,sunshine,dir,speed,vis,ptend,cbl,drybulb,wetbulb,dewpt,vp,rh,hlc,hsig1,hsig2,hsig3,hsig4,ceiling,dos,rainfall,tabdir,dni,temp_dew_dist,rainfall12hma,fog_state_fog,fog_state_no fog,season_autumn,season_spring,season_summer,season_winter,tsig1_0,tsig1_1,tsig1_2,tsig1_3,tsig1_4,tsig1_6,tsig1_7,tsig1_8,tsig1_9,tsig2_0.0,tsig2_1.0,tsig2_2.0,tsig2_3.0,tsig2_4.0,tsig2_5.0,tsig2_6.0,tsig2_7.0,tsig2_8.0,tsig2_9.0,tsig3_0.0,tsig3_1.0,tsig3_2.0,tsig3_3.0,tsig3_4.0,tsig3_5.0,tsig3_6.0,tsig3_7.0,tsig3_8.0,tsig3_9.0,pchar_0,pchar_1,pchar_2,pchar_3,pchar_4,pchar_5,pchar_6,pchar_7,pchar_8
23939,0.157474,-0.477146,-0.628931,0.907394,0.592403,0.31353,1.031267,-0.765498,-0.043149,-0.039436,-0.416261,-0.209465,-0.928211,-0.469562,-0.958545,-0.945974,0.177002,1.265908,1.454086,1.585501,1.763963,0.221055,-0.415021,-0.415119,-0.601713,-0.516163,-0.034592,-0.60933,-0.03421,-0.20967,-0.941575,1.340872,-0.276816,-0.338362,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
11407,1.163045,1.275959,1.487995,-0.55257,-0.284485,-0.506373,-0.283505,-0.765498,-0.043149,-0.039436,-0.416261,0.718635,0.769422,0.252956,2.182355,-0.853002,-1.733127,-1.000529,-0.922103,-0.712943,-0.766941,0.809214,-0.002964,-0.003052,-0.004084,-0.516163,-0.034592,0.701554,-0.03421,-0.20967,0.782022,1.220059,-0.78352,-0.198496,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
27333,-1.350882,-0.477146,-0.628931,-0.55257,-1.161373,-0.506373,-1.16002,-0.765498,-0.043149,-0.039436,-0.416261,-0.518832,0.405643,3.143029,-0.260567,4.353464,-3.064429,-1.301384,-1.538998,-1.850786,-1.555794,-0.955261,-0.297291,-0.297386,-0.482187,-0.516163,-0.034592,-0.754984,-0.03421,-0.20967,0.41268,-0.852335,0.694366,1.619758,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
59969,1.163045,-0.477146,1.487995,-0.91756,-0.722929,-0.506373,-0.283505,-0.765498,-0.043149,-0.039436,-0.416261,-0.518832,-0.079395,-1.19208,0.43741,0.448615,-0.385287,-0.418877,-0.465143,-0.485374,-0.602597,-0.199058,-0.002964,-0.003052,2.984065,-0.516163,-0.034592,2.88636,-0.03421,-0.20967,0.043337,-0.852335,0.018761,-0.373328,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2805,-2.859238,-0.477146,0.429532,-1.647542,-2.038261,-0.506373,-1.598277,-0.765498,-0.043149,-0.039436,-0.416261,-0.518832,-1.655768,-1.19208,0.088421,-1.038947,1.119662,-0.559276,-0.465143,-0.303319,-0.405383,0.641169,8.179305,8.179424,-0.751121,-0.516163,-0.034592,-0.754984,-0.03421,-0.20967,-1.557146,-0.852335,-0.614619,-0.373328,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [13]:
model = SVC(random_state=seed)

In [14]:
df_train.loc[(df_train.transition==1), ['transition','fog_state']]

df_train.loc[df_train.index >= '2011-01-20 18:00', ['transition', 'fog_state']]

Unnamed: 0_level_0,transition,fog_state
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2011-01-20 18:00:00,0,no fog
2011-01-20 19:00:00,0,no fog
2011-01-20 20:00:00,1,fog
2011-01-20 21:00:00,0,fog
2011-01-20 22:00:00,0,fog
...,...,...
2019-12-31 19:00:00,0,no fog
2019-12-31 20:00:00,0,no fog
2019-12-31 21:00:00,0,no fog
2019-12-31 22:00:00,0,no fog


In [15]:
vars_sel = ['vis', 'temp_dew_dist', 'rh', 'ceiling', 'duration', 'hsig2', 'dni', 
                  'dewpt', 'drybulb', 'cbl', 'hlc', 'ntot', 'speed', 'vp', 'pchar','dir']
num_vars_sel = [var for var in vars_sel if var in discrete+continuous]
cat_vars_sel = [var for var in vars_sel if var in categorical]

In [16]:
full_model_scores, _, _ = manual_cross_validate(model, X, y, 
                                                                     cat_vars=categorical, 
                                                                     num_vars=numerical, 
                                                                     folds=5,
                                                                     cat_encoder='oh',
                                                                     calc_feature_importance=False)


Fold : 1
training size: (13148, 73)
test size: (13148, 75)
[[13008    28]
 [   54    58]]
****************************************************************************************************
Fold : 2
training size: (26296, 76)
test size: (13148, 78)
[[12940    47]
 [   56   105]]
****************************************************************************************************
Fold : 3
training size: (39444, 77)
test size: (13148, 79)
[[13026    27]
 [   31    64]]
****************************************************************************************************
Fold : 4
training size: (52592, 77)
test size: (13148, 79)
[[13032    27]
 [   50    39]]
****************************************************************************************************
Fold : 5
training size: (65740, 78)
test size: (13148, 80)
[[13075    18]
 [   28    27]]
****************************************************************************************************


## Feature Engineering Tests

In this section, we try several feature engineering ideas and check the impacts of these variables on model performance. In the previous section, we already saw that the temp_dew_dst variable was a good addition. We also considered adding cloud presence indicators, but this and many other cloud volume properties are encoded by ntot, nsigX, nlc, and clow/medium/high.

This section will mainly be used to test lagged features.


## Hyperparameter Tuning

In this section, we hyperparameter tune our random forest decision tree classifier using a cross-validated randomised search approach. This will probably be updated later to a different method later.

In [17]:
# creating training sets using only the selected features
X_train, X_test = preprocess(df_train, df_test, cat_vars=cat_vars_sel, num_vars=num_vars_sel, cat_encoder='oh')
# create training data using all the variables df_train (for comparison)
X_train_all, X_test_all = preprocess(df_train, df_test, 
                                     cat_vars=categorical, num_vars=continuous+discrete, cat_encoder='oh')
y_train = y.copy()
y_test = df_test[target]

In [18]:
# for compatibility with TimeSeriesSplit
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

In [None]:
X_train.head()

Unnamed: 0,vis,temp_dew_dist,rh,ceiling,duration,hsig2,dni,dewpt,drybulb,cbl,hlc,ntot,speed,vp,dir,pchar_0,pchar_1,pchar_2,pchar_3,pchar_4,pchar_5,pchar_6,pchar_7,pchar_8
0,-1.376079,-0.362191,0.30568,-0.432978,-0.416469,-0.749357,-0.855241,-0.786384,-0.865935,1.170655,-0.002102,0.592468,-0.652624,-0.804416,0.770863,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-1.376079,-0.573671,0.557907,-0.432978,-0.416469,-0.749357,-0.855241,-0.763574,-0.946414,1.170655,-0.002102,0.592468,-0.83304,-0.804416,0.892206,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,-1.445887,0.060768,-0.198772,-0.432978,-0.416469,-0.749357,-0.855241,-1.060103,-0.906174,1.145849,-0.002102,0.592468,-0.83304,-1.035018,0.770863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-1.445887,-0.658263,0.641982,-0.418409,-0.416469,-0.749357,-0.855241,-0.695144,-0.926294,1.129312,0.027427,0.592468,-0.652624,-0.738529,0.528175,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,-1.445887,-0.869742,0.978284,-0.403841,-0.416469,-0.749357,-0.855241,-0.603904,-0.946414,1.129312,0.056956,0.592468,-0.652624,-0.672643,0.892206,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [None]:
X_test.head()

Unnamed: 0,vis,temp_dew_dist,rh,ceiling,duration,hsig2,dni,dewpt,drybulb,cbl,hlc,ntot,speed,vp,dir,pchar_0,pchar_1,pchar_2,pchar_3,pchar_4,pchar_5,pchar_6,pchar_7,pchar_8
0,-0.259153,-0.108416,-0.030621,-0.432978,-0.416469,-0.749357,-0.855241,-0.603904,-0.584257,1.377368,-0.002102,0.592468,-1.193872,-0.672643,0.164144,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,-0.608192,0.060768,-0.198772,-0.31643,-0.416469,-0.301451,-0.855241,-0.695144,-0.584257,1.336026,-0.06116,0.592468,-1.013456,-0.771472,0.164144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,-0.608192,0.356839,-0.535074,-0.345567,-0.416469,-0.331311,-0.855241,-0.991673,-0.704976,1.319489,-0.06116,0.592468,-0.83304,-0.969132,0.285487,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,-0.957232,-0.150712,-0.030621,-0.753485,-0.416469,-0.376102,-0.855241,-1.082913,-1.026894,1.261609,-0.06116,-1.160488,-1.374288,-1.067962,-0.563919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,-0.957232,-0.742854,0.641982,-0.753485,-0.416469,-0.749357,-0.855241,-1.698781,-1.851809,1.187192,-0.208805,-2.036967,-1.554704,-1.463281,0.406831,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [None]:
# We will finetune the hyperparameters
# DO NOT RERUN - RESULTS SAVED
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
              'C': [0.5,1,2,5,10], 
              'gamma': ['scale'],
              'kernel': ['rbf']
              }
# Create a base model
svm_model = SVC(random_state=seed)
time_split = TimeSeriesSplit(n_splits = 5)
# Instantiate the grid search model
grid_search = GridSearchCV(svm_model,param_grid,
                           cv=time_split,n_jobs=-1, verbose=3,return_train_score=True)

grid_search.fit(X_train, y_train)




Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [None]:
# Save the object to a file
with open(os.path.join(path, 'results/rsvm_hyperparam_search3.pickle'), "wb") as file:
    pickle.dump(grid_search, file)

In [19]:
# load hyperparameter search
with open(os.path.join(path, 'results/rsvm_hyperparam_search3.pickle'), 'rb') as file:
  grid_search = pickle.load(file)

In [None]:
grid_search.best_params_

{'C': 5, 'gamma': 'scale', 'kernel': 'rbf'}

In [None]:
grid_search.cv_results_

{'mean_fit_time': array([  5.88457656,   4.05424361, 445.12570353,  10.07946219,
          3.15259695,   7.99558978,   5.68485637,   3.13331237,
          5.54107409,   9.03886514, 928.51842742,  14.98017612,
          8.76728845, 670.64966602,   7.31122723]),
 'std_fit_time': array([   3.93578593,    2.91566716,  411.6736612 ,   10.37721247,
           2.4036641 ,    7.46476762,    4.54818269,    2.74260098,
           5.17330119,    8.59219824, 1485.74596507,   14.88967412,
           8.27987505,  550.30979391,    6.63668403]),
 'mean_score_time': array([ 2.97340593,  0.98933983, 80.15232038,  1.11940165,  0.67969246,
         1.0267458 ,  1.19609356,  0.69366026,  3.09673128,  0.98818254,
         0.57449522,  0.90140738,  1.15364671, 83.53357005,  1.12964973]),
 'std_score_time': array([ 2.13079491,  0.47330421, 30.34233933,  0.51295972,  0.2708416 ,
         0.38972178,  0.81178397,  0.35429502,  1.69587827,  0.45678259,
         0.2353637 ,  0.57290243,  0.57699228, 36.99928087, 

##  Final Evaluation

First we check the performance of the final model (features selected and hyperparameter tuned), and compare it to the initial default svm model (all features) using cross validation on the train/valid set (called this the validation model).

**Result:** The validation model outperforms the initial model on the train_valiation set.

After that, we train the final model with the all the train/valid data, and test it out on the unseen test set.


In [None]:
validation_model = SVC(**grid_search.best_params_, random_state=seed)
validation_model_scores, _, _ = manual_cross_validate(model=validation_model, 
                                                 X=X, y=y,
                                                 num_vars=num_vars_sel,
                                                 cat_vars=cat_vars_sel,              
                                                 folds=5, 
                                                 calc_feature_importance=False,
                                                 cat_encoder='oh')
                                        

Fold : 1
training size: (13148, 24)
test size: (13148, 26)
[[13021    15]
 [   88    24]]
****************************************************************************************************
Fold : 2
training size: (26296, 24)
test size: (13148, 26)
[[12954    33]
 [  116    45]]
****************************************************************************************************
Fold : 3
training size: (39444, 24)
test size: (13148, 26)
[[13032    21]
 [   50    45]]
****************************************************************************************************
Fold : 4
training size: (52592, 24)
test size: (13148, 26)
[[13041    18]
 [   73    16]]
****************************************************************************************************
Fold : 5
training size: (65740, 24)
test size: (13148, 26)
[[13088     5]
 [   40    15]]
****************************************************************************************************


In [None]:
# performance of model using all variables and no hyperparameter tuning
performance_report(full_model_scores)

Validation Scores
------------------------------
f1_score
Scores: [58.59, 67.09, 68.82, 50.32, 54.0]
Mean: 59.764

heidke_skill_score
Scores: [0.5828, 0.667, 0.686, 0.5003, 0.5383]
Mean: 0.595

transition_f1_score
Scores: [27.12, 29.95, 28.0, 24.53, 21.33]
Mean: 26.186

transition_hss_score
Scores: [-0.1858, -0.2726, -0.2255, -0.2366, -0.2908]
Mean: -0.242



In [None]:
# after feature selection and hyperparameter tuning
performance_report(validation_model_scores)

Validation Scores
------------------------------
f1_score
Scores: [31.79, 37.66, 55.9, 26.02, 40.0]
Mean: 38.274

heidke_skill_score
Scores: [0.3149, 0.3715, 0.5564, 0.2574, 0.3987]
Mean: 0.380

transition_f1_score
Scores: [16.0, 22.4, 27.78, 18.67, 12.5]
Mean: 19.470

transition_hss_score
Scores: [0.0416, -0.0064, -0.0048, -0.0544, -0.0478]
Mean: -0.014



In [28]:
# creating training sets using only the selected features
X_train, X_test = preprocess(df_train, df_test, cat_vars=cat_vars_sel, num_vars=num_vars_sel, cat_encoder='oh')

y_train = y.copy()
y_test = df_test[target]

In [29]:
# Finally, on the unseen test set
final_model = SVC(**grid_search.best_params_, random_state=seed)
final_model.fit(X_train, y_train)

In [30]:
final_scores = score_model(final_model, X_train, X_test, y_train, y_test, df_test)

print("Final model performance:")
print("-"*100)
print("F1 score: {}\nHeidke Skill Score:{}".format(final_scores['f1'],
                                                   final_scores['hss']))
print()
print("Transition F1 score: {}\nTransition Heidke Skill Score:{}".format(final_scores['transition_f1'], 
                                                                         final_scores['transition_hss']))

# logging all the parameters

modl["params"] = grid_search.best_params_
modl['f1'] = final_scores['f1']
modl['hss'] = final_scores['hss']
modl['transition_f1'] = final_scores['transition_f1']
modl['transition_hss'] = final_scores['transition_hss']

modl.stop()

Final model performance:
----------------------------------------------------------------------------------------------------
F1 score: 38.14
Heidke Skill Score:0.3781

Transition F1 score: 20.51
Transition Heidke Skill Score:-0.0294
Shutting down background jobs, please wait a moment...
Done!
Waiting for the remaining 7 operations to synchronize with Neptune. Do not kill this process.
All 7 operations synced, thanks for waiting!
Explore the metadata in the Neptune app:
https://app.neptune.ai/swiatej2/fyp/m/FYP-SVM1/metadata
