# Portfolio project on optimising a model for real-life data

## Objective

Predict peaks and troughs in financial assets prices time series using only technical indicators.

## Data

In [1]:
import pandas as pd

### Import features

We won't perform the preprocessing of the data here, we will be working directly with the features and the responses pre calculated.

In [15]:
data = pd.read_csv('equity_data.csv', index_col='date', parse_dates=['date'])

In [16]:
data.head()

Unnamed: 0_level_0,px_last,px_high,px_low,px_open,px_volume,asset,rsi_7,rsi_14,rsi_28,roc_1m,roc_3m,bb_1m,px_dev_21d,px_dev_50d,ma_3d_stoch_14d,ma_15d_stoch_70d,psycho_12d,psycho_8d,resp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1990-05-31,361.23,361.84,360.23,360.86,119856200.0,SPX Index,73.837822,72.00966,66.788579,8.722348,8.562241,0.792335,2.883125,5.109125,100.0,98.290838,66.666667,75.0,2
1990-06-01,363.16,363.52,361.21,361.26,141159800.0,SPX Index,77.751141,74.077085,68.059651,8.574504,8.231507,0.826827,3.032047,5.526882,100.0,98.588718,75.0,75.0,0
1990-06-04,367.4,367.85,362.43,363.16,133370100.0,SPX Index,83.917001,77.933232,70.621292,9.485353,10.085695,0.92741,3.788661,6.56256,100.0,98.588718,83.333333,75.0,0
1990-06-05,366.64,368.78,365.49,367.4,158007000.0,SPX Index,79.320115,75.757878,69.583932,8.348355,8.495842,0.878514,3.18185,6.160949,98.109453,98.457897,75.0,62.5,0
1990-06-06,364.96,366.64,364.42,366.64,126588400.0,SPX Index,69.501495,71.037515,67.317151,7.174111,8.312806,0.804155,2.373892,5.507518,91.819172,98.135466,66.666667,62.5,0


## Parameters

The parameters below specifies:
- The subset of equity indices we consider. 
- Whether we want to predict peaks and/or troughs.
- The list of base classifiers and their range of hyper parameters we want to optimise on.
- The list of meta classifiers and their range of hyper parameters we want to optimise on.

The fine tuning of the parameters will be done with a Bayesian optimisation algorithm provided by Optuna. Each classifier can have its own data processing, for instance some algorithms like the distance based ones for instance (e.g. K-NN) need to get the features on the same scale if we don't specify features preferences. Hence we can specify the algorithm to perform that data transformation for that classifier and adapt the data to algorithm. In order to reduce the dimension we even have the possibility to select fearures in a systematic way by using SelectPercentile or PCA. 

In [4]:
params = \
{
'is_on':
    {
    'classifiers':
        {
        'train': True,
        'fit': True,
        'eval': True,
        },
    },
'assets':
    {
    'equity': [
               #'DAX Index',
               #'HSI Index',
               #'MXASJ Index',
               #'MXEF Index',
               #'MXWD Index',
               #'NDX Index',
               #'NKY Index',
               #'RTY Index',
               #'SGX Index',
               #'SHSZ300 Index',
               'SPX Index',
               #'SVX Index',
               #'SXXE Index',
               #'SXXP Index',
               #'UKX Index'
               ],
    },
'training':
    {
    'train_date': '11/2018',
    'test_date': '01/2019',
    'classifiers':
        {
        'algos':
            {
            'xgb':
               {
               'standardisation': None, # 'std', None
               'normalisation': None, # 'norm', 'minmax', None
               'features_selection': None, # 'perc', 'pca', None
               'classifier': 'xgb',
               },
          #'lgb':
             #{
             #'standardisation': None, # 'std', None
             #'normalisation': None, # 'norm', 'minmax', None
             #'features_selection': None, # 'perc', 'pca', None
             #'classifier': 'lgb',
             #},
            },
        'seed': 42,
        'evaluation_metric': 'precision', # 'f1', 'mcc', 'log_loss', 'precision'
        'split':
            {
            'n_splits': 5,
            'group_gap': 21,
            'max_train_group_size': 252 * 12,
            'max_test_group_size': 252 * 4,
            },
        'features_selection':
            {
            'perc':
                {
                'fix_params':
                    {
                    'score_func': 'f_classif',
                    'percentile': 75,
                    },
                'trials_range':
                    {
                    #'percentile':  {'low': 10, 'high': 100, 'step': 10},
                    },
                },
            'pca':
                {
                'fix_params':
                    {
                    'svd_solver': 'full',
                    },
                'trials_range':
                    {
                    'n_components':  {'low': .55, 'high': .95, 'step': .1},
                    },
                },
            },
        'classifier':
            {
            'xgb':
                {
                'optimisation': {'n_trials': 25, 'timeout': None},
                'fix_params':
                    {
                    'objective': 'multi:softprob',
                    'random_state': 42,
                    'importance_type': 'weight',
                    #'max_depth': 3,
                    #'min_child_weight': 3,
                    #'learning_rate': .01,
                    #'gamma': 20,
                    #'colsample_bytree': .5,
                    },
                'trials_range':
                    {
                    'n_estimators':      {'low': 50, 'high': 350, 'step': 50},
                    'max_depth':         {'low': 3, 'high': 6, 'step': 1},
                    'min_child_weight':  {'low': 6, 'high': 10, 'step': 1},
                    'learning_rate':     {'low': .01, 'high': .31, 'step': .05},
                    'gamma':             {'low': 8, 'high': 12, 'step': 1},
                    'colsample_bytree':  {'low': .4, 'high': .8, 'step': .1},
                    'subsample':         {'low': .4, 'high': 1., 'step': .1},
                    'colsample_bylevel': {'low': .2, 'high': 1., 'step': .1},
                    'colsample_bynode':  {'low': .2, 'high': 1., 'step': .1},
                    #'reg_lambda':        {'low': 0., 'high': 1., 'step': .1},
                    #'reg_alpha':         {'low': 0., 'high': 1., 'step': .1},
                    },
                },
            'lgb':
                {
                'optimisation': {'n_trials': 1, 'timeout': None},
                'fix_params':
                    {
                    'objective': 'multiclass',
                    'random_state': 42,
                    'class_weight': 'balanced',
                    #'learning_rate': .01,
                    #'max_depth': 2,
                    #'num_leaves': 1_000,
                    },
                'trials_range':
                    {
                    #'max_depth':        {'low': 2, 'high': 6, 'step':2},
                    #'num_leaves':       {'low': 2, 'high': 2**10},
                    #'learning_rate':    {'low': .001, 'high': .6, 'log':True},
                    #'colsample_bytree': {'low': .4, 'high': 1., 'step': .1},
                    #'subsample':        {'low': .4, 'high': 1., 'step': .1},
                    },
                },
            'svm':
                {
                'optimisation':  {'n_trials': 1, 'timeout': None},
                'fix_params':
                    {
                    'random_state': 42,
                    'class_weight': 'balanced',
                    'kernel': 'sigmoid',
                    #'degree': 17,
                    #'max_iter': 500,
                    'C': .01,
                    },
                'trials_range':
                    {
                    #'C': {'low': 2**-2, 'high': 2**2, 'log': True}, # default value (i.e., 1) works best
                    #'kernel': {'choices': ['linear', 'poly', 'rbf', 'sigmoid', 'precomputed']},
                    #'kernel': {'choices': ['rbf', 'poly']},
                    #'degree': {'low': 1, 'high': 5, 'step': 2},
                    #'gamma': {'low': 2**-2, 'high': 2**2, 'log': True},
                    #'shrinking': {'choices': [True, False]},
                    #'coef0': {'low': -.1, 'high': .1, 'step': .01}
                    },
                },
            'rf':
                {
                'optimisation':  {'n_trials': 1, 'timeout': None},
                'fix_params':
                    {
                    'random_state': 42,
                    #'max_depth': 3,
                    #'max_features': .3,
                    #'n_estimators': 1000,
                    #'min_samples_split': 20,
                    #'max_leaf_nodes': 40,
                    #'min_samples_leaf': 20,
                    #'max_samples': .5,
                    #'max_features': 'log2', #sqrt (default), log2, None
                    },
                'trials_range':
                    {
                    #'n_estimators': {'low': 50, 'high': 1000, 'step': 50}, # 650
                    #'max_depth': {'low': 3, 'high': 9, 'step': 2}, # 3
                    #'max_features': {'low': .1, 'high': .9, 'step': .2}, # .9
                    #'min_samples_split': {'low': 2, 'high': 100, 'step': 2}, # .82
                    #'max_leaf_nodes': {'low': 2, 'high': 50, 'step': 2},
                    #'min_samples_leaf': {'low': 1, 'high': 391, 'step':10},
                    #'max_samples': {'low': .1, 'high': 1.},
                    },
                },
            },
        },
    },
}


### Split data

The function below split the data in train and test. When we will perform the optimisation of the hyper parameters of the classifiers we will split the train set into training and validation subset. We need to make sure there is no linkage from the test set and the train set otherwise the evaluation might be compromised. 

In [5]:
def split_data(data, params):
    """
    Split the data in:
    - X_train: features with dates up to train_date.
    - y_train: responses with dates up to train_date.
    - X_test: features with dates from test_date.
    - y_test: responses with dates from test_date.
    """
    def split(train_test):
        date = params['training'][f'{train_test}_date']
        if train_test == 'train':
            df = data.loc[:date]
        else:
            df = data.loc[date:]
        X_ = df.loc[:,'asset':].drop(['resp'], axis=1)
        y_ = df[['asset', 'resp']]
        return X_, y_

    data = data.loc[data.asset.isin(params['assets']['equity'])]
    X_train, y_train = split('train')
    X_test, y_test = split('test')
    return X_train, y_train, X_test, y_test

In [6]:
X_train, y_train, X_test, y_test = split_data(data, params)

### Machine Learning Utils

This script gathers a collection of machine learning functions. For instance "build_pipeline" provide a nice way to add different steps of data tranformation before applying the classification. "apply_frac_diff" deals with the non stationarity of time series by using fractional differenciation. 

In [7]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.svm import SVC
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from statsmodels.tsa.stattools import adfuller
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectPercentile
from sklearn.utils.class_weight import compute_sample_weight

def build_pipeline(params):
    """
    Build the sklearn pipeline with the following steps:
    - standardisation: Standardize features by removing the mean 
      and scaling to unit variance.
    - Normalisation: make the features between 0 and 1. 
    - Feature selection: reduce the number of features with either 
      a PCA or SelectPercentile which select features according to 
      a percentile of the highest scores.
    - Classifier: Predict the response based on the transformed 
      features. 
    """
    ML_library = \
        {
        'standardisation':
            {
            'std': StandardScaler,
            },
        'normalisation':
            {
            'norm': Normalizer,
            'minmax': MinMaxScaler,
            },
        'features_selection':
            {
            'perc': SelectPercentile,
            'pca': PCA,
            },
        'classifier':
            {
            'xgb': xgb.XGBClassifier,
            'lgb': LGBMClassifier,
            'rf': RandomForestClassifier,
            'svm': SVC,
            'knn': KNeighborsClassifier,
            'lr': LogisticRegression,
            },
        }
    l_steps = []
    for k, v in params.items():
        if v['name'] is not None:
            l_steps.append((k, ML_library[k][v['name']](**v['params'])))
    pipe = Pipeline(l_steps)
    return pipe

def get_params_fitting(cls_algo, y_wght):
    """
    Complete the arguments of the classifier to take account the 
    weights of the reponses. 
    """
    d_params = \
    {
    'xgb':
        {
        'classifier__sample_weight': y_wght,
        'classifier__verbose': False,
        },
    'rf':
        {
        'classifier__sample_weight': y_wght,
        },
    }

    if cls_algo in d_params:
        return d_params[cls_algo]
    else:
        return {}

def get_sample_weight(y):
    """
    Estimate sample weights by class for unbalanced datasets.
    Responses with few occurences will get higher weight.
    """ 
    l_y_wght = compute_sample_weight('balanced', y)
    return pd.Series(l_y_wght, index=y.index)

def apply_frac_diff(ts, **kwargs):
    """
    Apply fractional differential to make the data stationary
    with loosing too much information. 
    """
    def get_wght(d, thres):
        w, k=[1.], 1
        while True:
            w_= -w[-1] * (d - k + 1) / k
            if abs(w_) < thres:break
            w.append(w_)
            k += 1
        return np.array(w[::-1]).reshape(-1,1)

    def apply_wght(ts, d, thres):
        w = get_wght(d, thres)
        width = len(w) - 1
        d_res = {}
        if width >= ts.shape[0]:raise Exception("width is oversize")
        for i in range(width, ts.shape[0]):
            i_0_index, i_1_index = ts.index[i-width], ts.index[i]
            data = np.dot(w.T, ts.loc[i_0_index:i_1_index])[0]
            d_res[i_1_index] = data
        return pd.Series(d_res)

    def get_adf_corr(ts):
        out = pd.DataFrame(columns=['adfStat','pVal','lags',\
                                 'nb_obs','95% conf','corr'])
        for d in np.linspace(0, 1 , 11):
            ts_trans = apply_wght(ts, d, kwargs['thres'])
            corr = np.corrcoef(ts.loc[ts_trans.index], ts_trans)[0,1]
            adf = adfuller(ts_trans, maxlag=1, regression='c',autolag=None)
            out.loc[d] = list(adf[:4])+[adf[4]['5%']]+[corr]
        return out

    if 'thres' not in kwargs: kwargs['thres'] = 1e-4
    out = get_adf_corr(ts)
    min_d = min(out.loc[out.pVal <= .05].index)
    res = apply_wght(ts, min_d, kwargs['thres'])
    res.index.name = 'date'
    return res


PurgedGroupTimeSeriesSplit assures non leakage between the training and the validation sets. 

In [8]:
from sklearn.model_selection._split import _BaseKFold, indexable, _num_samples
from sklearn.utils.validation import _deprecate_positional_args
import numpy as np



class PurgedGroupTimeSeriesSplit(_BaseKFold):
    """Time Series cross-validator variant with non-overlapping groups.
    Allows for a gap in groups to avoid potentially leaking info from
    train into test if the model has windowed or lag features.
    Provides train/test indices to split time series data samples
    that are observed at fixed time intervals according to a
    third party provided group.
    In each split, test indices must be higher than before, and thus shuffling
    in cross validator is inappropriate.
    This cross-validation object is a variation of :class:`KFold`.
    In the kth split, it returns first k folds as train set and the
    (k+1)th fold as test set.
    The same group will not appear in two different folds (the number of
    distinct groups has to be at least equal to the number of folds).
    Note that unlike standard cross-validation methods, successive
    training sets are supersets of those that come before them.
    Read more in the :ref:`User Guide <cross_validation>`.
    Parameters
    ----------
    n_splits : int, default=5
        Number of splits. Must be at least 2.
    max_train_group_size : int, default=Inf
        Maximum group size for a single training set.
    group_gap : int, default=None
        Gap between train and test
    max_test_group_size : int, default=Inf
        We discard this number of groups from the end of each train split
    """

    @_deprecate_positional_args
    def __init__(self,
                 n_splits=5,
                 *,
                 max_train_group_size=np.inf,
                 max_test_group_size=np.inf,
                 group_gap=None,
                 verbose=False
                 ):
        super().__init__(n_splits, shuffle=False, random_state=None)
        self.max_train_group_size = max_train_group_size
        self.group_gap = group_gap
        self.max_test_group_size = max_test_group_size
        self.verbose = verbose

    def split(self, X, y=None, groups=None):
        """Generate indices to split data into training and test set.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Training data, where n_samples is the number of samples
            and n_features is the number of features.
        y : array-like of shape (n_samples,)
            Always ignored, exists for compatibility.
        groups : array-like of shape (n_samples,)
            Group labels for the samples used while splitting the dataset into
            train/test set.
        Yields
        ------
        train : ndarray
            The training set indices for that split.
        test : ndarray
            The testing set indices for that split.
        """
        if groups is None:
            raise ValueError(
                "The 'groups' parameter should not be None")
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        group_gap = self.group_gap
        max_test_group_size = self.max_test_group_size
        max_train_group_size = self.max_train_group_size
        n_folds = n_splits + 1
        group_dict = {}
        u, ind = np.unique(groups, return_index=True)
        unique_groups = u[np.argsort(ind)]
        n_samples = _num_samples(X)
        n_groups = _num_samples(unique_groups)
        for idx in np.arange(n_samples):
            if groups[idx] in group_dict:
                group_dict[groups[idx]].append(idx)
            else:
                group_dict[groups[idx]] = [idx]
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds={0} greater than"
                 " the number of groups={1}").format(n_folds,
                                                     n_groups))

        group_test_size = min(n_groups // n_folds, max_test_group_size)
        group_test_starts = range(n_groups - n_splits * group_test_size,
                                  n_groups, group_test_size)
        for group_test_start in group_test_starts:
            train_array = []
            test_array = []

            group_st = max(0, group_test_start - group_gap - max_train_group_size)
            for train_group_idx in unique_groups[group_st:(group_test_start - group_gap)]:
                train_array_tmp = group_dict[train_group_idx]

                train_array = np.sort(np.unique(
                    np.concatenate((train_array,
                                    train_array_tmp)),
                    axis=None), axis=None)

            train_end = train_array.size

            for test_group_idx in unique_groups[group_test_start:group_test_start + group_test_size]:
                test_array_tmp = group_dict[test_group_idx]
                test_array = np.sort(np.unique(
                    np.concatenate((test_array,
                                    test_array_tmp)),
                    axis=None), axis=None)

            test_array = test_array[group_gap:]

            if self.verbose > 0:
                pass

            yield [int(i) for i in train_array], [int(i) for i in test_array]


ClassifierHPO use Optuna's bayesian optmisation algorithm to find the best set of parameters values. 

In [9]:
import numpy as np
import pandas as pd
import optuna
from optuna.samplers import TPESampler
from sklearn.metrics import f1_score, log_loss, matthews_corrcoef, precision_score
from sklearn.feature_selection import f_classif


class ClassifierHPO():
    """
    Classifier Hyper Parameters Optimisation: Find the best set of 
    hyper parameters values, that gives the lowest error on the 
    validation set. 
    """
    def __init__(self, X, y, y_wght, algo, params):
        self.X = X
        self.y = y
        self.y_wght = y_wght
        self.algo = algo
        self.params = params
        self.best_params = None
        self.main()

    def get_params_trial(self, trial):
        """
        Provides the set of hyper parameters values tested by Optuna's algorithm. 
        """
        def get_params_tr(key, value):
            def get_params_fs_perc(trial):
                d_suggest = \
                    {
                    'percentile': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_fs_pca(trial):
                d_suggest = \
                    {
                    'n_components': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_xgb(trial):
                d_suggest = \
                    {
                    'n_estimators': trial.suggest_int,
                    'max_depth': trial.suggest_int,
                    'min_child_weight': trial.suggest_int,
                    'learning_rate': trial.suggest_float,
                    'reg_lambda': trial.suggest_int,
                    'reg_alpha': trial.suggest_int,
                    'subsample': trial.suggest_float,
                    'colsample_bytree': trial.suggest_float,
                    'colsample_bylevel': trial.suggest_float,
                    'colsample_bynode': trial.suggest_float,
                    'gamma': trial.suggest_int,
                    'grow_policy': trial.suggest_categorical,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_lgb(trial):
                d_suggest = \
                    {
                    'max_depth': trial.suggest_int,
                    'num_leaves': trial.suggest_int,
                    'learning_rate': trial.suggest_float,
                    'colsample_bytree': trial.suggest_float,
                    'subsample': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_rf(trial):
                d_suggest = \
                    {
                    'max_leaf_nodes': trial.suggest_int,
                    'n_estimators': trial.suggest_int,
                    'max_depth': trial.suggest_int,
                    'min_samples_split': trial.suggest_int,
                    'min_samples_leaf': trial.suggest_int,
                    'max_samples': trial.suggest_float,
                    'min_weight_fraction_leaf': trial.suggest_float,
                    'max_features': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_svm(trial):
                d_suggest = \
                    {
                    'C': trial.suggest_float,
                    'kernel': trial.suggest_categorical,
                    'degree': trial.suggest_int,
                    'gamma': trial.suggest_float,
                    'shrinking': trial.suggest_categorical,
                    'coef0': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_knn(trial):
                d_suggest = \
                    {
                    'n_neighbors': trial.suggest_int,
                    'weights': trial.suggest_categorical,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            def get_params_lr(trial):
                d_suggest = \
                    {
                    'penalty': trial.suggest_categorical,
                    'C': trial.suggest_float,
                    }
                return {x: d_suggest[x](x, **params_tr[x]) for x in params_tr}

            if self.params_opt[key] is not None:
                params_tr = self.params_opt[key]['trials_range']
                d_params_tr = {'perc': get_params_fs_perc,
                               'pca': get_params_fs_pca,
                               'xgb': get_params_xgb,
                               'lgb': get_params_lgb,
                               'rf': get_params_rf,
                               'svm': get_params_svm,
                               'knn': get_params_knn,
                               'lr': get_params_lr,}
                res = d_params_tr[d_k[key]](trial)
                res = self.params_opt[key]['fix_params'] | res

                d_fs_scr = {'f_classif': f_classif}
                if key == 'features_selection' and value == 'perc':
                    res['score_func'] = d_fs_scr[res['score_func']]
                return res

        params_trial = {}

        d_k = {'features_selection': self.algo['features_selection'],
               'classifier': self.algo['classifier']}

        for k, v in self.algo.items():
            if v is not None:
                if k in ['features_selection', 'classifier']:
                    params_trial[k] = {'name': v,
                                       'params': get_params_tr(k, v)}
                else:
                    params_trial[k] = {'name': v,
                                       'params': {}}
        return params_trial

    def get_perf(self, pipe,
                       X_training,
                       y_training,
                       X_validation,
                       y_validation,
                       y_wght_training,
                       y_wght_validation):
        d_mtx = \
        {
        'mcc': matthews_corrcoef,
        'f1': f1_score,
        'log_loss': log_loss,
        'precision': precision_score,
        }

        pipe.fit(X_training.values,
                 y_training.squeeze(),
                 **get_params_fitting(self.algo['classifier'], y_wght_training))

        y_pred = pipe.predict(X_validation.values)
        return d_mtx[self.params['evaluation_metric']](y_validation,
                                                       y_pred,
                                                       sample_weight=y_wght_validation,
                                                       average='weighted')
    def get_params_opt(self):
        params_opt = {}
        for k, v in self.algo.items():
            if v is not None and k in ['features_selection', 'classifier']:
                params_opt[k] = self.params[k][v]
        return params_opt

    def main(self):
        def objective(trial):
            params_trial = self.get_params_trial(trial)
            self.params_trial_record[trial.number] = params_trial
            pipe = build_pipeline(params_trial)
            l_perf = []
            split = self.cv.split(self.X, groups=self.X.index.values)
            for i, (training_idx, validation_idx) in enumerate(split):
                X_training = self.X.iloc[training_idx]
                y_training = self.y.iloc[training_idx]
                X_validation = self.X.iloc[validation_idx]
                y_validation = self.y.iloc[validation_idx]
                y_wght_training = self.y_wght.iloc[training_idx]
                y_wght_validation = self.y_wght.iloc[validation_idx]

                validation_start_date = max(X_training.index).strftime('%Y-%m-%d')
                X_validation = X_validation.sort_index().loc[validation_start_date:]
                y_validation = y_validation.sort_index().loc[validation_start_date:]
                y_wght_validation = y_wght_validation.sort_index().loc[validation_start_date:]

                perf = self.get_perf(pipe,
                                     X_training,
                                     y_training,
                                     X_validation,
                                     y_validation,
                                     y_wght_training,
                                     y_wght_validation)
                l_perf.append(perf)
            return round(np.median(l_perf), 4)

        self.cv = PurgedGroupTimeSeriesSplit(**self.params['split'])
        self.params_opt = self.get_params_opt()

        sampler = TPESampler(seed=self.params['seed'])
        study = optuna.create_study(direction='maximize', sampler=sampler)
        self.params_trial_record = {}
        study.optimize(objective, **self.params_opt['classifier']['optimisation'])
        self.best_params = self.params_trial_record[study.best_trial.number]


In [10]:
from sklearn.metrics import classification_report, confusion_matrix
import plotext as plt
import pandas as pd

class EvalClassif():
    """
    Evaluate the classifier on the test by plotting a lift chart,
    a confusion matrix, and the main error statistics. 
    """
    def __init__(self, y_true, y_pred, y_wght, target_names=None):
        self.y_true = y_true
        self.y_pred = y_pred
        self.y_wght = y_wght
        self.target_names = target_names

    def plot_lift_chart(self):
        df = pd.concat([self.y_true.astype('int'), self.y_pred],
                       axis=1)
        df.columns = ['true', 'pred']
        df = df.sort_values('pred')
        df['model'] = df['true'].cumsum()
        ratio = max(df['model']) / len(df.index)
        df['random'] = [x * ratio for x in range(1, len(df.index) + 1)]
        df['oracle'] = df['true'].sort_values(ascending=False).cumsum().to_list()
        df = df[['model', 'random', 'oracle']]
        df = df.reset_index(drop=True)
        plt.plot(df['model'])
        plt.plot(df['random'])
        plt.plot(df['oracle'])
        plt.plot_size(40, 10)
        plt.show()
        plt.clear_data()
        plt.clear_figure()

    def plot_conf_matrix(self):
        plt.confusion_matrix(self.y_true.squeeze(), self.y_pred, labels = self.target_names, color='dark')
        plt.plot_size(60, 10)
        plt.show()
        plt.clear_figure()

    def main(self):
        print(classification_report(self.y_true,
                                    self.y_pred,
                                    target_names=self.target_names))
        self.plot_conf_matrix()
        #self.plot_lift_chart()

    def get_pt_score(self):
        conf = confusion_matrix(self.y_true, self.y_pred)
        return conf[0][0] + conf[2][2]**2/conf[1][2]


In [11]:
class Classifier():
    """
    Gathers the main methods to optimise the hyper parameters of the classifiers, fit the model and evaluate it. 
    """
    def __init__(self, algo_id, params):
        self.algo_id = algo_id
        self.params = params

    def optimise_hyperparams(self, X, y, y_wght, algo_d):
        y_wght = y_wght.drop(columns='asset')
        return ClassifierHPO(X, y, y_wght, algo_d, self.params).best_params

    def train(self, X, y):
        y_wght = get_sample_weight(y)
        algo_pipe = self.params['algos'][self.algo_id]
        self.mdl_params = self.optimise_hyperparams(X, y, y_wght, algo_pipe)

    def fit(self, X, y):
        y_wght = get_sample_weight(y)
        pipe = build_pipeline(self.mdl_params)
        pipe.fit(X.values,
                 y.squeeze(),
                 **get_params_fitting(self.mdl_params['classifier']['name'], y_wght))
        self.mdl = pipe
        return pipe

    def predict(self, X):
        return pd.Series(self.mdl.predict(X.values), index=X.index, name='y_pred')

    def evaluate(self, y_true, y_pred, target_names):
        y_wght = get_sample_weight(y_true)
        EvalClassif(y_true, y_pred, y_wght, target_names).main()

In [None]:
import pickle
import os
import pandas as pd

class Storage():
    """
    Gathers the main classes to store the model's parameters, the fitted model and the results. 
    """
    @staticmethod
    def get_storage_dir(dir):
        storage_dir = os.getcwd() + '/storage/'
        if dir is not None: storage_dir += f'{dir}/'
        try:
            os.makedirs(storage_dir)
        except FileExistsError:
            pass
        return storage_dir

    @staticmethod
    def to_pickle(obj, name, dir=None):
        storage_dir = Storage.get_storage_dir(dir)
        with open(f'{storage_dir}{name}.pickle', 'wb') as handle:
            pickle.dump(obj, handle)


    @staticmethod
    def from_pickle(name, dir=None):
        storage_dir = Storage.get_storage_dir(dir)
        with open(f'{storage_dir}{name}.pickle', 'rb') as handle:
            return pickle.load(handle)

    @staticmethod
    def to_csv(df, name, dir=None):
        storage_dir = Storage.get_storage_dir(dir)
        df.to_csv(f'{storage_dir}/{name}.csv')

    @staticmethod
    def from_csv(name, dir=None):
        storage_dir = Storage.get_storage_dir(dir)
        return pd.read_csv(f'{storage_dir}/{name}.csv',
                           index_col='date',
                           parse_dates=['date'])


In [13]:

cls_params = params['training']['classifiers']
is_on = params['is_on']
asset_class = 'equity'
for algo_id in cls_params['algos']:
    print('*'*33, f'Classification Algo: {algo_id}', '*'*33)
    dir = f'{asset_class}/classifiers/{algo_id}'
    cls = Classifier(algo_id,  cls_params)
    if is_on['classifiers']['train']:
        print('~'*10, 'Training', '~'*10)
        cls.train(X_train.drop(columns='asset'),
                  y_train.drop(columns='asset'))
        Storage.to_pickle(cls.mdl_params, f'{algo_id}_params', dir)
    if is_on['classifiers']['fit']:
        print('~'*10, 'Fitting', '~'*10)
        cls.mdl_params = Storage.from_pickle(f'{algo_id}_params', dir)
        mdl = cls.fit(X_train.drop(columns='asset'),
                      y_train.drop(columns='asset'))
        Storage.to_pickle(mdl, f'{algo_id}_mdl', dir)
    if is_on['classifiers']['eval']:
        cls.mdl = Storage.from_pickle(f'{algo_id}_mdl', dir)
        for asset in params['assets'][asset_class]:
            print('.'*5, f'Evaluation on {asset}', '.'*5)
            target_names = ['nts', 'peak', 'trough']
            X_test_asset = X_test.loc[X_test.asset==asset]
            X_test_asset = X_test_asset.drop(columns='asset')
            y_true = y_test.loc[y_test.asset==asset]
            y_true = y_true.drop(columns='asset')
            y_pred = cls.predict(X_test_asset)
            cls.evaluate(y_true, y_pred, target_names)

[32m[I 2023-06-08 16:27:14,968][0m A new study created in memory with name: no-name-26035659-6d9d-4a3c-a38e-9b0cfd4d4e25[0m


********************************* Classification Algo: xgb *********************************
~~~~~~~~~~ Training ~~~~~~~~~~


[32m[I 2023-06-08 16:27:16,806][0m Trial 0 finished with value: 0.7956 and parameters: {'n_estimators': 150, 'max_depth': 6, 'min_child_weight': 9, 'learning_rate': 0.21000000000000002, 'gamma': 8, 'colsample_bytree': 0.4, 'subsample': 0.4, 'colsample_bylevel': 0.9000000000000001, 'colsample_bynode': 0.7}. Best is trial 0 with value: 0.7956.[0m
[32m[I 2023-06-08 16:27:18,795][0m Trial 1 finished with value: 0.788 and parameters: {'n_estimators': 250, 'max_depth': 3, 'min_child_weight': 10, 'learning_rate': 0.26, 'gamma': 9, 'colsample_bytree': 0.4, 'subsample': 0.5, 'colsample_bylevel': 0.4, 'colsample_bynode': 0.6000000000000001}. Best is trial 0 with value: 0.7956.[0m
[32m[I 2023-06-08 16:27:20,881][0m Trial 2 finished with value: 0.7697 and parameters: {'n_estimators': 200, 'max_depth': 4, 'min_child_weight': 9, 'learning_rate': 0.01, 'gamma': 9, 'colsample_bytree': 0.5, 'subsample': 0.7000000000000001, 'colsample_bylevel': 0.9000000000000001, 'colsample_bynode': 0.300000000

[32m[I 2023-06-08 16:28:22,142][0m Trial 24 finished with value: 0.7979 and parameters: {'n_estimators': 200, 'max_depth': 4, 'min_child_weight': 7, 'learning_rate': 0.26, 'gamma': 8, 'colsample_bytree': 0.4, 'subsample': 0.6000000000000001, 'colsample_bylevel': 0.5, 'colsample_bynode': 0.9000000000000001}. Best is trial 3 with value: 0.8056.[0m


~~~~~~~~~~ Fitting ~~~~~~~~~~
..... Evaluation on SPX Index .....
              precision    recall  f1-score   support

         nts       0.97      0.88      0.92       975
        peak       0.02      0.10      0.04        20
      trough       0.22      0.45      0.30        20

    accuracy                           0.86      1015
   macro avg       0.40      0.48      0.42      1015
weighted avg       0.93      0.86      0.89      1015

                         [0m[38;5;12m[1mConfusion Matrix[0m                   [0m
      [0m[38;5;12m┌────────────────────────────────────────────────────┐[0m
      [0m[38;5;12m│[0m[38;2;80;80;80m█████████████████[0m[38;2;236;236;236m█████████████████[0m[38;2;247;247;247m▐█████████████████[0m[38;5;12m│[0m
[38;5;12m[1m   nts[0m[38;5;12m┤[0m[38;2;249;249;249m▄▄▄▄▄▄▄▄▄[0m[48;2;80;80;80m[38;5;12m[1m858 - 84.53%[0m[38;2;253;253;253m▄▄▄▄▄[0m[48;2;236;236;236m[38;5;12m[1m85 - 8.37%[0m[38;2;253;253;253m▄▄▄▄▄▄▄[0m[48;2

### How to read the results:

- Number of true negatives: number of times their was no peaks and the algo predicted it correctly.
- Number of false positives: number of times the algo predicted a peak but it wasn’t right.
- Number of false negatives: number of times the algo missed a peak.
- Number of true positives: number of times the algo predicted correctly the peaks.
 
The objective is to have the highest number of true positives, and the lowest number of false ones.

If you apply the training, fitting, and evaluation only on the SPX Index you observe the peaks are much more difficult to predict than troughs. The error is much higher, and it misses most of the true values. One way to mitigate is to include in the fitting and the other indices. However, when doing so, you will pick most of the peaks which is good, but the signal will be much more sensitive and the number of false positive will go out of the roof. One way to mitigate would be to stack several classifiers togethers as weak learners, and use their outputs as inputs for a strong classifiers, which would learn from the outputs of the weak ones. 