<div style="display:flex; border-bottom:4px solid gray; background-color: white; padding: 10px;">
    <div>
        <h2 style="margin:10px 0px 0px 0px;">Master Thesis - Spring 2023</h2>
        <h4 style="margin:10px 10px 10px 0px;"><i>Artificial Intelligence - Data Science</i></h4>
    </div>
    <img src="https://raw.githubusercontent.com/JABE22/Image/main/Logos/logo_ural-federal-university.png" style="width:300px; height:150px; margin-right: 25px;" align='right' />
</div>
<h4 style="margin-top:10px; text-align:right; font-size: 20px; margin-right: 25px;"> Jarno Matarmaa - 03.2023 - Version draft</h4>

# Sport Activity Classification using Standard CML Models and Time Series Analysis
### Part (2/3), Time Series Classification

---

**TASKS**
- [25.1.2023] Check if there is possibility to get optimal intervals from single sequences using some threshold values for speed etc.

**CHANGE LOG**

**QUESTIONS**
- [25.2.2023] How to include signal length analysis? Without massive workload, many runs of time consuming algortihms etc..

---

<a id="0"></a> <br>
## I - Table of Contents

#### [1 - Data import and preview](#1)
* [1A - Libraries](#11)
* [1B - Data download](#11)

#### [2 - Dataset](#2)
* [2A - Data setup](#21)
* [2B - Train-Test data splitting *(stratified by y)*](#22)
* [2C - Data Standardization *(for visualization)*](#23)

#### [3 - Univariate Time Series Classification (UTSC)](#3)
* [3A - Libraries and functions](#31)
* [3B - Data setup](#32)
* [3C - Univariate TSC model classification](#33)

#### [4 - Multivariate Time Series Classification (MTSC)](#4)
* [4A - Libraries and functions](#41)
* [4B - Data setup](#42)
* [4C - Multivariate TSC model classification](#43)
* [4D - Ensemble classification](#44)

#### [5 - Test Section](#5)
* [5A - Pipeline](#51)

---

<a id="1"></a> <br>
## [▲](#0) 1 - Data import and preview

<a id="11"></a> <br>
### [▲](#1) 1A - Libraries

In [None]:
# System tools
import os
import sys
# File structure
from directory_structure import Tree
# Data manipulation tools
import pandas as pd
import numpy as np
import math
# Datetime
import datetime
# Data visualization tools
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline
# Seaborn setup
import seaborn as sns

In [None]:
plt.style.use('./styles/plotstyles.mplstyle')
cmap = sns.color_palette("muted", 10)
THEMA_COLOR = cmap[9]
#plt.style.use('default')
#sns.set_style('white', { 'axes.spines.right': False, 'axes.spines.top': False})

In [None]:
run_start = datetime.datetime.now()

<a id="12"></a> <br>
### [▲](#1) 1B - Data download

File path setup

In [None]:
DATA_PATH = "C:/Users/jarno/OneDrive - УрФУ/STUDIES/MASTER/DATA/"
DATA_PATH_ORG = "C:/Users/jarno/OneDrive - УрФУ/STUDIES/DesignWorkshop/DesignWorkshopProject/DATA/CSVDATA/SET1/"

In [None]:
results_filepath = DATA_PATH + 'results/'
preds_filepath = DATA_PATH + 'predictions/'

In [None]:
path = Tree(DATA_PATH, absolute=False)
print(path)

We use pickle api to handle numpy format data (.pkl)

In [None]:
import pickle as cPickle

In [None]:
SEQ_SEGMENTED = cPickle.load( open( "DATA/data_arrays/seq_segmented.pkl", "rb" ) )
INDEX_DATA = pd.read_csv("DATA/data_arrays/index_data.csv")
SEQ_SEGMENTED_LABELS = pd.read_csv("DATA/data_arrays/seq_segmented_labels.csv")

<a id="2"></a> <br>
## [▲](#0) 2 - Data setup

<a id="21"></a> <br>
### [▲](#2) 2A - Train-Test data splitting *(stratified by y)*

- Dataset generation from the sequencies
- Train and Test splits
- Functions for data variable initialization
- Global variable `SEQ_LEN` will be initialized

In [None]:
# Data manipulation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sktime.datatypes._panel._convert import from_2d_array_to_nested
from sktime.transformations.panel.reduce import Tabularizer

In [None]:
# These variables need to defined here to avoid (not defined warning in the Data setup functions)
DATA, LABELS = (None, None)

**Data setup functions**

Function to create dataset

In [None]:
'''
This is a function to create interlaced univariate signals from 3-dimensional data given as parameter (sequences)

Function can split sequences to a smaller parts (seq_start - seq_end) parameters
    - If sequences has already desired size, use seq_start=0 and seq_end=sequences.shape[1] * sequences.shape[2], 
      for example, if sequences.shape = (1160, 69, 3) => 69 * 3 = 207
    - Default values are set to [0, 100]

Thus, this function generates equal length segments as an output with a certain precondition:
    - Interval length [seq_start,seq_end] must be equal or greater than the length of shortest sequence multiplied by sequences.shape[2]
    - For example, if sequences.shape = (1160, [20-69], 3), interval length [seq_start,seq_end] must be 20 * 3 = 60 at the maximum.

Example:

SEGS.shape = (1160, 69, 3)
LBLS.shape = (1160,)

Proper function call example to use all the input data:

x_data, y_data = create_dataset(SEGS, LBLS, 0, 207, True, True)

x_data.shape = (1160, 207)  (has equal length segments 207)
y_data.shape = (1160,)

If you want to cut signals after interlation, you can select the desired interval using seq_start and seq_end as you wish

'''

def create_dataset(sequences, targets, seq_start=0, seq_end=100, std=False, info=True):

    if info: print("\nSequence/Targets length validity check: ", len(sequences), len(targets))

    target = targets.label.astype('category').cat.codes

    seq_len = seq_end - seq_start

    x_data = np.zeros((len(sequences), seq_len))
    y_data = np.zeros(len(sequences))

    if info: print(x_data.shape, y_data.shape)

    for i, s in zip(range(0,len(sequences)), sequences):
        # s = 69x3 shape single 3-dimensional segment/sequence 
        if info: print(i,s.shape)
        # Sequence manipulation
        signal = s.reshape(-1,1)                 # Generates interlation operation: (69,3) => (207,1) 
        signal = np.squeeze(np.asarray(signal))  # Remove axes of length one:       (207,1) => (207,)
        signal = signal[seq_start:seq_end]       # Selects sub segments, e.g. signal[0:100] => (100,)
        #if info: print(signal.shape)
        x_data[i] = signal
        y_data[i] = target[i]

    if std:
        if info: print("Standardization")
        x_data = StandardScaler().fit_transform(x_data)

    print('X:', x_data.shape, ' y:', y_data.shape)

    return x_data, y_data

Nested data variable reset function

In [None]:
# Run this if data variable reset needed
def initdata_xynested(seq_start, seq_end):
    global x_data, y_data
    global x_train, x_test, y_train, y_test
    global X_train_nest, X_test_nest
    # The whole data
    x_data, y_data = create_dataset(DATA, LABELS, seq_start=seq_start, seq_end=seq_end, std=False, info=False)
    # Train-Test data splits
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=24, stratify=y_data, shuffle=True)
    print('Train:', x_train.shape, y_train.shape, 'Test:', x_test.shape, y_test.shape)
    # Multivariate
    X_train_nest = from_2d_array_to_nested(x_train)
    X_test_nest  = from_2d_array_to_nested(x_test)

Nested data variable reset function (Standard)

In [None]:
# Run this if standard data variable reset needed
def initdata_xynested_std(seq_start, seq_end):
    global x_data_std, y_data
    global x_train_std, x_test_std, y_train, y_test
    global X_train_nest_std, X_test_nest_std
    # The whole data (standard)
    x_data_std, y_data = create_dataset(DATA, LABELS, seq_start=seq_start, seq_end=seq_end, std=True, info=False)
    # Train-Test data splits (standard)
    x_train_std, x_test_std, y_train, y_test = train_test_split(x_data_std, y_data, test_size=0.2, random_state=24, stratify=y_data, shuffle=True)
    print('Train:', x_train_std.shape, y_train.shape, 'Test:', x_test_std.shape, y_test.shape)
    # Multivariate
    X_train_nest_std = from_2d_array_to_nested(x_train_std)
    X_test_nest_std  = from_2d_array_to_nested(x_test_std)

Tabular data reset function

- This function uses global variables `X_train_nest_std`, `X_test_nest_std` and tabularizes them. Therefore recommended to use only together with nested data variable reset functions
- New global variables `X_train_tab`, `X_test_tab`, `X_train_tab_std` and `X_test_tab_std` will be initialized

In [None]:
def initdata_xytabular(std=False):
    global X_train_tab, X_test_tab
    tabu = Tabularizer()
    X_train_tab = tabu.fit_transform(X_train_nest)
    X_test_tab = tabu.fit_transform(X_test_nest)

    if std:
        global X_train_tab_std, X_test_tab_std
        tabu = Tabularizer()
        X_train_tab_std = tabu.fit_transform(X_train_nest_std)
        X_test_tab_std = tabu.fit_transform(X_test_nest_std)

Data reset function

`init_data(seq_start, seq_end, nest=True, tab=False, std=False)`

In [None]:
def init_data(seq_start, seq_end, nest=True, tab=False, std=False):
    # Initilaizes nested data variables for sktime classification models
    if nest:
        initdata_xynested(seq_start, seq_end)
        # Standard
        if std:
            initdata_xynested_std(seq_start, seq_end)

    # Initializes tabular data for sklearn classifiers
    if tab:
        initdata_xytabular()
        # Standard
        if std:
            initdata_xytabular(std=True)

**Initialize data**
- Initializes global data variables according to the given parameters
- Creates the data splits (and feedback prints according to the *create_dataset()* function)

***Important note!***

In variable `SEQ_SEGMENTED` data is already in equal length segments from the interval of `[0,1000]` (5x200). So, when creating segments it is defined that we get 5 splits in the length of 200 from the original sequence. It implies that `init_data()` function call can therefore use the range of `[0,600] (3x200=600)` because `create_dataset()` function flattens the three dimensional data.

**Select data**

Note: Only selected data is processed. Selection is global.

In [None]:
# Setup for full length sequences
#DATA, LABELS = (SEQ_FILTERED, INDEX_DATA)
#SEQ_START = 100
#SEQ_END = int(seq_analysis_df['Filtered']['min'])

# Setup for segmented equal length sequences
DATA, LABELS = (SEQ_SEGMENTED, SEQ_SEGMENTED_LABELS)
SEQ_START = 0
SEQ_END = 69*3 # E.g., if SEG_LEN=69, we use 3*69=207 "multivariate" signal

In [None]:
SEQ_START, SEQ_END

In [None]:
init_data(SEQ_START, SEQ_END, nest=True, tab=True, std=True)

In [None]:
X_train_nest_std.head(5)

**Random time series plots (for analysis)**

Before visualization we need to get sports from the codes 0-2

In [None]:
SPORT_CODES = {0: "Biking", 1:"Running", 2: "Other"}

In [None]:
fig, ax = plt.subplots(2,1,figsize=(10,6))
sns.lineplot(x_train[16], ax=ax[0])
ax[0].set_xlabel('Signal length (s)')
ax[0].set_ylabel('Value (std)')
ax[0].set_title("Activity type = %s" % SPORT_CODES[y_train[16]])
ax[0].grid()

sns.lineplot(x_test[7], ax=ax[1])
ax[1].set_xlabel('Signal length (s)')
ax[1].set_ylabel('Value (std)')
ax[1].set_title("Activity type = %s" % SPORT_CODES[y_test[7]])
ax[1].grid()

plt.tight_layout()

<a id="22"></a> <br>
### [▲](#2) 2B - Data Standardization *(for visualization)*

- Data can be standardized using **create_dataset(** *std=True* **)** function parameter
- This section procedure has precondition that **x_train** and **x_test** data is in two dimensional space

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_train_std = scaler.fit_transform(x_train)
x_test_std = scaler.fit_transform(x_test)

In [None]:
fig, ax = plt.subplots(2,1,figsize=(10,6))
sns.lineplot(x_train_std[16], ax=ax[0])
ax[0].set_xlabel('Signal length (s)')
ax[0].set_ylabel('Value (std)')
ax[0].set_title("Activity type = %s" % SPORT_CODES[y_train[16]])
ax[0].grid()

sns.lineplot(x_test_std[7], ax=ax[1])
ax[1].set_xlabel('Signal length (s)')
ax[1].set_ylabel('Value (std)')
ax[1].set_title("Activity type = %s" % SPORT_CODES[y_test[7]])
ax[1].grid()

plt.tight_layout()

Let's visualize the effect of standardization

In [None]:
fig, ax = plt.subplots(2,1,figsize=(10,6))
sns.lineplot(x_train[16], ax=ax[0])
ax[0].set_xlabel('Signal length (s)')
ax[0].set_ylabel('Value (std)')
ax[0].set_title("Activity type = %s" % SPORT_CODES[y_train[16]])
ax[0].grid()

sns.lineplot(x_train_std[16], ax=ax[1])
ax[1].set_xlabel('Signal length (s)')
ax[1].set_ylabel('Value (std)')
ax[1].set_title("Activity type = %s" % SPORT_CODES[y_train[16]])
ax[1].grid()

plt.tight_layout()

In [None]:
label1 = x_train_std[np.where(y_train==0)]
label2 = x_train_std[np.where(y_train==1)]
label3 = x_train_std[np.where(y_train==2)]

print(len(label1), len(label2), len(label3))

In [None]:
# Or without indexes
pd.DataFrame(y_train.astype(int)).value_counts()

**Random activities (for each category)**

In [None]:
#@title { run: "auto" }
rand_index = 32 #@param {type:"slider", min:0, max:42, step:1}

fig, ax = plt.subplots(3,1, figsize=(10,8), sharey=True)
fig.suptitle('Distinct plots for each category')
ax[0].plot(label1[rand_index], label='Biking')
ax[0].legend(loc='upper left', bbox_to_anchor=(0, 1))
ax[0].grid()
ax[1].plot(label2[rand_index], label='Running')
ax[1].legend(loc='upper left', bbox_to_anchor=(0, 1))
ax[1].grid()
ax[2].plot(label3[rand_index], label='Other')
ax[2].legend(loc='upper left', bbox_to_anchor=(0, 1))
ax[2].grid()
# Set common labels
plt.setp(ax, ylabel='Standard value')
plt.setp(ax, xlabel='Signal length (s)')
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(1,1,figsize=(12,3))
fig.suptitle('Combined plot', y=1.1)
ax.plot(label1[rand_index], label='Biking')
ax.plot(label2[rand_index], label='Running', c='grey', alpha=0.5)
ax.plot(label3[rand_index], label='Other', c='navy', alpha=0.5)
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower left',
                      ncol=3, mode="expand", borderaxespad=0.)
plt.xlabel('Signal length (s)')
plt.ylabel('Value (std)')
plt.grid()
plt.show()

**Random activities (for same category)**

In [None]:
#@title { run: "auto" }
rand_index1 = 10 #@param {type:"slider", min:0, max:42, step:1}
rand_index2 = 13 #@param {type:"slider", min:0, max:42, step:1}
rand_index3 = 15 #@param {type:"slider", min:0, max:42, step:1}
label = label2 #@param ["label1", "label2", "label3"] {type: "raw"}


plt.figure(figsize=(12,3))
plt.suptitle("Random sequences of the same category: Running", y=1.1)
plt.plot(label[rand_index1], lw=1, label='Index ' + str(rand_index1))
plt.plot(label[rand_index2], lw=1, label='Index ' + str(rand_index2), c='grey', alpha=0.5)
plt.plot(label[rand_index3], lw=1, label='Index ' + str(rand_index3), c='navy', alpha=0.5)
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower left',
                      ncol=3, mode="expand", borderaxespad=0.)
plt.xlabel('Signal length (s)')
plt.ylabel('Value (std)')
plt.grid()
plt.show()

---

<a id="3"></a> <br>
## [▲](#0) 3 - Univariate Time Series Classification (UTSC)

SKTIME
- [X] Time Series Forest Classifier
- [X] Supervised Time Series Forest
- [X] Random Interval Spectral Ensemble (RISE)
- [X] Random Interval Classifier
- [X] Shapelet Transform Classifier
- [X] KNeighbors Time Series Classifier
- [X] Composable Time Series Forest Classifier

<a id="31"></a> <br>
### [▲](#3) 3A - Libraries and functions

In [None]:
# Sktime univariate classifiers
from sktime.classification.interval_based import TimeSeriesForestClassifier
from sktime.classification.interval_based import SupervisedTimeSeriesForest
from sktime.classification.interval_based import RandomIntervalSpectralEnsemble # RISE
from sktime.classification.feature_based import RandomIntervalClassifier        # Rotation Forest
from sktime.classification.compose import ComposableTimeSeriesForestClassifier
from sktime.classification.shapelet_based import ShapeletTransformClassifier
from sktime.classification.distance_based import KNeighborsTimeSeriesClassifier
from sktime.classification.hybrid import HIVECOTEV1
from sktime.classification.dictionary_based import WEASEL

# Sktime - Multivariate
from sktime.classification.dictionary_based import MUSE

# Metrics
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, mean_squared_error

# Time and Progress bar solution (tqdm)
from time import time
from tqdm import tqdm

# Extra tools
from matplotlib import gridspec

**Helper functions and variables**

In [None]:
def plot_confmatrix(ax, yval, ypred):
    cm = pd.DataFrame(confusion_matrix(yval, ypred))
    cm_norm = cm.apply(lambda x: x/x.sum(), axis = 1)
    sns.set(font_scale=1.1) # for label size
    sns.heatmap(cm_norm, annot=True, xticklabels=('Biking', 'Running','Other'), 
                                     yticklabels=('Biking', 'Running','Other'),
                                     fmt='.1%',
                                     cmap='Blues',
                                     ax=ax,
                                     annot_kws={"size": 16}) # font size

In [None]:
# Classification report properties for matplotlip.table
col_colors = ['lightgray','lightgray','lightgray','skyblue','skyblue','skyblue']
row_colors = ['lightgray','lightgray','lightgray','lightgray']
cell_colors = [['white','white','white','lightgray','white','white'],
               ['white','white','white','lightgray','white','white'],
               ['white','white','white','lightgray','white','white'],
               ['white','white','white','lightgray','white','white']]

In [None]:
'''
Creates list of column names according to the parameters
- @general: Start of the column name
- @num: Identification number for column
Output for function call create_col_names(general='Score', num=3) 
=> ['Score 1','Score 2', 'Score 3']
'''
def create_col_names(general, num, info=False):
    col_names = []
    for i in range(1,num+1):
        col_names.append(general + str(i))
    
    if info: print("Columns created: ", col_names)
    
    return col_names

In [None]:
'''
Saves data to the latest file in a given directory. 
Inserts a row to a classification result table or a column to a prediction result table.
Return the saved data back (data parameter passed in function call).

Expected parameter values:
- type = ['results','predictions'] - this will define to which actual file (path and filename) data will be inserted
- classifier = The name (appreviation) of the classifier as it appears in the column name of the data given as parameter
- data = pandas dataframe object. For type='results' assumed to be one row, and for type='predictions' one column.

NOTE: Function uses global file paths defined in *1B - Data download* -section
'''
def save_to_file(type:str, classifier:str, data:pd.DataFrame):
    
    if type == 'results':
        results_csv_filename = os.listdir(results_filepath)[-1]
        results_temp = pd.read_csv(results_filepath + results_csv_filename)
        results_temp = pd.concat((results_temp, data), axis=0, ignore_index=True)
        results_temp.to_csv(results_filepath + results_csv_filename, index=False)
        print('Results saved into file: ' + results_csv_filename)
        
    elif type == 'predictions':
        preds_csv_filename = os.listdir(preds_filepath)[-1]
        preds_temp = pd.read_csv(preds_filepath + preds_csv_filename)
        preds_temp[classifier] = data[classifier]
        preds_temp.to_csv(preds_filepath + preds_csv_filename, index=False)
        print('Predictions saved into file: ' + preds_csv_filename)
    
    return data

**Classification functions**

In [None]:
'''
This classification funtion takes models and data as a function parameters ans prints out the results.
Notice, that results are only printed immediately when classification is completed, but nothing is stored. 
Therefore, when execution fails all the data will be lost. Be careful!
'''
def classify_report(models, x_train, y_train, x_test, y_test):
    for name, sktime_clf in models.items():
        # Classify
        clf = sktime_clf 
        clf.fit(x_train, y_train)
        y_pred = clf.predict(x_test)
        print(name, accuracy_score(y_test, y_pred))
        # Create figure and grid for different subplots
        fig = plt.figure(figsize=(18,3))
        spec = gridspec.GridSpec(ncols=2, nrows=1, width_ratios=[1, 2], wspace=0.3)
        # Plot confusion matrix
        ax1 = fig.add_subplot(spec[0])
        plot_confmatrix(ax1, y_test, y_pred)
        ax1.set_xlabel('Predicted')
        ax1.set_ylabel('Actual')
        # Plot classification report
        report = pd.DataFrame(classification_report(y_test, y_pred, digits=3, output_dict=True))
        ax2 = fig.add_subplot(spec[1])
        font_size=16
        bbox=[0, 0, 1, 1]
        ax2.axis('off')
        mpl_table = ax2.table(cellText=np.round(report.values,4), 
                            rowLabels=report.index, bbox=bbox, 
                            colLabels=report.columns,
                            colColours=col_colors,
                            rowColours=row_colors,
                            cellColours=cell_colors,
                            edges='closed')
        mpl_table.auto_set_font_size(False)
        mpl_table.set_fontsize(font_size)
        plt.show()

In [None]:
'''
Classification function which creates result tables. However, it does not create or print out any kind of analysis, 
such as correlation matrixes etc.
Data given as parameters have to be according to the classifier's requirements. 
Usability for different type of classifiers is quite flexible. 
Remember to use correct classifier type name from the list
clf_type = sktime | sklearn | sklearn-tree
'''
def classify(classifiers, clf_type, X_train, X_test, y_train, y_test, results, preds, iters):
    # Pandas dataframe for results
    if results is None:
        results = pd.DataFrame(columns=['Classifier','Type','Train(t)','Test(t)'])
        preds = pd.DataFrame()
    
    # Create column names for scores according to the number of iterations
    score_col_names = create_col_names('Score_', iters)
    
    # Create progress bar with non-default styles
    progress_bar = tqdm(classifiers.items(), ncols=100, colour='#87ceeb', file=sys.stdout)
    
    for name, clf in progress_bar:
        progress_bar.set_description("Processing \033[1m ➥%s \033[0m" % str(clf)) # Includes bold text printing
        score_row = pd.DataFrame(data={'Classifier': name,'Type': clf_type,}, index=[0])
        
        # Insert score colums for each iterations
        for i in range(1,iters+1):
            score_row[score_col_names] = 0
        
        best_score = 0

        for iter in range(1,iters+1):
            start = time() # Start timing the model
            clf.fit(X_train, y_train)
            train_time = time() - start # Stop train timer
            start = time()              # Start test timer
            # Predictions
            y_pred = clf.predict(X_test)
            # Model scores
            score = accuracy_score(y_test, y_pred)
            score_time = time()-start
            # Set values (note: time data will be overwritten in each iteration)
            score_row['Train(t)'] = train_time
            score_row['Test(t)'] = score_time
            score_row[score_col_names[iter-1]] = score
            # Among iterations, we could take only one mse, f1 and roc. We store them for the best accuracy.
            if score > best_score:
                # STORE PREDICTIONS to a data frame
                preds[name] = y_pred
                # More metrics: MSE, F1 and ROC-AUC scores
                score_row['mse'] = mean_squared_error(y_test, y_pred)
                score_row['f1'] = f1_score(y_test, y_pred, average='micro')
                if clf_type not in ['sklearn']:
                    y_pred_proba = clf.predict_proba(X_test)
                    score_row['roc-auc'] = roc_auc_score(y_test, y_pred_proba, average="weighted", multi_class="ovr")
                else:
                    score_row['roc-auc'] = np.nan
                best_score = score
            
        results = pd.concat((results, score_row), axis=0, ignore_index=True)
        
    print('Classification done for ' + clf_type + '\n')
    
    return results, preds

<a id="32"></a> <br>
### [▲](#3) 3B - Data setup

Here we initialize global variables 
`x_data_std, y_data, x_train_std, x_test_std, y_train, y_test` using function `init_data(...)`

In [None]:
init_data(SEQ_START, SEQ_END, nest=True, tab=True, std=True)

In [None]:
# Create data to count category distribution in train and test data
y_train_labels = pd.DataFrame(y_train.astype(int), columns=['cat'])
y_train_labels.cat.replace(SPORT_CODES.keys(), SPORT_CODES.values(), inplace=True)
y_train_labels['split'] = 'train'
y_test_labels = pd.DataFrame(y_test.astype(int), columns=['cat'])
y_test_labels.cat.replace(SPORT_CODES.keys(), SPORT_CODES.values(), inplace=True)
y_test_labels['split'] = 'test'
cat_distribut_tbl = pd.concat([y_train_labels, y_test_labels])
#cat_distribut_tbl = cat_distribut_tbl.groupby(['cat', 'split'])['cat'].count().unstack('split')

# Plot dataframe
fig, ax = plt.subplots(1,1,figsize=(6,2.5))
cat_distribut_tbl.plot(kind='barh',
                       stacked=True,
                       width=0.5,
                       title='Category distribution',
                       #color=[THEMA_COLOR, 'skyblue'],
                       grid=True,
                       ax=ax).legend(loc='lower right')
plt.xlabel('Number of instances')
plt.ylabel('Category')
plt.grid(axis='y')

In [None]:
cat_distribut_tbl

In [None]:
databar = pd.DataFrame(data=cat_distribut_tbl['cat'].value_counts())
databar

In [None]:
fig, ax = plt.subplots(1,1,figsize=(8,3))
databar.plot(kind='barh', color=THEMA_COLOR, zorder=3,
                  width=0.5,
                  title='Category distribution',
                  grid=True, legend=False, ax=ax)
ax.set_xlabel('Number of instances')
ax.set_ylabel('Category')

#### Signal visualization in test data

In [None]:
labels, counts = np.unique(y_test, return_counts=True)
print(labels, counts)

In [None]:
X_test_nest_std.rename(columns={0:'dim_0'}).head() # For report we change the column name temporarily

In [None]:
for label in labels:
    fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25))
    for instance in X_test_nest_std.loc[y_test==label, 0]:
        ax.plot(instance, label='label')
    ax.set(title=f"Instances of {label}")

In [None]:
fig, ax = plt.subplots(1, figsize=plt.figaspect(0.25))
for label in labels:
    X_test_nest_std.loc[y_test==label, 0].iloc[0].plot(ax=ax, label=f"class {label}")
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
ax.set(title="Example time series", xlabel="Time");

<a id="33"></a> <br>
### [▲](#3) 3C - Univariate TSC model classification

#### Sktime models

In [None]:
sktime_clfs = {
    'TSF': TimeSeriesForestClassifier(),
    'STSF': SupervisedTimeSeriesForest(),
    'RISE': RandomIntervalSpectralEnsemble(),
    'RIC': RandomIntervalClassifier(),
    'STC': ShapeletTransformClassifier(),
    'kNN-TS': KNeighborsTimeSeriesClassifier(),
    'CTSF': ComposableTimeSeriesForestClassifier(), # Time consuming (~30-40min)
    'WEASEL': WEASEL(),
    #'HIVE-COTEv1.0': HIVECOTEV1(),                                                 # Extreme time consuming (one succesful run in 900min=15h)
    #'CanonicalIntervalForest': CanonicalIntervalForest(),                          # Extreme time consuming
}

#### Classification with modelwise reports (Disabled)

In [None]:
#classify_report(sktime_clfs, X_train_nest_std, y_train, X_test_nest_std, y_test)
# This is HIVE-COTE result

#### Single models test (Disabled)

**TEST** TSF EXTRA

In [None]:
#clf_tsf = TimeSeriesForestClassifier()
#classify_report({'TSF':clf_tsf}, X_train_nest_std, y_train, X_test_nest_std, y_test)

**TEST** HIVECOTE EXTRA

In [None]:
# This took time 900 min
#sktime_clfs_e1 = {'HIVE-COTE': HIVECOTEV1()}
#classify_report(sktime_clfs_e1, X_train_nest_std, y_train, X_test_nest_std, y_test)

![hive_cote_report](./img/classification_report_hivecote1.png)

**TEST** WEASEL EXTRA

In [None]:
#sktime_clfs_e2 = {'WEASEL': WEASEL()}
#classify_report(sktime_clfs_e2, X_train_nest_std, y_train, X_test_nest_std, y_test)

#### Iterative Classification

In [None]:
init_data(SEQ_START, SEQ_END, nest=True, tab=True, std=True)

<div style="display: flex; padding: 15px; background-color: skyblue; height: 60px; border-radius: 5px; width: 95vw;">
    <h3 style="font-size: 20px;"><b>Classify</b> - Data setup B</h3><br>
</div>

In [None]:
ITERS = 3
#Please Note, this may take time about ~20 min per iteration with ~500 features
results, preds = classify(sktime_clfs, 'sktime', X_train_nest_std, X_test_nest_std, y_train, y_test, results=None, preds=None, iters=ITERS)

**Check the data before saving**

In [None]:
results.sort_values('Score_1') 

In [None]:
preds

**Save results, scores and predictions to the file**

In [None]:
time_stamp = datetime.datetime.now().strftime("D%Y%m%d_T%H%M")
results_csv_filename = 'results_datasetup_b_' + time_stamp + '.csv'
preds_csv_filename = 'preds_datasetup_b_' + time_stamp + '.csv'

In [None]:
results.to_csv(results_filepath + results_csv_filename, index=False)
preds.to_csv(preds_filepath + preds_csv_filename, index=False)

**Save the case data** 

* This is for later analysis (every time different test data)
* This must be done before `init_data()`function call in order to maintain the same CASE-data which was used in the classification

In [None]:
case_data_train = pd.DataFrame(np.column_stack((x_train_std, y_train)))
case_data_train.to_csv('DATA/case_data/TRAIN-DATA_CASE-' + time_stamp, index=False)

In [None]:
case_data_test = pd.DataFrame(np.column_stack((x_test_std, y_test)))
case_data_test.to_csv('DATA/case_data/TEST-DATA_CASE-' + time_stamp, index=False)

#### Manual TSC classification (Disabled)

* Here you can try classification with a single model (or smaller subset of classifiers) using the same functions
* You can insert results to existing saved tables using *read-modify-save* method

In [None]:
#init_data(SEQ_START, SEQ_END, nest=True, tab=True, std=True)

Current model: cBOSS

In [None]:
#from sktime.classification.dictionary_based import ContractableBOSS

In [None]:
'''
ITERS = 1
sktime_clfs_e2 = {'cBOSS': ContractableBOSS()}
results_cboss, preds_cboss = classify(sktime_clfs_e2, 'sktime', X_train_nest_std, X_test_nest_std, y_train, y_test, results=None, preds=None, iters=ITERS)
'''

In [None]:
#results_cboss

Reads latest predictions from the existing file and inserts a new prediction column to the table. Then file will be saved again.

In [None]:
'''
preds_csv_filename = os.listdir(preds_filepath)[-1]
preds_csv_filename
'''

In [None]:
'''
preds_temp = pd.read_csv(preds_filepath + preds_csv_filename)
preds_temp['cBOSS'] = preds.cBOSS
preds_temp['Correct'] = y_test
preds_temp.to_csv(preds_filepath + preds_csv_filename, index=False)
'''

#### ROC-AUC -curve analysis

For practical reason we do this here for now

In [None]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import RocCurveDisplay, roc_auc_score

In [None]:
train_data = pd.read_csv('DATA/case_data/TRAIN-DATA_CASE-D20230424_T1718')
test_data = pd.read_csv('DATA/case_data/TEST-DATA_CASE-D20230424_T1718')

In [None]:
x_train = train_data.iloc[:,:-1].values
y_train = train_data['207']
x_test = test_data.iloc[:,:-1].values
y_test = test_data['207']

In [None]:
sktime_clfs

In [None]:
sktime_clfs.get('TSF').fit(x_train, y_train)

In [None]:
#sktime_clfs.get('TSF').predict(x_test)
y_score = sktime_clfs.get('TSF').predict_proba(x_test)

In this section we use a LabelBinarizer to binarize the target by one-hot-encoding in a OvR fashion. This means that the target of shape (n_samples,) is mapped to a target of shape (n_samples, n_classes).

In [None]:
label_binarizer = LabelBinarizer().fit(y_train)
y_onehot_test = label_binarizer.transform(y_test)
y_onehot_test.shape  # (n_samples, n_classes)

We can as well easily check the encoding of a specific class:

In [None]:
label_binarizer.transform([1])

ROC curve showing a specific class

- In the following plot we show the resulting ROC curve when regarding the sports as either “biking” (class_id=0) or “non-biking” (the rest 1 or 2).

In [None]:
class_of_interest = 0
class_id = np.flatnonzero(label_binarizer.classes_ == class_of_interest)[0]
class_id

In [None]:
# Print roc-auc using sklearn function
print(roc_auc_score(y_test, y_score, average="weighted", multi_class="ovr"))

fig, ax = plt.subplots(1,3,figsize=(20,8))

for f in [0,1,2]:

    class_of_interest = f
    class_id = np.flatnonzero(label_binarizer.classes_ == class_of_interest)[0]

    RocCurveDisplay.from_predictions(
        y_onehot_test[:, class_id],
        y_score[:, class_id],
        name=f"{class_of_interest} vs the rest",
        color="darkorange",
        ax=ax[f]
    )
    ax[f].plot([0, 1], [0, 1], "k--", label="chance level (AUC = 0.5)")
    ax[f].axis("square")
    ax[f].set_xlabel("False Positive Rate")
    ax[f].set_ylabel("True Positive Rate")
    ax[f].set_title("One-vs-Rest ROC curves:\nBiking vs (Running & Other)")
    ax[f].legend()

fig.suptitle('Time Series Forest (TSF) ROC_AUC curves', fontsize=20)
plt.show()

<a id="4"></a> <br>
## [▲](#0) 4 - Multivariate TSC (MTSC)

<a id="41"></a> <br>
### [▲](#4) 4A - Libraries and functions

In [None]:
from sktime.classification.dictionary_based import MUSE # WEASEL+MUSE (multivariate version of WEASEL)
from sktime.classification.compose import ColumnEnsembleClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sktime.datatypes._panel._convert import from_2d_array_to_nested

<a id="42"></a> <br>
### [▲](#4) 4B - Data setup

In [None]:
hr_list = np.array([seq.T[0] for seq in SEQ_SEGMENTED])
spd_list = np.array([seq.T[1] for seq in SEQ_SEGMENTED])
alt_list = np.array([seq.T[2] for seq in SEQ_SEGMENTED])

hr_std = (hr_list - hr_list.mean())/(hr_list.std())
spd_std = (spd_list - spd_list.mean())/(spd_list.std())
alt_std = (alt_list - alt_list.mean())/(alt_list.std())

In [None]:
hr_std.shape

**Save data to file in order to access it later**

In [None]:
pd.DataFrame(hr_std).to_csv('DATA/HR-DATA_std_1160x69', index=False)
pd.DataFrame(spd_std).to_csv('DATA/SPD-DATA_std_1160x69', index=False)
pd.DataFrame(alt_std).to_csv('DATA/ALT-DATA_std_1160x69', index=False)
pd.DataFrame(SEQ_SEGMENTED_LABELS.label).to_csv('DATA/TARGET-DATA_1160x69', index=False)

**Segment visualization**

In [None]:
fig, ax = plt.subplots(4,2,figsize=(20,12))

def plot_sample(ax, data, title, legend=['1','2','3','4','5'], loc='upper right', color=None):
    ax.set_title(title)
    ax.plot(data, color=color)
    ax.legend(legend, loc=loc)

# Original data plots
plot_sample(ax[0,0], hr_list[0:5].T, 'Heart Rate')
plot_sample(ax[1,0], spd_list[0:5].T, 'Speed')
plot_sample(ax[2,0], alt_list[0:5].T, 'Altitude')

# Scaled data plots
plot_sample(ax[0,1], hr_std[0:5].T, 'Heart Rate')
plot_sample(ax[1,1], spd_std[0:5].T, 'Speed')
plot_sample(ax[2,1], alt_std[0:5].T, 'Altitude')

# Plot activity features
plot_sample(ax[3,0], hr_list[0], title='', color='red')
plot_sample(ax[3,0], spd_list[0], title='', color='blue')
plot_sample(ax[3,0], alt_list[0], title='Activity 1', legend=['hr','spd','alt'], color='green')

plot_sample(ax[3,1], hr_list[3], title='', color='red')
plot_sample(ax[3,1], spd_list[3], title='', color='blue')
plot_sample(ax[3,1], alt_list[3], title='Activity 4', legend=['hr','spd','alt'], color='green')

plt.tight_layout()
plt.show()

Effect of standardization

In [None]:
fig, ax = plt.subplots(2,2,figsize=(20,6))

# Plot activity features
ax[0,0].set_title('Activity A')
ax[0,0].plot(hr_list[0], color='red')
ax[0,0].plot(spd_list[0], color='blue')
ax[0,0].plot(alt_list[0], color='green')
ax[0,0].legend(['hr','spd','alt'], loc='upper right')

ax[0,1].set_title('Activity A (std)')
ax[0,1].plot(hr_std[0], color='red')
ax[0,1].plot(spd_std[0], color='blue')
ax[0,1].plot(alt_std[0], color='green')
ax[0,1].legend(['hr','spd','alt'], loc='upper right')

ax[1,0].set_title('Activity B')
ax[1,0].plot(hr_list[3], color='red')
ax[1,0].plot(spd_list[3], color='blue')
ax[1,0].plot(alt_list[3], color='green')
ax[1,0].legend(['hr','spd','alt'], loc='upper right')

ax[1,1].set_title('Activity B (std)')
ax[1,1].plot(hr_std[3], color='red')
ax[1,1].plot(spd_std[3], color='blue')
ax[1,1].plot(alt_std[3], color='green')
ax[1,1].legend(['hr','spd','alt'], loc='upper right')

plt.tight_layout()

Transform features separately to a nested structure

In [None]:
df_nest_hr = from_2d_array_to_nested(np.array(hr_std))
df_nest_spd = from_2d_array_to_nested(np.array(spd_std))
df_nest_alt = from_2d_array_to_nested(np.array(alt_std))

In [None]:
df_nest_hr

In [None]:
df_multi = pd.DataFrame()
df_multi['hr'] = df_nest_hr
df_multi['spd'] = df_nest_spd
df_multi['alt'] = df_nest_alt
df_multi['target'] = SEQ_SEGMENTED_LABELS.label.astype('category')
df_multi

Create train-test splits for actual classification (we use the same splits for MUSE and ensemble)

<a id="43"></a> <br>
### [▲](#4) 4C - Multivariate TSC model classification

> Create train/test splits

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(df_multi.iloc[:,0:3], 
                                          df_multi.iloc[:,3].cat.codes, 
                                          test_size=0.2, 
                                          random_state=24, 
                                          stratify=df_multi.iloc[:,3], 
                                          shuffle=True)

`class MUSE(anova=True, 
            variance=False, 
            bigrams=True, 
            window_inc=2, 
            alphabet_size=4, 
            use_first_order_differences=True, 
            feature_selection='chi2', 
            p_threshold=0.05, 
            support_probabilities=False, 
            n_jobs=1,  
            random_state=None)`

> Classification report

In [None]:
sktime_clfs_e3 = {'MUSE': MUSE(window_inc=4)}
classify_report(sktime_clfs_e3, x_tr, y_tr, x_te, y_te)

> Iterative classification

TODO: Modify `classify()` function to allow probability calculation also for `sktime-multivariate` and `sktime-ensemble` types 

In [None]:
results_muse, preds_muse = classify(sktime_clfs_e3, 'sktime-multivar', x_tr, x_te, y_tr, y_te, results=None, preds=None, iters=3)

In [None]:
results_muse

In [None]:
preds_muse

> Save results to file

In [None]:
save_to_file('results','MUSE',results_muse)

In [None]:
save_to_file('predictions','MUSE',preds_muse)

<a id="44"></a> <br>
### [▲](#4) 4D - Column Ensemble classification

#### Univariate separate feature classification
>> In order to find best models for each feature

In [None]:
ITERS = 1

**0 - Heart Rate**

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(df_nest_hr, 
                                          df_multi.target.cat.codes, 
                                          test_size=0.2, 
                                          random_state=24, 
                                          stratify=df_multi.target, 
                                          shuffle=True)

In [None]:
results_hr, preds_hr = classify(sktime_clfs, 'sktime', x_tr, x_te, y_tr, y_te, results=None, preds=None, iters=ITERS)

In [None]:
results_hr['Exec_Time(s)'] = results_hr['Train(t)']+results_hr['Test(t)']
results_hr['train/test(s)'] = (results_hr['Exec_Time(s)']-results_hr['Exec_Time(s)'].min())/(results_hr['Exec_Time(s)'].max()-results_hr['Exec_Time(s)'].min())
results_hr.sort_values('Score_1', ascending=False)

**1 - Speed**

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(df_nest_spd, 
                                          df_multi.target.cat.codes, 
                                          test_size=0.2, 
                                          random_state=24, 
                                          stratify=df_multi.target, 
                                          shuffle=True)

In [None]:
results_spd, preds_spd = classify(sktime_clfs, 'sktime', x_tr, x_te, y_tr, y_te, results=None, preds=None, iters=ITERS)

In [None]:
results_spd['Exec_Time(s)'] = results_spd['Train(t)']+results_spd['Test(t)']
results_spd['train/test(s)'] = (results_spd['Exec_Time(s)']-results_spd['Exec_Time(s)'].min())/(results_spd['Exec_Time(s)'].max()-results_spd['Exec_Time(s)'].min())
results_spd.sort_values('Score_1', ascending=False)

**2 - Altitude**

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(df_nest_alt, 
                                          df_multi.target.cat.codes, 
                                          test_size=0.2, 
                                          random_state=24, 
                                          stratify=df_multi.target, 
                                          shuffle=True)

In [None]:
results_alt, preds_alt = classify(sktime_clfs, 'sktime', x_tr, x_te, y_tr, y_te, results=None, preds=None, iters=ITERS)

In [None]:
results_alt['Exec_Time(s)'] = results_alt['Train(t)']+results_alt['Test(t)']
results_alt['train/test(s)'] = (results_alt['Exec_Time(s)']-results_alt['Exec_Time(s)'].min())/(results_alt['Exec_Time(s)'].max()-results_alt['Exec_Time(s)'].min())
results_alt.sort_values('Score_1', ascending=False)

#### Column Ensemble

According to the single feature classification results we select the following models to ensemble

| Feature | Classifier | Result |
|    -    |     -      |   -    |
| HeartRate | Supervised Time Series Forest (STSF) | 0.58 |
| Speed| Supervised Time Series Forest (STSF) | 0.92 |
| Altitude | Random Interval Classifier (STSF) | 0.79 |


We need to initialize data variables

In [None]:
x_tr, x_te, y_tr, y_te = train_test_split(df_multi.iloc[:,0:3], 
                                          df_multi.iloc[:,3].cat.codes, 
                                          test_size=0.2, 
                                          random_state=24, 
                                          stratify=df_multi.iloc[:,3], 
                                          shuffle=True)

In [None]:
clf_emb = ColumnEnsembleClassifier( estimators=[
                                    ("STSF1", SupervisedTimeSeriesForest(), [0]),  # column 1
                                    ("STSF2", SupervisedTimeSeriesForest(), [1]),  # column 2
                                    ("RIC", RandomIntervalClassifier(), [2]),      # column 3
                                ])

In [None]:
sktime_clfs_e4 = {'ENSEMBLE': clf_emb}
classify_report(sktime_clfs_e4, x_tr, y_tr, x_te, y_te)

In [None]:
results_emb, preds_emb = classify(sktime_clfs_e4, 'sktime-ensemble', x_tr, x_te, y_tr, y_te, results=None, preds=None, iters=3)

In [None]:
results_emb

In [None]:
preds_emb

> Save results to a file

In [None]:
save_to_file('results','ENSEMBLE',results_emb)

In [None]:
save_to_file('predictions','ENSEMBLE',preds_emb)

<div style="display: block; padding: 15px; background-color: lightgreen; height: auto; border-radius: 5px; width: 95vw;">
    <h3 style="font-size: 26px;"><b>Execution Information</b></h3>
    <p>Works only when the whole file is executed from start to this point</p>
</div>

In [None]:
run_end = datetime.datetime.now()
run_time = run_end - run_start
print('File execution info:')
print('Start\t', run_start)
print('End\t', run_end)
print('Runtime\t', str(run_time))

---

<a id="7"></a> <br>
# [▲](#CT) 7 - Test Section

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
# Some extra time series classifiers from sktime
from sktime.classification.compose import ColumnEnsembleClassifier
from sktime.classification.feature_based import SummaryClassifier
from sktime.classification.interval_based import TimeSeriesForestClassifier
# Data transformation tools from sktime
from sktime.transformations.panel.compose import ColumnConcatenator
from sktime.transformations.panel.segment import RandomIntervalSegmenter
from sktime.transformations.panel.shapelet_transform import ShapeletTransform
# Model tuning tools
from sklearn.model_selection import GridSearchCV, StratifiedKFold
# Dataset for testing purposes
from sktime.datasets import load_basic_motions
# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
# Metrics
from sklearn.metrics import roc_auc_score
from sktime.datasets import load_unit_test

#### 7A - Column Ensemble in "Basic Motions" dataset

**Functions**

Confusion matrix plot function (for 4 class in test data)

In [None]:
def plot_confmatrix_4(ax, yval, ypred):
    cm = pd.DataFrame(confusion_matrix(yval, ypred))
    cm_norm = cm.apply(lambda x: x/x.sum(), axis = 1)
    sns.set(font_scale=1.1) # for label size
    sns.heatmap(cm_norm, annot=True, xticklabels=('Pred A1', 'Pred A2','Pred A3','Pred A4'), 
                                     yticklabels=('Act A1', 'Act A2','Act A3','Act A4'),
                                     fmt='.1%',
                                     cmap='Blues',
                                     ax=ax,
                                     annot_kws={"size": 16}) # font size

**Data**

In [None]:
X_train, y_train = load_basic_motions(split="train")
X_test, y_test = load_basic_motions(split="test")

In [None]:
X_train.head(5)

**Estimators (Models)**

In [None]:
estimators = [("STSF", SupervisedTimeSeriesForest(), [0]), 
              ("RISE", RandomIntervalClassifier(), [1, 2])]

**Ensemble**

In [None]:
col_ens = ColumnEnsembleClassifier(estimators=estimators)
col_ens.fit(X_train, y_train)

y_pred = col_ens.predict(X_test)
print('model', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))

In [None]:
print(classification_report(y_test, y_pred, digits=3))

In [None]:
fig = plt.figure()
spec = gridspec.GridSpec(ncols=1, nrows=1, width_ratios=[1], wspace=0.3)
ax = fig.add_subplot(spec[0])
plot_confmatrix_4(ax, y_test, y_pred)

#### Study case data

In [None]:
clf = ColumnEnsembleClassifier( estimators=[
                                    ("STSF", SupervisedTimeSeriesForest(), [0]),                     # column 1
                                    ("TSFC", TimeSeriesForestClassifier(n_estimators=200), [1]),     # column ...
                                    ("RISE", RandomIntervalSpectralEnsemble(n_estimators=200), [2]), # column n
                                ])
clf.fit(X_train_nest, y_train)

y_pred = clf.predict(X_test_nest)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, digits=3))
plot_confmatrix(y_test, y_pred)

### Hyperparameter list of classifiers

In [None]:
clf.get_param_names()

In [None]:
model_hp = {}
for name, model in sktime_clfs.items():
    hyperparams = model.get_params()
    model_hp[name] = hyperparams

model_hp['MUSE'] = muse.get_params()

pd.DataFrame(model_hp).replace(np.nan, '', regex=True)

### CML model test

In [None]:
# ML models
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
sklearn_clfs = {
    'kNN':  KNeighborsClassifier(),
    'G-NB': GaussianNB(),
    'QDA':  QuadraticDiscriminantAnalysis(),
    'LR':   LogisticRegression(),
    'SVM':  SVC(),
    'MLP':  MLPClassifier(),
    'LDA':  LinearDiscriminantAnalysis(),
    'GB':   GradientBoostingClassifier()
}
sklearn_clfs_tree = {
    'DT':   DecisionTreeClassifier(),
    'RF':   RandomForestClassifier(),
}

In [None]:
init_data(SEQ_START, SEQ_END, nest=False, tab=True)
X_train_tab.head()

In [None]:
X_train_tab_std.head()

In [None]:
results_sklearn, preds_sklearn = classify(sklearn_clfs, 'sklearn', X_train_tab_std, X_test_tab_std, y_train, y_test, results=None, preds=None, iters=3)

In [None]:
classify_report(sklearn_clfs_tree, X_train_tab, y_train, X_test_tab, y_test)

In [None]:
classify_report(sklearn_clfs, X_train_tab_std, y_train, X_test_tab_std, y_test)