### Preliminaries

In [None]:
import pandas as pd
import numpy as np
from numpy import save

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

import keras
from keras.utils import to_categorical
from keras.models import Sequential,Input,Model
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Conv1D, MaxPooling1D
from keras.layers.normalization import BatchNormalization
from keras.layers.advanced_activations import LeakyReLU

In [None]:
## load  processed data
training_data = np.load("training_data.npy", allow_pickle = True)
labels = np.load("labels.npy", allow_pickle = True)
training_data.shape,training_data.shape

### Decision forest

Finally, we can start applying some models. We create now a validation data for testing the decision forest thanks to the function [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

In [None]:
## SPLITTING
X_train, X_test, y_train, y_test = train_test_split(training_data, labels, test_size=0.2, random_state=42)

In [None]:
# check shape
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

#### Features engineering

In order to improve the accuracy of our decision forest, it is wise to do first some features engineering. We define two functions. The first creates simply the mean of our signal, while the second does something a little bit more complicated, that is, it applys different statistics indices to our signal.

In [None]:
def features_mean(signal):
    return np.mean(signal,axis=2)

def features(signal, functions):
    summaries=[]
    for fn in functions:
        summaries.append(fn(signal,axis=2))
    return np.concatenate(summaries,axis=1)

summaries = [np.mean, np.min, np.max, np.std]

X_train_summaries = features(X_train, summaries)
X_test_summaries = features(X_test, summaries)

X_train_mean = features_mean(X_train)
X_test_mean = features_mean(X_test)

#### Running the model

A little bit of fun has arrived. Here I create a function in order to run the decision forest with different number of nodes. As you know, the number of nodes in a forest is extremely important as it allows to control for overfitting. Let's have a look at all the arguments:

- **n_estimators** : this is a list and as I anticipate, with this we can control the number of nodes
- **X_traind and X_test** : because we used 2 different kinds of features engineering, here we can choose which one to use
- **y_train and y_test** : regardless wich kind of training data we use, we will always use the same test set and for this reason these arguments have already default values
- **random_state** : simply a way of controlling for randomness when looking for the best split at each node. This is usefull in order to be able to reproduce the accuracy.

I leave you [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) the official documentation about decision forest.

In conclusion, not only this function allows us to choose the training data and the number of nodes, but it saves for each number of nodes its accuracy.

In [None]:
## this function take as argument a training and a validation data set and return the accuracy based on 
# the number of nodes, which is encode here as n_estimators

def run_forest(n_estimators ,X_train, X_test, y_train = y_train, y_test = y_test , random_state = 333):
    
    acc=[] # list of accuracy which depends on the hyperparameter n_estimators

    for num_features in n_estimators: # hyperparameters to change with higher number
        forest = RandomForestClassifier(n_estimators=num_features, 
                               bootstrap = True,
                               max_features = 'sqrt',
                                random_state = random_state)
        fitted_model=forest.fit(X_train,y_train)
        prediction=fitted_model.predict(X_test)
        accuracy=accuracy_score(y_test,prediction)
        solution = (num_features,accuracy)
        acc.append(solution)
        
    return acc

Let's run the model, here we can define the lists to put as **n_estimators** argument. My machine cannot handle all these data, for the sake of the example, I just put 1 node for each type of features engineering. You can try with [50,100,200]  and you will see that the accuracy is going to increase, although it cannot go over 20%. Not too much eh? For this we have DL ;).

In [None]:
#set up different number of nodes 
n_estimators_mean = [1]
n_estimators_summaries = [1]

In [None]:
# accuracy for the mean
accuracy_mean = run_forest(n_estimators_mean,X_train_mean,X_test_mean)
print(accuracy_mean)

In [None]:
# accuracy for the summaries
accuracy_summaries = run_forest(n_estimators_summaries,X_train_summaries,X_test_summaries)
print(accuracy_summaries)