# Kata 2: AutoML Loop & HyperparameterSpace

Let's now use the solution of the Kata 1 and try to do AutoML from there. 

A call to the `AutoML` class is already made at the end of this notebook, after which the best model found automatically using the accuracy on the validation dataset is retrained on the train dataset and the validation dataset to be tested on the test dataset. We use a simple train/val split for the validation.

## The task

Your goal is then to add more choices to try in the pipeline (and with more hyperparameters per choice) to make it really good. 

## Loading the Dataset

In [1]:
import urllib
import os

def download_import(filename):
    with open(filename, "wb") as f:
        # Downloading like that is needed because of Colab operating from a Google Drive folder that is only "shared with you".
        url = 'https://raw.githubusercontent.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code/master/{}'.format(filename)
        f.write(urllib.request.urlopen(url).read())

try:
    import google.colab
    download_import("data_loading.py")
    !mkdir data;
    download_import("data/download_dataset.py")
    print("Downloaded .py files: dataset loaders.")
except:
    print("No dynamic .py file download needed: not in a Colab.")

DATA_PATH = "data/"
!pwd && ls
os.chdir(DATA_PATH)
!pwd && ls
!python download_dataset.py
!pwd && ls
os.chdir("..")
!pwd && ls
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)

No dynamic .py file download needed: not in a Colab.
/home/alexandre/Documents/Kata-Clean-Machine-Learning-From-Dirty-Code
 cache
 data
 data_loading.py
 __pycache__
 README.md
 requirements.txt
'SOLUTION - Kata 1 - Refactor Dirty ML Code into Pipeline.ipynb'
'SOLUTION - Kata 2 - AutoML Loop and HyperparameterSpace.ipynb'
 time-series-data.jpg
 time-series-data.xcf
'TODO - Kata 1 - Refactor Dirty ML Code into Pipeline.ipynb'
'TODO - Kata 2 - AutoML Loop and HyperparameterSpace.ipynb'
 venv
/home/alexandre/Documents/Kata-Clean-Machine-Learning-From-Dirty-Code/data
 download_dataset.py   source.txt	 'UCI HAR Dataset.zip'
 __MACOSX	      'UCI HAR Dataset'

Downloading...
Dataset already downloaded. Did not download twice.

Extracting...
Dataset already extracted. Did not extract twice.

/home/alexandre/Documents/Kata-Clean-Machine-Learning-From-Dirty-Code/data
 download_dataset.py   source.txt	 'UCI HAR Dataset.zip'
 __MACOSX	      'UCI HAR Dataset'
/home/alexandre/Documents/Kata-Clean-Ma

In [2]:
# install neuraxle if needed:
try:
    import neuraxle
    assert neuraxle.__version__ == '0.3.4'
except:
    !pip install neuraxle==0.3.4

In [3]:
# Finally load dataset!
from data_loading import load_all_data
X_train, y_train, X_test, y_test = load_all_data()
print("Dataset loaded!")

Some useful info to get an insight on dataset's shape and normalisation:
(X shape, y shape, every X's mean, every X's standard deviation)
(2947, 128, 9) (2947, 1) 0.09913992 0.39567086
Dataset loaded!


## Let's reuse our pipeline steps as we should have created them in Kata 1:

In [4]:
import numpy as np
from neuraxle.base import BaseStep, NonFittableMixin


class NumpyFFT(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Featurize time series data with FFT.

        :param data_inputs: time series data of 3D shape: [batch_size, time_steps, sensors_readings]
        :return: featurized data is of 2D shape: [batch_size, n_features]
        """
        transformed_data = np.fft.rfft(data_inputs, axis=-2)
        return transformed_data


class FFTPeakBinWithValue(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will compute peak fft bins (int), and their magnitudes' value (float), to concatenate them.

        :param data_inputs: real magnitudes of an fft. It could be of shape [batch_size, bins, features].
        :return: Two arrays without bins concatenated on feature axis. Shape: [batch_size, 2 * features]
        """
        time_bins_axis = -2
        peak_bin = np.argmax(data_inputs, axis=time_bins_axis)
        peak_bin_val = np.max(data_inputs, axis=time_bins_axis)

        # Notice that here another FeatureUnion could have been used with a joiner:
        transformed = np.concatenate([peak_bin, peak_bin_val], axis=-1)

        return transformed


class NumpyAbs(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a max.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.abs(data_inputs)


class NumpyMean(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a mean.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.mean(data_inputs, axis=-2)


class NumpyRavel(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        if data_inputs is not None:
            data_inputs = data_inputs if isinstance(data_inputs, np.ndarray) else np.array(data_inputs)
            return data_inputs.ravel()
        return data_inputs


class NumpyMedian(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a median.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.median(data_inputs, axis=-2)


class NumpyMin(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a min.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.min(data_inputs, axis=-2)


class NumpyMax(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a max.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.max(data_inputs, axis=-2)


## Define some classifiers here and their hyperparam space

You'll want to combine a few classifiers here in the pipeline below.

In [5]:
from neuraxle.hyperparams.distributions import Choice, Boolean
from neuraxle.hyperparams.distributions import RandInt, LogUniform
from neuraxle.hyperparams.space import HyperparameterSpace
from neuraxle.pipeline import Pipeline
from neuraxle.steps.output_handlers import OutputTransformerWrapper
from neuraxle.steps.sklearn import SKLearnWrapper
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier


decision_tree_classifier = SKLearnWrapper(
    DecisionTreeClassifier(), HyperparameterSpace({
        'criterion': Choice(['gini', 'entropy']), 'splitter': Choice(['best', 'random']),
        'min_samples_leaf': RandInt(2, 5), 'min_samples_split': RandInt(1, 3), }))

extra_tree_classifier = SKLearnWrapper(
    ExtraTreeClassifier(), HyperparameterSpace({
        'criterion': Choice(['gini', 'entropy']), 'splitter': Choice(['best', 'random']),
        'min_samples_leaf': RandInt(2, 5), 'min_samples_split': RandInt(1, 3), }))

ridge_classifier = Pipeline([
    OutputTransformerWrapper(NumpyRavel()),
    SKLearnWrapper(RidgeClassifier(), HyperparameterSpace({
        'alpha': Choice([(0.0, 1.0, 10.0), (0.0, 10.0, 100.0)]), 'fit_intercept': Boolean(), 'normalize': Boolean()
    }))
]).set_name('RidgeClassifier')

logistic_regression = Pipeline([
    OutputTransformerWrapper(NumpyRavel()),
    SKLearnWrapper(LogisticRegression(), HyperparameterSpace({
        'C': LogUniform(0.01, 10.0), 'fit_intercept': Boolean(), 'dual': Boolean(),
        'penalty': Choice(['l1', 'l2']), 'max_iter': RandInt(20, 200)
    }))
]).set_name('LogisticRegression')

random_forest_classifier = Pipeline([
    OutputTransformerWrapper(NumpyRavel()),
    SKLearnWrapper(RandomForestClassifier(), HyperparameterSpace({
        'n_estimators': RandInt(50, 600), 'criterion': Choice(['gini', 'entropy']),
        'min_samples_leaf': RandInt(2, 5), 'min_samples_split': RandInt(1, 3),
        'bootstrap': Boolean()
    }))
]).set_name('RandomForestClassifier')

## Add your classifiers to the pipeline

In [6]:
from neuraxle.base import Identity
from neuraxle.steps.flow import TrainOnlyWrapper, ChooseOneStepOf
from neuraxle.steps.numpy import NumpyConcatenateInnerFeatures, NumpyShapePrinter, NumpyFlattenDatum
from neuraxle.union import FeatureUnion


pipeline = Pipeline([
    TrainOnlyWrapper(NumpyShapePrinter(custom_message="Input shape before feature union")),
    FeatureUnion([
        Pipeline([
            NumpyFFT(),
            NumpyAbs(),
            FeatureUnion([
                NumpyFlattenDatum(),  # Reshape from 3D to flat 2D: flattening data except on batch size
                FFTPeakBinWithValue()  # Extract 2D features from the 3D FFT bins
            ], joiner=NumpyConcatenateInnerFeatures())
        ]),
        NumpyMean(),
        NumpyMedian(),
        NumpyMin(),
        NumpyMax()
    ], joiner=NumpyConcatenateInnerFeatures()),
    # TODO in kata 2, optional: Add some feature selection right here for the motivated ones:
    #      https://scikit-learn.org/stable/modules/feature_selection.html
    # TODO in kata 2, optional: Add normalization right here (if using other classifiers)
    #      https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
    TrainOnlyWrapper(NumpyShapePrinter(custom_message="Shape after feature union, before classification")),
    # Shape: [batch_size, remade_features]
    ChooseOneStepOf([
        decision_tree_classifier,
        extra_tree_classifier,
        ridge_classifier,
        logistic_regression,
        random_forest_classifier
    ]),
    TrainOnlyWrapper(NumpyShapePrinter(custom_message="Shape at output after classification")),
    # Shape: [batch_size]
    Identity()
])

  "Will rename '{}' because it already exists.".format(class_name))


## Finally do AutoML! Launch the main AutoML optimization loop.

In [7]:
import shutil 

# Clear cache if we've already ran the AutoML to start fresh:
cache_folder = 'cache'
if os.path.exists(cache_folder):
    shutil.rmtree(cache_folder)
os.makedirs(cache_folder, exist_ok=True)

In [8]:
from neuraxle.metaopt.auto_ml import AutoML, InMemoryHyperparamsRepository, validation_splitter, \
    RandomSearchHyperparameterSelectionStrategy
from neuraxle.metaopt.callbacks import ScoringCallback
from sklearn.metrics import accuracy_score


auto_ml = AutoML(
    pipeline=pipeline,
    hyperparams_optimizer=RandomSearchHyperparameterSelectionStrategy(),
    validation_split_function=validation_splitter(test_size=0.20),
    scoring_callback=ScoringCallback(accuracy_score, higher_score_is_better=True),
    n_trials=7,
    epochs=1,
    hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=cache_folder),
    refit_trial=True,
)

Do AutoML by selecting on validation data, and get best model refitted on all train and validation data:

In [9]:
auto_ml = auto_ml.fit(X_train, y_train)
best_pipeline = auto_ml.get_best_model()


trial 1/7
new trial:
{
    "ChooseOneStepOf": {
        "LogisticRegression": {
            "Optional(LogisticRegression)": {
                "SKLearnWrapper_LogisticRegression": {
                    "C": 4.611497229679814,
                    "dual": false,
                    "fit_intercept": false,
                    "max_iter": 105,
                    "penalty": "l1"
                }
            }
        },
        "RandomForestClassifier": {
            "Optional(RandomForestClassifier)": {
                "SKLearnWrapper_RandomForestClassifier": {
                    "bootstrap": true,
                    "criterion": "gini",
                    "min_samples_leaf": 4,
                    "min_samples_split": 2,
                    "n_estimators": 158
                }
            }
        },
        "RidgeClassifier": {
            "Optional(RidgeClassifier)": {
                "SKLearnWrapper_RidgeClassifier": {
                    "alpha": [
                        0.0,


NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.938137321549966
trial 2/7 score: 0.938137321549966
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': False, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'entropy', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrapper_

NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.7552685248130524
trial 3/7 score: 0.7552685248130524
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': False, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'gini', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrapper_D

NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.9422161794697484
trial 4/7 score: 0.9422161794697484
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': False, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'entropy', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrappe

NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.8891910265125765
trial 5/7 score: 0.8891910265125765
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': True, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'entropy', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrapper

NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.9374575118966689
trial 6/7 score: 0.9374575118966689
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': False, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'gini', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrapper_D

NumpyShapePrinter: (5881, 639) Shape after feature union, before classification
NumpyShapePrinter: (5881,) Shape at output after classification
main train: 1.0
main validation: 0.8891910265125765
trial 7/7 score: 0.8891910265125765
Trial.from_json({'hyperparams': {'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__enabled': True, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__ccp_alpha': 0.0, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__class_weight': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__criterion': 'gini', 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_depth': None, 'ChooseOneStepOf__SKLearnWrapper_DecisionTreeClassifier__Optional(SKLearnWrapper_DecisionTreeClassifier)__max_features': None, 'ChooseOneStepOf__SKLearnWrapper_De

NumpyShapePrinter: (7352, 639) Shape after feature union, before classification
NumpyShapePrinter: (7352,) Shape at output after classification


Predict on test data and score:

In [10]:
y_pred = best_pipeline.predict(X_test)

accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
print("Test accuracy score:", accuracy)
assert accuracy > 0.85, "Try again harder!"
# It's getting good on this dataset if you're over 92%. The current code is able to do this.
# Getting to 94% is a very hard task on this dataset. 

Test accuracy score: 0.9104173736002714


## This is the end!

Congratulations. You won.

## Recommended additional readings and learning resources: 

- For more info on clean machine learning, you may want to read [How to Code Neat Machine Learning Pipelines](https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html).
- For reaching higher performances, you could use a [LSTM Recurrent Neural Network](https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition) and refactoring it into a neat pipeline as you've created here, now by [using TensorFlow in your ML pipeline](https://github.com/Neuraxio/Neuraxle-TensorFlow).
- You may as well want to request [more training and coaching for your ML or time series processing projects](https://www.neuraxio.com/en/time-series-solution) from us if you need.
