# Kata 2: Refactor Dirty ML Code into Pipeline

Let's convert dirty machine learning code into clean code using a [Pipeline](https://stackoverflow.com/a/60303302/2476920) - which is the [Pipe and Filter Design Pattern](https://docs.microsoft.com/en-us/azure/architecture/patterns/pipes-and-filters) for Machine Learning. 

At first you may still wonder *why* using this Design Patterns is good. You'll realize just how good it is in the 2nd [Clean Machine Learning Kata](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code) when you'll do AutoML. Pipelines will give you the ability to easily manage the hyperparameters and the hyperparameter space, on a per-step basis. You'll also have the good code structure for training, saving, reloading, and deploying using any library you want without hitting a wall when it'll come to serializing your whole trained pipeline for deploying in prod.


## The Dataset

It'll be downloaded automatically for you in the code below. 

We're using a Human Activity Recognition (HAR) dataset captured using smartphones. The [dataset](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones) can be found on the UCI Machine Learning Repository. 

### The task

Classify the type of movement amongst six categories from the phones' sensor data:
- WALKING,
- WALKING_UPSTAIRS,
- WALKING_DOWNSTAIRS,
- SITTING,
- STANDING,
- LAYING.

### Video dataset overview

Follow this link to see a video of the 6 activities recorded in the experiment with one of the participants:

<p align="center">
  <a href="http://www.youtube.com/watch?feature=player_embedded&v=XOEN9W05_4A
" target="_blank"><img src="http://img.youtube.com/vi/XOEN9W05_4A/0.jpg" 
alt="Video of the experiment" width="400" height="300" border="10" /></a>
  <a href="https://youtu.be/XOEN9W05_4A"><center>[Watch video]</center></a>
</p>

### Details about the input data

The dataset's description goes like this:

> The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. 

Reference: 
> Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.

That said, I will use the almost raw data: only the gravity effect has been filtered out of the accelerometer  as a preprocessing step for another 3D feature as an input to help learning. If you'd ever want to extract the gravity by yourself, you could use the following [Butterworth Low-Pass Filter (LPF)](https://github.com/guillaume-chevalier/filtering-stft-and-laplace-transform) and edit it to have the right cutoff frequency of 0.3 Hz which is a good frequency for activity recognition from body sensors.

Here is how the 3D data cube looks like. So we'll have a train and a test data cube, and might create validation data cubes as well: 

![](time-series-data.jpg)

So we have 3D data of shape `[batch_size, time_steps, features]`. If this and the above is still unclear to you, you may want to [learn more on the 3D shape of time series data](https://www.quora.com/What-do-samples-features-time-steps-mean-in-LSTM/answer/Guillaume-Chevalier-2).

## Loading the Dataset

In [1]:
import os

DATA_PATH = "data/"
# Note: Linux bash commands start with a "!" inside those "ipython notebook" cells
!pwd && ls
os.chdir(DATA_PATH)
!pwd && ls
!python download_dataset.py
!pwd && ls
os.chdir("..")
!pwd && ls
DATASET_PATH = DATA_PATH + "UCI HAR Dataset/"
print("\n" + "Dataset is now located at: " + DATASET_PATH)

/home/gui/Documents/GIT/Kata-Clean-Machine-Learning-From-Dirty-Code
 data						        README.md
 data_loading.py				        requirements.txt
'Kata 1 - Refactor Dirty ML Code into Pipeline.ipynb'   time-series-data.jpg
 other.py					        time-series-data.xcf
 __pycache__					        venv
/home/gui/Documents/GIT/Kata-Clean-Machine-Learning-From-Dirty-Code/data
 download_dataset.py   source.txt	 'UCI HAR Dataset.zip'
 __MACOSX	      'UCI HAR Dataset'

Downloading...
Dataset already downloaded. Did not download twice.

Extracting...
Dataset already extracted. Did not extract twice.

/home/gui/Documents/GIT/Kata-Clean-Machine-Learning-From-Dirty-Code/data
 download_dataset.py   source.txt	 'UCI HAR Dataset.zip'
 __MACOSX	      'UCI HAR Dataset'
/home/gui/Documents/GIT/Kata-Clean-Machine-Learning-From-Dirty-Code
 data						        README.md
 data_loading.py				        requirements.txt
'Kata 1 - Refactor Dirty ML Code into Pipeline.ipynb'   time-series-data.jpg
 other.py					        time-

In [2]:
from data_loading import load_all_data
X_train, y_train, X_test, y_test = load_all_data()

Some useful info to get an insight on dataset's shape and normalisation:
(X shape, y shape, every X's mean, every X's standard deviation)
(2947, 128, 9) (2947, 1) 0.09913992 0.39567086


## Cleaning Up: Define Pipeline Steps and a Pipeline

The kata is to fill the classes below and to use them properly in the pipeline thereafter.

In [7]:
from neuraxle.base import BaseStep, NonFittableMixin
from neuraxle.steps.numpy import NumpyConcatenateInnerFeatures, NumpyShapePrinter, NumpyFlattenDatum

class NumpyFFT(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Featurize time series data with FFT.

        :param data_inputs: time series data of 3D shape: [batch_size, time_steps, sensors_readings]
        :return: featurized data is of 2D shape: [batch_size, n_features]
        """
        transformed_data = np.fft.rfft(data_inputs, axis=-2)
        return transformed_data


class FFTPeakBinWithValue(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will compute peak fft bins (int), and their magnitudes' value (float), to concatenate them.

        :param data_inputs: real magnitudes of an fft. It could be of shape [batch_size, bins, features].
        :return: Two arrays without bins concatenated on feature axis. Shape: [batch_size, 2 * features]
        """
        time_bins_axis = -2
        peak_bin = np.argmax(data_inputs, axis=time_bins_axis)
        peak_bin_val = np.max(data_inputs, axis=time_bins_axis)
        
        # Notice that here another FeatureUnion could have been used with a joiner:
        transformed = np.concatenate([peak_bin, peak_bin_val], axis=-1)
        
        return transformed


class NumpyAbs(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a max.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.abs(data_inputs)


class NumpyMean(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a mean.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.mean(data_inputs, axis=-2)


class NumpyMedian(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a median.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.median(data_inputs, axis=-2)


class NumpyMin(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a min.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.min(data_inputs, axis=-2)


class NumpyMax(NonFittableMixin, BaseStep):
    def transform(self, data_inputs):
        """
        Will featurize data with a max.

        :param data_inputs: 3D time series of shape [batch_size, time_steps, sensors]
        :return: featurized time series of shape [batch_size, features]
        """
        return np.max(data_inputs, axis=-2)


Let's now create the Pipeline with the code:

In [8]:
from neuraxle.base import Identity
from neuraxle.pipeline import Pipeline
from neuraxle.steps.flow import TrainOnlyWrapper
from neuraxle.union import FeatureUnion

pipeline = Pipeline([
    # ToNumpy(),  # Cast type in case it was a list.
    TrainOnlyWrapper(NumpyShapePrinter(
        # For debugging, do this print at train-time only:
        custom_message="Input shape before feature union"
        # Shape: [batch_size, time_steps, sensor_features]
    )),
    FeatureUnion([
        Pipeline([
            NumpyFFT(),
            NumpyAbs(),
            FeatureUnion([
                NumpyFlattenDatum(),  # Reshape from 3D to flat 2D: flattening data except on batch size
                FFTPeakBinWithValue()  # Extract 2D features from the 3D FFT bins
            ], joiner=NumpyConcatenateInnerFeatures())
        ]),
        NumpyMean(),
        NumpyMedian(),
        NumpyMin(),
        NumpyMax()
    ], joiner=NumpyConcatenateInnerFeatures()),  # The joiner will here join like this: np.concatenate([...], axis=-1)
    TrainOnlyWrapper(NumpyShapePrinter(
        custom_message="Shape after feature union, before classification"
        # Shape: [batch_size, remade_features]
    )),
    # TODO in kata 2: Add some feature selection right here for the motivated ones:
    #      https://scikit-learn.org/stable/modules/feature_selection.html
    # TODO in kata 2: Add normalization right here (if using other classifiers)
    #      https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html
    DecisionTreeClassifier(),
    # TODO in kata 2: Try other classifiers different than the DecisionTreeClassifier just above:
    #      https://scikit-learn.org/stable/modules/multiclass.html
    TrainOnlyWrapper(NumpyShapePrinter(
        custom_message="Shape at output after classification"
        # Shape: [batch_size]
    )), Identity()
])


  "Will rename '{}' because it already exists.".format(class_name))


## Test Your Code: Make the Tests Pass

The 3rd test is the real deal.

In [9]:
def _test_is_pipeline(pipeline):
    assert isinstance(pipeline, Pipeline)


def _test_has_all_data_preprocessors(pipeline):
    assert "DecisionTreeClassifier" in pipeline
    assert "FeatureUnion" in pipeline
    assert "Pipeline" in pipeline["FeatureUnion"]
    assert "NumpyMean" in pipeline["FeatureUnion"]
    assert "NumpyMedian" in pipeline["FeatureUnion"]
    assert "NumpyMin" in pipeline["FeatureUnion"]
    assert "NumpyMax" in pipeline["FeatureUnion"]


def _test_pipeline_words_and_has_ok_score(pipeline):
    pipeline = pipeline.fit(X_train, y_train)
    
    y_pred = pipeline.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    print("Test accuracy score:", accuracy)
    assert accuracy > 0.7


if __name__ == '__main__':
    tests = [_test_is_pipeline, _test_has_all_data_preprocessors, _test_pipeline_words_and_has_ok_score]
    for t in tests:
        try:
            t(pipeline)
            print("==> Test '{}(pipeline)' succeed!".format(t.__name__))
        except Exception as e:
            print("==> Test '{}(pipeline)' failed:".format(t.__name__))
            import traceback
            print(traceback.format_exc())


==> Test '_test_is_pipeline(pipeline)' succeed!
==> Test '_test_has_all_data_preprocessors(pipeline)' succeed!
NumpyShapePrinter: (7352, 128, 9) Input shape before feature union
NumpyShapePrinter: (7352, 639) Shape after feature union, before classification
NumpyShapePrinter: (7352,) Shape at output after classification
Test accuracy score: 0.8652867322701052
==> Test '_test_pipeline_words_and_has_ok_score(pipeline)' succeed!


## Good job!

Your code should now be clean after making the tests pass.

## You're ready for [part 2](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code).

You should now be ready for the 2nd [Clean Machine Learning Kata](https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code). Note that the solutions are available in the repository above as well. 

## Recommended additional readings and learning resources: 

- For more info on clean machine learning, you may want to read [How to Code Neat Machine Learning Pipelines](https://www.neuraxio.com/en/blog/neuraxle/2019/10/26/neat-machine-learning-pipelines.html).
- For reaching higher performances, you could use a [LSTM Recurrent Neural Network](https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition) and refactoring it into a neat pipeline as you've created here, now by [using TensorFlow in your ML pipeline](https://github.com/Neuraxio/Neuraxle-TensorFlow).
- You may as well want to request [more training and coaching for your ML or time series processing projects](https://www.neuraxio.com/en/time-series-solution) from us if you need.
