# Pipelines in Machine Learning

This is a very quick notebook presenting some information on how to utilize pipelines in machine learning processes and what are they in general.

* Sections:
    - What are Pipelines
    - Why they can matter
    - Practical example
    - Further

To follow along with the notebook, please take a look at what dependenices are being used in it (presented in the following block of code).

In [17]:
# Data manipulation
import pandas as pd
from sklearn import datasets

# Data Processing
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier

# Pipeline and parameters selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Module to save and load our models
from sklearn.externals import joblib

# NOT REQUIRED: Some extra, to allow for errors surpressing in the notebook and checking for the python version
import warnings
import sys
warnings.filterwarnings('ignore')

print(f"Using Python {sys.version.split('(')[0]}\n")
print("With the following packages:")
print(30 * "-")
print(f"Pandas: {pd.__version__}")
print(f"Sklearn: {sklearn.__version__}")
print(30 * "-")

Using Python 3.7.3 

With the following packages:
------------------------------
Pandas: 0.25.0
Sklearn: 0.21.2
------------------------------


You can install them via pip, or conda. The commands for that are (for the versions used in the notebook, or above):

* For Pandas: 
```pip install pandas>=0.25.0; conda install pandas>=0.25.0```

* For Sklearn:
```pip install scikit-learn>=0.21.2; conda install scikit-learn>=0.21.2```

## What are Pipelines

Pipelines are essentially what they sound to be - a process in which some actions are lined up into a pattern that is executed every single time in the same way. The concept of this might seem redundant, as if we want to use the principle of building pipelines with code - then it would be fair to say that we have always been doing that. This is because code usually needs to be executed in a particular sequence every single time before one can derive to the desired result.

However, pipelines in the context of Machine Learning and more specifically this tutorial, are not general concepts, but rather already created classes from other packages that can be utilized in a way, that allow for less coding and faster production deployment of prepared models for various predictions/classification/forecasting/etc. tasks. 

## Why they can matter

Theory alone can be quite dry, but it is import to understand what they can do, before diving into making them and evaluating their actual benefit.

Let's break a Machine Learning tasks into steps in to as basic of a form as we can:

* Step 1:
    - Gather and clean data (We assume we already have it)
    - Split data into training and testing samples (or even into train-validate-test)
    
* Step 2:
    - Prepare a Machine Learning Model with leading steps to it (Potentially in the following way):
        - Transform/Scale the data
        - Perform dimensionality reduction
        - Train a classifier, potentially evaluating which hyperparameters work best through Grid Search
    - Evaluate the Model
    - Alter the Model if required
    - Save the Model

* Step 3:
    - Deploy the Model into production    
    
Step 1 is a story on it's own, but consider than when you have the data - that step is over and you can move to the next one. Step 2 that matters to us - is a bit messy. It essentially requires to create a sequential data processing pipeline. Data moves from one pre-processing step to another and even though it might be good to have more control over certain processes, by programming them on the spot and debugging, real problems might start at Step 3, where every single model crucial step needs to be repeated for it to work in a desired way. This is where the pre-existing pipelines can come in. By creating a pipeline in the Step 2, you would just need to transport it into the deployment stage further, rather than being required to re-write the whole data analyzing process, with a trained model in it.

If it still sounds a bit confussing, don't worry, a real application might help clear some things.

## Practical example

Practical example will not be too complex, but enough to get a general sense of what is going on with pipelines. For it, a very standard Iris database will be used. Let's perform the first step into getting the data.

In [2]:
# ----- Load the data (iris) -----
iris = datasets.load_iris()

# To X assign data that describes some flower's attributes 
X = iris.data   

# To y assign data that tells which flower is in records
y = iris.target

In [3]:
# Check the loaded information for IV
pd.DataFrame(X, columns = iris.feature_names).head(5)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
# Check the loaded information for DV
pd.DataFrame(y, columns = ["Flower Group"]).head(5)

Unnamed: 0,Flower Group
0,0
1,0
2,0
3,0
4,0


If you are working in the real world - a good approach would be to actually analyze your data and learn something about it. Something of the following type could be done:

In [5]:
flowers = {}
for value in y:
    if value not in flowers:
        flowers[value] = 1
    else:
        flowers[value] += 1
        
print(f"We have the following amount of flowers in each group, classified as 0, 1, 2: {flowers}")

We have the following amount of flowers in each group, classified as 0, 1, 2: {0: 50, 1: 50, 2: 50}


Or even more with graphical analysis, using libraries like seaborn and matplotlib. But Iris dataset is one of the more basic ones, and it feels a bit of a waste of time to spend too much time on it. Therefore, this step is quickly glanced over, though in the real life - it can be as important and crucial of addressing, as anything done in the whole project.

We can consider that we have all the data we need for the Step 2 by now. Therefore, we can split it into the trainig and testing sets. We can do it easily with the following approach, by dedicating 15% of the data to testing and the rest for training. An important note - the following approach will also shuffle the data. As well, our data is properly distributed, so we do not need to worry about some samples being too bias. 

In [6]:
# ----- Split the data -----
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=1337)

Now let's move to step 2 and create a pipeline. Let's say we want to first scale our data (standardize features by removing the mean and scaling to unit variance), then applying a PCA (essentially reducing the data dimensionality) and then classifying our data, with a Decision Tree Classifier that we will train. This can be done in multiple steps, by calling a number of packages and creating multiple objects for those procedures. Sequentially calling methods from those objects to transform our data and then working with the inputs and outputs of each to pass to the next object, until we reach our goal.

Luckily, this tutorial is here to resolve this inconvenience. Using the `sklearn.pipeline.Pipeline` class we can do that in a much simpler fashion. The object of this class takes a list of parameters. Those parameters - are classes that can be utilized in transforming our data in a sequential manner. Each parameter needs to be passed in a tuple, where the first argument is the name of the data transformer that we can assign as we want, and the second - the class that will take care of our data transformation requirement. Data transforming classes can be also custom, but they will have to be made in the same matter as most of sklearn ones.

In the following example a pipeline is created, and data we have is fitted with it, essentially allowing for training of the model to happen in the pipeline.

In [7]:
# ----- Build a pipeline -----
pipe = Pipeline([('scaler', StandardScaler()),        # Scale the data
                 ('pca', PCA()),                      # Apply a PCA
                 ('dt', DecisionTreeClassifier())],   # Run a Decision Tree Classifier
                 verbose = 0)              
# Apply the data for training
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=None,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('dt',
                 DecisionTreeClassifier(class_weight=None, criterion='gini',
                                        max_depth=None, max_features=None,
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'))],
         verbose=0)

In [8]:
# ----- See how well our default learner performs -----
print(f"Current score: {round(100*pipe.score(X_test, y_test), 2)}%")

Current score: 95.65%


Our pipeline holds a classifier, than can classify data with quite a fair accuracy. But there is a problem. We have not set a single hyperparameter and used everything in our pipeline as a default set up. Luckily, we can perform a gird search and see, which hyperparameters fit best for each of our classes in the pipeline.

Let's create a dictionary, that will allow us to do that.

The first value in the dictionary needs to be the name of the parameter that we are setting. It needs to be called in the following way: "the name to the class that we have assigned in our pipeline" + "\_\_" + "the hyperparameter name the class uses". Next to it, a list should be created of hyperparameters we want to experiment with. Consider the following example:

In [9]:
# ----- Implement a Grid Search to test more parameters and find best ones -----
param_grid = {
    "pca__n_components": [1, 2, 3, 4],
    "dt__criterion": ["gini", "entropy"],      # default: gini
    "dt__max_depth": [1, 2, 3, 4, 5]           # default: None
}

If you do not know the name of the parameters you can experiment with, either __a)__ look into the documentation of those classes that allow for data transformation, or __b)__ call the "Class..get_params().keys()" command, which will display all the parameters you can modify. Though with the second approach you still might miss information on what type of variables those parameters take. Consider the following:

In [10]:
# ----- Parameters to modify for the PCA procedure -----
PCA().get_params().keys()

dict_keys(['copy', 'iterated_power', 'n_components', 'random_state', 'svd_solver', 'tol', 'whiten'])

In [11]:
# ----- Parameters to modify for the Decision Tree Classifier -----
DecisionTreeClassifier().get_params().keys()

dict_keys(['class_weight', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_impurity_split', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'presort', 'random_state', 'splitter'])

After you have prepared your set of parameters to experiment with, they can be assigned into a GridSearchCV class together with the pipeline and called for execution and evaluation:

In [12]:
search = GridSearchCV(pipe, param_grid)
# ----- Initiate the Search and Check Results -----
search.fit(X_train, y_train)
print("\n")
print(f"Best parameter (CV score = {round(100*search.best_score_, 2)}%)")
print(f"Best Parameters for the job: {search.best_params_}")



Best parameter (CV score = 96.06%)
Best Parameters for the job: {'dt__criterion': 'gini', 'dt__max_depth': 5, 'pca__n_components': 3}


With optimal hyperparamet settings found we can consider that the prediction power of our model in the pipeline could increase, if we set them. 

Now, the whole pipeline can be saved with optimized settings (unless we want to check for more things to work on) and deployed into production. To do that, we can either pickle the model, or use the sklearn saving methodology:

In [13]:
# ----- Save the model with optimal parameters under the name of iris_classifier_pipeline ----- 
joblib.dump(search.best_estimator_, "iris_classifier_pipeline.pkl")

['iris_classifier_pipeline.pkl']

In [14]:
# ----- Load the previously saved model that can be deployed -----
loaded_pipeline = joblib.load("iris_classifier_pipeline.pkl")

In [15]:
# ----- Evaluate if it's the model with optimal parameters -----
loaded_pipeline

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('pca',
                 PCA(copy=True, iterated_power='auto', n_components=3,
                     random_state=None, svd_solver='auto', tol=0.0,
                     whiten=False)),
                ('dt',
                 DecisionTreeClassifier(class_weight=None, criterion='gini',
                                        max_depth=5, max_features=None,
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        presort=False, random_state=None,
                                        splitter='best'))],
         verbose=0)

In [16]:
# ----- Use the loaded model to predict unseen previously data -----
predictions = loaded_pipeline.predict(X_test)

# ----- Compare those predictions to the reality of things -----
answers = (loaded_pipeline.predict(X_test) == y_test)

# ----- Evaluate how many answers were correct -----
accuracy = round(100 * (sum(answers) / len(answers)), 2)
print(f"Accuracy of our model is: {accuracy}%")

Accuracy of our model is: 100.0%


Accuracy actually increased compared to the default one, but do not forget to consider that during the grid search only the training data was evaluated. And though in both cases models performed almost equally well, with the testing set; in reality, with a bigger data corpus the second model is likely to be much better in the long run. 

## Further

What this tutorial does not cosider, is cases when datasets are really huge and a need for data loaders might be present (something that would gradually load data into the pipeline, for both training and predictions) as well as how custom data transforming classes could be developed. This is something that I wanted to keep out, to not make the tutorial confussing and make one of a more simple nature. But those aspects are definitelly something that are worth while to be considered in the future.

For now - I wanted to make something to share and for personal learning, as I noticed I've never actually worked with such approach to pipelines, that can in fact be quite beneficial and convenient for at least small scale applications. Bigger ones might in fact require a more sophisticated set of tools and skills.