[Pipelines](https://algorithmia.com/blog/ml-pipeline) have been growing in popularity, and now they are everywhere you turn in data science, ranging from simple data pipelines to complex machine learning pipelines. The overarching purpose of a pipeline is to streamline processes in data analytics and machine learning.

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive"> Why Machine Learning Pipelines? </h2>

[The key benefit](https://www.oreilly.com/library/view/building-machine-learning/9781492053187/ch01.html) of machine learning pipelines lies in the automation of the model life cycle steps. When new training data becomes available, a workflow which includes data validation, preprocessing, model training, analysis, and deployment should be triggered. We have observed too many data science teams manually going through these steps, which is costly and also a source of errors.

Let’s cover some details of the benefits of machine learning pipelines:
1. Ability to focus on new models, not maintaining existing models. Many data scientists spending their days on keeping previously developed models up to date. They run scripts manually to preprocess their training data, they write one-off deployment scripts, or they manually tune their models. Automated pipelines allow data scientists to develop new models, the fun part of their job. Ultimately, this will lead to higher job satisfaction and retention in a competitive job market.

2. Prevention of bugs. In manual machine learning workflows, a common source of bugs is a change in the preprocessing step after a model was trained. In this case, we would deploy a model with different processing instructions than what we trained the model with. These bugs might be really difficult to debug since an inference of the model is still possible, but simply incorrect. With automated workflows, these errors can be prevented.

3. Standardization: standardized machine learning pipelines improve the experience of a data science team. Due to the standardized setups, data scientists can be onboarded quickly or move across teams and find the same development environments. This improves efficiency and reduces the time spent getting set up on a new project. The time investment of setting up machine learning pipelines can also lead to an improved retention rate.

In this first notebook I will explain how to build a simple machine learning pipeline. In the upcoming notebooks I will start building on it to develop much more complex pipelines step by step.

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive"> Package imports </h2>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # basic plots
import seaborn as sns # advanced plots

from sklearn.model_selection import train_test_split #split the data
from sklearn.preprocessing import StandardScaler #scale the data
from sklearn.neighbors import KNeighborsClassifier #The KNN model
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score #Evaluation metrics 
from sklearn.pipeline import Pipeline #Sikit learn pipline 
from sklearn.model_selection import GridSearchCV #cross validation

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive"> Read the data and run quality checks </h2>

In [None]:
# reading the data and displaying the first 5 rows
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.head()

In [None]:
# checking if features are in the right type
df.info()

In [None]:
# any duplicates? 
df[df.duplicated() == True]

In [None]:
# drop doplicated row
df1 = df.drop_duplicates()

In [None]:
# re-check: any duplicates? 
df1[df1.duplicated() == True]

In [None]:
# any nulls? 
nulls = df.isna().sum() #count null values in each column
df_nulls = pd.DataFrame(nulls) # convert the result into a dataframe
df_nulls.transpose() # transpose the dataframe and print the result

In [None]:
# Any outliers?
int_vars = df1[["age", "trestbps", "chol", "thalach", "target"]]
sns.pairplot(int_vars, hue = "target")
plt.show()

In [None]:
# Any outliers?
cat_vars = ["oldpeak", "ca", "thal"]
plt.figure(figsize = (8, 4), dpi = 100)
sns.boxplot(data = df1, y = "oldpeak", x = "target")
plt.show()

In [None]:
#box plots
fig, axes = plt.subplots(2, 2, figsize=(10,5), dpi = 100)

#Mean Sepal Length
sns.boxplot(ax = axes[0,0], data = df1, y = 'oldpeak')
axes[0,0].set_xlabel(None)
axes[0,0].set_ylabel(None)
axes[0,0].set_title("oldpeak")


#Mean Sepal Width
sns.boxplot(ax = axes[0,1], data = df1, y = 'ca')
axes[0,1].set_xlabel(None)
axes[0,1].set_ylabel(None)
axes[0,1].set_title("ca")

#Mean Petal Length
sns.boxplot(ax = axes[1,0], data = df1, y = 'thal')
axes[1,0].set_xlabel(None)
axes[1,0].set_ylabel(None)
axes[1,0].set_title("thal")

#Mean Petal Width
sns.boxplot(ax = axes[1,1], data = df1, y = 'slope')
axes[1,1].set_xlabel(None)
axes[1,1].set_ylabel(None)
axes[1,1].set_title("slope")

plt.tight_layout()
plt.subplots_adjust(hspace=0.5)

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive"> Result of the quality checks </h2>

1. The data has no nulls
2. All feature are in the right type
3. There was one duplicate row and we dropped it
4. It has some outliers: 
   - chol > 500
   - oldpeak > 4
   - ca > 2
   - thal < 1

In what follows I will detect and remove those outliers

In [None]:
# detect outlier: chol > 500
df1[df1["chol"] > 500]

In [None]:
# drop outlier: chol > 500
out_index = df1[df1["chol"] > 500].index[0]
df1 = df1.drop(out_index, axis = 0)

In [None]:
# check if it is already dropped: chol > 500
df1[df1["chol"] > 500]

In [None]:
# detect outlier: (oldpeak > 4) & (ca > 2)
df1[(df1["oldpeak"] > 4) & (df1["ca"] > 2)]

In [None]:
# drop outlier: (oldpeak > 4) & (ca > 2)
for index in [204, 250, 291]:
    df1 = df1.drop(index, axis = 0)

In [None]:
# check if it is already dropped: (oldpeak > 4) & (ca > 2)
df1[(df1["oldpeak"] > 4) & (df1["ca"] > 2)]

In [None]:
# number of dropped rows
df.shape[0] - df1.shape[0]

>Now our data is free of errors and ready to build our machine learning pipline 

<h2 style="background-color:#f15a39; padding: 20px; font-family:cursive">  Building a machine learning pipline from scratch</h2>

**Follow along very carefully here! We use very specific string codes AND variable names here so that everything matches up correctly. This is not a case where you can easily swap out variable names for whatever you want!**

We'll use a Pipeline object to set up a workflow of operations:

1. Scale Data
2. Create Model on Scaled Data

*How does the Scaler work inside a Pipeline with CV? Is scikit-learn "smart" enough to understand .fit() on train vs .transform() on train and test?**

**Yes! Scikit-Learn's pipeline is well suited for this! [Full Info in Documentation](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) **

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be discribed as follows:

* Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
* Step 1: the scaler is fitted on the TRAINING data
* Step 2: the scaler transforms TRAINING data
* Step 3: the models are fitted/trained using the transformed TRAINING data
* Step 4: the scaler is used to transform the TEST data
* Step 5: the trained models predict using the transformed TEST data

### Setup the pipline 

In [None]:
# Intiate the scaler
scaler = StandardScaler()

# Intiate the model
knn = KNeighborsClassifier()

# train test split
X = df1.drop("target", axis = 1)
y = df1["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Operations in oder 
operations = [('scaler', scaler), ('knn', knn)] #Notice that they are written in tuples inside a list.
 
# set up pipline
pipe = Pipeline(operations) #Notice: it is written with Capital P

I initiated the scaler and the model, this is obvious but what operations and pipline mean? It simply specifies the tasks that will be executed inside pipline one by one. First the pipline will scale the data and then will fit the model, those are our two main operations and they will be executed in order.

### Hyper parameter tuning

In [None]:
# Here are the paramaters that can be modified in KNN classifier 
knn.get_params().keys()  

In [None]:
# we will only modify the 'n_neighbors'
k_values = list(range(1,20))
k_values

In [None]:
# setting the parameter grid
param_grid = {'n_neighbors': k_values}

The way I have just written the parameter grid is the standard way where there is no pipline used. It is no longer valid with the existence of a pipline because the pipline has two operations (scale the data and fit the model) and GridSearchCV does not actually know if 'n_neighbors' goes with the scaler or the model. So we will use the name of the intended operation inside the parametter grid as follows.

In [None]:
# setting the parameter grid
param_grid = {'knn__n_neighbors': k_values} # we can add any other parameters to be tuned

Notice the naming convention, we used the name of the operation and then two unerscores and then the name of the parameter. 

*In general: If your parameter grid is going inside a PipeLine, your parameter name needs to be specified in the following manner:**

* chosen_string_name + **two** underscores + parameter key name
* model_name + __ + parameter name
* knn_model + __ + n_neighbors
* knn_model__n_neighbors

[StackOverflow on this](https://stackoverflow.com/questions/41899132/invalid-parameter-for-sklearn-estimator-pipeline)

The reason we have to do this is because it let's scikit-learn know what operation in the pipeline these parameters are related to (otherwise it might think n_neighbors was a parameter in the scaler).

---

In [None]:
# Putting all of it together 
full_cv_classifier = GridSearchCV(pipe,param_grid,cv=5,scoring='accuracy')

# Fitting the pipline 
full_cv_classifier.fit(X_train,y_train)

In [None]:
# Model best parameters
full_cv_classifier.best_estimator_.get_params()

In [None]:
# printing the accuracy associated with each k
acc = full_cv_classifier.cv_results_['mean_test_score']
k_acc = pd.DataFrame({'k_values': k_values, 'Accuracy': acc})
k_acc = k_acc.set_index("k_values").transpose()
round(k_acc, 2)

The best performance is associated with 6 neighbors

### Final Model

We just saw that our  GridSearch recommends a K=6. Let's now use the PipeLine again, but this time, no need to do a grid search, instead we will evaluate on our hold-out Test Set.

In [None]:
# initiate and set the operations
scaler = StandardScaler()
knn6 = KNeighborsClassifier(n_neighbors=6)
operations = [('scaler',scaler),('knn6',knn6)]

In [None]:
# set up pipline
pipe = Pipeline(operations)

In [None]:
# fit the pipline 
pipe.fit(X_train,y_train)

In [None]:
# predict on the test set
pipe_pred = pipe.predict(X_test)

In [None]:
# print the classification report 
print(classification_report(y_test,pipe_pred))

### Congrats, you made it to the end of the notebook. Hope you found it useful!