# Sklearn Pipeline:
This is a basic pipeline implementation. In real-life data science, scenario data would 
need to be prepared first then applied pipeline for rest processes. Building quick and 
efficient machine learning models is what pipelines are for. Pipelines are high in demand
as it helps in coding better and extensible in implementing big data projects. Automating 
the applied machine learning workflow and saving time invested in redundant preprocessing work.

# Advantages of using Pipeline:

Automating the workflow being iterative.

Easier to fix bugs 

Production Ready

Clean code writing standards

Helpful in iterative hyperparameter tuning and cross-validation evaluation

# Pipeline
It ensures reusability of the model by reducing the redundant part, thereby speeding up 
the process. This could prove to be very effective during the production workflow.

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
iris_df=load_iris()
# Split
X_train,X_test,y_train,y_test=train_test_split(iris_df.data,iris_df.target,test_size=0.3,random_state=0)
#make pipeline
pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)),('lr_classifier',LogisticRegression(random_state=0))])
model = pipeline_lr.fit(X_train, y_train)
model.score(X_test,y_test)

0.8666666666666667

In [None]:
from sklearn import metrics

# Stacking Multiple Pipelines to Find the Model with the Best Accuracy

In [30]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB

pipeline_lr=Pipeline([('scalar1',StandardScaler()),
                     ('pca1',PCA(n_components=2)), 
                     ('lr_classifier',LogisticRegression())])
pipeline_dt=Pipeline([('scalar2',StandardScaler()),
                     ('pca2',PCA(n_components=2)),
                     ('dt_classifier',DecisionTreeClassifier())])
pipeline_svm = Pipeline([('scalar3', StandardScaler()),
                      ('pca3', PCA(n_components=2)),
                      ('clf', svm.SVC())])
pipeline_knn=Pipeline([('scalar4',StandardScaler()),
                     ('pca4',PCA(n_components=2)),
                     ('knn_classifier',KNeighborsClassifier())])

pipeline_rf = Pipeline([('scalar5',StandardScaler()),
                     ('pca5',PCA(n_components=2)),
                     ('rf_classifier',RandomForestClassifier())])

pipeline_nb = Pipeline([('scalar6',StandardScaler()),
                     ('pca6',PCA(n_components=2)),
                     ('nb_classifier',GaussianNB())])

MultinomialNB()

pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest, pipeline_knn, pipeline_rf,pipeline_nb]
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'Support Vector Machine',3:'K Nearest Neighbor',4:'RandomForestClassifier',5:'GaussianNB()'}

for pipe in pipelines:
    pipe.fit(X_train, y_train)
    
for i,model in enumerate(pipelines):
    print("{} Test Accuracy:{}".format(pipe_dict[i],model.score(X_test,y_test)))
    

Logistic Regression Test Accuracy:0.8666666666666667
Decision Tree Test Accuracy:0.9111111111111111
Support Vector Machine Test Accuracy:0.9111111111111111
K Nearest Neighbor Test Accuracy:0.9111111111111111
RandomForestClassifier Test Accuracy:0.9111111111111111
GaussianNB() Test Accuracy:0.9111111111111111


# Check best accuracy

In [23]:
best_accuracy=0.0
best_classifier=0
best_pipeline=""

for i,model in enumerate(pipelines):
    if model.score(X_test,y_test)>best_accuracy:
        best_accuracy=model.score(X_test,y_test)
        best_pipeline=model
        best_classifier=i
print('Classifier with best accuracy:{}'.format(pipe_dict[best_classifier]))

Classifier with best accuracy:Decision Tree


# Hyperparameter Tuning in Pipeline
With pipelines, you can easily perform a grid-search over a set of parameters for each step 
of this meta-estimator to find the best performing parameters. To do this you first need to
create a parameter grid for your chosen model. One important thing to note is that you need 
to append the name that you have given the classifier part of your pipeline to each parameter
name. In my code above I have called this ‘randomforestclassifier’ so I have added 
randomforestclassifier__ to each parameter. Next, I created a grid search object which 
includes the original pipeline. When I then call fit, the transformations are applied to 
the data, before a cross-validated grid-search is performed over the parameter grid.

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
pipe = make_pipeline((RandomForestClassifier()))
grid_param = [
{"randomforestclassifier": [RandomForestClassifier()],
"randomforestclassifier__n_estimators":[10,100,1000],
 "randomforestclassifier__max_depth":[5,8,15,25,30,None],
 "randomforestclassifier__min_samples_leaf":[1,2,5,10,15,100],
"randomforestclassifier__max_leaf_nodes": [2, 5,10]}]
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) 
best_model = gridsearch.fit(X_train,y_train)
best_model.score(X_test,y_test)

0.9555555555555556

# Challenges in using Pipeline:
Need Proper data cleaning, 
Data Exploration and Analysis and 
Efficient feature engineering

# Individual models 

In [None]:
# Logistic regression
Logistic regression is an extension to the linear regression algorithm. The details of the
linear regression algorithm are discussed in Learn regression algorithms using Python and
scikit-learn. In a logistic regression algorithm, instead of predicting the actual continuous
value, we predict the probability of an outcome. To achieve this, a logistic function is applied
to the outcome of the linear regression. The logistic function is also referred to as a sigmoid
function. This outputs a value between 0 and 1. Then, we select a line that depends on the use 
case. Any data point with a probability value above the line is classified into the class 
represented by 1. The data point below the line is classified into the class represented by 0.

In [None]:
from sklearn.linear_model import LogisticRegression
model_name = "lrc"
lrc = LogisticRegression(random_state=0,multi_class='auto',solver = 'lbfgs',max_iter=1000)
lrc_model = Pipeline(steps=['preprocessor']) 

In [None]:
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

# Task: Apply all classification algorithm by adding pipeline


In [None]:
from sklearn.tree import ExtraTreeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm.classes import OneClassSVM
from sklearn.neural_network.multilayer_perceptron import MLPClassifier
from sklearn.neighbors.classification import RadiusNeighborsClassifier
from sklearn.neighbors.classification import KNeighborsClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OutputCodeClassifier
from sklearn.multiclass import OneVsOneClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from sklearn.linear_model.ridge import RidgeClassifierCV
from sklearn.linear_model.ridge import RidgeClassifier
from sklearn.linear_model.passive_aggressive import PassiveAggressiveClassifier    
from sklearn.gaussian_process.gpc import GaussianProcessClassifier
from sklearn.ensemble.voting_classifier import VotingClassifier
from sklearn.ensemble.weight_boosting import AdaBoostClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.ensemble.bagging import BaggingClassifier
from sklearn.ensemble.forest import ExtraTreesClassifier
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.semi_supervised import LabelPropagation
from sklearn.semi_supervised import LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.naive_bayes import MultinomialNB  
from sklearn.neighbors import NearestCentroid
from sklearn.svm import NuSVC
from sklearn.linear_model import Perceptron
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.mixture import DPGMM
from sklearn.mixture import GMM 
from sklearn.mixture import GaussianMixture
from sklearn.mixture import VBGMM


#  baseline models
When developing a machine learning model for a project it is sensible to create a baseline 
model first. This model should be in essence a ‘dummy’ model such as one that always predicts
the most frequently occurring class. This provides a baseline on which to benchmark your 
‘intelligent’ model so that you can ensure that it is performing better than random results 
for example.

In [39]:
# Load libraries
from sklearn.datasets import load_iris
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()

# Create target vector and feature matrix
X, y = iris.data, iris.target
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Create dummy classifer
dummy = DummyClassifier(strategy='uniform', random_state=1)

# "Train" model
dummy.fit(X_train, y_train)
# Get accuracy score
dummy.score(X_test, y_test)  

0.42105263157894735

sklearn solutions for various problesm
https://www.neuraxio.com/blogs/news/whats-wrong-with-scikit-learn-pipelines
https://www.neuraxle.org/stable/scikit-learn_problems_solutions.html#problem-defining-the-search-space-hyperparameter-distributions    