# A simple example of feature scaling and SVM with pipline and GridSearchCV

In a machine learning project, the data need multiplie transformation and then fit a model, for example, encoding categorical variables, feature scaling, normalisation, feature selection... Scikit-learn provid built in function to preporcess data and fit the parameter in a sequentical way with the advantage of reproduction.


Definition of pipeline class according to scikit-learn is

- Sequentially apply a list of transforms and a final estimator. Intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.

As the name suggests, pipeline class allows sticking multiple processes into a single scikit-learn estimator. pipeline class has fit, predict and score method just like any other estimator (ex. LinearRegression).

In [6]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris

steps = [('scaler', StandardScaler()), ('SVM', SVC())]
pipeline = Pipeline(steps) # define the pipeline object.
parameteres = {'SVM__C':[0.001,0.1,10,100,10e5], 'SVM__gamma':[0.1,0.01]}
grid = GridSearchCV(pipeline, param_grid=parameteres, cv=5)

In [7]:
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
data = iris_df.iloc[:100, [0,1,-1]]
X, y= np.array(data)[:,:2], np.array(data)[:,-1]
X.shape, y.shape

((100, 2), (100,))

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((70, 2), (30, 2), (70,), (30,))

In [9]:
grid.fit(x_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('scaler',
                                        StandardScaler(copy=True,
                                                       with_mean=True,
                                                       with_std=True)),
                                       ('SVM',
                                        SVC(C=1.0, break_ties=False,
                                            cache_size=200, class_weight=None,
                                            coef0=0.0,
                                            decision_function_shape='ovr',
                                            degree=3, gamma='scale',
                                            kernel='rbf', max_iter=-1,
                                            probability=False,
                                            random_state=None, shrinking=True,
                                            tol=0.001

In [14]:
print("score = %3.2f" %(grid.score(x_train,y_train)))
print("score = %3.2f" %(grid.score(x_test,y_test)))
print(grid.best_params_)

score = 1.00
score = 0.97
{'SVM__C': 0.1, 'SVM__gamma': 0.1}


**References**

- [Scikit-learn user guide](https://scikit-learn.org/stable/data_transforms.html)

- [A Simple Example of Pipeline in Machine Learning with Scikit-learn](https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976)

