# Tutorial: Binning process with sklearn Pipeline

This example shows how to use a binning process as a transformation within a Scikit-learn Pipeline. A pipeline generally comprises the application of one or more transforms and a final estimator.

In [1]:
from optbinning import BinningProcess

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

To get us started, let’s load a well-known dataset from the UCI repository

In [2]:
data = load_boston()

variable_names = data.feature_names
X = data.data
y = data.target

In [3]:
variable_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [4]:
categorical_variables = ['CHAS']

Instantiate a ``BinningProcess`` object class with variable names and the list of numerical variables to be considered categorical. Create pipeline object by providing two steps: a binning process transformer and a linear regression estimator.

In [5]:
binning_process = BinningProcess(variable_names,
                                 categorical_variables=categorical_variables)

In [6]:
lr = Pipeline(steps=[('binning_process', binning_process),
                     ('regressor', LinearRegression())])

Split dataset into train and test Fit pipeline with training data.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [8]:
lr.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('binning_process',
                 BinningProcess(binning_fit_params=None,
                                binning_transform_params=None,
                                categorical_variables=['CHAS'],
                                max_bin_size=None, max_n_bins=None,
                                max_n_prebins=20, max_pvalue=None,
                                max_pvalue_policy='consecutive',
                                min_bin_size=None, min_n_bins=None,
                                min_prebin_size=0.05, n_jobs=None,
                                selection_criteria=None, special_codes=None,
                                split_digits=None,
                                variable_names=array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7'),
                                verbose=False)),
                ('regressor',
                 LinearRegression(copy_X=Tr

In [9]:
y_test_predict = lr.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))

MSE:      17.626
R2 score: 0.760


In this case, the performance metrics show that the binning process transformation is effective in improving predictions.

In [10]:
lr2 = LinearRegression()
lr2.fit(X_train, y_train)

y_test_predict = lr2.predict(X_test)

print("MSE:      {:.3f}".format(mean_squared_error(y_test, y_test_predict)))
print("R2 score: {:.3f}".format(r2_score(y_test, y_test_predict)))

MSE:      24.291
R2 score: 0.669


#### Binning process statistics

The binning process of the pipeline can be retrieved to show information about the problem and timing statistics.

In [11]:
binning_process.information(print_level=1)

optbinning (Version 0.12.0)
Copyright (c) 2019-2021 Guillermo Navas-Palencia, Apache License 2.0

  Statistics
    Number of records                    404
    Number of variables                   13
    Target type                   continuous

    Number of numerical                   12
    Number of categorical                  1
    Number of selected                    13

  Time                                1.5685  sec



The ``summary`` method returns basic statistics for each binned variable.

In [12]:
binning_process.summary()

Unnamed: 0,name,dtype,status,selected,n_bins
0,CRIM,numerical,OPTIMAL,True,10
1,ZN,numerical,OPTIMAL,True,3
2,INDUS,numerical,OPTIMAL,True,7
3,CHAS,categorical,OPTIMAL,True,2
4,NOX,numerical,OPTIMAL,True,9
5,RM,numerical,OPTIMAL,True,10
6,AGE,numerical,OPTIMAL,True,9
7,DIS,numerical,OPTIMAL,True,8
8,RAD,numerical,OPTIMAL,True,4
9,TAX,numerical,OPTIMAL,True,6
