This notebook illustrated the difference in predictive power score for non-transformed data and transformed data and the use of the pipelines with PPS.

As stated in the [scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) unscaled data and the presence of very large outliers can degrade the predictive performance and the speed of machine learning algorithms. The pps module allows for the transformation of data using a pipeline approach to prevent data leakeage into cross validation.

The following modules are required:

In [3]:
import pps as pps
import pandas as pd
import numpy as np

The following data transformers are used to illustrate the difference in predictive power score of raw data and transformed data:

In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler

The breast cancer dataset is imported and loaded into a dataframe to allow interfacing with pps predictors method.

In [6]:
import sklearn.datasets as data_sets
dataset = data_sets.load_breast_cancer()
X_full, y_full = dataset.data, dataset.target
dataset
data= np.c_[dataset['data'], dataset['target']]
names = [str(name) for name in dataset['feature_names']]
data = pd.DataFrame(data, columns= names + ['target'])

The top 5 predictors by pps score using raw data are:

In [13]:
pps.predictors(data, 'target')[['x','y','ppscore']].head()

Unnamed: 0,x,y,ppscore
0,worst concave points,target,0.69881
1,worst area,target,0.662555
2,worst perimeter,target,0.66103
3,worst radius,target,0.657945
4,mean concave points,target,0.64377


Using MinMax, Standard and MaxAbs scalers to transform data before calculating the predictive power score yields a different ranking of predictive scores; "worst area" and "worst radius" features exchange positions in the ranking.

In [14]:
pipeline = [MinMaxScaler(), StandardScaler(), MaxAbsScaler()]
pps.predictors(data, 'target', pipeline=pipeline)[['x','y','ppscore']].head()

Unnamed: 0,x,y,ppscore
0,worst concave points,target,0.694085
1,worst radius,target,0.667363
2,worst perimeter,target,0.665755
3,worst area,target,0.662555
4,mean concave points,target,0.64377
