# How to Avoid Data Leakage When Performing Data Preparation for Machine Learning

![Data_leakage_water](data_leakge.jpg)

Data preparation is the process of transforming raw data into a form that is appropriate for modeling.

A **incorrect** approach to preparing data applies the transform on the entire rows of dataset before  fitting and evaluating the performance of the model. This results in a problem referred to as **data leakage**, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.

In this notebook, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.

After going through this notebook , you will know:

- Inexperienced application of data preparation methods to the whole dataset(ie considering all the rows of raw dataset) results in data leakage that causes incorrect estimates of model performance.
- Data preparation must be prepared on the training set only in order to avoid data leakage.
- How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.


Let’s get started.

## This Notebook Overview

This notebook is divided into three parts; they are:

1. Problem With Naive(Inexperienced) Data Preparation
2. Data Preparation With Train and Test Sets

   - Train-Test Evaluation With In-correct Data Preparation
   - Train-Test Evaluation With Correct Data Preparation
   
3. Data Preparation With k-fold Cross-Validation
    - Cross-Validation Evaluation With In-correct Data Preparation
    - Cross-Validation Evaluation With Correct Data Preparation
    
    
### Problem With Incorrect Data Preparation
The manner in which data preparation techniques are applied to data matters.

A common wrong approach is to first apply one or more transforms to the entire dataset. Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a machine learning model.

1. Prepare Dataset
2. Split Data
3. Evaluate Models

 This is a common wrong approach, it is dangerously incorrect in most cases.

The problem with applying data preparation techniques before splitting data for train and model evaluation is that it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s performance on the problem.

Data leakage refers to a problem where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset. This leakage is often small and subtle but can have a marked effect on performance.

>… leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage.

> — Page 93, [Feature Engineering for Machine Learning](https://www.amazon.in/Feature-Engineering-Machine-Learning-Principles-ebook/dp/B07BNX4MWC/ref=dp_kinw_strp_1), 2018.

We get data leakage by applying data preparation techniques to the entire dataset(ie all the rows of data set).

This is not a direct type of data leakage, where we would train the model on the train+test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training. This can make it a harder type of data leakage to spot, especially for beginners.

>One other aspect of resampling is related to the concept of information leakage which is where the test set data are used (directly or indirectly) during the training process. This can lead to overly optimistic results that do not replicate on future data points and can occur in subtle ways.

> — Page 55, [Feature Engineering and Selection](https://www.amazon.in/Feature-Engineering-Selection-Practical-Predictive-ebook/dp/B07VMP371H/ref=dp_kinw_strp_1), 2019.

For example, consider the case where we want to normalize a data, that is scale input variables to the range 0-1.

When we normalize the input variables, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. The dataset is then split into train and test datasets, but the examples in the training dataset know something about the data in the test dataset; they have been scaled by the global minimum and maximum values, so they know more about the global distribution of the variable whhen they should not.

We get the same type of leakage with almost all data preparation techniques; for example, standardization estimates the mean and standard deviation values from the domain in order to scale the variables; even models that impute missing values using a model or summary statistics will draw on the full dataset to fill in values in the training dataset.

**The solution is straightforward and simple**

Data preparation must be fit on the training dataset only. That is, any coefficients or parameters or models prepared for the data preparation process must only use rows of data in the training dataset.

Once fit, the data preparation algorithms or models can then be applied to the training dataset, and to the test dataset.

1. Split Data.
2. Fit Data Preparation on Training Dataset (`fit` method in scikit-learn).
3. Apply Data Preparation to Train and Test Datasets(`transform` method in scikit-learn). 
4. Fit and Evaluate Models.

More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. 

>In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness.

> — Pages 54-55, [Feature Engineering and Selection]( https://www.amazon.in/Feature-Engineering-Selection-Practical-Predictive-ebook/dp/B07VMP371H/ref=dp_kinw_strp_1), 2019.

Now that we are little familiar with how to apply data preparation to avoid data leakage, 

let’s look at some worked examples.



### Data Preparation With Train and Test Sets
In this section, we will evaluate a logistic regression model using train and test sets on a synthetic binary classification dataset where the input variables have been normalized.

First, let’s define our synthetic dataset.

We will use the [make_classification()](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) function to create the dataset with 1,000 rows of data and 20 numerical input features. The example below creates the dataset and summarizes the shape of the input and output variable arrays.

In [1]:
# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
X.shape, y.shape

((1000, 20), (1000,))

In [2]:
import pandas as pd
X_df = pd.DataFrame(X)

In [3]:
X_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.292995,-4.212231,-1.288332,-2.178498,-0.645277,2.580977,0.284224,-7.182793,-1.912111,2.737295,0.813957,3.969737,-2.669398,3.346923,4.197918,0.99991,-0.302019,-4.431706,-2.826467,0.449168
1,-0.068399,5.518841,11.238977,-5.0397,-2.086784,2.149685,0.559734,15.113777,-3.071834,-2.574584,3.324576,2.067542,-5.249258,-2.1545,4.931091,1.296735,-3.186133,-3.089948,1.190299,1.620256
2,0.731616,-0.684686,-0.981742,-2.552465,-5.270308,-1.561498,-1.169269,-2.104087,-1.131139,4.654775,-2.786596,-2.034761,2.149657,-0.134154,-1.198231,-2.720604,-0.123961,5.654297,-0.646599,-3.15653
3,2.309107,-0.320548,-6.591664,1.070525,-4.418769,1.134274,2.340813,-5.983425,0.675917,-1.007879,-0.761441,6.866297,1.44227,1.768678,5.173661,-1.070164,-2.447064,-1.109038,-2.997035,1.993212
4,-0.488406,-3.213065,1.100805,-1.356223,5.325086,0.729179,-0.25704,-1.035284,0.478013,-0.010764,-0.227408,2.551456,0.951594,-2.91491,-2.186843,-1.089129,1.406454,3.082424,0.925835,-2.326362


In [4]:
y_df = pd.DataFrame(y)
y_df.head()

Unnamed: 0,0
0,1
1,1
2,1
3,0
4,0


Running the example creates the dataset and confirms that the input part of the dataset has 1,000 rows and 20 columns for the 20 input variables(Gaussian distributed) and that the output variable has 1,000 examples to match the 1,000 rows of input data, one value per row.And it is binary valued with 0 and 1.  

Next, we can fit and evaluate our model on the scaled dataset, starting with their naive or incorrect approach.

### Train-Test Evaluation With Incorrect Data Preparation

The incorrect approach involves first applying the data preparation method, then splitting the data before fit and evaluating the model.

We can normalize the input variables using the [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) class, which is first defined with the default configuration scaling the data to the range 0-1, then the `fit_transform()` function is called to fit the transform on the dataset and apply it to the dataset in a single step. The result is a normalized version of the input variables, where each column in the array is separately normalized (e.g. has its own minimum and maximum calculated).

The complete code for this is listed below:

In [5]:
# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


**Note: The example above normalizes the data(data prepartion step) and then splits the data into train and test sets, a worng way of doing things.**

Next, let’s explore how we might correctly prepare the data to avoid data leakage.

### Train-Test Evaluation With Correct Data Preparation

The correct approach to performing data preparation with a train-test split evaluation is to fit the data preparation on the training set, then apply the transform to the train and test sets.

This requires that we first split the data into train and test sets and then do data preparation step

In [6]:
# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455


Running the example splits the data into train and test sets, normalizes the data correctly, then fits and evaluates the model.


In this case, we can see that the estimate for the model is about 85.455 percent, which is more accurate than the estimate with data leakage in the previous section that achieved an accuracy of 84.848 percent.

We expect data leakage to result in an incorrect estimate of model performance. We would expect this to be an optimistic estimate with data leakage, e.g. better performance, although in this case, we can see that data leakage resulted in slightly worse performance. This might be because of the difficulty of the prediction task.

## Data Preparation With k-fold Cross-Validation
In this section, we will evaluate a logistic regression model using [k-fold cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) on a synthetic binary classification dataset where the input variables have been normalized.

You may recall that k-fold cross-validation involves splitting a dataset into k non-overlapping groups of rows. The model is then trained on all but one group to form a training dataset and then evaluated on the held-out fold. This process is repeated so that each fold is given a chance to be used as the holdout test set. Finally, the average performance across all evaluations is reported.

The k-fold cross-validation procedure generally gives a more reliable estimate of model performance than a train-test split, although it is more computationally expensive given the repeated fitting and evaluation of models.

Let’s first look at **incorrect data preparation with k-fold cross-validation.**

### Cross-Validation Evaluation With Incorrect Data Preparation
Incorrect data preparation with cross-validation involves applying the data transforms on the complete dataset first, then using the cross-validation procedure.

We will use the synthetic dataset prepared in the previous section and normalize the data directly.

In [12]:
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

The k-fold cross-validation procedure must first be defined. We will use repeated stratified 10-fold cross-validation, which is a best practice for classification. Repeated means that the whole cross-validation procedure is repeated multiple times, three in this case. Stratified means that each group of rows will have the relative composition of examples from each class as the whole dataset. We will use k=10 or 10-fold cross-validation.

This can be achieved using the [`RepeatedStratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html) which can be configured to three repeats and 10 folds, and then using the [`cross_val_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function to perform the procedure, passing in the defined model, cross-validation object, and metric to calculate, in this case, accuracy.

In [13]:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
scores

array([0.86, 0.91, 0.88, 0.81, 0.83, 0.84, 0.81, 0.84, 0.88, 0.84, 0.84,
       0.86, 0.85, 0.83, 0.89, 0.87, 0.79, 0.97, 0.84, 0.84, 0.81, 0.88,
       0.8 , 0.85, 0.89, 0.88, 0.87, 0.83, 0.83, 0.87])

let us combine all the code pieces and form the complete example as follows:

In [10]:
# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Average Accuracy: %.3f and Stadard deviation: %.3f' % (mean(scores)*100, std(scores)*100))

Average Accuracy: 85.300 and Stadard deviation: 3.607


Running the  above example normalizes the data first(data preparation first), then  evaluates the model using repeated stratified cross-validation.

In this case, we can see that the model achieved an estimated accuracy of about 85.300 percent, which we know is **incorrect given the data leakage** allowed via the data preparation procedure.

Next, let’s look at how we can evaluate the model with cross-validation and avoid data leakage.

### Cross-Validation Evaluation With Correct Data Preparation

Data preparation without data leakage when using cross-validation is slightly more challenging.

It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure, e.g. the groups of folds of rows.

We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

>To provide a solid methodology, we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

>— Page 55, [Feature Engineering and Selection](https://www.amazon.in/Feature-Engineering-Selection-Practical-Predictive-ebook/dp/B07VMP371H/ref=dp_kinw_strp_1), 2019.

The evaluation procedure changes from simply and incorrectly evaluating just the model to correctly evaluating the entire pipeline of data preparation and model together as a single atomic unit.

This can be achieved using the [Pipeline class.](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

This class takes a list of steps that define the pipeline. Each step in the list is a tuple with two elements. The first element is the name of the step (a string) and the second is the configured object of the step, such as a transform or a model. The model is only supported as the final step, although we can have as many transforms as we like in the sequence.

We can then pass the configured object to the cross_val_score() function for evaluation.

Putting all this together, the complete example of correctly performing data preparation without data leakage when using cross-validation is listed below.

In [11]:
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Average Accuracy: %.3f and Stadard deviation: %.3f' % (mean(scores)*100, std(scores)*100))

Average Accuracy: 85.433 and Stadard deviation: 3.471


The example normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

In this case, we can see that the model has an estimated accuracy of about 85.433 percent, compared to the approach with data leakage that achieved an accuracy of about 85.300 percent.

As with the train-test example in the previous section, removing data leakage has resulted in a slight improvement in performance when our intuition might suggest a drop given that data leakage often results in an optimistic estimate of model performance. Nevertheless, the examples clearly demonstrate that data leakage does impact the estimate of model performance and how to correct data leakage by correctly performing data preparation after the data is split.



### Refrences
- [Data preparation, Wikipedia.](https://en.wikipedia.org/wiki/Data_preparation)
- [Data cleansing, Wikipedia.](https://en.wikipedia.org/wiki/Data_cleansing)
- [Data pre-processing, Wikipedia.](https://en.wikipedia.org/wiki/Data_pre-processing)

## Summary
In this tutorial, you discovered how to avoid data leakage during data preparation when evaluating machine learning models.

Specifically, we learned:

- Incorrect application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
- Data preparation must be prepared on the training set only in order to avoid data leakage.
- How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.


**Do you have any questions?**

Ask your questions in the comments below and I will try my best to answer.