# Avoiding Data Leakage

#### Emmanuel Amador Maldonado

**The main objective is to see the difference in the model performance between using data preparation in the whole dataset before split it into test and train sets, and using data preparation after splitting the dataset into train and test sets.**

In [31]:
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline

## Evaluating the model using Train-Test Evaluation

In this section, we'll evaluate a logistic regression model using train and test sets on a synthetic binary classification dataset where the inpyt variables have been normalized.

Using the **nake_classification** function in *sklearn.datasets*, we'll create a dataset with a binary target variable

- n_samples: Total number of rows
- n_features: Number of independent variables (n_features = n_informative + n_redundant + n_repeated)
    - n_informative: Number of features that it gives real information related to the target variable
    - n_redundant: Number of features that doesn't give real information related to the target variable 
    - n_repreated: Number of features repeated
- random_state: Number of the seed. To make the results replicables
- n_classes: Number of labels of the classification problem (Number of labels for the target variable) -> default: 2


Returns X, y -> independent variables, and the target variable, respectively


In [3]:
X, y = make_classification(n_samples = 1000, n_features = 20, n_informative = 15, n_redundant = 5, random_state = 7)

In [4]:
print(X.shape, y.shape)

(1000, 20) (1000,)


In [5]:
pd.DataFrame(X).head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.292995,-4.212231,-1.288332,-2.178498,-0.645277,2.580977,0.284224,-7.182793,-1.912111,2.737295,0.813957,3.969737,-2.669398,3.346923,4.197918,0.99991,-0.302019,-4.431706,-2.826467,0.449168
1,-0.068399,5.518841,11.238977,-5.0397,-2.086784,2.149685,0.559734,15.113777,-3.071834,-2.574584,3.324576,2.067542,-5.249258,-2.1545,4.931091,1.296735,-3.186133,-3.089948,1.190299,1.620256
2,0.731616,-0.684686,-0.981742,-2.552465,-5.270308,-1.561498,-1.169269,-2.104087,-1.131139,4.654775,-2.786596,-2.034761,2.149657,-0.134154,-1.198231,-2.720604,-0.123961,5.654297,-0.646599,-3.15653
3,2.309107,-0.320548,-6.591664,1.070525,-4.418769,1.134274,2.340813,-5.983425,0.675917,-1.007879,-0.761441,6.866297,1.44227,1.768678,5.173661,-1.070164,-2.447064,-1.109038,-2.997035,1.993212
4,-0.488406,-3.213065,1.100805,-1.356223,5.325086,0.729179,-0.25704,-1.035284,0.478013,-0.010764,-0.227408,2.551456,0.951594,-2.91491,-2.186843,-1.089129,1.406454,3.082424,0.925835,-2.326362
5,-0.156687,-2.491359,-0.319048,-0.180767,-5.161745,0.536021,-1.435684,0.708005,-2.480216,1.017607,0.899322,2.455431,0.87808,0.619633,1.120065,1.784713,-3.079556,4.192865,-4.93151,0.221128
6,3.095598,5.155741,4.84693,1.677064,-5.461116,2.922476,-4.679053,2.916699,-1.501298,0.243174,0.537594,-12.232591,-1.480634,1.653768,1.381391,-3.410147,-0.437894,6.170876,0.53659,-2.258617
7,1.482795,1.159066,0.805299,-3.453292,-9.46446,-3.631272,-0.010151,3.080085,1.621198,1.300982,2.147708,-0.036937,-0.277505,-0.506086,-5.903622,1.555044,0.679583,3.990227,-4.600461,2.687322
8,-4.548827,1.388299,1.85443,1.201001,-5.795249,-0.654437,0.537701,1.920046,-0.550207,-2.581467,-2.397156,7.739894,3.841829,-0.280951,-0.937141,1.817031,-3.414837,5.19609,-3.094033,1.823347
9,-0.641345,-1.285319,0.959277,-1.510647,7.076031,-1.491739,-0.409155,-3.735169,0.113359,2.43839,3.767352,-6.095996,-1.29555,-0.655873,0.92146,-2.270914,2.29691,-2.945498,-0.863316,-2.363958


In [6]:
pd.DataFrame(y).head(15)

Unnamed: 0,0
0,1
1,1
2,1
3,0
4,0
5,0
6,1
7,1
8,0
9,1


### Train-test evaluation using data preparation to the whole dataset and then split it into train and test datasets

1. Data Preparation
2. Split dataset into train and test sets
3. Model evaluation

We can normalize the independent variables using the MinMaxScaler function from *sklearn.preprocessing*

- feature_range: (min_val, max_val) The range of values each feature will have after the transformation
- copy: True/False. If True, performs inplace row normalization

The transformation is given by:

- X_std = (X - X.min(axis=0))/(X.max(axis=0) - X.min(axis = 0))
- X_scaled = X_std*(max - min) + min

Where min, max = feature_range

In [7]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

pd.DataFrame(X).head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.478319,0.186936,0.424031,0.42932,0.585333,0.674249,0.529972,0.314806,0.393743,0.727494,0.517405,0.530177,0.279907,0.631903,0.744175,0.580887,0.511004,0.273137,0.340144,0.543896
1,0.451084,0.734584,0.771313,0.24054,0.54615,0.642636,0.548818,0.891427,0.319694,0.370846,0.671665,0.476338,0.098575,0.283846,0.794893,0.600236,0.329298,0.320941,0.628732,0.614195
2,0.511375,0.38546,0.43253,0.404646,0.459618,0.370612,0.430548,0.446149,0.443609,0.856237,0.296178,0.36023,0.618627,0.411667,0.370886,0.338365,0.522222,0.632484,0.496759,0.327452
3,0.630259,0.405953,0.277012,0.643687,0.482764,0.568208,0.67065,0.345824,0.55899,0.476037,0.420609,0.612159,0.568906,0.532053,0.811674,0.445949,0.375861,0.391518,0.327889,0.636582
4,0.419431,0.243167,0.490262,0.483573,0.747615,0.538515,0.492948,0.473789,0.546354,0.542985,0.453421,0.490035,0.534418,0.235738,0.302497,0.444713,0.618642,0.540853,0.609732,0.377286
5,0.44443,0.283784,0.450901,0.561128,0.462569,0.524357,0.412324,0.518873,0.357469,0.612032,0.52265,0.487317,0.529251,0.459356,0.531259,0.632045,0.336013,0.580416,0.188905,0.530207
6,0.689532,0.714149,0.594112,0.683706,0.454431,0.699281,0.190466,0.575993,0.419974,0.560035,0.500425,0.071597,0.363462,0.524783,0.549336,0.293417,0.502443,0.650889,0.581766,0.381353
7,0.567986,0.489223,0.48207,0.34521,0.345615,0.2189,0.509836,0.580219,0.619347,0.631058,0.599355,0.416775,0.448027,0.388136,0.045382,0.617074,0.572847,0.573196,0.212689,0.678249
8,0.113426,0.502124,0.511154,0.652296,0.445349,0.437098,0.547311,0.550218,0.480702,0.370384,0.320106,0.636885,0.737566,0.40238,0.388947,0.634152,0.314889,0.616159,0.32092,0.626386
9,0.407905,0.351658,0.486339,0.473384,0.795209,0.375725,0.482542,0.403967,0.523071,0.707425,0.69887,0.245283,0.376471,0.378659,0.51752,0.367678,0.674742,0.326088,0.481188,0.375029


Then we split the data into train and test sets using **train_test_split** from *sklearn.model_selection*, where its parameters are:

- test_size: It should be between 0 and 1, where it represents the percentage of the full dataset that will be the test set
- train_size: It should be between 0 and 1, if nothing is selected, then the value is automatically set to the complement of the test size
- random_state: Seed to allow replicate the results

Returns

X_train, X_test, y_train, y_test arrays 

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 1) 

In [9]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(670, 20) (330, 20) (670,) (330,)


Then we create our logistic regression using **LogisticRegression** from *sklearn.linear_model*

In [10]:
model = LogisticRegression()

model.fit(X_train, y_train)

When fitting the model, we create the specific formula for the problem and then we can make a prediction using the test set. We can compare the prediction to the expected values and calculate a classification accuracy score using **accuracy_score** from *sklearn.metrics*

- normalize: True/False. If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classificed samples

In [11]:
y_predicted = model.predict(X_test)
accuracy_data_leakaged = accuracy_score(y_test, y_predicted)
print(f"Accuracy of the model with data leakaged: {round(accuracy_data_leakaged*100, 3)} %")

Accuracy of the model with data leakaged: 84.848 %


**We already know that there was data leakage, and this estimate of model accuracy is wrong**

### Train-test evaluation using data preparation to the whole dataset and then split it into train and test datasets

1. Split dataset into train and test sets
2. Data Preparation in the train and test sets individually
3. Model evaluation

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 1)

In [13]:
scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train) # Notice that we only fit the scale on the X_train model

X_test = scaler.transform(X_test)

In [14]:
model = LogisticRegression()

model.fit(X_train, y_train)

y_predicted = model.predict(X_test)

accuracy_without_data_leakaged = accuracy_score(y_test, y_predicted)

print(f"Accuracy of the model without data leakaged: {round(accuracy_without_data_leakaged * 100,3)} %")


Accuracy of the model without data leakaged: 85.455 %


**In this case, we can see that the estimate for the model is about 85.455%, which is more accurate than the estiamate with data leakage with just 84.848%. We would expect this to be an optimistic estimate with data leakage (better performance), although in this case, we can see that data leakage resulted in slightly worse performance. This might be because of the difficulty of the prediction task**

## Evaluating the model using Cross-Validation Evaluation

We'll use the same dataset as in the previous section

### Cross-Validation Evaluation using data preparation to the whole dataset first

In [17]:
scaler = MinMaxScaler()

X = scaler.fit_transform(X)

The k-fold cross-validation procedure must first be defined. We'll use repeated stratified 10-fold cross-validation which is a best practice for classification. 

- Repeated means that the whole cross-validation procedure is repeated multiple times, three in this case. 
- Stratified means that each group of rows will have the relative composition of examples from each class as the whole dataset.

We will use k = 10 or 10-fold cross-validation. This can be achieved using the RepeatedStratifiedKFold which can be configured to three repeats and 10 folds, and then using the cross_val_score() function to perform the procedure, passing in the defined model, cross-validation object, and metric to calculate, in this case, the accuracy

The parameters for this **RepeatedStratifiedKFold** from *sklearn.model_selection* are:

- n_splits: Number of folds, must be at least 2 (default: 5)
- n_repeats: Number of times cross-validator needs to be repeated (default: 10)
- random_state: Integer. Seed to replicate the results

The parameters for this **cross_val_score** from *sklearn.model_selection* are: 

- estimator: Estimator object (function, it can be LogisticRegression, LinearRegression, etc) implementing "fit"
- X: independent variables
- y: target variable
- n_jobs: Int (default None). Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1. -1 means using all processors
- scoring: A str or a socrer callable object/function with signature scorer(estimator, X, y), which should return only a single value
- cv: Int, cross-validation generator or an interable. Determines the cross-validation splitting strategy. Possible inputs are:
    - None, to use the default 5-fold cross-validation
    - int, to specify the number of folds in a (Stratified)KFold
    - CV splitter,


In [23]:
model = LogisticRegression()
cross_val = RepeatedStratifiedKFold(n_splits = 10, n_repeats = 3, random_state = 1)
scores = cross_val_score(model, X, y, scoring="accuracy", cv = cross_val, n_jobs = -1) #return a list with the score of each cross-validation accuracy

In [22]:
print(scores)

[0.86 0.91 0.88 0.81 0.83 0.84 0.81 0.84 0.88 0.84 0.84 0.86 0.85 0.83
 0.89 0.87 0.79 0.97 0.84 0.84 0.81 0.88 0.8  0.85 0.89 0.88 0.87 0.83
 0.83 0.87]


In [29]:
print(f"Mean of the accuracy getting data leakage:{round(100*scores.mean(),3)} %, std = {round(100*scores.std(),3)} %")

Mean of the accuracy getting data leakage:85.3 %, std = 3.607 %


### Cross-Validation Evaluation with Correct Data Preparation

Data preparation without data leakage when using cross-validation is slightly more challenging, it requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure. We can achieve this by defining a modeling pipeline that defiens a sequence of data preparation steps to performs and endning in the model to fit and evaluate.


The evaluation procedure can be achieved if we create a pipeline class. This class takes a list of steps that define the pipeline. Each step in the list is a tuple with two elements.

- The first element is the name of the step (a string)
- The second element is the configured object of the step, such as a transform or a model. The model is only supported as the final step, although we can have as many transforms as we like in the sequence

In [32]:
steps = [] #define the pipeline
steps.append(("scaler", MinMaxScaler()))
steps.append(("model", LogisticRegression()))

pipeline = Pipeline(steps = steps)



In [33]:
cross_val = RepeatedStratifiedKFold(n_splits = 10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring = "accuracy", cv = cross_val, n_jobs = -1)
print(scores)

[0.86 0.91 0.87 0.81 0.83 0.84 0.81 0.84 0.88 0.84 0.84 0.86 0.85 0.83
 0.89 0.88 0.8  0.97 0.84 0.84 0.81 0.88 0.81 0.85 0.89 0.88 0.87 0.84
 0.84 0.87]


In [34]:
print(f"Mean of the accuracy getting data leakage:{round(100*scores.mean(),3)} %, std = {round(100*scores.std(),3)} %")

Mean of the accuracy getting data leakage:85.433 %, std = 3.471 %


**In this case, we can see that the model has an estimated accuracy of about 85.433% compared to the approach with the data leakage that achieved an accuracy of about 85.3%. As with the train-test example, removing data leakage has resulted in a slight improvement in performance when our intuition might suggest a drop given that data leakage often results in an optimistic estimate of model performance. Nevertheless, the examples demonstrate that data leakage may impact the estimate of model performance and how to correct data leakage by correctly performing data preparation after the data is split**