# `Chapter 04: Data Preparation Without Data Leakage`

### `The problem: `

with applying data preparation techniques before splitting data for model evaluation is that it
can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s
performance on the problem. Data leakage refers to a problem where information about the
holdout dataset, such as a test or validation dataset, is made available to the model in the
training dataset.

1. Problem With Naive Data Preparation
2. Data Preparation With Train and Test Sets
3. Data Preparation With k-fold Cross-Validation

Data preparation must be fit on the training dataset only. That is, any
coefficients or models prepared for the data preparation process must only use rows of data in
the training dataset. Once fit, the data preparation algorithms or models can then be applied
to the training dataset, and to the test dataset.

1. Split Data.
2. Fit Data Preparation on Training Dataset.
3. Apply Data Preparation to Train and Test Datasets.
4. Evaluate Models


In this section, we will evaluate `a logistic regression model` using train and test sets on a synthetic binary classification dataset where the input variables have been normalized. First, let’s define our synthetic dataset. We will use the make classification() function to create the dataset with `1,000 rows` of data and `20 numerical input` features. The example below creates the dataset and summarizes the shape of the input and output variable arrays


### `01: Data Leakage Scenario `

In [9]:
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score



In [10]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

(1000, 20) (1000,)


In [11]:

# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

In [12]:
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

In [13]:
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

In [14]:
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)

In [15]:
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 84.848


`Note:` Example of evaluating a model using a train-test split with data leakage.

### `02: Data Preparation without Data Leakage`

The correct approach to performing data preparation with a train-test split evaluation is to fit
the data preparation on the training set, then apply the transform to the train and test sets.
This requires that we first split the data into train and test sets

In [17]:
# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5,
random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Accuracy: 85.455


`Note:` In this case, we can see that the estimate for the model is about 85.455 percent, which is
more accurate than the estimate with data leakage in the previous section that achieved an
accuracy of 84.848 percent. We expect data leakage to result in an incorrect estimate of model
performance. We would expect this to be an optimistic estimate with data leakage, e.g. better
performance, although in this case, we can see that data leakage resulted in slightly worse
performance. This might be because of the difficulty of the prediction task

### `03: Cross-Validation Evaluation With Naive Data Preparation` 

In [19]:
# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print(scores)
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

[0.86 0.91 0.88 0.81 0.83 0.84 0.81 0.84 0.88 0.84 0.84 0.86 0.85 0.83
 0.89 0.87 0.79 0.97 0.84 0.84 0.81 0.88 0.8  0.85 0.89 0.88 0.87 0.83
 0.83 0.87]
Accuracy: 85.300 (3.607)


`Note:` Example of evaluating a model using a cross-validation with data leakage

In [22]:
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# define dataset4.5. Further Reading 34
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)

# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

# report performance
print(scores)
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

[0.86 0.91 0.87 0.81 0.83 0.84 0.81 0.84 0.88 0.84 0.84 0.86 0.85 0.83
 0.89 0.88 0.8  0.97 0.84 0.84 0.81 0.88 0.81 0.85 0.89 0.88 0.87 0.84
 0.84 0.87]
Accuracy: 85.433 (3.471)
