#### Hi all.  🙋•♂️ 

​
#### We continue our **Beginner-Intermediate Friendly Machine Learning series**, which would help anyone who wants to learn or refresh the basics of ML.
​
#### What we have covered: 

#### [Beginner Friendly Detailed Explained EDAs – For anyone at the beginnings of DS/ML journey](https://www.kaggle.com/general/253911#1393015) ✔️
​
#### [BIAS & VARIANCE TRADEOFF](https://www.kaggle.com/kaanboke/ml-basics-bias-variance-tradeoff) ✔️
​
#### [LINEAR ALGORITHMS](https://www.kaggle.com/kaanboke/ml-basics-linear-algorithms)  ✔️
​
#### [NONLINEAR ALGORITHMS](https://www.kaggle.com/kaanboke/nonlinear-algorithms)  ✔️
​
#### [The Most Used Methods to Deal with MISSING VALUES](https://www.kaggle.com/kaanboke/the-most-used-methods-to-deal-with-missing-values)  ✔️
​
#### [Beginner Friendly End to End ML Project- Classification with Imbalanced Data](https://www.kaggle.com/kaanboke/beginner-friendly-end-to-end-ml-project-enjoy)  ✔️
​
#### Today we will cover one of the main problems of the data preprocessing : **Data Leakage**

#### **By the way, when you like the topic, you can show it by supporting** 👍

####  **Feel free to leave a comment in the notebook**. 


#### All the best 🤘

![](https://miro.medium.com/max/1400/1*FUZS9K4JPqzfXDcC83BQTw.png)

Image Credit: https://miro.medium.com/

<a id="toc"></a>

<h3 class="list-group-item list-group-item-action active" data-toggle="list" role="tab" aria-controls="home">Table of Contents</h3>
    
* [Data Leakage](#0)
* [Target Leakage](#1)
* [Training- Test Leakage](#2)
* [How to Deal With Data Leakage](#3)
* [Is Cross Validation Enough to Handle Data Leakage?](#4)
* [Where is the Data Leakage?](#5)
* [Using Pipeline](#6)
* [Cross-Validation & Pipeline- Correct Data Preparation](#7)
* [Conclusion](#8)
* [References & Further Reading](#9)

<a id="0"></a>
<font color="lightseagreen" size=+2.5><b>Data Leakage</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

> In statistics and machine learning, leakage (also known as data leakage or target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model's utility when run in a production environment.

> Leakage is often subtle and indirect, making it hard to detect and eliminate. Leakage can cause a statistician or modeler to select a suboptimal model, which could be outperformed by a leakage-free model. 

Reference: https://en.wikipedia.org/wiki/Leakage_(machine_learning)

![](https://smartwatermagazine.com/sites/default/files/styles/thumbnail-1180x647/public/water-pipe-2.jpg)

Image credit: https://smartwatermagazine.com/blogs/victoria-edwards/bunker-mentality-will-never-solve-worlds-water-leakage-problem

In [None]:
import pandas as pd 
import numpy as np 

from sklearn.model_selection import train_test_split,RepeatedStratifiedKFold,cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler,PowerTransformer,OneHotEncoder
from sklearn.compose import ColumnTransformer


import warnings
warnings.filterwarnings("ignore")

<a id="1"></a>
<font color="lightseagreen" size=+2.5><b>Target Leakage</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- In the prediction modeling, we want get the best prediction on the target variable.
- When we use the information, which shouldn't be expected to be present in the training, it causes the mess.
- Let me give examples to clarify the issue.

#### Credit Card Application:

- If we want to predict whether person receive a card or not 
- And we have a feature so called, 'paid_by_card'
- We have to examine that, whether these payment were made by the card in question or not?
- Don't think it is exaggerated, you will see a lot of similar examples, in your data science career.

- For further discussion on the target data leakage and different examples please refer to: https://www.researchgate.net/publication/221653692_Leakage_in_Data_Mining_Formulation_Detection_and_Avoidance

<a id="2"></a>
<font color="lightseagreen" size=+2.5><b>Training- Test Leakage</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

### Time Related Issues:

- Imagine we are dealing with the sale price.
- We have a data between 2015-2020 sale prices.
- If we randomly divide these years between test and training datsets, 
- We will inform the training data about the test data and also make useless our timeseries analysis.

### Data Preprocessing Issues:

- All the preprocessing should be done with training data !!!
- Featuring, scaling, normalization, missing value imputations, categorical value coding, etc.

<a id="3"></a>
<font color="lightseagreen" size=+2.5><b>How to Deal With Data Leakage</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- Remove leaky variables
   - As we have mentioned in the credit card application prediction model, 
   - If we want to get minimze the contamination
   - We have to remove 'fishy' variable 'paid_by_card' from our model.
- Use cross validation
-Use pipelines

- In this study we will focus on data preparation part of the data leakage.

<a id="4"></a>
<font color="lightseagreen" size=+2.5><b>Is Cross Validation Enough to Handle Data Leakage?</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- By using K-fold cross-validation, we  split the data k non-overlapping groups of rows. 
- And then our model is trained on the training dataset 
- And then our model is evaluated on the held-out fold. 
- For the data leakage part, we use train and test/validation seperately.


- Ok let's see everything in the action.

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')
df.head()

### We have already worked on this data. I am not going to make a detailed analysis on this data. Please refer to [The Most Used Methods to Deal with MISSING VALUES](https://www.kaggle.com/kaanboke/the-most-used-methods-to-deal-with-missing-values)

In [None]:
df.isnull().sum()

- Ok we have mising values, and most of the machine learning algorithm can not tolerate the missing values 
- and we have to handle it.
- Let's use mean imputation and by using Cross Validation evaluate our Logistic Regression model.

- First we fill the missing variable, by using each variable seperately.

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')
df['ph']= df['ph'].fillna(df['ph'].median())
df['Sulfate']= df['Sulfate'].fillna(df['Sulfate'].median())
df['Trihalomethanes']= df['Trihalomethanes'].fillna(df['Trihalomethanes'].median())


X = df.drop('Potability', axis=1)
y = df['Potability']

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

model = LogisticRegression(solver='liblinear')

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

result = cross_val_score(model, X, y,  scoring='roc_auc',cv=cv, n_jobs=-1)

print(f'{round(np.mean(result),6)}')

- We fill the missing values by using Simple Imputer.

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')

X = df.drop('Potability', axis=1)
y = df['Potability']

imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)

scaler = MinMaxScaler()
X = scaler.fit_transform(X)

model = LogisticRegression(solver='liblinear')

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

result = cross_val_score(model, X, y,  scoring='roc_auc',cv=cv, n_jobs=-1)

print(f'{round(np.mean(result),6)}')

- Ok What we did?
- First we fill the missing values with the 'median' scores.
- Then we normalize the scales.
- We use cross validation and seperated our training and validation data to evaluate.

<a id="5"></a>
<font color="lightseagreen" size=+2.5><b>Where is the Data Leakage?</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- **imputer = SimpleImputer(strategy='median')**

- ![](https://thefinanalyst.com/wp-content/uploads/2021/04/ME.jpg)

  - imputer uses dataset to get median scores for the variables which have missing values
   - When we using cross validation to seperate training data,
   - Training data has already knowledge on the test data about the median scores of the several features.
   - Traning data knows the global median scores and has more knowledge on the global distribution of the features than it should.



- **scaler = MinMaxScaler()**
   - As we have seen that, when we are using to normalize our data;
   - We are using minimum nd maximum values of the features.

![](https://i.stack.imgur.com/EuitP.png)


- When we using cross validation to seperate training data,
- Training data has already knowledge on the test data about the minimum and maximum scores of the features.
- Because training data is scaled based on the global maximum and minimum values
- Traning data has more knloedge on the global distribution of the features than it should.

- Both of the results have a data leakage problem. 
- This can lead to make an incorrect estimation of model performance.

<a id="6"></a>
<font color="lightseagreen" size=+2.5><b>Using Pipeline</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- 'One definition of an ML pipeline is a means of automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs. This type of ML pipeline makes the process of inputting data into the ML model fully automated.' (reference: https://algorithmia.com/blog/ml-pipeline)


- When we use pipeline in our models, data pass through different automated steps before reaching the final output.

- In our case, imputer will fit only on the training dataset, not the entire dataset or test set.

- Let's see the coding.

In [None]:
df = pd.read_csv('../input/water-potability/water_potability.csv')

X = df.drop('Potability', axis=1)
y = df['Potability']

model =LogisticRegression(solver='liblinear')

pipeline = Pipeline(steps=[('imp', SimpleImputer(strategy='median')),('s',MinMaxScaler()),('m', model)]) 

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

result = cross_val_score(pipeline, X, y,  scoring='roc_auc',cv=cv, n_jobs=-1)

print(f'{round(np.mean(result),6)}')

- Ok we are seeing a litlle change in the model performance.
- Becaues of the data leakage, one can expect better score on the first two model.
- But there is very smal difference. 
- This might be because of the difficulty of the prediction task.
- Important thing is, cross validation changes simply and incorrectly evaluating just the model.
- By using pipeline, cross validation evaluate the entire pipleline of data preparation and model together.

<a id="7"></a>
<font color="lightseagreen" size=+2.5><b>Cross-Validation & Pipeline- Correct Data Preparation</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.head()

### We have already worked on this data. I am not going to make a detailed analysis on this data. Please refer to [Beginner Friendly end to end ML Project](https://www.kaggle.com/kaanboke/beginner-friendly-end-to-end-ml-project-enjoy)

- Before modeling:
  - Handle the missing data.
  - Change the categorical variables to numerical version for ML model
  - Change the scale
  - Normalize the distribution

- We will do everything by using pipeline, so that cross validation evaluate the entire pipleline of data preparation and model together.

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df=df.drop('id', axis=1)
categorical = [ 'hypertension', 'heart_disease', 'ever_married','work_type', 'Residence_type', 'smoking_status']
numerical = ['avg_glucose_level', 'bmi','age']
y= df['stroke']
X = df.drop('stroke', axis=1)


model =LogisticRegression(solver='liblinear')

transformer = ColumnTransformer(transformers=[('imp',SimpleImputer(strategy='median'),numerical),('o',OneHotEncoder(),categorical)])

pipeline = Pipeline(steps=[('t', transformer),('p',PowerTransformer(method='yeo-johnson')),('m', model)])    

cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

result = cross_val_score(pipeline, X, y,  scoring='roc_auc',cv=cv, n_jobs=-1)

print(f'{round(np.mean(result),3)}')

- Ok let's go step by step.

In [None]:
transformer = ColumnTransformer(transformers=[('imp',SimpleImputer(strategy='median'),numerical),('o',OneHotEncoder(),categorical)])


- First we need to deal with the missing values in the numerical features by
- Secondly we need to change categorical variable to ML readable numerical version by OneHotEncoder
- Since we are dealing a part of the data we use Column Transformer for this task.

In [None]:
pipeline = Pipeline(steps=[('t', transformer),('p',PowerTransformer(method='yeo-johnson')),('m', model)])    


- We defined the steps of the pipeline.
- First we put the what we have done in the column transformer
- Then our numerical variables have skewness and different scales, we use the Power transformer to normalize them for Logistic regression model
- We call the model

In [None]:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)

- We have defined our cross validation evaluation method.
- Since we are dealing imbalanced data, we selected stratified version of Kfold with 3 repeats.

In [None]:
result = cross_val_score(pipeline, X, y,  scoring='roc_auc',cv=cv, n_jobs=-1)

- First we put the pipeline and our fetures and target variable
- We defined our metric (Roc-Auc)
- We put our cross validation method.

<a id="8"></a>
<font color="darkblue" size=+1.5><b>Conclusion</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>

- Data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
- To prevent this **data preparation must be prepared on the training dataset**.

- Nothing else really. OK if you want, I will tell it one more time;
- **Data preparation must be prepared on the training dataset**

#### By the way, when you like the topic, you can show it by supporting 👍

####  **Feel free to leave a comment**. 

#### All the best 🤘

- **Enjoy** 🤘

![](https://media.giphy.com/media/l2JJyDYEX1tXFmCd2/giphy.gif)

<a id="9"></a>
<font color="darkblue" size=+1.5><b>References & Further Reading</b></font>


<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Table of Contents</a>


[Machine Learning - Beginner &Intermediate-Friendly BOOKS](https://www.kaggle.com/general/255972)