#### Data leakage:
    Data Leakage in machine learning happens when the data that we are used to training a machine learning algorithm is having the information about the target which the model is trying to predict, this results in unreliable and bad prediction outcomes after model deployment but unrealistically best training and possibly even the best validation accuracies.

    ------------------------------------------------------------------------------------------------------------------------
    Data Leakage is the scenario where the Machine Learning Model is already aware of some part of test data after training.Data Leakage refers to a mistake that is made by the creator of a machine learning model in which they accidentally share the information between the test and training data sets.
     
     The purpose of holding the test set during training is to estimate the performance of the model when given totally unseen data which inturn gives us an idea of how well the model is generalised to unseen isntances.But, Data Leakage process spoils this purpose and exposes the test set to the model during training itself when care is not taken.
     
     When such a model is then used on truly unseen data that is coming mostly on the production side, then the performance of that model will be much lower than expected after deployment.

    ------------------------------------------------------------------------------------------------------------------------
    There are two main types of leakage: 
        - target leakage
        - train-test contamination.

#### Target Leakage:
     Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.
       
       Eg :  Imagine you want to predict who will get sick with pneumonia. Below is the raw data
       
           --------------------------------------------------------------------
           got_pneumonia  | age |	weight|  male	 | took_antibiotic_medicine
           --------------------------------------------------------------------
               False	  |   65|	 100  |    False  |	  False	
               False	  |   72|	 130  |     True  |      False
               True	   |   58|	 100  |	False  |	  True
           ---------------------------------------------------------------------    

    People 'take antibiotic medicines' after getting pneumonia in order to recover. The raw data shows a strong relationship between those columns, but took_antibiotic_medicine is frequently changed after the value for got_pneumonia is determined. This is target leakage.
    
    The model would see that anyone who has a value of False for took_antibiotic_medicine didn't have pneumonia. Since validation data comes from the same source as training data, the pattern will repeat itself in validation, and the model will have great validation (or cross-validation) scores.But the model will be very inaccurate when subsequently deployed in the real world, because even patients who will get pneumonia won't have received antibiotics yet when we need to make predictions about their future health.
    
    Possible Solution : To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be excluded.

#### Train-Test Contamination:
    A different type of leak occurs when you aren't careful to distinguish training data from validation data.If we recall, validation is meant to be a measure of how the model does on data that it hasn't considered before. One can corrupt this process in subtle ways if the validation data affects the preprocessing behavior.
    
    While solving a Machine learning problem statement, firstly we do the data cleaning and preprocessing which involves the following steps:
    - Evaluating the parameters(mean, sd, variance) for normalizing or rescaling features
    - Finding the minimum and maximum values of a particular feature
    - Normalize the particular feature in our dataset
    - Removing the outliers
    - Fill or completely remove the missing data in our dataset
    - Encodings
    - Feature engineering like feature extraction, feature selection
    
    Possible solution: All the preprocessing steps should be done using only the training set and mostly in cross validation inner loop with the help of scikit Pipeline or R caret package.By doing so, the transformation will not be applied to hold out validation set during training. The model wont be aware of anything about the hold out set and this results in realistic estimations of unseen data.
    
    Applying preprocessing techniques to the entire dataset will cause the model to learn not only the training set but also the test set and hence the data leakage.If you perform techniques like feature selection on all of the data and then cross-validate, then the test data in each fold of the cross-validation procedure was also used to choose the features and this is what biases the performance analysis.
    
    

#### Examples:
    1.Imagine, we are working on a problem statement in which we have to build a model that predicts a certain medical condition. If we have a feature that indicates whether a patient had a surgery related to that medical condition, then it causes data leakage and we should never be included that as a feature in the training data. The indication of surgery is highly predictive of the medical condition and would probably not be available in all cases. If we already know that a patient had a surgery related to a medical condition, then we may not even require a predictive model to start with.
    
    2. Let’s imagine we are working on a problem statement in which we have to build a model that predicts if a user will stay on a website. Including features that expose the information about future visits will cause the problem of data leakage. So, we have to use only features about the current session because information about the future sessions is not generally available after we deployed our model.

#### How to detect Data Leakage?
    1. In general, if we see that the model which we build is too good to be true (i.,e gives predicted and actual output the same), then we should get suspicious and data leakage cannot be ruled out. At that time, the model might be somehow memorizing the relations between feature and target instead of learning and generalizing it for the unseen data. So, it is advised that before the testing, the prior documented results are weighed against the expected results.
    
    2.While doing the Exploratory Data Analysis (EDA), we may detect features that are very highly correlated with the target variable. Of course, some features are more correlated than others but a surprisingly high correlation needs to be checked and handled carefully. We should pay close attention to those features.
    
    3.After the completion of the model training, if features are having very high weights(check using feature_importances_), then we should pay close attention. Those features might be leaky.

#### Tips to combat Data Leakage:
    1. Temporal Cutoff
        - Remove all data just prior to the event of interest, focusing on the time you learned about a fact or observation rather than the time the observation occurred.When dealing with time-series data, we should pay more attention to data leakage. For example, if we somehow use data from the future when doing computations for current features or predictions, it is highly likely to end up with a leaked model. It generally happens when the data is randomly split into train and test subsets. So, when working with time-series data, we put a cutoff value on time which might be very useful, as it prevents us from getting any information after the time of prediction.
        
      2. Add Noise
          - Add random noise to input data to try and smooth out the effects of possibly leaking variables.

    3. Remove Leaky Variables
        - Evaluate simple rule based models line OneR using variables like account numbers and IDs and the like to see if these variables are leaky, and if so, remove them. If you suspect a variable is leaky, consider removing it.Extract always the appropriate set of features.
     
    4. Use Pipelines
        - Heavily use pipeline architectures that allow a sequence of data preparation steps to be performed within cross validation folds, such as the caret package in R and Pipelines in scikit-learn.
     
     5.Use a Holdout Dataset
         - Hold back an unseen validation dataset as a final sanity check of your model before you use it.

In [72]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.pipeline import make_pipeline
from collections import Counter

In [58]:
#Example to show Target Leakage using credit card eligibility data set

data = pd.read_csv("E:\Learning\ML\Datasets\data\AER_credit_card_data\AER_credit_card_data.csv", true_values=['yes'], false_values=['no'])
data.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


#### Info:
    Here is a summary of the data
        card: 1 if credit card application accepted, 0 if not
        reports: Number of major derogatory reports
        age: Age in years plus twelfths of a year
        income: Yearly income (divided by 10,000)
        share: Ratio of monthly credit card expenditure to yearly income
        expenditure: Average monthly credit card expenditure
        owner: 1 if owns home, 0 if rents
        selfempl: 1 if self-employed, 0 if not
        dependents: 1 + number of dependents
        months: Months living at current address
        majorcards: Number of major credit cards held
        active: Number of active credit accounts

In [82]:
X = data.iloc[:, data.columns != 'card']
Y = data['card']

In [83]:
X.shape

(1319, 11)

In [61]:
Y.shape

(1319,)

In [71]:
Y.value_counts()

True     1023
False     296
Name: card, dtype: int64

In [62]:
Y

0        True
1        True
2        True
3        True
4        True
        ...  
1314     True
1315    False
1316     True
1317     True
1318     True
Name: card, Length: 1319, dtype: bool

In [64]:
clf = RandomForestClassifier()
pipe = make_pipeline(clf)
cv = KFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(pipe, X, Y, cv=cv, scoring='accuracy', error_score='raise')
print(f"Overall average Cross validtion score of the model is: {scores.mean()}")

Overall average Cross validtion score of the model is: 0.9802972472819803


#### Observation: 
    It's very rare to find models that are accurate 98% of the time. It happens, but it's uncommon enough that we should inspect the data more closely for target leakage. A few variables look suspicious. For example, does expenditure mean expenditure on this card or on cards used before applying? At this point, basic data comparisons can be very helpful.

In [68]:
expenditure_cardholders = X.expenditure[Y]

In [70]:
expenditure_cardholders

0       124.983300
1         9.854167
2        15.000000
3       137.869200
4       546.503300
           ...    
1310      4.583333
1314      7.333333
1316    101.298300
1317     26.996670
1318    344.157500
Name: expenditure, Length: 1023, dtype: float64

In [74]:
len(Counter(expenditure_cardholders))

981

In [75]:
expenditure_noncardholders = X.expenditure[~Y]

In [76]:
len(Counter(expenditure_noncardholders))

1

In [77]:
print("Percent of those who did not receive a card and had no expenditures: %.2f"%((expenditure_noncardholders==0).mean()))

Percent of those who did not receive a card and had no expenditures: 1.00


In [78]:
print("Percent of those who received a card and had no expenditures: %.2f"%((expenditure_cardholders==0).mean()))

Percent of those who received a card and had no expenditures: 0.02


#### Observation:
    If we observe, everyone who did not receive a card had no expenditures, while only 2% of people had no expenditures after receiving a card meaning 98 percent people had expenditure after receiving the card. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case of target leakage, where expenditures probably means expenditures on the card they applied for.
    Since share is partially determined by expenditure, it should be excluded too. The variables active and majorcards are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

In [85]:
# Dropping leaky predictors from dataset
potential_leakage = ['expenditure', 'share', 'active', 'majorcards']
X = X.drop(potential_leakage, axis=1)
X.head()

Unnamed: 0,reports,age,income,owner,selfemp,dependents,months
0,0,37.66667,4.52,True,False,3,54
1,0,33.25,2.42,False,False,3,34
2,0,33.66667,4.5,True,False,4,58
3,0,30.5,2.54,False,False,0,25
4,0,32.16667,9.7867,True,False,2,64


In [86]:
scores = cross_val_score(pipe, X, Y, cv=cv, scoring='accuracy', error_score='raise')
print(f"Overall average Cross validtion score of the model is: {scores.mean()}")

Overall average Cross validtion score of the model is: 0.8256187832523711


#### Observation:
    This accuracy is quite a bit lower, which might be disappointing. However, we can expect it to be right about 80% of the time when used on new applications, whereas the leaky model would likely do much worse than that (in spite of its higher apparent score in cross-validation).