# Data Leakage
Data leakage (or leakage) happens when your training data contains information about the target, but similar data will
not be available when the model is used for prediction. This leads to high performance on the training set (and
possibly even the validation data), but the model will perform poorly in production.

In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the
model becomes very inaccurate.

There are two main types of leakage:

    1.  target leakage
    2.  train-test contamination.


# Target Leakage
Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It
is important to think about target leakage in terms of the timing or chronological order that data becomes available,
not merely whether a feature helps to make good predictions.

An example will be helpful. Imagine you want to predict who will get sick with pneumonia.

The top few rows of your raw data look like this:

    got_pneumonia   age weight  male    took_antibiotic_medicine

    False	        65	 100	  False	        False
    False	        72	 130	  True	         False
    True	         58	 100	  False	        True

People take antibiotic medicines after getting pneumonia in order to recover. The raw data shows a strong relationship
between those columns, but took_antibiotic_medicine is frequently changed after the value for got_pneumonia is
determined. This is target leakage.

The model would see that anyone who has a value of False for took_antibiotic_medicine didn't have pneumonia. Since
validation data comes from the same source as training data, the pattern will repeat itself in validation, and the
model will have great validation (or cross-validation) scores.

But the model will be very inaccurate when subsequently deployed in the real world, because even patients who will get
pneumonia won't have received antibiotics yet when we need to make predictions about their future health.

To prevent this type of data leakage, any variable updated (or created) after the target value is realized should be
excluded.

    Generally: if a feature changes after the change of feature, it is useless and should be excluded:

        - You can be sick, but haven't started taking medicines yet - False Negative
        - You can take medicines prophylactically, even though you are not infected - False Positive

# Train - Test Contamination
A different type of leak occurs when you aren't careful to distinguish training data from validation data.

Recall that validation is meant to be a measure of how the model does on data that it hasn't considered before. You can
corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called
train-test contamination.

After all, you incorporated data from the validation or test data into how you make predictions, so the may do well on
that particular data even if it can't generalize to new data. This problem becomes even more subtle (and more dangerous)
when you do more complex feature engineering.

If your validation is based on a simple train-test split, exclude the validation data from any type of fitting,
including the fitting of preprocessing steps. This is easier if you use scikit-learn pipelines. When using
cross-validation, it's even more critical that you do your preprocessing inside the pipeline!

In [73]:
import pandas as pd
import numpy as np

Use true_values and false_values, in order to change values 'yes' to True and 'no' to False.

In [74]:
card_df = pd.read_csv('data/AER_credit_card_data.csv', true_values=['yes'], false_values=['no'])

In [75]:
card_df['age'] = card_df['age'].apply(np.floor).astype(int)

In [76]:
print(card_df.head())

   card  reports  age  income     share  expenditure  owner  selfemp  \
0  True        0   37  4.5200  0.033270   124.983300   True    False   
1  True        0   33  2.4200  0.005217     9.854167  False    False   
2  True        0   33  4.5000  0.004156    15.000000   True    False   
3  True        0   30  2.5400  0.065214   137.869200  False    False   
4  True        0   32  9.7867  0.067051   546.503300   True    False   

   dependents  months  majorcards  active  
0           3      54           1      12  
1           3      34           1      13  
2           4      58           1       5  
3           0      25           1       7  
4           2      64           1       5  


In [77]:
X = card_df.drop('card', axis=1)
y = card_df['card']

In [78]:
print('Number of rows in the dataset:', X.shape[0])

Number of rows in the dataset: 1319


Since the number of rows is 1319, it is considered to be a small set. We will use cross-validation to ensure accurate
measures of model accuracy.

In [79]:
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

Since there is no preprocessing in this dataset, there is no need to create a Pipeline
    - however we will use it anyway as a better practice!

In [80]:
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100, random_state=33))
cv_scores = cross_val_score(my_pipeline, X, y,
                            cv=5,
                            scoring='accuracy')
print('Cross validation accuracy: {:0.4f}%'.format(cv_scores.mean()))

Cross validation accuracy: 0.9795%


With experience, you'll find that it's very rare to find models that are accurate 98% of the time.

It happens, but it's
uncommon enough that we should inspect the data more closely for target leakage.

Here is a summary of the data, which you can also find under the data tab:

    ·   card: 1 if credit card application accepted, 0 if not
    ·   reports: Number of major derogatory reports
    ·   age: Age n years plus twelfths of a year
    ·   income: Yearly income (divided by 10,000)
    ·   share: Ratio of monthly credit card expenditure to yearly income
    ·   expenditure: Average monthly credit card expenditure
    ·   owner: 1 if owns home, 0 if rents
    ·   selfempl: 1 if self-employed, 0 if not
    ·   dependents: 1 + number of dependents
    ·   months: Months living at current address
    ·   majorcards: Number of major credit cards held
    ·   active: Number of active credit accounts

A few variables look suspicious.

    For example:
    - Does expenditure mean expenditure on this card
    or on cards used before appying?

At this point, basic data comparisons can be very helpful:

In [98]:
expenditure_cardholders = X.expenditure[y]  # Data for expenditure column, for cardholders.
expenditure_non_cardholders = X.expenditure[~y] # Data for expenditure column, for non-cardholders.

In [99]:
print('Fraction of those, who did not receive a card and had no expenditures: {:.2f}%'.format(
    (expenditure_non_cardholders == 0).mean()
))
print('Fraction of those, who received a card and had no expenditures: {:.2f}%'.format(
    (expenditure_cardholders == 0).mean()
))

Fraction of those, who did not receive a card and had no expenditures: 1.00%
Fraction of those, who received a card and had no expenditures: 0.78%


As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had
no expenditures. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case
of target leakage, where expenditures probably means expenditures on the card they applied for.

    - No expenditures, because no card was received. That's pretty straightforward. But it was very important to check,
    whether by the 'expenditures' the author meant expenditures before or after card application.

Because it is straightforward, that there cannot be any expenditures if there is no cardholder, this is a great example
of how the Target Leakage looks like - target is predicted on the features, that change after the target is applied.

Since share is partially determined by expenditure, it should be excluded too. The variables active and majorcards are a
little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than
sorry if you can't track down the people who created the data to find out more.

We would run a model without target leakage as follows:

    - expenditures - it is obvious that non-card holder cannot have any expenditures, causes target leakage
    - share - it is ratio of monthly expenditures to yearly income, since shares are independent from expenditures,
    they are causing target leakage
    - majorcards and active cards - after you become a cardholder, does this card belong to majorcards or not? It
    becomes an active card of course, so does this feature treat about previously or currently active cards? In order
    be safe, better exclude these two features too.

In [103]:
potential_leaks = ['expenditure', 'share', 'majorcards', 'active']
new_X = X.drop(columns=potential_leaks, axis=1, inplace=False)

In [104]:
new_cv_scores = cross_val_score(my_pipeline, new_X, y,
                                cv=5,
                                scoring='accuracy')
print('Cross validation accuracy: {:0.4f}%'.format(new_cv_scores.mean()))

Cross validation accuracy: 0.8317%


# Conclusion
Data leakage can be multi-million dollar mistake in many data science applications. Careful separation of training and
validation data can prevent train-test contamination, and pipelines can help implement this separation. Likewise, a
combination of caution, common sense, and data exploration can help identify target leakage.

# 1. The Data Science of Shoelaces
Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to
review a model one of their employees built to predict how many shoelaces they'll need each month. The features going
into the machine learning model include:

    1.  The current month (January, February, etc)
    2.  Advertising expenditures in the previous month
    3.  Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
    4.  The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used.
But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather
they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the leather used feature constitutes a source of data leakage? If your answer is "it depends," what does
it depend on?

    This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage).
    Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if
    that is determined during the month, you would not have access to it when you make the prediction. If you have a
    guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during
    the month cannot be used as a feature (because it causes leakage).

    - You cannot predict how many shoelaces you will need even if you decide how much leather you need to buy at the
    beginning of the month - simply the number of purchased shoes may constantly change. The number of shoelaces will
    change in parallel to the change of leather. You cannot use the feature, which will change within any change of the
    target. It causes a target leakage. You do not have access to the feature during prediction of the target, if both
    feature and target change together.

# 2. Return of the Shoelaces
You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used)
leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend
on?

    - This could be fine, but it depends on whether they order shoelaces first or leather first. If they order
    shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they
    order leather first, then you'll have that number available when you place your shoelace order, and you should be
    ok. But it will only work if the leather stock will be fixed. The shoelaces number will be dependent on the fixed
    number of leather, so during the prediction, that feature cannot change - it will not cause leakage.

    In general, if the feature cannot change after/during prediction, then it will not cause target leakage.

# 3. Getting Rich With Cryptocurrencies?
You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of
dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day
ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the
currency (in dollars) is about to go up.

The most important features in his model are:

    1.  Current price of the currency
    2.  Amount of the currency sold in the last 24 hours
    3.  Change in the currency price in the last 24 hours
    4.  Change in the currency price in the last 1 hour
    5.  Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over 100 𝑖𝑛𝑡ℎ𝑒𝑙𝑎𝑠𝑡𝑦𝑒𝑎𝑟,𝑎𝑛𝑑𝑦𝑒𝑡ℎ𝑖𝑠𝑚𝑜𝑑𝑒𝑙′
𝑠𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑒𝑟𝑟𝑜𝑟𝑖𝑠𝑙𝑒𝑠𝑠𝑡ℎ𝑎𝑛 1. He says this is proof his model is accurate, and you should invest with him, buying the currency
whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

    - There is no source of leakage here - all of the feature data are gathered in past, there is no option for them to
    change during prediction - their values are fixed. These features should be available at the moment you want to
    make a predition, and they're unlikely to be changed in the training data after the prediction target is determined.

    But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's
    price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest.

    For instance, if it is 100 𝑡𝑜𝑑𝑎𝑦, 𝑎 𝑚𝑜𝑑𝑒𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑛𝑔 𝑎 𝑝𝑟𝑖𝑐𝑒 𝑜𝑓 100 tomorrow may seem accurate, even if it can't tell you
    whether the price is going up or down from the current price - simply it will not show you any further trend.

    A better prediction target would be the change in price over the next day. If you can consistently predict whether
    the price is about to go up or down (and by how much), you may have a winning investment opportunity.

# 4. Preventing Infections
An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it
can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the
prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best
incorporate the surgeon information into the model?

You have a clever idea.

Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a
feature.

Does this pose any target leakage issues? Does it pose any train-test contamination issues?

    - This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if
    you are careful).

    You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is
    then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage
    if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for.
    Calculating this for each surgery in your training data may be a little tricky.

    You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed,
    including those from the test-set. The result would be that your model could look very accurate on the test set,
    even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the
    surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when
    seeing new data. So this contamination defeats the purpose of the test set.

# 5. Housing Prices
You will build a model to predict housing prices. The model will be deployed on an ongoing basis, to predict the price
of a new house when a description is added to a website. Here are four features that could be used as predictors.

    1.  Size of the house (in square meters)
    2.  Average sales price of homes in the same neighborhood
    3.  Latitude and longitude of the house
    4.  Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

    - 2 is the source of target leakage. Here is an analysis for each feature:

    1.  The size of a house is unlikely to be changed after it is sold (though technically it's possible). But
    typically this will be available when we need to make a prediction, and the data won't be modified after the home
    is sold. So it is pretty safe.

    2.  We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold,
    and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if
    only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be
    exactly equal to the value we are trying to predict. In general, for neighborhoods with few sales, the model will
    perform very well on the training data. But when you apply the model, the home you are predicting won't have been
    sold yet, so this feature won't work the same as it did in the training data.

    3.  These don't change, and will be available at the time we want to make a prediction. So there's no risk of target
    leakage here.

    4.  This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of
    target leakage here.

# Conclusions v2:
    - Target Leakage will occur if the feature can change after or during making the prediction. In case of data that
    happened before the predictions, usually we say about fixed price at specific time interval, there is no risk of
    target leakage. If a single feature directly changes the target at anytime and both feature and target are highly
    dependent at any time, moreover the value of the feature is not fixed and can change at anytime - target leakage is
    inevitable. For example, if you use number of sold smartphones as a feature to predict number of sold headphones,
    you will experience target leakage, because anytime number of sold smarphones changes during prediction, the target
    changes. Thus there is a leak of target. You could avoid this problem, if you use data from fixed time interval.

    - Train - Test Contamination - occurs if you validate your model using the data, that was used during preprocessing
    or training. If you train the model on specific data and then you validate its accuracy using same information, the
    accuracy will be greatly higher than the real one. For example, if you predict whether the surgeon got infected,
    using the same patient in training and validation set, you are experiencing train - test contamination. You could
    avoid this problem, if you used data that your model hasn't seen before for validation purposes.
