Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.

In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the model becomes very inaccurate.

There are two main types of leakage: target leakage and train-test contamination.

In [1]:
import pandas as pd

In [10]:
data = pd.read_csv(".\\data\\aer-credit-card-data\\AER_credit_card_data.csv", true_values=['yes'], false_values=['no'])
data.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [11]:
y = data.card
X = data.drop(['card'], axis=1)

In [12]:
print(X.shape)
X.head()

(1319, 11)


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [13]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))

score = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')

In [14]:
score

array([0.98106061, 0.97727273, 0.98484848, 0.97348485, 0.98479087])

In [16]:
score.mean()
X.columns

Index(['reports', 'age', 'income', 'share', 'expenditure', 'owner', 'selfemp',
       'dependents', 'months', 'majorcards', 'active'],
      dtype='object')

In [17]:
potential_leaks = ["expenditure", "share", "majorcards", "active"]

X = X.drop(potential_leaks, axis=1)

In [18]:
X.columns

Index(['reports', 'age', 'income', 'owner', 'selfemp', 'dependents', 'months'], dtype='object')

In [19]:
cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')
cv_scores.mean()

0.8354821984099552

### Conclusion

- Data leakage can be multi-million dollar mistake in many data science applications. 
- Careful separation of training and validation data can prevent train-test contamination, and pipelines can help implement this separation.
- Likewise, a combination of caution, common sense, and data exploration can help identify target leakage.