# DATA LEAKAGE

La fuga de datos se produce cuando sus datos de entrenamiento contienen información sobre el objetivo, pero no se dispondrá de datos similares cuando el modelo se utilice para la predicción. Esto conduce a un alto rendimiento en el conjunto de entrenamiento (y posiblemente incluso en los datos de validación), pero el modelo tendrá un rendimiento pobre en producción.

Hay dos tipos principales de fugas: las fugas de objetivo y la contaminación entrenamiento-prueba.

## Target leakage

occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions.

## Train-Test contamination

A different type of leak occurs when you aren't careful to distinguish training data from validation data.

Recall that validation is meant to be a measure of how the model does on data that it hasn't considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called train-test contamination.

## Example


In [6]:
import pandas as pd 
import numpy as np 

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [4]:
data=pd.read_csv('./AER_credit_card_data.csv', true_values = ['yes'], false_values = ['no'])

X=data.drop(['card'],axis=1)
y=data.card

In [5]:
data.shape

(1319, 12)

In [7]:
my_pipeline=make_pipeline(RandomForestClassifier(n_estimators=100))

cv_scores=cross_val_score(my_pipeline,X,y,
                          cv=5,
                          scoring='accuracy')

cv_scores.mean()

0.9787734762069362

In [8]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))

Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02


In [9]:
# Drop leaky predictors from dataset
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.832429
