本节主要介绍什么是数据泄露，以及如何避免数据泄露。  
  
什么是数据泄露：  
就是说我们训练的模型很精确直到我们用于预测时，模型却变得不够精确了。  
  
分为两种：  
1.target leakage  
2.train-test contamination  

首先看个目标泄露小例子：  
人们在得肺炎和注射抗生素药物有着强相互关系，但是注射抗生素在目标值（患肺炎）确定后却有可能变化。  
为了防止这种类型的目标值泄露，我们需要确保不要在目标值实现后更新任何变量。  

训练集-测试集污染：  
如果我们的validation data影响数据预处理过程，很可能就会发生train-test contamination.  

加入在对数据进行训练集-测试集划分时，就对所有数据进行了预处理，如填充缺失值，那么就很可能影响结果。  
即使，在源数据上，没有产生任何新数据，但是仍在某种程度上产生一些问题。

In [1]:
# 例子中学习一种方法检测和删除掉目标泄露值。
import pandas as pd

data = pd.read_csv('./input/AER_credit_card_data.csv', true_values=['yes'], false_values=['no'])

y = data.card
X = data.drop(['card'], axis=1)
print("Number of rows in the dataset", X.shape[0])
X.head()

Number of rows in the dataset 1319


Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [2]:
y.head()

0    True
1    True
2    True
3    True
4    True
Name: card, dtype: bool

In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y,
                            cv=5, 
                            scoring='accuracy')
print("cross-validation accuracy: %f"%cv_scores.mean())

cross-validation accuracy: 0.979525


根据经验，精度结果很难达到98%，此时我们应该检查target leakage了。  
  
观察数据，我们发现expenditure（支出）有点可以，它是指这张卡的支出呢，还是申请使用前的卡上的支出？

In [4]:
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print("Fraction of those who did not receive a card and had no expenditures: %.2f"%(expenditures_cardholders == 0).mean())
print("Fraction of those who received a card and had to expenditures:%.2f"%(expenditures_noncardholders == 0).mean())

Fraction of those who did not receive a card and had no expenditures: 0.02
Fraction of those who received a card and had to expenditures:1.00


通过上述的比较，我们可以清晰的看出没有收到卡的一般都没有支出，只有2%的人没有支出但获得了卡。由此不难判断为什么模型有如此高的精确度。  
这就是一种目标泄露。特征“share”在一定程度上也由“expenditures”决定，它也应该被排除。  
“激活账户数量”和“持有的主要信用卡数量”有一点不清晰，但是从描述中看起来，他们有些相关。在多数情况下，我们选择相对保守。

In [6]:
# 从数据中删除一些数据
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

cv_scores = cross_val_score(my_pipeline, X2, y,
                            cv=5,
                            scoring='accuracy')
print("Cross-Val accuracy:%f"%cv_scores.mean())

Cross-Val accuracy:0.833952
