In [3]:
import pandas as pd

data = pd.read_csv('~/JProjects/kaggle/data/AER_credit_card_data.csv', true_values=['yes'], false_values=['no'])

y = data.card

X = data.drop('card', axis=1)


In [12]:
data.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


Here is a summary of the data, which you can also find under the data tab:

- `card`: 1 if credit card application accepted, 0 if not
- `reports`: Number of major derogatory reports
- `age`: Age n years plus twelfths of a year
- `income`: Yearly income (divided by 10,000)
- `share`: Ratio of monthly credit card expenditure to yearly income
- `expenditure`: Average monthly credit card expenditure
- `owner`: 1 if owns home, 0 if rents
- `selfempl`: 1 if self-employed, 0 if not
- `dependents`: 1 + number of dependents
- `months`: Months living at current address
- `majorcards`: Number of major credit cards held
- `active`: Number of active credit accounts

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

print('Accuracy: ', cv_scores.mean())

Accuracy:  0.9802943887544648


Does `expenditure` mean expenditure on this card or on cards used before applying?
At this point, basic data comparison can be very helpful:

In [15]:
exp_cardholders = X.expenditure[y]
exp_noncardholders = X.expenditure[~y]

print(f'Fraction of those who received a card and had no expenditure: {(exp_cardholders == 0).mean()}')
print(f'Fraction of those who did not receive a card and had no expenditure: {(exp_noncardholders == 0).mean()}')

Fraction of those who received a card and had no expenditure: 0.020527859237536656
Fraction of those who did not receive a card and had no expenditure: 1.0


As shown above, everyone who did not receive a card had no expenditures, while only 2% of those who received a card had no expenditures. It's not surprising that our model appeared to have a high accuracy. But this also seems to be a case of target leakage, where expenditures probably means *expenditures on the card they applied for.*

Since `share` is partially determined by `expenditure`, it should be excluded too. The variables `active` and `majorcards` are a little less clear, but from the description, they sound concerning. In most situations, it's better to be safe than sorry if you can't track down the people who created the data to find out more.

We would run a model without target leakage as follows:

In [18]:
# Drop leakage predictors
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
X2 = X.drop(potential_leaks, axis=1)

cv_scores = cross_val_score(pipeline, X2, y, cv=5, scoring='accuracy')

print(f'Accuracy: {cv_scores.mean()}')

Accuracy: 0.8369915888927295
