# Problem Statement

The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

- D_* = Delinquency variables
- S_* = Spend variables
- P_* = Payment variables
- B_* = Balance variables
- R_* = Risk variables

With the following features being categorical:

['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

Our task is to predict, for each customer_ID, the probability of a future payment default (target = 1).

Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# Filtering the Large Dataset

In [3]:
# Chunk = pd.read_csv("train_data.csv",chunksize=100000)
# Train = pd.concat(Chunk)
# Train_labels = pd.read_csv("train_labels.csv")

# Train = pd.merge(Train,Train_labels,how="inner",on=["customer_ID"])

# Train = Train.head(200000)
# Train.to_csv("Train.csv",index=False)

# Test = pd.read_csv("test_data.csv",chunksize=100000)
# Test = pd.concat(Test)
# Test = Test.head(500000)
# Test.to_csv("Test.csv",index = False)

In [9]:
Train = pd.read_csv("Train.csv")
Test = pd.read_csv("Test.csv")

In [11]:
Test.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_136,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145
0,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-02-19,0.631315,0.001912,0.010728,0.814497,0.007547,0.168651,0.009971,0.002347,...,,,,,0.004669,,,,0.008281,
1,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-03-25,0.587042,0.005275,0.011026,0.810848,0.001817,0.241389,0.000166,0.009132,...,,,,0.000142,0.00494,0.009021,,0.003695,0.003753,0.00146
2,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-04-25,0.609056,0.003326,0.01639,1.00462,0.000114,0.266976,0.004196,0.004192,...,,,,7.4e-05,0.002114,0.004656,,0.003155,0.002156,0.006482
3,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-05-20,0.614911,0.009065,0.021672,0.816549,0.009722,0.188947,0.004123,0.015325,...,,,,0.004743,0.006392,0.00289,,0.006044,0.005206,0.007855
4,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-06-15,0.591673,0.238794,0.015923,0.810456,0.002026,0.180035,0.000731,0.011281,...,,,,0.008133,0.004329,0.008384,,0.001008,0.007421,0.009471


In [13]:
Train.nunique()

customer_ID     16576
S_2               396
P_2            198458
D_39           200000
B_1            200000
                ...  
D_142           34356
D_143          196497
D_144          198567
D_145          196497
target              2
Length: 191, dtype: int64

In [14]:
Train.describe()

Unnamed: 0,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,D_42,D_43,...,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145,target
count,198458.0,200000.0,200000.0,199928.0,200000.0,162335.0,199928.0,199928.0,29193.0,139942.0,...,7186.0,7186.0,196497.0,198581.0,196497.0,34356.0,196497.0,198567.0,196497.0,200000.0
mean,0.653527,0.1534126,0.125356,0.616971,0.08121541,0.227386,0.06221769,0.1343925,0.18117,0.1560354,...,0.01424359,0.162349,0.1798358,0.02622542,0.1653822,0.392596,0.1797362,0.05339187,0.06179908,0.253395
std,0.246135,0.2723813,0.213396,0.402275,0.2309947,0.197101,0.2081621,0.2348492,0.216851,0.216974,...,0.09543979,0.258647,0.3798505,0.1441568,0.3490555,0.238923,0.3797464,0.1851044,0.1901802,0.434956
min,-0.383019,3.892609e-07,-0.899396,3e-06,2.96293e-08,-0.254707,5.627163e-08,1.04218e-07,-0.000219,8.705647e-07,...,4.129697e-08,2e-06,7.139375e-08,5.277736e-08,5.642931e-08,-0.014441,1.65358e-08,1.161969e-07,3.397747e-08,0.0
25%,0.476334,0.00454139,0.008858,0.10055,0.002899612,0.127432,0.002897833,0.00529073,0.039449,0.04235285,...,0.002589397,0.003541,0.003018427,0.002551375,0.003033245,0.196174,0.003034162,0.002752575,0.003027842,0.0
50%,0.691541,0.00906638,0.031935,0.814164,0.005792558,0.164248,0.00576875,0.009872834,0.120749,0.0882885,...,0.005141117,0.007028,0.006041458,0.005109293,0.006062756,0.382038,0.006077791,0.005499113,0.006059988,0.0
75%,0.863455,0.2360171,0.129569,1.00224,0.008684554,0.260202,0.008652639,0.1635911,0.250728,0.1850985,...,0.007667529,0.501621,0.00909657,0.007658175,0.009094362,0.566102,0.009088631,0.008267479,0.009093526,1.0
max,1.009998,5.33136,1.324053,1.01,2.507711,2.918675,6.798167,1.625262,3.252056,9.089694,...,1.009913,1.509486,1.01,1.009994,1.174753,1.751388,1.01,1.343284,4.282032,1.0


In [21]:
print(Train.isnull().sum())

customer_ID         0
S_2                 0
P_2              1542
D_39                0
B_1                 0
                ...  
D_142          165644
D_143            3503
D_144            1433
D_145            3503
target              0
Length: 191, dtype: int64


In [41]:
Categorical_Features = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

for features in Categorical_Features:
    print(Train[features].unique())
    print(Train[features].value_counts())
    

[ 0.  2.  1. nan]
B_30
0.0    169651
1.0     28195
2.0      2082
Name: count, dtype: int64
[ 2.  1.  3.  5.  6.  7.  4. nan]
B_38
2.0    69185
3.0    45557
1.0    42290
5.0    16264
4.0    10948
7.0     9821
6.0     5863
Name: count, dtype: int64
[ 1.  0. nan]
D_114
1.0    119921
0.0     73734
Name: count, dtype: int64
[ 0. nan  1.]
D_116
0.0    193429
1.0       226
Name: count, dtype: int64
[ 4. -1.  6.  2.  1. nan  3.  5.]
D_117
-1.0    52208
 3.0    41808
 4.0    41322
 2.0    24095
 5.0    17049
 6.0    12774
 1.0     4399
Name: count, dtype: int64
[ 0.  1. nan]
D_120
0.0    170250
1.0     23405
Name: count, dtype: int64
[ 1. nan  0. -1.]
D_126
 1.0    154085
 0.0     32515
-1.0      9175
Name: count, dtype: int64
['CR' 'CO' 'CL' 'XZ' 'XM' 'XL']
D_63
CO    148773
CR     33751
CL     15991
XZ       925
XM       309
XL       251
Name: count, dtype: int64
['O' 'R' nan 'U' '-1']
D_64
O     105870
U      55177
R      29601
-1      1312
Name: count, dtype: int64
[nan  1.  0.]
D_66
1.0   