**Contents**

* References
* Libraries
* Importing Train Set
* Train Set preprocessing
  * Changing the format of customer_ID and S_2
  * Sorting the dataset by customer_ID and S_2
  * Grouping the dataset by customer_ID
  * Removing the columns with too many missing values
  * Handling the missing values in other columns
  * Removing the numerical columns with very less correlation with the output
  * Handling categorical data
  * Feature Scaling
* Importing Test Set
* Test Set preprocessing
  * Changing the format of S_2
  * Grouping the dataset by customer_ID
  * Handling the missing values
  * Handling categorical data
  * Missing categories in Test Set
  * Feature Scaling
* Classification models
  * Logistic Regression Classification
  * XGBoost Classification
  * Light Gradient Boost Machine (LightGBM or LGBM) Classification
  * Linear Discriminant Analysis Classification
* Choosing the best model
* Submission

# References

**Notebooks referred:**
* https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793
* https://www.kaggle.com/code/kingsshah/easiest-way-to-reduce-from-190-to-36-columns/notebook

**Dataset used:**
* https://www.kaggle.com/datasets/munumbutt/amexfeather

# Libraries

In [1]:
import numpy as np
import pandas as pd
import gc

# Importing Train Set

In [2]:
train = pd.read_feather('../input/amexfeather/train_data.ftr')
train = train.drop(columns = 'target')

train_labels = pd.read_csv('/kaggle/input/amex-default-prediction/train_labels.csv')

In [3]:
train.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_136,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0.938477,0.001734,0.008728,1.006836,0.009224,0.124023,0.008774,0.004707,...,,,,0.002426,0.003706,0.003819,,0.000569,0.00061,0.002674
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0.936523,0.005775,0.004925,1.000977,0.006153,0.126709,0.000798,0.002714,...,,,,0.003956,0.003166,0.005032,,0.009575,0.005493,0.009216
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0.954102,0.091492,0.021652,1.009766,0.006817,0.123962,0.007599,0.009422,...,,,,0.003269,0.007328,0.000427,,0.003429,0.006985,0.002604
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0.960449,0.002455,0.013687,1.00293,0.001372,0.117188,0.000685,0.005531,...,,,,0.006119,0.004517,0.003201,,0.008423,0.006527,0.009598
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0.947266,0.002483,0.01519,1.000977,0.007607,0.11731,0.004654,0.009308,...,,,,0.003672,0.004944,0.008888,,0.00167,0.008125,0.009827


In [4]:
train_labels.head()

Unnamed: 0,customer_ID,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0
1,00000fd6641609c6ece5454664794f0340ad84dddce9a2...,0
2,00001b22f846c82c51f6e3958ccd81970162bae8b007e8...,0
3,000041bdba6ecadd89a52d11886e8eaaec9325906c9723...,0
4,00007889e4fcd2614b6cbe7f8f3d2e5c728eca32d9eb8a...,0


Train data: (5531451, 190)

Train labels: (458913, 2)

# Train Set preprocessing

There are multiple entries for a single customer ID. Let's first group them so that each customer ID has single entry. This will significantly reduce the size of the dataset and further preprocessing will be easier.

**Changing the format of customer_ID and S_2**

In order to group the dataset, I first have to sort it (so that 'last' function works correctly). And in order to sort the dataset, I first have to change the format of 'S_2' to 'datetime'. I have also changed the format of 'customer_ID' from 'hex' to 'int'.

In [5]:
train['customer_ID'] = train['customer_ID'].str[-16:].apply(int, base = 16)
train_labels['customer_ID'] = train_labels['customer_ID'].str[-16:].apply(int, base = 16)
train['S_2'] = pd.to_datetime(train['S_2'])

**Sorting the dataset by customer_ID and S_2**

In [6]:
train = train.sort_values(['customer_ID', 'S_2']).reset_index(drop = True)
train_labels = train_labels.sort_values('customer_ID').reset_index(drop = True)

**Grouping the dataset by customer_ID**

In [7]:
train = train.groupby('customer_ID').agg('last')
train.reset_index(drop = False, inplace = True)

_ = gc.collect()

**Removing the columns with too many missing values**

I have removed the columns with more than 50% missing values.

In [8]:
nan_drop_frac = 0.5

train.dropna(axis = 1,
             thresh = int((1-nan_drop_frac) * len(train)),
             inplace = True)

del nan_drop_frac
_ = gc.collect()

25 numerical columns and 1 categorical column ('D_66') have been removed.

In [9]:
all_columns = train.drop(columns = ['customer_ID', 'S_2']).columns

cat_columns = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']
cat_columns = [col for col in cat_columns if col in all_columns]  #Need to do this because some categorical columns might have been removed in above code cell

num_columns = all_columns.drop(cat_columns)

**Handling the missing values in other columns**

In [10]:
from sklearn.impute import SimpleImputer

num_nan_columns = [col for col in num_columns if train[col].isnull().sum()]
cat_nan_columns = [col for col in cat_columns if train[col].isnull().sum()]

simple_imp = SimpleImputer(strategy = 'most_frequent')

if len(num_nan_columns):
    temp_imp = simple_imp.fit_transform(train[num_nan_columns])
    train[num_nan_columns] = pd.DataFrame(temp_imp, columns = num_nan_columns)

if len(cat_nan_columns):
    temp_imp = simple_imp.fit_transform(train[cat_nan_columns])
    train[cat_nan_columns] = pd.DataFrame(temp_imp, columns = cat_nan_columns)
    
del num_nan_columns, cat_nan_columns, simple_imp, temp_imp
_ = gc.collect()

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


Now the train set has 152 numerical features and 10 categorical features. I tried to import test set with all the features, but it resulted in allocating more RAM than available. So I need to use some kind of dimensionality reduction so that I can import only selected columns of test set. I couldn't use PCA or LDA as they also require all the columns to form principal components (using some coefficients associated to all the columns). I needed some method in which I can ignore the less relevant features. That's why I used the correlation technique:

**Removing the numerical columns with very less correlation with the output**

I have considered top 78 numerical columns which have most correlation with the output. I selected 78 because these have correlation coefficient more than 0.001 with the output. I have not removed any categorical columns.

In [11]:
how_many_num_columns = 78

corr_of_columns = abs(train[num_columns].corrwith(train_labels['target']))
corr_of_columns.sort_values(ascending = False, inplace = True)

num_columns = corr_of_columns[:how_many_num_columns].index
all_columns = list(num_columns) + list(cat_columns)

train = train[['customer_ID', 'S_2'] + all_columns]

del how_many_num_columns, corr_of_columns
_ = gc.collect()

Now train set has 78 numerical features and 10 categorical features. I could now import only these selected columns of test set without any RAM issue.

**Handling categorical data**

The get_dummies( ) function will convert the categorical columns into dummies and attach them at end (very similar to One Hot Encoding). I have used prefix seperator '=' for better visuality and understanding.

e.g. if there is a categorical column 'ABC' which has categories 0 and 1, then this function will create two dummies 'ABC=0' and 'ABC=1' and attach them at end, and remove the original 'ABC' column. 

This will not change any other columns.

In [12]:
train = pd.get_dummies(data = train,
                       prefix_sep = '=',
                       columns = cat_columns,
                       drop_first = True)

**Feature Scaling**

In [13]:
from sklearn.preprocessing import StandardScaler

columns_to_scale = train.drop(columns = ['customer_ID', 'S_2']).columns

scaler = StandardScaler()
train[columns_to_scale] = pd.DataFrame(scaler.fit_transform(train[columns_to_scale]), columns = columns_to_scale)

In [14]:
train.head()

Unnamed: 0,customer_ID,S_2,P_2,B_2,B_18,B_9,D_48,D_55,B_33,D_44,...,D_64=-1,D_64=O,D_64=R,D_64=U,D_68=1.0,D_68=2.0,D_68=3.0,D_68=4.0,D_68=5.0,D_68=6.0
0,3249127622875,2018-03-09,1.371094,1.014648,0.309082,-0.682617,-1.036133,-0.980469,0.859863,-0.567383,...,0.0,-1.068359,-0.429688,-0.638184,-0.151733,-0.210449,-0.317871,-0.315918,-0.547852,0.963379
1,23402014749356,2018-03-24,-1.572266,-1.384766,-0.978516,1.351562,1.543945,1.835938,-1.166016,-0.058472,...,0.0,-1.068359,-0.429688,1.568359,-0.151733,-0.210449,-0.317871,-0.315918,1.825195,-1.038086
2,37423844044824,2018-03-24,-0.198486,-1.293945,-0.975098,-0.117249,-0.173828,-0.102844,-1.167969,-0.030273,...,0.0,-1.068359,2.328125,-0.638184,-0.151733,-0.210449,-0.317871,-0.315918,1.825195,-1.038086
3,149683802139734,2018-03-07,1.097656,0.55127,1.162109,-0.663574,-1.169922,-0.94873,0.865723,-0.568848,...,0.0,0.936035,-0.429688,-0.638184,-0.151733,-0.210449,-0.317871,-0.315918,-0.547852,0.963379
4,160820792638518,2018-03-05,-0.444336,-1.135742,-0.967773,0.308105,0.447021,0.197876,-1.166016,-0.047913,...,0.0,0.936035,-0.429688,-0.638184,-0.151733,-0.210449,3.146484,-0.315918,-0.547852,-1.038086


In [15]:
train_labels.head()

Unnamed: 0,customer_ID,target
0,3249127622875,0
1,23402014749356,1
2,37423844044824,1
3,149683802139734,0
4,160820792638518,1


# Importing Test set

In [16]:
columns_to_import = ['customer_ID', 'S_2'] + all_columns

In [17]:
test = pd.read_feather('../input/amexfeather/test_data.ftr',
                       columns = columns_to_import)

del columns_to_import
_ = gc.collect()

In [18]:
test.head()

Unnamed: 0,customer_ID,S_2,P_2,B_2,B_18,B_9,D_48,D_55,B_33,D_44,...,B_30,B_38,D_114,D_116,D_117,D_120,D_126,D_63,D_64,D_68
0,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-02-19,0.631348,0.814453,0.592285,0.001013,0.626465,0.114563,1.003906,0.007584,...,0.0,1.0,,,,,0.0,CR,,
1,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-03-25,0.586914,0.811035,0.59082,0.005535,0.611816,0.184082,1.004883,0.006645,...,0.0,1.0,,,,,0.0,CR,,
2,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-04-25,0.608887,1.004883,0.591309,2.3e-05,0.62207,0.253906,1.006836,0.009605,...,0.0,2.0,,,,,0.0,CR,,
3,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-05-20,0.614746,0.816406,0.59082,0.007206,0.615723,0.305664,1.009766,0.00782,...,0.0,2.0,,,,,0.0,CR,,
4,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-06-15,0.591797,0.810547,0.593262,0.000569,0.591797,0.350342,1.001953,0.009956,...,0.0,2.0,0.0,0.0,-1.0,1.0,0.0,CR,U,6.0


Test data: (11363762, 90)

# Test set preprocessing

Most of the things are similar to train set. Only difference is that I don't have to do dimensionality reduction because it is already done (as I have only imported selected columns).

**Changing the format of S_2**

I can't change the format and order of customer IDs because I need original customer IDs for submission.

In [19]:
test['S_2'] = pd.to_datetime(test['S_2'])

**Grouping the dataset by customer_ID**

In [20]:
test = test.groupby('customer_ID').agg('last')
test.reset_index(drop = False, inplace = True)

_ = gc.collect()

**Handling missing values**

In [21]:
num_nan_columns = [col for col in num_columns if test[col].isnull().sum()]
cat_nan_columns = [col for col in cat_columns if test[col].isnull().sum()]

simple_imp = SimpleImputer(strategy = 'most_frequent')

if len(num_nan_columns):
    temp_imp = simple_imp.fit_transform(test[num_nan_columns])
    test[num_nan_columns] = pd.DataFrame(temp_imp, columns = num_nan_columns)

if len(cat_nan_columns):
    temp_imp = simple_imp.fit_transform(test[cat_nan_columns])
    test[cat_nan_columns] = pd.DataFrame(temp_imp, columns = cat_nan_columns)
    
del num_nan_columns, cat_nan_columns, simple_imp, temp_imp
_ = gc.collect()

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


**Handling categorical data**

In [22]:
test = pd.get_dummies(data = test,
                      prefix_sep = '=',
                      columns = cat_columns,
                      drop_first = True)

**Missing categories in Test Set**

I noticed that after converting categorical columns into individual dummies, train set had 114 features but test set had 112. So 2 categories are missing in test set that's why their dummies didn't get created. I have added columns of zeros for these categories, which indirectly means that these categories are absent in the dataset, while still maintaing the shape.

In [23]:
train_columns = train.columns
test_columns = test.columns

missing_columns = [col for col in train_columns if col not in test_columns]

for col in missing_columns:
    test[col] = [0]*len(test)
    
del train_columns, test_columns, missing_columns
_ = gc.collect()

Missing categories were 'D_64=-1' and 'D_68=1.0'.

**Feature scaling**

In [24]:
test[columns_to_scale] = pd.DataFrame(scaler.transform(test[columns_to_scale]), columns = columns_to_scale)

del scaler, columns_to_scale
_ = gc.collect()

In [25]:
test.head()

Unnamed: 0,customer_ID,S_2,P_2,B_2,B_18,B_9,D_48,D_55,B_33,D_44,...,D_64=O,D_64=R,D_64=U,D_68=2.0,D_68=3.0,D_68=4.0,D_68=5.0,D_68=6.0,D_64=-1,D_68=1.0
0,00000469ba478561f23a92a868bd366de6f6527a684c9a...,2019-10-12,-0.254918,1.014514,0.046002,-0.668306,0.255605,0.483091,0.861928,-0.558213,...,-1.068429,-0.429716,1.567706,-0.210434,-0.317829,-0.315822,-0.54783,0.963695,0.0,-0.15175
1,00001bf2e77ff879fab36aa4fac689b9ba411dae63ae39...,2019-04-15,0.761109,1.012145,1.157323,-0.627684,-1.065017,-0.861433,0.855998,-0.56874,...,0.935953,-0.429716,-0.637875,-0.210434,-0.317829,-0.315822,-0.54783,0.963695,0.0,-0.15175
2,0000210045da4f81e5f122c6bde5c2a617d03eef67f82c...,2019-10-16,0.225783,0.529958,0.034221,-0.657095,0.271883,0.35181,0.846115,-0.572257,...,-1.068429,-0.429716,1.567706,-0.210434,-0.317829,3.166344,-0.54783,-1.037673,0.0,-0.15175
3,00003b41e58ede33b8daf61ab56d9952f17c9ad1c3976c...,2019-04-22,-0.462493,-0.936447,-1.135022,-0.068436,0.49299,0.703515,-1.175634,0.480338,...,-1.068429,2.327121,-0.637875,-0.210434,-0.317829,-0.315822,1.825383,-1.037673,0.0,-0.15175
4,00004b22eaeeeb0ec976890c1d9bfc14fd9427e98c4ee9...,2019-10-22,-1.427537,-1.343255,-0.972054,1.502339,1.48458,1.197848,-1.174566,0.963758,...,-1.068429,2.327121,-0.637875,-0.210434,-0.317829,-0.315822,1.825383,-1.037673,0.0,-0.15175


In [26]:
print('Training data shape: ', train.shape)
print('Training labels shape: ', train_labels.shape)
print('Test data shape: ', test.shape)

Training data shape:  (458913, 114)
Training labels shape:  (458913, 2)
Test data shape:  (924621, 114)


# Classification models

Defining the prerequisites

In [27]:
from sklearn.model_selection import cross_val_score

preds_of_models = pd.DataFrame({})
accuracies_of_models = {}

Making sure that columns in train set and test set are in same order

In [28]:
train.sort_index(axis = 1, inplace = True)
test.sort_index(axis = 1, inplace = True)

**Logistic Regression Classification**

In [29]:
from sklearn.linear_model import LogisticRegression

lrc = LogisticRegression(penalty = 'l2',
                         solver = 'sag',
                         tol = 1e-2,
                         random_state = 6)

lrc.fit(X = train.drop(columns = ['customer_ID', 'S_2']),
        y = train_labels['target'])

preds_of_models['LRC'] = pd.DataFrame(lrc.predict_proba(X = test.drop(columns = ['customer_ID', 'S_2'])))[1]

accuracies_lrc = cross_val_score(estimator = lrc,
                                 X = train.drop(columns = ['customer_ID', 'S_2']),
                                 y = train_labels['target'],
                                 scoring = 'accuracy',
                                 cv = 5)

accuracies_of_models['LRC'] = accuracies_lrc.mean()

del lrc, accuracies_lrc
_ = gc.collect()

**XGBoost Classification**

In [30]:
from xgboost import XGBClassifier

xgb = XGBClassifier()

xgb.fit(X = train.drop(columns = ['customer_ID', 'S_2']),
        y = train_labels['target'])

preds_of_models['XGB'] = pd.DataFrame(xgb.predict_proba(X = test.drop(columns = ['customer_ID', 'S_2'])))[1]

accuracies_xgb = cross_val_score(estimator = xgb,
                                 X = train.drop(columns = ['customer_ID', 'S_2']),
                                 y = train_labels['target'],
                                 scoring = 'accuracy',
                                 cv = 5)

accuracies_of_models['XGB'] = accuracies_xgb.mean()

del xgb, accuracies_xgb
_ = gc.collect()

**Light Gradient Boost Machine (LightGBM or LGBM) Classification**

In [31]:
from lightgbm import LGBMClassifier

lgb = LGBMClassifier(boosting_type = 'dart',
                     n_estimators = 600,
                     objective = 'binary',
                     random_state = 3)

lgb.fit(X = train.drop(columns = ['customer_ID', 'S_2']),
        y = train_labels['target'])

preds_of_models['LGB'] = pd.DataFrame(lgb.predict_proba(X = test.drop(columns = ['customer_ID', 'S_2'])))[1]

accuracies_lgb = cross_val_score(estimator = lgb,
                                 X = train.drop(columns = ['customer_ID', 'S_2']),
                                 y = train_labels['target'],
                                 scoring = 'accuracy',
                                 cv = 5)

accuracies_of_models['LGB'] = accuracies_lgb.mean()

del lgb, accuracies_lgb
_ = gc.collect()

**Linear Discriminant Analysis Classification**

In [32]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()

lda.fit(X = train.drop(columns = ['customer_ID', 'S_2']),
        y = train_labels['target'])

preds_of_models['LDA'] = pd.DataFrame(lda.predict_proba(X = test.drop(columns = ['customer_ID', 'S_2'])))[1]

accuracies_lda = cross_val_score(estimator = lda,
                                 X = train.drop(columns = ['customer_ID', 'S_2']),
                                 y = train_labels['target'],
                                 scoring = 'accuracy',
                                 cv = 5)

accuracies_of_models['LDA'] = accuracies_lda.mean()

del lda, accuracies_lda
_ = gc.collect()

# Choosing the best model

In [33]:
print(pd.Series(accuracies_of_models))

LRC    0.894322
XGB    0.896503
LGB    0.899112
LDA    0.891422
dtype: float64


In [34]:
best_model = max(accuracies_of_models, key = lambda x: accuracies_of_models[x])

final_preds = preds_of_models[best_model]

print(f'The model with highest accuracy is {best_model} with {accuracies_of_models[best_model]*100} % accuracy')

del preds_of_models, accuracies_of_models, best_model
_ = gc.collect()

The model with highest accuracy is LGB with 89.91115947838072 % accuracy


In [35]:
final_preds.head(10)

0    0.041858
1    0.002068
2    0.048701
3    0.368053
4    0.883488
5    0.001505
6    0.923070
7    0.231280
8    0.734474
9    0.004088
Name: LGB, dtype: float64

# Submission

In [36]:
submission = pd.DataFrame({'customer_ID': test['customer_ID'],
                           'prediction': final_preds}).set_index('customer_ID')
submission.to_csv('submission.csv')