In [1]:
import pandas as pd
import numpy as np 

In [14]:
df = pd.read_csv("UCI_Credit_Card.csv")

DATA Description
* ID: ID of each client
* LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
* SEX: Gender (1=male, 2=female)
* EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)
* MARRIAGE: Marital status (1=married, 2=single, 3=others)
* AGE: Age in years
* PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, … 8=payment delay for eight months, 9=payment delay for nine months and above)
* PAY_2: Repayment status in August, 2005 (scale same as above)
* PAY_3: Repayment status in July, 2005 (scale same as above)
* PAY_4: Repayment status in June, 2005 (scale same as above)
* PAY_5: Repayment status in May, 2005 (scale same as above)
* PAY_6: Repayment status in April, 2005 (scale same as above)
* BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
* BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
* BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
* BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
* BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
* BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
* PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
* PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
* PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
* PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
* PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
* PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
* default.payment.next.month: Default payment (1=yes, 0=no)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

In [15]:
df.rename(columns={"default.payment.next.month":"Target"}, inplace = True)

In [16]:
print(df["MARRIAGE"].unique())
print(df["SEX"].unique())
print(df["EDUCATION"].unique())
print(df["PAY_0"].unique())

[1 2 3 0]
[2 1]
[2 1 3 5 4 6 0]
[ 2 -1  0 -2  1  3  4  8  7  5  6]


In [17]:
df['MARRIAGE'].replace({0 : 3},inplace = True)
df["EDUCATION"].replace({6 : 5, 0 : 5}, inplace = True)
df["PAY_0"].replace({-1 : 0, -2 : 0}, inplace = True)
df["PAY_2"].replace({-1 : 0, -2 : 0}, inplace = True)
df["PAY_3"].replace({-1 : 0, -2 : 0}, inplace = True)
df["PAY_4"].replace({-1 : 0, -2 : 0}, inplace = True)
df["PAY_5"].replace({-1 : 0, -2 : 0}, inplace = True)
df["PAY_6"].replace({-1 : 0, -2 : 0}, inplace = True)

There is no match between data description and data set. For example: Description say "MARRIAGE column: Marital Status 1=Married, 2-Single, 3-Others" but in real data there are 4 unique values (0-1-2-3) so I will replace 0 with 3. Same mistake made in different columns like EDUCATION and payment columns (PAY_0, PAY_2 etc. so I will replace them with logical values.

In [18]:
cat_col = ["SEX","MARRIAGE","EDUCATION"]
df = pd.get_dummies(df, columns = cat_col)

Transforming categorical columns to binary numeric values.

In [19]:
df.drop(columns=["SEX_2","ID"], inplace = True)

In [20]:
pd.set_option('display.max_columns', 50)
df.head()

Unnamed: 0,LIMIT_BAL,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,Target,SEX_1,MARRIAGE_1,MARRIAGE_2,MARRIAGE_3,EDUCATION_1,EDUCATION_2,EDUCATION_3,EDUCATION_4,EDUCATION_5
0,20000.0,24,2,2,0,0,0,0,3913.0,3102.0,689.0,0.0,0.0,0.0,0.0,689.0,0.0,0.0,0.0,0.0,1,0,1,0,0,0,1,0,0,0
1,120000.0,26,0,2,0,0,0,2,2682.0,1725.0,2682.0,3272.0,3455.0,3261.0,0.0,1000.0,1000.0,1000.0,0.0,2000.0,1,0,0,1,0,0,1,0,0,0
2,90000.0,34,0,0,0,0,0,0,29239.0,14027.0,13559.0,14331.0,14948.0,15549.0,1518.0,1500.0,1000.0,1000.0,1000.0,5000.0,0,0,0,1,0,0,1,0,0,0
3,50000.0,37,0,0,0,0,0,0,46990.0,48233.0,49291.0,28314.0,28959.0,29547.0,2000.0,2019.0,1200.0,1100.0,1069.0,1000.0,0,0,1,0,0,0,1,0,0,0
4,50000.0,57,0,0,0,0,0,0,8617.0,5670.0,35835.0,20940.0,19146.0,19131.0,2000.0,36681.0,10000.0,9000.0,689.0,679.0,0,1,1,0,0,0,1,0,0,0


In [22]:
df.shape

(30000, 30)

In [24]:
X = df.drop(columns = ["Target"], axis = 1)
y = df["Target"]

In [23]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                            test_size=0.25, random_state=42)

In [64]:
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso

# Lasso Classifier

In [27]:
def lasso_cv (x_train, y_train, x_test):
    model = LassoCV(cv=10, random_state=42, max_iter = 10000, n_jobs = -1)
    model.fit(x_train,y_train)
    y_pred = model.predict(x_test)
    alpha = model.alpha_
    return y_pred, alpha

In [28]:
result = lasso_cv(X_train, y_train, X_test)
y_pred = result[0]
alpha = result[1]

In [31]:
alpha

25.521486276881543

In [29]:
from sklearn.metrics import classification_report

In [32]:
lc_model = Lasso(max_iter = 1000,
                 alpha = alpha)
lc_model.fit(X_train, y_train)

Lasso(alpha=25.521486276881543, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

In [43]:
y_pred_lc = lc_model.predict(X_test)

In [55]:
y_pred_lc = [0 if a_ < 0.24575563984827312 else 1 for a_ in y_pred_lc]

In [51]:
yy_pred_lc = lc_model.predict(X_train)

In [175]:
y_pred_lc2 = [0 if a_ < 0.24575563984827312 else 1 for a_ in yy_pred_lc]

In [59]:
## Finding threshold value
sum(yy_pred_lc * y_train) / sum(y_train)

0.24575683113930463

Test Error:

In [85]:
clf_rep_lasso = classification_report(y_test,y_pred_lc, zero_division=1)

In [86]:
print(clf_rep_lasso)

              precision    recall  f1-score   support

           0       0.85      0.56      0.68      5873
           1       0.29      0.65      0.40      1627

    accuracy                           0.58      7500
   macro avg       0.57      0.61      0.54      7500
weighted avg       0.73      0.58      0.62      7500



Train Error:

In [176]:
clf_rep_lasso2 = classification_report(y_train,y_pred_lc2, zero_division=1)
print(clf_rep_lasso2)

              precision    recall  f1-score   support

           0       0.84      0.56      0.67     17491
           1       0.29      0.64      0.40      5009

    accuracy                           0.58     22500
   macro avg       0.57      0.60      0.54     22500
weighted avg       0.72      0.58      0.61     22500



# Decision Tree Model

In [72]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [90]:
DT_model = DecisionTreeClassifier(random_state = 42)
dt_params = {"min_samples_leaf" : [1,5,100,1000,2000],
             "ccp_alpha" : [0, 1, 2, 5, 6]
            }

In [91]:
dt_cv_model = GridSearchCV(DT_model, 
                           dt_params, 
                           n_jobs = -1, 
                           verbose = 1,
                           cv = 5)

In [92]:
dt_cv_model.fit(X_train, y_train)

Fitting 5 folds for each of 25 candidates, totalling 125 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 125 out of 125 | elapsed:    6.2s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=-1,
             param_grid={'ccp_alpha': [0, 1, 2, 5, 6],
                         'm

In [93]:
dt_cv_model.best_params_

{'ccp_alpha': 0, 'min_samples_leaf': 1000}

## Tuned DT Model

In [81]:
DT_model = DecisionTreeClassifier(random_state = 42,
                                 ccp_alpha = 0,
                                 min_samples_leaf = 1000)

In [82]:
DT_model.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1000, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [83]:
y_pred_dt = DT_model.predict(X_test)

In [84]:
y_pred_dt2 = DT_model.predict(X_train)

Test Error:

In [88]:
clf_rep_dt = classification_report(y_test,y_pred_dt, zero_division=1)
print(clf_rep_dt)

              precision    recall  f1-score   support

           0       0.83      0.96      0.89      5873
           1       0.68      0.32      0.43      1627

    accuracy                           0.82      7500
   macro avg       0.76      0.64      0.66      7500
weighted avg       0.80      0.82      0.79      7500



Train Error:

In [89]:
clf_rep_dt2 = classification_report(y_train,y_pred_dt2, zero_division=1)
print(clf_rep_dt2)

              precision    recall  f1-score   support

           0       0.83      0.96      0.89     17491
           1       0.70      0.33      0.45      5009

    accuracy                           0.82     22500
   macro avg       0.77      0.65      0.67     22500
weighted avg       0.80      0.82      0.79     22500



# Random Forest Model

In [94]:
from sklearn.ensemble import RandomForestClassifier

In [96]:
rf_model = RandomForestClassifier(n_estimators = 500, 
                                 min_samples_leaf = 5,
                                 n_jobs = -1,
                                 random_state = 42
                                 )

In [116]:
rf_params = {"max_features" : [15, 29]}

In [117]:
rf_cv_model = GridSearchCV(rf_model, 
                           rf_params, 
                           n_jobs = -1, 
                           verbose = 1,
                           cv = 5)

In [118]:
rf_cv_model.fit(X_train, y_train)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of  10 | elapsed:  2.5min remaining:  1.7min
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  3.0min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=5,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=500, n_jobs=-1,
                                              oob_score=False, random_state=42,
                                    

In [115]:
rf_cv_model.best_params_

{'max_features': 20}

## Tuned RF Model

In [119]:
rf_model = RandomForestClassifier(n_estimators = 500, 
                                 min_samples_leaf = 5,
                                 n_jobs = -1,
                                 random_state = 42,
                                  max_features = 20
                                 )

In [120]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=20,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=-1, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [121]:
y_pred_rf = rf_model.predict(X_test)

Test Error:

In [124]:
clf_rep_rf = classification_report(y_test,y_pred_rf, zero_division=1)
print(clf_rep_rf)

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      5873
           1       0.65      0.36      0.46      1627

    accuracy                           0.82      7500
   macro avg       0.74      0.65      0.68      7500
weighted avg       0.80      0.82      0.80      7500



In [125]:
y_pred_rf2 = rf_model.predict(X_train)

Train Error:

In [126]:
clf_rep_rf2 = classification_report(y_train,y_pred_rf2, zero_division=1)
print(clf_rep_rf2)

              precision    recall  f1-score   support

           0       0.89      0.99      0.94     17491
           1       0.94      0.59      0.72      5009

    accuracy                           0.90     22500
   macro avg       0.91      0.79      0.83     22500
weighted avg       0.90      0.90      0.89     22500



# XGB Model

In [127]:
from xgboost import XGBClassifier

In [128]:
xgb_model = XGBClassifier(verbosity = 1,
                         min_child_weight = 10,
                         
                         )

In [129]:
xgb_params = {"eta" : [0.1, 1],
             "max_depth" : [1,3,6],
             "n_estimators" : [100, 500, 1000]}

In [130]:
xgb_cv_model = GridSearchCV(xgb_model, 
                           xgb_params, 
                           n_jobs = -1, 
                           verbose = 1,
                           cv = 5)

In [131]:
xgb_cv_model.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  5.6min finished




GridSearchCV(cv=5, error_score=nan,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=10,
                                     missing=nan, monotone_constraints=None,
                                     n_estimat...
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     subsample=None, tree_method=None,
                                     use_label_encoder=True,
  

In [132]:
xgb_cv_model.best_params_

{'eta': 0.1, 'max_depth': 3, 'n_estimators': 100}

## Tuned XGB Model

In [133]:
xgb_model = XGBClassifier(verbosity = 1,
                         min_child_weight = 10,
                          eta = 0.1,
                          max_depth = 3,
                          n_estimators = 1000
                         
                         )

In [134]:
xgb_model.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.1, gamma=0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.100000001, max_delta_step=0, max_depth=3,
              min_child_weight=10, missing=nan, monotone_constraints='()',
              n_estimators=1000, n_jobs=8, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=True,
              validate_parameters=1, verbosity=1)

In [139]:
y_pred_xgb = xgb_model.predict(X_test)

Test Error:

In [140]:
clf_rep_xgb = classification_report(y_test,y_pred_xgb, zero_division=1)
print(clf_rep_xgb)

              precision    recall  f1-score   support

           0       0.84      0.94      0.89      5873
           1       0.64      0.36      0.46      1627

    accuracy                           0.82      7500
   macro avg       0.74      0.65      0.68      7500
weighted avg       0.80      0.82      0.80      7500



Train Error

In [143]:
y_pred_xgb2 = xgb_model.predict(X_train)

In [145]:
clf_rep_xgb2 = classification_report(y_train,y_pred_xgb2, zero_division=1)
print(clf_rep_xgb2)

              precision    recall  f1-score   support

           0       0.86      0.96      0.91     17491
           1       0.76      0.43      0.55      5009

    accuracy                           0.84     22500
   macro avg       0.81      0.70      0.73     22500
weighted avg       0.84      0.84      0.83     22500



# Results

In [177]:
print("------------- Lasso Classifier -------------")
print(clf_rep_lasso)
print(clf_rep_lasso2)
print("------------- Decisition Tree Classifier -------------")
print(clf_rep_dt)
print(clf_rep_dt2)
print("------------- Random Forest Classifier -------------")
print(clf_rep_rf)
print(clf_rep_rf2)
print("------------- XGB Classifier -------------")
print(clf_rep_xgb)
print(clf_rep_xgb2)

------------- Lasso Classifier -------------
              precision    recall  f1-score   support

           0       0.85      0.56      0.68      5873
           1       0.29      0.65      0.40      1627

    accuracy                           0.58      7500
   macro avg       0.57      0.61      0.54      7500
weighted avg       0.73      0.58      0.62      7500

              precision    recall  f1-score   support

           0       0.84      0.56      0.67     17491
           1       0.29      0.64      0.40      5009

    accuracy                           0.58     22500
   macro avg       0.57      0.60      0.54     22500
weighted avg       0.72      0.58      0.61     22500

------------- Decisition Tree Classifier -------------
              precision    recall  f1-score   support

           0       0.83      0.96      0.89      5873
           1       0.68      0.32      0.43      1627

    accuracy                           0.82      7500
   macro avg       0.76     

Algorithm  | Train Accuracy | Test Accuracy | Test Precision (Class 1)
------------- | ------------- | ------------- | -------------
Lasso Classifier  | 0.58 | 0.58 | 0.29 
Decision Tree  | 0.82 | 0.82 | 0.68
Random Forest  | 0.94 | 0.82 | 0.65
XGB Classifier  | 0.84 | 0.82 | 0.64


Decision Tree, Random Forest and XGB Classifier accuracy scores for test data looks same and 0.82 . Lasso classifier is behind other algorithms. Data is highly imbalanced so accuracy cannot not be right metric. We should look precision score,it shows correctness achieved in positive prediction, because in the dataset, if target value equal 1, that means client could not default his payment on time. The model should catch the values of 1 as much as it can. Decision Tree algorithm looks like most succesfull algorithm for precision score, for both accuracy and precision scores, train and test scores very close to each other, it means we can say that model fitted to data well, there is no over and under fitting situation. Because of the highly imbalanced data, lasso classifier did not work well, it tends to label 0 each instances. Also random forest model looks like over fitted, train and test accuracy is highly different. 