# Project: Home Credit Default Risk


# Modeling

## Sampling Methods
The dataset is inbalanced. Model performance could improve with different sampling techniques. We will test oversamping, SMOTE (Synthetic Minority Oversampling Technique), and undersampling. 

In [52]:
import imblearn
np.random.seed(123)

In [53]:
from sklearn.model_selection import cross_val_score, train_test_split, KFold
from sklearn.linear_model import LogisticRegressionCV, LassoCV, RidgeCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import xgboost as xgb

In [54]:
y = data['TARGET']
x = data.drop('TARGET', axis = 1)

# Create train and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 123)

In [55]:
# Create sampling datasets to train the models with.
o_sam = imblearn.over_sampling.RandomOverSampler(sampling_strategy = 0.5, random_state= 123)
u_sam = imblearn.under_sampling.RandomUnderSampler(sampling_strategy = 0.5, random_state= 123)
smote = imblearn.over_sampling.SMOTE(random_state = 123)

In [56]:
x_over , y_over = o_sam.fit_resample(x_train, y_train)
x_under, y_under = u_sam.fit_resample(x_train, y_train)
x_smote, y_smote = smote.fit_resample(x_train, y_train)

## Models:
We will be testing a total of five model approaches for our analysis. Three of the models will be linear (LogisticRegression, Lasso, & Ridge). Additional models will include, Random Forest Classifier, and K nearest neighbors (KNN). We will use k-fold cross validation for trianing of each of the models. SVM was also considered for the analysis, but removed due to computational time contraints. In total we tested 10 models ranging in methods and specific parameters. The goal is to obatain a model with the best predictive power on the dataset. Each of the members of are team focussed on development of one of the models. Pankhuri - RandomForest, Hasitha - KNN, Meghana - SVC, and Heber - (linear models & xgboost).

In [57]:
# Set up kfold cross validation. There will be 10 folds in the cross validation.
kfold = KFold(n_splits = 10)

In [58]:
# Linear models being tested inclide logistic, lasso, and Ridge regression.
log_mod = LogisticRegressionCV(cv=kfold, random_state = 123)
lasso_mod = LassoCV(cv= kfold, random_state = 123)
ridge_mod = RidgeCV(cv= kfold)
# SVC is a very resource instensive model. These models were Limited due to time restrictions.
SVC_mod1 = SVC(random_state = 123)
SVC_mod2 = SVC(kernel = 'linear', random_state = 123)
# We will be testing randomforst models with the default parameters and one model with a restricted depth and min_leaf size.
rf_mod1 = RandomForestClassifier()
rf_mod2 = RandomForestClassifier(max_depth = 5, min_samples_leaf = 500)
# Adding an xgboosted model for fun default and set number of estimators.
xgb_mod1 = xgb.XGBClassifier(objective = 'binary:logistic')
xgb_mod2 = xgb.XGBClassifier(n_estimators= 100, objective = 'binary:logistic')
# We will run 3 knn models with different values for n_neighbors.
knn_mod1 = KNeighborsClassifier()
knn_mod2 = KNeighborsClassifier(n_neighbors = 10)
knn_mod3 = KNeighborsClassifier(n_neighbors = 2)

In [59]:
def modelperformance(model, xtrain, ytrain, xtest, ytest):
    train = model.fit(xtrain, ytrain)
    pred = model.predict(xtest)
    if pred[0] != 0 or pred[0] != 1:
        pred = [int(i > 0.5) for i in pred] # binary values set to 1 if greater than 0.5
    
    print('Model')
    print(metrics.confusion_matrix(ytest, pred))
    print()
    print('AUC:', round(metrics.roc_auc_score(ytest, pred), 4))
    print(metrics.classification_report(ytest, pred, digits = 4))

## Imbalanced Performance

In [60]:
# Run model on imb dataset.
modelperformance(log_mod, x_train, y_train, x_test, y_test)
modelperformance(lasso_mod, x_train, y_train, x_test, y_test)
modelperformance(ridge_mod, x_train, y_train, x_test, y_test)
# modelperformance(SVC_mod1, x_train, y_train, x_test, y_test)
# modelperformance(SVC_mod2, x_train, y_train, x_test, y_test)
modelperformance(rf_mod1, x_train, y_train, x_test, y_test)
modelperformance(rf_mod2, x_train, y_train, x_test, y_test)
modelperformance(xgb_mod1, x_train, y_train, x_test, y_test)
modelperformance(xgb_mod2, x_train, y_train, x_test, y_test)
modelperformance(knn_mod1, x_train, y_train, x_test, y_test)
modelperformance(knn_mod2, x_train, y_train, x_test, y_test)
modelperformance(knn_mod3, x_train, y_train, x_test, y_test)

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56475    13]
 [ 4999    16]]

AUC: 0.5015
              precision    recall  f1-score   support

         0.0     0.9187    0.9998    0.9575     56488
         1.0     0.5517    0.0032    0.0063      5015

    accuracy                         0

The first three models predicted every application as non-default. These models are no better then using the majority class. This is to be expected with using the imbablanced dataset. Some of the more advanced models did a better job, but we can improve our results with different sampling technique. The highest AUC value was 0.5135 using the XGBoost model.

## Oversampling Performance

In [63]:
# Training models on Oversampled data
modelperformance(log_mod, x_over, y_over, x_test, y_test)
modelperformance(lasso_mod, x_over, y_over, x_test, y_test)
modelperformance(ridge_mod, x_over, y_over, x_test, y_test)
# modelperformance(SVC_mod1, x_over, y_over, x_test, y_test)
# modelperformance(SVC_mod2, x_over, y_over, x_test, y_test)
modelperformance(rf_mod1, x_over, y_over, x_test, y_test)
modelperformance(rf_mod2, x_over, y_over, x_test, y_test)
modelperformance(xgb_mod1, x_over, y_over, x_test, y_test)
modelperformance(xgb_mod2, x_over, y_over, x_test, y_test)
modelperformance(knn_mod1, x_over, y_over, x_test, y_test)
modelperformance(knn_mod2, x_over, y_over, x_test, y_test)
modelperformance(knn_mod3, x_over, y_over, x_test, y_test)

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[50194  6294]
 [ 3016  1999]]

AUC: 0.6436
              precision    recall  f1-score   support

         0.0     0.9433    0.8886    0.9151     56488
         1.0     0.2410    0.3986    0.3004      5015

    accuracy                         0

Oversampling has made a impact on model performance. The highest AUC value was 0.6477, using the xgboost model. Changing the parameters in the xgboosted model made no significant impact on the results of the accuracy.

## Undersampling Performance

In [64]:
# Training models on Undersampled data
modelperformance(log_mod, x_under, y_under, x_test, y_test)
modelperformance(lasso_mod, x_under, y_under, x_test, y_test)
modelperformance(ridge_mod, x_under, y_under, x_test, y_test)
# modelperformance(SVC_mod1, x_under, y_under, x_test, y_test)
# modelperformance(SVC_mod2, x_under, y_under, x_test, y_test)
modelperformance(rf_mod1, x_under, y_under, x_test, y_test)
modelperformance(rf_mod2, x_under, y_under, x_test, y_test)
modelperformance(xgb_mod1, x_under, y_under, x_test, y_test)
modelperformance(xgb_mod2, x_under, y_under, x_test, y_test)
modelperformance(knn_mod1, x_under, y_under, x_test, y_test)
modelperformance(knn_mod2, x_under, y_under, x_test, y_test)
modelperformance(knn_mod3, x_under, y_under, x_test, y_test)

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[56488     0]
 [ 5015     0]]

AUC: 0.5
              precision    recall  f1-score   support

         0.0     0.9185    1.0000    0.9575     56488
         1.0     0.0000    0.0000    0.0000      5015

    accuracy                         0.9185     61503
   macro avg     0.4592    0.5000    0.4787     61503
weighted avg     0.8436    0.9185    0.8794     61503

Model
[[50169  6319]
 [ 3033  1982]]

AUC: 0.6417
              precision    recall  f1-score   support

         0.0     0.9430    0.8881    0.9147     56488
         1.0     0.2388    0.3952    0.2977      5015

    accuracy                         0

The highest AUC value for undersampling data is 0.6555, using the xgboost model.

## SMOTE Performance

In [65]:
# Training models on SMOTE sampled data
modelperformance(log_mod, x_smote, y_smote, x_test, y_test)
modelperformance(lasso_mod, x_smote, y_smote, x_test, y_test)
modelperformance(ridge_mod, x_smote, y_smote, x_test, y_test)
# modelperformance(SVC_mod1, x_smote, y_smote, x_test, y_test)
# modelperformance(SVC_mod2, x_smote, y_smote, x_test, y_test)
modelperformance(rf_mod1, x_smote, y_smote, x_test, y_test)
modelperformance(rf_mod2, x_smote, y_smote, x_test, y_test)
modelperformance(xgb_mod1, x_smote, y_smote, x_test, y_test)
modelperformance(xgb_mod2, x_smote, y_smote, x_test, y_test)
modelperformance(knn_mod1, x_smote, y_smote, x_test, y_test)
modelperformance(knn_mod2, x_smote, y_smote, x_test, y_test)
modelperformance(knn_mod3, x_smote, y_smote, x_test, y_test)

Model
[[46231 10257]
 [ 2773  2242]]

AUC: 0.6327
              precision    recall  f1-score   support

         0.0     0.9434    0.8184    0.8765     56488
         1.0     0.1794    0.4471    0.2560      5015

    accuracy                         0.7881     61503
   macro avg     0.5614    0.6327    0.5663     61503
weighted avg     0.8811    0.7881    0.8259     61503

Model
[[25093 31395]
 [ 1681  3334]]

AUC: 0.5545
              precision    recall  f1-score   support

         0.0     0.9372    0.4442    0.6027     56488
         1.0     0.0960    0.6648    0.1678      5015

    accuracy                         0.4622     61503
   macro avg     0.5166    0.5545    0.3853     61503
weighted avg     0.8686    0.4622    0.5673     61503

Model
[[39233 17255]
 [ 1635  3380]]

AUC: 0.6843
              precision    recall  f1-score   support

         0.0     0.9600    0.6945    0.8060     56488
         1.0     0.1638    0.6740    0.2635      5015

    accuracy                    

The highest AUC value for SMOTE data is 0.6843, using a Ridge Model.

# Model selection:
Based on the training results, the model we would want to test for the kaggle competition would be the Ridge model trained on the SMOTE data. It had the highest AUC score of 0.6843. AUC was chosen for the best selecter due to the imbalanced data. We wanted a model to be able to identify the applications likely to default and not just correctly identify the applications that would not default. The rest of the sampling tests showed that the xgboost performed as the top model. We may want to consider combining the two estimates to see if they provide a better result.

In [89]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import time

In [90]:
def modelperformancenewdata(model, xtrain, ytrain, xtest, ytest, newdata):
    train = model.fit(xtrain, ytrain)
    y_pred = model.predict(xtest)
    pred = model.predict(newdata)
    
    if y_pred[0] != 0 or y_pred[0] != 1:
        y_pred = [int(i > 0.5) for i in y_pred] # binary values set to 1 if greater than 0.5
    
    if pred[0] != 0 or pred[0] != 1:
        pred = [int(i > 0.5) for i in pred] # binary values set to 1 if greater than 0.5
        
     # Calculate other evaluation metrics
    precision = precision_score(ytest, y_pred)
    recall = recall_score(ytest, y_pred)
    f1 = f1_score(ytest, y_pred)
    auc_score =  round(metrics.roc_auc_score(ytest, y_pred), 4)
    accuracy = round(metrics.accuracy_score(y_test, y_pred), 4)

    # Print the evaluation metrics
    print("Prediction Accuracy:", accuracy)
    print("AUC Score:", auc_score)
    print("Precision:", precision)
    print("Recall:", recall)
    print("F1 Score:", f1)
    
    return pred

In [93]:
start_time = time.time()
Mod_performance_1 = smpl_sub.copy()
Mod_performance_1.iloc[:,1] = modelperformancenewdata(ridge_mod, x_smote, y_smote, x_test, y_test, data_test)

print(Mod_performance_1['TARGET'].value_counts())
Mod_performance_1.to_csv("Model_pred_1.csv", index = False, header = True)

Mod_performance_2 = smpl_sub.copy()
Mod_performance_2.iloc[:,1] = modelperformancenewdata(xgb_mod1, x_under, y_under, x_test, y_test, data_test)

print(Mod_performance_2['TARGET'].value_counts())
Mod_performance_2.to_csv("Model_pred_2.csv", index = False, header = True)

Mod_performance_3 = smpl_sub.copy()
Mod_performance_3.iloc[:,1] = modelperformancenewdata(xgb_mod1, x_over, y_over, x_test, y_test, data_test)

print(Mod_performance_3['TARGET'].value_counts())
Mod_performance_3.to_csv("Model_pred_3.csv", index = False, header = True)

print("---%s seconds ---" % (time.time() - start_time))

Prediction Accuracy: 0.6929
AUC Score: 0.6843
Precision: 0.16379937000242306
Recall: 0.6739780658025922
F1 Score: 0.26354775828460036
0    32354
1    16390
Name: TARGET, dtype: int64
Prediction Accuracy: 0.8275
AUC Score: 0.6555
Precision: 0.22333267365921236
Recall: 0.45004985044865403
F1 Score: 0.2985252298128431
0    41019
1     7725
Name: TARGET, dtype: int64
Prediction Accuracy: 0.8489
AUC Score: 0.6477
Precision: 0.2442027253167583
Recall: 0.4073778664007976
F1 Score: 0.3053583439204843
0    42569
1     6175
Name: TARGET, dtype: int64
---515.5551784038544 seconds ---


The top three models had auc scores of 0.6477 - 0.6843. The top model was the ridge regression when using smote sampling. Total time to train and test these models was just under 9 min.

In [95]:
start_time = time.time()
Ems_performance =  smpl_sub.copy()
Ems_performance['TARGET'] = (Mod_performance_1['TARGET'] + Mod_performance_2['TARGET'] + Mod_performance_3['TARGET']) / 3
Ems_performance.to_csv("ems_pred.csv", index = False, header = True)
print("---%s seconds ---" % (time.time() - start_time)) # Model runtime is dependent on the run time of the other three models. It will be lond due to waiting on the other models.
# Total runtime is just under 9 mins.

---0.18550324440002441 seconds ---


## Model Summary:
The task involved the development of a classification model for predicting loan applicant default likelihood. The provided dataset included various credit-related details for each application. Data preprocessing involved outlier removal, mean value imputation for missing columns, and conversion of categorical variables into dummy variables. After preparing the final dataset, it was divided into training and testing sets.To train the models, K-fold cross-validation was employed, utilizing five different classification methods to create a total of ten models. In order to tackle class imbalance in the data, three sampling methods were implemented. The top three models were selected based on their performance, assessed using the area under the curve (AUC) as the metric. These top models comprised two XGboosted models and one Ridge Regression linear model, with AUC scores ranging from 0.6477 to 0.6843.
The ultimate model created was an ensemble model, combining the top three models along with their specific sampling techniques. This ensemble model achieved the highest score among the four models submitted to Kaggle, with a score of 0.71609. We recommend that Home Credit adopts this combined approach for future application evaluations, as it outperforms the majority response classification by a 20% margin. By implementing this model, Home Credit can make more accurate predictions regarding the suitability of loaning money to applicants, based on their prior credit history.