## Background / Motivation

Loans have become a common aspect of modern life, but their approval presents a significant challenge to banks: assessing the likelihood of repayment. This challenge is heightened in regions such as the United States, where many individuals, often targeted by unscrupulous lenders, lack established credit histories. Further, banks face the risk of borrower default, underscoring the need for rigorous vetting. To address this, our project analyzes historical loan data to develop a predictive model. The goal is to aid banks in confidently granting loans while minimizing default risks, thereby fostering a more inclusive financial environment.

## Problem statement 
Beyond merely considering credit history, we aim to discern patterns, trends, and anomalies within the data that could potentially signal a borrower's inability to repay loans.

## Data sources

The data source was published on Kaggle by Home Credit Group, a non bank financial institution that concerns themselves with making loans to those who don’t have credit history (link here: https://www.kaggle.com/competitions/home-credit-default-risk). The dataset was published as a competition to find a model that gets the best ROC_AUC when predicting if an individual will repay a loan or not. **The dataset has a binary target variable with 8.1% ones indicating if the participant defaulted on their loan**. The features of the dataset include things like gender, whether the individual owns a car, their approximate income, if they have children, and many other factors. 

## Stakeholders

There are two main stakeholders of our models. 

First, the banks. Banks want to give as many loans as possible that will be paid back, as when they give out loans they can charge interest on them and make a return. Banks also must avoid giving out loans to people who won’t be able to pay back to avoid making a loss. Given that there is a large swath of the population that doesn’t have established credit histories, and thus banks cannot use credit score to decide whether or not to loan to them, banks would benefit from an accurate model predicting how likely these people will pay back their loans. 

Second, the people without credit histories. If banks were unwilling to give any loans to people without existing credit histories, these people would have a very difficult time becoming homeowners. By developing a model for the bank to give them confidence in their loans to these people, we are giving people access to home loans who otherwise wouldn’t have been able to obtain them. 


## Data quality check / cleaning / preparation 

### Feature Engineering:
1. Process additional datasets (i.e., previous applications/payments, cash/credit card balances).
2. Compute aggregate statistics such as mean, minimum, and maximum, and count the number of unique variables grouped by identifiers.
3. Create dummy variables (via one-hot encoding) for categorical features in the datasets.
4. Compute the unique and maximum contract statuses.
5. Join data frames on common identifiers to incorporate useful information from multiple sources into the training and test data.


### Data Preprocessing:
1. Remove the target variable from the training set.
2. Combine the train and test datasets to ensure consistent one-hot encoding of categorical variables across both datasets.
3. Deal with missing values by imputing them with the mean value of the respective columns.
4. Remove highly correlated features for dimensionality reduction by calculating the correlation matrix and removing one of the pairs of features if their correlation exceeds a certain threshold (0.7).
5. Rename columns using regular expressions.


### Dealing with Imbalance
To address class imbalance (0.92: 0.08), we used a stratified 80-20% train-test split,  resulting in 246K observations in the training dataset and 61.5K in testing. This method maintains a consistent class distribution across both datasets, thus preserving the representation of each class in our data.


### Dealing with "Large"
Next, we resampled and downsized the 246K training data to a more manageable 15K sample by implementing a variant of the previous stratified train-test split, where we split the training data into 15K real train data while discarding the rest. This reduces computational demands while preserving the class distribution.

We train our model on this smaller sample, then predict on the full 61.5K test dataset, balancing computational efficiency with model evaluation accuracy.

## Approach

### Models
We opted against the MARS model due to its suboptimal performance in classification tasks, given its design for continuous output variables, not categorical ones as needed in our project. Instead, we employed models such as Lasso, Ridge, Random Forest, Bagged Random Forest, Light Gradient Boosting, and XGBoost, as they are better equipped for classification problems, handling collinearity, feature selection, and large, high-dimensional datasets more effectively.

### Metrics
Our model evaluation primarily centers on two metrics:
The AUC score, assesses our model's proficiency in differentiating between defaulters (positive class) and non-defaulters (negative class). Notably, this metric is also the principal measure used in the corresponding Kaggle competition from which our project derives its data source.
The balance between precision and recall, enables us to minimize the occurrences of both false positives, potentially leading to the unjust rejection of creditworthy clients, and false negatives, which could inadvertently result in the acceptance of high-risk clients.
This approach safeguards Home Credit's finances, promotes fair lending, and addresses our client's needs.

### Problems
Our key challenges:
1) Imbalanced dataset. This imbalance biased our model to categorize the majority of entries as the negative class,  resulting in a skewed baseline accuracy of 91%.  
2) 300,000 rows of data. The dataset size posed computational efficiency challenges and required strategic data sampling to ensure crucial observations were not overlooked.

Existing solutions on Kaggle have achieved a score of 0.80 AUC on the public leaderboard. However, as most high-scoring solutions remain undisclosed, our group did not reference Kaggle's solutions during the course of our project.

## Developing the model: Hyperparameter tuning

### Bagging Model (Bagged Decision Trees)
*By Victoria Shi*

1. Hyperparameter Tuning
    1. Coarse search: I used `RandomizedSearchCV` to perform a randomized search over the parameter space defined by params and used ROC-AUC as the scoring metric for optimization, and the search was performed using 5-fold stratified cross-validation.
       ```
        params = {
            'base_estimator__max_depth': [10, 15, 20],
            'base_estimator__max_features': [0.5, 0.75, 1.0],
            'max_features': [0.5, 0.75, 1.0],
            'bootstrap': [True, False],
            'bootstrap_features': [True, False],
            'class_weight': ["balanced", "balanced_subsample"]
        }
        ```
    2. Fine-Tuning: I performed another round of hyperparameter optimization to fine-tune the `BaggingClassifier`. I used `GridSearchCV` this time and defined a narrower parameter grid called `fine_bag_params`. I kept the scoring metrics and cross-validation setup the same. The best estimator obtained from this grid search represented the final  classifier.
       ```
        fine_bag_params = {
            'base_estimator__max_depth': [15],
            'base_estimator__max_features': [0.5, 0.6],
            'max_features': [0.5, 0.6],
            'max_samples': [0.5, 0.75, 1.0],
            'bootstrap': [False],
            'bootstrap_features': [False]
        }
        ```
    3. Final Optimal BaggingClassifier
       - base_estimator__class_weight='balanced'
       - base_estimator__max_depth=12
       - base_estimator__max_features=0.5
       - bootstrap=False
       - max_features=0.50
       - n_estimators=300
2. Optimize the probability threshold: I plotted the precision-recall curve to visualize the tradeoff between precision and recall as a function of various decision threshold and performed cross-validation to find the decision thresholds that maximizes the F1 value, which is the harmonic mean of precision and recall.
    - Final threshold: 0.54
3. Prediction on test data: I applied the optimal decision threshold obtained in the previous step to calculate the accuracy, precision, and recall on the test dataset. Additionally, I plotted the confusion matrix.
    1. Final AUC: 0.7512754266244552
    2. Test Accuracy: 0.8304
    3. Test Precision: 0.2235
    4. Test Recall: 0.4447

In [None]:
bagging_opt = BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                                      max_depth=12,
                                                                      max_features=0.5,
                                                                      random_state=42),
                                bootstrap=False, max_features=0.50, n_estimators=300,
                                random_state=42).fit(X_train, y_train)

### Random Forest Classifier
*by Victoria Shi*

I developed the Random Forest model in a similar fashion as the BaggingClassifier, despite variations in search space.

1. Hyperparameter Tuning
    1. Coarse Search:  used `RandomizedSearchCV` to perform a randomized search over the parameter space defined by params and used ROC-AUC as the scoring metric for optimization, and the search was performed using 5-fold stratified cross-validation.
        ```
        rf_fine_params = {
            'max_depth': [12, 15, 18],
            'max_features': [15],
            'max_samples': [0.5, 0.75],
            'n_estimators': [300],
            'max_leaf_nodes': [1000],
            'min_samples_split': [1, 2, 4],
            'min_samples_leaf': [1, 2, 4],
            'warm_start': [True, False],
            'class_weight': ['balanced']
        }
        ```
    2. Fine-Tuning: I performed another round of hyperparameter optimization to fine-tune the `RandomForestClassifier`. I used `GridSearchCV` this time and defined a narrower parameter grid called `fine_rf_params`. I kept the scoring metrics and cross-validation setup the same. The best estimator obtained from this grid search represented the final classifier.
        ```
        rf_fine_params = {
            'max_depth': [12, 15, 18],
            'max_features': [15],
            'max_samples': [0.5, 0.75],
            'n_estimators': [300],
            'max_leaf_nodes': [1000],
            'min_samples_split': [1, 2, 4],
            'min_samples_leaf': [1, 2, 4],
            'warm_start': [True, False],
            'class_weight': ['balanced']
        }
        ```
    3. Final optimal hyperparamters
        - class_weight='balanced'
        - max_depth=16
        - max_features='sqrt'
        - max_leaf_nodes=1000
        - max_samples=0.75
        - min_samples_leaf=2
        - n_estimators=300, random_state=42,
        - warm_start=True
2. Optimizing Probability Threshold: As previous, I plotted precision-recall curve and used cross-validation to find the threshold that maximized f1 value,
    - Final threshold: 0.53
3. Prediction on test data:
    1. Final AUC: 0.7516
    2. Test Accuracy: 0.8202
    3. Test Precision: 0.2174
    4. Test Recall: 0.4723

In [None]:
rfc_opt = RandomForestClassifier(class_weight='balanced', max_depth=16, max_features='sqrt',
                                 max_leaf_nodes=1000, max_samples=0.75,
                                 min_samples_leaf=2, n_estimators=300, random_state=42,
                                 warm_start=True).fit(X_train, y_train)

### Lasso
*By William Pattie*

I began by doing a coarse grid search of the alpha parameter for the Lasso model using GridSearchCV. I did this on a small sample of the data (15k observations) since our original dataset was very large. 

In [None]:
#Initial tuning on sample dataset:
alphas = 10 ** np.linspace(-0.5, -4.5, 200) * 0.5

I then did a finer search around my coarse alpha to find a finer alpha. I did this GridSearch on the entire dataset. 

In [None]:
#Finer tuning on the whole dataset:
alphas_fine = np.linspace(optimal_a / 3, optimal_a * 3, 10)

I then tuned the threshold of the model by using cross_val_predict with the optimal alpha for lasso. I selected the best model based on F1 score.

My optimal Lasso hyperparameters were

- Alpha: 9.73151391349376e-05
- Threshold: 0.16

My roc_auc on test data was 0.755

### Ridge
*By William Pattie*

Tuning of my ridge model followed the same procedure. 

In [None]:
#Initial tuning on sample dataset:
alphas = 10 ** np.linspace(3, -1, 200) * 0.5

#Finer tuning on the whole dataset: 
alphas_fine = np.linspace(optimal_a / 3, optimal_a * 3, 10)

My optimal ridge parameters were: 

- Threshold: 0.16	
- alpha: 222.335854

My ROC_AUC on test data was 0.756

### AdaBoost
*By William Pattie*

Now for my AdaBoost model I began running a coarse seach on the sample data. I also tried running a model using the default hyperparameters

In [None]:
#Defining Parameter grid for Randomized Search CV
grid = dict()
grid['n_estimators'] = [10, 50, 100]
grid['learning_rate'] = [0.001, 0.01, 0.1, 1]
grid['base_estimator'] = [DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=5),
                          DecisionTreeClassifier(max_depth=20)]

In [None]:
#DEFAULT HYPERPARAMETERS
grid = dict()
grid['n_estimators'] = [50]
grid['learning_rate'] = [1]
grid['base_estimator'] = [DecisionTreeClassifier(max_depth=1)]

The model with default hyperparameters was much better than my grid search, so I stuck with those parameters and plotted how it would affect ROC_AUC to change them. I used the graphs to select hyperparameter values that might make sense for a final GridSearch, then I did a GridSearch on the whole dataset while tuning the threshold. The best model was selected by ROC_AUC score and the best threshold on that model was selected by F1 score. 

In [None]:
threshold_hyperparam_vals = np.arange(0.49, 0.51, 0.001)

#Defining Parameter grid
grid = dict()
grid['n_estimators'] = [30, 50, 80]
grid['learning_rate'] = [0.1, 1]
grid['base_estimator'] = [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2)]

My optimal AdaBoost hyperparameters were

- learning_rate = 1
- n_estimators = 80
- base_estimator = DecisionTreeClassifier(max_depth=1)
- threshold = 0.495

My roc_auc on test data was: 0.757

### Gradient Boosting
*By Julia Chu*

##### Additional Data Processing:
Starting with a standard train-test split dataset, I applied the SMOTETomek sampling method to generate synthetic minority samples and remove nearby majority instances, aiming for data balance. The improved dataset's effectiveness was confirmed by comparing base model performance before and after resampling.

Based on the results from the base model, I plotted the distribution of false negative features against true positives. This allowed me to identify potential predictors that our model might be overlooking. Insights gathered from this analysis were subsequently employed to guide further feature engineering, enhancing our model's performance.

##### Model Development
In the initial optimization phase, I utilized the BayesianOptimization library, integrating early stopping to refine the Light GBM model within our search space. The Bayesian Optimization approach, anchored in black box probability theory, uses prior observations to construct an objective function and select promising sampling points via Gaussian Process. I deliberately defined a wide search space, capitalizing on the method's capacity to autonomously identify optimal parameters. Through several iterations of BayesianCV, I incrementally increased both the init_round (the number of exploratory steps before the Bayesian process) and n_iter (number of iterations). This optimization strategy yielded an AUC of 0.748 with a decision threshold roughly at 0.175.

In [None]:
lgbBO = BayesianOptimization(lgb_eval, {'learning_rate': (0.01, 0.5),
                                        'num_leaves': (24, 45),
                                        'feature_fraction': (0.1, 0.9),
                                        'bagging_fraction': (0.8, 1),
                                        'max_depth': (8, 25),
                                        'lambda_l1': (0, 5),
                                        'lambda_l2': (0, 3),
                                        'min_split_gain': (0.001, 0.1),
                                        'min_child_weight': (5, 50)}, random_state=0)
# optimize
lgbBO.maximize(init_points=init_round, n_iter=opt_round)

bayes_param = {'bagging_fraction': 0.8706181305911628, 'feature_fraction': 0.4985913556467668,
               'lambda_l1': 2.6591344619579216, 'lambda_l2': 0.8759030094132374, 'learning_rate': 0.021598957600363185,
               'max_depth': int(16.39003425765356), 'min_child_weight': 9.733155599372788,
               'min_split_gain': 0.08363623636722453, 'num_leaves': int(28.009367040413146)}


Random Search: 
After several rounds of Bayesian optimization, I observed trends in certain parameters; for example, the max_depth consistently fell within the 20-35 range. Given the size and complexity of our dataset, this range for max_depth appeared reasonable and was therefore maintained during the subsequent Random Search I also kept the n_estimators constant at 100 throughout the tuning process. Once the remaining hyperparameters were finalized, I attempted to increase n_estimators only when further improvements in the AUC metric were not evident. The final model achieves an AUC of 0.7631; with a decision treshold at 0.175 that result in a 0.2308 in precision and 0.4815 in recall on test dataset. 

In [None]:
rsParams = {
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2, ],
    'n_estimators': [100],
    # 'boosting_type' : ['gbdt', 'dart'], # for better accuracy -> try dart
    'objective': ['tweedie', 'binary'],
    'num_leaves': [25, 50, 100],
    'min_child_weight': [0.1, 0.5, 2, 6],
    'colsample_bytree': [0.8, 0.95],
    'max_depth': [8, 16, 28, 32],
    'subsample': [0.5, 0.75, 0.9, 1],
    'min_child_samples': [300],
    'class_weight': ['balanced'],
    'reg_alpha': [0, 0.1, 1, 3, 10, 100],
    'reg_lambda': [0, 0.1, 1, 3, 10, 100],
    'early_stopping_round': [150]
}

rs_param = {'subsample': 0.75, 'reg_lambda': 1, 'reg_alpha': 1, 'objective': 'binary',
            'num_leaves': 50, 'n_estimators': 100, 'min_child_weight': 0.1, 'min_child_samples': 300,
            'max_depth': 32, 'learning_rate': 0.1, 'early_stopping_round': 150,
            'colsample_bytree': 0.8, 'class_weight': 'balanced'}

### XGBoost
*By Yiru Zhang*

## Model Ensemble 

Put the results of enembling individual models. Feel free to add subsections in this section to add more innovative ensembling methods.

### Voting ensemble

The simplest voting ensemble will be the model where all models have equal weights.

You may come up with innovative methods of estimating weights of the individual models, such as based on their cross-val error. Sometimes, these methods may work better than stacking ensembles, as stacking ensembles tend to overfit.

### Stacking ensemble
Try out different models as the metamodel. You may split work as follows. The person who worked on certain types of models *(say AdaBoost and MARS)* also uses those models as a metamodel in the stacking ensemble.

### Ensemble of ensembled models

If you are creating multiple stacking ensembles *(based on different metamodels)*, you may ensemble them.

### Innovative ensembling methods
*(Optional)*

Some models may do better on certain subsets of the predictor space. You may find that out, and given a data point, choose the model(s) that will best predict for that data point. This is similar to the idea of developing a decision tree metamodel. However, decision tree is prone to overfitting.

Another idea may be to correct the individual models with the intercept and slope *(note the tree-based models don't have an intercept and may suffer from a constant bias)*, and then ensemble them. This is equivalent to having a simple linear regression meta-model for each of the individual models, and then ensembling the meta-models with a meta-metamodel or a voting ensemble.

## Limitations of the model with regard to prediction

While we were able to fine-tune most of the hyperparameters on our sampled dataset, this approach inherently reduces the dataset size and increases the classification uncertainty. Given additional computational resources, we would aim to conduct the hyperparameter search on the full dataset or employ a more strategic sampling method to ensure data quality. 

The data collection feasibility for our model's predictors largely depends on stakeholders' existing systems. If they're already gathering personal, geographical, and financial background information, integrating our model could be straightforward, otherwise, it could be resource-intensive.

Our model, based on loan application data, can theoretically predict default risk at the very moment. However, as it forecasts an event that may occur months or years later, ongoing monitoring of the borrower's circumstances is crucial for maintaining accurate risk assessments.


## Conclusions and Recommendations to stakeholder(s)

Despite the project's original intention to develop a model that wasn't heavily reliant on credit-related information, EXT Sources - a category of credit-specific data - emerged as the model's top-ranked feature. This observation implies potential biases and assumptions inherent in the dataset. It seems the data may be intrinsically skewed towards credit-related information, which is subsequently reflected in the models' outcomes.

Stakeholders might want to consider **alternative data sources** that could supplement traditional credit data, such as rental payment history, utility bill payments, or even non-traditional data like social media behavior or online reviews. This could potentially provide a more comprehensive picture of a borrower's financial responsibility.

Additionally, in the GB model, one top-ranked feature is "30_CNT_SOCIAL_CIRCLE" (How many observations of a client's social surroundings defaulted on 30 DPD (days past due)). Insights from this could be used to potentially apply a Social Influence / Social Selection model (like ERGM, ALAAM, or REM) to understand the possible impact an individual's network has on default risk **(Social Network Analysis)**.

Furthermore, the stakeholders should contemplate the **introduction of welfare workshops**. If creditworthiness plays a significant role in loan determination, there could be a need for a welfare workshop or a non-profit organization dedicated to educating vulnerable individuals on methods to maintain their credit standing even amidst financial instability.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 25%;">
       <col span="1" style="width: 40%;">
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Individual Model</th>
    <th>Work other than individual model</th>    
    <th>Details of work other than individual model</th>
  </tr>
  <tr>
    <td>William Pattie</td>
    <td>Lasso, Ridge & Ada</td>
    <td></td>    
    <td></td>
  </tr>
  <tr>
    <td>Julia Chu</td>
    <td>LightGBM</td>
    <td>Project Timeline & Report, Sampling, Feature Engineering</td>    
    <td>Stacking ensembles and voting ensemble</td>
  </tr>
    <tr>
    <td>Victoria Shi</td>
    <td>Bagged trees & Random forest</td>
    <td>Data Preprocessing, Feature Engineering</td>    
    <td>Variable selection based on feature importance</td>
  </tr>
    <tr>
    <td>Yiru Zhang</td>
    <td>XGBoost</td>
    <td>Ensemble modeling</td>    
    <td>voting ensemble & stacking ensemble</td> 
  </tr>
</table>