## **Dataset description**

The dataset: SBA dataset posted on Kaggle. 

The dataset is from the U.S. Small Business Administration (SBA)
The U.S. SBA was founded in 1953 on the principle of promoting and assisting small enterprises in the U.S. credit market 
(SBA Overview and History, US Small Business Administration (2015)). Small businesses have been a primary source of job creation in the United States; therefore, fostering small business formation and growth has social benefits by creating job opportunities and reducing unemployment.
There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. 
However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans.

More info on the original dataset: https://www.kaggle.com/mirbektoktogaraev/should-this-loan-be-approved-or-denied


## **Preparation**

Loaded the data into H2O-3 dataframe.  
Described imported dataframe.

In [1]:
import h2o
try:
    h2o.cluster().shutdown()
except:
    pass 

In [2]:
# Adjust as per limits on your PC
# Limit to 8 threads and 8GB memory
h2o.init(nthreads=8, max_mem_size=8)

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "17.0.7" 2023-04-18; OpenJDK Runtime Environment Temurin-17.0.7+7 (build 17.0.7+7); OpenJDK 64-Bit Server VM Temurin-17.0.7+7 (build 17.0.7+7, mixed mode, sharing)
  Starting server from /Users/aishwaryaadiki/anaconda3/lib/python3.11/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/hd/qln9g6n51snggb5qgwn55w9c0000gn/T/tmp9u4ms8ey
  JVM stdout: /var/folders/hd/qln9g6n51snggb5qgwn55w9c0000gn/T/tmp9u4ms8ey/h2o_aishwaryaadiki_started_from_python.out
  JVM stderr: /var/folders/hd/qln9g6n51snggb5qgwn55w9c0000gn/T/tmp9u4ms8ey/h2o_aishwaryaadiki_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,2 months and 14 days
H2O_cluster_name:,H2O_from_python_aishwaryaadiki_onchze
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8 Gb
H2O_cluster_total_cores:,10
H2O_cluster_allowed_cores:,8


In [3]:
df_h = h2o.import_file('/Users/aishwaryaadiki/Downloads/SBA_loans_lab_6')

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [4]:
df_h.describe()

Unnamed: 0,City,State,Zip,Bank,BankState,NAICS,Term,NoEmp,NewExist,CreateJob,RetainedJob,FranchiseCode,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,GrAppv,SBA_Appv,Defaulted
type,enum,enum,int,enum,enum,int,int,int,int,int,int,int,int,enum,enum,int,int,int,int,int
mins,,,0.0,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,500.0,250.0,0.0
mean,,,53773.700544393985,,,399594.72817002464,110.90885988667263,11.38415085106738,1.2807185362326887,9.0115273614965,11.433813593723062,2755.5899473400323,0.7598827801349027,,,201926.88862444606,6.553808255436989,193398.5314152575,150058.35065866666,0.17522924046198418
maxs,,,99999.0,,,928120.0,461.0,8500.0,2.0,8800.0,8800.0,99999.0,2.0,,,11446325.0,996262.0,5000000.0,4869000.0,1.0
sigma,,,31182.082734237698,,,263331.7973921777,78.95806965689958,67.30104824719155,0.4520661496065112,246.60707495715351,247.46697982527505,12754.797773687817,0.6467730811038548,,,290056.8000335532,2367.8805220862832,284469.6765889608,229388.82282767945,0.38016411904602204
zeros,,,43,,,40267,161,1369,220,125596,87605,42030,64388,,,44,179829,0,0,148321
missing,3,4,0,316,318,0,0,0,23,0,0,0,0,932,855,0,0,0,0,0
0,Cuyahoga Falls,OH,44224.0,"PNC BANK, NATIONAL ASSOCIATION",OH,332710.0,84.0,3.0,1.0,0.0,3.0,1.0,1.0,N,N,99000.0,0.0,99000.0,49500.0,0.0
1,LAWRENCEVILLE,GA,30043.0,MERRILL LYNCH BANK USA,UT,454390.0,84.0,1.0,2.0,0.0,1.0,1.0,1.0,0,N,100900.0,0.0,108400.0,81300.0,0.0
2,MAPLE GROVE,MN,55369.0,TWIN CITIES-METRO CERT. DEVEL,MN,0.0,240.0,6.0,1.0,4.0,6.0,1.0,0.0,N,N,186000.0,0.0,186000.0,186000.0,0.0


## **Set attributes for the categorical  variables - make non-numerical columns "factor" (categorical)**

We will be using H2O-3 DRF model. The model can handle categorical variable and missing values.  
I will need to **mark** categorical variables with `asfactor` method

In [5]:
# Choose which columns to encode
cat_columns = ["City","State","Bank","BankState", "NewExist", "RevLineCr","LowDoc","Zip"]
encoded_columns = cat_columns
response = "Defaulted"

df_h[encoded_columns+[response]] = df_h[encoded_columns+[response]].asfactor()

## **Split to train/test/validation**

In [6]:
train,test,valid = df_h.split_frame(ratios=[.7, .15], seed=123)

## **Used H2O-3 to train Distributed Random Forest model with default parameters.**
Documentation: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html#defining-a-drf-model


In [7]:
from h2o.estimators import H2ORandomForestEstimator

predictors = train.columns
predictors.remove("Defaulted")
print("Predictor columns:", predictors)    
response_col = "Defaulted"

drf = H2ORandomForestEstimator(nfolds=0,seed=123)

drf.train(predictors, response_col, training_frame= train)
model_summary = drf.summary()

Predictor columns: ['City', 'State', 'Zip', 'Bank', 'BankState', 'NAICS', 'Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'RevLineCr', 'LowDoc', 'DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']
drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%


In [8]:
model_summary

Unnamed: 0,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
,50.0,50.0,5293379.0,20.0,20.0,20.0,5725.0,7517.0,6730.1


### **Model performance metrics on Test dataset**

Alternative method to calculate metrics is to use `h2o.make_metrics(pred, actual)` function:   
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#computing-model-metrics-from-general-predictions

In [9]:
print("Best accuracy threshold:",drf.model_performance(test).accuracy()[0][0],"\n",
      " Accuracy:",
      drf.model_performance(test).accuracy()[0][1])
print("Best F1 threshold:",drf.model_performance(test).F1()[0][0],"\n",
      " F1:",
      drf.model_performance(test).F1()[0][1])
print("Model AUC:", drf.model_performance(test).auc())
print("Model AUCPR:", drf.model_performance(test).aucpr())
drf.model_performance(test).confusion_matrix()

Best accuracy threshold: 0.456615068949759 
  Accuracy: 0.9346093691815
Best F1 threshold: 0.37789810081215014 
  F1: 0.8057286542035639
Model AUC: 0.9562319892237096
Model AUCPR: 0.8734020921112297


Unnamed: 0,0,1,Error,Rate
0,21392.0,751.0,0.0339,(751.0/22143.0)
1,1026.0,3685.0,0.2178,(1026.0/4711.0)
Total,22418.0,4436.0,0.0662,(1777.0/26854.0)


## **HyperParameter Tuning**
H2O GridSearch for the following search space.

```
# DRF hyperparameters
from h2o.grid.grid_search import H2OGridSearch

# DRF hyperparameters
drf_params2  = {'ntrees': [25,50,100],
               'max_depth': [10,15,20,25]}

# Train and validate a cartesian grid of DRFs
drf_grid2  = H2OGridSearch(model=H2ORandomForestEstimator,
                          grid_id='drf_grid2',
                          hyper_params=drf_params2)

drf_grid2.train(x=predictors,
               y=response_col,
               nfolds=0,
               training_frame=train,
               validation_frame=valid,
               seed=123)
```

You can follow the progress of the Grid search job by using H2O Flow, the URL is shown right after `h2o.init()`.
Answer following questions:
- What is the best set of parameters for the DRF model using AUCPR metric?
- What is the best model performance (AUCPR) on a validation dataset?
- What is the best model performance (AUCPR) on a test dataset?
- Produce Confusion Matrix using best F1 probability threshold

In [10]:
# DRF hyperparameters
from h2o.grid.grid_search import H2OGridSearch

# DRF hyperparameters
drf_params2  = {'ntrees': [25,50,100],
               'max_depth': [10,15,20,25]}

# Train and validate a cartesian grid of DRFs
drf_grid2  = H2OGridSearch(model=H2ORandomForestEstimator,
                          grid_id='drf_grid2',
                          hyper_params=drf_params2)

drf_grid2.train(x=predictors,
               y=response_col,
               nfolds=0,
               training_frame=train,
               validation_frame=valid,
               seed=123)


drf Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,max_depth,ntrees,model_ids,logloss
,20.0,100.0,drf_grid2_model_11,0.1851478
,25.0,100.0,drf_grid2_model_12,0.1862623
,20.0,50.0,drf_grid2_model_7,0.1882485
,15.0,100.0,drf_grid2_model_10,0.1953589
,20.0,25.0,drf_grid2_model_3,0.1958864
,15.0,50.0,drf_grid2_model_6,0.1971946
,15.0,25.0,drf_grid2_model_2,0.1986369
,25.0,50.0,drf_grid2_model_8,0.2039817
,10.0,100.0,drf_grid2_model_9,0.2347524
,10.0,50.0,drf_grid2_model_5,0.2358679


In [11]:
# Get the grid results, sorted by validation AUCPR
drf_grid2perf1 = drf_grid2.get_grid(sort_by='aucpr', decreasing=True)
drf_grid2perf1

Unnamed: 0,max_depth,ntrees,model_ids,aucpr
,25.0,100.0,drf_grid2_model_12,0.8806076
,20.0,100.0,drf_grid2_model_11,0.8798941
,25.0,50.0,drf_grid2_model_8,0.8758004
,20.0,50.0,drf_grid2_model_7,0.8757039
,15.0,100.0,drf_grid2_model_10,0.8717645
,25.0,25.0,drf_grid2_model_4,0.8706091
,20.0,25.0,drf_grid2_model_3,0.8702474
,15.0,50.0,drf_grid2_model_6,0.8682247
,15.0,25.0,drf_grid2_model_2,0.8638473
,10.0,100.0,drf_grid2_model_9,0.832002


In [12]:
# Grab the top GBM model, chosen by validation AUCPR
best_drf_grid2 = drf_grid2perf1.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_best_drf_grid2_perf1 = best_drf_grid2.model_performance(test)

best_best_drf_grid2_perf1.aucpr()

0.879360370754568

### **The best max depth is 25 and n-trees is 100 for model 12**
### **Validation AUCPR value of 0.8806 for model 12**
### **Test AUCPR value is 0.8794**
### **Confusion Matrix using best F1 probability threshold:**

In [13]:
best_best_drf_grid2_perf1.confusion_matrix()

Unnamed: 0,0,1,Error,Rate
0,21410.0,733.0,0.0331,(733.0/22143.0)
1,997.0,3714.0,0.2116,(997.0/4711.0)
Total,22407.0,4447.0,0.0644,(1730.0/26854.0)


## **HyperParameter Tuning Version/Option 2 (Early Stopping)**


Tuned model using early stopping criteria by specifying `ntrees`=1000 and setting the parameters required for early stopping. Provided Validation data-frame for early stopping to work correctly.


Alternative to find optimal setting for `ntrees` (number of trees in RF) parameter is to set it to high number, and use early stopping to find optimal number of trees.

parameters tuned:
```
drf_params  = {'max_depth': [10,15, 25]}
```
Stopping parameters to be added to the `train` method:
- stopping_rounds
- stopping_metric
- stopping_tolerance


In [16]:
# DRF hyperparameters
from h2o.grid.grid_search import H2OGridSearch

# DRF hyperparameters
drf_params3  = {'ntrees': [1000],
               'max_depth': [10,15,20,25],
               'min_rows' : [5, 10, 15]}

search_criteria = {'strategy': "RandomDiscrete", 'stopping_metric': "rmse", 'stopping_tolerance': 0.0001, 'stopping_rounds': 10, 
                  'max_runtime_secs': 500}


# Train and validate a cartesian grid of DRFs
drf_grid3= H2OGridSearch(model=H2ORandomForestEstimator,
                          grid_id='drf_grid3',
                          hyper_params=drf_params3,
                          search_criteria=search_criteria)

drf_grid3.train(x=predictors,
               y=response_col,
               nfolds=0,
               training_frame=train,
               validation_frame=valid,
               seed=123)

# Get the grid results, sorted by validation AUCPR
drf_grid3perf2 = drf_grid3.get_grid(sort_by='aucpr', decreasing=True)
drf_grid3perf2


drf Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,max_depth,min_rows,ntrees,model_ids,aucpr
,25.0,5.0,1000.0,drf_grid3_model_10,0.8745997
,25.0,10.0,1000.0,drf_grid3_model_9,0.8709754
,20.0,10.0,1000.0,drf_grid3_model_4,0.8707267
,15.0,5.0,1000.0,drf_grid3_model_7,0.8705623
,25.0,15.0,1000.0,drf_grid3_model_5,0.865688
,15.0,10.0,1000.0,drf_grid3_model_1,0.8653282
,15.0,15.0,1000.0,drf_grid3_model_6,0.8612371
,10.0,5.0,1000.0,drf_grid3_model_3,0.8343764
,10.0,10.0,1000.0,drf_grid3_model_2,0.8329777


In [17]:
# Grab the top GBM model, chosen by validation AUCPR
best_drf_grid3 = drf_grid3perf2.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_drf_grid3_perf2 = best_drf_grid3.model_performance(test)

best_drf_grid3_perf2.aucpr()

0.8741029599827863

### **The best max depth is 25, min rows is 5, and n trees is 1000 for model 10**
### **Validation AUCPR value of 0.8746 for model 10**
### **Test AUCPR value is 0.8741**
### **Confusion Matrix using best F1 probability threshold:**

In [18]:
best_drf_grid3_perf2.confusion_matrix()

Unnamed: 0,0,1,Error,Rate
0,21365.0,778.0,0.0351,(778.0/22143.0)
1,1017.0,3694.0,0.2159,(1017.0/4711.0)
Total,22382.0,4472.0,0.0668,(1795.0/26854.0)


## **Used best model parameters from the GridSearch and Trained new model as following:**
- Trained model on a combined dataset: train+validation
- `nfolds=0`
- Evaluated trained model on test dataset

In [22]:
combined_data = train.merge(valid, all_x=True)
combined_data.tail()

BankState,Zip,Bank,NAICS,Term,NewExist,Defaulted,GrAppv,SBA_Appv,UrbanRural,BalanceGross,NoEmp,LowDoc,CreateJob,RevLineCr,FranchiseCode,DisbursementGross,City,RetainedJob,State
WY,83002,WELLS FARGO BANK NATL ASSOC,0,84,1,0,22000.0,17600.0,1,0,2,Y,0,0,1,22000.0,JACKSON,0,WY
WY,83013,BANK OF JACKSON HOLE,114210,84,1,0,68000.0,61200.0,0,0,6,Y,0,N,1,68000.0,MORAN,0,WY
WY,83014,BANK OF THE WEST,235310,240,1,0,80000.0,72000.0,0,0,13,N,0,N,1,80000.0,WILSON,0,WY
WY,83025,FRONTIER CERT. DEVEL CO,451110,240,1,0,1166000.0,1166000.0,2,0,20,N,15,N,1,1166000.0,TETON VILLAGE,15,WY
WY,83101,BANK OF THE WEST,0,180,1,0,40000.0,36000.0,0,0,5,N,0,N,1,40000.0,KEMMERER,0,WY
WY,83110,FRONTIER CERT. DEVEL CO,444130,240,1,0,330000.0,330000.0,0,0,12,N,1,N,1,330000.0,AFTON,11,WY
WY,83116,BANK OF THE WEST,0,180,1,0,206000.0,154500.0,0,0,8,N,0,0,1,206000.0,DIAMONDVILLE,0,WY
WY,83422,BANK OF JACKSON HOLE,0,240,1,0,350000.0,297500.0,0,0,1,N,0,N,1,350000.0,DRIGGS,0,ID
WY,83455,BANK OF JACKSON HOLE,0,60,1,0,86400.0,69120.0,0,0,6,Y,0,N,1,86400.0,TETON VILLAGE,0,WY
WY,95030,WELLS FARGO BANK NATL ASSOC,114210,180,1,0,300000.0,225000.0,2,0,14,N,2,0,1,300000.0,CODY,14,WY


In [23]:
# DRF hyperparameters
from h2o.grid.grid_search import H2OGridSearch

# DRF hyperparameters
drf_params4  = {'ntrees': [1000],
               'max_depth': [25],
               'min_rows' : [5]}


# Train and validate a cartesian grid of DRFs
drf_grid4= H2OGridSearch(model=H2ORandomForestEstimator,
                          grid_id='drf_grid4',
                          hyper_params=drf_params4)

drf_grid4.train(x=predictors,
               y=response_col,
               nfolds=0,
               training_frame=combined_data,
               seed=123)

# Get the grid results, sorted by validation AUCPR
drf_grid4perf2 = drf_grid4.get_grid(sort_by='aucpr', decreasing=True)
drf_grid4perf2

drf Grid Build progress: |███████████████████████████████████████████████████████| (done) 100%


Unnamed: 0,max_depth,min_rows,ntrees,model_ids,aucpr
,25.0,5.0,1000.0,drf_grid4_model_1,0.8720056


In [24]:
# Grab the top GBM model, chosen by validation AUCPR
best_drf_grid4 = drf_grid4perf2.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_drf_grid4_perf2 = best_drf_grid4.model_performance(test)
print("Question 4 AUCPR on test data: ")
best_drf_grid4_perf2.aucpr()

0.8775779188832054

### **Compared AUCPR (test dataset) metrics from the Models above:
    4th AUCPR on test data: 0.8776

    3rd AUCPR on test data: 0.8741

    2nd AUCPR on test data: 0.8794
    
    The difference between 2,3, and 4's AUCPR metric is negligible


In [26]:
best_drf_grid4_perf2.confusion_matrix()

Unnamed: 0,0,1,Error,Rate
0,21244.0,899.0,0.0406,(899.0/22143.0)
1,908.0,3803.0,0.1927,(908.0/4711.0)
Total,22152.0,4702.0,0.0673,(1807.0/26854.0)


## Question 5

Review following documentation section first: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/cross-validation.html

- Train DRF model with the parameters found in the Question 3.
- Use full data to train the model.
- Use cross-validation to produce reliable metrics for the model performance
- Produce model metrics using cross-validation 

**Note**: understand why the model training takes significantly longer than before we used cross-validation

In [None]:
# DRF hyperparameters
from h2o.grid.grid_search import H2OGridSearch

# DRF hyperparameters
drf_params5  = {'ntrees': [1000],
               'max_depth': [25],
               'min_rows' : [5]}


# Train and validate a cartesian grid of DRFs
drf_grid5= H2OGridSearch(model=H2ORandomForestEstimator,
                          grid_id='drf_grid5',
                          hyper_params=drf_params5)

drf_grid5.train(x=predictors,
               y=response_col,
               nfolds=2,
               training_frame=df_h,
               seed=123)

# Get the grid results, sorted by validation AUCPR
drf_grid5perf2 = drf_grid5.get_grid(sort_by='aucpr', decreasing=True)
# Grab the top GBM model, chosen by validation AUCPR
best_drf_grid5 = drf_grid5perf2.models[0]

# Now let's evaluate the model performance on a test set
# so we get an honest estimate of top model performance
best_drf_grid5_perf2 = best_drf_grid5.model_performance(test)
print("AUCPR value: ")
best_drf_grid5_perf2.aucpr()

AUCPR value: 0.8789

### The mode took too long to run and used up too many resources for just 2 cross validation folds. If not, I would have tried to run 5 folds as well.