# Final small business loan approval  

This notebook marks the final loan approval work around for my final year capstone project.

## problem definition:
The case-study project: we assume the role of loan officer at a bank and try to approve or deny a loan by assessing its risk of default using machine learning models. The most accurate model will be picked for deployment and integration with an interface for interactive decision making.

## project objective:
The project will try to answer the following questions: As a representative of the bank, should I grant a loan to a particular small business (Company X)? Why or why not? The decision made by assessing a loan's risk.

generally:
The assessment is accomplished by estimating the loan's default probability through analyzing the historical dataset and then classifying the loan into one of two categories: (a) higher risk—likely to default on the loan (i.e., be charged off/failure to pay in full) or (b) lower risk—likely to pay off the loan in full. There have been many success stories of start-ups receiving SBA loan guarantees such as FedEx and Apple Computer. However, there have also been stories of small businesses and/or start-ups that have defaulted on their SBA-guaranteed loans. 

## Dataset:
The dataset used in this project is the U.S. Small Business Administration (SBA). Here is the link to the dataset- [Link Text](https://www.kaggle.com/datasets/mirbektoktogaraev/should-this-loan-be-approved-or-denied) 

The dataset is a real dataset from the U.S. Small Business Administration (SBA). The case-study assignment, titled “Should This Loan be Approved or Denied?” is designed to teach statistical thinking by focusing on how to use real data to make informed decisions for a particular purpose. For this assignment, students assume the role of a loan officer who is deciding whether to approve a loan to a small business.

In [1]:
# packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

# we want plots to appear inside the notebook
%matplotlib inline

# Models from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
# from xgboost import XCBClassifier

# Model evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve

# imports for column transformations
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Pipeline and feature selection
from sklearn.pipeline import Pipeline

In [41]:
import random
import json
from loguru import logger

In [42]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [43]:
import pickle 
from joblib import dump, load

In [4]:
import sklearn
sklearn.show_versions()


System:
    python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:42:31) [MSC v.1937 64 bit (AMD64)]
executable: C:\Users\HP\Desktop\final_year_capstone\env\python.exe
   machine: Windows-11-10.0.22631-SP0

Python dependencies:
      sklearn: 1.3.0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.4
        scipy: 1.11.3
       Cython: None
       pandas: 2.1.4
   matplotlib: 3.8.0
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: C:\Users\HP\Desktop\final_year_capstone\env\Library\bin\mkl_rt.2.dll
         prefix: mkl_rt
       user_api: blas
   internal_api: mkl
        version: 2023.1-Product
    num_threads: 4
threading_layer: intel

       filepath: C:\Users\HP\Desktop\final_year_capstone\env\vcomp140.dll
         prefix: vcomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: C:\Users\HP\Desktop\final_year_capstone\env\Library\bin\libiomp5md.d

# Loan Data

In [5]:
# this code ignores all the warnings 
import warnings
warnings.filterwarnings("ignore") # you can change "igonre" to "default"

In [8]:
# Import Dataset
sbanational_df = pd.read_csv('./SBA Loan Approval notebook/data/SBAnational/SBAnational.csv', low_memory=False)
sbanational_df.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,RevLineCr,LowDoc,ChgOffDate,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,28-Feb-97,1997,...,N,Y,,28-Feb-99,"$60,000.00",$0.00,P I F,$0.00,"$60,000.00","$48,000.00"
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,28-Feb-97,1997,...,N,Y,,31-May-97,"$40,000.00",$0.00,P I F,$0.00,"$40,000.00","$32,000.00"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,28-Feb-97,1997,...,N,N,,31-Dec-97,"$287,000.00",$0.00,P I F,$0.00,"$287,000.00","$215,250.00"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,28-Feb-97,1997,...,N,Y,,30-Jun-97,"$35,000.00",$0.00,P I F,$0.00,"$35,000.00","$28,000.00"
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,28-Feb-97,1997,...,N,N,,14-May-97,"$229,000.00",$0.00,P I F,$0.00,"$229,000.00","$229,000.00"


In [9]:
sbanational_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 899164 entries, 0 to 899163
Data columns (total 27 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   LoanNr_ChkDgt      899164 non-null  int64  
 1   Name               899150 non-null  object 
 2   City               899134 non-null  object 
 3   State              899150 non-null  object 
 4   Zip                899164 non-null  int64  
 5   Bank               897605 non-null  object 
 6   BankState          897598 non-null  object 
 7   NAICS              899164 non-null  int64  
 8   ApprovalDate       899164 non-null  object 
 9   ApprovalFY         899164 non-null  object 
 10  Term               899164 non-null  int64  
 11  NoEmp              899164 non-null  int64  
 12  NewExist           899028 non-null  float64
 13  CreateJob          899164 non-null  int64  
 14  RetainedJob        899164 non-null  int64  
 15  FranchiseCode      899164 non-null  int64  
 16  Ur

In [10]:
sbanational_df.MIS_Status.value_counts()

MIS_Status
P I F     739609
CHGOFF    157558
Name: count, dtype: int64

In [11]:
sbanational_df.shape

(899164, 27)

In [12]:
# check for duplicates
sbanational_df.duplicated().any()

False

# Clean data

In [10]:
sbanational_df.head().T

Unnamed: 0,0,1,2,3,4
LoanNr_ChkDgt,1000014003,1000024006,1000034009,1000044001,1000054004
Name,ABC HOBBYCRAFT,LANDMARK BAR & GRILLE (THE),"WHITLOCK DDS, TODD M.","BIG BUCKS PAWN & JEWELRY, LLC","ANASTASIA CONFECTIONS, INC."
City,EVANSVILLE,NEW PARIS,BLOOMINGTON,BROKEN ARROW,ORLANDO
State,IN,IN,IN,OK,FL
Zip,47711,46526,47401,74012,32801
Bank,FIFTH THIRD BANK,1ST SOURCE BANK,GRANT COUNTY STATE BANK,1ST NATL BK & TR CO OF BROKEN,FLORIDA BUS. DEVEL CORP
BankState,OH,IN,IN,OK,FL
NAICS,451120,722410,621210,0,0
ApprovalDate,28-Feb-97,28-Feb-97,28-Feb-97,28-Feb-97,28-Feb-97
ApprovalFY,1997,1997,1997,1997,1997


In [13]:
# check for missing values
sbanational_df.isna().sum()

LoanNr_ChkDgt             0
Name                     14
City                     30
State                    14
Zip                       0
Bank                   1559
BankState              1566
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736465
DisbursementDate       2368
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

In [14]:
len(sbanational_df["ChgOffDate"])

899164

In [15]:
# check percentage of missing values on ChgOffPrinGr
sbanational_df["ChgOffDate"].isna().sum()/len(sbanational_df["ChgOffDate"]) 

0.8190552557709161

In [16]:
# check percentage of missing values on all columns

def check_percentage_of_missing_data(df):
    total_rows = len(df)
    missing_percentage = (df.isna().sum() / total_rows)*100

    missing_dict = pd.DataFrame({
        "Column Name:": missing_percentage.index,
        "Percrentage of missing Data:": missing_percentage.values
    })

    print(missing_dict.to_string(index=False))

In [17]:
check_percentage_of_missing_data(sbanational_df)

     Column Name:  Percrentage of missing Data:
    LoanNr_ChkDgt                      0.000000
             Name                      0.001557
             City                      0.003336
            State                      0.001557
              Zip                      0.000000
             Bank                      0.173383
        BankState                      0.174162
            NAICS                      0.000000
     ApprovalDate                      0.000000
       ApprovalFY                      0.000000
             Term                      0.000000
            NoEmp                      0.000000
         NewExist                      0.015125
        CreateJob                      0.000000
      RetainedJob                      0.000000
    FranchiseCode                      0.000000
       UrbanRural                      0.000000
        RevLineCr                      0.503579
           LowDoc                      0.287156
       ChgOffDate                     81

In [18]:
# drop ChgOffDate
sbanational_df.drop('ChgOffDate', axis = 1, inplace=True)

In [19]:
# check for NaN values
sbanational_df.isna().sum()

LoanNr_ChkDgt           0
Name                   14
City                   30
State                  14
Zip                     0
Bank                 1559
BankState            1566
NAICS                   0
ApprovalDate            0
ApprovalFY              0
Term                    0
NoEmp                   0
NewExist              136
CreateJob               0
RetainedJob             0
FranchiseCode           0
UrbanRural              0
RevLineCr            4528
LowDoc               2582
DisbursementDate     2368
DisbursementGross       0
BalanceGross            0
MIS_Status           1997
ChgOffPrinGr            0
GrAppv                  0
SBA_Appv                0
dtype: int64

In [20]:
# drop all rows with missing values
sbanational_df.dropna(inplace=True)

In [21]:
# check if we still have missing values
sbanational_df.isna().sum()

LoanNr_ChkDgt        0
Name                 0
City                 0
State                0
Zip                  0
Bank                 0
BankState            0
NAICS                0
ApprovalDate         0
ApprovalFY           0
Term                 0
NoEmp                0
NewExist             0
CreateJob            0
RetainedJob          0
FranchiseCode        0
UrbanRural           0
RevLineCr            0
LowDoc               0
DisbursementDate     0
DisbursementGross    0
BalanceGross         0
MIS_Status           0
ChgOffPrinGr         0
GrAppv               0
SBA_Appv             0
dtype: int64

In [22]:
sbanational_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 886240 entries, 0 to 899163
Data columns (total 26 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   LoanNr_ChkDgt      886240 non-null  int64  
 1   Name               886240 non-null  object 
 2   City               886240 non-null  object 
 3   State              886240 non-null  object 
 4   Zip                886240 non-null  int64  
 5   Bank               886240 non-null  object 
 6   BankState          886240 non-null  object 
 7   NAICS              886240 non-null  int64  
 8   ApprovalDate       886240 non-null  object 
 9   ApprovalFY         886240 non-null  object 
 10  Term               886240 non-null  int64  
 11  NoEmp              886240 non-null  int64  
 12  NewExist           886240 non-null  float64
 13  CreateJob          886240 non-null  int64  
 14  RetainedJob        886240 non-null  int64  
 15  FranchiseCode      886240 non-null  int64  
 16  UrbanRu

In [23]:
date_columns = sbanational_df.select_dtypes(include=['datetime64'])
print(date_columns.columns.tolist())

[]


In [24]:
# convert date cols to date format
date_columns=['ApprovalDate','DisbursementDate']
sbanational_df[date_columns] = sbanational_df[date_columns].apply(pd.to_datetime)

# print columns
date_columns = sbanational_df.select_dtypes(include=['datetime64'])
print(date_columns.columns.tolist())

['ApprovalDate', 'DisbursementDate']


In [25]:
sbanational_df.head().T

Unnamed: 0,0,1,2,3,4
LoanNr_ChkDgt,1000014003,1000024006,1000034009,1000044001,1000054004
Name,ABC HOBBYCRAFT,LANDMARK BAR & GRILLE (THE),"WHITLOCK DDS, TODD M.","BIG BUCKS PAWN & JEWELRY, LLC","ANASTASIA CONFECTIONS, INC."
City,EVANSVILLE,NEW PARIS,BLOOMINGTON,BROKEN ARROW,ORLANDO
State,IN,IN,IN,OK,FL
Zip,47711,46526,47401,74012,32801
Bank,FIFTH THIRD BANK,1ST SOURCE BANK,GRANT COUNTY STATE BANK,1ST NATL BK & TR CO OF BROKEN,FLORIDA BUS. DEVEL CORP
BankState,OH,IN,IN,OK,FL
NAICS,451120,722410,621210,0,0
ApprovalDate,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00
ApprovalFY,1997,1997,1997,1997,1997


In [26]:
# clean all currency columns
currency_columns = ['DisbursementGross', 'BalanceGross', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv']
sbanational_df[currency_columns] = sbanational_df[currency_columns].replace('[\$,]', '', regex=True).astype(float)

In [27]:
sbanational_df.head().T

Unnamed: 0,0,1,2,3,4
LoanNr_ChkDgt,1000014003,1000024006,1000034009,1000044001,1000054004
Name,ABC HOBBYCRAFT,LANDMARK BAR & GRILLE (THE),"WHITLOCK DDS, TODD M.","BIG BUCKS PAWN & JEWELRY, LLC","ANASTASIA CONFECTIONS, INC."
City,EVANSVILLE,NEW PARIS,BLOOMINGTON,BROKEN ARROW,ORLANDO
State,IN,IN,IN,OK,FL
Zip,47711,46526,47401,74012,32801
Bank,FIFTH THIRD BANK,1ST SOURCE BANK,GRANT COUNTY STATE BANK,1ST NATL BK & TR CO OF BROKEN,FLORIDA BUS. DEVEL CORP
BankState,OH,IN,IN,OK,FL
NAICS,451120,722410,621210,0,0
ApprovalDate,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00
ApprovalFY,1997,1997,1997,1997,1997


# Feature Engineering

For this project we need the Industry Column ,
the NAICS column has the industry code. The fisrt two digits indicate a company's industry

In [28]:
sbanational_df['Industry'] = sbanational_df['NAICS'].astype('str').apply(lambda x: x[:2])
sbanational_df['Industry'] = sbanational_df['Industry'].map({
    '11': 'Agriculture, forestry, fishing, hunting',
    '21': ' Mining, quarrying, oil and gas extraction',
    '22': 'Utilities',
    '23': 'Construction',
    '31': 'Manufacturing',
    '32': 'Manufacturing',
    '33': 'Manufacturing',
    '42': 'Wholesale_trade',
    '44': 'Retail_trade',
    '45': 'Retail_trade',
    '48': 'Transportation, warehousing',
    '49': 'Transportation, warehousing',
    '51': 'Information',
    '52': 'Finance, Insurance',
    '53': 'Real estate, rental, leasing',
    '54': 'Professional, scientific, technical services',
    '55': 'Management of companies, enterprises',
    '56': 'Administrative support, waste management',
    '61': 'Educational',
    '62': 'Healthcare, Social_assist',
    '71': 'Arts, Entertain, recreation',
    '72': 'Accomodation, Food services',
    '81': 'Other services',
    '92': 'Public adminstration',
    '0': 'Other'
})

In [29]:
sbanational_df.head().T

Unnamed: 0,0,1,2,3,4
LoanNr_ChkDgt,1000014003,1000024006,1000034009,1000044001,1000054004
Name,ABC HOBBYCRAFT,LANDMARK BAR & GRILLE (THE),"WHITLOCK DDS, TODD M.","BIG BUCKS PAWN & JEWELRY, LLC","ANASTASIA CONFECTIONS, INC."
City,EVANSVILLE,NEW PARIS,BLOOMINGTON,BROKEN ARROW,ORLANDO
State,IN,IN,IN,OK,FL
Zip,47711,46526,47401,74012,32801
Bank,FIFTH THIRD BANK,1ST SOURCE BANK,GRANT COUNTY STATE BANK,1ST NATL BK & TR CO OF BROKEN,FLORIDA BUS. DEVEL CORP
BankState,OH,IN,IN,OK,FL
NAICS,451120,722410,621210,0,0
ApprovalDate,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00
ApprovalFY,1997,1997,1997,1997,1997


# EDA

# Select Suitable Features

In [30]:
relevant_columns=['GrAppv', 'Term', 'NoEmp','NewExist', 'UrbanRural', 'Industry','MIS_Status']
df=sbanational_df.loc[:,relevant_columns]
df

Unnamed: 0,GrAppv,Term,NoEmp,NewExist,UrbanRural,Industry,MIS_Status
0,60000.0,84,4,2.0,0,Retail_trade,P I F
1,40000.0,60,2,2.0,0,"Accomodation, Food services",P I F
2,287000.0,180,7,1.0,0,"Healthcare, Social_assist",P I F
3,35000.0,60,2,1.0,0,Other,P I F
4,229000.0,240,14,1.0,0,Other,P I F
...,...,...,...,...,...,...,...
899159,70000.0,60,6,1.0,0,Retail_trade,P I F
899160,85000.0,60,6,1.0,0,Retail_trade,P I F
899161,300000.0,108,26,1.0,0,Manufacturing,P I F
899162,75000.0,60,6,1.0,0,Other,CHGOFF


# Transform Data

preprocessing stage

In [31]:
# Extract all non-numeric columns
non_numeric_columns = df.select_dtypes(include=['object'])
non_numeric_columns.head()

Unnamed: 0,Industry,MIS_Status
0,Retail_trade,P I F
1,"Accomodation, Food services",P I F
2,"Healthcare, Social_assist",P I F
3,Other,P I F
4,Other,P I F


In [32]:
column_names = non_numeric_columns.columns.tolist()
print(column_names)

['Industry', 'MIS_Status']


In [33]:
features = non_numeric_columns.columns.to_numpy()
features

array(['Industry', 'MIS_Status'], dtype=object)

In [34]:
# Encoding DataFrame
encoded_df=pd.DataFrame()
for column in df.columns:
    if df[column].dtype == 'object':
        encoded_df[column] = df[column].astype('category').cat.codes
    else:
        encoded_df[column] = df[column]

In [35]:
encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 886240 entries, 0 to 899163
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   GrAppv      886240 non-null  float64
 1   Term        886240 non-null  int64  
 2   NoEmp       886240 non-null  int64  
 3   NewExist    886240 non-null  float64
 4   UrbanRural  886240 non-null  int64  
 5   Industry    886240 non-null  int8   
 6   MIS_Status  886240 non-null  int8   
dtypes: float64(2), int64(3), int8(2)
memory usage: 42.3 MB


# Training 

In [36]:
# Split the data
np.random.seed(42)

X = encoded_df.drop("MIS_Status", axis=1)
y = encoded_df["MIS_Status"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Modeling

### Expected Output
1. Whether loan is guaranteed or not
2. Major factors that led to loan results
3. Most Probable Banks

### evaluate preds function

In [37]:
# function to evaluate preds 
def evaluate_preds(y_true, y_preds):
    """
        Perfoms evaluation comparison on y_true labels vs y_pred labels 
        on a classification
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2), 
                   "precision": round(precision, 2),
                   "recall": round(recall, 2),
                   "f1": round(f1, 2)}
    print(f"Acc: {accuracy*100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")

    return metric_dict

### Random Forest

In [32]:
np.random.seed(42)

# initialize model
rf_classifier = RandomForestClassifier()

# fit the model
rf_classifier.fit(X_train, y_train)

In [33]:
# check model score
rf_classifier.score(X_test, y_test)

0.9214546849611843

metrics

In [34]:
# Make predictions with the best hyperparameters, 
# rs_clf automatically selects the best params
rs_y_preds = rf_classifier.predict(X_test)

# evaluate the predictions
rfc_1_metrics = evaluate_preds(y_test, rs_y_preds)

Acc: 92.15%
Precision: 0.94
Recall: 0.96
F1 score: 0.95


# Incorporate external data


This is a very crucial part of the project:

By incoporating external information the project ensures improvement in the model's ability to keep up with external factors that may affect the overall loan decision. This will reduce errors in the model's prediction ability. This segment will add `Industry trends` as an external factor to drive the overall decision of this model. 

Industry trends values:
1. Positive 
2. Negative
3. Neutral

### Engineer industry trends column

In [38]:
# column to be created
column_to_move = 'Industry Trends'
last_index = len(sbanational_df.columns) - 1

sbanational_df.insert(last_index, column_to_move, "")
sbanational_df.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,Industry Trends,Industry
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,1997-02-28,1997,...,Y,1999-02-28,60000.0,0.0,P I F,0.0,60000.0,48000.0,,Retail_trade
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,1997-02-28,1997,...,Y,1997-05-31,40000.0,0.0,P I F,0.0,40000.0,32000.0,,"Accomodation, Food services"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,1997-02-28,1997,...,N,1997-12-31,287000.0,0.0,P I F,0.0,287000.0,215250.0,,"Healthcare, Social_assist"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,1997-02-28,1997,...,Y,1997-06-30,35000.0,0.0,P I F,0.0,35000.0,28000.0,,Other
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,1997-02-28,1997,...,N,1997-05-14,229000.0,0.0,P I F,0.0,229000.0,229000.0,,Other


### random sentiments 

In [39]:
# industries

# Dictionary mapping for industries
industry_mapping = {
    '11': 'Agriculture, forestry, fishing, hunting',
    '21': ' Mining, quarrying, oil and gas extraction',
    '22': 'Utilities',
    '23': 'Construction',
    '31': 'Manufacturing',
    '32': 'Manufacturing',
    '33': 'Manufacturing',
    '42': 'Wholesale_trade',
    '44': 'Retail_trade',
    '45': 'Retail_trade',
    '48': 'Transportation, warehousing',
    '49': 'Transportation, warehousing',
    '51': 'Information',
    '52': 'Finance, Insurance',
    '53': 'Real estate, rental, leasing',
    '54': 'Professional, scientific, technical services',
    '55': 'Management of companies, enterprises',
    '56': 'Administrative support, waste management',
    '61': 'Educational',
    '62': 'Healthcare, Social_assist',
    '71': 'Arts, Entertain, recreation',
    '72': 'Accomodation, Food services',
    '81': 'Other services',
    '92': 'Public adminstration',
}

# Extract the values from the dictionary
industry_array = list(industry_mapping.values())

print("Industries array:")
print(industry_array)

Industries array:
['Agriculture, forestry, fishing, hunting', ' Mining, quarrying, oil and gas extraction', 'Utilities', 'Construction', 'Manufacturing', 'Manufacturing', 'Manufacturing', 'Wholesale_trade', 'Retail_trade', 'Retail_trade', 'Transportation, warehousing', 'Transportation, warehousing', 'Information', 'Finance, Insurance', 'Real estate, rental, leasing', 'Professional, scientific, technical services', 'Management of companies, enterprises', 'Administrative support, waste management', 'Educational', 'Healthcare, Social_assist', 'Arts, Entertain, recreation', 'Accomodation, Food services', 'Other services', 'Public adminstration']


In [44]:
# create random sentiments for training purposes

industry_sentiments_random = {}

for sector in industry_array:
    sentiment = random.choice(['positive', 'neutral', 'negative'])
    industry_sentiments_random[sector] = sentiment

logger.info(f"{json.dumps(industry_sentiments_random, indent=4)}")

[32m2024-04-29 23:07:19.946[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m9[0m - [1m{
    "Agriculture, forestry, fishing, hunting": "neutral",
    " Mining, quarrying, oil and gas extraction": "positive",
    "Utilities": "negative",
    "Construction": "positive",
    "Manufacturing": "negative",
    "Wholesale_trade": "negative",
    "Retail_trade": "positive",
    "Transportation, warehousing": "neutral",
    "Information": "neutral",
    "Finance, Insurance": "negative",
    "Real estate, rental, leasing": "negative",
    "Professional, scientific, technical services": "positive",
    "Management of companies, enterprises": "negative",
    "Administrative support, waste management": "negative",
    "Educational": "positive",
    "Healthcare, Social_assist": "positive",
    "Arts, Entertain, recreation": "neutral",
    "Accomodation, Food services": "positive",
    "Other services": "negative",
    "Public adminstration": "neutral"
}[0m


### populate Industry trends

using random trends

In [45]:
new_df = sbanational_df
for industry, sentiment in industry_sentiments_random.items():
    # Populate the overall sentiment in the dataset for the current industry
    new_df.loc[new_df['Industry'] == industry, 'Industry Trends'] = sentiment

new_df.head()

Unnamed: 0,LoanNr_ChkDgt,Name,City,State,Zip,Bank,BankState,NAICS,ApprovalDate,ApprovalFY,...,LowDoc,DisbursementDate,DisbursementGross,BalanceGross,MIS_Status,ChgOffPrinGr,GrAppv,SBA_Appv,Industry Trends,Industry
0,1000014003,ABC HOBBYCRAFT,EVANSVILLE,IN,47711,FIFTH THIRD BANK,OH,451120,1997-02-28,1997,...,Y,1999-02-28,60000.0,0.0,P I F,0.0,60000.0,48000.0,positive,Retail_trade
1,1000024006,LANDMARK BAR & GRILLE (THE),NEW PARIS,IN,46526,1ST SOURCE BANK,IN,722410,1997-02-28,1997,...,Y,1997-05-31,40000.0,0.0,P I F,0.0,40000.0,32000.0,positive,"Accomodation, Food services"
2,1000034009,"WHITLOCK DDS, TODD M.",BLOOMINGTON,IN,47401,GRANT COUNTY STATE BANK,IN,621210,1997-02-28,1997,...,N,1997-12-31,287000.0,0.0,P I F,0.0,287000.0,215250.0,positive,"Healthcare, Social_assist"
3,1000044001,"BIG BUCKS PAWN & JEWELRY, LLC",BROKEN ARROW,OK,74012,1ST NATL BK & TR CO OF BROKEN,OK,0,1997-02-28,1997,...,Y,1997-06-30,35000.0,0.0,P I F,0.0,35000.0,28000.0,,Other
4,1000054004,"ANASTASIA CONFECTIONS, INC.",ORLANDO,FL,32801,FLORIDA BUS. DEVEL CORP,FL,0,1997-02-28,1997,...,N,1997-05-14,229000.0,0.0,P I F,0.0,229000.0,229000.0,,Other


### remove other industry

In [46]:
# check percentage
rows_to_remove = new_df[new_df['Industry'] == 'Other'] 
total_rows = len(new_df)

percentage_to_remove = (len(rows_to_remove) / total_rows) * 100
print(f"Percentage of removed rows: {percentage_to_remove:.2f}%")
print(f"rows : {len(rows_to_remove)} out of {total_rows}")
print(f"remaining rows: {total_rows - len(rows_to_remove)}")

Percentage of removed rows: 22.37%
rows : 198267 out of 886240
remaining rows: 687973


In [47]:
# remove other data
df_cleaned = new_df[new_df['Industry'] != 'Other'] 

# Perform Decision Making on all the sectors 

- Will use generated dummy sentiments,
- train(develop an algorithn) to make a decision on Industry Trends.
- train and score

Sentiment choices:
- positive
- neutral
- negative

generate random sentiments for test and train purposes for industry_sentiments dictionary

In [48]:
df_cleaned.head().T

Unnamed: 0,0,1,2,5,7
LoanNr_ChkDgt,1000014003,1000024006,1000034009,1000084002,1000094005
Name,ABC HOBBYCRAFT,LANDMARK BAR & GRILLE (THE),"WHITLOCK DDS, TODD M.","B&T SCREW MACHINE COMPANY, INC",WEAVER PRODUCTS
City,EVANSVILLE,NEW PARIS,BLOOMINGTON,PLAINVILLE,SUMMERFIELD
State,IN,IN,IN,CT,FL
Zip,47711,46526,47401,6062,34491
Bank,FIFTH THIRD BANK,1ST SOURCE BANK,GRANT COUNTY STATE BANK,"TD BANK, NATIONAL ASSOCIATION",REGIONS BANK
BankState,OH,IN,IN,DE,AL
NAICS,451120,722410,621210,332721,811118
ApprovalDate,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00,1997-02-28 00:00:00
ApprovalFY,1997,1997,1997,1997,1997


In [49]:
# value counts
df_cleaned["MIS_Status"].value_counts()

MIS_Status
P I F     548445
CHGOFF    139528
Name: count, dtype: int64

In [50]:
# make decision incorporating industry trends
decision_rules = {
    "positive": "P I F",
    "neutral":  "P I F",
    "negative": "CHGOFF"
}

for index, row in df_cleaned.iterrows():
    """ 
    This loop will set the correct prediction based on Industry Trend. 
    Allow loans paid in full will be checked. And re-consider prediction of 
    customer paying provided Industry Trend 
    """
    if row['MIS_Status'] == "P I F": 
        industry = row['Industry']
        industry_sentiment = row['Industry Trends']
        decision = decision_rules.get(industry_sentiment) 
        df_cleaned.at[index, 'MIS_Status'] = decision 

# Road to training

### feature selection

`NB: please note that 'industry Trends' column is added on relevant columns` 

In [51]:

relevant_columns=['GrAppv', 'Term', 'NoEmp','NewExist', 'UrbanRural', 'Industry', 'Industry Trends','MIS_Status']
df_ext=df_cleaned.loc[:,relevant_columns]
df_ext

Unnamed: 0,GrAppv,Term,NoEmp,NewExist,UrbanRural,Industry,Industry Trends,MIS_Status
0,60000.0,84,4,2.0,0,Retail_trade,positive,P I F
1,40000.0,60,2,2.0,0,"Accomodation, Food services",positive,P I F
2,287000.0,180,7,1.0,0,"Healthcare, Social_assist",positive,P I F
5,517000.0,120,19,1.0,0,Manufacturing,negative,CHGOFF
7,45000.0,84,1,2.0,0,Other services,negative,CHGOFF
...,...,...,...,...,...,...,...,...
899156,50000.0,60,20,1.0,0,Manufacturing,negative,CHGOFF
899157,200000.0,36,40,1.0,0,Manufacturing,negative,CHGOFF
899159,70000.0,60,6,1.0,0,Retail_trade,positive,P I F
899160,85000.0,60,6,1.0,0,Retail_trade,positive,P I F


### Transform Data

In [52]:
# Extract all non-numeric columns
non_numeric_columns = df_ext.select_dtypes(include=['object'])
non_numeric_columns.head()

Unnamed: 0,Industry,Industry Trends,MIS_Status
0,Retail_trade,positive,P I F
1,"Accomodation, Food services",positive,P I F
2,"Healthcare, Social_assist",positive,P I F
5,Manufacturing,negative,CHGOFF
7,Other services,negative,CHGOFF


In [53]:
column_names = non_numeric_columns.columns.tolist()
print(column_names)

['Industry', 'Industry Trends', 'MIS_Status']


In [54]:
features = non_numeric_columns.columns.to_numpy()
features

array(['Industry', 'Industry Trends', 'MIS_Status'], dtype=object)

In [55]:
# Encoding DataFrame
encoded_df=pd.DataFrame()
for column in df_ext.columns:
    if df_ext[column].dtype == 'object':
        encoded_df[column] = df_ext[column].astype('category').cat.codes
    else:
        encoded_df[column] = df_ext[column]

In [56]:
encoded_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 687973 entries, 0 to 899161
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   GrAppv           687973 non-null  float64
 1   Term             687973 non-null  int64  
 2   NoEmp            687973 non-null  int64  
 3   NewExist         687973 non-null  float64
 4   UrbanRural       687973 non-null  int64  
 5   Industry         687973 non-null  int8   
 6   Industry Trends  687973 non-null  int8   
 7   MIS_Status       687973 non-null  int8   
dtypes: float64(2), int64(3), int8(3)
memory usage: 49.6 MB


In [57]:
encoded_df.head()

Unnamed: 0,GrAppv,Term,NoEmp,NewExist,UrbanRural,Industry,Industry Trends,MIS_Status
0,60000.0,84,4,2.0,0,16,2,1
1,40000.0,60,2,2.0,0,1,2,1
2,287000.0,180,7,1.0,0,8,2,1
5,517000.0,120,19,1.0,0,11,0,0
7,45000.0,84,1,2.0,0,12,0,0


##### Save 

In [59]:
category_mapping = {column: df_ext[column].astype('category').cat.categories for column in df_ext.columns if df_ext[column].dtype == 'object'}

In [60]:
# Save the category mappings to a file
with open('category_mapping.pkl', 'wb') as f:
    pickle.dump(category_mapping, f)

#### load

In [None]:
# Load the category mapping
with open('category_mapping.pkl', 'rb') as f:
    category_mapping = pickle.load(f)

##### use

In [61]:
category_mapping

{'Industry': Index([' Mining, quarrying, oil and gas extraction',
        'Accomodation, Food services',
        'Administrative support, waste management',
        'Agriculture, forestry, fishing, hunting',
        'Arts, Entertain, recreation', 'Construction', 'Educational',
        'Finance, Insurance', 'Healthcare, Social_assist', 'Information',
        'Management of companies, enterprises', 'Manufacturing',
        'Other services', 'Professional, scientific, technical services',
        'Public adminstration', 'Real estate, rental, leasing', 'Retail_trade',
        'Transportation, warehousing', 'Utilities', 'Wholesale_trade'],
       dtype='object'),
 'Industry Trends': Index(['negative', 'neutral', 'positive'], dtype='object'),
 'MIS_Status': Index(['CHGOFF', 'P I F'], dtype='object')}

### Training

In [49]:
# Split the data
np.random.seed(42)

X = encoded_df.drop("MIS_Status", axis=1)
y = encoded_df["MIS_Status"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Modeling

In [50]:
# initialize model
np.random.seed(42)

rf_classifier2 = RandomForestClassifier()

# fit the model
rf_classifier2.fit(X_train, y_train)

%time

CPU times: total: 0 ns
Wall time: 1.59 ms


In [51]:
# check model score
rf_classifier2.score(X_test, y_test)

0.9370471310730768

metrics

In [52]:
# Make predictions with the best hyperparameters, 
# rs_clf automatically selects the best params
rs_y_preds = rf_classifier2.predict(X_test)

# evaluate the predictions
rfc_2_metrics = evaluate_preds(y_test, rs_y_preds)

Acc: 93.70%
Precision: 0.94
Recall: 0.96
F1 score: 0.95


# Hyperparameter tuning with RandomizedSearchCV

In [62]:
# we re going to use a dict 
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]
       }

np.random.seed(42)

# Split into X & y
X = encoded_df.drop("MIS_Status", axis=1)
y = encoded_df["MIS_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate model
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf, 
                            param_distributions=grid,
                            n_iter=10, # number of models to try
                            cv=5, # cross validation
                            verbose=2)

# Fit the randomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);
%time

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=   0.0s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END max_depth=10, max_features=auto, min_samples_leaf=1, min_samples_split=2, n_estima

In [53]:
# check best params
rs_clf.best_params_

{'n_estimators': 100,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 10}

In [87]:
# function to evaluate preds 
def evaluate_preds(y_true, y_preds):
    """
        Perfoms evaluation comparison on y_true labels vs y_pred labels 
        on a classification
    """
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2), 
                   "precision": round(precision, 2),
                   "recall": round(recall, 2),
                   "f1": round(f1, 2)}
    print(f"Acc: {accuracy*100:.2f}%")
    print(f"Precision: {precision:.2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")

    return metric_dict

In [54]:
# check model score
rs_clf.score(X_test, y_test)

0.9520549438569715

In [55]:
# Make predictions with the best hyperparameters, 
# rs_clf automatically selects the best params
rs_y_preds = rs_clf.predict(X_test)

# evaluate the predictions
rs_2_metrics = evaluate_preds(y_test, rs_y_preds)

Acc: 95.21%
Precision: 0.92
Recall: 0.91
F1 score: 0.91


# Hyperparameter tuning with GridSearchCV

In [1]:
grid = {'n_estimators': [100, 200, 500],
     'max_depth': [None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_split': [6],
     'min_samples_leaf': [1, 2]}

In [57]:
np.random.seed(42)

# Split into X & y
X = encoded_df.drop("MIS_Status", axis=1)
y = encoded_df["MIS_Status"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate model
clf = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf, 
                      param_grid=grid,
                      cv=5, # cross validation
                      verbose=2)

# Fit the GridSearchCV version of clf
gs_clf.fit(X_train, y_train);

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=100; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, n_estimators=200; total time=   0.0s
[CV] END max_depth=None, max_features=auto, min_samples_leaf=1, min_samples_split=6, 


KeyboardInterrupt



# Export Model

#### save model 2

In [64]:
model_file_path = 'loan_model.pkl'

# Save the model to a file using pickle
with open(model_file_path, 'wb') as file:
    pickle.dump(rf_classifier2, file)

In [65]:
# Load a saved model to test
loaded_pickle_model = pickle.load(open("loan_model.pkl", "rb"))

In [66]:
# Make some predictions
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test, pickle_y_preds)

Acc: 93.70%
Precision: 0.94
Recall: 0.96
F1 score: 0.95


{'accuracy': 0.94, 'precision': 0.94, 'recall': 0.96, 'f1': 0.95}