#**Should This Loan Be Approved Or Denied: An Exploration Of Risk Classification Using SBA Loan Data**

Matt Holsten, Rob Pitkin

Tufts University, Spring 2022

CS-135 Machine Learning, Final Project



---


# This Notebook
This Notebook is part of our SBA Loan Analysis Project described below.
Look on our [GitHub Repository](https://github.com/MatthewHolsten/sba-loan-risk-analysis) to see the rest of the project, including the [Web App](https://matthewholsten.github.io/sba-loan-risk-analysis-webapp/) and [API](https://matthewholsten.pythonanywhere.com/) this model was used to help create.


# Project Abstract

The U.S. Small Business Administration ("SBA") has created a large dataset of SBA-guaranteed loans stretching back almost 60 years which documents the information and outcomes of loans. Due to the nature of the SBA and the successes they've helped create, there is an ever-growing desire to have additional risk information for predicting whether or not a borrowing-business will default on their loan.

To address this desire, we first employed a feature analysis to discover which aspects of SBA-backed loans contribute most to the risk of defaulting. Then, we confirmed our analysis with two classification methods used to predict the outcomes of loans before they are granted: one linear, logistic regression model and one non-linear, feed-forward neural network. 

Lastly, we created a free, open source API and web application (both in our github) to deploy our machine learning models. We find that there are 11 significant features which strongly indicate loan risk, with the loan term length having the highest correlation. We also find that both linear and non-linear models produce accuracies around 70% and 80% respectively, thereby demonstrating how our applications can serve as viable tools for SBA Loan Officers, lenders, and small businesses alike.



#Importing Data

In [None]:
import pandas as pd
df = pd.read_csv("SBAnational.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
MA_df = df.loc[df['State'] == 'MA']

In [None]:
print(MA_df.shape)
MA_df.head()

In [None]:
CA_df = df.loc[df['State'] == 'CA']

In [None]:
print(CA_df.shape)
CA_df.head()

# Cleaning up data

In [None]:
from datetime import datetime

def clean_data(data):
    # Creating a new column for unix time instead of day-month-year
    data['ApprovalUnixTime'] = data['ApprovalDate'].apply(
        lambda x: int(datetime.strptime(x, '%d-%b-%y').strftime("%s")))

    # Creating the recession feature (if the loan occurred during the Great
    # Recession)
    recession_start_unix = 1196485200
    recession_end_unix = 1243828800
    data['Recession'] = data['ApprovalUnixTime'].apply(
        lambda x: 1 if x >= recession_start_unix and x <= recession_end_unix
        else 0)

    # Fixing approval FY column (first getting rid of error value, then
    # converting to ints)
    data.loc[:,'ApprovalFY'] = data['ApprovalFY'].replace('1976A', 1976)
    data['ApprovalFY'] = data['ApprovalFY'].apply(int)


    # Mapping years to political party
    admin_yr = {
        1958: 'R', 1959: 'R', 1960: 'R', 1961: 'D',
        1962: 'D', 1963: 'D', 1964: 'D', 1965: 'D',
        1966: 'D', 1967: 'D', 1968: 'D', 1969: 'R',
        1970: 'R', 1971: 'R', 1972: 'R', 1973: 'R',
        1974: 'R', 1975: 'R', 1976: 'R', 1976: 'R', 
        1977: 'D', 1978: 'D', 1979: 'D', 1980: 'D',
        1981: 'R', 1982: 'R', 1983: 'R', 1984: 'R', 
        1985: 'R', 1986: 'R', 1987: 'R', 1988: 'R', 
        1989: 'R', 1990: 'R', 1991: 'R', 1992: 'R', 
        1993: 'D', 1994: 'D', 1995: 'D', 1996: 'D',
        1997: 'D', 1998: 'D', 1999: 'D', 2000: 'D',
        2001: 'R', 2002: 'R', 2003: 'R', 2004: 'R',
        2005: 'R', 2006: 'R', 2007: 'R', 2008: 'R',
        2009: 'D', 2010: 'D', 2011: 'D', 2012: 'D',
        2013: 'D', 2014: 'D', 2015: 'D', 2016: 'D',
        2020: 'D', 2021: 'D', 2022: 'D', 2023: 'D'
    }

    # Adding admin party column
    data['AdminParty'] = data['ApprovalFY'].apply(
        lambda x: 1 if admin_yr[x] == 'D' else 0
    )

    # Adding industry column
    data['Industry'] = data['NAICS'].apply(
        lambda x: int(x/10000)
    )

    # Adding real estate column
    data['RealEstate'] = data['Term'].apply(
        lambda x: 1 if x >= 240 else 0
    )

    # Adding SBA backed proportion column
    props = []
    for i in data.index:
        props.append(float(data['SBA_Appv'][i].replace('$','').replace(',','')) / \
                    float(data['GrAppv'][i].replace('$','').replace(',','')))
    data['SBAProportion'] = props

    # Adding Gross disbursement float column
    data['TotalLoan'] = data['GrAppv'].apply(
        lambda x: float(x.replace('$','').replace(',',''))
    )

    # Adding SBA loan float column
    data['SBALoan'] = data['SBA_Appv'].apply(
        lambda x: float(x.replace('$','').replace(',',''))
    )

    # Adding new business column --> mode replacement (mode was existing biz)
    data['NewBusiness'] = data['NewExist'].apply(
        lambda x: 1 if x == 2 else 0
    )

    # Mapping states to integers based on alphabetical order
    states = {
        'AK': 1, 'AL': 2, 'AR': 3, 'AZ': 4,
        'CA': 5, 'CO': 6, 'CT': 7, 'DC': 8,
        'DE': 9, 'FL': 10, 'GA': 11, 'HI': 12,
        'IA': 13, 'ID': 14, 'IL': 15, 'IN': 16,
        'KS': 17, 'KY': 18, 'LA': 19, 'MA': 20,
        'MD': 21, 'ME': 22, 'MI': 23, 'MN': 24,
        'MO': 25, 'MS': 26, 'MT': 27, 'NC': 28,
        'ND': 29, 'NE': 30, 'NH': 31, 'NJ': 32,
        'NM': 33, 'NV': 34, 'NY': 35, 'OH': 36,
        'OK': 37, 'OR': 38, 'PA': 39, 'RI': 40,
        'SC': 41, 'SD': 42, 'TN': 43, 'TX': 44,
        'UT': 45, 'VA': 46, 'VT': 47, 'WA': 48,
        'WI': 49, 'WV': 50, 'WY': 51
    }

    # Adding location --> mode replacement (mode is CA)
    data['StateNum'] = data['State'].apply(
        lambda x: states[x] if x in states else 5
    )

    # Adding target --> mode replacement (mode is Paid In Full)
    data['PaidOff'] = data['MIS_Status'].apply(
        lambda x: 1 if x != 'CHGOFF' else 0
    )

    data.head()
    return data

In [None]:
data = clean_data(df.copy())

# data = clean_data(MA_df.copy())

# data = clean_data(CA_df.copy())

# Checking unique values of columns

In [None]:
def print_uniques(data):
    # Our target is Paid off
    print("PaidOff\n", data['PaidOff'].unique(), "\n")

    """
       Our features are: Industry, Gross Disbursement, 
       New vs. Established business, Loans backed by real estate, 
       Economic recession, SBA's guaranteed portion, SBA backed proportion,
       Administration Party, Urban vs. Rural, Jobs Created, Jobs Retained,
       and Loan Term
    """

    print("Industry\n",data['Industry'].unique(), "\n")
    print("TotalLoan\n",data['TotalLoan'].unique(), "\n")
    print("New vs. Old\n",data['NewBusiness'].unique(), "\n")
    print("RealEstate\n",data['RealEstate'].unique(), "\n")
    print("Recession\n",data['Recession'].unique(), "\n")
    print("SBALoan\n",data['SBALoan'].unique(), "\n")
    print("SBAProportion\n",data['SBAProportion'].unique(), "\n")
    print("AdminParty\n",data['AdminParty'].unique(), "\n")
    print("UrbanRural\n",data['UrbanRural'].unique(), "\n")
    print("CreateJob\n",data['CreateJob'].unique(), "\n")
    print("RetainedJob\n",data['RetainedJob'].unique(), "\n")
    print("Term\n",data['Term'].unique(), "\n")

In [None]:
print_uniques(data)

# Feature Selection

In [None]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif, f_classif
import matplotlib.pyplot as plt

def perform_feature_selection(data, include_loc=False):

    # Isolating target variable (Paid in full vs. Charged Off)
    target = data.copy().iloc[:,37]

    # Counting the number of positive examples vs. negative examples
    ones = 0
    for t in target:
        if t == 1:
            ones += 1
    ones = ones/target.size
    zeros = 1 - ones

    weights = [max(ones, zeros)/zeros, max(ones,zeros)/ones]

    # Dictionary of features to column indices
    features = {
        'Industry': 30,
        'TotalLoan': 33,
        'NewBusiness': 35,
        'RealEstate': 31,
        'Recession': 28,
        'SBALoan': 34,
        'AdminParty': 29,
        'SBAProportion': 32,
        'UrbanRural' : 16,
        'CreateJob' : 13,
        'RetainedJob' : 14,
        'Term' : 10,
        'StateNum' : 36
    }

    # Creating a list of feature columns
    fs = []
    fs_names = []
    for x in features:
        fs_names.append(x)
        fs.append(np.array(data.copy().iloc[:,features[x]]))
    if not include_loc:
        fs.pop(-1)
        fs_names.pop(-1)
    fs = np.array(fs).T
    print("Number of features:",fs.shape[1])

    # Feature selection: MI test and F-test
    if include_loc:
        mutual_info = mutual_info_classif(fs, target, discrete_features=[1,0,1,1,1,0,1,0,1,0,0,0,1])
        f_test, _ = f_classif(fs, target)
    else:
        mutual_info = mutual_info_classif(fs, target, discrete_features=[1,0,1,1,1,0,1,0,1,0,0,0])
        f_test, _ = f_classif(fs, target)
    return fs, fs_names, target, mutual_info, f_test, weights

In [None]:
# Don't include location (state-level)
# fs, fs_names, target, mi, ftest, weights = perform_feature_selection(data, False)

# Include location (country-wide)
fs, fs_names, target, mi, ftest, weights = perform_feature_selection(data, True)

Number of features: 13


### Mutual Information Test

In [None]:
def print_MI_results(fs_names, mutual_info):
    # Mutual information results
    print("Mutual Information by index: ", mutual_info)
    fig = plt.figure()
    fig.set_size_inches(14.5, 8.5)
    plt.bar(fs_names, mutual_info)
    plt.title("Features vs. Mutual Information")
    plt.xlabel("Feature")
    plt.xticks(rotation=45)
    plt.ylabel("Mutual Information Value")
    plt.show()

In [None]:
print_MI_results(fs_names, mi)

### F-Test

In [None]:
def print_ftest_results(fs_names, f_test):
    # F-test results
    print("F-values by index: ", f_test)
    fig = plt.figure()
    fig.set_size_inches(14.5, 8.5)
    plt.bar(fs_names, f_test)
    plt.title("Features vs. F-Test value")
    plt.xlabel("Feature")
    plt.xticks(rotation=45)
    plt.ylabel("F-value")
    plt.show()

In [None]:
print_ftest_results(fs_names, ftest)

### Correlation Matrix

In [None]:
import seaborn as sn

def create_correlation_matrix(feature_cols, feature_names, feature_target):
    corr_df = pd.DataFrame(feature_cols, columns=feature_names)
    corr_df['PaidOff'] = feature_target
    fig = plt.figure()
    fig.set_size_inches(18.5, 10.5)
    corr_matrix = corr_df.corr()
    sn.heatmap(corr_matrix, annot=True)
    plt.show()

In [None]:
create_correlation_matrix(fs, fs_names, target)

# Separating test and training data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import math

def create_train_test_sets(X, Y, weights):
    # Getting rid of 'NewBusiness'
    features = np.delete(X, 2, 1)

    # Getting rid of 'CreateJob'
    features = np.delete(features, 8, 1)

    # Normalizing total loan, sba loan, sba proportion, retained jobs,
    # and loan term
    means = []
    std_devs = []
    sc = StandardScaler()

    # Total Loan
    features[:,1] = sc.fit_transform(features[:,1].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # SBA Loan
    features[:,4] = sc.fit_transform(features[:,4].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # SBA Proportion
    features[:,6] = sc.fit_transform(features[:,6].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # Retained Jobs
    features[:,8] = sc.fit_transform(features[:,8].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # Loan Term
    features[:,9] = sc.fit_transform(features[:,9].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # Creating test and training sets
    x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.25, random_state=73)
    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.13, random_state=81)

    y_train = np.array(y_train)
    y_test = np.array(y_test)
    y_val = np.array(y_val)

    w_train = [weights[0] if x == 0 else weights[1] for x in y_train]
    w_test = [weights[0] if x == 0 else weights[1] for x in y_test]

    # Converting sets to tensors
    x_train = torch.tensor(x_train.astype(np.float32))
    x_test = torch.tensor(x_test.astype(np.float32))
    x_val = torch.tensor(x_val.astype(np.float32))
    y_train = torch.tensor(np.around(y_train.astype(np.float32)))
    y_test = torch.tensor(np.around(y_test.astype(np.float32)))
    y_val = torch.tensor(np.around(y_val.astype(np.float32)))
    
    data_sets = {
        "x_train" : x_train,
        "x_test" : x_test,
        "x_val" : x_val,
        "y_train" : y_train,
        "y_test" : y_test,
        "y_val" : y_val,
        "w_train" : w_train, 
        "w_test" : w_test, 
        "means" : means, 
        "sds" : std_devs
    }

    return data_sets

In [None]:
data_sets = create_train_test_sets(fs, target, weights)

# Logistic Regression

### Creating our Logistic Regression class

In [None]:
class SBALogisticRegression(torch.nn.Module):
    def __init__(self, input_dim, output_dim, means, std_devs):
        super(SBALogisticRegression, self).__init__()

        # Storing means and std_devs for future queries
        self.means = means
        self.std_devs = std_devs 

        # Creating out logistic regression unit
        self.linear = torch.nn.Linear(input_dim, output_dim)  
   
    def forward(self, x):
        outputs = torch.sigmoid(self.linear(x))
        return outputs

### Creating our LR model and training function

In [None]:
LRepochs = 50000
input_dim = 10 # Our features
output_dim = 1 # Single binary output 
learning_rate = 0.001

LRmodel = SBALogisticRegression(input_dim, output_dim, data_sets["means"], data_sets["sds"])
LR_train_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_train"]))
LR_test_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_test"]))
LRoptimizer = torch.optim.SGD(LRmodel.parameters(), lr=learning_rate)

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import RocCurveDisplay

def LR_train_and_test(model, optimizer, x_train, x_test, y_train, y_test):
    iter = 0
    predicted_labels = None
    final_predicted = None
    # fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1)
    for epoch in range(int(LRepochs)):
        iter+=1

        # Zeroing our gradients for the batch
        optimizer.zero_grad()

        # Feed-forward pass
        outputs = model(x_train)

        # Finding our loss and computing the gradients
        loss = LR_train_loss(torch.squeeze(outputs), y_train) 
        loss.backward()
        
        # Updating weights and biases
        optimizer.step()
        
        if iter % (LRepochs/50) == 0:
            # Calculating the loss and accuracy for the test dataset
            test_outputs = torch.squeeze(model(x_test))
            test_loss = LR_test_loss(test_outputs, y_test)
            
            predicted_prob = test_outputs.detach().numpy()
            predicted_test = test_outputs.round().detach().numpy()
            test_correct = np.sum(predicted_test == y_test.detach().numpy())
            test_accuracy = 100 * test_correct/y_test.size(0)
            
            # Calculating the loss and accuracy for the train dataset
            train_correct = np.sum(torch.squeeze(outputs).round().detach().numpy() == y_train.detach().numpy())
            train_accuracy = 100 * train_correct/y_train.size(0)
            
            print("Iteration:", iter,"\nTest - Loss:", test_loss.item(), "Accuracy:", test_accuracy)
            print("Train - Loss:", loss.item(), "Accuracy:", train_accuracy, "\n")

            if predicted_labels is None:
                predicted_labels = predicted_prob
            else:
                np.append(predicted_labels, predicted_prob)

            final_predicted = predicted_test

    # Build confusion matrix and ROC Curve
    fig = plt.figure()
    ConfusionMatrixDisplay.from_predictions(y_test.detach().numpy(), final_predicted)
    RocCurveDisplay.from_predictions(y_test.detach().numpy(), predicted_labels)
    plt.show()

In [None]:
LR_train_and_test(LRmodel, LRoptimizer, data_sets["x_train"], data_sets["x_test"], data_sets["y_train"], data_sets["y_test"])

# Neural Network

### Creating our Neural Network class

In [None]:
class SBANeuralNet(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, means, std_devs, iterations):
        super(SBANeuralNet, self).__init__()
        # Storing means and std_devs for future queries
        self.means = means
        self.std_devs = std_devs
        self.iterations = iterations

        # Fully-connected Layers
        self.full_1 = torch.nn.Linear(input_dim, hidden_dim)
        self.full_2 = torch.nn.Linear(hidden_dim, output_dim)

        # Activation function
        self.sigmoid = torch.nn.Sigmoid()  

    def forward(self, x):
        # Aggregation of layer 1
        out = self.full_1(x)
        # Activation of layer 1
        out = self.sigmoid(out)

        # Aggregation of layer 2
        out = self.full_2(out)
        # Activation of layer 2
        out = self.sigmoid(out)
        return out

### Creating our Neural Network model and training function

In [None]:
NNepochs = [5000, 10000, 20000] # Number of iterations
input_dim = 10 # Our features
hidden_dims = [2, 4, 8, 16, 32]  # Hidden layer sizes
output_dim = 1 # Single binary output 
learning_rates = np.logspace(-3, -1, 5)

NNmodels = []
model_parameters = []
for it in NNepochs:
    for hd in hidden_dims:
        for lr in learning_rates:
            model_parameters.append((hd, lr, it))
            NNmodel = SBANeuralNet(input_dim, hd, output_dim, data_sets["means"], data_sets["sds"], it)
            NNoptimizer = torch.optim.SGD(NNmodel.parameters(), lr=lr, momentum=0.9)
            NNmodels.append((NNmodel, NNoptimizer))

NN_train_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_train"]))
NN_test_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_test"]))

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import RocCurveDisplay

def NN_train_and_test(NNmodel, NNoptimizer, x_train, x_test, y_train, y_test, epochs):
    iter = 0
    best_accuracy = 0
    predicted_labels = None
    final_predicted = None
    for epoch in range(epochs):
        NNmodel.train(True)
        # Zeroing our gradients for the batch
        NNoptimizer.zero_grad()

        # Feed-forward pass
        outputs = NNmodel(x_train)

         # Finding our loss and computing the gradients
        loss = NN_train_loss(torch.squeeze(outputs), y_train) 
        loss.backward()
        
        # Updating weights and biases
        NNoptimizer.step()

        iter+=1
        if iter % 1000 == 0:
            NNmodel.train(False)
            # Calculating the loss and accuracy for the test dataset
            test_outputs = torch.squeeze(NNmodel(x_test))
            test_loss = NN_test_loss(test_outputs, y_test)
            
            predicted_test = test_outputs.round().detach().numpy()
            predicted_prob = test_outputs.detach().numpy()
            test_correct = np.sum(predicted_test == y_test.detach().numpy())
            test_accuracy = 100 * test_correct/y_test.size(0)
            
            # Calculating the loss and accuracy for the train dataset
            train_correct = np.sum(torch.squeeze(outputs).round().detach().numpy() == y_train.detach().numpy())
            train_accuracy = 100 * train_correct/y_train.size(0)
            
            print("Iteration:", iter,"\nTesting Loss:", test_loss.item(), "Testing Accuracy:", test_accuracy)
            print("Training Loss:", loss.item(), "Training Accuracy:", train_accuracy, "\n")

            if predicted_labels is None:
                predicted_labels = predicted_prob
            else:
                np.append(predicted_labels, predicted_prob)

            final_predicted = predicted_test
            
    # Build confusion matrix and ROC Curve
    fig = plt.figure()
    ConfusionMatrixDisplay.from_predictions(y_test.detach().numpy(), final_predicted)
    RocCurveDisplay.from_predictions(y_test.detach().numpy(), predicted_labels)
    plt.show()

In [None]:
for m, o in NNmodels:
    NN_train_and_test(m, o, data_sets["x_train"], data_sets["x_test"], data_sets["y_train"], data_sets["y_test"], m.iterations)

# Adding the ability to query the models

In [None]:
def query(model, sample, print_result=True):
    model.train(False)
    
    output = torch.squeeze(model(sample))
    output = output.round().detach().numpy()
    if output == 1.0:
        if print_result:
            print("Predicted low risk loan")
        return 1.0
    else:
        if print_result:
            print("Predicted high risk loan")
        return 0.0

# Validating the models

### Logistic Regression

In [None]:
accuracy = 0.0

for i, x in enumerate(data_sets["x_val"]):
    pred = query(LRmodel, x, False)
    if data_sets["y_val"][i].detach().numpy() == pred:
        accuracy += 1.0

print("Accuracy for LR:",100*(accuracy/data_sets["y_val"].size(0)))


### Neural Network

In [None]:
for i, (m, _) in enumerate(NNmodels):
    accuracy = 0.0

    for j, x in enumerate(data_sets["x_val"]):
        pred = query(m, x, False)
        if data_sets["y_val"][j].detach().numpy() == pred:
            accuracy += 1.0

    print("Hyper-parameters: HD1 -", model_parameters[i][0], "LearningRate -", model_parameters[i][1], "Iterations -", model_parameters[i][2])
    print("Accuracy for NN:",100*(accuracy/data_sets["y_val"].size(0)))

# Creating and training the best model from hyper-parameter tuning

In [None]:
best_epochs = 20000 # Number of iterations
input_dim = 10 # Our features
hidden_dim = 16 # Hidden layer 1 sizes
output_dim = 1 # Single binary output 
lr = 0.01

best_model = SBANeuralNet(input_dim, hidden_dim, output_dim, data_sets["means"], data_sets["sds"], best_epochs)
optimizer = torch.optim.SGD(best_model.parameters(), lr=lr, momentum=0.9)

NN_train_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_train"]))
NN_test_loss = torch.nn.BCELoss(weight = torch.FloatTensor(data_sets["w_test"]))

In [None]:
NN_train_and_test(best_model, optimizer, data_sets["x_train"], data_sets["x_test"], data_sets["y_train"], data_sets["y_test"], best_epochs)

### Running the optimal model on the validation set

In [None]:
accuracy = 0.0

for j, x in enumerate(data_sets["x_val"]):
    pred = query(best_model, x, False)
    if data_sets["y_val"][j].detach().numpy() == pred:
        accuracy += 1.0

print("Accuracy for NN:",100*(accuracy/data_sets["y_val"].size(0)))

Accuracy for NN: 80.85192697768763


# Exporting the Model

In [None]:
torch.save(best_model.state_dict(), "country_model.pt")
# torch.save(best_model.state_dict(), "ca_model.pt")
# torch.save(best_model.state_dict(), "ma_model.pt")

In [None]:
torch.onnx.export(best_model, torch.zeros(11), 'country_model.onnx', verbose=True)
# torch.onnx.export(best_model, torch.zeros(10), 'ca_model.onnx', verbose=True)
# torch.onnx.export(best_model, torch.zeros(10), 'ma_model.onnx', verbose=True)

# Experimenting with fewer features

### Removing features

In [None]:
print(fs_names)

fs_copy = fs.copy()
fs_names_copy = fs_names.copy()

# Getting rid of 'NewBusiness'
fs_copy = np.delete(fs_copy, 2, 1)
fs_names_copy.pop(2)

# Getting rid of 'CreateJob'
fs_copy = np.delete(fs_copy, 8, 1)
fs_names_copy.pop(8)

# Getting rid of 'TotalLoan'
fs_copy = np.delete(fs_copy, 1, 1)
fs_names_copy.pop(1)

# Getting rid of 'SBALoan'
fs_copy = np.delete(fs_copy, 3, 1)
fs_names_copy.pop(3)

# # Getting rid of loan term - make -2 if country-wide
# fs_copy = np.delete(fs_copy, -2, 1)
# fs_names_copy.pop(-2)

# # Getting rid of SBA proportion
# fs_copy = np.delete(fs_copy, 6, 1)
# fs_names_copy.pop(6)

# Getting rid of AdminParty
# fs_copy = np.delete(fs_copy, 5, 1)
# fs_names_copy.pop(5)

print(fs_copy.shape)

print(fs_names_copy)

### Creating training, test, and validation sets with new set of features

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import torch
import math

def create_less_train_test_sets(X, Y, weights):
    print(X.shape)

    # Normalizing total loan, sba loan, sba proportion, retained jobs,
    # and loan term
    features = X
    means = []
    std_devs = []
    sc = StandardScaler()

    # Total Loan
    # features[:,1] = sc.fit_transform(features[:,1].reshape(-1,1)).flatten()
    # means.append(sc.mean_)
    # std_devs.append(math.sqrt(sc.var_))

    # SBA Loan
    features[:,4] = sc.fit_transform(features[:,4].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # SBA Proportion
    # features[:,5] = sc.fit_transform(features[:,5].reshape(-1,1)).flatten()
    # means.append(sc.mean_)
    # std_devs.append(math.sqrt(sc.var_))

    # Retained Job
    features[:,6] = sc.fit_transform(features[:,6].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    # Loan Term
    features[:,7] = sc.fit_transform(features[:,7].reshape(-1,1)).flatten()
    means.append(sc.mean_)
    std_devs.append(math.sqrt(sc.var_))

    print(features.shape)

    # Creating test and training sets
    x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.25, random_state=73)
    x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.13, random_state=81)

    y_train = np.array(y_train)
    y_test = np.array(y_test)
    y_val = np.array(y_val)

    w_train = [weights[0] if x == 0 else weights[1] for x in y_train]
    w_test = [weights[0] if x == 0 else weights[1] for x in y_test]

    # Converting sets to tensors
    x_train = torch.tensor(x_train.astype(np.float32))
    x_test = torch.tensor(x_test.astype(np.float32))
    x_val = torch.tensor(x_val.astype(np.float32))
    y_train = torch.tensor(np.around(y_train.astype(np.float32)))
    y_test = torch.tensor(np.around(y_test.astype(np.float32)))
    y_val = torch.tensor(np.around(y_val.astype(np.float32)))
    
    data_sets = {
        "x_train" : x_train,
        "x_test" : x_test,
        "x_val" : x_val,
        "y_train" : y_train,
        "y_test" : y_test,
        "y_val" : y_val,
        "w_train" : w_train, 
        "w_test" : w_test, 
        "means" : means, 
        "sds" : std_devs
    }

    return data_sets

In [None]:
less_data_sets = create_less_train_test_sets(fs_copy, target, weights)

### Creating and running our logistic regression model with new features and datasets

In [None]:
LRepochs = 50000
input_dim = 9 # Our features
output_dim = 1 # Single binary output 
learning_rate = 0.001

LR_less_model = SBALogisticRegression(input_dim, output_dim, less_data_sets["means"], less_data_sets["sds"])
LR_train_loss = torch.nn.BCELoss(weight = torch.FloatTensor(less_data_sets["w_train"]))
LR_test_loss = torch.nn.BCELoss(weight = torch.FloatTensor(less_data_sets["w_test"]))
LR_less_optimizer = torch.optim.SGD(LR_less_model.parameters(), lr=learning_rate)

In [None]:
LR_train_and_test(LR_less_model, LR_less_optimizer, less_data_sets["x_train"], less_data_sets["x_test"], less_data_sets["y_train"], less_data_sets["y_test"])

In [None]:
accuracy = 0.0

for i, x in enumerate(less_data_sets["x_val"]):
    pred = query(LR_less_model, x, False)
    if less_data_sets["y_val"][i].detach().numpy() == pred:
        accuracy += 1.0

print("Accuracy for LR:",100*(accuracy/less_data_sets["y_val"].size(0)))

Accuracy for LR: 68.46091548894135


### Creating and running our neural network with new features and datasets

In [None]:
best_less_model = SBANeuralNet(9, hidden_dim, output_dim, less_data_sets["means"], less_data_sets["sds"], best_epochs)
less_optimizer = torch.optim.SGD(best_less_model.parameters(), lr=lr, momentum=0.9)

NN_train_loss = torch.nn.BCELoss(weight = torch.FloatTensor(less_data_sets["w_train"]))
NN_test_loss = torch.nn.BCELoss(weight = torch.FloatTensor(less_data_sets["w_test"]))

In [None]:
NN_train_and_test(best_less_model, less_optimizer, less_data_sets["x_train"], less_data_sets["x_test"], less_data_sets["y_train"], less_data_sets["y_test"], best_epochs)

In [None]:
accuracy = 0.0

for j, x in enumerate(less_data_sets["x_val"]):
    pred = query(best_less_model, x, False)
    if less_data_sets["y_val"][j].detach().numpy() == pred:
        accuracy += 1.0

print("Accuracy for NN:",100*(accuracy/less_data_sets["y_val"].size(0)))

Accuracy for NN: 80.4161105978168
