<div class = 'main-container'>
    <div class = 'inner_container'>
        <h1>Brief: Proton Bank</h1>
        <p>In this scenario, you work for Proton Bank, a FinTech organization that offers competitive mortgages.<br><br>Previously, the organization reviewed each application on a case-by-case basis, but as they have become more popular are struggling to keep up with demand.<br><br>You have been asked to use recent loan data to build a Logistic Regression model that will predict if an application will be approved or not. The aim is to use this model to identify key factors that lead to an application being rejected. This will allow the company to filter out overly risky applications automatically and devote more time to working with higher-quality applications.</p>
        <h2>Project Deliverables</h2>
        <p>By the end of the hackathon, you will present:</p>
        <ul>
            <li>A report containing your key insights on employee satisfaction and attrition, including the evidence needed to back up your claims.</li>
            <li>Recommendations for what SecureSolutions should do next based on your findings.</li>
        </ul>
        <br>
    </div>
</div>

<div class = 'main-container'>
    <div class = 'inner_container'>
        <h2>Data Dictionary</h2>
        <br>
        <details>
            <summary><b class = "sol_text">Proton Bank data</b></summary>      
            <br>
            <table class = "tb">
                <tr class = "tr-head">
                    <th>Feature</th>
                    <th>Description</th>
                    <th>Expected Values</th>
                </tr>
                <tr class = "tr-main">
                    <td>Loan_ID</td>
                    <td>Unique ID for loan application</td>
                    <td>String</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>Married</td>
                    <td>Indicates if the applicant is married</td>
                    <td>String</td>
                </tr>
                <tr class = "tr-main">
                    <td>Dependents</td>
                    <td>How many dependents the applicant has</td>
                    <td>Integer</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>Education</td>
                    <td>Applicant's level of education</td>
                    <td>String</td>
                </tr>
                <tr class = "tr-main">
                    <td>Self-Employed</td>
                    <td>Indicates if the applicant is self employed</td>
                    <td>String</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>ApplicantIncome</td>
                    <td>Applicant's annual income</td>
                    <td>Integer</td>
                </tr>
                <tr class = "tr-main">
                    <td>CoapplicantIncome</td>
                    <td>Co-applicant's annual income</td>
                    <td>Integer</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>LoanAmount</td>
                    <td>The amount applicant has requested to loan (in thousands)</td>
                    <td>Integer</td>
                </tr>
                <tr class = "tr-main">
                    <td>Loan_Amount_Term</td>
                    <td>How long the loan has been requested for</td>
                    <td>Integer</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>Credit_History</td>
                    <td>Indicates if the applicant's credit history is sufficient</td>
                    <td>Float</td>
                </tr>
                <tr class = "tr-main">
                    <td>Property_Area</td>
                    <td>Area description the applicant lives</td>
                    <td>String</td>
                </tr>
                <tr class = "tr-main-alt">
                    <td>Loan_Status</td>
                    <td>Indicates if the loan has been approved</td>
                    <td>1 or 0</td>
                </tr>             
            </table>
        </details>
        <br>
    </div>
</div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

loans = pd.read_csv('../Resources/Data/proton_loans.csv')

display(loans.head(5))

In [None]:
# view the distribution of approved vs rejected

fig, ax = plt.subplots(figsize = (12, 8))

grouped = loans.groupby('Loan_Status').count().reset_index()

sns.barplot(x = 'Loan_Status', y = 'Loan_ID', data = grouped)

ax.set_title('Loan Approvals vs Rejections')
ax.set_xlabel('Loan Status')
ax.set_ylabel('Count')
ax.set_xticklabels(['Rejected', 'Approved'])

plt.show()

In [None]:
# view the % approval for categorical data

# as this plot will be using subplots, we will use i & j to iterate through the grid coordinates
i = 0
j = 0

fig, ax = plt.subplots(figsize = (12, 8), ncols = 2, nrows = 3)

# define list of categorical variables
for col in ['Married', 'Dependents', 'Education', 'Self_Employed',
            'Credit_History', 'Property_Area']:
    
    # for each variable, group the data and count how many loans were approved (sum) and members of category (count)
    group = loans.groupby(col)[['Loan_Status']].agg({'sum', 'count'}).reset_index()
    # calculate the percentage approval for each category
    group['%_Approved'] = group[('Loan_Status', 'sum')]/group[('Loan_Status', 'count')]
    
    # plot a bar chart for each
    sns.barplot(x = col, y = '%_Approved', data = group, ax = ax[i, j])
    ax[i, j].set_title(col)
    ax[i, j].set_ylabel('% Approved')
    
    if j == 0:
        
        j += 1
        
    else:
        
        i += 1
        j = 0

plt.suptitle('% Approved Per Predictor')
plt.tight_layout()
plt.show()

In [None]:
# looking at this, it appears there is a significant difference in % approval for the following categories:

# Married
# Education
# Credit History
# Property Area

In [None]:
# plot box plot to investigate the spread of numerical factors against loan approval

i = 0
j = 0

fig, ax = plt.subplots(figsize = (12, 16), ncols = 2, nrows = 2)

for col in ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']:
   
   
    sns.boxplot(x = 'Loan_Status', y = col, data = loans, ax = ax[i, j])
    
    ax[i, j].set_title(col)
    ax[i, j].set_xlabel('Loan Status')
    
    if j == 0:
        
        j += 1
        
    else:
        
        i += 1
        j = 0

plt.suptitle('Attrition for Various Fields')
plt.tight_layout()
plt.show()

In [None]:
# there doesn't appear to be too much difference in any of these categories

In [None]:
# calculate total household income

loans['Household_Income'] = loans['ApplicantIncome'] + loans['CoapplicantIncome']

# calculate the loan as a % of the total household income

loans['loan_%_of_household_income'] = (loans['LoanAmount'] * 1000)/loans['Household_Income']

In [None]:
# plot

fig, ax = plt.subplots(figsize = (12, 8), ncols = 2)
  
sns.boxplot(x = 'Loan_Status', y = 'loan_%_of_household_income', data = loans, ax = ax[0])

ax[0].set_title('Loan % of Total Annual Household Income')
ax[0].set_xlabel('Loan Status')

sns.boxplot(x = 'Loan_Status', y = 'Household_Income', data = loans, ax = ax[1])

ax[1].set_title('Total Household Income')
ax[1].set_xlabel('Loan Status')

plt.show()

In [None]:
# it appears that for loans that were approved, household incomes seem to be higher
# additionally, these applications had loan amounts that were a smaller % of the household income

In [None]:
# define predictors 

X = loans[['Married', 'Education', 'Credit_History', 'Property_Area', 'loan_%_of_household_income', 'LoanAmount', 'Household_Income']]

# turn categorical variables into dummies

X_dummy = pd.get_dummies(X, columns = ['Married', 'Education', 'Credit_History', 'Property_Area'])

# define target

y = loans['Loan_Status']

# split into training and test data

X_train, X_test, y_train, y_test = train_test_split(X_dummy, y, stratify = y, test_size = 0.2)

In [None]:
# build the model

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

In [None]:
score = log_reg.score(X_test, y_test)
baseline = y.value_counts(normalize = True).max()

print('Model Accuracy: {}, Baseline Accuracy: {}'.format(score, baseline))

In [None]:
# check effects

effects = log_reg.coef_[0]
factors = X_dummy.columns

effect_dic = {'Factor' : factors, 'Effect' : effects}

effect_df = pd.DataFrame(effect_dic).set_index('Factor').sort_values(by = 'Effect')

fig, ax = plt.subplots(figsize = (12, 12))

ax.barh(effect_df.index, width = effect_df['Effect'])

plt.show()