# Loan Repayment 
## Exploratory Data Analysis (EDA)
<br>
<font size='3'>
Problem statement:<br>
Banks, fintech companies, and other institutions that provide loans rigourously need to analyze the credit score for all clients. There are mechanisms such as underwriting reports which assess the fraud risk for a given potential borrower. However, these risk scoring mechanisms only provide information whether to approve or not approve a certain loan. There are greater risk issues with funded loans where the borrower defaults (do not pay off). 
<br><br>
Hence, this loan repayment task is consequential towards objectively assessing the risk of whether a funded borrower can actually pay off their loans. In essence, this task ultimately results in creating <b>predictive models that can accurately predict whether a funded borrower will pay off (True) or default (False).</b> 
<br><br>
This notebook is the first part of loan repayment task which includes data transformations and exploring actionable data for interesting insights. Data transformation include cleaning unimportant records and columns, aggregating columns to create new parameter(s), and join data together with other datasets to come up with a set of parameters that will be useful for storytelling and later imported for the modelling section.
<br><br>
<b>Note: Please go through this notebook before reviewing modelling.ipynb</b>
</font>

In [1]:
import os
# Change directory to this file's directory
this_path = globals()['_dh'][-1]
os.chdir(this_path)
print("This file's directory:", os.getcwd())
# Change current path to parent of this file's directory
# to access all modules from parent
os.chdir('..')
source_path = os.getcwd()
print("Parent directory:", source_path)

This file's directory: /home/mattkhoo/Git-Loan-Repayment-EDA-Predictive/notebooks
Parent directory: /home/mattkhoo/Git-Loan-Repayment-EDA-Predictive


In [2]:
from main.data_loader import DataLoader
import main.viz_utils as viz
%matplotlib inline

## Load Data

In [3]:
# Load feather data
os.chdir(source_path)
feather_path = os.path.join(str(os.getcwd()),'data-feathers') 
for _, _, files in os.walk(feather_path):
    for file in files: print(file)

clarity_underwriting_variables
payment
loan


In [4]:
# Load loan data
loan_feather_path = os.path.join(feather_path, 'loan')
loan_manager = DataLoader()
loan_manager.load_feather(loan_feather_path)
# loan_manager.display(10, False)  # display last n rows

## Data Transformation on Loan 

In [5]:
# Drop unused columns
dropped = ['anon_ssn', 'applicationDate', 'originated', 'originatedDate', 'approved']
loan_manager.drop_column(dropped)
loan_manager.display_types()

loanId                               object
payFrequency                         object
apr                                 float64
nPaidOff                            float64
isFunded                              int64
loanStatus                           object
loanAmount                          float64
originallyScheduledPaymentAmount    float64
state                                object
leadType                             object
leadCost                              int64
fpStatus                             object
clarityFraudId                       object
hasCF                                 int64
dtype: object

In [6]:
loan_manager.data.isFunded.unique()

array([0, 1])

In [9]:
loan_manager.single_eqfilter('isFunded', 1)  # get all funded loans
loan_manager.drop_column(['isFunded'])  # now we can drop isFunded
loan_manager.data.shape

(38982, 13)

<font size="3"> Reasoning: The chosen columns are deemed not useful for loan repayment analysis. Columns such as 'originated' and 'approved' precede (chronologically) the column 'isFunded'. Since this is a loan repayment problem, <b>we only care about funded loans</b>. Date columns are not relevant since there is no date data on when each loan is actually funded and when the loan is originally supposed to be paid. Hence, the punctuality of loan repayments cannot be measured. As for identification columns, only primary/foreign keys are useful for joins later, hence anon_ssn is removed. </font>

In [None]:
# Load payment data
payment_feather_path = os.path.join(feather_path, 'payment')
payment_manager = DataLoader()
payment_manager.load_feather(payment_feather_path)
# payment_manager.display(10, False) # display last n rows

In [None]:
# Load underwriting data
undw_feather_path = os.path.join(feather_path, 'clarity_underwriting_variables')
undw_manager = DataLoader()
undw_manager.load_feather(undw_feather_path)
# undw_manager.display(10, False) # display last n rows