# **<center>EDA**

In [1]:
import os

from dotenv import load_dotenv

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from minisom import MiniSom

from tqdm import tqdm

In [2]:
plt.style.use('dark_background')

In [3]:
# Load in the dotenv variables
load_dotenv()

project_path = os.getenv('Project_Path')[2:78]

# Change notebook directory back one so that it can acess the data
os.chdir(project_path)

In [4]:
data = pd.read_csv('./data/interim/wrangled', low_memory = False)
y = pd.read_csv('./data/raw/loan.csv',low_memory = False)['loan_status']

In [5]:
data.head(3)

Unnamed: 0,annual_inc,pymnt_plan,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,pub_rec,revol_bal,initial_list_status,last_pymnt_amnt,...,emp_type_Clerk,emp_type_Designer,emp_type_Director,emp_type_Education,emp_type_Executive,emp_type_Healer,emp_type_Manager,emp_type_Technical,emp_type_Unemployed,emp_type_Vol
0,-0.789014,-0.003357,-0.364672,-1.771866,0.305877,0.975849,-0.335522,-0.145932,-0.97077,-0.415561,...,-0.077264,-0.052174,-0.196151,-0.185922,-0.180733,-0.244724,-0.454611,-0.390656,4.030308,-0.172132
1,-0.696292,-0.003357,-0.364672,0.143584,4.312132,0.975849,-0.335522,-0.679268,-0.97077,-0.426398,...,-0.077264,-0.052174,-0.196151,-0.185922,-0.180733,-0.244724,-0.454611,-0.390656,-0.24812,-0.172132
2,-0.970563,-0.003357,-0.364672,0.491479,1.307441,0.975849,-0.335522,-0.622684,-0.97077,-0.315809,...,-0.077264,-0.052174,-0.196151,-0.185922,-0.180733,-0.244724,-0.454611,-0.390656,4.030308,-0.172132


In [7]:
# Create a loop over all the different attack types
for col in data.columns:
    # Print out the attack types in bullet format to make the next section easier
    print(f"- {col}")

- annual_inc
- pymnt_plan
- delinq_2yrs
- earliest_cr_line
- inq_last_6mths
- mths_since_last_delinq
- pub_rec
- revol_bal
- initial_list_status
- last_pymnt_amnt
- last_credit_pull_d
- collections_12_mths_ex_med
- mths_since_last_major_derog
- dti_joint
- tot_cur_bal
- il_util
- max_bal_bc
- inq_fi
- home_ownership_OTHER
- home_ownership_RENT
- verification_status_Source Verified
- verification_status_Verified
- purpose_home_improvement
- purpose_house
- purpose_major_purchase
- purpose_small_business
- purpose_wedding
- emp_type_Accountant
- emp_type_Admin
- emp_type_Analyst
- emp_type_Assistant
- emp_type_Clergy
- emp_type_Clerk
- emp_type_Designer
- emp_type_Director
- emp_type_Education
- emp_type_Executive
- emp_type_Healer
- emp_type_Manager
- emp_type_Technical
- emp_type_Unemployed
- emp_type_Vol


- annual_inc
- pymnt_plan
- delinq_2yrs
- earliest_cr_line
- inq_last_6mths
- mths_since_last_delinq
- pub_rec
- revol_bal
- initial_list_status
- last_pymnt_amnt
- last_credit_pull_d
- collections_12_mths_ex_med
- mths_since_last_major_derog
- dti_joint
- tot_cur_bal
- il_util
- max_bal_bc
- inq_fi
- home_ownership: Type of home ownership
- verification_status: Whether or not the information has been verified or the source of the information has been verified. One hot encoded to multiple columns.
- purpose: The reason for the loan. one hot encoded so it encompasses many columns.
- emp_type: The type of employment that the borrower has but has been one hot encoded so it encompasses many columns

## **Basic Analysis**

While it may be possible to intuit many of the column names from the dataset it is important to define all of these columns.

- emp_length: The number of years that the borrower has been employed. 10 is the maximum even for values over 10
- annual_inc: The income of the borrower per year. This also includes joint income if the borrower has another person that is contributing to the application
- delinq_2yrs: The number of deliquencies the borrower has had in the last 2 years.
- inq_last_6mths: The number of inquiries in the last 6 months excluding auto and mortgage inquiries.
- pub_rec: The number of derogatory public records that the borrower has
- revol_bal: Total balance for the borrower's revolving credit line
- initial_list_status: The type of loan that it is values are 0 and 1
- last_pymnt_amnt: The amount that the borrower paid the last time a payment was made.
- tot_coll_amt: Total collection amounts ever owed
- tot_cur_bal: Total current balance of all accounts
- il_util: Ratio of tot_cur_bal to the credit limit on installment accounts
- max_bal_bc: Maximum balance owed on all revolving accounts 
- home_ownership_RENT: Dummy variable from home ownership that is 1 when the borrower is renting
- verification_status_Verified: dummy variable for verification status. Determines if the income was verified by lending club
- purpose_small_business: Dummy variable. 1 if the reason the borrower took out the loan was for their small business.
- emp_type_Manager: 1 if the profession that the borrower had was a manager. I may have missed some if their description was vague
- reason_Debt_Consolidation: dummy variable that determines if the reason that the borrower has requested the loan is to try to clean up their debts. Presumably this person is in a different situation from someone that is taking on a new project and is going into debt for the first time.


In [16]:
# Create a loop over all the different attack types
for type in y.unique():
    # Print out the attack types in bullet format to make the next section easier
    print(f"- {type}")

- Fully Paid
- Charged Off
- Current
- Default
- Late (31-120 days)
- In Grace Period
- Late (16-30 days)
- Does not meet the credit policy. Status:Fully Paid
- Does not meet the credit policy. Status:Charged Off
- Issued


- **Fully Paid:** The loan has be fully repaid on time in its entirety
- **Charged Off:** The loan has not been fully repaid and has been written off as not fully recoverable
- **Current:** The borrower is up to date on all the payments but the term of the loan has not expired and there are payments remaining
- **Default:** The late period has passed and lending club has started the process of charging off the loan
- **Late (31-120 days):** The borrower is late on a payment by 31-121 days
- **In Grace Period:** The borrower is within the grace period of p
- **Late (16-30 days):** The borrower is late on their payment by 16-30 days 
- **Does not meet the credit policy. Status:Fully Paid:** Lending club has a standard for credit scores that its applicants must meet. If they don't they cannot continue the loan. These borrowers did not have a high enough credit but still managed to fully pay back their loan.
- **Does not meet the credit policy. Status:Charged Off:** Similar to the above borrowers these people did not meet the minimum credit requirements. Unfortunately they were not able to pay back their loan and the loan was charged off
- **Issued:** The Loan has been issued but no payments have been made yet.