# Loan

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Descriptive Analysis

This _Loan_ dataset can be used to predict the score of a loan application.

It includes
- **loan-10k.lrn.csv**: training dataset
- **loan-10k.tes.csv**: test/validation dataset
- **loan-10k.sol.ex.csv**: sample solutions

In [22]:
datafolder = "../../datasets/Loan"

training_file = f"{datafolder}/loan-10k.lrn.csv"
test_file = f"{datafolder}/loan-10k.tes.csv"
solution_file = f"{datafolder}/loan-10k.sol.ex.csv"

training_df = pd.read_csv(training_file)
test_df = pd.read_csv(test_file)

training_df.head()


Unnamed: 0,ID,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,emp_length,home_ownership,annual_inc,...,debt_settlement_flag,issue_d_month,issue_d_year,earliest_cr_line_month,earliest_cr_line_year,last_pymnt_d_month,last_pymnt_d_year,last_credit_pull_d_month,last_credit_pull_d_year,grade
0,24341,12500.0,12500.0,12500.0,36 months,7.21,387.17,< 1 year,MORTGAGE,81000.0,...,N,6,2018,6,2000,2,2019,2,2019,A
1,67534,33850.0,33850.0,33775.0,60 months,20.99,915.57,1 year,MORTGAGE,80000.0,...,N,10,2015,9,1984,2,2019,2,2019,E
2,35080,10000.0,10000.0,10000.0,60 months,20.0,264.94,< 1 year,RENT,36580.0,...,N,9,2017,10,2006,1,2018,11,2018,D
3,4828,20250.0,20250.0,20250.0,36 months,14.31,695.15,9 years,RENT,48700.0,...,N,0,2015,6,1996,6,2016,9,2017,C
4,59259,25000.0,25000.0,25000.0,36 months,14.99,866.52,1 year,MORTGAGE,85000.0,...,N,11,2016,0,2002,2,2019,2,2019,C


In [None]:
print(f"""
The training dataset has following properties:
    - Number of Instances: {training_df.shape[0]}
    - Number of Features: {training_df.shape[1]-1}
    - Number of Target Attributes: 1
    - Number of Dimensions: {training_df.ndim}
    - Has missing values: {'yes' if training_df.isnull().sum().sum() > 0 else 'no'}
    - Types of attributes: {training_df.dtypes.value_counts().to_dict()}
    - Duplicate Instances: {training_df.duplicated().sum()}
""")

print(f"""
The validation dataset has following properties:
    - Number of Instances: {test_df.shape[0]}
    - Has missing values: {'yes' if test_df.isnull().sum().sum() > 0 else 'no'}
    - Duplicate Instances: {test_df.duplicated().sum()}
""")



The training dataset has following properties:
    - Number of Instances: 10000
    - Number of Features: 91
    - Number of Target Attributes: 1
    - Number of Dimensions: 2
    - Has missing values: no
    - Types of attributes: {dtype('float64'): 69, dtype('O'): 14, dtype('int64'): 9}
    - Duplicate Instances: 0


The validation dataset has following properties:
    - Number of Instances: 10000
    - Has missing values: no
    - Duplicate Instances: 0



### Attributes

In [24]:
print(f"{'Attribute':<25} {'Data Type':<10} {'Missing Values':<15} {'Min':<10} {'Max'}")
print("-" * 70)
for col in training_df.columns:
    if training_df[col].dtype in [np.int64, np.float64]:
        print(f"{col:<25} {str(training_df[col].dtype):<10} {training_df[col].isnull().sum():<15} {training_df[col].min():<10} {training_df[col].max()}")
    else:
        print(f"{col:<25} {str(training_df[col].dtype):<10} {training_df[col].isnull().sum():<15} {'n.a.':<10} {'n.a.':<10} {set(training_df[col].values)}") 

Attribute                 Data Type  Missing Values  Min        Max
----------------------------------------------------------------------
ID                        int64      0               0          99999
loan_amnt                 float64    0               1000.0     40000.0
funded_amnt               float64    0               1000.0     40000.0
funded_amnt_inv           float64    0               1000.0     40000.0
term                      object     0               n.a.       n.a.       {' 60 months', ' 36 months'}
int_rate                  float64    0               5.31       30.99
installment               float64    0               30.12      1717.63
emp_length                object     0               n.a.       n.a.       {'5 years', '2 years', '8 years', '6 years', '4 years', '9 years', '10+ years', '7 years', '3 years', '1 year', '< 1 year'}
home_ownership            object     0               n.a.       n.a.       {'ANY', 'OWN', 'RENT', 'OTHER', 'MORTGAGE'}
annual_inc 

#### Description & Data Types

- `ID` (_Integer_): Each record has a unique identifier, which is not a predictive feature but used for reference.
- `grade` (_Object_): **Target attribute** indicating the loan score. Training dataset contains values between A and G.

**Numerical attributes** (selection)
- `loan_amount`, `funded_amnt`, `funded_amnt_inv`
- `int_rate`
- `installment`
- `annual_inc`
- `dti`
- `delinq_2yrs`
- `fico_range_low`, `fico_range_high`, `last_fico_range_high`, `last_fico_range_low`
- `inq_las_6mths`
- `open_acc`
- `pub_rec`
- `revol_bal`, `revol_util`
- `total_acc`
- `out_prncp`, `out_prncp_inv`
- `total_pymnt`, `total_pymnt_inv`, `last_pymnt_amnt`
- `total_rec_prncp`, `total_rec_int`, `total_rec_late_fee`
- `recoveries`
- `collection_recovery_fee`

**Categorical attributes**: 
- `term` (_ordinal_): Duration of  
- `emp_length` (_ordinal_): Length of employment in years // TODO: distance between values is defined?
- `home_ownership`
- `verification_status`
- `loan_status`
- `pymnt_plan`
- `purpose`
- `addr_state`
- `initial_list_status`
- `hardship_flag`
- `disbursement_method`
- `debt_settlement_flag`




## Visualisations

## Prepare & Split Dataset

- Separate target attribute from features 
- Remove `ID`

- Split dataset

## Preprocessing

In [25]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

### Missing Values & Duplicates

### Outliers

### Encoding

one-hot encoding / label encoding

encode target attribute?

### Correlation & Redundant Features

### Scale & Normalize Features

## Classification

## Evaluation