TODO:
* Function for remove a set of columns / rows and check theyve been removed by returning bool
* Can do PCA but mostly note for financial loan data as its very hard to be explainable. Can maybe save it for later
* Can do stratify Y
* Write a helper func to display the data dict
* verify drop hardship loans works

# Exploratory Data Analysis and Cleaning

## Date: OCT 10, 2023

-- ------------------------


## Introduction

This notebook cleans the data for the lending club accepted loans, then exports the data as a parquet file. Due to the size of the dataset, the csv is read in chunks, with a random sample taken from each each chunk. Only fully paid and charged off / defaulted loans are sampled as current loans hold no value in classifying the target variable. Those samples are merged and will become the working dataset for the duration of the project. After unnecessary and leaky features are removed, features are formatted and null values dealt with. Finally the dataframe is size is optimized and exported  

### Table-of-contents


1. [Introduction](#Introduction)
   - [Table-of-contents](#Table-of-contents)
   - [Import-Librarys](#Import-Librarys)
   - [Data Dictionary](#Data-Dictionary)
   - [Define-Functions](#Define-Functions)
   - [Load in the data](#Load-the-data)
3. [Data Cleaning](#Data-Cleaning)
   - [Initial Exploration](#Initial-Exploration)
   - [Feature Pruning](#Feature-Pruning)
   - [Explore Columns to drop](#Explore-Columns-to-drop)
   - [Dataframe Null Values](#Dataframe-Null-Values)
4. [Dataframe optimization](#Dataframe-optimization)
5. [Exploratory-Data-Analysiss](Exploratory-Data-Analysis)
6. [Feature Engineering](#Feature-Engineering)
7. . [Conclusion](#Conclusion)


### Import-Librarys

In [None]:
#%pip install pandas-downcast

In [None]:
#%pip install missingno

In [None]:
#%run helpers.ipynb

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#from helpers import full_display
#import pdcast as pdc
#import missingno as msno

from pathlib import Path

### Data-Dictionary

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
#pathlib is used to ensure compatibility across operating systems
try:
    data_destination = Path('../Data/Lending_club/Lending Club Data Dictionary Approved.csv')
    dict_df = pd.read_csv(data_destination, encoding='ISO-8859-1')
    display(dict_df.iloc[:,0:2])
except FileNotFoundError as e:
    print(e.args[1])
    print('Check file location')

#### Define-Functions

When initially loading in the dataset, Pandas raised a DtypeWarning over mixed datatypes within various columns. Setting low_memory = False while breaking the CSV into chunks allows Pandas to load an entire chunk before guessing the data types. When the script to scrape the data dictionary is finished, the data dict can then be passed in instead of relying on pandas. The mixed_data_types function is stilled called as a sanity check.

In [None]:
def mixed_data_types(df:pd.DataFrame) -> bool:
    '''
    Takes in a dataframe and checks for columns with mixed data types
    If none are found return False, else True
    
    :param df: The dataframe to be checked
    :type df: obj
    :return bool: True if found, false if none were found
    :type return: bool
    '''
    
    #loop through each column
    for column in df:

        #filter outint datatypes coming from Nan and get unique data types
        unique_types = df[column].dropna(inplace=False).apply(type).unique()

        #if there are more than 1 datatype in a column
        if unique_types.size > 1:
            return True
    return False

In [None]:
def optimize_column(df:pd.DataFrame):
    '''
    Takes in a dataframe and returns the dataframe with the smallest datatype for each column
    Example int64 -> int32
    '''
    datatypes = df.dtypes.unique()
    for column in df:
        if df[col].dtype == 'int':
            pass




In [None]:
def drop_row(df):
    '''
    takes in some rows and then drops those rows. Checks rows have been dropped
    '''
    pass

#### Load the data

Due to the size of the dataset, it is read in chunks. After each chunk is read and checked for mixed data types, it is randomly sampled and then placed within a list. Only fully paid and /defaulted and charged off loans are taken, as current loans including late or in grace period loans do not hold any value in target variable prediction. This is done when loading in the data otherwise it becomes too large for memory. The different samples are then combined into a single sample representative of the whole dataset. EDA will be performed on this single sample.

In [None]:
chunk_size = 5*100000
sample_size =  100000
random_state = 11

assert sample_size < chunk_size, f"Cannot take a sample of {sample_size} rows out of {chunk_size} rows"

print(f'Chunk size: {chunk_size} rows')
print(f'Rows to be sampled: {sample_size} rows')


sampled_dataframes = []
try:
    data_destination = Path('../Data/Lending_club/accepted_2007_to_2018Q4.csv')

    #split the csv into chunks and iterate over each chunk
    with pd.read_csv(data_destination, chunksize=chunk_size, low_memory = False) as reader:
        for count,chunk in enumerate(reader):
            
            if mixed_data_types(df=chunk) == True:
                raise Exception("Mixed data types found")

            #define a list that includes only finished loan statuses
            finished_loan_status = ['Fully Paid',
                                    'Charged Off',
                                    'Does not meet the credit policy. Status:Fully Paid',
                                    'Does not meet the credit policy. Status:Charged Off',
                                    'Default']
            
            #filter the dataframe for loans that are finished or null
            filtered_chunk = chunk.loc[chunk['loan_status'].isin(finished_loan_status) | chunk['loan_status'].isnull()]

            #sample the filtered df and append to list
            sampled_df = filtered_chunk.sample(n=sample_size, random_state=random_state)
            sampled_dataframes.append(sampled_df)
            
            print(f"{count} sampled dataframe shape: {sampled_df.shape}")
        print('Finished')

except FileNotFoundError as e:
    print(e.args[1])
    print('Check file name and location')
    
except Exception as e:
    print(e.args[1])

There are no duplicate datatypes within any columns. The random samples can be combined into a single sample dataframe. This sample will be used as the working dataset.

In [None]:
sample_accepted_df = pd.concat(sampled_dataframes, ignore_index=False)

&nbsp;

## Data Cleaning

### Initial Exploration

***Display the first 5 rows*** 

In [None]:
sample_accepted_df.head(5)

***Dataframe shape***

In [None]:
rows, columns = sample_accepted_df.shape
print(f'Dataframe rows: {rows}')
print(f'Dataframe columns: {columns}')

***Dataframe info***

In [None]:
sample_accepted_df.info()

Of the 151 columns, 113 are float64 and 38 are objects. The dataframe takes up approximatly 580 MB.
Note:
- The numeric columns are all float64 and the object columns. These columns can be optimized later to save memory space and decrease computation time by changing the datatypes.
- There is no datetime column

***Describe Dataframe***

In [None]:
sample_accepted_df.describe()

Some key points:

- Loan Amount
  
    - Average Loan Amount is ~ 15,000 USD with a standard deviation of 9240 USD, having a max of 40,000 USD and minimum of 500 USD. This follows LendingClubs  policies for minimum and maximum loan amounts.

- Funded amount
    - Nearly identical to the loan amount

- Funded amount by investors
    - Very similar to the  funded amount

- Interest Rate
    - The interest rates are quite high. An average of 13%, with a minimum of 5.3% and a maximum of 31%.


   

### Feature Pruning

We will exclude any leaky features, non relevant features and any features that were not present in the original loan application, focusing first on dropping irrelevant columns.

***Hardship Loans***

Hardship loans make up a very small proportion of the dataset, and add 15 columns of complexity. We will drop these columns and loans if they exist in our dataset, and limit our analysis to non hardship loans.

In [None]:
#fetch the value counts for the for the hardships flags
hardships = sample_accepted_df['hardship_flag'].value_counts()
display(hardships)

#if there are loans with the yes hardship flag
if 'Y' in hardships:
    #get the count of hardship loans
    yes_hardship_count = hardships.iloc[1]
    print(f'The hardship loans represent only {(yes_hardship_count/sample_accepted_df.shape[0])*100}% of the dataset')

    #get the index of the hardship loans
    rows_to_remove = sample_accepted_df.loc[sample_accepted_df['hardship_flag'] == 'Y'].index

    #drop the loans
    sample_accepted_df.drop(rows_to_remove, inplace=True)

    #check the rows have been dropped
    assert sample_accepted_df['hardship_flag'].value_counts().shape[0] == 1
    print('Hardship loans and associated columns have been dropped')

else:
    print('There are no hardship loans.')
    
columns_to_drop = ['hardship_flag', 'hardship_type',
                        'hardship_reason', 'hardship_status',
                        'hardship_amount', 'hardship_start_date',
                        'hardship_end_date', 'deferral_term',
                        'hardship_length', 'hardship_dpd',
                        'hardship_loan_status', 'payment_plan_start_date',
                        'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount',
                        'hardship_last_payment_amount']
sample_accepted_df.drop(columns = columns_to_drop)
print('Hardship columns have been dropped')

***Employee Title***

In [None]:
sample_accepted_df['emp_title'].value_counts()

There are too many unique Employee titles to attempt any sort of grouping or encoding for now. Possibly in the future we could use NLP or an external API to group the Employee Title.

In [None]:
sample_accepted_df.drop(columns = 'emp_title', inplace=True)

***Loan Status***

Any current loans have already been dropped when reading in the data. We can now finish grouping the completed loans.

More information on the loan status's can be found here:  
https://www.lendingclub.com/help/investing-faq/what-do-the-different-note-statuses-mean

In [None]:
sample_accepted_df['loan_status'].value_counts()

The "Does not meet the credit policy" means when the loans were made under a different credit card policy, that does not meet the current policy. This has affect on the loans themselves, so they can be grouped with their counter parts. Charged off and Defaulted can also been grouped together.

In [None]:
status_mapping = {
    "Fully Paid": "Fully Paid",
    "Does not meet the credit policy. Status:Fully Paid": "Fully Paid",
    "Does not meet the credit policy. Status:Charged Off": "Charged Off/Default",
    "Charged Off": "Charged Off/Default",
    "Default": "Charged Off/Default",
}

#map the loans
sample_accepted_df['loan_status'] = sample_accepted_df['loan_status'].map(status_mapping)

Check the mapping has worked:

In [None]:
sample_accepted_df['loan_status'].value_counts()

The mapping was successful, we are not left with only successful and failed loans.

***State / Zip Code***

We have 2 geographical features. We will drop both of them for now as they will add too much complexity to the model. However, in the future we can perhaps use a 3rd party api and introduce mean or median income data by region, allowing us to capture some of that geographical data.

In [None]:
display(sample_accepted_df['addr_state'].value_counts())
print('-'*20)
display(sample_accepted_df['zip_code'].value_counts())

In [None]:
sample_accepted_df.drop(columns = 'zip_code', inplace=True)

***Fico scores***

We can drop the fico scores

In [None]:
sample_accepted_df['last_fico_range_high'].value_counts()

In [None]:
sample_accepted_df.drop(columns = ['last_fico_range_high','last_fico_range_low'], inplace=True)

***Description***

In [None]:
display(sample_accepted_df['desc'].value_counts())

There are too many unique descriptions to create dummy variables. We can drop this column

In [None]:
sample_accepted_df.drop(columns = ['desc'], inplace=True)

***Leaky columns***

In [None]:
sample_accepted_df.info()

We can remove any columns that:  
- describe payments made toward the loan

In [None]:
sample_accepted_df.drop(columns = ['total_pymnt', 'total_rec_prncp',
                       'total_rec_int', 'total_rec_late_fee',
                       'recoveries', 'collection_recovery_fee',
                       'last_pymnt_d', 'last_pymnt_amnt'], inplace=True)

- loan attributes post acceptance

### Feature engineering

***Term***

***CONVERT TO JUST 36 AND 60 LIKE INSTRUCTOR SAID***

Convert from str to int

In [None]:
sample_accepted_df['term'].value_counts()

Remove rows that leak from future ie features about the loan after it has been given

### Dataframe-Null-Values

------------------------------------------

In [None]:
pd.set_option('display.max_rows', None)

In [None]:
sample_accepted_df.isnull().sum().sort_values(ascending=False)

In [None]:
sample_accepted_df.isnull().sum()/sample_accepted_df.shape[0]*100

Note how there seems to be groupings of nulls. We will explore these groupings

In [None]:
msno.bar(sample_accepted_df)
plt.show()

We can drop columns that are linked to LendingClubs internal tracking of the loans. 

In [None]:
columns_to_drop.extend(['member_id','url'])

&nbsp;

***Explore the groupings of nulls***

We will start with the smallest

In [None]:
null_rows = sample_accepted_df[sample_accepted_df['revol_bal'].isnull()]
print('Number of null rows in revol_bal: ', sample_accepted_df['revol_bal'].isnull().sum())
display(null_rows)

we can remove these null entries

In [None]:
sample_accepted_df.dropna(subset=['revol_bal'], inplace=True)

In [None]:
sample_accepted_df['revol_bal'].isnull().sum()

***Annual income***

In [None]:
null_row = sample_accepted_df[sample_accepted_df['annual_inc'].isnull()]
null_row

***Why i removed the row***

In [None]:
sample_accepted_df.dropna(subset=['annual_inc'], inplace=True)

In [None]:
sample_accepted_df['annual_inc'].isnull().sum()

Total acc

In [None]:
null_row = sample_accepted_df[sample_accepted_df['total_acc'].isnull()]
null_row

&nbsp;

### Explore Columns to drop

### Dataframe-optimization

Since we

In [None]:
print(sample_accepted_df.info())
sample_accepted_df = pdc.downcast(sample_accepted_df)
print(sample_accepted_df.info())
# Infer minimum schema for DataFrame.
schema = pdc.infer_schema(sample_accepted_df)
print(schema)
sample_accepted_df.shape

TODO: Optimize column datatypes to reduce code runtime and increase memory efficiency

### Exploratory-Data-Analysis

Explore the relationship between interest rate and loan amount

In [None]:
# Separate the data between fully paid and charged off / defaulted loans
paid_loans = sample_accepted_df[sample_accepted_df['loan_status'] == "Fully Paid"]
defaulted_loans = sample_accepted_df[sample_accepted_df['loan_status'] == "Charged Off/Default"]

# A hexbin is more appropriate due to the number of datapoints being plotted. The count of each hex is plotted on the right
plt.hexbin(paid_loans['funded_amnt'], paid_loans['int_rate'], gridsize=20, label='Fully Paid')
plt.colorbar()
plt.xlabel('Loan Amount')
plt.xticks(rotation=45) 
plt.ylabel('Interest Rate')
plt.title('Hexbin plot of Interest Rate vs Loan Amount')
plt.show()

sns.boxplot(data=sample_accepted_df, x='loan_status', y='int_rate')
plt.xticks(rotation=45) 
plt.title('Boxplot of Loan Amount by Loan Status')
plt.xlabel('Loan Status')
plt.ylabel('Interest Rate')
plt.show()

Notice how there isn't much variation between late and "in grace period" loans, but there is between fully payed and defaulted / charged off loans. Charged off / defaulted loans have the highest median interest rate, with fully paid loans having one of the lowest. When considered with the hexplot, the majority of loans fall between $5,000 and $10,000, with an interest rate of approximately 12%, with the defaulted / charged off loans have a much higher interest rate, being further from the central grouping of data on the hex plot. 

### Feature-Engineering

TODO:
- Loan-to-income ratio
- Loan purpose one hot encoding
- simplify loan grade and subgrade

purpose

## Conclusion

### Resources used:

- https://stackoverflow.com/questions/51325601/how-to-stop-my-pandas-data-table-from-being-truncated-when-printed