# Optimising Loans

In this project, I'll be practice working with chunked dataframes and optimizing a dataframe's memory usage. 

I'll be working with personal loan data.

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the origination fee that Lending Club charges.

I'll be working with a dataset of loans approved from 2007-2011.

The entire dataset consumes about 67 megabytes of memory. 

For this project, I'll be imagining that I only have 10 megabytes of memory available.

Let's get started...

In [300]:
import pandas as pd

import numpy as np

pd.options.display.max_columns = 99

Note here we can use the `nrows` parameter to only load in N amount of rows.

In [2]:
loans_5 = pd.read_csv("loans_2007.csv", nrows = 5)
loans_5

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [40]:
loans_5.shape

(5, 52)

We'll now load in 1000 rows, and ensure that our file size is under 5MB. We'll use the `memory_usage` method.

In [18]:
loans_1000 = pd.read_csv("loans_2007.csv", nrows = 1000)
loans_1000.memory_usage(deep = True).sum() / (1024 * 1024) # Convert to MB by dividing by 1024 * 1024

1.5502548217773438

### Exploring Data in Chunk

Now let's try understand the columns better by using chunks.

For each of these chunks, we want to know:

1. How many columns have a numeric type? How many have a string type?

1. How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?

1. Which float columns have no missing values and could be candidates for conversion to the integer type?

We also want to calculate the total memory usage across all of the chunks.

Based on our previous exericse, we could probably triple the number of rows per chunk. Let's try this.

In [35]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)
for chunk in chunks_iter:
    print(chunk.memory_usage(deep = True).sum() / (1024 * 1024))

4.649059295654297
4.644805908203125
4.646563529968262
4.647915840148926
4.644108772277832
4.645991325378418
4.644582748413086
4.646951675415039
4.645077705383301
4.64512825012207
4.657840728759766
4.656707763671875
4.663515090942383
4.896956443786621
0.880854606628418


Looks good!

Let's see how many rows there are in the data.

In [37]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)
num_rows = 0
for chunk in chunks_iter:
    num_rows += len(chunk)
num_rows

42538

### How many columns have a numeric type? How many have a string type?

In [142]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)
chunk_dict = {}

for i, chunk in enumerate(chunks_iter, 1):
    
    chunk_name = f"C{i}"
    obj_total = len(list(chunk.dtypes[chunk.dtypes == np.object].index))
    num_total = len(list(chunk.dtypes[(chunk.dtypes == np.float) | (chunk.dtypes == np.int)].index))
    
    chunk_dict[chunk_name] = [obj_total, num_total]
    
    if i == 14 or i == 15 or i == 13:
        print(chunk["id"][:5])
        print()
        
chunk_df = pd.DataFrame.from_dict(chunk_dict, orient='index', columns = ["Object", "Number"]).transpose()
chunk_df

36000    412050
36001    426918
36002    426414
36003    426858
36004    426845
Name: id, dtype: int64

39000    298963
39001    298946
39002    298649
39003    297158
39004    297783
Name: id, dtype: object

42000    247286
42001    246996
42002    246720
42003    246535
42004    246197
Name: id, dtype: object



Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15
Object,21,21,21,21,21,21,21,21,21,21,21,21,21,22,22
Number,31,31,31,31,31,31,31,31,31,31,31,31,31,30,30


We can see above that column *id* is being cast as an object in the last couple of chunks. As we don't really need the id column for future analysi, we'll just ignore this.

### How many unique values are there in each string column? and how many of the string columns contain less than 50% unique values?

In [211]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)

unique_values_dict = {}

# Loop through all chunks
for chunk in chunks_iter:
    obj_total = list(chunk.dtypes[chunk.dtypes == np.object].index)
    
    if "id" in obj_total:
        obj_total.remove("id")
        
    # Adding value counts to dictionary
    for col in obj_total:
        unique_elements = chunk[col].value_counts()
        if col not in unique_values_dict:
            unique_values_dict[col] = [unique_elements]
        else:
            unique_values_dict[col].append(unique_elements)
            
# Concatenating values in dictionary for each chunk
# Grouping by index and calculating sum
for col in unique_values_dict:
    concat = pd.concat(unique_values_dict[col])
    grouped = concat.groupby(concat.index).sum()
    if len(grouped) < 50:
        print(f"{col:25} : {len(grouped)}")

term                      : 2
grade                     : 7
sub_grade                 : 35
emp_length                : 11
home_ownership            : 5
verification_status       : 3
loan_status               : 9
pymnt_plan                : 2
purpose                   : 14
initial_list_status       : 1
application_type          : 1


### Which float columns have no missing values and could be candidates for conversion to the integer type?

In [279]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)

cols = []

for chunk in chunks_iter:
    floats = chunk.select_dtypes(include = "float")
    floats_null = floats.columns[~floats.isnull().any()].tolist()
    cols.append(floats_null)
                            
union = sorted(set().union(*cols))

for i, u in enumerate(union, 1):
    print(f"{i:5} {u}")

    1 acc_now_delinq
    2 annual_inc
    3 chargeoff_within_12_mths
    4 collection_recovery_fee
    5 collections_12_mths_ex_med
    6 delinq_2yrs
    7 delinq_amnt
    8 dti
    9 funded_amnt
   10 funded_amnt_inv
   11 inq_last_6mths
   12 installment
   13 last_pymnt_amnt
   14 loan_amnt
   15 member_id
   16 open_acc
   17 out_prncp
   18 out_prncp_inv
   19 policy_code
   20 pub_rec
   21 pub_rec_bankruptcies
   22 recoveries
   23 revol_bal
   24 tax_liens
   25 total_acc
   26 total_pymnt
   27 total_pymnt_inv
   28 total_rec_int
   29 total_rec_late_fee
   30 total_rec_prncp


### Calculate memory usage across all chunks

In [293]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)

bits = 0
for chunk in chunks_iter:
    bits += chunk.memory_usage(deep = True).sum() / (1024 * 1024)
    
print(f"{round(bits, 2):2} MBs")

66.22 MBs


### Optimizing String Columns

Let's first see what string columns we should look at.

In [317]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000)

for chunk in chunks_iter:
    print(chunk.head(2).transpose())
    break

                                      0                1
id                              1077501          1077430
member_id                    1.2966e+06      1.31417e+06
loan_amnt                          5000             2500
funded_amnt                        5000             2500
funded_amnt_inv                    4975             2500
term                          36 months        60 months
int_rate                         10.65%           15.27%
installment                      162.87            59.83
grade                                 B                C
sub_grade                            B2               C4
emp_title                           NaN            Ryder
emp_length                    10+ years         < 1 year
home_ownership                     RENT             RENT
annual_inc                        24000            30000
verification_status            Verified  Source Verified
issue_d                        Dec-2011         Dec-2011
loan_status                  Fu

Let's just go with the following columns.

In [320]:
useful_obj_cols = ['term', 
                   'sub_grade', 
                   'emp_title', 
                   'grade',
                   'sub_grade',
                   'home_ownership', 
                   'verification_status', 
                   'issue_d', 
                   'purpose', 
                   'earliest_cr_line', 
                   'revol_util', 
                   'last_pymnt_d', 
                   'last_credit_pull_d']

In [323]:
chunks_iter = pd.read_csv("loans_2007.csv", chunksize = 3000, usecols = useful_obj_cols)

for chunk in chunks_iter:
    print(chunk.dtypes)
    break

term                   object
grade                  object
sub_grade              object
emp_title              object
home_ownership         object
verification_status    object
issue_d                object
purpose                object
earliest_cr_line       object
revol_util             object
last_pymnt_d           object
last_credit_pull_d     object
dtype: object
