# Practice Optimizing DataFrames and Processing in Chunks 

In this project, i'll work with finncial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. 

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to lend. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the origination fee that Lending Club charges. 

The entire dataset consumses about 67 megabytes of memory. We will optimize dataframe and processing in chunks in situation only 10 megabytes of memory available. 

In [92]:
import pandas as pd
pd.options.display.max_columns = 99

## Read dataset 

In [93]:
loans = pd.read_csv('loans_2007.csv', nrows = 5)
loans

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [94]:
# memory of dataframe in 1000 rows 
memory_thousand = pd.read_csv('loans_2007.csv', nrows = 1000)
memory_thousand.memory_usage(deep = True).sum() / (2**20)

1.5502548217773438

## Calculate total memory usage 

Given that the dataset's 1000 rows consume 1.55 MB of memory, i can expect chunks of 3000 rows to consume around 4.5MB. Because available memory useage is 10MB, it is good to consume less than 50% of available memory.  

In [95]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

for chunk in chunk_iter :
    print(chunk.memory_usage(deep = True).sum()/(2**20))  

4.649059295654297
4.644805908203125
4.646563529968262
4.647915840148926
4.644108772277832
4.645991325378418
4.644582748413086
4.646951675415039
4.645077705383301
4.64512825012207
4.657840728759766
4.656707763671875
4.663515090942383
4.896956443786621
0.880854606628418


## Explore dataframe in chunks

### Total number of rows 

In [96]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

nrow = 0
for chunk in chunk_iter :
    nrow += len(chunk)
    ncol = len(chunk.columns)
print(nrow, ncol)

42538 52


There are 42538 rows and 52 columns in loans dataframe. 

### Number of columns by data type

In [97]:
loans_chunks = pd.read_csv('loans_2007.csv', chunksize = 3000)

num_numerical_columns = []
num_object_columns = []
for chunk in loans_chunks : 
    numerical_columns = chunk.select_dtypes(exclude = ['object']).columns
    num_numerical_columns.append(len(numerical_columns))
    object_columns = chunk.select_dtypes(include = ['object']).columns
    num_object_columns.append(len(object_columns))
    
print(f"Number of numerical columns : {num_numerical_columns}")
print(f"Number of object columns : {num_object_columns}")

Number of numerical columns : [31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
Number of object columns : [21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


In [98]:
print(f"Columns in numerical_columns : {numerical_columns}")
print(f"Columns in object_columns : {object_columns}")

Columns in numerical_columns : Index(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'installment', 'annual_inc', 'dti', 'delinq_2yrs', 'inq_last_6mths',
       'open_acc', 'pub_rec', 'revol_bal', 'total_acc', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_amnt',
       'collections_12_mths_ex_med', 'policy_code', 'acc_now_delinq',
       'chargeoff_within_12_mths', 'delinq_amnt', 'pub_rec_bankruptcies',
       'tax_liens'],
      dtype='object')
Columns in object_columns : Index(['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'verification_status', 'issue_d',
       'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code',
       'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status',
       'last_pymnt_d', 'last_credit_pull_d', 'application_type'],

To optimize dataframe, we need to process column 'id' into integer.

## Check columns for optimizing 

Next step is find columns need to be optimizing. The workflow of check columns for optimizing is same as below : 

1. Check unique vlaues in each object columns.( Find object columns contain values that are less than 50% unique)
2. Check candidates for conversion to the integer type and have no missing values.

### Check object columns 

In [99]:
# Count number of unique values in object columns 
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

unique_series = {}
for chunk in chunk_iter : 
    object_df = chunk.select_dtypes(include = ['object'])
    object_columns = object_df.columns
    for col in object_columns : 
        col_unique = chunk[col].value_counts() 
        if col in unique_series : 
            unique_series[col].append(col_unique)
        else : 
            unique_series[col] = [col_unique]

nunique_series = {} 
for col in unique_series : 
    col_concat = pd.concat(unique_series[col])
    col_group = col_concat.groupby(col_concat.index).sum()
    nunique_series[col] = len(col_group)

nunique_series = pd.Series(nunique_series).sort_values(ascending = False)
nunique_series[nunique_series <= 50]

addr_state             50
sub_grade              35
purpose                14
emp_length             11
loan_status             9
grade                   7
home_ownership          5
verification_status     3
pymnt_plan              2
term                    2
application_type        1
initial_list_status     1
dtype: int64

To optimize loan dataframe, above columns need to be covert in category data type.

### Check numerical columns 

In [100]:
# Check columns have no missing values 
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

missing = []
for chunk in chunk_iter : 
    numeric_df = chunk.select_dtypes(exclude = ['object'])
    missing.append(numeric_df.isnull().sum())

missing_series = pd.concat(missing)
missing_series = missing_series.groupby(missing_series.index).sum().sort_values(ascending = False)
missing_series

pub_rec_bankruptcies          1368
chargeoff_within_12_mths       148
collections_12_mths_ex_med     148
tax_liens                      108
acc_now_delinq                  32
open_acc                        32
delinq_amnt                     32
pub_rec                         32
delinq_2yrs                     32
total_acc                       32
inq_last_6mths                  32
annual_inc                       7
installment                      3
collection_recovery_fee          3
dti                              3
funded_amnt                      3
funded_amnt_inv                  3
total_rec_prncp                  3
last_pymnt_amnt                  3
loan_amnt                        3
total_rec_late_fee               3
out_prncp                        3
out_prncp_inv                    3
policy_code                      3
recoveries                       3
revol_bal                        3
total_pymnt                      3
total_pymnt_inv                  3
total_rec_int       

### Calculate memory usage across all chunks

In [101]:
# Check memory usage of object columns 
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

object_memory = 0
numeric_memory = 0
total_memory = 0
for chunk in chunk_iter :
    object_df = chunk.select_dtypes(include = ['object'])
    numeric_df = chunk.select_dtypes(exclude = ['object'])
    object_memory += object_df.memory_usage(deep = True).sum()/(2**20)
    numeric_memory += numeric_df.memory_usage(deep = True).sum()/(2**20)
    total_memory += chunk.memory_usage(deep = True).sum()/(2**20)
    
print(f"Memory usage of object columns : {round(object_memory, 3)}MB")
print(f"Memory usage of numeric columns : {round(numeric_memory, 3)}MB")
print(f"Memory usage of total columns : {round(total_memory, 3)}MB")

Memory usage of object columns : 56.182MB
Memory usage of numeric columns : 10.036MB
Memory usage of total columns : 66.216MB


## Optimize columns 

### Object columns

In [102]:
# Check unique value in object columns 
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

unique_series = {}
for chunk in chunk_iter : 
    object_df = chunk.select_dtypes(include = ['object'])
    object_columns = object_df.columns
    for col in object_columns : 
        col_unique = chunk[col].value_counts() 
        if col in unique_series : 
            unique_series[col].append(col_unique)
        else : 
            unique_series[col] = [col_unique]

for col in unique_series : 
    col_concat = pd.concat(unique_series[col])
    col_group = col_concat.groupby(col_concat.index).sum()
    print(col_group)

 36 months    31534
 60 months    11001
Name: term, dtype: int64
  5.42%    573
  5.79%    410
  5.99%    347
  6.00%     19
  6.03%    447
          ... 
 23.59%      4
 23.91%     11
 24.11%      3
 24.40%      1
 24.59%      1
Name: int_rate, Length: 394, dtype: int64
A    10183
B    12389
C     8740
D     6016
E     3394
F     1301
G      512
Name: grade, dtype: int64
A1    1142
A2    1520
A3    1823
A4    2905
A5    2793
B1    1882
B2    2113
B3    2997
B4    2590
B5    2807
C1    2264
C2    2157
C3    1658
C4    1370
C5    1291
D1    1053
D2    1485
D3    1322
D4    1140
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G3      79
G4      99
G5      86
Name: sub_grade, dtype: int64
  old palm inc                       1
 Brocade Communications              1
 CenturyLink                         1
 Department of Homeland Security     1
 Down To Earth Distributors, Inc.    1
               

Useful object columns for analysis are 'id', 'term', 'int_rate', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'purpose', 'addr_status', 'earliest_cr_line', 'revol_util', 'last_pymnt_d', 'last_credit_pull_d'. From result of unique values and number of unique values, object columns need to be converted as below : 

- id : as numeric
- term : as numeric(strip months)
- int_rate : as numeric(strip %)
- sub_garde : as category 
- emp_title : same 
- home_ownership : as category
- verifiation_status : as category
- issue_d : as datetime 
- loan_status : as category
- purpose : as category 
- addr_state : as category
- earliest_cr_line : as datetime
- revol_util : as numeric(strip %)
- last_pymnt_d : as datetime
- last_credit_pull_d : as datetime 

In [103]:
# Convert object type to numerical type 
print(f"Previous total memory : {total_memory}")
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000)

total_memory = 0
for chunk in chunk_iter : 
    term_split = chunk['term'].str.split(" ").str[1]
    chunk['term'] = pd.to_numeric(term_split)
    int_strip = chunk['int_rate'].str.rstrip('%')
    chunk['int_rate'] = pd.to_numeric(int_strip)
    revol_strip = chunk['revol_util'].str.rstrip('%')
    chunk['revol_util'] = pd.to_numeric(revol_strip) 
    
    total_memory += chunk.memory_usage(deep = True).sum()/(2**20)
    
print(f"Current total memory : {total_memory}")

Previous total memory : 66.21605968475342
Current total memory : 59.37732982635498


In [107]:
# Convert object type to category
print(f"Previous total memory : {total_memory}")
chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000, parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'])

total_memory = 0
for chunk in chunk_iter : 
    term_split = chunk['term'].str.split(" ").str[1]
    chunk['term'] = pd.to_numeric(term_split)
    int_strip = chunk['int_rate'].str.rstrip('%')
    chunk['int_rate'] = pd.to_numeric(int_strip)
    revol_strip = chunk['revol_util'].str.rstrip('%')
    chunk['revol_util'] = pd.to_numeric(revol_strip) 
    
    total_memory += chunk.memory_usage(deep = True).sum()/(2**20)
    
print(f"Current total memory : {round(total_memory,3)}MB")

Previous total memory : 59.37732982635498
Current total memory : 50.131991386413574


In [109]:
# Convert object type to datetime 
print(f"Previous total memory : {round(total_memory,3)}MB")

cat_dtypes = {
    'sub_grade' : 'category',
    'home_ownership' : 'category',
    'verification_status' : 'category',
    'loan_status' : 'category',
    'purpose' : 'category',
    'addr_state' : 'category'
}

chunk_iter = pd.read_csv('loans_2007.csv', chunksize = 3000, 
                         dtype = cat_dtypes,
                         parse_dates = ['issue_d', 'earliest_cr_line', 'last_pymnt_d', 'last_credit_pull_d'])

total_memory = 0
for chunk in chunk_iter : 
    term_split = chunk['term'].str.split(" ").str[1]
    chunk['term'] = pd.to_numeric(term_split)
    int_strip = chunk['int_rate'].str.rstrip('%')
    chunk['int_rate'] = pd.to_numeric(int_strip)
    revol_strip = chunk['revol_util'].str.rstrip('%')
    chunk['revol_util'] = pd.to_numeric(revol_strip) 

    total_memory += chunk.memory_usage(deep = True).sum()/(2**20)
    
print(f"Current total memory : {round(total_memory,3)}MB")

Previous total memory : 50.131991386413574
Current total memory : 34.713704109191895


In [110]:
chunk.dtypes

id                                    object
member_id                            float64
loan_amnt                            float64
funded_amnt                          float64
funded_amnt_inv                      float64
term                                 float64
int_rate                             float64
installment                          float64
grade                                 object
sub_grade                           category
emp_title                             object
emp_length                            object
home_ownership                      category
annual_inc                           float64
verification_status                 category
issue_d                       datetime64[ns]
loan_status                         category
pymnt_plan                            object
purpose                             category
title                                 object
zip_code                              object
addr_state                          category
dti       