# Optimizing Dataframes and Processing in chunks

## Introduction

The main aim of this project is to practice working with dataframes in chunks and optimizing a dataframe's memory usage.

The data we will be employing is part of the financial lending data from **Lending Club**. I will be using the dataset of approved loans from 2007-2011.

If we read in the entire data set, it will consume about 67 MB of memory. In this project we will imagine that we only have 10 MB of memory.

In [1]:
import pandas as pd
pd.options.display.max_columns = 99

In [2]:
#Read in the first five lines
loans5 = pd.read_csv('loans_2007.csv', nrows=5)
loans5

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [None]:
#columns to covert to integer type
to_int64 = [mem]

In [3]:
loans5.apply(pd.isnull).sum()

id                            0
member_id                     0
loan_amnt                     0
funded_amnt                   0
funded_amnt_inv               0
term                          0
int_rate                      0
installment                   0
grade                         0
sub_grade                     0
emp_title                     2
emp_length                    0
home_ownership                0
annual_inc                    0
verification_status           0
issue_d                       0
loan_status                   0
pymnt_plan                    0
purpose                       0
title                         0
zip_code                      0
addr_state                    0
dti                           0
delinq_2yrs                   0
earliest_cr_line              0
inq_last_6mths                0
open_acc                      0
pub_rec                       0
revol_bal                     0
revol_util                    0
total_acc                     0
initial_

In [4]:
import numpy as np
loans5.select_dtypes(include=[np.number]).shape[1]

31

Next, let's read in the first 100o rows and calculate the total memory usage for these rows.  We'll then increase and reduce the number of rows to keep the memory usage under 5.

In [5]:
loans_1000 = pd.read_csv('loans_2007.csv', nrows=1000)

In [6]:
#check the memory usage of the first 1000 rows
loans_1000.memory_usage(deep=True).sum()/(2**20)

1.5502548217773438

Since the memory usage is only 1.55 MB, let's try to increase the number of rows as we evaluate the memory footprint using batch processing.

In [7]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in loans_iter:
    print(chunk.memory_usage(deep=True).sum()/(2**20))
    

4.649059295654297
4.644805908203125
4.646563529968262
4.647915840148926
4.644108772277832
4.645991325378418
4.644582748413086
4.646951675415039
4.645077705383301
4.64512825012207
4.657840728759766
4.656707763671875
4.663515090942383
4.896956443786621
0.880854606628418


**How many rows are in the dataframe?**

In [8]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_rows = 0
for chunk in loans_iter:
    num_rows += len(chunk)
print(num_rows)

42538


## Exploring the Data in Chunks

Let's to understand our data better while using dataframe chunks.

In [9]:
#for each chunk, how many columns have a numeric type 
#and how many have a string type
import numpy as np
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
numeric_cols = []
string_cols = []
dtypes = []
for chunk in loans_iter:
    n = chunk.select_dtypes(include=[np.number]).shape[1]
    numeric_cols.append(n)
    s = chunk.select_dtypes(include=['object']).shape[1]
    string_cols.append(s)
    dtypes.append(chunk.dtypes)
    
print(numeric_cols)
print(string_cols)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


We can notice that the number of datatypes switched between the third and second last iterations. Let's print those out and see the changes.

In [10]:
#chunk data types on the third last iteration
print(dtypes[-3],'\n')

id                              int64
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                           object
int_rate                       object
installment                   float64
grade                          object
sub_grade                      object
emp_title                      object
emp_length                     object
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
pymnt_plan                     object
purpose                        object
title                          object
zip_code                       object
addr_state                     object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line               object
inq_last_6mths                float64
open_acc    

In [11]:
#chunk data types on the second last iteration
print(dtypes[-1],'\n')

id                             object
member_id                     float64
loan_amnt                     float64
funded_amnt                   float64
funded_amnt_inv               float64
term                           object
int_rate                       object
installment                   float64
grade                          object
sub_grade                      object
emp_title                      object
emp_length                     object
home_ownership                 object
annual_inc                    float64
verification_status            object
issue_d                        object
loan_status                    object
pymnt_plan                     object
purpose                        object
title                          object
zip_code                       object
addr_state                     object
dti                           float64
delinq_2yrs                   float64
earliest_cr_line               object
inq_last_6mths                float64
open_acc    

We can observe that the **id** column changed from **int64** to **object** data type.

Since this column is not very important - we will not use it anywhere in our analysis - we can ignore it.

**How many unique values are there in each string column?**

In [12]:
#How many unique values are there in each string column? 
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
string_uniques = {}
for chunk in loans_iter:
    string_chunk = chunk.select_dtypes(include=['object'])
    sc_cols = string_chunk.columns
    for sc_col in sc_cols:
        vc_series = string_chunk[sc_col].value_counts()
        if sc_col in string_uniques:
            string_uniques[sc_col].append(vc_series)
        else:
            string_uniques[sc_col] = vc_series
            
uniques_combined = {}

for sc_col in string_uniques:
    u_group = string_uniques[sc_col].groupby(string_uniques[sc_col].index).sum()
    uniques_combined[sc_col] = u_group
    #string columns with less than 50% unique values
    if uniques_combined[sc_col].shape[0] < (num_rows/2):
        print(sc_col, uniques_combined[sc_col].shape[0])
              

term 2
int_rate 36
grade 7
sub_grade 35
emp_title 2653
emp_length 11
home_ownership 3
verification_status 3
issue_d 2
loan_status 6
pymnt_plan 1
purpose 13
title 1406
zip_code 568
addr_state 43
earliest_cr_line 366
revol_util 884
initial_list_status 1
last_pymnt_d 54
last_credit_pull_d 55
application_type 1
id 3000


### Float columns with missing values

**Which columns have no missing values and could be candidates for conversion to integer type**

In [13]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
#create a list to all the series with the number of missing 
#values in each column of each chunk
float_ms = []
for chunk in loans_iter:
    float_df = chunk.select_dtypes(include=['float'])
    num_missing = float_df.apply(pd.isnull).sum()
    float_ms.append(num_missing)
    
float_concact = pd.concat(float_ms)    
float_concact.groupby(float_concact.index).sum().sort_values()

member_id                        3
total_rec_int                    3
total_pymnt_inv                  3
total_pymnt                      3
revol_bal                        3
recoveries                       3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_late_fee               3
loan_amnt                        3
last_pymnt_amnt                  3
total_rec_prncp                  3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
collection_recovery_fee          3
installment                      3
annual_inc                       7
inq_last_6mths                  32
total_acc                       32
delinq_2yrs                     32
pub_rec                         32
delinq_amnt                     32
open_acc                        32
acc_now_delinq                  32
tax_liens                      108
collections_12_mths_ex_med     148
chargeoff_within_12_

**What's the total memory usage accross all chunks?**

In [14]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
mem_usage = []
for chunk in loans_iter:
    mem = chunk.memory_usage(deep=True).sum()
    mem_usage.append(mem)
    
sum(mem_usage)/(2**20)

66.21605968475342

## Optimizing String Columns

We can acheive the greatest memory improvements by converting the string columns to numeric type.

Let's now convert all the columns where the values are less than 50% unique to category type, and columns that contain numeric values to float type.

**Which string columns can we convert to a numeric type if we clean them**

In [15]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
string_uniques = {}
for chunk in loans_iter:
    string_chunk = chunk.select_dtypes(include=['object'])
    sc_cols = string_chunk.columns
    for sc_col in sc_cols:
        vc_series = string_chunk[sc_col].value_counts()
        if sc_col in string_uniques:
            string_uniques[sc_col].append(vc_series)
        else:
            string_uniques[sc_col] = vc_series
            
uniques_combined = {}
cols_less_50 = {}
for sc_col in string_uniques:
    u_group = string_uniques[sc_col].groupby(string_uniques[sc_col].index).sum()
    uniques_combined[sc_col] = u_group
    #string columns with less than 50% unique values
    if uniques_combined[sc_col].shape[0] < (num_rows/2):
        cols_less_50[sc_col] =uniques_combined[sc_col].shape[0]
print('String columns with less than 50% unique values: ', '\n')
cols_less_50

String columns with less than 50% unique values:  



{'term': 2,
 'int_rate': 36,
 'grade': 7,
 'sub_grade': 35,
 'emp_title': 2653,
 'emp_length': 11,
 'home_ownership': 3,
 'verification_status': 3,
 'issue_d': 2,
 'loan_status': 6,
 'pymnt_plan': 1,
 'purpose': 13,
 'title': 1406,
 'zip_code': 568,
 'addr_state': 43,
 'earliest_cr_line': 366,
 'revol_util': 884,
 'initial_list_status': 1,
 'last_pymnt_d': 54,
 'last_credit_pull_d': 55,
 'application_type': 1,
 'id': 3000}

Among the above columns the following are the most useful for analysis:

In [16]:
useful_obj_cols = ['term', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status', 'issue_d', 'purpose', 'earliest_cr_line', 'revol_util', 'last_pymnt_d', 'last_credit_pull_d']

useful_obj_cols

['term',
 'sub_grade',
 'emp_title',
 'home_ownership',
 'verification_status',
 'issue_d',
 'purpose',
 'earliest_cr_line',
 'revol_util',
 'last_pymnt_d',
 'last_credit_pull_d']

Let's now have a closer look the unique values in the most useful columns:

In [17]:
for col in useful_obj_cols:
    print(col)
    print(uniques_combined[col], '\n')
#     print("-----------")

term
 36 months    2060
 60 months     940
Name: term, dtype: int64 

sub_grade
A1     97
A2    104
A3    105
A4    178
A5    141
B1    162
B2    168
B3    236
B4    203
B5    205
C1    191
C2    172
C3    103
C4    101
C5     84
D1     67
D2    111
D3     90
D4     72
D5     56
E1     61
E2     59
E3     43
E4     35
E5     45
F1     34
F2     23
F3     16
F4     11
F5     10
G1      3
G2      3
G3      4
G4      5
G5      2
Name: sub_grade, dtype: int64 

emp_title
(Collaborative) Abbott Nutrition Intl     1
16th MP BDE, U.S. Army                    1
1Life Healthcare                          1
1ST FRANKLIN FINANCIAL CORP               1
22squared, inc                            1
                                         ..
wyoming valley hospital                   1
yankee candle company                     1
zakheim and lavrar                        1
zashko inc.                               1
zoll medical corp                         1
Name: emp_title, Length: 2653, dtype: int64 

**we will convert the following columns to numeric**
- By cleaning

In [18]:
to_numeric = ['term','revol_util']

**We shall convert the following to category**

In [19]:
to_category = {
    "sub_grade": "category", "home_ownership": "category", 
    "verification_status": "category", "purpose": "category"
}

**We shall convert the following columns to datetime**

In [20]:
to_datetime = ['issue_d', 'earliest_cr_line', 
               'last_pymnt_d','last_credit_pull_d']

**Converting to numeric, category and datatime**

In [21]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000, 
                         dtype=to_category, parse_dates=to_datetime)
for chunk in loans_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    

chunk.dtypes

id                                    object
member_id                            float64
loan_amnt                            float64
funded_amnt                          float64
funded_amnt_inv                      float64
term                                 float64
int_rate                              object
installment                          float64
grade                                 object
sub_grade                           category
emp_title                             object
emp_length                            object
home_ownership                      category
annual_inc                           float64
verification_status                 category
issue_d                       datetime64[ns]
loan_status                           object
pymnt_plan                            object
purpose                             category
title                                 object
zip_code                              object
addr_state                            object
dti       

**Checking the memory footprint once again**

In [22]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000, 
                         dtype=to_category, parse_dates=to_datetime)
mem_usage = []
for chunk in loans_iter:
    mem = chunk.memory_usage(deep=True).sum()
    mem_usage.append(mem)
    
sum(mem_usage)/(2**20)

46.629088401794434

We now have an improvement in the memory footprint about 20MB - from 66 MB to 47 MB

## Optimizing Numeric Columns

Let's now optimize the numeric columns:

**Identify float columns that contain missing values, and that we can convert to a more space efficient subtype**

In [24]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000, 
                         dtype=to_category, parse_dates=to_datetime)
num_mv = {}
for chunk in loans_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    float_df = chunk.select_dtypes(include=['float'])
    for col in float_df.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in num_mv:
            num_mv[col] = num_mv[col] + missing_values
        else:
            num_mv[col] = missing_values
            
num_mv

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}

Let's drop rows with all values missing.

In [28]:
loans_iter = pd.read_csv('loans_2007.csv', chunksize=3000, 
                         dtype=to_category, parse_dates=to_datetime)
num_mv = {}
mem_usage = []
for chunk in loans_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    chunk = chunk.dropna(how='all')
    mem = chunk.memory_usage(deep=True).sum()
    mem_usage.append(mem)
    float_df = chunk.select_dtypes(include=['float'])
    for col in float_df.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in num_mv:
            num_mv[col] = num_mv[col] + missing_values
        else:
            num_mv[col] = missing_values
print(sum(mem_usage)/(2**20))            
num_mv

42.3846960067749


{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}

## Conclusions and Next Steps