# Optimizing DataFrames and Processing in Chunks

In this project, we will demonstrate how to optimize and process a large amount of data using a dataset of loans approved from 2007-2011 from Lending Club's website.

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 99

## Loading the Data

In [2]:
# Preview first five rows of dataset
first_five = pd.read_csv('loans_2007.csv', nrows=5)
first_five

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


## Calculating Memory Footprint

Reading in the entire dataset will consume about 65 megabytes of memory. We will pretend that we only have 10 megabytes of memory to work with, so we will have to find just the right chunk size in order to process and load in all of our data. Let's start with 1000 rows and work from there.

In [3]:
chunk_1000 = pd.read_csv('loans_2007.csv', nrows=1000)
# Finding and converting memory size to megabytes
chunk_1000.memory_usage(deep=True).sum()/(1024*1024)

1.5273666381835938

We found that reading in the first 1000 rows will consume about 1.53 megabytes of memory. We have 10 megabytes of memory to work with, so we can load in bigger chunks of data. However, just to be on the safe side, we will load in no more than 50% of the 10 MB that we have to work with.

In this case, we can assume that loading in 3000 rows will be just enough, since 1.53 multiplied by 3 will give us roughly 4.60 MB usage per chunk. Let's check to see if this is correct.

In [4]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/(1024*1024))

4.580394744873047
4.576141357421875
4.577898979187012
4.579251289367676
4.575444221496582
4.577326774597168
4.575918197631836
4.578287124633789
4.576413154602051
4.57646369934082
4.589176177978516
4.588043212890625
4.594850540161133
4.828314781188965
0.868586540222168


As we can see, the maximum amount of memory any of our chunks will consume is about 4.83 MB. This is well under the 10 MB we have available for use, and just under 50% of 10 MB since we want to be on the safer side.

## Exploring the Data in Chunks

In [5]:
# Find out number of rows in entire dataset
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
num_rows = 0
for chunk in chunk_iter:
    num_rows += len(chunk)
print(num_rows)

42538


## How many columns have a numeric type? How many have a string type?

In [6]:
loans_chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

numeric = []  # List numeric columns
string = []  # List string columns
for c in loans_chunks:
    nums = c.select_dtypes(include=[np.number]).shape[1]
    numeric.append(nums)
    strs = c.select_dtypes(include=['object']).shape[1]
    string.append(strs)

print(numeric)
print(string)

[31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 30, 30]
[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22, 22]


In general, our dataset has 31 numeric columns and 21 string columns. However, these columns are not consistent across all chunks, especially the last two. In order to check this, we will compare the chunk columns to the overall columns.

In [7]:
obj_cols = []
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)

for chunk in chunk_iter:
    chunk_obj_cols = chunk.select_dtypes(include=['object']).columns.tolist()
    if len(obj_cols) > 0:
        is_same = obj_cols == chunk_obj_cols
        if not is_same:
            print("overall obj cols:", obj_cols, "\n")
            print("chunk obj cols:", chunk_obj_cols, "\n")
    else:
        obj_cols = chunk_obj_cols

overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

chunk obj cols: ['id', 'term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 

overall obj cols: ['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'revol_util', 'initial_list_status', 'last_pymnt_d', 'last_credit_pull_d', 'application_type'] 



We can see that in the last two chunks, the `id` column is not being represented as a numeric. Since we are not interested in using the `id` column for our analysis, we will ignore it for now.

## How many unique values are there in each string column? How many of the string columns contain values that are less than 50% unique?

In [8]:
loans_chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

uniques = {}
for lc in loans_chunks:
    strings_only = lc.select_dtypes(include=['object'])
    cols = strings_only.columns
    for c in cols:
        val_counts = strings_only[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

uniques_combined = {}
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    if u_group.shape[0] < 50:
        print(col, u_group.shape[0])

term 2
grade 7
sub_grade 35
emp_length 11
home_ownership 5
verification_status 3
loan_status 9
pymnt_plan 2
purpose 14
initial_list_status 1
application_type 1


## Which float columns have no missing values and could be candidates for conversion to the integer type?

In [9]:
loans_chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

missing = []
for lc in loans_chunks:
    floats = lc.select_dtypes(include=['float'])
    missing.append(floats.apply(pd.isnull).sum())

combined_missing = pd.concat(missing)
combined_missing.groupby(combined_missing.index).sum().sort_values()

member_id                        3
total_rec_int                    3
total_pymnt_inv                  3
total_pymnt                      3
revol_bal                        3
recoveries                       3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_late_fee               3
loan_amnt                        3
last_pymnt_amnt                  3
total_rec_prncp                  3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
collection_recovery_fee          3
installment                      3
annual_inc                       7
inq_last_6mths                  32
total_acc                       32
delinq_2yrs                     32
pub_rec                         32
delinq_amnt                     32
open_acc                        32
acc_now_delinq                  32
tax_liens                      108
collections_12_mths_ex_med     148
chargeoff_within_12_

## Calculate the total memory usage across all chunks

In [10]:
loans_chunks = pd.read_csv('loans_2007.csv', chunksize=3000)

mem_usage = []

for lc in loans_chunks:
    mem_usage.append(lc.memory_usage(deep=True).sum() / 1024 ** 2)

sum(mem_usage)

65.24251079559326

## Optimizing String Columns

In [11]:
obj_cols

['term',
 'int_rate',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'verification_status',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'earliest_cr_line',
 'revol_util',
 'initial_list_status',
 'last_pymnt_d',
 'last_credit_pull_d',
 'application_type']

In [12]:
useful_obj_cols = ['term', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status',
                   'issue_d', 'purpose', 'earliest_cr_line', 'revol_util', 'last_pymnt_d', 'last_credit_pull_d']

In [13]:
# Create dictionary (key: column, value: list of Series objects representing each chunk's value counts)
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000)
str_cols_vc = {}
for chunk in chunk_iter:
    str_cols = chunk.select_dtypes(include=['object'])
    for col in str_cols.columns:
        current_col_vc = str_cols[col].value_counts()
        if col in str_cols_vc:
            str_cols_vc[col].append(current_col_vc)
        else:
            str_cols_vc[col] = [current_col_vc]

In [14]:
# Combine value counts
combined_vcs = {}

for col in str_cols_vc:
    combined_vc = pd.concat(str_cols_vc[col])
    final_vc = combined_vc.groupby(combined_vc.index).sum()
    combined_vcs[col] = final_vc

In [15]:
for col in useful_obj_cols:
    print(col)
    print(combined_vcs[col])
    print("-----------")

term
 36 months    31534
 60 months    11001
Name: term, dtype: int64
-----------
sub_grade
A1    1142
A2    1520
A3    1823
A4    2905
A5    2793
B1    1882
B2    2113
B3    2997
B4    2590
B5    2807
C1    2264
C2    2157
C3    1658
C4    1370
C5    1291
D1    1053
D2    1485
D3    1322
D4    1140
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G3      79
G4      99
G5      86
Name: sub_grade, dtype: int64
-----------
emp_title
  old palm inc                       1
 Brocade Communications              1
 CenturyLink                         1
 Department of Homeland Security     1
 Down To Earth Distributors, Inc.    1
                                    ..
zashko inc.                          1
zeno office solutions                1
zion lutheran school                 1
zoll medical corp                    1
zozaya officiating                   1
Name: emp_title, Length: 30658, dtype: int

## Converting to Category

In [16]:
convert_col_dtypes = {
    "sub_grade": "category", "home_ownership": "category",
    "verification_status": "category", "purpose": "category"
}

In [17]:
chunk[useful_obj_cols]

Unnamed: 0,term,sub_grade,emp_title,home_ownership,verification_status,issue_d,purpose,earliest_cr_line,revol_util,last_pymnt_d,last_credit_pull_d
42000,36 months,C2,Best Buy,RENT,Not Verified,Feb-2008,debt_consolidation,Jul-2000,100.7%,Feb-2011,Jun-2016
42001,36 months,G2,CVS PHARMACY,OWN,Not Verified,Feb-2008,debt_consolidation,Mar-1989,51.9%,Nov-2008,Jun-2016
42002,36 months,E4,General Motors,RENT,Not Verified,Feb-2008,debt_consolidation,Dec-1998,80.7%,Feb-2011,Jun-2016
42003,36 months,G4,usa medical center,RENT,Not Verified,Feb-2008,debt_consolidation,Jul-1995,57.2%,Feb-2011,Jun-2011
42004,36 months,B3,InvestSource Inc,RENT,Not Verified,Feb-2008,debt_consolidation,Sep-2005,74%,Mar-2010,Aug-2010
...,...,...,...,...,...,...,...,...,...,...,...
42533,36 months,B3,,RENT,Not Verified,Jun-2007,other,,,Jun-2010,May-2007
42534,36 months,A5,,NONE,Not Verified,Jun-2007,other,,,Jun-2010,Aug-2007
42535,36 months,A3,Homemaker,MORTGAGE,Not Verified,Jun-2007,other,,,Jun-2010,Feb-2015
42536,,,,,,,,,,,


In [18]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[
                         "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])

total_memory = []
for chunk in chunk_iter:
    # Clean term and revol_util columns and convert to numeric
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    total_memory.append(chunk.memory_usage(deep=True).sum() / (1024 * 1024))

chunk.dtypes
print('\nTotal memory usage: {:.2f} MB'.format(sum(total_memory)))


Total memory usage: 41.08 MB


After cleaning the data and converting some columns, we were able to drop the total memory usage from 65 MB to 41 MB. Next we will optimize the numeric columns.

## Optimizing Numeric Columns

In [19]:
# Find float columns with missing values
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[
                         "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])
mv_counts = {}
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    float_cols = chunk.select_dtypes(include=['float'])
    for col in float_cols.columns:
        missing_values = len(chunk) - chunk[col].count()
        if col in mv_counts:
            mv_counts[col] = mv_counts[col] + missing_values
        else:
            mv_counts[col] = missing_values
mv_counts

{'member_id': 3,
 'loan_amnt': 3,
 'funded_amnt': 3,
 'funded_amnt_inv': 3,
 'installment': 3,
 'annual_inc': 7,
 'dti': 3,
 'delinq_2yrs': 32,
 'inq_last_6mths': 32,
 'open_acc': 32,
 'pub_rec': 32,
 'revol_bal': 3,
 'revol_util': 93,
 'total_acc': 32,
 'out_prncp': 3,
 'out_prncp_inv': 3,
 'total_pymnt': 3,
 'total_pymnt_inv': 3,
 'total_rec_prncp': 3,
 'total_rec_int': 3,
 'total_rec_late_fee': 3,
 'recoveries': 3,
 'collection_recovery_fee': 3,
 'last_pymnt_amnt': 3,
 'collections_12_mths_ex_med': 148,
 'policy_code': 3,
 'acc_now_delinq': 32,
 'chargeoff_within_12_mths': 148,
 'delinq_amnt': 32,
 'pub_rec_bankruptcies': 1368,
 'tax_liens': 108,
 'term': 3}

The results show that all numeric columns contain missing values, so we cannot convert any `float` columns to `int`. We can, however, downcast `float` types to an even smaller type to decrease the dataframe size. Next we will convert `float64` columns to `float32` columns.

In [20]:
chunk_iter = pd.read_csv('loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=[
                         "issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"])

total_memory = []
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned)
    chunk['revol_util'] = pd.to_numeric(revol_cleaned)
    float_cols = chunk.select_dtypes(include=['float'])
    # Convert to float32
    for col in float_cols.columns:
        chunk[col] = pd.to_numeric(chunk[col], downcast='float')
    total_memory.append(chunk.memory_usage(deep=True).sum() / (1024 * 1024))

print('\nTotal memory usage: {:.2f} MB'.format(sum(total_memory)))


Total memory usage: 36.04 MB


After converting all the `float64` columns to `float32`, we were able to decrease the memory size even further, from 41 MB to 36 MB.

## Conclusion

In this project, we were able to process a large dataset by loading it in chunks and optimizing it by converting column data types to more efficient ones. In doing so, we were able to reduce the total memory usage of the data from 65 MB to 36 MB, which is a 44% decrease in memory usage. By decreasing the memory significantly, we are able to load even bigger datasets more efficiently in the future.