# Practice Optimizing Dataframes and Processing in Chunks

## Introduction

We'll be working with financial lending data from [Lending Club](https://www.lendingclub.com/), a marketplace for personal loans that matches borrowers with investors. 

The Lending Club's website lists approved loans. Qualified investors can view the borrower's credit score, the purpose of the loan, and other details in the loan applications. Once a lender is ready to back a loan, it selects the amount of money it wants to fund. When the loan amount the borrower requested is fully funded, the borrower receives the money, minus the origination fee that Lending Club charges.

We'll be working with a dataset of loans approved from 2007-2011, which you can download from [Lending Club's website](https://www.lendingclub.com/info/download-data.action). The **desc** column has been removed to make the system run more quickly.

In this guided project, we'll practice working with chunked dataframes and optimizing a dataframe's memory usage. 

If we read in the entire data set, it will consume about 67 megabytes of memory. **Let's imagine that we only have 10 megabytes of memory available throughout this project**, so you can practice the concepts you learned in the last two missions.

## Importing packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_columns = 99

%matplotlib inline

## First analysis

Let's look for quality issues in the first 5 rows and check the memory usage for the first 1000 rows:

In [2]:
first_5_rows = pd.read_csv("my_datasets/loans_2007.csv", nrows=5)
first_1000_rows = pd.read_csv("my_datasets/loans_2007.csv", nrows=1000)

In [3]:
# Show first 5 elements
first_5_rows

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,last_credit_pull_d,collections_12_mths_ex_med,policy_code,application_type,acc_now_delinq,chargeoff_within_12_mths,delinq_amnt,pub_rec_bankruptcies,tax_liens
0,1077501,1296599.0,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,,10+ years,RENT,24000.0,Verified,Dec-2011,Fully Paid,n,credit_card,Computer,860xx,AZ,27.65,0.0,Jan-1985,1.0,3.0,0.0,13648.0,83.7%,9.0,f,0.0,0.0,5863.155187,5833.84,5000.0,863.16,0.0,0.0,0.0,Jan-2015,171.62,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
1,1077430,1314167.0,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,Ryder,< 1 year,RENT,30000.0,Source Verified,Dec-2011,Charged Off,n,car,bike,309xx,GA,1.0,0.0,Apr-1999,5.0,3.0,0.0,1687.0,9.4%,4.0,f,0.0,0.0,1008.71,1008.71,456.46,435.17,0.0,117.08,1.11,Apr-2013,119.66,Sep-2013,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
2,1077175,1313524.0,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,,10+ years,RENT,12252.0,Not Verified,Dec-2011,Fully Paid,n,small_business,real estate business,606xx,IL,8.72,0.0,Nov-2001,2.0,2.0,0.0,2956.0,98.5%,10.0,f,0.0,0.0,3005.666844,3005.67,2400.0,605.67,0.0,0.0,0.0,Jun-2014,649.91,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
3,1076863,1277178.0,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,AIR RESOURCES BOARD,10+ years,RENT,49200.0,Source Verified,Dec-2011,Fully Paid,n,other,personel,917xx,CA,20.0,0.0,Feb-1996,1.0,10.0,0.0,5598.0,21%,37.0,f,0.0,0.0,12231.89,12231.89,10000.0,2214.92,16.97,0.0,0.0,Jan-2015,357.48,Apr-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0
4,1075358,1311748.0,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,University Medical Group,1 year,RENT,80000.0,Source Verified,Dec-2011,Current,n,other,Personal,972xx,OR,17.94,0.0,Jan-1996,0.0,15.0,0.0,27783.0,53.9%,38.0,f,461.73,461.73,3581.12,3581.12,2538.27,1042.85,0.0,0.0,0.0,Jun-2016,67.79,Jun-2016,0.0,1.0,INDIVIDUAL,0.0,0.0,0.0,0.0,0.0


In [4]:
# Total memory usage in Mb
first_1000_rows.memory_usage(deep=True).sum() / 2**20

1.5387649536132812

According to the introduction, we have 10Mb of memory so we need to define the chunk size in order to use around 5Mb per chunk.
As 1000 rows need ~1.5Mb, let's try a chunk size of 3000 rows:

In [5]:
chunk_iter = pd.read_csv("my_datasets/loans_2007.csv", chunksize=3000)
for chunk in chunk_iter:
    print(chunk.memory_usage(deep=True).sum()/(1024*1024))

4.614681243896484
4.6104278564453125
4.612185478210449
4.613537788391113
4.6097307205200195
4.6116132736206055
4.610204696655273
4.612573623657227
4.610699653625488
4.610750198364258
4.623462677001953
4.6223297119140625
4.62913703918457
4.8625898361206055
0.8746747970581055


Let's check also the total number of rows:

In [6]:
chunk_iter = pd.read_csv("my_datasets/loans_2007.csv", chunksize=3000)
n_rows = sum([len(chunk) for chunk in chunk_iter])
print("Total number of rows:", n_rows)

Total number of rows: 42538


## Exploring the Data in Chunks

Let's try to understand the column types better while using dataframe chunks.

In [7]:
# How many columns have a numeric type? How many have a string type?

types = first_5_rows.dtypes.value_counts().index
columns_dict = {}
for t in types:
    mask = first_5_rows.dtypes == t
    col_list = first_5_rows.columns[mask]
    columns_dict[str(t)] = list(col_list)

for key, value in columns_dict.items():
    print("Columns of type",key,":")
    for v in value:
        print("-",v)
    print("\n")

Columns of type float64 :
- member_id
- loan_amnt
- funded_amnt
- funded_amnt_inv
- installment
- annual_inc
- dti
- delinq_2yrs
- inq_last_6mths
- open_acc
- pub_rec
- revol_bal
- total_acc
- out_prncp
- out_prncp_inv
- total_pymnt
- total_pymnt_inv
- total_rec_prncp
- total_rec_int
- total_rec_late_fee
- recoveries
- collection_recovery_fee
- last_pymnt_amnt
- collections_12_mths_ex_med
- policy_code
- acc_now_delinq
- chargeoff_within_12_mths
- delinq_amnt
- pub_rec_bankruptcies
- tax_liens


Columns of type object :
- term
- int_rate
- grade
- sub_grade
- emp_title
- emp_length
- home_ownership
- verification_status
- issue_d
- loan_status
- pymnt_plan
- purpose
- title
- zip_code
- addr_state
- earliest_cr_line
- revol_util
- initial_list_status
- last_pymnt_d
- last_credit_pull_d
- application_type


Columns of type int64 :
- id




In [8]:
# How many unique values are there in each string column? 
# How many of the string columns contain values that are less than 50% unique?

chunk_iter = pd.read_csv("my_datasets/loans_2007.csv", chunksize=3000)
uniques = {}
n_rows = 0

for chunk in chunk_iter:
    n_rows = n_rows + len(chunk)
    string_cols = chunk.select_dtypes(include=['object'])
    string_cols_names = string_cols.columns
    for c in string_cols_names:
        val_counts = string_cols[c].value_counts()
        if c in uniques:
            uniques[c].append(val_counts)
        else:
            uniques[c] = [val_counts]

uniques_combined = {}

print("Number of unique values per column:\n")
for col in uniques:
    u_concat = pd.concat(uniques[col])
    u_group = u_concat.groupby(u_concat.index).sum()
    uniques_combined[col] = u_group
    print(col, u_group.shape[0])
print("\nNumber of rows:",n_rows)

Number of unique values per column:

term 2
int_rate 394
grade 7
sub_grade 35
emp_title 30658
emp_length 11
home_ownership 5
verification_status 3
issue_d 55
loan_status 9
pymnt_plan 2
purpose 14
title 21264
zip_code 837
addr_state 50
earliest_cr_line 530
revol_util 1119
initial_list_status 1
last_pymnt_d 103
last_credit_pull_d 108
application_type 1
id 3538

Number of rows: 42538


From all the string columns, only "**emp_title**" and "**title**" have more unique values than 50% of the total rows. For these 2 cases, it's not recommended to convert the columns into categories.

In [9]:
# Which float columns have no missing values and could be candidates for conversion to the integer type?

chunk_iter = pd.read_csv("my_datasets/loans_2007.csv", chunksize=3000)
missing_values = []

for chunk in chunk_iter:
    float_cols = chunk.select_dtypes(include=['float64'])
    missing_values.append(float_cols.isnull().sum())

missing_values_comb = pd.concat(missing_values)
missing_values_comb = missing_values_comb.groupby(missing_values_comb.index).sum().sort_values()
missing_values_comb

member_id                        3
total_rec_int                    3
total_pymnt_inv                  3
total_pymnt                      3
revol_bal                        3
recoveries                       3
policy_code                      3
out_prncp_inv                    3
out_prncp                        3
total_rec_late_fee               3
loan_amnt                        3
last_pymnt_amnt                  3
total_rec_prncp                  3
funded_amnt_inv                  3
funded_amnt                      3
dti                              3
collection_recovery_fee          3
installment                      3
annual_inc                       7
inq_last_6mths                  32
total_acc                       32
delinq_2yrs                     32
pub_rec                         32
delinq_amnt                     32
open_acc                        32
acc_now_delinq                  32
tax_liens                      108
collections_12_mths_ex_med     148
chargeoff_within_12_

There are no columns without missing values. At least they have 3 values missing.

In [10]:
# Calculate the total memory usage across all of the chunks.

chunk_iter = pd.read_csv("my_datasets/loans_2007.csv", chunksize=3000)
memory_usage_list = [chunk.memory_usage(deep=True).sum() for chunk in chunk_iter]

# Memory usage in Mb
total_memory_usage = sum(memory_usage_list) / 2**20
total_memory_usage

65.72859859466553

The required amount of memory for the whole CSV file is 65.73Mb

## Optimizing String Columns

We can achieve the greatest memory improvements by converting the string columns to a numeric type. Let's convert all of the columns where the values are less than 50% unique to the category type, and the columns that contain numeric values to the float type.

Let's start by selecting the useful string columns:

In [11]:
string_cols_names = list(string_cols_names)
string_cols_names

['id',
 'term',
 'int_rate',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'verification_status',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'earliest_cr_line',
 'revol_util',
 'initial_list_status',
 'last_pymnt_d',
 'last_credit_pull_d',
 'application_type']

In [12]:
useful_string_cols = ['term', 'sub_grade', 'emp_title', 'home_ownership', 'verification_status', 'issue_d', 'purpose', 'earliest_cr_line', 'revol_util', 'last_pymnt_d', 'last_credit_pull_d']
useful_string_cols

['term',
 'sub_grade',
 'emp_title',
 'home_ownership',
 'verification_status',
 'issue_d',
 'purpose',
 'earliest_cr_line',
 'revol_util',
 'last_pymnt_d',
 'last_credit_pull_d']

Let's use the unique_combined dict that we previously created to check the different values in the useful string columns.

In [13]:
for col in useful_string_cols:
    print("Column",col)
    print(uniques_combined[col])
    print("-----------------")

Column term
 36 months    31534
 60 months    11001
Name: term, dtype: int64
-----------------
Column sub_grade
A1    1142
A2    1520
A3    1823
A4    2905
A5    2793
B1    1882
B2    2113
B3    2997
B4    2590
B5    2807
C1    2264
C2    2157
C3    1658
C4    1370
C5    1291
D1    1053
D2    1485
D3    1322
D4    1140
D5    1016
E1     884
E2     791
E3     668
E4     552
E5     499
F1     392
F2     308
F3     236
F4     211
F5     154
G1     141
G2     107
G3      79
G4      99
G5      86
Name: sub_grade, dtype: int64
-----------------
Column emp_title
  old palm inc                                               1
 Brocade Communications                                      1
 CenturyLink                                                 1
 Department of Homeland Security                             1
 Down To Earth Distributors, Inc.                            1
 Plaid, Inc.                                                 1
 U.S. Dept. Of Homeland Security                            

According to the shown information, some string columns can be:
- converted to **category**: sub_grade, home_ownership, verification_status, purpose
- converted to **float**: term, revol_util
- converted to **datetime**: issue_d, earliest_cr_line, last_pymnt_d, last_credit_pull_d

In [14]:
convert_col_dtypes = {
    "sub_grade": "category", "home_ownership": "category", 
    "verification_status": "category", "purpose": "category"
}
date_cols = ["issue_d", "earliest_cr_line", "last_pymnt_d", "last_credit_pull_d"]

memory_usage_list = []
chunk_iter = pd.read_csv('my_datasets/loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=date_cols)
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned, downcast='float')
    chunk['revol_util'] = pd.to_numeric(revol_cleaned, downcast='float')
    memory_usage_list.append(chunk.memory_usage(deep=True).sum())

# Memory usage in Mb
total_memory_usage = sum(memory_usage_list) / 2**20
total_memory_usage

41.250041007995605

After the changes, we reduced the required memory from 65Mb to 41Mb.

## Optimizing Numeric Columns

Now let's optimize the numeric columns using the **pandas.to_numeric()** function. Considering the 'float' columns, we can apply the function with the downcast='float' parameter to optimize the float size according to the values.

In [18]:
memory_usage_list = []
float_cols_names = float_cols.columns

chunk_iter = pd.read_csv('my_datasets/loans_2007.csv', chunksize=3000, dtype=convert_col_dtypes, parse_dates=date_cols)
for chunk in chunk_iter:
    term_cleaned = chunk['term'].str.lstrip(" ").str.rstrip(" months")
    revol_cleaned = chunk['revol_util'].str.rstrip("%")
    chunk['term'] = pd.to_numeric(term_cleaned, downcast='float')
    chunk['revol_util'] = pd.to_numeric(revol_cleaned, downcast='float')
    for col in float_cols_names:
        chunk[col] = pd.to_numeric(chunk[col], downcast='float')
    memory_usage_list.append(chunk.memory_usage(deep=True).sum())

# Memory usage in Mb
total_memory_usage = sum(memory_usage_list) / 2**20
total_memory_usage

36.38195323944092

After the last optimization, we reduced the required memory from 65Mb to 36Mb.