_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [2]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

--2019-05-30 21:29:16--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 64.48.1.20
Connecting to resources.lendingclub.com (resources.lendingclub.com)|64.48.1.20|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip’

LoanStats_2018Q4.cs     [      <=>           ]  21.42M  1.69MB/s    in 13s     

2019-05-30 21:29:34 (1.66 MB/s) - ‘LoanStats_2018Q4.csv.zip’ saved [22458773]



In [3]:
!unzip LoanStats_2018Q4.csv.zip

Archive:  LoanStats_2018Q4.csv.zip
  inflating: LoanStats_2018Q4.csv    


In [4]:
!head LoanStats_2018Q4.csv

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)
"id","member_id","loan_amnt","funded_amnt","funded_amnt_inv","term","int_rate","installment","grade","sub_grade","emp_title","emp_length","home_ownership","annual_inc","verification_status","issue_d","loan_status","pymnt_plan","url","desc","purpose","title","zip_code","addr_state","dti","delinq_2yrs","earliest_cr_line","inq_last_6mths","mths_since_last_delinq","mths_since_last_record","open_acc","pub_rec","revol_bal","revol_util","total_acc","initial_list_status","out_prncp","out_prncp_inv","total_pymnt","total_pymnt_inv","total_rec_prncp","total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee","last_pymnt_d","last_pymnt_amnt","next_pymnt_d","last_credit_pull_d","collections_12_mths_ex_med","mths_since_last_major_derog","policy_code","application_type","annual_inc_joint","dti_joint","verification_status_joint","acc_now_delinq","tot_coll_amt","tot_cur_bal","open_acc_6m","open_act_il","op

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [0]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_rows', 500)

In [6]:
# Change engine to python to support the skip footer parameter
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1, skipfooter=2, engine='python')
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,...,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,9600,9600,9600.0,36 months,12.98%,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,,,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,...,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,...,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,2500,2500,2500.0,36 months,13.56%,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,...,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,30000,30000,30000.0,60 months,18.94%,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,...,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

In [0]:
df.dtypes
df = df.drop(['id', 'url', 'desc', 'member_id'], axis='columns')

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

In [0]:
def remove_percent(string):
  return float(string.strip("%"))

Apply the function to the `int_rate` column

In [0]:
df['int_rate'] = df['int_rate'].apply(remove_percent)


In [10]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,...,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,,,MORTGAGE,35704.0,Not Verified,Dec-2018,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,...,0.0,0,0,6500,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,...,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,...,0.0,1,0,60124,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,Dec-2018,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,...,0.0,1,0,372872,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,


### Clean `emp_title`

Look at top 20 titles

In [11]:
df['emp_title'].value_counts(dropna = False).head(20)

NaN                   20947
Teacher                2090
Manager                1773
Registered Nurse        952
Driver                  924
RN                      726
Supervisor              697
Sales                   580
Project Manager         526
General Manager         523
Office Manager          521
Owner                   420
Director                402
Operations Manager      387
Truck Driver            387
Nurse                   326
Engineer                325
Sales Manager           304
manager                 301
Supervisor              270
Name: emp_title, dtype: int64

How often is `emp_title` null?

In [12]:
df['emp_title'].isnull().sum()

20947

Clean the title and handle missing values

In [13]:
def clean_titles(title):
  if isinstance(title, str):
    return title.strip().title()
  else:
    return 'Unknown'
  
df['emp_title'] = df['emp_title'].apply(clean_titles)
df['emp_title'].value_counts(dropna=False).head(20)

Unknown               20947
Teacher                2557
Manager                2395
Registered Nurse       1418
Driver                 1258
Supervisor             1160
Truck Driver            920
Rn                      834
Office Manager          805
Sales                   803
General Manager         791
Project Manager         720
Owner                   625
Director                523
Operations Manager      518
Sales Manager           500
Police Officer          440
Nurse                   425
Technician              420
Engineer                412
Name: emp_title, dtype: int64

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

In [14]:
df['emp_title_manager'] = df['emp_title'].str.contains('Manager')
df['emp_title_manager'].head(20)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18     True
19    False
Name: emp_title_manager, dtype: bool

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

In [0]:
df['issue_d'] = pd.to_datetime(df['issue_d'], infer_datetime_format=True)

df['issue_year'] = df['issue_d'].dt.year

df['issue_month'] = df['issue_d'].dt.month

df['issue_day'] = df['issue_d'].dt.day

In [16]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,...,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,issue_day
0,10000,10000,10000.0,36 months,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,...,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1
1,9600,9600,9600.0,36 months,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,...,748,4200,0,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1
2,4000,4000,4000.0,36 months,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,...,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1
3,2500,2500,2500.0,36 months,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,...,16901,36500,18124,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1
4,30000,30000,30000.0,60 months,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,...,99468,15000,94072,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1


# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [17]:
#Check to see exactly what the format of the term cells are
df['term'].loc[0]

' 36 months'

In [18]:
example_string = ' 40 months'
print(example_string)

def term_to_int(string):
  return int(string.strip('months').strip())

example_string = term_to_int(example_string)
  
print(example_string)

type(example_string)

 40 months
40


int

In [0]:
df['term'] = df['term'].apply(term_to_int)

In [20]:
df['term'].dtype

dtype('int64')

In [0]:
df['loan_status_is_great'] = df['loan_status'].str.contains('Current|Fully Paid')
df['loan_status_is_great'] = df['loan_status_is_great'].apply(lambda x: int(x))

In [0]:
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)

In [23]:
df['last_pymnt_d'].value_counts(dropna=False)

2019-04-01    74937
2019-03-01    47723
2019-02-01     2040
2019-01-01     1582
2018-12-01     1057
2018-11-01      679
2018-10-01      234
NaT             160
Name: last_pymnt_d, dtype: int64

In [0]:
# Create last payment month
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month

# Create last payment year
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

# Fill missing months and years with 0
df['last_pymnt_d_month'] = df['last_pymnt_d_month'].fillna(0)

df['last_pymnt_d_year'] = df['last_pymnt_d_year'].fillna(0)

# Converts months and years from floats to ints
df['last_pymnt_d_month'] = df['last_pymnt_d_month'].apply(lambda x: int(x))

df['last_pymnt_d_year'] = df['last_pymnt_d_year'].apply(lambda x: int(x))

In [25]:
df.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,...,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term,emp_title_manager,issue_year,issue_month,issue_day,loan_status_is_great,last_pymnt_d_month,last_pymnt_d_year
0,10000,10000,10000.0,36,10.33,324.23,B,B1,Unknown,< 1 year,MORTGAGE,280000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,9279.39,9279.39,964.08,964.08,720.61,243.47,0.0,0.0,...,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1,1,4,2019
1,9600,9600,9600.0,36,12.98,323.37,B,B5,Unknown,,MORTGAGE,35704.0,Not Verified,2018-12-01,Current,n,home_improvement,Home improvement,401xx,KY,0.84,0,Nov-2003,0,69.0,,5,0,748,11.5%,23,w,8934.25,8934.25,994.35,994.35,665.75,328.6,0.0,0.0,...,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1,1,4,2019
2,4000,4000,4000.0,36,23.4,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3762.39,3762.39,459.24,459.24,237.61,221.63,0.0,0.0,...,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1,1,4,2019
3,2500,2500,2500.0,36,13.56,84.92,C,C1,Chef,10+ years,RENT,55000.0,Not Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,109xx,NY,18.24,0,Apr-2001,1,,45.0,9,1,4341,10.3%,34,w,2328.06,2328.06,251.94,251.94,171.94,80.0,0.0,0.0,...,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1,1,4,2019
4,30000,30000,30000.0,60,18.94,777.23,D,D2,Postmaster,10+ years,MORTGAGE,90000.0,Source Verified,2018-12-01,Current,n,debt_consolidation,Debt consolidation,713xx,LA,26.52,0,Jun-1987,0,71.0,75.0,13,1,12315,24.2%,44,w,29074.35,29074.35,2284.34,2284.34,925.65,1358.69,0.0,0.0,...,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,,False,2018,12,1,1,4,2019


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

In [0]:
# Replace null values with 0.00
df['revol_util'] = df['revol_util'].fillna('0.00')

#Check to see if there are still null values
df['revol_util'].isnull().sum()

def strip_to_int(string):
  string = string.strip('%') #Strip the percent
  string = string.strip() # Strip whitespace
  string = float(string) # Change string to float first
  string = int(string) # Then change it to a int
  return string

In [35]:
# Apply func to series
df['revol_util'] = df['revol_util'].apply(strip_to_int)

df['revol_util'].head()

0    38
1    11
2    19
3    10
4    24
Name: revol_util, dtype: int64

In [71]:
# Create a top 20 list
top_20_titles = []

# Iterate through value count and pull the index of the highest counts
for i in df['emp_title'].value_counts().index:
  if len(top_20_titles) < 20:
    if i != 'Unknown': # Don't add 'Unknown' titles
      top_20_titles.append(i)
      
# Check to ensure list is correct
print(top_20_titles)
print(len(top_20_titles))

['Teacher', 'Manager', 'Registered Nurse', 'Driver', 'Supervisor', 'Truck Driver', 'Rn', 'Office Manager', 'Sales', 'General Manager', 'Project Manager', 'Owner', 'Director', 'Operations Manager', 'Sales Manager', 'Police Officer', 'Nurse', 'Technician', 'Engineer', 'Administrative Assistant']
20


In [73]:
# Create function to check if title is in the top 20
def isTop20(title):
  if title in top_20_titles:
    return title
  else:
    return 'Other'
  
df['emp_title'] = df['emp_title'].apply(isTop20)

df['emp_title'].value_counts()

Other                       110479
Teacher                       2557
Manager                       2395
Registered Nurse              1418
Driver                        1258
Supervisor                    1160
Truck Driver                   920
Rn                             834
Office Manager                 805
Sales                          803
General Manager                791
Project Manager                720
Owner                          625
Director                       523
Operations Manager             518
Sales Manager                  500
Police Officer                 440
Nurse                          425
Technician                     420
Engineer                       412
Administrative Assistant       409
Name: emp_title, dtype: int64

You can uncomment and run the cells below to re-download and extract the Instacart data

In [74]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-05-30 22:15:05--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.102.125
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.102.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2019-05-30 22:15:10 (40.1 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [75]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [76]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01
