_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
#!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
#!unzip LoanStats_2018Q4.csv.zip

In [0]:
#!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

In [326]:
import pandas as pd
df = pd.read_csv('LoanStats_2018Q4.csv', header=1, skipfooter=2)

  


In [327]:
print(df.shape)
df.head()

(128412, 144)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,...,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,,,5000,5000,5000.0,36 months,17.97%,180.69,D,D1,Administrative,6 years,MORTGAGE,59280.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,490xx,MI,10.51,0,Apr-2011,0,,,8,0,4599,19.1%,13,w,4456.17,4456.17,895.96,895.96,...,0.0,0,0,136927,11749,13800,10000,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
1,,,25000,25000,25000.0,60 months,14.47%,587.82,C,C2,teacher,10+ years,OWN,110000.0,Not Verified,Dec-2018,Current,n,,,credit_card,Credit card refinancing,117xx,NY,26.43,1,Jan-1997,0,7.0,,23,0,39053,45.7%,49,w,23533.24,23533.24,2908.95,2908.95,...,10.0,0,0,179321,95648,62800,91424,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
2,,,10000,10000,10000.0,36 months,10.33%,324.23,B,B1,,< 1 year,MORTGAGE,280000.0,Not Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,974xx,OR,6.15,2,Jan-1996,0,18.0,,14,0,9082,38%,23,w,8788.59,8788.59,1612.54,1612.54,...,28.6,0,0,367828,61364,20900,54912,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
3,,,4000,4000,4000.0,36 months,23.40%,155.68,E,E1,Security,3 years,RENT,90000.0,Source Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,070xx,NJ,26.33,0,Sep-2006,4,59.0,,15,0,5199,19.2%,20,w,3596.15,3596.15,770.6,770.6,...,0.0,0,0,98655,66926,21900,71555,,,,,,,,,,,,N,,,,,,,,,,,,,,,N,,,,,,
4,,,31450,31450,31450.0,36 months,7.56%,979.16,A,A3,Construction Manager,7 years,MORTGAGE,130000.0,Verified,Dec-2018,Current,n,,,debt_consolidation,Debt consolidation,895xx,NV,9.29,0,Jul-1997,0,,,11,0,65911,63.1%,17,w,26689.41,26689.41,5802.32,5802.32,...,33.3,0,0,519900,65911,62400,0,64367.0,Jul-1997,0.0,3.0,14.0,60.9,1.0,19.0,0.0,0.0,,N,,,,,,,,,,,,,,,N,,,,,,


In [328]:
df.isna().sum()

id                                            128412
member_id                                     128412
loan_amnt                                          0
funded_amnt                                        0
funded_amnt_inv                                    0
term                                               0
int_rate                                           0
installment                                        0
grade                                              0
sub_grade                                          0
emp_title                                      20947
emp_length                                     11704
home_ownership                                     0
annual_inc                                         0
verification_status                                0
issue_d                                            0
loan_status                                        0
pymnt_plan                                         0
url                                           

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

Apply the function to the `int_rate` column

### Clean `emp_title`

Look at top 20 titles

How often is `emp_title` null?

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [329]:
df['term'].head()

0     36 months
1     60 months
2     36 months
3     36 months
4     36 months
Name: term, dtype: object

In [330]:
df['term'].nunique() #checking the number of unique values in that col

2

In [0]:
df['term'] = df['term'].str.strip(' months').astype(int) #using the strip function to remove the ' months' part of the string, then using the astype func to set the remaining 60 str to an int

In [332]:
df['term'].head(10)

0    36
1    60
2    36
3    36
4    36
5    36
6    36
7    36
8    60
9    36
Name: term, dtype: int64

In [333]:
df['loan_status'].head(10)

0    Current
1    Current
2    Current
3    Current
4    Current
5    Current
6    Current
7    Current
8    Current
9    Current
Name: loan_status, dtype: object

In [334]:
df['loan_status'].unique()

array(['Current', 'Fully Paid', 'Late (31-120 days)', 'In Grace Period',
       'Late (16-30 days)', 'Charged Off', 'Default'], dtype=object)

In [0]:
df['loan_status_is_great'] = [1 if (x == 'Current') or (x == 'Fully Paid') else 0 for x in df['loan_status']]

In [336]:
pd.options.display.max_rows = 150
df[['loan_status','loan_status_is_great']].head(100)

Unnamed: 0,loan_status,loan_status_is_great
0,Current,1
1,Current,1
2,Current,1
3,Current,1
4,Current,1
5,Current,1
6,Current,1
7,Current,1
8,Current,1
9,Current,1


In [337]:
df['last_pymnt_d'].head(10)

0    Jun-2019
1    May-2019
2    May-2019
3    May-2019
4    Jun-2019
5    May-2019
6    May-2019
7    May-2019
8    May-2019
9    May-2019
Name: last_pymnt_d, dtype: object

In [338]:
df['last_pymnt_d'].describe()

count       128253
unique          10
top       Jun-2019
freq        103574
Name: last_pymnt_d, dtype: object

In [339]:
df['last_pymnt_d'].isna().sum()

159

In [0]:
last_pm_md = (df['last_pymnt_d'].mode())
df['last_pymnt_d'].fillna(last_pm_md, inplace=True) #This refuses to work for what ever reason

In [341]:
df['last_pymnt_d'].isna().sum()

159

In [342]:
df['last_pymnt_d'].describe()

count       128253
unique          10
top       Jun-2019
freq        103574
Name: last_pymnt_d, dtype: object

In [343]:
df['last_pymnt_d'].unique()

array(['Jun-2019', 'May-2019', 'Apr-2019', 'Feb-2019', 'Jan-2019',
       'Mar-2019', nan, 'Dec-2018', 'Jul-2019', 'Nov-2018', 'Oct-2018'],
      dtype=object)

In [344]:
df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)
df['last_pymnt_d'].head()

0   2019-06-01
1   2019-05-01
2   2019-05-01
3   2019-05-01
4   2019-06-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [0]:
last_pm_md2 = df['last_pymnt_d'].mode()
df['last_pymnt_d'].fillna(last_pm_md2, inplace=True)#This refuses to work aswell

In [346]:
df['last_pymnt_d'].isna().sum()

159

In [347]:
df['last_pymnt_d'].head()

0   2019-06-01
1   2019-05-01
2   2019-05-01
3   2019-05-01
4   2019-06-01
Name: last_pymnt_d, dtype: datetime64[ns]

In [0]:
df['last_pymnt_d_month'] = df['last_pymnt_d'].dt.month
df['last_pymnt_d_year'] = df['last_pymnt_d'].dt.year

In [349]:
df[['last_pymnt_d', 'last_pymnt_d_month', 'last_pymnt_d_year']].head(20)

Unnamed: 0,last_pymnt_d,last_pymnt_d_month,last_pymnt_d_year
0,2019-06-01,6.0,2019.0
1,2019-05-01,5.0,2019.0
2,2019-05-01,5.0,2019.0
3,2019-05-01,5.0,2019.0
4,2019-06-01,6.0,2019.0
5,2019-05-01,5.0,2019.0
6,2019-05-01,5.0,2019.0
7,2019-05-01,5.0,2019.0
8,2019-05-01,5.0,2019.0
9,2019-05-01,5.0,2019.0


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
#!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
#!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
#%cd instacart_2017_05_01

In [0]:
print(df['revol_util'].isna().sum())
df['revol_util'].describe()

In [0]:
df['revol_util'].mode()

In [0]:
rev_mode = df['revol_util'].mode()
print(rev_mode)
print(type(rev_mode))

In [0]:
df['revol_util'].isna().sum()

In [0]:
df['revol_util'].nunique()

In [0]:
df['revol_util'] = df['revol_util'].str.strip('%').astype(float)

In [0]:
df['revol_util'].head()

In [0]:
print(df['revol_util'].describe())
print(df['revol_util'].mode())

In [0]:
df['revol_util'].fillna(df['revol_util'].mode(), inplace=True)
print(df['revol_util'].isna().sum())

In [0]:
print(df['emp_title'])