<a href="https://colab.research.google.com/github/ianforrest11/DS-Unit-1-Sprint-1-Dealing-With-Data/blob/master/module4-makefeatures/LS_DS_124_Make_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Make features

Objectives
-  understand the purpose of feature engineering
-  work with strings in pandas
- work with dates and times in pandas

Links
- [Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering)
- Python Data Science Handbook
  - [Chapter 3.10](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html), Vectorized String Operations
  - [Chapter 3.11](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html), Working with Time Series

## Get LendingClub data

[Source](https://www.lendingclub.com/info/download-data.action)

In [0]:
!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
!unzip LoanStats_2018Q4.csv.zip

In [0]:
!head LoanStats_2018Q4.csv

## Load LendingClub data

pandas documentation
- [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
- [`options.display`](https://pandas.pydata.org/pandas-docs/stable/options.html#available-options)

## Work with strings

For machine learning, we usually want to replace strings with numbers.

We can get info about which columns have a datatype of "object" (strings)

### Convert `int_rate`

Define a function to remove percent signs from strings and convert to floats

Apply the function to the `int_rate` column

### Clean `emp_title`

Look at top 20 titles

How often is `emp_title` null?

Clean the title and handle missing values

### Create `emp_title_manager`

pandas documentation: [`str.contains`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html)

## Work with dates

pandas documentation
- [to_datetime](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html)
- [Time/Date Components](https://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components) "You can access these properties via the `.dt` accessor"

# ASSIGNMENT

- Replicate the lesson code.

- Convert the `term` column from string to integer.

- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.

- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [0]:
# Import csv file - !wget bash command

!wget https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip

In [0]:
# Unzip csv file - !unzip bash command

!unzip LoanStats_2018Q4.csv.zip

In [109]:
# Create dataframe - skipping first row and last two rows

import pandas as pd
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows = 1, skipfooter = 2, engine = 'python')

KeyboardInterrupt: ignored

In [0]:
# Adjust field of view of dataframe

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [0]:
# Create function to remove ' months' string from 'term' column, & convert strings to ints

def remove_months(string):
   return int(string.strip(' months'))

In [0]:
# Apply function to 'term' column, remove ' months' and convert strings to ints

df['term'] = df['term'].apply(remove_months)

In [15]:
# Confirm column has been converted to int

print(df['term'].dtypes)

int64


In [0]:
# Add loan_status_is_great column to dataframe, change True/False values to 1/0 values

df['loan_status_is_great'] = df['loan_status'].str.contains('Current|Fully Paid') * 1

In [32]:
# Confirm column was added and working properly

df[['loan_status', 'loan_status_is_great']].iloc[[0,11,20]]

Unnamed: 0,loan_status,loan_status_is_great
0,Current,1
11,Fully Paid,1
20,Late (31-120 days),0


In [46]:
# Obtain type of last_pymnt_d column

print(df['last_pymnt_d'].dtypes)

object


In [0]:
# Convert last_pymnt_d to datetime

df['last_pymnt_d'] = pd.to_datetime(df['last_pymnt_d'], infer_datetime_format=True)

In [0]:
# Create last_pymnt_d_month & last_pymnt_d_year columns

df['last_pymnt_d_month'] = pd.DatetimeIndex(df['last_pymnt_d']).month
df['last_pymnt_d_year'] = pd.DatetimeIndex(df['last_pymnt_d']).year

In [0]:
df['last_pymnt_d_month'] = df['last_pymnt_d_month'].fillna(0)
df['last_pymnt_d_year'] = df['last_pymnt_d_year'].fillna(0)

df['last_pymnt_d_month'] = df['last_pymnt_d_month'].astype(int)
df['last_pymnt_d_year'] = df['last_pymnt_d_year'].astype(int)

In [0]:
import calendar
df['last_pymnt_d_month'] = df['last_pymnt_d_month'].apply(lambda x: calendar.month_abbr[x])

In [0]:
import numpy as np
df['last_pymnt_d_month'] = df['last_pymnt_d_month'].replace(0, np.nan)
df['last_pymnt_d_year'] = df['last_pymnt_d_year'].replace(0, np.nan)

In [118]:
df[['last_pymnt_d_month','last_pymnt_d_year']]

Unnamed: 0,last_pymnt_d_month,last_pymnt_d_year
0,6,2019
1,5,2019
2,5,2019
3,5,2019
4,6,2019
5,5,2019
6,5,2019
7,5,2019
8,5,2019
9,5,2019


In [69]:
print(df['last_pymnt_d_month'].dtypes)

float64


# STRETCH OPTIONS

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [0]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [0]:
# %cd instacart_2017_05_01