# Automated Feature Engineering
 A machine learning model can only learn from the data we give it, and making sure that data is relevant to the task is one of the most crucial steps in the machine learning pipeline.
 
However, manual feature engineering is a tedious task and is limited by both human imagination - there are only so many features we can think to create - and by time - creating new features is time-intensive. Ideally, there would be an objective method to create an array of diverse new candidate features that we can then use for a machine learning task. This process is meant to not replace the data scientist, but to make her job easier and allowing her to supplement domain knowledge with an automated workflow.

In this notebook, I will walk through an implementation of using Featuretools, an open-source Python library for automatically creating features with relational data (where the data is in structured tables). Although there are now many efforts working to enable automated model selection and hyperparameter tuning, there has been a lack of automating work on the feature engineering aspect of the pipeline.

### Dataset
To show the basic idea of featuretools we will use an example dataset consisting of three tables:

* `clients`: information about clients at a credit union
* `loans`: previous loans taken out by the clients
* `payments`: payments made/missed on the previous loans

The general problem of feature engineering is taking disparate data, often distributed across multiple tables, and combining it into a single table that can be used for training a machine learning model. Featuretools has the ability to do this for us, creating many new candidate features with minimal effort. These features are combined into a single table that can then be passed on to our model.

In [None]:
# Run this if featuretools is not already installed
# !pip install -U featuretools

In [1]:
# data manipulation 
import pandas as pd
import numpy as np

# automated feature engineering
import featuretools as ft

# ignore warnings from pandas
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Read in the data# Read  
clients = pd.read_csv('data/clients.csv', parse_dates = ['joined'])
loans = pd.read_csv('data/loans.csv', parse_dates = ['loan_start', 'loan_end'])
payments = pd.read_csv('data/payments.csv', parse_dates = ['payment_date'])

In [3]:
clients.head()

Unnamed: 0,client_id,joined,income,credit_score
0,46109,2002-04-16,172677,527
1,49545,2007-11-14,104564,770
2,41480,2013-03-11,122607,585
3,46180,2001-11-06,43851,562
4,25707,2006-10-06,211422,621


In [4]:
loans.sample(10)

Unnamed: 0,client_id,loan_type,loan_amount,repaid,loan_id,loan_start,loan_end,rate
346,39384,cash,11728,0,11700,2007-04-20,2009-11-09,5.78
146,35089,home,6263,1,10210,2011-12-01,2014-03-24,0.66
166,35214,home,9389,0,10336,2014-03-18,2016-05-07,1.4
310,49068,other,13789,1,10470,2011-05-29,2013-09-08,2.03
408,41472,other,6661,1,11148,2003-04-14,2005-08-05,1.12
93,25707,credit,9905,1,11505,2004-05-12,2006-08-17,0.9
415,49624,cash,8621,0,10454,2014-07-26,2015-12-29,3.18
400,41472,home,1270,0,11855,2014-04-17,2015-10-28,9.82
165,35214,credit,6352,0,11685,2003-04-03,2005-06-25,1.93
22,49545,other,8419,1,11054,2006-10-24,2009-07-05,0.83


In [5]:
payments.sample(10)

Unnamed: 0,loan_id,payment_amount,payment_date,missed
819,10045,1701,2004-06-09,0
1901,11716,1676,2010-12-19,0
2759,10305,717,2003-11-11,0
1360,11177,1431,2010-02-16,0
450,11837,1630,2002-12-30,1
2778,11991,494,2005-08-28,0
2673,11238,1616,2006-07-12,1
214,11514,1154,2014-07-03,1
2323,11975,661,2002-11-20,1
1343,10262,583,2007-01-14,1


## Manual Feature Engineering Examples
Let's show a few examples of features we might make by hand. We will keep this relatively simple to avoid doing too much work! First we will focus on a single dataframe before combining them together. In the clients dataframe, we can take the month of the joined column and the natural log of the income column. Later, we see these are known in featuretools as transformation feature primitives because they act on column in a single table.

In [None]:
# create a month column
clients['join_month'] = clients['joined'].dt.month

# create a log of income column
clients['log_income'] = np.log(clients['income'])

clients.head()

To incorporate information about the other tables, we use the `df.groupby` method, followed by a suitable aggregation function, followed by `df.merge`. For example, let's calculate the average, minimum, and maximum amount of previous loans for each client. In the terms of featuretools, this would be considered an aggregation feature primitive because we using multiple tables in a one-to-many relationship to calculate aggregation figures (don't worry, this will be explained shortly!).

In [6]:
# groupby client id and calculate mean, max, min previous loan size
stats = loans.groupby('client_id')['loan_amount'].agg(['mean', 'max', 'min'])
stats.columns = ['mean_loan_amount', 'max_loan_amount', 'min_loan_amount']
stats.head()

Unnamed: 0_level_0,mean_loan_amount,max_loan_amount,min_loan_amount
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
25707,7963.95,13913,1212
26326,7270.0625,13464,1164
26695,7824.722222,14865,2389
26945,7125.933333,14593,653
29841,9813.0,14837,2778


In [7]:
# merge with the clients dataframe
clients.merge(stats, left_on = 'client_id', right_index=True, how = 'left').head(10)

Unnamed: 0,client_id,joined,income,credit_score,mean_loan_amount,max_loan_amount,min_loan_amount
0,46109,2002-04-16,172677,527,8951.6,14049,559
1,49545,2007-11-14,104564,770,10289.3,14971,3851
2,41480,2013-03-11,122607,585,7894.85,14399,811
3,46180,2001-11-06,43851,562,7700.85,14081,1607
4,25707,2006-10-06,211422,621,7963.95,13913,1212
5,39505,2011-10-14,153873,610,7424.05,14575,904
6,32726,2006-05-01,235705,730,6633.263158,14802,851
7,35089,2010-03-01,131176,771,6939.2,13194,773
8,35214,2003-08-08,95849,696,7173.555556,14767,667
9,48177,2008-06-09,190632,769,7424.368421,14740,659


We could go further and include information about payments in the clients dataframe. To do so, we would have to group payments by the loan_id, merge it with the loans, group the resulting dataframe by the client_id, and then merge it into the clients dataframe. This would allow us to include information about previous payments for each client.

Clearly, this process of manual feature engineering can grow quite tedious with many columns and multiple tables and I certainly don't want to have to do this process by hand! Luckily, featuretools can automatically perform this entire process and will create more features than we would have ever thought of. Although I love pandas, there is only so much manual data manipulation I'm willing to stand!

## Featuretools
Now that we know what we are trying to avoid (tedious manual feature engineering), let's figure out how to automate this process. Featuretools operates on an idea known as Deep Feature Synthesis. The concept of Deep Feature Synthesis is to use basic building blocks known as feature primitives (like the transformations and aggregations done above) that can be stacked on top of each other to form new features. The depth of a "deep feature" is equal to the number of stacked primitives. The orginal paper can be read [here](https://www.featurelabs.com/wp-content/uploads/2017/12/DSAA_DSM_2015.pdf).

I threw out some terms there, but don't worry because we'll cover them as we go. Featuretools builds on simple ideas to create a powerful method, and we will build up our understanding in much the same way.

The first part of Featuretools to understand is an entity. This is simply a table, or in pandas, a DataFrame. We corral multiple entities into a single object called an EntitySet. This is just a large data structure composed of many individual entities and the relationships between them.