In [4]:
"""
Topic: Implementation of Featuretools

Main source of this notebook: https://www.analyticsvidhya.com/blog/2018/08/guide-automated-feature-engineering-featuretools-python/

Featuretools Ref:
     1. https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96?fbclid=IwAR1JvwK-wsJEQQ1k5gGPvjzj4rkLCwSqut9V0smclJpb-GlwFyZmv-d__eU
     2. https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Loan%20Repayment/notebooks/Automated%20Loan%20Repayment.ipynb
     # 說明relationship和feature primitives(參數意義例如max_depth)
     3. https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219
"""

# Loading required Libraries and Data

In [1]:
import featuretools as ft
import numpy as np
import pandas as pd
from catboost import CatBoostRegressor

train = pd.read_csv("Train.csv")
test = pd.read_csv("Test.csv")

In [2]:
print(train.head())
print(test.head())

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

# Data Preparation

In [3]:
# saving identifiers
test_Item_Identifier = test['Item_Identifier']
test_Outlet_Identifier = test['Outlet_Identifier']
sales = train['Item_Outlet_Sales']
train.drop(['Item_Outlet_Sales'], axis=1, inplace=True)

Then we will combine the train and test set as it saves us the trouble of performing the same step(s) twice.

In [4]:
combi = train.append(test, ignore_index=True)

In [18]:
combi.isnull().sum()

Item_Identifier                 0
Item_Weight                  2439
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  4016
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

Quite a lot of missing values in the Item_Weight and Outlet_size variables. Let’s quickly deal with them:

In [19]:
# imputing missing data
combi['Item_Weight'].fillna(combi['Item_Weight'].mean(), inplace = True)
combi['Outlet_Size'].fillna("missing", inplace = True)

# Data Preprocessing

In [5]:
combi['Item_Fat_Content'].value_counts()

Low Fat    8485
Regular    4824
LF          522
reg         195
low fat     178
Name: Item_Fat_Content, dtype: int64

It seems Item_Fat_Content contains only two categories, i.e., “Low Fat” and “Regular” – the rest of them we will consider redundant. So, let’s convert it into a binary variable.

In [6]:
# dictionary to replace the categories
fat_content_dict = {'Low Fat':0, 'Regular':1, 'LF':0, 'reg':1, 'low fat':0}

combi['Item_Fat_Content'] = combi['Item_Fat_Content'].replace(fat_content_dict, regex=True)

#  Feature Engineering using Featuretools

- Now we can start using Featuretools to perform automated feature engineering! It is necessary to have a unique identifier feature in the dataset (our dataset doesn’t have any right now). So, we will create one unique ID for our combined dataset. If you notice, we have two IDs in our data—one for the item and another for the outlet. So, simply concatenating both will give us a unique ID.

In [7]:
combi['id'] = combi['Item_Identifier'] + combi['Outlet_Identifier']
combi.drop(['Item_Identifier'], axis=1, inplace=True)

- Please note that I have dropped the feature Item_Identifier as it is no longer required. However, I have retained the feature Outlet_Identifier because I plan to use it later.

- Now before proceeding, we will have to create an **EntitySet**. An EntitySet is a structure that contains multiple dataframes and relationships between them. So, let’s create an EntitySet and add the dataframe combination to it.

In [36]:
# creating and entity set 'es'
es = ft.EntitySet(id = 'sales')

# adding a dataframe 
es.entity_from_dataframe(entity_id = 'bigmart', dataframe = combi, index = 'id')

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 11]
  Relationships:
    No relationships

Our data contains information at two levels— *item level* and *outlet level*. Featuretools offers a functionality to split a dataset into multiple tables. We have created a new table **‘outlet’** from the BigMart table based on the outlet ID **Outlet_Identifier**.

In [37]:
combi.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,id
0,9.3,0,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,FDA15OUT049
1,5.92,1,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,DRC01OUT018
2,17.5,0,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,FDN15OUT049
3,19.2,1,0.0,Fruits and Vegetables,182.095,OUT010,1998,missing,Tier 3,Grocery Store,FDX07OUT010
4,8.93,0,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,NCD19OUT013


In [38]:
es.normalize_entity(base_entity_id='bigmart', new_entity_id='outlet', index = 'Outlet_Identifier', 
additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outtlet_Location_Type', 'Outlet_Type'])

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 7]
    outlet [Rows: 10, Columns: 5]
  Relationships:
    bigmart.Outlet_Identifier -> outlet.Outlet_Identifier

In [42]:
additional_variables = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
combi.loc[:, additional_variables].drop_duplicates()

Unnamed: 0,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,1999,Medium,Tier 1,Supermarket Type1
1,2009,Medium,Tier 3,Supermarket Type2
3,1998,missing,Tier 3,Grocery Store
4,1987,High,Tier 3,Supermarket Type1
7,1985,Medium,Tier 3,Supermarket Type3
8,2002,missing,Tier 2,Supermarket Type1
9,2007,missing,Tier 2,Supermarket Type1
11,1997,Small,Tier 1,Supermarket Type1
19,2004,Small,Tier 2,Supermarket Type1
23,1985,Small,Tier 1,Grocery Store


In [44]:
# Let’s check the summary of our EntitySet.
print(es)

Entityset: sales
  Entities:
    bigmart [Rows: 14204, Columns: 7]
    outlet [Rows: 10, Columns: 5]
  Relationships:
    bigmart.Outlet_Identifier -> outlet.Outlet_Identifier


- As you can see above, it contains two entities – *bigmart* and *outlet*. There is also a relationship formed between the two tables, connected by **Outlet_Identifier**. **This relationship will play a key role in the generation of new features.**

- Now we will use **Deep Feature Synthesis(DFS)** to create new features automatically. Recall that DFS uses ***Feature Primitives*** to create features using multiple tables present in the EntitySet.

In [61]:
# agg_primitives Default: ["sum", "std", "max", "skew", "min", "mean", "count", "percent_true", "n_unique", "mode"]
# trans_primitives Default: ["day", "year", "month", "weekday", "haversine", "num_words", "num_characters"]
feature_matrix, feature_names = ft.dfs(entityset=es,
                                       target_entity='bigmart',
                                       max_depth=2, # 代表疊了幾個basic feature
                                       verbose=1,
                                       n_jobs=3)

# feature_names可能還是要存下來，未來遇到類似的資料也是一樣要重新生一次
"""
>> Deep Feature Synthesis <<

We now have all the pieces in place to understand deep feature synthesis (dfs). 
    In fact, we already performed dfs in the previous function call! 
    A deep feature is simply a feature made of stacking multiple primitives and dfs is 
    the name of process that makes these features. 
The depth of a deep feature is the number of primitives required to make the feature.

For example, the MEAN(payments.payment_amount) column is a deep feature with a depth of 1 
    because it was created using a single aggregation. A feature with a depth of two is 
    LAST(loans(MEAN(payments.payment_amount)) This is made by stacking two aggregations: 
    LAST (most recent) on top of MEAN. This represents the average payment size of the 
    most recent loan for each client.
"""

Built 37 features
EntitySet scattered to workers in 2.185 seconds
Elapsed: 00:01 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


'\n>> Deep Feature Synthesis <<\n\nWe now have all the pieces in place to understand deep feature synthesis (dfs). \n    In fact, we already performed dfs in the previous function call! \n    A deep feature is simply a feature made of stacking multiple primitives and dfs is \n    the name of process that makes these features. \nThe depth of a deep feature is the number of primitives required to make the feature.\n\nFor example, the MEAN(payments.payment_amount) column is a deep feature with a depth of 1 \n    because it was created using a single aggregation. A feature with a depth of two is \n    LAST(loans(MEAN(payments.payment_amount)) This is made by stacking two aggregations: \n    LAST (most recent) on top of MEAN. This represents the average payment size of the \n    most recent loan for each client.\n'

In [53]:
feature_matrix.columns

Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type',
       'Item_MRP', 'Outlet_Identifier', 'outlet.Outlet_Establishment_Year',
       'outlet.Outlet_Size', 'outlet.Outlet_Location_Type',
       'outlet.Outlet_Type', 'outlet.SUM(bigmart.Item_Weight)',
       'outlet.SUM(bigmart.Item_Fat_Content)',
       'outlet.SUM(bigmart.Item_Visibility)', 'outlet.SUM(bigmart.Item_MRP)',
       'outlet.STD(bigmart.Item_Weight)',
       'outlet.STD(bigmart.Item_Fat_Content)',
       'outlet.STD(bigmart.Item_Visibility)', 'outlet.STD(bigmart.Item_MRP)',
       'outlet.MAX(bigmart.Item_Weight)',
       'outlet.MAX(bigmart.Item_Fat_Content)',
       'outlet.MAX(bigmart.Item_Visibility)', 'outlet.MAX(bigmart.Item_MRP)',
       'outlet.SKEW(bigmart.Item_Weight)',
       'outlet.SKEW(bigmart.Item_Fat_Content)',
       'outlet.SKEW(bigmart.Item_Visibility)', 'outlet.SKEW(bigmart.Item_MRP)',
       'outlet.MIN(bigmart.Item_Weight)',
       'outlet.MIN(bigmart.Item_Fat_Content)',
       

In [57]:
feature_matrix.head()

Unnamed: 0_level_0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,outlet.Outlet_Establishment_Year,outlet.Outlet_Size,outlet.Outlet_Location_Type,outlet.Outlet_Type,...,outlet.MIN(bigmart.Item_Fat_Content),outlet.MIN(bigmart.Item_Visibility),outlet.MIN(bigmart.Item_MRP),outlet.MEAN(bigmart.Item_Weight),outlet.MEAN(bigmart.Item_Fat_Content),outlet.MEAN(bigmart.Item_Visibility),outlet.MEAN(bigmart.Item_MRP),outlet.COUNT(bigmart),outlet.NUM_UNIQUE(bigmart.Item_Type),outlet.MODE(bigmart.Item_Type)
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DRA12OUT010,11.6,0,0.068535,Soft Drinks,143.0154,OUT010,1998,missing,Tier 3,Grocery Store,...,0,0.0,32.6558,12.72287,0.356757,0.101939,141.159742,925,16,Fruits and Vegetables
DRA12OUT013,11.6,0,0.040912,Soft Drinks,142.3154,OUT013,1987,High,Tier 3,Supermarket Type1,...,0,0.0,31.49,12.788139,0.353509,0.060242,141.128428,1553,16,Fruits and Vegetables
DRA12OUT017,11.6,0,0.041178,Soft Drinks,140.3154,OUT017,2007,missing,Tier 2,Supermarket Type1,...,0,0.0,32.09,12.78208,0.35256,0.061142,140.998931,1543,16,Snack Foods
DRA12OUT018,11.6,0,0.041113,Soft Drinks,142.0154,OUT018,2009,Medium,Tier 3,Supermarket Type2,...,0,0.0,31.89,12.803638,0.353816,0.059976,141.000899,1546,16,Fruits and Vegetables
DRA12OUT027,12.792854,0,0.040748,Soft Drinks,140.0154,OUT027,1985,Medium,Tier 3,Supermarket Type3,...,0,0.0,31.29,12.792854,0.353432,0.060344,141.012347,1559,16,Fruits and Vegetables


- There is one issue with this dataframe – it is not sorted properly. We will have to sort it based on the id variable from the combi dataframe.

In [62]:
feature_matrix = feature_matrix.reindex(index=combi['id'])
feature_matrix = feature_matrix.reset_index()
feature_matrix.head()

Unnamed: 0,id,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,outlet.Outlet_Establishment_Year,outlet.Outlet_Size,outlet.Outlet_Location_Type,...,outlet.MIN(bigmart.Item_Fat_Content),outlet.MIN(bigmart.Item_Visibility),outlet.MIN(bigmart.Item_MRP),outlet.MEAN(bigmart.Item_Weight),outlet.MEAN(bigmart.Item_Fat_Content),outlet.MEAN(bigmart.Item_Visibility),outlet.MEAN(bigmart.Item_MRP),outlet.COUNT(bigmart),outlet.NUM_UNIQUE(bigmart.Item_Type),outlet.MODE(bigmart.Item_Type)
0,FDA15OUT049,9.3,0,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,...,0,0.0,32.4558,12.803003,0.352903,0.059,141.163199,1550,16,Fruits and Vegetables
1,DRC01OUT018,5.92,1,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,...,0,0.0,31.89,12.803638,0.353816,0.059976,141.000899,1546,16,Fruits and Vegetables
2,FDN15OUT049,17.5,0,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,...,0,0.0,32.4558,12.803003,0.352903,0.059,141.163199,1550,16,Fruits and Vegetables
3,FDX07OUT010,19.2,1,0.0,Fruits and Vegetables,182.095,OUT010,1998,missing,Tier 3,...,0,0.0,32.6558,12.72287,0.356757,0.101939,141.159742,925,16,Fruits and Vegetables
4,NCD19OUT013,8.93,0,0.0,Household,53.8614,OUT013,1987,High,Tier 3,...,0,0.0,31.49,12.788139,0.353509,0.060242,141.128428,1553,16,Fruits and Vegetables


# Model Building
- It is time to check how useful these generated features actually are. We will use them to build a model and predict Item_Outlet_Sales. Since our final data (feature_matrix) has many categorical features, I decided to use the **CatBoost** algorithm. __It can use categorical features directly and is scalable in nature.__ You can refer to this article to read more about CatBoost.

CatBoost requires all the categorical variables to be in the string format. So, we will convert the categorical variables in our data to string first:

In [64]:
categorical_features = np.where(feature_matrix.dtypes == 'object')[0]

for i in categorical_features:
    feature_matrix.iloc[:,i] = feature_matrix.iloc[:,i].astype('str')

Let’s split feature_matrix back into train and test sets.

In [65]:
feature_matrix.drop(['id'], axis=1, inplace=True)  # unique
train = feature_matrix[:8523]
test = feature_matrix[8523:]

In [66]:
# removing uneccesary variables
train.drop(['Outlet_Identifier'], axis=1, inplace=True)
test.drop(['Outlet_Identifier'], axis=1, inplace=True)

In [74]:
# identifying categorical features
categorical_features = np.where(train.dtypes == 'object')[0]
print(categorical_features)

[ 0  2  4  6  8  9 10]


Split the train data into training and validation set to check the model’s performance locally.

In [68]:
from sklearn.model_selection import train_test_split

# splitting train data into training and validation set
xtrain, xvalid, ytrain, yvalid = train_test_split(train, sales, test_size=0.25, random_state=11)

In [69]:
train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,outlet.Outlet_Establishment_Year,outlet.Outlet_Size,outlet.Outlet_Location_Type,outlet.Outlet_Type,outlet.SUM(bigmart.Item_Weight),...,outlet.MIN(bigmart.Item_Fat_Content),outlet.MIN(bigmart.Item_Visibility),outlet.MIN(bigmart.Item_MRP),outlet.MEAN(bigmart.Item_Weight),outlet.MEAN(bigmart.Item_Fat_Content),outlet.MEAN(bigmart.Item_Visibility),outlet.MEAN(bigmart.Item_MRP),outlet.COUNT(bigmart),outlet.NUM_UNIQUE(bigmart.Item_Type),outlet.MODE(bigmart.Item_Type)
0,9.3,0,0.016047,Dairy,249.8092,1999,Medium,Tier 1,Supermarket Type1,19844.655,...,0,0.0,32.4558,12.803003,0.352903,0.059,141.163199,1550,16,Fruits and Vegetables
1,5.92,1,0.019278,Soft Drinks,48.2692,2009,Medium,Tier 3,Supermarket Type2,19794.425,...,0,0.0,31.89,12.803638,0.353816,0.059976,141.000899,1546,16,Fruits and Vegetables
2,17.5,0,0.01676,Meat,141.618,1999,Medium,Tier 1,Supermarket Type1,19844.655,...,0,0.0,32.4558,12.803003,0.352903,0.059,141.163199,1550,16,Fruits and Vegetables
3,19.2,1,0.0,Fruits and Vegetables,182.095,1998,missing,Tier 3,Grocery Store,11768.655,...,0,0.0,32.6558,12.72287,0.356757,0.101939,141.159742,925,16,Fruits and Vegetables
4,8.93,0,0.0,Household,53.8614,1987,High,Tier 3,Supermarket Type1,19859.98,...,0,0.0,31.49,12.788139,0.353509,0.060242,141.128428,1553,16,Fruits and Vegetables


Finally, we can now train our model. The evaluation metric we will use is RMSE (Root Mean Squared Error).

In [72]:
model_cat = CatBoostRegressor(iterations=100, learning_rate=0.3, depth=6, eval_metric='RMSE', random_seed=7)

# training model
model_cat.fit(xtrain, ytrain, cat_features=categorical_features, use_best_model=True)

You should provide test set for use best model. use_best_model parameter swiched to false value.


0:	learn: 2115.6201157	total: 66.3ms	remaining: 6.56s
1:	learn: 1703.3104557	total: 83.5ms	remaining: 4.09s
2:	learn: 1430.5051247	total: 92.4ms	remaining: 2.99s
3:	learn: 1277.7410778	total: 101ms	remaining: 2.43s
4:	learn: 1192.4927416	total: 110ms	remaining: 2.09s
5:	learn: 1143.3930400	total: 115ms	remaining: 1.79s
6:	learn: 1111.4556516	total: 121ms	remaining: 1.61s
7:	learn: 1094.5453134	total: 130ms	remaining: 1.49s
8:	learn: 1082.9897641	total: 139ms	remaining: 1.4s
9:	learn: 1077.0754209	total: 147ms	remaining: 1.33s
10:	learn: 1074.5711811	total: 153ms	remaining: 1.24s
11:	learn: 1071.5899223	total: 161ms	remaining: 1.18s
12:	learn: 1069.9816730	total: 168ms	remaining: 1.12s
13:	learn: 1069.5956805	total: 172ms	remaining: 1.06s
14:	learn: 1068.7983887	total: 181ms	remaining: 1.02s
15:	learn: 1066.4024029	total: 189ms	remaining: 992ms
16:	learn: 1065.8685695	total: 198ms	remaining: 965ms
17:	learn: 1065.7720438	total: 202ms	remaining: 920ms
18:	learn: 1064.4690988	total: 210ms

<catboost.core.CatBoostRegressor at 0x7f1e341eecc0>

In [73]:
# validation score
model_cat.score(xvalid, yvalid)

1091.4893118858736

The same model got a score of 1155.12 on the public leaderboard. Without any feature engineering, the scores were ~1103 and ~1183 on the validation set and the public leaderboard, respectively. Hence, the features created by Featuretools are not just random features, they are valuable and useful. Most importantly, the amount of time it saves in feature engineering is incredible.



# Featuretools Interpretability

- Making our data science solutions interpretable is a very important aspect of performing machine learning. __Features generated by Featuretools can be easily explained__ even to a non-technical person because they are based on the primitives, which are easy to understand.

- For example, the features outlet.SUM(bigmart.Item_Weight) and outlet.STD(bigmart.Item_MRP) mean outlet-level sum of weight of the items and standard deviation of the cost of the items, respectively.

- This makes it possible for those people who are not machine learning experts, to contribute as well in terms of their domain expertise.

# Remark
## Feature Primitives
- A __feature primitive__ is an operation applied to a table or a set of tables to create a feature. These represent simple calculations, many of which we already use in manual feature engineering, that can be stacked on top of each other to create complex deep features. Feature primitives fall into two categories:

- **Aggregation**: function that groups together children for each parent and calculates a statistic such as mean, min, max, or standard deviation across the children. An example is the maximum previous loan amount for each client. An aggregation covers multiple tables using relationships between tables.
- **Transformation**: an operation applied to one or more columns in a single table. An example would be taking the absolute value of a column, or finding the difference between two columns in one table.

In [56]:
# List the primitives in a dataframe
primitives = ft.list_primitives()
pd.options.display.max_colwidth = 100

primitives[primitives['type'] == 'aggregation'].head()

Unnamed: 0,name,type,description
0,median,aggregation,Determines the middlemost number in a list of values.
1,num_true,aggregation,Counts the number of `True` values.
2,sum,aggregation,"Calculates the total addition, ignoring `NaN`."
3,time_since_last,aggregation,Calculates the time elapsed since the last datetime (in seconds).
4,skew,aggregation,Computes the extent to which a distribution differs from a normal distribution.


In [55]:
primitives[primitives['type'] == 'transform'].head()

Unnamed: 0,name,type,description
20,subtract_numeric_scalar,transform,Subtract a scalar from each element in the list.
21,year,transform,Determines the year value of a datetime.
22,num_characters,transform,Calculates the number of characters in a string.
23,divide_by_feature,transform,Divide a scalar by each value in the list.
24,modulo_numeric,transform,Element-wise modulo of two lists.


In [54]:
ft.list_primitives().head()

Unnamed: 0,name,type,description
0,median,aggregation,Determines the middlemost number in a list of values.
1,num_true,aggregation,Counts the number of `True` values.
2,sum,aggregation,"Calculates the total addition, ignoring `NaN`."
3,time_since_last,aggregation,Calculates the time elapsed since the last datetime (in seconds).
4,skew,aggregation,Computes the extent to which a distribution differs from a normal distribution.


- ***target_entity*** is nothing but the entity ID for which we wish to create new features (in this case, it is the entity ‘bigmart’). The parameter **max_depth** controls the complexity of the features being generated by stacking the primitives. The parameter n_jobs helps in parallel feature computation by using multiple cores.

- That’s all you have to do with Featuretools. It has generated a bunch of new features on its own.

- Let’s have a look at these newly created features.

## Relationships
- Relationships are a fundamental concept not only in featuretools, but in any relational database. The most common type of relationship is one-to-many. The best way to think of a one-to-many relationship is with the analogy of parent-to-child. A parent is a single individual, but can have mutliple children. In the context of tables, a parent table will have one row (observation) for every individual while a child table can have many observations for each parent. In a parent table, each individual has a single row and is uniquely identified by an index (also called a key). Each individual in the parent table can have multiple rows in the child table. Things get a little more complicated because children tables can have children of their own, making these grandchildren of the original parent.

- As an example of a parent-to-child relationship, the app dataframe has one row for each client (identified by SK_ID_CURR) while the bureau dataframe has multiple previous loans for each client. Therefore, the bureau dataframe is the child of the app dataframe. The  bureau dataframe in turn is the parent of bureau_balance because each loan has one row in bureau (identified by SK_ID_BUREAU) but multiple monthly records in bureau_balance. When we do manual feature engineering, keeping track of all these relationships is a massive time investment (and a potential source of error), but we can add these relationships to our EntitySet and let featuretools worry about keeping the tables straight!

### Adding Relationships
Defining the relationships is straightforward using the diagram for the data tables. For each relationship, we need to first specify the parent variable and then the child variable. Altogether, there are a total of 6 relationships between the tables (counting the training and testing relationships as one). Below we specify these relationships and then add them to the EntitySet.

In [None]:
# Relationship between app_train and bureau
r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

# Relationship between bureau and bureau balance
r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

# Relationship between current app and previous apps
r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

# Relationships between previous apps and cash, installments, and credit
r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])

In [None]:
# Add in the defined relationships
es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                           r_previous_cash, r_previous_installments, r_previous_credit])
# Print out the EntitySet
es