# Predicting a customer's next purchase using automated feature engineering

<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

**As customers use your product, they leave behind a trail of behaviors that indicate how they will act in the future. Through automated feature engineering we can identify the predictive patterns in granular customer behavioral data that can be used to improve the customer's experience and generate additional revenue for your business.**

In this tutorial, we show how [Featuretools](www.featuretools.com) can be used to perform feature engineering on a multi-table dataset of 3 million online grocery orders provided by Instacart to train an accurate machine learning model to predict what product a customer buys next.

*Note: If you are running this notebook yourself, refer to the [read me on Github](https://github.com/featuretools/predict_next_purchase#running-the-tutorial) for instructions to download the Instacart dataset*

## Highlights

* We automatically generate 150+ features using Deep Feature Synthesis and select the 20 most important features for predictive modeling
* We build a pipeline that it can be reused for numerous prediction problems (you can try this yourself!)
* We quickly develop a model on a subset of the data and validate on the entire dataset in a scalable manner using [Dask](http://dask.pydata.org/en/latest/).

In [1]:
import featuretools as ft
from dask import bag
from dask.diagnostics import ProgressBar
import pandas as pd
import utils
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import os
ft.__version__

'0.5.1'

## Step 1. Load data

We start by loading in just one partion of our datast. In this case, a partition of our dataset contains the complete purchase history for each user within it. At the end of the tutorial, we will run the pipeline on every partiton to generate a final model. To learn more about loading data into Featuretools, read the guide [here](https://docs.featuretools.com/loading_data/using_entitysets.html).

In [2]:
es = utils.load_entityset("partitioned_data/part_1/")
es

Entityset: instacart
  Entities:
    order_products [Rows: 156605, Columns: 7]
    orders [Rows: 16328, Columns: 3]
    users [Rows: 1000, Columns: 2]
  Relationships:
    order_products.order_id -> orders.order_id
    orders.user_id -> users.user_id

## Step 2. Make Labels

For supervised machine learning, we need labels. These labels define what our predictive model will be used for. In this tutorial, we will predict if a customer will buy Bananas in the next 4 weeks.

We generate training examples by selecting a `cutoff_time` in the past to make our labels. Using users who had acivity during `training_window` days before the `cutoff_time`, we look to see if they purchase the product in the `prediction_window`. 

If you are running this code yourself, feel free to experiment with any of these parameters! For example, try to predict if a customer will buy "Limes" instead of "Bananas" or increase the size of your `prediction_window`.

In [3]:
label_times = utils.make_labels(es=es,
                                product_name = "Banana",
                                cutoff_time = pd.Timestamp('March 15, 2015'),
                                prediction_window = ft.Timedelta("4 weeks"),
                                training_window = ft.Timedelta("60 days"))
label_times.head(5)

Unnamed: 0,user_id,time,label
0,1001,2015-03-15,False
1,1002,2015-03-15,False
2,1003,2015-03-15,False
3,1004,2015-03-15,False
4,1005,2015-03-15,False


We can see above the our training examples contain three pieces of information: a user id, the last time we can use data before feature engineering (called the "cutoff time"), and the label to predict. These are called our "label times".

The distribution of the labels

In [4]:
label_times["label"].value_counts()

False    654
True     128
Name: label, dtype: int64

## 3. Automated Feature Engineering
With our label times in hand, we can use Deep Feature Synthesis to automatically generate features.

When we use DFS, we specify

* `target_entity` - the table to build feature for
* `cutoff_time` the point in time to calculate the features
* `training_window` - the amount of historical data we want to use when calculating features

A good way to think of the `cutoff_time` is that it let's us "pretend" we are at an earlier point in time when generating our features so we can simulate making predictions. We get this time for each customer from the label times we generated above.

In [5]:
feature_matrix, features = ft.dfs(target_entity="users", 
                                  cutoff_time=label_times,
                                  training_window=ft.Timedelta("60 days"), # same as above
                                  entityset=es,
                                  verbose=True)
# encode categorical values
fm_encoded, features_encoded = ft.encode_features(feature_matrix,
                                                  features)

print("Number of features %s" % len(features_encoded))
fm_encoded.head(10)

Built 76 features
Elapsed: 00:13 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks
Number of features 161


Unnamed: 0_level_0,MONTH(first_orders_time) = 1,MONTH(first_orders_time) is unknown,PERCENT_TRUE(order_products.reordered),DAY(first_orders_time) = 1,DAY(first_orders_time) is unknown,YEAR(first_orders_time) = 2015,YEAR(first_orders_time) is unknown,COUNT(order_products),MODE(order_products.department) = produce,MODE(order_products.department) = dairy eggs,...,MODE(orders.DAY(order_time)) = 2,MODE(orders.DAY(order_time)) = 8,MODE(orders.DAY(order_time)) = 4,MODE(orders.DAY(order_time)) = 7,MODE(orders.DAY(order_time)) = 6,MODE(orders.DAY(order_time)) = 5,MODE(orders.DAY(order_time)) = 9,MODE(orders.DAY(order_time)) = 14,MODE(orders.DAY(order_time)) is unknown,label
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001,1,0,0.424242,1,0,1,0,33,0,0,...,0,0,0,0,0,0,0,0,0,False
1002,1,0,0.244898,1,0,1,0,49,0,0,...,0,0,0,0,0,0,0,1,0,False
1003,1,0,0.670886,1,0,1,0,79,0,1,...,1,0,0,0,0,0,0,0,0,False
1004,1,0,0.692308,1,0,1,0,169,1,0,...,0,0,0,0,0,0,0,0,0,False
1005,1,0,0.518519,1,0,1,0,54,0,1,...,0,0,0,0,0,0,0,1,0,False
1006,1,0,0.105263,1,0,1,0,19,1,0,...,0,0,0,0,0,0,0,0,0,False
1007,1,0,0.478261,1,0,1,0,46,1,0,...,0,0,0,0,0,0,0,0,0,False
1008,1,0,0.03125,1,0,1,0,32,0,1,...,1,0,0,0,0,0,0,0,0,True
1009,1,0,0.428571,1,0,1,0,14,1,0,...,0,0,0,0,0,0,0,0,0,False
1011,1,0,0.342105,1,0,1,0,38,1,0,...,0,0,0,0,0,0,0,0,0,False


## Step 4. Machine Learning

Using the default parameters, we generated 160 potential features for our prediction problem. With a few simple commands, this feature matrix can be used for machine learning

In [6]:
X = utils.merge_features_labels(fm_encoded, label_times)
X.drop(["user_id", "time"], axis=1, inplace=True)
X = X.fillna(0)
y = X.pop("label")

Let's train a Random Forest and validate using 3-fold cross validation

In [7]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                         scoring="roc_auc", verbose=True)

"AUC %.2f +/- %.2f" % (scores.mean(), scores.std())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    9.5s finished


'AUC 0.84 +/- 0.03'

We can see we perform noticably better than guessing! However, we have a pretty high difference in performance accross folds.

To improve this, let's identify the top 20 features using a Random Forest and then perform machine learning on the whole dataset (all of the partions).

In [8]:
clf.fit(X, y)
top_features = utils.feature_importances(clf, features_encoded, n=20)

1: Feature: COUNT(order_products WHERE product_name = Banana), 0.129
2: Feature: MODE(order_products.product_name) = Banana, 0.040
3: Feature: MODE(orders.MODE(order_products.product_name)) = Banana, 0.029
4: Feature: STD(orders.PERCENT_TRUE(order_products.reordered)), 0.019
5: Feature: COUNT(order_products WHERE department = produce), 0.018
6: Feature: MEAN(orders.COUNT(order_products)), 0.017
7: Feature: COUNT(order_products WHERE department = dairy eggs), 0.017
8: Feature: MEAN(orders.NUM_UNIQUE(order_products.product_name)), 0.017
9: Feature: MEAN(orders.NUM_UNIQUE(order_products.aisle_id)), 0.017
10: Feature: NUM_UNIQUE(order_products.product_name), 0.016
11: Feature: STD(orders.NUM_UNIQUE(order_products.department)), 0.016
12: Feature: COUNT(order_products), 0.016
13: Feature: SUM(orders.PERCENT_TRUE(order_products.reordered)), 0.016
14: Feature: PERCENT_TRUE(order_products.reordered), 0.016
15: Feature: MEAN(orders.PERCENT_TRUE(order_products.reordered)), 0.016
16: Feature: STD(

To persist this features, we can save them to disk.

In [9]:
ft.save_features(top_features, "top_features")

### Understanding feature engineering in Featuretools

Before moving forward, take a look at the feature we created. You will see that they are more than just simple transformations of columns in our raw data. Instead, they aggregations (and sometimes stacking of aggregations) across the relationships in our dataset. If you're curious how this works, learn about the Deep Feature Synthesis algorithm in our documentation [here](https://docs.featuretools.com/automated_feature_engineering/afe.html).

DFS is so powerful because with no manual work, the library figured out that historical purchases of bananas are important for predicting future purchases. Additionally, it surfaces that purchasing dairy or eggs and reordering behavior are important features. 

Even though these features are intuitive, Deep Feature Synthesis will automatically adapt as we change the prediction problem, saving us the time of manually brainstorming and implementing these data transformation. 

## Scaling to full dataset

Once we have written the pipeline for one partition, we can easily scale it out to the full dataset using [Dask](dask.pydata.org). A similar pipeline could also be built using [Spark](http://spark.apache.org/docs/2.2.0/api/python/).

In [10]:
pbar = ProgressBar()
pbar.register()

First, we assemble our partitions and map them to entity sets using the function from before. A single partition contains all the data for each user within it, so this computation is easily parallelized. 

In [11]:
path = "partitioned_data/"
#_, dirnames, _ = os.walk(path).next()
dirnames = [os.path.join(path, d) for d in os.listdir(path)]
b = bag.from_sequence(dirnames)
entity_sets = b.map(utils.load_entityset)

Next, we create label times for each entity set

In [12]:
label_times = entity_sets.map(utils.dask_make_labels,
                              product_name = "Banana",
                              cutoff_time = pd.Timestamp('March 1, 2015'),
                              prediction_window = ft.Timedelta("4 weeks"),
                              training_window = ft.Timedelta("60 days"))
label_times

dask.bag<map-das..., npartitions=104>

In [13]:
# load in the features from before
top_features = ft.load_features("top_features")
feature_matrices = label_times.map(utils.calculate_feature_matrix, features=top_features)

Now, we compute with Dask. Running on a Macbook Pro with a 2.2 GHz Intel Core i7 and 16gb of ram, this takes about 20 minutes to run. The compute method can take an argument `num_workers` which defaults to using all of the cores on your computer. If you don't want it to do that, you can explicitly specify a number of workers with `feature_matrices.compute(num_workers=2)` where you can replace 2 with the number of cores you want to use.

In [14]:
fms_out = feature_matrices.compute()
X = pd.concat(fms_out)

[########################################] | 100% Completed | 57min 19.4s


Now, we repeat the same machine learning steps from the sample dataset

In [15]:
X.drop(["user_id", "time"], axis=1, inplace=True)
X = X.fillna(0)
y = X.pop("label")

In [16]:
clf = RandomForestClassifier(n_estimators=400, n_jobs=-1)
scores = cross_val_score(estimator=clf,X=X, y=y, cv=3,
                         scoring="roc_auc", verbose=True)

"AUC %.2f +/- %.2f" % (scores.mean(), scores.std())

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.6min finished


'AUC 0.87 +/- 0.00'

We can now we that our accuracy has stabalized across folds, giving us much more confidence in our model.

Now, let's look at the top features

In [17]:
clf.fit(X, y)
top_features = utils.feature_importances(clf, top_features, n=20)

1: Feature: COUNT(order_products WHERE product_name = Banana), 0.272
2: Feature: MODE(order_products.product_name) = Banana, 0.062
3: Feature: MODE(orders.MODE(order_products.product_name)) = Banana, 0.047
4: Feature: COUNT(order_products WHERE department = produce), 0.045
5: Feature: STD(orders.PERCENT_TRUE(order_products.reordered)), 0.043
6: Feature: MEAN(orders.PERCENT_TRUE(order_products.reordered)), 0.042
7: Feature: PERCENT_TRUE(order_products.reordered), 0.041
8: Feature: SUM(orders.PERCENT_TRUE(order_products.reordered)), 0.040
9: Feature: STD(orders.NUM_UNIQUE(order_products.department)), 0.039
10: Feature: STD(orders.NUM_UNIQUE(order_products.product_name)), 0.038
11: Feature: STD(orders.COUNT(order_products)), 0.038
12: Feature: MEAN(orders.NUM_UNIQUE(order_products.aisle_id)), 0.038
13: Feature: COUNT(order_products WHERE department = dairy eggs), 0.035
14: Feature: MEAN(orders.NUM_UNIQUE(order_products.product_name)), 0.034
15: Feature: MEAN(orders.COUNT(order_products)),

We can see the top features shifted around a bit, but mostly stayed the same.

## Next Steps

While this is an end-to-end example of going from raw data to a trained machine learning model, it is necessary to do further exploration before claiming we've built something impact full.

Fortunately, Featuretools makes it easy to build structured data science pipeline. As a next steps, you could
* Further validate these results by creating feature vectors at different cutoff times
* Define other prediction problems for this dataset (you can even change the entity you are making predictions on!)
* Save feature matrices to disk as CSVs so they can be reused with different problems without recalculating
* Experiment with parameters to Deep Feature Synthesis
* Create custom primitives for DFS. More info [here](https://docs.featuretools.com/automated_feature_engineering/primitives.html).


<p>
    <img src="https://www.featurelabs.com/wp-content/uploads/2017/12/logo.png" alt="Featuretools" />
</p>


Featuretools was created by the developers at [Feature Labs](https://www.featurelabs.com/). If building impactful data science pipelines is important to you or your business, please [get in touch](https://www.featurelabs.com/contact).