# Predict Next Purchase

In this example, we build a machine learning application that predicts whether customers will purchase a product within the next shopping period. This application is structured into three important steps:

* Prediction Engineering
* Feature Engineering
* Machine Learning

In the first step, we label the historical transactions by using [Compose](https://compose.alteryx.com/). In the second step, we generate the features by using [Featuretools](https://docs.featuretools.com/). In the third step, we search for the best machine learning pipeline by using [EvalML](https://evalml.alteryx.com/). After working through these steps, you will learn how to build machine learning applications for real-world problems like predicting consumer spending. Let's get started.

In [1]:
from demo.predict_next_purchase import load_sample
from evalml import AutoMLSearch
from evalml.preprocessing import split_data
import composeml as cp
import featuretools as ft
import matplotlib as mpl

  import pandas.util.testing as tm


We will use this historical data of online grocery orders provided by Instacart.

In [2]:
df = load_sample()

df.head()

Unnamed: 0_level_0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,department,user_id,order_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,120,33120,13,0,Organic Egg Whites,86,16,dairy eggs,23750,2015-01-11 08:00:00
1,120,31323,7,0,Light Wisconsin String Cheese,21,16,dairy eggs,23750,2015-01-11 08:00:00
2,120,1503,8,0,Low Fat Cottage Cheese,108,16,dairy eggs,23750,2015-01-11 08:00:00
3,120,28156,11,0,Total 0% Nonfat Plain Greek Yogurt,120,16,dairy eggs,23750,2015-01-11 08:00:00
4,120,41273,4,0,Broccoli Florets,123,4,produce,23750,2015-01-11 08:00:00


## Prediction Engineering

Note we have two parameters in the prediction problem:

* The name of the product.
* The length of the shopping period.

We can change these parameters to create different prediction problems. For example, will a customer purchase an avocado within the next 3 days or a banana within the next week? These variations can be done by simply tweaking the parameters. This helps us explore different scenarios which is crucial for making better decisions.


### Defining the Labeling Process

In each shopping period, we will check whether a customer bought a product. Let’s define this as a labeling function with a parameter for the product name.

In [3]:
def bought_product(ds, product_name):
    return ds.product_name.str.contains(product_name).any()

### Representing the Prediction Problem

We will represent the prediction problem using a label maker. This way, we can run searches on the online grocery orders to generate the training examples. This is done by setting the following parameters:

* The `target_entity` as the customer, because we want to label orders for each individual customer.
* The `labeling_function` as the function we defined previously.
* The `time_index` as the order time, because shoppings periods are based on the order time.
* The `window_size` as the length of a shopping period. We can tweak this parameter to create variations of the prediction problem. In this case, we will use one week as the length of the shopping period.

In [None]:
lm = cp.LabelMaker(
    target_entity='user_id',
    time_index='order_time',
    labeling_function=bought_product,
    window_size='7d',
)

### Finding the Training Examples

Now, we can run a search to check whether the product was purchased within shopping periods. This is done using the following parameters:

* The online grocery orders sorted by the order time.
* The `num_examples_per_instance` to find the number of training examples per customer. In this case, we will search for all existing examples.

In [None]:
lt = lm.search(
    df.sort_values('order_time'),
    minimum_data='3d',
    num_examples_per_instance=-1,
    product_name='Banana',
    gap='3d',
    verbose=False,
)

lt.head()


We can print out the settings and transforms that were used to make the labels. This is useful as a reference to understand how the labels were made.

In [None]:
lt.describe()

These plots show the discrete label distribution and the cumulative count across time.

In [None]:
%matplotlib inline
fig = mpl.pyplot.figure(figsize=(5, 8))
ax0 = fig.add_subplot(211)
ax1 = mpl.pyplot.subplot(212)
fig.tight_layout()

lt.plot.distribution(ax=ax0)
lt.plot.count_by_time(ax=ax1);

## Generate Features

Now, we are ready for feature engineering. To get started, let's create an entity set to represent the data.

In [None]:
es = ft.EntitySet('instacart')

es.entity_from_dataframe(
    dataframe=df.reset_index(),
    entity_id='order_products',
    time_index='order_time',
    index='id',
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='orders',
    index='order_id',
    additional_variables=['user_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='orders',
    new_entity_id='users',
    index='user_id',
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='order_products',
    new_entity_id='products',
    index='product_id',
    additional_variables=['aisle_id', 'department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='products',
    new_entity_id='aisles',
    index='aisle_id',
    additional_variables=['department_id'],
    make_time_index=False,
)

es.normalize_entity(
    base_entity_id='aisles',
    new_entity_id='departments',
    index='department_id',
    make_time_index=False,
)

es["order_products"]["department"].interesting_values = ['produce']
es["order_products"]["product_name"].interesting_values = ['Banana']
es.plot()

Let's generate the features that correspond to our labels.

In [None]:
fm, fd = ft.dfs(
    entityset=es,
    target_entity='users',
    cutoff_time=lt,
    cutoff_time_in_index=True,
    include_cutoff_time=False,
    verbose=False,
)

fm.head()

## Machine Learning

Now, we can create a machine learning model. Let's extract the labels from the feature matrix and split the data into training and holdout sets.

In [None]:
y = fm.pop('bought_product')
splits = split_data(fm, y, test_size=0.2, random_state=0)
X_train, X_holdout, y_train, y_holdout = splits

### Train Model

Next, we search for the optimal pipeline by trying out different models on the training set.

In [None]:
automl = AutoMLSearch(problem_type='binary', objective='f1', random_state=0)
automl.search(X_train, y_train, data_checks=None, show_iteration_plot=False)

In [None]:
automl.best_pipeline.describe()
automl.best_pipeline.graph()

### Test Model

Finally, we score the model performance by evaluating predictions on the holdout set.

In [None]:
best_pipeline = automl.best_pipeline.fit(X_train, y_train)
score = best_pipeline.score(X_holdout, y_holdout, objectives=['f1'])
dict(score)