# Prediction engineering case study using UK Retail Dataset 

In this case study, we will study prediction engineering. Prediction engineering is a step in predictive modeling, where we:
* Define an outome we are interested in predicting 
* Scan the data to find the past occurences of the outcome
* These past occurences become training examples for machine learning/modeling
* We will then use featuretools to extract features and learn a predictive model. 

In this particular casestudy, we are focusing a retail dataset openly available at 

We will define the prediction problem as the one where the customer has more than ``k`` purchases

In [113]:
import featuretools as ft
from utils import (find_training_examples, load_uk_retail_data, 
                   engineer_features_uk_retail, preview)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support, confusion_matrix
ft.__version__
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Step 1:  Load and prepare data 

In [24]:
item_purchases, invoices, items,customers = load_uk_retail_data()

The dataset has the following tables:
* ``item_purchases``
* ``invoices``
* ``items``
* ``customers``

The following relations exist
* A customer may have multiple invoices 
* An item may have been purchased multiple times 
* An invoice may have multiple item purchases 

In [25]:
entities = {
        "item_purchases": (item_purchases, "item_purchase_id", "InvoiceDate" ),
        "items": (items, "StockCode"),
        "customers": (customers,"CustomerID"),
        "invoices":(invoices,"InvoiceNo","first_item_purchases_time")
        }

relationships = [("customers", "CustomerID","invoices", "CustomerID"), 
                ("invoices", "InvoiceNo","item_purchases", "InvoiceNo"),
                ("items", "StockCode","item_purchases", "StockCode")]

# Step 2 : Find training examples

In the code snippet below, we are trying to find training examples from the data. We set the following parameters:
* ``prediction_window``=14 days 
* ``training_window``=21 days
* ``lead`` = 7 days
* ``threshold``=2 --> specifies the number of purchases that the customer need to have in the future to be considered engaged

In [105]:
label_times = find_training_examples(item_purchases, invoices,
                                     prediction_window=pd.Timedelta("14d"),
                                     training_window=pd.Timedelta("21d"),
                                     lead=pd.Timedelta("7d"),
                                     threshold=5)

In [123]:
preview(label_times,5)

Unnamed: 0,CustomerID,t_start,cutoff_time,purchases>threshold
0,17505.0,2011-05-18,2011-06-08,False
516,16444.0,2011-05-18,2011-06-08,False
517,16889.0,2011-05-18,2011-06-08,False
518,17613.0,2011-05-18,2011-06-08,True
519,17152.0,2011-05-18,2011-06-08,False


In the output above, we are showing the first 5 training examples. The first column is the CustomerID, the second column is the timestamp after which we can use the data for generating features. The third column is the last timestamp we can use the data from the customer. The fourth column is the label. It is ``True`` if the customer had more than 5 purchases in the period between (``cutoff_time+lead``, ``cutoff_time+lead+prediction_window``)

# Step 3: Now lets generate features. 
Next we generate features for each of the training examples. We use featuretools to generate the features. Featuretools is an automated feature engineering software. We go into detail about this software package in the NYC-Taxi case study. Here we simply use the tool to generate features. 

In [107]:
feature_matrix=engineer_features_uk_retail(entities,relationships,
                                           label_times,training_window='21d')

In [122]:
preview(feature_matrix,10)

Unnamed: 0_level_0,WEEK(first_invoices_time),HOUR(first_invoices_time),MAX(item_purchases.Quantity),STD(item_purchases.UnitPrice),DAY(first_invoices_time),IS_WEEKEND(first_invoices_time),MINUTE(first_invoices_time),MONTH(first_invoices_time),MAX(item_purchases.UnitPrice),MEAN(item_purchases.Quantity),...,MAX(invoices.STD(item_purchases.UnitPrice)),STD(invoices.MAX(item_purchases.Quantity)),MEAN(invoices.STD(item_purchases.UnitPrice)),MAX(invoices.MEAN(item_purchases.Quantity)),MAX(invoices.STD(item_purchases.Quantity)),MEAN(invoices.MAX(item_purchases.UnitPrice)),MEAN(invoices.MAX(item_purchases.Quantity)),MEAN(invoices.MEAN(item_purchases.Quantity)),STD(invoices.MAX(item_purchases.UnitPrice)),STD(invoices.MEAN(item_purchases.UnitPrice))
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12353.0,20,17,,,19,False,47,5,,,...,,,,,,,,,,
12359.0,2,12,,,12,False,43,1,,,...,,,,,,,,,,
12360.0,21,9,,,23,False,43,5,,,...,,,,,,,,,,
12380.0,23,9,,,7,False,49,6,,,...,,,,,,,,,,
12415.0,1,11,600.0,2.367284,6,False,12,1,12.5,110.378378,...,2.37664,350.0,1.18832,113.260274,131.485301,8.375,250.0,6.630137,4.125,0.773973
12417.0,50,11,24.0,5.565414,17,False,51,12,28.0,11.608696,...,5.565414,0.0,5.565414,11.608696,6.761394,28.0,24.0,11.608696,0.0,0.0
12423.0,51,10,,,21,False,54,12,,,...,,,,,,,,,,
12426.0,21,12,,,29,True,26,5,,,...,,,,,,,,,,
12431.0,48,10,24.0,2.658325,1,False,3,12,7.95,8.0,...,2.658325,0.0,2.658325,8.0,6.947004,7.95,24.0,8.0,0.0,0.0
12437.0,2,14,48.0,6.323762,12,False,13,1,18.0,18.375,...,6.323762,0.0,6.323762,18.375,14.247259,18.0,48.0,18.375,0.0,0.0


# Step 4: Let's train a model using Random Forests 
Now we are ready to train a model and evaluate it. To do this, we:
* First split our training examples in train_test_split 
* Impute missing values 
* Train a model using training data 
* Test on the data set aside for testing

We can split the data using the function ``train_test_split`` and specifying the proportion we want for testing. In this case we specified that as 35%

In [109]:
y=label_times['purchases>threshold']
X_train, X_test, y_train, y_test = train_test_split(feature_matrix, 
                                                    y, test_size=0.35)

We can impute the missing values or ``NaN`` values in the feature_matrix using the ``Imputer`` in scikit-learn. It replaces the ``NaN`` values in a feature column with the ``mean`` of the rest of the entries in that column. This is a simple imputation startegy

In [116]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp = imp.fit(X_train)
X_train_imp = imp.transform(X_train)

We can train a RandomForest classifier (a type of ensemble classifier). We make use of scikit-learn package for this as well. 

In [117]:
clf = RandomForestClassifier(random_state=0,n_estimators=10,
                             class_weight="balanced",verbose=True)
clf.fit(X_train_imp, y_train)

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=True, warm_start=False)

# Step 5: Test the model
To test a model, we:
* First impute the missing values
* Use the trained classifier to predict the labels

In [118]:
X_test_imp = imp.transform(X_test)
predicted_labels = clf.predict(X_test_imp)

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.0s finished


We evaluate by calculatin

In [120]:
tn, fp, fn, tp = confusion_matrix(y_test, predicted_labels).ravel()

In [121]:
tp,fp

(0, 10)