# Welcome to the retail notebook!

In this demonstration, we will show you how the Retail EntitySet and Projects were created. 

In this notebook you will learn:

1. How an EntitySet can be made from a single table on S3
2. What a *prediction problem* is and
3. How to load EntitySets and Projects into Tempo

# Step 1: Make an EntitySet
We start by loading in a dataframe of retail logs. The CSV has information on Customers, Transactions, Orders and Products. The data is first downloaded from a public S3 bucket.

In [1]:
import featurelabs as fl
import pandas as pd

import utils

csv_s3 = "s3://featurelabs-static/online-retail-logs.csv"
data = pd.read_csv(csv_s3, parse_dates=["order_date"])

utils.overview(data)
utils.warnings(data)


+--------------+
|  Data Shape  |
+--------------+
Number of columns: 8
Number of rows: 541909

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 135080
Average missing values by column: 17066.75

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 142.53 MB
Average memory by column: 15.84 MB

+--------------+
|  Data Types  |
+--------------+
                index
0                    
int64               1
datetime64[ns]      1
float64             2
object              4

+------------+
+------------+
DataFrame has 5268 duplicates
customer_id has 135080 missing values: (24% of total)
order_id has many unique values: 25900
product_id has many unique values: 4070
description has many unique values: 4223


In [2]:
# drop the duplicates
data = data.drop_duplicates()

EntitySets organize the data you work with to define prediction problems, perform feature engineering, and train machine learning models. They contain multiple tables, known as entities, and the relationships between them.

In this case, we have a single table of data, but we can create new entities using `normalize_entity`. 

In [3]:
es = fl.EntitySet(id="Online Retail Logs")
es.entity_from_dataframe("order_products",
                         dataframe=data,
                         index="order_product_id",
                         variable_types={'description': fl.variable_types.Text})

# create a new "products" entity
es.normalize_entity(new_entity_id="products",
                    base_entity_id="order_products",
                    index="product_id",
                    additional_variables=["description"])

# create a new "orders" entity
es.normalize_entity(new_entity_id="orders",
                    base_entity_id="order_products",
                    index="order_id",
                    additional_variables=[
                        "customer_id", "country", "order_date"],
                    make_time_index="order_date")

# create a new "customers" entity based on the orders entity
es.normalize_entity(new_entity_id="customers",
                    base_entity_id="orders",
                    index="customer_id",
                    additional_variables=["country"])

es.add_last_time_indexes()
es



Entityset: Online Retail Logs
  Entities:
    order_products [Rows: 536641, Columns: 5]
    customers [Rows: 4373, Columns: 3]
    products [Rows: 4070, Columns: 2]
    orders [Rows: 25900, Columns: 3]
  Relationships:
    order_products.product_id -> products.product_id
    order_products.order_id -> orders.order_id
    orders.customer_id -> customers.customer_id

In [4]:
utils.show(utils.static_histogram(data.groupby('customer_id').sum()['price'], 
                                  col_max=1500, n_bins=100),
          title='Total customer spending histogram', height=400)

In [5]:
utils.show(utils.static_histogram(data.groupby('product_id').count()['order_id'], 
                                  col_max=400, n_bins=100),
           title='Histogram of product orders', height=400)

# Step 2: Making predictions

The next step is to decide what we want to predict. *We can use the same EntitySet to make different predictions*. For example, we might be interested in predicting how much a customer will spend in the future, or we might be interested in predicting how many of each product we'll sell. Here's code to create those prediction problems:

In [6]:
customer_predictions = data.groupby('customer_id').min()['order_date'].reset_index()
customer_predictions = customer_predictions.merge(data.sort_values(by='order_date').groupby('customer_id').apply(lambda df: df['price'].sum()-df[df['order_date']==df['order_date'].min()]['price'].sum()).reset_index()).rename(columns={0: 'future spending'})
customer_predictions.to_csv('data/prediction_problems/future customer spending.csv')
utils.column_report(customer_predictions)


+-----------------------+
|  Time Column Summary  |
+-----------------------+

## order_date ##
Last Time: 2011-12-09 12:16:00
First Time: 2010-12-01 08:26:00

+--------------------------+
|  Numeric Column Summary  |
+--------------------------+

## customer_id ##
Maximum: 18287.0, Minimum: 12346.0, Mean: 15299.68
Quartile 3: 16778.25 | Median: 15300.50| Quartile 1: 13812.75

## future spending ##
Maximum: 41354.12, Minimum: 0.0, Mean: 247.30
Quartile 3: 210.01 | Median: 52.80| Quartile 1: 0.00


In [10]:
def multiclass_maker(val):
    if val < 10:
        return 'Less than 10'
    if val < 100:
        return 'Between 10 and 100'
    if val < 1000:
        return 'Between 100 and 1000'
    else:
        return 'More than 1000'
customer_multiclass = customer_predictions.copy()
customer_multiclass['future spending'] = customer_multiclass['future spending'].apply(multiclass_maker)

customer_multiclass.to_csv('data/prediction_problems/predict customer spending multiclass.csv')
utils.show(utils.static_piechart(customer_multiclass['future spending']))

Unnamed: 0,customer_id,order_date,future spending
0,12346.0,2011-01-18 10:01:00,Less than 10
1,12347.0,2010-12-07 14:57:00,Between 100 and 1000
2,12348.0,2010-12-16 19:09:00,Between 100 and 1000
3,12349.0,2011-11-21 09:51:00,Less than 10
4,12350.0,2011-02-02 16:01:00,Less than 10
5,12352.0,2011-02-16 12:33:00,More than 1000
6,12353.0,2011-05-19 17:47:00,Less than 10
7,12354.0,2011-04-21 13:11:00,Less than 10
8,12355.0,2011-05-09 13:49:00,Less than 10
9,12356.0,2011-01-18 09:50:00,Between 10 and 100


# Step 3: Connecting to Tempo

Here, we'll upload the data directly to the webapp. As a warning, this cell can destroy any manipulations you've already done in the app with the retail dataset. If you want to overwrite the existing EntitySet and Projects, uncomment the commented line.

In [8]:
import featurelabs as fl
client = fl.Client()

# client.unpublish_entityset(es)
client.publish_entityset(es)
client.publish_project(project_name='Predict Future Spending',
                       label_times=customer_predictions,
                       entityset_id=es.id,
                       entity_id='customers',
                       label_type='regression',
                       description='For every customer, predict how much they' 
                                   'will spend for the remainder of the dataset.'
                                   'This problem is trying to predict how good'
                                   'a certain customer will be.')

client.publish_project(project_name='Predict Future Spending (multiclass)',
                       label_times=customer_multiclass,
                       entityset_id=es.id,
                       entity_id='customers',
                       label_type='multiclass',
                       description='For every customer, predict whether they will spend' 
                                   'less than 10, between 10 and 100, between 100 and 1000' 
                                   'or more than 1000. This is an easier prediction problem' 
                                   'than regression, but still has all of the important information.')

ValueError: Duplicate EntitySet, try deleting                     EntitySet or changing EntitySet name