# Final Class Assignment

# Introduction 
* brief summary
    * problem 
    * motivation 

# Outline 
1. Problem Formalization - Translate business problem into machine learning task 
    * Mathematical formalization of learning problem 
    * Definition of target variable
    * Supervised/unsupervised ML? 
    * Post-process model predictions to dervie final results?
2. Selection of ML algorithms 
3. Define how to construct training and validation set
    * Nature of prediciton task? (Time-series, etc.)
    * How to evaluate model to ensure it generalizes well to unseen data? 
    * How to tune hyperparameters of train models? 
4. Explain Feature selection 
    * Why is that feature necessary?
    * Can you link features to existing theory? 
    * What information will the features capture? 
    * Bakcing up claims with EDA? 
    * How to encode data patterns in the model´s architecture?  
5. Description of baselines
    * Implementation of baselines
    * Why are proposed baselines relevant? 
    * What insights can you dervie form the baselines´results?
6. Results
    * key results and supplementary findings 
    * results in line w/ expectations? 
    * Areas where appraoch does well/fail? 
    * Explain which features/architectural choices contribute to good performance? 
    * What data is most useful?
    * Discussion of findings   
7. Concusion and next steps
    * Summary of approach and key results
    * Recommendations for marketing applications?
    * limitations of approach 
    * ideas for next steps?

# THE CURRENT STATE OF THE ART
Today, consumers are used to personalized experiences through interactions with brands like Amazon, eBay or Alibaba. But the majority of brick-and-mortar retailers, whether in grocery, drugstore, DIY, electronics or fashion, still rely on generic mass promotions and omnichannel offers. For customers, most of the products advertised are not relevant. Our research in Germany has shown that, on average, customers buy only
3 out of 100 advertised items in grocery stores, meaning 97% are not relevant or are not discovered.
Mass advertising can also be costly. A study by Nielsen1 shows that the majority of discounted sales would have occurred anyway. As a result, 59%
of mass promotions in food retail worldwide do not cover their costs. Because of this, many retailers end up fighting for consumers with daily low prices and mass promotions at a level where profits are barely being made.
Retail is changing and so should the approach to promotions. Retailers need a system that attracts new customers and wins their loyalty while increasing cart size and optimizing discounts toward higher profits.
This is where AI (artificial intelligence) comes into play. Modern machine learning algorithms are able to leverage the incredible amount of customer data retailers have. They can detect even the smallest clues in customer behavior and recommend the right product at the right time at the right price. All of this is calculated on demand in seconds and individually for each customer.
Our real-world testing with some of the largest German and U.S. grocers has shown that our tested AI, the SO1 Engine, is able to increase redemption rates many times over while increasing loyalty club member revenue by 15.8% and profit by 36.4%.
In this post, we take a look at 7 evidence-backed business cases. The purpose is to inspire retailers on how these technologies can help grow their business.

In [2]:
import os
import tqdm
import warnings
import functools

import numpy as np
import pandas as pd

import seaborn as sns
import sklearn.preprocessing
import sklearn.neighbors
import sklearn.model_selection
import sklearn.metrics

import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
pd.options.mode.chained_assignment = None  # default='warn'

import pyarrow

In [4]:
os.getcwd()

'/Users/asmir/mlim_project/mlim/exercises/ex4'

In [114]:
%cd ../assignment/
os.getcwd()

/Users/asmir/mlim_project/mlim/exercises/assignment


'/Users/asmir/mlim_project/mlim/exercises/assignment'

In [183]:
shop = pd.read_parquet("data_s2000.parquet") 
shop.head()

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect,...,no_shopping_events,week_basket_size,week_basket_value,week_customer_product_sales,mean_customer_product_sales,mean_basket_size,mean_basket_value,ave_offered_dis_week,ave_used_dis_week,category_label
0,0,0,71,629.0,0.0,0,1,1,0,0,...,90,10.0,5908.0,1.0,0.0869,8.992208,5273.981818,0.019338,1.090516,23
1,0,0,91,605.0,0.0,0,1,1,0,0,...,90,10.0,5908.0,1.0,0.094682,8.992208,5273.981818,0.019338,1.090516,0
2,0,0,116,715.0,0.0,0,1,1,0,0,...,90,10.0,5908.0,1.0,0.038911,8.992208,5273.981818,0.019338,1.090516,19
3,0,0,123,483.0,0.0,0,1,1,0,0,...,90,10.0,5908.0,1.0,0.055772,8.992208,5273.981818,0.019338,1.090516,18
4,0,0,157,592.0,0.0,0,1,1,0,0,...,90,10.0,5908.0,1.0,0.038911,8.992208,5273.981818,0.019338,1.090516,7


In [184]:
shop = shop[["week","shopper","product","price", "discount", "discount_offered", "product_bought",
            "purchase_w/o_dis", "no_purchase_w_dis", "discount_effect", "max_price"]]

In [185]:
#discount effect --> either neutral (if shopper would have bought the item anyways) or positive 
shop["discount_effect"] = np.where(((shop.discount_offered == 1) & (shop.product_bought == 1)), 1, 0)
shop[(shop.discount_effect == 1) & (shop.week != 0)].head()

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect,max_price
15305,1,5,202,326.0,35.0,1,1,0,0,1,502.0
15323,1,7,114,405.0,30.0,1,1,0,0,1,579.0
15325,1,7,188,316.0,40.0,1,1,0,0,1,527.0
15341,1,9,120,415.0,35.0,1,1,0,0,1,639.0
15346,1,9,225,481.0,20.0,1,1,0,0,1,602.0


In [189]:
shop

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect,...,week_basket_size,week_basket_value,mean_basket_size,mean_basket_value,avg_offered_dis_week,avg_used_dis_week,customer_mean_product_price,customer_discount_buy_share,product_dis_sells_share,customer_product_sales_share
0,0,0,71,629.0,0.0,0,1,1,0,0,...,10,5908.0,8.992208,5273.981818,0.000000,0.000000,585.901427,0.998703,1.0,0.086900
1,0,0,91,605.0,0.0,0,1,1,0,0,...,10,5908.0,8.992208,5273.981818,0.000000,0.000000,585.901427,0.998703,1.0,0.094682
2,0,0,116,715.0,0.0,0,1,1,0,0,...,10,5908.0,8.992208,5273.981818,0.000000,0.000000,585.901427,0.998703,1.0,0.038911
3,0,0,123,483.0,0.0,0,1,1,0,0,...,10,5908.0,8.992208,5273.981818,0.000000,0.000000,585.901427,0.998703,1.0,0.055772
4,0,0,157,592.0,0.0,0,1,1,0,0,...,10,5908.0,8.992208,5273.981818,0.000000,0.000000,585.901427,0.998703,1.0,0.038911
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1378715,89,1999,144,562.0,0.0,0,1,1,0,0,...,13,7515.0,9.322904,5336.280350,0.769231,0.769231,571.258750,0.998750,1.0,0.063750
1378716,89,1999,158,566.0,0.0,0,1,1,0,0,...,13,7515.0,9.322904,5336.280350,0.769231,0.769231,571.258750,0.998750,1.0,0.065000
1378717,89,1999,192,549.0,0.0,0,1,1,0,0,...,13,7515.0,9.322904,5336.280350,0.769231,0.769231,571.258750,0.998750,1.0,0.091250
1378718,89,1999,213,592.0,0.0,0,1,1,0,0,...,13,7515.0,9.322904,5336.280350,0.769231,0.769231,571.258750,0.998750,1.0,0.021250


In [187]:
# CUSTOMER DIMENSION 
# no_products_bought: number products bought by a customer i
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['product'].agg('count').reset_index().rename(columns={"product":"no_products_bought"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# spend: Customer Lifetime Value (sum € spend by a customer i
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['price'].agg('sum').reset_index().rename(columns={"price":"spend"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# customer_mean_product_price: average price of an item bought by a customer i  
#shop['customer_mean_product_price'] = shop['spend']/(shop['no_products_bought']+1)

# no_unique_products: number unique products bought by customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['product'].agg('nunique').reset_index().rename(columns={"product":"no_unique_products"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# discount_buys: number products bought at discount by a customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['discount'].agg('count').reset_index().rename(columns={"discount":"discount_buys"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# customer_discount_buy_share: the percentage of products bought at discount by customer i 
#shop['customer_discount_buy_share'] = shop['discount_buys']/(shop['no_products_bought']+1)

#------------------

# PRODUCT DIMENSION
# product_sells: number of times the product was sold 
tmp = shop[shop.product_bought == 1].groupby(['product'])['price'].agg('count').reset_index().rename(columns={"price":"product_sells"})
shop = pd.merge(shop, tmp, on="product", how="left")

# product_dis_sells
tmp = shop[shop.product_bought == 1].groupby(['product'])['discount_offered'].agg('count').reset_index().rename(columns={"discount_offered":"product_dis_sells"})
shop = pd.merge(shop, tmp, on="product", how="left")

# product_dis_sells_share
#shop['product_dis_sells_share'] = shop['product_dis_sells']/shop['product_sells']

#-----------------------------

# CUSTOMER X PRODUCT DIMENSION
# customer_product_sales: no product j sales for customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper','product'])['price'].agg('count').reset_index().rename(columns={"price":"customer_product_sales"})
shop = pd.merge(shop, tmp, on=['shopper','product'], how="left")

# customer_product_sales_share: share of product j of all product buys of customer i 
#shop['customer_product_sales_share'] = shop.customer_product_sales/(shop.no_products_bought+1)

# customer_prod_dis_buys: number buys of a product j at discount by customer i 
tmp = shop.groupby(["shopper","product"])["discount_effect"].agg("count").reset_index().rename(columns={"discount_effect":"customer_prod_dis_buys"})
shop = pd.merge(shop, tmp, on=["shopper","product"], how="left")

# customer_prod_bought_dis_share: share a product j is bought at a discount by customer i 
tmp = shop.groupby(["shopper","product"])["discount_effect"].agg("mean").reset_index().rename(columns={"discount_effect":"customer_prod_bought_dis_share"})
shop = pd.merge(shop, tmp, on=["shopper","product"], how="left")

# customer_prod_dis_offers: number discount offers of a product j for customer i 
tmp = shop.groupby(["shopper","product"])["discount_offered"].agg("count").reset_index().rename(columns={"discount_offered":"customer_prod_dis_offers"})
shop = pd.merge(shop, tmp, on=["shopper","product"], how="left")

# customer_product_dis_offered_share: share product j was offered at discount by customer i 
tmp = shop.groupby(['shopper','product'])['discount_offered'].agg('mean').reset_index().rename(columns={"discount_offered":"customer_prod_dis_offer_share"})
shop = pd.merge(shop, tmp, on=['shopper','product'], how="left")

#--------------------------

# WEEK X CUSTOMER DIMENSION
# week_basket_size: number products bought by a customer i in week t 
tmp = shop[shop.product_bought == 1].groupby(['week','shopper'])['product'].agg('count').reset_index().rename(columns={"product":"week_basket_size"})
shop = pd.merge(shop, tmp, on=['week','shopper'], how="left")

# week_basket_value: sum products in € by a customer i in week t 
tmp = shop[shop.product_bought == 1].groupby(['week','shopper'])['price'].agg('sum').reset_index().rename(columns={"price":"week_basket_value"})
shop = pd.merge(shop, tmp, on=['week','shopper'], how="left")

#-------------------

# CUSTOMER DIMENSION
# mean_basket_size: the average basket size of customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['week_basket_size'].agg('mean').reset_index().rename(columns={"week_basket_size":"mean_basket_size"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# mean_basket_value: the average basket value in € of customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['week_basket_value'].agg('mean').reset_index().rename(columns={"week_basket_value":"mean_basket_value"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

#--------------------------

# WEEK X CUSTOMER DIMENSION
# avg_offered_dis_week: average offered discount per week or rather average coupon value per shopper
tmp = shop.groupby(["week","shopper"])["discount"].agg("mean").reset_index().rename(columns={"discount":"avg_offered_dis_week"})
shop = pd.merge(shop, tmp, on=["week","shopper"], how="left")

# avg_used_dis_week: average used discount per week
tmp = shop[shop.product_bought == 1].groupby(["week","shopper"])["discount"].agg("mean").reset_index().rename(columns={"discount":"avg_used_dis_week"})
shop = pd.merge(shop, tmp, on=["week","shopper"], how="left")

In [188]:
# customer_mean_product_price: average price of an item bought by a customer i  
shop['customer_mean_product_price'] = shop['spend']/(shop['no_products_bought']+1)
# customer_discount_buy_share: the percentage of products bought at discount by customer i 
shop['customer_discount_buy_share'] = shop['discount_buys']/(shop['no_products_bought']+1)
# product_dis_sells_share
shop['product_dis_sells_share'] = shop['product_dis_sells']/shop['product_sells']
# customer_product_sales_share: share of product j of all product buys of customer i 
shop['customer_product_sales_share'] = shop.customer_product_sales/(shop.no_products_bought+1)

In [190]:
# save master DataFrame as parquet
shop.to_parquet("final_data.parquet")

In [115]:
baskets = pd.read_parquet("baskets.parquet")
baskets.head() # 68.8 mm

Unnamed: 0,week,shopper,product,price
0,0,0,71,629
1,0,0,91,605
2,0,0,116,715
3,0,0,123,483
4,0,0,157,592


In [120]:
baskets = baskets[baskets.shopper <2000]

* 90 weeks (0-89)
* 100k shoppers (0-99999)
* 250 products (0-249)
* Price: 234-837 cent

In [116]:
coupons = pd.read_parquet("coupons.parquet")

In [117]:
coupons.head() # 45 mm

Unnamed: 0,week,shopper,product,discount
0,0,0,35,35
1,0,0,193,40
2,0,0,27,30
3,0,0,177,35
4,0,0,5,30


In [124]:
coupons = coupons[coupons.shopper <2000]

In [127]:
coupons.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 900000 entries, 0 to 44509999
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype
---  ------    --------------   -----
 0   week      900000 non-null  int64
 1   shopper   900000 non-null  int64
 2   product   900000 non-null  int64
 3   discount  900000 non-null  int64
dtypes: int64(4)
memory usage: 34.3 MB


* 90 weeks (0-89)
* 100k shoppers (0-99999)
* 250 products (0-249)
* 45mm discounts (0-40%)

In [118]:
idx = pd.read_parquet("coupon_index.parquet")

In [119]:
idx.head() # 10k -> 2000 customers x 5 produts for promotion 

Unnamed: 0,week,shopper,coupon
0,90,0,0
2000,90,0,1
4000,90,0,2
6000,90,0,3
8000,90,0,4


* Data for week 90 only
* 2k shoppers (ID 0-1999)
* 5 coupons: 0(0%), 1(15%), 2(20%), 3(25%), 4(30%)

In [136]:
shop = np.array([(x, y, z) for x in range(90) for y in range(2000) for z in range(250)])
shop = pd.DataFrame(t1)
shop.rename(columns={0:'week', 1:'shopper', 2:'product'}, inplace=True)

Unnamed: 0,week,shopper,product
0,0,0,0
1,0,0,1
2,0,0,2
3,0,0,3
4,0,0,4
...,...,...,...
44999995,89,1999,245
44999996,89,1999,246
44999997,89,1999,247
44999998,89,1999,248


In [143]:
#Merging coupon and basket data 
shop = pd.merge(shop, baskets, on=['week','shopper','product'], how='outer')
shop = pd.merge(shop, coupons, on=['week','shopper','product'], how='outer')
shop.head()

Unnamed: 0,week,shopper,product,price,discount
0,0,0,0,,
1,0,0,1,,
2,0,0,2,,
3,0,0,3,,
4,0,0,4,,


In [144]:
shop['discount'] = shop['discount'].replace(np.nan, 0)

In [145]:
#discount offered to the shopper
shop["discount_offered"] = np.where(shop.discount != 0, 1, 0)
#product purchased
shop["product_bought"] = np.where(shop.price.isna(), 0, 1)
#purchase without having a discount
shop["purchase_w/o_dis"] = np.where(((shop.product_bought == 1) & (shop.discount_offered == 0)), 1, 0)
#no purchase even though a discount was offered
shop["no_purchase_w_dis"] = np.where(((shop.product_bought == 0) & (shop.discount_offered == 1)), 1, 0)

In [148]:
#discount effect --> either neutral (if shopper would have bought the item anyways) or positive 
shop["discount_effect"] = np.where(((shop.discount_offered == 1) & (shop.product_bought == 1)), 1, 0)
shop[(shop.discount_effect == 1) & (shop.week != 0)].head()

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect
501452,1,5,202,326.0,35.0,1,1,0,0,1
501864,1,7,114,405.0,30.0,1,1,0,0,1
501938,1,7,188,316.0,40.0,1,1,0,0,1
502370,1,9,120,415.0,35.0,1,1,0,0,1
502475,1,9,225,481.0,20.0,1,1,0,0,1


In [149]:
#maximal price of product
max_price = shop.groupby("product")["price"].agg(max).reset_index()
max_price = max_price.rename(columns = {"price": "max_price"})

In [150]:
#merge max price to the shop DataFrame
shop = pd.merge(shop, max_price, on = "product", how = "left")

#impute missing prices (of the coupon data set) by the max price minues the offered discount (because this was the 
#price the shoppers was offered)
shop["price"] = np.where(shop.price.isna(), shop.max_price*(100-shop.discount)*0.01, shop.price)
shop

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect,max_price
0,0,0,0,688.0,0.0,0,0,0,0,0,688.0
1,0,0,1,560.0,0.0,0,0,0,0,0,560.0
2,0,0,2,773.0,0.0,0,0,0,0,0,773.0
3,0,0,3,722.0,0.0,0,0,0,0,0,722.0
4,0,0,4,620.0,0.0,0,0,0,0,0,620.0
...,...,...,...,...,...,...,...,...,...,...,...
44999995,89,1999,245,549.0,0.0,0,1,1,0,0,549.0
44999996,89,1999,246,491.4,30.0,1,0,0,1,0,702.0
44999997,89,1999,247,670.0,0.0,0,0,0,0,0,670.0
44999998,89,1999,248,490.0,0.0,0,0,0,0,0,490.0


In [161]:
shop[shop['product'] == 238]

Unnamed: 0,week,shopper,product,price,discount,discount_offered,product_bought,purchase_w/o_dis,no_purchase_w_dis,discount_effect,max_price
238,0,0,238,390.0,0.0,0,0,0,0,0,390.0
488,0,1,238,390.0,0.0,0,0,0,0,0,390.0
738,0,2,238,390.0,0.0,0,0,0,0,0,390.0
988,0,3,238,390.0,0.0,0,0,0,0,0,390.0
1238,0,4,238,390.0,0.0,0,0,0,0,0,390.0
...,...,...,...,...,...,...,...,...,...,...,...
44998988,89,1995,238,390.0,0.0,0,0,0,0,0,390.0
44999238,89,1996,238,390.0,0.0,0,0,0,0,0,390.0
44999488,89,1997,238,390.0,0.0,0,0,0,0,0,390.0
44999738,89,1998,238,390.0,0.0,0,0,0,0,0,390.0


In [None]:
# CUSTOMER DIMENSION 
# no_products_bought: number products bought by a customer i
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['product'].agg('count').reset_index().rename(columns={"product":"no_products_bought"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# cltv: Customer Lifetime Value (sum € spend by a customer i
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['price'].agg('sum').reset_index().rename(columns={"price":"cltv"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# customer_mean_product_price: average price of an item bought by a customer i  
shop['customer_mean_product_price'] = shop.cltv/shop.no_products_bought

# mean_basket_size: the average basket size of customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['week_basket_size'].agg('mean').reset_index().rename(columns={"week_basket_size":"mean_basket_size"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# mean_basket_value: the average basket value in € of customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['week_basket_value'].agg('mean').reset_index().rename(columns={"week_basket_value":"mean_basket_value"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# no_unique_products: number unique products bought by customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper'])['product'].agg('nunique').reset_index().rename(columns={"product":"no_unique_products"})
shop = pd.merge(shop, tmp, on="shopper", how="left")

# discount_buys: number products bought at discount by a customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper']['discount'].agg('count').reset_index().rename(columns={"discount":"discount_buys"}))
shop = pd.merge(shop, tmp, on="shopper", how="left")

# customer_discount_buy_share: the percentage of products bought at discount by customer i 
shop['customer_discount_buy_share'] = shop.discount_buys/shop.no_products_bought

#------------------

# PRODUCT DIMENSION
# product_sells: number of times the product was sold 
tmp = shop[shop.product_bought == 1].groupby(['product'])['price'].agg('count').reset_index().rename(columns={"price":"product_sells"})
shop = pd.merge(shop, tmp, on="shopper", how="left")
# mean_product_discounted

# mean_

# CUSTOMER X PRODUCT DIMENSION
# prod_bought_dis_share: share a product j is bought at a discount d by customer i 
tmp = shop.groupby(["shopper","product"])["discount_effect"].agg("mean").reset_index().rename(columns={"discount":"prod_bought_dis_share"})
shop = pd.merge(shop, tmp, on=["shopper","product"], how="left")

# customer_product_discounted_share: share product j was bought at discount by customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper','product'])['discount_offered'].agg('mean').reset_index().rename(columns={"discount_offered":"customer_product_discounted_share"})
shop = pd.merge(shop, tmp, on=['shopper','product'], how="left")

# customer_product_sales: no product j was bought by customer i 
tmp = shop[shop.product_bought == 1].groupby(['shopper','product'])['price'].agg('count').reset_index().rename(columns={"price":"customer_product_sales"})
shop = pd.merge(shop, tmp, on=['shopper','product'], how="left")

# mean_customer_product_sales: average 
shop['mean_customer_product_sales'] = shop.customer_product_sales/(shop.no_products_bought+1)

# WEEK X CUSTOMER DIMENSION
# week_basket_size: number products bought by a customer i in week t 
tmp = shop[shop.product_bought == 1].groupby(['week','shopper'])['product'].agg('count').reset_index().rename(columns={"product":"week_basket_size"})
shop = pd.merge(shop, tmp, on=['week','shopper'], how="left")

# week_basket_value: sum products in € by a customer i in week t 
tmp = shop[shop.product_bought == 1].groupby(['week','shopper'])['price'].agg('sum').reset_index().rename(columns={"price":"week_basket_value"})
shop = pd.merge(shop, tmp, on=['week','shopper'], how="left")

# avg_offered_dis_week: average offered discount per week or rather average coupon value per shopper
tmp = shop.groupby(["week","shopper"])["discount"].agg("mean").reset_index().rename(columns={"discount":"avg_offered_dis_week"})
shop = pd.merge(shop, tmp, on=["week","shopper"], how="left")

# avg_used_dis_week: average used discount per week
tmp = shop[shop.product_bought == 1].groupby(["week","shopper"])["discount"].agg("mean").reset_index().rename(columns={"discount":"avg_used_dis_week"})
shop = pd.merge(shop, tmp, on=["week","shopper"], how="left")

# WEEK X CUSTOMER X PRODUCT DIMENSION
# week_customer_product_sales: no prouct j was bought by customer i in week t 
tmp = shop[shop.product_bought == 1].groupby(['week', 'shopper', 'product'])['price'].agg('count').reset_index().rename(columns={"price":"week_customer_product_sales"})
shop = pd.merge(shop, tmp, on=['week', 'shopper', 'product'], how="left")

In [None]:
#clearing memory 
del baskets
del coupons
del data

# Heuristic Model 

#### Logic of the Heuristic Model 
**Assumption**: 
* (1) A customer will continue buying items with a coupon, if these items were only or mostly bought with a coupon previously. 
* (2) Items which are bought only once (with a coupon) will not be bought again.

1. Reserve week 89 data for the test set
2. Calculate the rate an item is bought with a coupon for each customer --> CHECK
3. Filter out items which are bought less often than twice with a coupon --> CHECK
4. Filter out items which are bought less often than the average item for that user --> CHECK
5. Order the items descending by coupon redemption rate
6. Return the top 5 products 
7. Assign a coupon to these products for the week t+1, here week 90

In [12]:
#Merging coupon data to basket data
bc = pd.merge(baskets, coupons, on=['week','shopper','product'], how='left')
bc.head()

Unnamed: 0,week,shopper,product,price,discount
0,0,0,71,629,
1,0,0,91,605,
2,0,0,116,715,
3,0,0,123,483,
4,0,0,157,592,


In [13]:
bc['discount'] = bc['discount'].replace(np.nan, 0)

In [24]:
bc['discounted'] = np.where(bc['discount']>0,1,0)

In [25]:
bc['dis_prod_share'] = bc.groupby(['shopper','product'])['discounted'].transform('mean')

In [30]:
bc['product_discount_buys'] = bc.groupby(['shopper', 'product'])['discounted'].transform('sum')

In [32]:
bc['product_buys'] = bc.groupby(['shopper', 'product'])['price'].transform('count')

In [34]:
bc['buys'] = bc.groupby(['shopper'])['price'].transform('count')

In [36]:
bc['unique_products'] = bc.groupby(['shopper'])['product'].transform('nunique')

In [55]:
bc['discount_buys'] = bc.groupby(['shopper'])['discounted'].transform('mean')

In [56]:
bc.head(n=10)

Unnamed: 0,week,shopper,product,price,discount,dis_prod_share,discounted,product_discount_buys,product_buys,buys,unique_products,discount_buys
0,0,0,71,629,0.0,0.029851,0,2,67,770,54,0.032468
1,0,0,91,605,0.0,0.041096,0,3,73,770,54,0.032468
2,0,0,116,715,0.0,0.033333,0,1,30,770,54,0.032468
3,0,0,123,483,0.0,0.0,0,0,43,770,54,0.032468
4,0,0,157,592,0.0,0.0,0,0,30,770,54,0.032468
5,0,0,167,582,0.0,0.0,0,0,32,770,54,0.032468
6,0,0,171,639,0.0,0.0,0,0,29,770,54,0.032468
7,0,0,184,651,0.0,0.019231,0,1,52,770,54,0.032468
8,0,0,207,410,0.0,0.166667,0,1,6,770,54,0.032468
9,0,0,225,602,0.0,0.029412,0,2,68,770,54,0.032468


In [58]:
bc_filtered = bc[(bc['product_discount_buys']>1) & (bc['product_buys']>(bc['buys']/bc['unique_products']))]

In [59]:
bc_filtered.head(n=10)

Unnamed: 0,week,shopper,product,price,discount,dis_prod_share,discounted,product_discount_buys,product_buys,buys,unique_products,discount_buys
0,0,0,71,629,0.0,0.029851,0,2,67,770,54,0.032468
1,0,0,91,605,0.0,0.041096,0,3,73,770,54,0.032468
9,0,0,225,602,0.0,0.029412,0,2,68,770,54,0.032468
10,0,1,22,528,0.0,0.0625,0,2,32,665,71,0.039098
26,0,3,98,481,0.0,0.076923,0,2,26,752,91,0.037234
31,0,4,25,540,0.0,0.038961,0,3,77,558,34,0.032258
33,0,4,156,575,0.0,0.055556,0,2,36,558,34,0.032258
43,0,6,109,667,0.0,0.058824,0,4,68,677,73,0.044313
49,0,7,61,739,10.0,0.055556,1,2,36,861,60,0.04065
50,0,7,79,736,0.0,0.033898,0,2,59,861,60,0.04065


In [77]:
bc_filtered_grouped = bc_filtered.groupby(['shopper', 'product']).mean().reset_index(drop=False)

In [78]:
bc_filtered_grouped.head(n=20)

Unnamed: 0,shopper,product,week,price,discount,dis_prod_share,discounted,product_discount_buys,product_buys,buys,unique_products,discount_buys
0,0,67,43.942857,624.228571,2.0,0.085714,0.085714,3.0,35.0,770.0,54.0,0.032468
1,0,71,43.716418,624.761194,0.671642,0.029851,0.029851,2.0,67.0,770.0,54.0,0.032468
2,0,87,46.818182,503.181818,3.409091,0.090909,0.090909,2.0,22.0,770.0,54.0,0.032468
3,0,91,45.315068,598.356164,1.09589,0.041096,0.041096,3.0,73.0,770.0,54.0,0.032468
4,0,202,44.25641,493.615385,1.666667,0.051282,0.051282,2.0,39.0,770.0,54.0,0.032468
5,0,225,46.338235,599.323529,0.441176,0.029412,0.029412,2.0,68.0,770.0,54.0,0.032468
6,1,1,43.461538,540.615385,3.461538,0.153846,0.153846,2.0,13.0,665.0,71.0,0.039098
7,1,21,45.388889,457.194444,1.25,0.055556,0.055556,2.0,36.0,665.0,71.0,0.039098
8,1,22,43.6875,515.625,2.34375,0.0625,0.0625,2.0,32.0,665.0,71.0,0.039098
9,1,63,42.456522,705.695652,1.847826,0.065217,0.065217,3.0,46.0,665.0,71.0,0.039098


In [None]:
#df[["order_id","user_id","order_number"]].sort_values(by=["order_number"]).groupby(by="user_id").agg({'order_id':lambda x: list(x)}).reset_index(drop=False)

In [82]:
topk = 5
heuristic_df = bc_filtered_grouped.groupby(['shopper']).apply(lambda x: x.nlargest(topk,['dis_prod_share'])).reset_index(drop=True)

In [83]:
heuristic_df

Unnamed: 0,shopper,product,week,price,discount,dis_prod_share,discounted,product_discount_buys,product_buys,buys,unique_products,discount_buys
0,0,87,46.818182,503.181818,3.409091,0.090909,0.090909,2.0,22.0,770.0,54.0,0.032468
1,0,67,43.942857,624.228571,2.000000,0.085714,0.085714,3.0,35.0,770.0,54.0,0.032468
2,0,202,44.256410,493.615385,1.666667,0.051282,0.051282,2.0,39.0,770.0,54.0,0.032468
3,0,91,45.315068,598.356164,1.095890,0.041096,0.041096,3.0,73.0,770.0,54.0,0.032468
4,0,71,43.716418,624.761194,0.671642,0.029851,0.029851,2.0,67.0,770.0,54.0,0.032468
...,...,...,...,...,...,...,...,...,...,...,...,...
398720,99997,31,49.190476,755.952381,3.571429,0.095238,0.095238,2.0,21.0,560.0,72.0,0.032143
398721,99998,84,38.600000,468.500000,6.500000,0.200000,0.200000,2.0,10.0,621.0,65.0,0.040258
398722,99998,248,49.172414,474.793103,3.103448,0.103448,0.103448,3.0,29.0,621.0,65.0,0.040258
398723,99998,103,42.580645,535.516129,0.645161,0.064516,0.064516,2.0,31.0,621.0,65.0,0.040258


In [84]:
heuristic_df_s = heuristic_df[heuristic_df['shopper'] < 2000]

In [102]:
heuristic_df_s.head(n=15) # shopper 3 only has 4 products 

Unnamed: 0,shopper,product,week,price,discount,dis_prod_share,discounted,product_discount_buys,product_buys,buys,unique_products,discount_buys,coupon
0,0,87,46.818182,503.181818,3.409091,0.090909,0.090909,2.0,22.0,770.0,54.0,0.032468,0
1,0,67,43.942857,624.228571,2.0,0.085714,0.085714,3.0,35.0,770.0,54.0,0.032468,1
2,0,202,44.25641,493.615385,1.666667,0.051282,0.051282,2.0,39.0,770.0,54.0,0.032468,2
3,0,91,45.315068,598.356164,1.09589,0.041096,0.041096,3.0,73.0,770.0,54.0,0.032468,3
4,0,71,43.716418,624.761194,0.671642,0.029851,0.029851,2.0,67.0,770.0,54.0,0.032468,4
5,1,220,42.2,570.3,6.5,0.2,0.2,2.0,10.0,665.0,71.0,0.039098,0
6,1,1,43.461538,540.615385,3.461538,0.153846,0.153846,2.0,13.0,665.0,71.0,0.039098,1
7,1,83,42.777778,621.666667,2.407407,0.074074,0.074074,2.0,27.0,665.0,71.0,0.039098,2
8,1,63,42.456522,705.695652,1.847826,0.065217,0.065217,3.0,46.0,665.0,71.0,0.039098,3
9,1,22,43.6875,515.625,2.34375,0.0625,0.0625,2.0,32.0,665.0,71.0,0.039098,4


In [101]:
heuristic_df_s['coupon'] = heuristic_df_s.groupby("shopper").cumcount()

In [86]:
#product_dict = heuristic_df_s.set_index('shopper').to_dict()['product'] 
#heuristic_idx['product'] = heuristic_idx['shopper'].map(product_dict)

In [104]:
heuristic_idx = idx.copy()

In [106]:
heur_idx = pd.merge(heuristic_idx, heuristic_df_s[['shopper','coupon','product']], on=['shopper', 'coupon'], how='left')
heur_idx.head(n=15)

Unnamed: 0,week,shopper,coupon,product
0,90,0,0,87.0
1,90,0,1,67.0
2,90,0,2,202.0
3,90,0,3,91.0
4,90,0,4,71.0
5,90,1,0,220.0
6,90,1,1,1.0
7,90,1,2,83.0
8,90,1,3,63.0
9,90,1,4,22.0


# XGBoost

#### Proposed Features

**Customer features**

* avg_coupon_use  --> exists
* avg_coupon_value --> exists
* no_products_bought (+_per_week) --> exists
* customer_lifetime_value (€) --> spend --> exists
* avg_basekt_value --> exists
* avg_basket_size --> exists
* avg_days_between_orders --> to-do --> check if needed, when trivial --> drop
 

**Product features** 

* product_frequency (how often is item bought by all users) --> exists
* product_buying_cycle (days between purchases by all users) --> exists
* avg_product_discount_value --> exists
* avg_product_discount_use --> exists
* discount_for_product_j --> exists

**Customer_x_Product features**

* ratio_product_j_bought_w/_discount 
* days_between_purchases_product_j 
* share_product_j_basketsize_week
* share_product_j_basketvalue_week

**Lagged features**

* coupon_used_for_product_j_t-1

In [None]:
for x in i:
    if i == 10:
        X_test['discount_value'] = 0.1
        xgb_pred_10 = predict(X_test)
    elif i == 20:
        X_test['discount_value'] = 0.2
        xgb_pred_20 = predict(X_test)
    elif ....
    else ...
    
list = xgb_pred_10, 20, 30 
    
df = pd.DataFrame()

for x in list:
    df[x]=x 
    
df[y10] = xgb_pred_10
df[y20] = xgb_pred_20

--> argmax --> calculate maximimum "uplift" !!!

--> top5 products 

# Setting
## Problem: Revenue Uplift Maximization 

$D_{it}^{*} = argmax_{D=[d_{1},...,d_{J}]}\displaystyle\sum_{j=0}^{J} \ price_{j}*[(1-d_{j})*p_{itj}(D)-p_{itj}(0)]$


Motivation!
- Problem of this paper: how well will a Machine Learning algorithm perform against a highdimensional problem. A product buy as well as a coupon redemption will be influenced by a series of features along the following dimensions.: 

- **Product-features**: nature of product; some times get shopped more often than others; detergent vs. Bread (quote some stats)
- **Customer-features**: personality traits: might give info whether a customer is a Persuadable, lost cause; sure thing; sleeping dog --> quote literature  
- **Customer-Product features**: mixture of the above dimensions; preference of customer towards products: Shopping certain products more often than others. 
- **Discount-features**: has a discount being offered/+ used? Was an item Not bought although a coupon was given? Was an item bought with a discount. 
- **Time-features**: overdue products in weeks (Christopher), lagged features: indicates recent purchase behavior. 
-  
All these dimensions are informing our model in training as covariates. 


Daten an der Time Dimension entlang in Control und treated splitten. 
- Check into transformed outcome approach: 1, if treated and Converted, 0 else 
- Code Data in a way that target includes treatment indirectly 
- Train model on new y 
- Take uplift scores in (€)
- Pick highest 5 products per user measured on uplift Score 
- Give coupon 

Regression auf t~X 
* any Machine Learning algorithm
* fine tune: k fold 
t0= 10%, t1=15%,... 

Regression Y~X
* any Machine Learning algorithm 
*  fine tune: k fold 
Uplift: residuals between —> figure out 

Was hat das für einen Effekt auf revenue (Price als Target: int (continuous) ) 

—> interval abchecken: Konfidenz auf den Effekt 


- [ ] ease of use

# Pipeline

1. sample negative classes
    1.1 find a way: a, b or c
    * random (gabels negative sampler) --> get 15/20
    * Christophers preference list --> get 10/15 from here 
    
    * NEEED postives + negatives == 50/50 --> per USER per WEEK!!!
2. features: 
    * target features: (HW3) 
        * discount offered
        * discount used
        * bought
        * not-bought
    * overdue products (time) ---> Christopher merging 
    * lagged features (time) --> TBD! Arash! 
    * features (dimensions: prod, user, prodxuser, discount) 
        * Asmir mergine features into master file 
3. Model (non-parametric) --> LGBM
    * train-test-split: first 80 weeks, test: last 10 weeks/ 88 train --> 89 test   --> TBD!
    * train --> parameters ---> k-fold CV + random parameter search --> hyperparameters 
    * training_final: hyperparameters
    * predict( see above: d=0,0.1,0.2,0.3,0.4) 
    * table: "uplifts" (revenue max. problem)
    * top5 products ordered by uplift (desc) 
    * give coupon: get discount from top 5
    

In [None]:
for x in i:
    if i == 10:
        X_test['discount_value'] = 0.1
        xgb_pred_10 = predict(X_test)
    elif i == 20:
        X_test['discount_value'] = 0.2
        xgb_pred_20 = predict(X_test)
    elif ....
    else ...
    
list = xgb_pred_10, 20, 30 
    
df = pd.DataFrame()

for x in list:
    df[x]=x 
    
df[y10] = xgb_pred_10
df[y20] = xgb_pred_20

--> argmax --> calculate maximimum "uplift" !!!

--> top5 products 

In [None]:
lop = shop[["shopper","week","product"]].sort_values(by=["shopper", "week"]).groupby(by=["shopper","week"]).agg({'product':lambda x: list(x)}).reset_index(drop=False)
lop = lop.rename(columns={"product": "list_of_products"})
# remove duplicates
lop["list_of_products"] = lop["list_of_products"].apply(lambda x: list(dict.fromkeys(x)))
lop.head()

In [None]:
import torch
import torch.nn as nn

from modules.model_base import ModelBase


class Model(ModelBase):
    def __init__(self, J, T, K, L, epsilon, pretrained):

        # init
        ModelBase.__init__(self, pretrained)
        self.epsilon = epsilon
        self.J = J

        # time embedding
        self.w_conv_t = nn.Parameter(torch.FloatTensor(T, K).uniform_(0.18, 0.22))

        # product embedding
        self.w_conv_j = nn.Parameter(torch.FloatTensor(J, L).uniform_(-0.025, 0.025))
        self.w_conv_j2 = nn.Parameter(torch.FloatTensor(J, L).uniform_(-0.025, 0.025))
        self.w_conv_j_d = nn.Parameter(torch.FloatTensor(J, L).uniform_(-0.025, 0.025))
        self.w_conv_j_d2 = nn.Parameter(torch.FloatTensor(J, L).uniform_(-0.025, 0.025))
        self.w_conv_j_pf = nn.Parameter(torch.FloatTensor(J, L).uniform_(-0.025, 0.025))
        self.w_conv_j_pf2 = nn.Parameter(
            torch.FloatTensor(J, L).uniform_(-0.025, 0.025)
        )

        # frequency embedding
        self.w_pf_filter = nn.Parameter(torch.FloatTensor(K).uniform_(0.5, 0.7))

        # output weights
        self.w_out_conv_t = nn.Parameter(torch.FloatTensor(K, 1).uniform_(-0.25, 0.25))
        self.w_out_conv_j = nn.Parameter(torch.FloatTensor(K, 1).uniform_(-0.1, 0.1))
        self.w_out_discount = nn.Parameter(torch.FloatTensor(J).uniform_(0.1, 0.2))
        self.w_out_discount_cross = nn.Parameter(
            torch.FloatTensor(J).uniform_(0.1, 0.2)
        )

        # bias
        self.ff_out_b = nn.Parameter(torch.FloatTensor(1, J, 1).uniform_(-3, -2.5))

        # load pretrained weights
        if pretrained is not None:
            self.load_weights()

    def forward(self, in_pf, in_np, in_bc):

        # dimensions
        B = in_pf.shape[0]
        J = in_pf.shape[1]

        # purchase frequency
        clamp_pf = torch.clamp(in_pf, self.epsilon, 1 - self.epsilon)
        logit_pf = torch.log(clamp_pf / (1 - clamp_pf))

        # purchase frequency cross
        logit_pf_cross_in = torch.einsum("bj,jl->bl", logit_pf, self.w_conv_j_pf)
        logit_pf_cross = torch.einsum("bl,jl->bj", logit_pf_cross_in, self.w_conv_j_pf2)

        # time convolution
        bc_conv_t = torch.einsum("btj,tk->bkj", in_bc, self.w_conv_t)
        bc_conv_t = bc_conv_t + torch.einsum("bj,k->bkj", in_pf, self.w_pf_filter)
        bc_conv_t = torch.nn.functional.leaky_relu(bc_conv_t, negative_slope=0.2)

        # time convolution residual path
        logit_conv_t = torch.squeeze(
            torch.einsum("bkj,kz->bjz", bc_conv_t, self.w_out_conv_t), 2
        )

        # product convolution
        bc_conv_tj_in = torch.einsum("bkj,jl->bkl", bc_conv_t, self.w_conv_j)
        bc_conv_tj_out = torch.einsum("bkl,jl->bkj", bc_conv_tj_in, self.w_conv_j2)
        logit_conv_tj = torch.squeeze(
            torch.einsum("bkj,km->bjm", bc_conv_tj_out, self.w_out_conv_j), 2
        )

        # discount
        logit_discount = torch.mul(in_np, self.w_out_discount)

        # discount cross
        cross_price_in = torch.einsum("bj,jl->bl", in_np, self.w_conv_j_d)
        cross_price_out = torch.einsum("bl,jl->bj", cross_price_in, self.w_conv_j_d2)
        logit_discount_cross = torch.mul(cross_price_out, self.w_out_discount_cross)

        return (
            torch.squeeze(self.ff_out_b, 2)
            + logit_pf
            + logit_pf_cross
            + logit_conv_t
            + logit_conv_tj
            + logit_discount
            + logit_discount_cross
        )

# EDA

In [28]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()

%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'

## When do shoppers order again? 

* create column: days_since_prior_order

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="days_since_prior_order", data=orders_df, color=color[4])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Days since prior order', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency distribution by days since prior order", fontsize=15)
plt.show()

## How many prior orders do shoppers have? 
* create column: order_number

In [None]:
cnt_srs = orders_df.groupby("user_id")["order_number"].aggregate(np.max).reset_index()
cnt_srs = cnt_srs.order_number.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8, color=color[2])
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Maximum order number', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

## How many items do shoppers buy? 
* create column: xyz

In [None]:
grouped_df = order_products_train_df.groupby("order_id")["add_to_cart_order"].aggregate("max").reset_index()
cnt_srs = grouped_df.add_to_cart_order.value_counts()

plt.figure(figsize=(12,8))
sns.barplot(cnt_srs.index, cnt_srs.values, alpha=0.8)
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Number of products in the given order', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

# Bestsellers

In [29]:
bs = baskets['product'].value_counts().reset_index().head(20)
bs.columns = ['product', 'frequency_count']
bs

Unnamed: 0,product,frequency_count
0,105,987902
1,76,908456
2,196,748950
3,101,706808
4,199,699511
5,198,667407
6,64,657588
7,192,652528
8,162,620873
9,197,615295


# Shopper Model (Athey, Ruiz)

## Preprocessing

In [None]:
# Step 1
# create groups of products by subcategory, week and year


In [None]:
# creating baskets
usr_baskets = total_filter.groupby(['user_id','basket_hash']).apply(lambda x: x['article_text'].unique())
usr_baskets

In [None]:
# Step 3
# randomly taking a product from the same subcategory as a product that was bought 

import random
new_rows = pd.Series([random.choice(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0])

In [None]:
# Step 4
# fill the rows of the new products with data of the remaining columns 
# from the original product except for price
new_sample = pd.DataFrame({'basket_hash': [x.basket_hash for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0],
                           'article_text': new_rows,                                      
                           'user_id': [x.user_id for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0],
                           'week': [x.week for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0],
                           'year': [x.year for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0],
                           'category_name': [x.category_name for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0],                           
                           'subcategory_name': [x.subcategory_name for x in total_filter.itertuples() if len(list(set(groups[(x.year, x.week, x.subcategory_name)]).difference(set(baskets[(x.user_id, x.basket_hash)])))) > 0]})

In [None]:
def top_value_count(x):
    return x.value_counts().idxmax()

In [None]:
# Step 5
# calculate the most frequent price at which a particular product 
# was sold in a respective week and year
prices_top_freq = df_prices.groupby(['year','week', 'article_text'])['price']
prices = prices_top_freq.apply(top_value_count).reset_index()

In [None]:
# add the prices for our new products by merging with the most frequent prices 
new_sample2 = pd.merge(new_sample, prices, how = 'left', on = ['year', 'week', 'article_text'])

new_sample2['price'] = new_sample2.groupby('article_text')['price'].transform(lambda x: x.fillna(method = 'ffill'))
new_sample2['price'] = new_sample2.groupby('article_text')['price'].transform(lambda x: x.fillna(method = 'bfill'))

In [None]:
# Step 6
# products are sampled to add to data and were not bought
new_sample2['bought'] = 0


In [None]:
new_sample2 = new_sample2[['basket_hash', 'article_text', 'user_id', 'price', 'category_name','subcategory_name', 'bought', 'week', 'year']]


In [None]:
# putting bought and sampled not bought products into one dataframe
final_df = total_filter.append(new_sample2).sort_index().reset_index(drop=True)

In [None]:
# add other items from basket into seperate column as a list
final_df['other_basket_prods'] = pd.Series([list(set(baskets[(x.user_id, x.basket_hash)]).difference(x.article_text)) for x in final_df.itertuples() ])

## Model architecture

In [None]:
# creating label encoders for items, users and weeks
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(final_df['article_text'])
final_df['encoded_prods'] = le.transform(final_df['article_text'])
final_df['other_basket_prods_encoded'] = final_df['other_basket_prods'].apply(lambda x : le.transform(x))

le_user = LabelEncoder()
le_user.fit(final_df['user_id'])
final_df['encoded_user'] = le_user.transform(final_df['user_id'])

le_week = LabelEncoder()
le_week.fit(final_df['week'])
final_df['encoded_week'] = le_week.transform(final_df['week'])

In [None]:
# splitting the data into train and test
from sklearn import model_selection

X = final_df.drop(["bought", 'basket_hash', 'category_name', 'subcategory_name'], axis = 1)
Y = final_df["bought"]

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(
    X, Y, test_size = 0.2, random_state = 42)

In [None]:
from keras.preprocessing.sequence import pad_sequences

largest_basket = X_train['other_basket_prods_encoded'].apply(lambda x: len(x)).max()
basket_prods_train_pad = pad_sequences(X_train['other_basket_prods_encoded'], maxlen = largest_basket + 1, padding = 'post')
basket_prods_test_pad = pad_sequences(X_test['other_basket_prods_encoded'], maxlen = largest_basket + 1, padding = 'post')

basket_prods_train_pad

In [None]:
import keras
from keras.layers import Input, Embedding, Dot, Reshape, Dense, , multiply, average, add
from keras.models import Model
from keras.optimizers import Adam

In [None]:
# defining the inputs for our model user, item, price and week
embedding_size = 20 
user_len = len(le_user.classes_) + 1
item_len = len(le.classes_) + 1
week_len = len(le_week.classes_) + 1

user = Input(name = 'user', shape = (1,))
item = Input(name = 'item', shape = (1,))
price = Input(name = 'price', shape = (1,))
week = Input(name = 'week', shape = (1,))
basket = Input(name = 'basket', shape = (None,))

The Item Popularity, $\lambda_{c}$, captures the overall item popularity and will be represented in our model by the item popularity embedding that goes straight into our last add function. It has an embedding dimension of 1.

 $\theta_{ut}^T \alpha_{c}$

In [None]:
# creating the first embedding layer for item popularity with embedding size of 1
item_pop = Embedding(name = 'item_pop', 
                           input_dim = item_len, 
                           output_dim = 1)(item)

# Reshape to be a single number (shape will be (None, 1))
item_pop = Reshape(target_shape = (1, ))(item_pop)


In [None]:
# creating the embeddings for user and item 
# Embedding the user (shape will be (None, 1, embedding_size))
user_embedding = Embedding(name = 'user_embedding',
                               input_dim = user_len,
                               output_dim = embedding_size)(user)

# shared item embedding layer for items and baskets
# use mask_zero = True, since we had to pad our baskets with zeros
prod_embed_shared = Embedding(name = 'prod_embed_shared_embedding', 
                           input_dim = item_len, 
                           output_dim = embedding_size,
                           input_length = None,
                           mask_zero =True)

# Embedding the product (shape will be (None, 1, embedding_size))
item_embedding = prod_embed_shared(item)

# Merge the layers with a dot product along the second axis 
# (shape will be (None, 1, 1))
user_item = Dot(name = 'user_item', axes = 2)([item_embedding, user_embedding])

# Reshape to be a single number (shape will be (None, 1))
user_item = Reshape(target_shape = (1, ))(user_item)

Jumping forward a bit, we consider the term $\rho_{c}^T(\frac{1}{i-1} \displaystyle\sum_{j=1}^{i-1} \alpha_{y_{tj}} )$ from equation 2. We first create a new item-item interaction vector $\rho_{c}$ that captures complementary effects between items. This is multiplied by the term $\displaystyle\sum_{j=1}^{i-1} \alpha_{y_{tj}}$, which is nothing more than the average of the vectors of all the other items in the shopping basket. Note that this $\alpha$ is the same as the $\alpha$ from our previous embedding. This is why we use a shared embedding layer.