In [1]:
# Set directories 
import os
import sys

src_dir = os.path.join(os.getcwd(), '..', 'src')
sys.path.append(src_dir)
data_dir = os.path.join(os.getcwd(), '..', 'data')
 
#from IPython.core.debugger import set_trace


# Import Functions
import pandas as pd
import numpy as np
from d00_utils.ImportFiles import ImportOrders,ImportAddInfos,SaveSampleOfOrders
from d01_preprocess.PrepTargetFlatfile import JoinOrderProductFiles,PrepFlatfile,PrepTarget
from d01_preprocess.PrepFtRecFreq import UserItemFreq,GeneralItemFreq,LastOrdOfItem, TimeToLastOrd
from d01_preprocess.PrepAddFeatures import AvgToCardOrder
from d01_preprocess.PrepTimeOfDay import OrderTimeofdayConverter, AddTimeOfDay,PrepTimeOfDayUser,PrepTimeOfDayProduct




First of all I am going to filter the orders and order_product dataframes in a way that there is only data for a small subset of users. With this data I will be able to experiment quicker. 


In [2]:
order_products__prior, order_products__train, orders = ImportOrders(data_dir,False)
#SaveSampleOfOrders(data_dir,order_products__prior, order_products__train, orders,100)


In [3]:
#order_products__prior, order_products__train, orders = ImportOrders(data_dir,small_sample=True)
#aisles,deparments,products = ImportAddInfos(data_dir)


The order product dataframes are going to be joined and some additional information is joined to the data from the orders dataframe

In [4]:
order_products_gesamt = JoinOrderProductFiles(order_products__prior,order_products__train,orders)

# Prepare Flatfile and Target

The flatfile is created as follows. The train order is filtered out and products from all previous orders are grouped by user, product_id. For every user which is in the training set, we get a dataframe with all the products which were already ordered before the training set. All these products could potenially have been reordered in the training set. 

In [5]:
which_set = "train"

In [6]:
flatfile = PrepFlatfile(order_products_gesamt,orders,which_set)

In the following we also calculate the target. The flatfile is joined with all the user, product_cominations in the training set. If the product is also in the train set the label is 1. If the product is not in the training set the label is 0.

In [7]:
flatfile = PrepTarget(flatfile,order_products_gesamt)


# Recency & Frequency

I am going to start building a  model with some basic features.

The first features I am going to build are the following

* Frequency: How often has a customer previously ordered an item
* Frequency relative: In how many of the orders of a customer has the item been
* Recency: How long ago was the last time a user ordered an item
* In how many of the last three (just first guess) of 

All of these features caputre the user-item relation


In [8]:
flatfile = UserItemFreq(flatfile,order_products_gesamt,orders,which_set)

In [9]:
last_order_of_item = LastOrdOfItem(order_products_gesamt,which_set)
flatfile = flatfile.merge(last_order_of_item,how = 'left', on = ['user_id','product_id'])

In [10]:
flatfile = TimeToLastOrd(flatfile,orders,which_set)

# Average Position in Card

Potentially the average to card order could have an influence. Items which tend to always get chosen first might also be more relevant. Or vice versa items always chosen last could be less likely to be in an order

In [11]:
flatfile = AvgToCardOrder(flatfile,order_products_gesamt,which_set)

# Time of Day

There could be patterns that products tend to be bought at specific time (from specific users). The following three features try to capture that:
* How often has the item already been bought at the same time as the train order is done from a specific user (user-product-level)
* In how many orders at a given time of day has the product been bought (user-product-level)
* Is the product generally ordered during the time of day the order is placed (product-level)

In [12]:
# product order and order table need information about time of day
flatfile,order_products_gesamt,orders = AddTimeOfDay(flatfile,order_products_gesamt,orders)
flatfile = PrepTimeOfDayUser(flatfile,order_products_gesamt,orders,which_set)
flatfile = PrepTimeOfDayProduct(flatfile,order_products_gesamt,orders,which_set)

# General 

If a item A is bought more often then item B, item A will be in more orders. Thus the probability that item A is ordered is higher than that of B. The following features are built:

* How many of the orders contain that specific item 
* Is the item one of the 10 % most ordered items (why 10 %? Just a first try) 

My intuition is that that probably the other user product features will be more important.




In [13]:
flatfile = GeneralItemFreq(flatfile,order_products_gesamt,which_set)

The flatfile is save to disk so that it can later be used for the modeling. Also a small sample data set is saved to develop the workflow.

In [15]:
flatfile.to_csv(os.path.join(data_dir,"02_intermediate/flatfile.csv"),index=False)
SampleSaveFlatfile(flatfile,1000,data_dir)