Goal: predict which previously purchased products will be in a user’s next order. Specifically, for each order_id in the test set, predict a space-delimited list of product_ids for that order. 

Orders file tells to which set (prior, train, test) an order belongs to without giving the details of the products ordered. For any given user_id the last order is flagged as either train or test while all previous orders are marked as prior. The prior file has the prior orders for both test and train set while the train file has the details of the last order. All together we have 3,421,083 order_ids for 206,209 user_ids:

prior (32,434,449 products) -> train user (131,209 user_ids) -> train (1,384,617 products)

prior (32,434,449 products) -> test user  (75,000 user_ids)  -> to forecast (??? products)

For the 75,000 test user_ids there are also 75,000 order_ids that we need to include in the final answer. The "prior" pandas frame is 216 MB, orders is 88 MB, and train is 8MB. Fortunately one can reduce the size by merging and aggregating the data (I am working with 5 MB in all_data, 6 MB in train, and 2MB in test). 

We will see that the train file (our target or y) has on average 6 reordered products per user (with standard dev 6 and full range of 0 to 71 - this is after we exclude the new products - otherwise the average would be 11 with a range of 1 to 80). The data in the prior set (our X) has on a average 8 products per user and order (on average 17 orders). Our task is to predict the 6 products in the basket from a set of 65 products (on average for each user) purchseded in the past (even though there are some 49,000 products total).

I am not using the rest of the files (at first disregarding the train users as well). The prediction (at first) is to take (for each test user) all their past orders and select "n" of the most common products where "n" is the average number of products in past  orders.  The  program takes just a few  minutes to run  on Kaggle. The score it gives is 0.329 - clearly not good enough (about 1500 out of 2000 when first run) so we need to work on it harder and use this just as a start. 

Before using any Machine Learning algorithm we need to understand what is the measure of sucess and so in the second part of the notebook we look at the train set and calculate F1 for the same prediction. I am concerned about the use of F1. For example if a the true order is [1,2,3,4,5,6] and we predict 2 correctly (1 in 3) , ie [1,2,7,8,9,10] then the F1 is 0.33 (and the same precision and recall). The leader board best score is 0.4 which is less than 3 correct (for the total of 6). The F1 score does not penalize for predicting incorrect number of products, for example if we add to the cart [11,12,13,14,3] , ie just 1 correct in 5 we improve the score to 0.35. To get to 0.4 one would need to add [11,12,3] or again 1 in 3 correct.  The precision will remain the same and only recall will increase (see the code at the end). The narrow task is to use F1 but is this really relevant to the business?



In [None]:
import pandas as pd
import numpy as np
from collections import Counter

myfolder = '../input/'
prior = pd.read_csv(myfolder + 'order_products__prior.csv', dtype={'order_id': np.uint32,
           'product_id': np.uint16}).drop(['add_to_cart_order', 'reordered'], axis=1)
orders = pd.read_csv(myfolder + 'orders.csv', dtype={'order_hour_of_day': np.uint8,
           'order_number': np.uint8, 'order_id': np.uint32, 'user_id': np.uint32,
           'days_since_prior_order': np.float16}).drop(['order_dow','order_hour_of_day'], axis=1)
orders.set_index('order_id', drop=False, inplace=True)

In [None]:
#This might take a minute - adding the past products to the orders frame

orders['prod_list'] = prior.groupby('order_id').aggregate({'product_id':lambda x: list(x)})
orders=orders.fillna('')
orders['num_items'] = orders['prod_list'].apply(len).astype(np.uint8)
    

In [None]:
#aggregate again by creating a list of list of all products in all orders for each user

all_products = orders.groupby('user_id').aggregate({'prod_list':lambda x: list(x)})
all_products['mean_items']= orders.groupby('user_id').aggregate({'num_items':lambda x: np.mean(x)}).astype(np.uint8)
all_products['max_items']= orders.groupby('user_id').aggregate({'num_items':lambda x: np.max(x)}).astype(np.uint8)
all_products['user_id']=all_products.index


In [None]:
# This function flattens the list of list (of product_ids), then finds the most common elements in it
# and joins them into the required format for the test set only

def myfrequent(x):
    prodids = x.prod_list
    n=x.mean_items
    C=Counter( [elem for sublist in prodids for elem in sublist] ).most_common(n)
    return ' '.join(str(C[i][0]) for i in range(0,n))  

test=orders[['order_id','user_id']].loc[orders['eval_set']=='test']
test=test.merge(all_products,on='user_id')
test['products']=test.apply(myfrequent,axis=1)
test[['order_id','products']].to_csv('mean_submission0.csv', index=False)  
test.head(3)

The score from LB is 0.329 and to understand it better we look at the train set:
    

In [None]:
train=orders[['order_id','user_id']].loc[orders['eval_set']=='train']
train_orders = pd.read_csv(myfolder + 'order_products__train.csv', dtype={'order_id': np.uint32,
           'product_id': np.uint16, 'reordered': np.int8}).drop(['add_to_cart_order'], axis=1)
train_orders = train_orders[train_orders['reordered']==1].drop('reordered',axis=1)  # predicting for reordered only
train['true'] = train_orders.groupby('order_id').aggregate({'product_id':lambda x: list(x)})
train['true']=train['true'].fillna('')
train['true_n'] = train['true'].fillna('').apply(len).astype(np.uint8)
train=train.merge(all_products,on='user_id')
train['prod_list']=train['prod_list'].map(lambda x: [elem for sublist in x for elem in sublist])


In [None]:
def myfrequent2(x):     # select the n most common elements from the prod_list
    prodids = x.prod_list
    n=x.mean_items
    C=Counter(prodids).most_common(n)
    return list((C[i][0]) for i in range(0,n))  

def f1_score_single(x):    #copied from LiLi
    y_true = set(x.true)
    y_pred = set(x.prediction)
    cross_size = len(y_true & y_pred)
    if cross_size == 0: return 0.
    p = 1. * cross_size / len(y_pred)
    r = 1. * cross_size / len(y_true)
    return 2 * p * r / (p + r)

train['prediction']=train.apply(myfrequent2,axis=1)
train['f1']=train.apply(f1_score_single,axis=1).astype(np.float16)
print('The F1 score on the traing set is  {0:.3f}.'.format(  train['f1'].mean()  ))
train.head(3)

Look at the example below how one can get F1 from 0.33 to 0.4 without increasing precision just by increasing the number of products in the basket.

In [None]:

def f1(y_true,y_pred):    
    y_true = set(y_true)
    y_pred = set(y_pred)
    cross_size = len(y_true & y_pred)
    if cross_size == 0: return 0.
    p = 1. * cross_size / len(y_pred)
    r = 1. * cross_size / len(y_true)
    return 2 * p * r / (p + r)

y_true=[1,2,3,4,5,6]
y_pred=[1,2,7,8,9,10]
print (' True, Pred, F1:   ',y_true,y_pred,f1(y_true, y_pred))
y_pred.extend([11,12,3])
print (' True, Pred, F1:   ',y_true,y_pred,f1(y_true, y_pred))
y_pred=[1,2,3,8,9,10]
print (' True, Pred, F1:   ',y_true,y_pred,f1(y_true, y_pred))


Seeing this and the fact that highest LB score is about 0.4 I have to ask if that is due to the increase in recall only and not precision - by increasing the number of products in an order.  I checked a couple of other simple public kernels with similar approach and I noticed that they achieve higher LB score at the cost of having on average 12 products in an order instead of 6. 