### Flat file  creation of test data

The test dataset in "order_products__test_cap.csv" contains order id and associated products in the order. In order to predict which product the test users would currently order, the model would need these test user's previous order histories. The model would need these test user based user features, product features and user-product features. These features can be obtained by following the below mentioned steps.

In [4]:
#Importing Libraries
import pandas as pd
import numpy as np
from collections import OrderedDict

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

from sklearn import metrics 
from sklearn.model_selection import cross_validate
from sklearn.metrics import f1_score
from sklearn.preprocessing import MinMaxScaler

In [5]:
# Importing Data
aisles_df = pd.read_csv('aisles.csv')
products_df = pd.read_csv('products.csv')
orders_df = pd.read_csv('orders.csv')
order_products_prior_df = pd.read_csv('order_products__prior.csv')
departments_df = pd.read_csv('departments.csv')
order_products_train_df = pd.read_csv('order_products__train_cap.csv')
order_products_test_df = pd.read_csv('order_products__test_cap.csv')

In [4]:
order_products_test_df.head()

Unnamed: 0,order_id,product_id
0,1,49302
1,1,11109
2,1,10246
3,1,49683
4,1,43633


Obtaining order based features by merging test data with order df

In [5]:
order_products_test_df = order_products_test_df.merge(orders_df.drop('eval_set', axis=1), on='order_id')

In [6]:
order_products_test_df.head()

Unnamed: 0,order_id,product_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,1,49302,112108,4,4,10,9.0
1,1,11109,112108,4,4,10,9.0
2,1,10246,112108,4,4,10,9.0
3,1,49683,112108,4,4,10,9.0
4,1,43633,112108,4,4,10,9.0


In [22]:
order_products_test_df.to_csv('order-product-test(derieved).csv')

32803 Unique orders are present in Test data. 32803 unique users are test users

In [7]:
# unique test orders and test users
print(order_products_test_df.order_id.nunique())
print(order_products_test_df.user_id.nunique())

32803
32803


Obtaining prior order history 

In [8]:
order_products_prior_df = order_products_prior_df.merge(orders_df.drop('eval_set', axis=1), on='order_id')

#### Creating User-Product feature

Here I am creating a new dataframe called user_product_df which gives information about product and user pair. Following new features are created:
- **user_product_avg_add_to_cart_order**: For the given User-Product pair, this column tells the average add to cart order of the product for this user
- **user_product_total_orders**: For the given User-Product pair, this column tells ,how many times this product was ordered
- **user_product_avg_days_since_prior_order** :For the given User-Product pair, this column tells , average number of days elapsed since last time this product was ordered by the user
- **user_product_avg_order_dow**: For the given User-Product pair, this column tells , average day of the week when the user orderes this product
- **user_product_avg_order_hour_of_day** :For the given User-Product pair, this column tells , average hour of the day when the user orderes this product.


In [9]:
user_prod_features1 = ['user_product_avg_add_to_cart_order','user_product_total_orders','user_product_avg_days_since_prior_order',
                      'user_product_avg_order_dow','user_product_avg_order_hour_of_day']

user_product_df = (order_products_prior_df.groupby(['product_id','user_id'],as_index=False) \
                                                .agg(OrderedDict([('add_to_cart_order','mean'),( 'order_id','count'),('days_since_prior_order','mean'),
                                                     ('order_dow','mean'),
                                                     ('order_hour_of_day','mean')])))
user_product_df.columns = ['product_id','user_id'] + user_prod_features1
user_product_df.head()

Unnamed: 0,product_id,user_id,user_product_avg_add_to_cart_order,user_product_total_orders,user_product_avg_days_since_prior_order,user_product_avg_order_dow,user_product_avg_order_hour_of_day
0,1,138,3.0,2,11.5,6.0,14.0
1,1,709,20.0,1,6.0,0.0,21.0
2,1,764,10.5,2,9.0,3.5,15.0
3,1,777,7.0,1,26.0,1.0,7.0
4,1,825,2.0,1,30.0,2.0,14.0


selecting prior user product history for test users

In [10]:
test_ids = order_products_test_df['user_id'].unique() 
test_ids.shape

(32803L,)

In [11]:
df_T = user_product_df[user_product_df['user_id'].isin(test_ids)]
df_T.head(2)

Unnamed: 0,product_id,user_id,user_product_avg_add_to_cart_order,user_product_total_orders,user_product_avg_days_since_prior_order,user_product_avg_order_dow,user_product_avg_order_hour_of_day
0,1,138,3.0,2,11.5,6.0,14.0
10,1,1540,7.0,17,8.294118,1.235294,14.529412


In [12]:
df_T.user_id.nunique()

32803

a dataframe called test_carts is created which tells what all products were ordered by the test users in their latest order which is the test dataset.

In [13]:
test_carts = (order_products_test_df.groupby('user_id',as_index=False)
                                      .agg({'product_id':(lambda x: set(x))})
                                      .rename(columns={'product_id':'latest_cart'}))

In [14]:
test_carts.head()

Unnamed: 0,user_id,latest_cart
0,1,"{196, 26405, 13032, 39657, 25133, 38928, 26088..."
1,2,"{24838, 11913, 45066, 31883, 38547, 24852, 327..."
2,10,"{48720, 10177, 29650, 24654}"
3,14,"{3808, 11042, 15172, 29509, 8744, 42284, 29615..."
4,29,"{48800, 39170, 20874, 35507, 37645, 49615, 612..."


Next df_T and test_carts are merged. This new dataset contains historical (prior)order info ( which product ids were ordered by the test user and how many times and also if they are present in their latest test order. ) A new column called in_cart is present .This tells whether a prior product ordered by the user is also present in the current test order.

In [15]:
df_T = df_T.merge(test_carts, on='user_id')
df_T['in_cart'] = (df_T.apply(lambda row: row['product_id'] in row['latest_cart'], axis=1).astype(int))


In [16]:
df_T['in_cart'].value_counts()

0    1918855
1     206898
Name: in_cart, dtype: int64

#### Creating Product Based Features
Here I am creating a new dataframe called prod_features_df. Following product based features are created.
- **product_total_orders**: How many times a given product has been ordered overall
- **product_avg_add_to_cart_order** :This tells the average add to cart order of the product
- **product_avg_order_dow** : This tells the average day of week when this product is ordered
- **product_avg_order_hour_of_day** : This tells the average hour of the day when this product is ordered the most
- **product_avg_days_since_prior_order** : This tells the average number of days elapsed since this product was last ordered


In [17]:
prod_features = ['product_total_orders','product_avg_add_to_cart_order','product_avg_order_dow', 'product_avg_order_hour_of_day', 'product_avg_days_since_prior_order']

prod_features_df = (order_products_prior_df.groupby(['product_id'],as_index=False)
                                           .agg(OrderedDict(
                                                   [('order_id','nunique'),
                                                    ('add_to_cart_order','mean'),('order_dow','mean'),
                                      ('order_hour_of_day', 'mean'),
                                      ('days_since_prior_order', 'mean')])))
prod_features_df.columns = ['product_id'] + prod_features
prod_features_df.head()

Unnamed: 0,product_id,product_total_orders,product_avg_add_to_cart_order,product_avg_order_dow,product_avg_order_hour_of_day,product_avg_days_since_prior_order
0,1,1852,5.801836,2.776458,13.238121,10.432725
1,2,90,9.888889,2.922222,13.277778,10.482759
2,3,277,6.415162,2.736462,12.104693,10.565385
3,4,329,9.507599,2.683891,13.714286,14.686207
4,5,15,6.466667,2.733333,10.666667,12.428571


Merging df_T and prod_features_df

In [18]:
df_T = df_T.merge(prod_features_df, on='product_id')

df_T = df_T.dropna()

#### Creating User Based Features

Creating a dataframe containng user features.(user_features_df)
user_total_orders: Total number of orders placed by the user
- **user_avg_cartsize** : Average cart size of the user
- **user_total_products** : Total number of products ordered by the user
- **user_avg_days_since_prior_order**: Number of days elapsed between subsequent orders
- **user_avg_order_dow** : Average day of the week when user places order
- **user_avg_order_hour_of_day**: Average hour of the day when user places order


In [19]:
user_features = ['user_total_orders','user_avg_cartsize','user_total_products','user_avg_days_since_prior_order','user_avg_order_dow','user_avg_order_hour_of_day']

user_features_df = (order_products_prior_df.groupby(['user_id'],as_index=False)
                                           .agg(OrderedDict(
                                                   [('order_id',['nunique', (lambda x: x.shape[0] / x.nunique())]),
                                                    ('product_id','nunique'),
                                                    ('days_since_prior_order','mean'),('order_dow','mean'),
                                                    ('order_hour_of_day','mean')])))

user_features_df.columns = ['user_id'] + user_features

In [20]:
df_T = df_T.merge(user_features_df, on='user_id')
df_T = df_T.dropna()

In [21]:
df_T['user_product_order_freq'] = df_T['user_product_total_orders'] / df_T['user_total_orders'] 
df_T.head()

Unnamed: 0,product_id,user_id,user_product_avg_add_to_cart_order,user_product_total_orders,user_product_avg_days_since_prior_order,user_product_avg_order_dow,user_product_avg_order_hour_of_day,latest_cart,in_cart,product_total_orders,...,product_avg_order_dow,product_avg_order_hour_of_day,product_avg_days_since_prior_order,user_total_orders,user_avg_cartsize,user_total_products,user_avg_days_since_prior_order,user_avg_order_dow,user_avg_order_hour_of_day,user_product_order_freq
0,1,138,3.0,2,11.5,6.0,14.0,{42475},0,1852,...,2.776458,13.238121,10.432725,32,4,55,10.4,3.040541,12.689189,0.0625
1,907,138,2.5,2,6.0,5.0,13.5,{42475},0,2025,...,3.030123,12.974321,16.501616,32,4,55,10.4,3.040541,12.689189,0.0625
2,1000,138,5.0,1,7.0,6.0,12.0,{42475},0,2610,...,2.7659,13.382759,9.858176,32,4,55,10.4,3.040541,12.689189,0.03125
3,3265,138,1.0,1,19.0,5.0,14.0,{42475},0,5270,...,2.811006,13.203605,12.138405,32,4,55,10.4,3.040541,12.689189,0.03125
4,4913,138,3.0,1,24.0,5.0,13.0,{42475},0,952,...,2.992647,13.35084,13.906358,32,4,55,10.4,3.040541,12.689189,0.03125


Apart from creating user based features, product based features and User-Product pair based features,5 more features are created which tells how different a user is from remaining other users.

- **product_total_orders_delta_per_user** : difference between total number of orders placed for the product and total number of orders placed for the product by the specific user.


- **product_avg_add_to_cart_order_delta_per_user** : difference between product's average add to cart order based on all users and product's average add to cart order based on this specific users.


- **product_avg_order_dow_per_user** : difference between average day of week when the product is ordered based on all users and average day of week when the product is ordered based on this specifc user 

- **product_avg_order_hour_of_day_per_user**: difference between product's average hour of day when ordered and product's average hour of day when ordered by this user 
- **product_avg_days_since_prior_order_per_user** difference between product's average days elapsed since last order placed and average days elapsed since last order placed by specifc user


In [22]:
df_T['product_total_orders_delta_per_user'] = df_T['product_total_orders'] - df_T['user_product_total_orders']

df_T['product_avg_add_to_cart_order_delta_per_user'] = df_T['product_avg_add_to_cart_order'] - \
                                                            df_T['user_product_avg_add_to_cart_order']

df_T['product_avg_order_dow_per_user'] = df_T['product_avg_order_dow'] - df_T['user_product_avg_order_dow']

df_T['product_avg_order_hour_of_day_per_user'] = df_T['product_avg_order_hour_of_day'] - \
                                                            df_T['user_product_avg_order_hour_of_day']

df_T['product_avg_days_since_prior_order_per_user'] = df_T['product_avg_days_since_prior_order'] - \
                                                            df_T['user_product_avg_days_since_prior_order']

In [23]:
df_T.head(2)

Unnamed: 0,product_id,user_id,user_product_avg_add_to_cart_order,user_product_total_orders,user_product_avg_days_since_prior_order,user_product_avg_order_dow,user_product_avg_order_hour_of_day,latest_cart,in_cart,product_total_orders,...,user_total_products,user_avg_days_since_prior_order,user_avg_order_dow,user_avg_order_hour_of_day,user_product_order_freq,product_total_orders_delta_per_user,product_avg_add_to_cart_order_delta_per_user,product_avg_order_dow_per_user,product_avg_order_hour_of_day_per_user,product_avg_days_since_prior_order_per_user
0,1,138,3.0,2,11.5,6.0,14.0,{42475},0,1852,...,55,10.4,3.040541,12.689189,0.0625,1850,2.801836,-3.223542,-0.761879,-1.067275
1,907,138,2.5,2,6.0,5.0,13.5,{42475},0,2025,...,55,10.4,3.040541,12.689189,0.0625,2023,1.153333,-1.969877,-0.525679,10.501616


In [24]:
df_T.shape

(1987019, 26)

#### One hot encoding of variable Department

Obtaining department names for correspoding product ids. Followed by merging it with df_T. Next one hot encoding of the categorical variable (department) is obtained.


In [6]:
prod_dept_df = products_df.merge(departments_df, on = 'department_id')
prod_dept_df = prod_dept_df[['product_id', 'department']]
prod_dept_df.head()

Unnamed: 0,product_id,department
0,1,snacks
1,16,snacks
2,25,snacks
3,32,snacks
4,41,snacks


In [26]:
df_T = df_T.merge(prod_dept_df, on = 'product_id')
df_T = df_T.dropna()
df_T = pd.concat([df_T, pd.get_dummies(df_T['department'])], axis=1)
df_T.head()

Unnamed: 0,product_id,user_id,user_product_avg_add_to_cart_order,user_product_total_orders,user_product_avg_days_since_prior_order,user_product_avg_order_dow,user_product_avg_order_hour_of_day,latest_cart,in_cart,product_total_orders,...,household,international,meat seafood,missing,other,pantry,personal care,pets,produce,snacks
0,1,138,3.0,2,11.5,6.0,14.0,{42475},0,1852,...,0,0,0,0,0,0,0,0,0,1
1,1,1540,7.0,17,8.294118,1.235294,14.529412,"{37600, 1, 11266, 10310, 130, 40199, 6184, 396...",1,1852,...,0,0,0,0,0,0,0,0,0,1
2,1,8703,13.0,1,1.0,4.0,13.0,"{18023, 16714, 38928, 22802, 49235, 12341, 43352}",0,1852,...,0,0,0,0,0,0,0,0,0,1
3,1,14806,5.0,1,22.0,2.0,12.0,{40939},0,1852,...,0,0,0,0,0,0,0,0,0,1
4,1,15175,4.8,10,7.3,4.3,11.9,"{21572, 10310, 6184, 47402, 37710, 15672, 3077...",0,1852,...,0,0,0,0,0,0,0,0,0,1


In [27]:
del df_T['department']

In [29]:
df_T=df_T.dropna()

Thus the test order ids and the products ordered in them now have correspoding user ids(test users) and all associated user features and product features which will be used for predicting products ordered by test users. These features were needed to be added to the test data because in 2.Instacart-feature_engineering and flat file creation.ipynb notebook, all these above mentioned features were created out of trainig data.And these features will play role in training model.Similary for testing data, these features are required, as based on that only predictions will be made.

In [30]:
df_T.to_csv('flatfile_testdata.csv')