# Instacart: Forecasting (Part II)

by [Raul Maldonado](https://www.linkedin.com/in/raulm8)


**STILL IN DEVELOPMENT**

![InstaCart Logo](https://bloximages.chicago2.vip.townnews.com/pinalcentral.com/content/tncms/assets/v3/editorial/e/4c/e4cb9197-ddce-59e1-93c7-2868e145c705/5b858c3f5e8de.image.jpg?resize=400%2C212)

## 1. Introduction

**Background**:

Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.


From the Instacart Kaggle competition, the organization challenged the Kaggle community  to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. 


### Import

In [1]:
import os

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib.style as style
style.available
style.use('seaborn-poster') #sets the size of the charts
style.use('ggplot')

plt.rcParams["axes.grid"] = False

import seaborn as sns

# Personal information via Localfile in
# the Resources/Admin folder.
import sys
sys.path.append("../Resources/Admin")
path = os.path.join('..','Resources','Data','RawData')

In [2]:
aislesDf = pd.read_csv(f'{path}/aisles.csv')
departmentsDf = pd.read_csv(f'{path}/departments.csv')
productsDf = pd.read_csv(f'{path}/products.csv')
ordersDf = pd.read_csv(f'{path}/orders.csv')
order_products_prior = pd.read_csv(f'{path}/order_products__prior.csv')
order_products_train = pd.read_csv(f'{path}/order_products__train.csv')

In [3]:
sampleSubmission_df = pd.read_csv(f'{path}/sample_submission.csv')

In [4]:
sampleSubmission_df.head(1)

Unnamed: 0,order_id,products
0,17,39276 29259


In [59]:
ordersDf_prior = ordersDf[ordersDf['eval_set']=='prior']

In [52]:
ordersDf.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='object')

In [11]:
ordersDf_train = ordersDf[ordersDf['eval_set']=='train']

In [54]:
orderProducts_trainDf = ordersDf_train.merge(order_products_train, how='inner', on=['order_id','order_id'])

reorder_dataset = orderProducts_trainDf[['reordered','order_dow','order_hour_of_day']]

In [55]:
reorder_dataset.head()

Unnamed: 0,reordered,order_dow,order_hour_of_day
0,1,4,8
1,1,4,8
2,1,4,8
3,1,4,8
4,1,4,8


In [69]:
X = reorder_dataset[['order_dow','order_hour_of_day']]
y = reorder_dataset['reordered']

In [70]:
X = pd.get_dummies(data= X, columns =['order_dow'])


In [64]:
y = y.values.reshape(-1,1)

In [71]:
y.shape

(1384617,)

In [78]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y,\
                                                    test_size=0.25,random_state=12)

In [79]:
from sklearn.linear_model import LogisticRegression

In [80]:
model_logReg = LogisticRegression()

In [81]:
model_logReg.fit(X_train,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [84]:
predict_logReg = model_logReg.predict(X_test)

In [82]:
model_logReg.score(X_test,y_test)

0.5985931158007252

In [83]:
from sklearn.metrics import classification_report

In [86]:
print(classification_report(y_test, predict_logReg))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00    138949
           1       0.60      1.00      0.75    207206

   micro avg       0.60      0.60      0.60    346155
   macro avg       0.30      0.50      0.37    346155
weighted avg       0.36      0.60      0.45    346155



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [87]:
# F1-score for predicting non-reordered values has the model performing poorly. 
# we are embedding correct classifications of re-ordering, but none for the other.

# We need to determine another model to get great performance of both classifications, and not just good performance for one.

## 3. Forecasting

**Objective:** Predicting product items in the next cart.

Based on the sample data, we find that the information gathered in our prediction is a concatenated list of product ids per order.

Example:

| Order ID       | Products     |
| :------------- | ----------: |
|  123            |          9987 9912 |
| 124  | 9965 99123 12387 |

In [89]:
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

In [90]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()


In [91]:
ordersDf_train = ordersDf[ordersDf['eval_set']=='train']


In [92]:
reorder_df = order_products_train.merge(ordersDf_train,left_on='order_id',right_on='order_id',how='left')\
                    .merge(productsDf,on=['product_id','product_id'], how='left')
print("Data's Dimensions: ", reorder_df.shape)

Data's Dimensions:  (1384617, 13)


In [93]:
reorder_df.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id
0,1,49302,1,1,112108,train,4,4,10,9.0,Bulgarian Yogurt,120,16
1,1,11109,2,1,112108,train,4,4,10,9.0,Organic 4% Milk Fat Whole Milk Cottage Cheese,108,16
2,1,10246,3,0,112108,train,4,4,10,9.0,Organic Celery Hearts,83,4
3,1,49683,4,0,112108,train,4,4,10,9.0,Cucumber Kirby,83,4
4,1,43633,5,1,112108,train,4,4,10,9.0,Lightly Smoked Sardines in Olive Oil,95,15


In [94]:
reorder_df.columns

Index(['order_id', 'product_id', 'add_to_cart_order', 'reordered', 'user_id',
       'eval_set', 'order_number', 'order_dow', 'order_hour_of_day',
       'days_since_prior_order', 'product_name', 'aisle_id', 'department_id'],
      dtype='object')

In [95]:
#Preliminary Training to main training, by grabbing sample of 2000 data points
reorder_df = reorder_df.sample(n=2000,random_state=42)

features = reorder_df[['order_id','days_since_prior_order','order_hour_of_day',\
                       'aisle_id','department_id','reordered']]
                      
target= reorder_df['product_id']

In [96]:
model.fit(features,target)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [97]:
model.score(features,target)

0.9995

In [100]:
reorder_df_prior_demo = order_products_prior.merge(ordersDf_prior,left_on='order_id',right_on='order_id',how='left')\
                    .merge(productsDf,on=['product_id','product_id'], how='left')
print("Data's Dimensions: ", reorder_df_prior_demo.shape)

reorder_df_prior_demo = reorder_df_prior_demo.sample(n=2000,random_state=42)


features_demo = reorder_df_prior_demo[['order_id','days_since_prior_order','order_hour_of_day',\
                       'aisle_id','department_id','reordered']]
                      
target_demo= reorder_df_prior_demo['product_id']

Data's Dimensions:  (32434489, 13)


In [101]:
features_demo[:10]

Unnamed: 0,order_id,days_since_prior_order,order_hour_of_day,aisle_id,department_id,reordered
29481110,3109255,8.0,19,104,13,0
2852353,301098,1.0,15,83,4,0
11194500,1181866,8.0,17,24,4,0
15909397,1678630,26.0,14,115,7,1
6101870,644090,30.0,19,75,17,0
5278828,557169,10.0,12,129,1,1
6983365,737337,8.0,13,24,4,1
13169449,1389978,16.0,20,45,19,1
4203220,443457,6.0,9,126,11,1
15630910,1649178,17.0,8,120,16,1


In [102]:
prediction_results = model.predict(reorder_df_prior_demo[['order_id','days_since_prior_order','order_dow','order_hour_of_day',\
                       'aisle_id','department_id']][:10])
prediction_results

array([ 9837, 47222, 42585, 41588,  9837, 41362, 41362,  9837, 41362,
        6867])

In [103]:
from sklearn.metrics import classification_report


In [104]:
reorder_df_prior_demo.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order,product_name,aisle_id,department_id
29481110,3109255,34099,16,0,135284,prior,9,0,19,8.0,Crushed Red Chili Pepper,104,13
2852353,301098,41950,5,0,7293,prior,2,4,15,1.0,Organic Tomato Cluster,83,4
11194500,1181866,45066,8,0,111385,prior,2,1,17,8.0,Honeycrisp Apple,24,4
15909397,1678630,8859,2,1,147365,prior,7,0,14,26.0,Natural Spring Water,115,7
6101870,644090,24781,2,0,99290,prior,7,0,19,30.0,"PODS Laundry Detergent, Ocean Mist Designed fo...",75,17


In [115]:
productsDf.loc[productsDf['product_id'].isin(prediction_results)][['product_id','product_name']]

Unnamed: 0,product_id,product_name
3797,3798,Pink Lady Apples
5604,5605,Sesame Seed
8173,8174,Organic Navel Orange
15289,15290,Orange Bell Pepper
23636,23637,Maple & Brown Sugar High Fiber Instant Oatmeal
33197,33198,Sparkling Natural Mineral Water
33456,33457,Chicken Breast Nuggets
37946,37947,Organic Peach Lowfat Yogurt
39957,39958,Organics Gummy Bears
46021,46022,Baked Whole Grain Wheat Original Crackers Thin...


(TODO): How should I continue? 
1. Feature Selection
2. Feature Engineering (Ratios)

## Resources

[Starting with Postgres](https://www.codementor.io/engineerapart/getting-started-with-postgresql-on-mac-osx-are8jcopb)

[Postgres via SQLAlchemy](https://www.compose.com/articles/using-postgresql-through-sqlalchemy/)