# Exploratory Data Analysis : InstaCart

## Problem Statement

The Instacart data set is anonymized and contains samples of over 3 million grocery orders from 200,000+ Instacart users. The objective is to predict which previously purchased products (prior orders) will be in a user's next order (train and test orders) in order to explore the kinds of food Americans eat.

## Metrics and Assumptions

### Assumption:


- Option 1: The frequency of product orders within the best selling aisles in each department may indicate which previously purchased products will be in the Instacart user’s next order.
- Want to predict which previously purchased products (prior orders) will be in our user's next order (train and test orders).

#### NOTES:
- previously purchased products (prior orders)
- user's next order (train and test orders)
- train orders contains 'ordered products' info while test orders does not.
- Classification problem because we need to predict whether each pair of user and prodcut is a reorder or not

## Setup

### Importing Necessary Modules

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import re
import seaborn as sns
color = sns.color_palette()

# Limit floats output to 3 decimal points
pd.set_option('display.float_format', lambda x: '%.3f' % x)

plt.style.use('fivethirtyeight')
%matplotlib inline 

#Supress unnecessary warnings for readability and cleaner presentation
import warnings
warnings.filterwarnings('ignore') 

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.stats import norm
from sklearn.preprocessing import StandardScaler
from scipy import stats

## Loading CSV files into Dataframes

#### List of Files:

In [None]:
from subprocess import check_output
print(check_output(["ls", "../input/instacart-market-basket-analysis/"]).decode("utf8"))

In [None]:
order_products_train_df = pd.read_csv("../input/instacart-market-basket-analysis/order_products__train.csv")
order_products_prior_df = pd.read_csv("../input/instacart-market-basket-analysis/order_products__prior.csv")
orders_df = pd.read_csv("../input/instacart-market-basket-analysis/orders.csv")
products_df = pd.read_csv("../input/instacart-market-basket-analysis/products.csv")
aisles_df = pd.read_csv("../input/instacart-market-basket-analysis/aisles.csv")
departments_df = pd.read_csv("../input/instacart-market-basket-analysis/departments.csv")

In [None]:
orders_df.head()

In [None]:
# Check for
orders_df.isnull().sum()

In [None]:
# Set NaN to zeros

orders_df = orders_df.reset_index()
orders_df.isnull().sum()

### Explore Data

In [None]:
orders_df.head()

In [None]:
products_df.head()

In [None]:
departments_df.head()

In [None]:
aisles_df.head()

In [None]:
order_products_prior_df.head()

In [None]:
order_products_train_df.head()

In [None]:
print(aisles_df.shape, products_df.shape, departments_df.shape, order_products_prior_df.shape,
      order_products_train_df.shape, orders_df.shape)

#### Visualizing Product Portfolio
- Orders in Dataset (prior, train, test)
- Number of orders from three datasets.
- Bargraph for comparison.

In [None]:
orders_df.columns

In [None]:
combine_dataset = orders_df.groupby('eval_set')['order_id'].aggregate({'Total_orders': 'count'}).reset_index()

combine_dataset

In [None]:
combine_dataset  = combine_dataset.groupby(['eval_set']).sum()['Total_orders'].sort_values(ascending=False)

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(10,10))
sns.barplot(combine_dataset.index, combine_dataset.values, palette="RdBu")
plt.ylabel('Number of Orders', fontsize=14)
plt.title('Types of Datasets', fontsize=16)
plt.show()

#### Combine Departments and Aisles Dataframes:

In [None]:
# Approach with inner joins on 'products_df' & 'departments_df'
# Aisle by 'department_id' and 'aisle_id'

product_combine = products_df.reset_index().set_index('department_id').join(departments_df, how="inner")
product_combine = product_combine.reset_index().set_index('aisle_id').join(aisles_df, how="inner")


In [None]:
product_combine.head()

In [None]:
product_combine.head()

In [None]:
"""
product_combine = product_combine.reset_index().set_index('product_id')
product_combine.sort_index(axis=0, ascending= True, kind= 'quicksort', inplace= True)
"""

In [None]:
order_products_train_df.head()

#### Frequency of Orders by Days of the Week:

In [None]:
orders_df.columns.values

In [None]:
dayofweek = orders_df.groupby('order_id')['order_dow'].aggregate("sum").reset_index()

dayofweek = dayofweek.order_dow.value_counts()

In [None]:
sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(10, 10))
sns.barplot(dayofweek.index, dayofweek.values, palette="RdBu")
plt.ylabel('Number of Orders', fontsize=13)
plt.xlabel('Days of Order in a Week', fontsize=13)
plt.title('Number of Orders from Each Day of the Week', fontsize = 16)
plt.show()

- We can observe that day '0' and day '1' (weekend) are more popular days for orders.

#### Time of Day with Most Orders : 

In [None]:
orders_df.columns.values

In [None]:
orders_df.head()

In [None]:
timeofday = orders_df.groupby('order_id')['order_hour_of_day'].aggregate("sum").reset_index()

timeofday = timeofday.order_hour_of_day.value_counts()


In [None]:
sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15,10))
sns.barplot(timeofday.index, timeofday.values, palette="Blues_d")
plt.ylabel('Number of Orders', fontsize=14)
plt.xlabel('Time of Day of Orders', fontsize=14)
plt.show()

Peak hours for orders range between 8:00 to 16:00 (4:00pm).

#### Continuous Bivariate Density of Orders between Day of Week and Hour of Day :

In [None]:
# Selecting a small sample size for kernel density axes 
smallset = orders_df[0:100000]

# Use KDE plot to depict the probability densities at different values in continuous variable.

In [None]:
day_vs_hours = sns.jointplot(x="order_hour_of_day", y="order_dow", data=smallset, kind="kde", color="dodgerblue")
day_vs_hours.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
day_vs_hours.ax_joint.collections[0].set_alpha(0)
day_vs_hours.set_axis_labels("Hour of Day (24 hour format)", "Day of the week")

In [None]:
day_vs_sincepriororder = sns.jointplot(smallset.days_since_prior_order, smallset.order_dow, data=smallset, kind="kde", color="dodgerblue")
day_vs_sincepriororder.set_axis_labels("Days Since Last Order", "Day of Week")

The frequency of orders for 'day_vs_sincepriororder' appear to peak at the 7 days and 30 days|

In [None]:
orders_df.columns

In [None]:
prior_order_dist.head()

In [None]:
# Generating a dataframe with one column 'days_since_prior_order'

prior_order_dist = orders_df[['days_since_prior_order']]

sns.set_style('darkgrid')
f, ax = plt.subplots(figsize=(15,10))
sns.barplot(prior_order_dist.index, palette="Reds_d")
plt.ylabel('Number of Orders', fontsize=14)
plt.xlabel('Days Since Last Order', fontsize=14)
plt.show()

### Department Distribution

In [None]:
# Merging product_id, aisle_id, department_id from products_df, aisles_df, departments_df into order_products_prior_df

# This will allow me to pull out and aggregate column values to generate product distribution by department.


order_products_prior_df = pd.merge(order_products_prior_df, products_df, on='product_id', how='left')
order_products_prior_df = pd.merge(order_products_prior_df, aisles_df, on='aisle_id', how='left')
order_products_prior_df = pd.merge(order_products_prior_df, departments_df, on='department_id', how='left')
order_products_prior_df.head()

In [None]:
plt.figure(figsize=(10,10))
temp_series = order_products_prior_df['department'].value_counts()
labels = (np.array(temp_series.index))
sizes = (np.array((temp_series / temp_series.sum())*100))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=200)
plt.title("Departments Distribution", fontsize=15)
plt.show()

### Reorders

#### Frequency of Reorders of Previously Ordered Products:

In [None]:
order_products_prior_df.head()

In [None]:
order_products_prior_df.columns

In [None]:
freq_rereorder = order_products_prior_df.groupby('reordered')['product_id'].aggregate({'Total_Products': 'count'}).reset_index()

freq_rereorder['Ratios'] = freq_rereorder["Total_Products"].apply(lambda x: x / freq_rereorder['Total_Products'].sum())

freq_rereorder

Ratio shows that 59% of customers ordered products that they've previously ordered.

In [None]:
freq_rereorder  = freq_rereorder.groupby(['reordered']).sum()['Total_Products'].sort_values(ascending=False)

sns.set_style('whitegrid')
f, ax = plt.subplots(figsize=(5, 8))
sns.barplot(freq_rereorder.index, freq_rereorder.values, palette='muted')
plt.ylabel('Number of Products', fontsize=13)
plt.xlabel('Reorder Frequency', fontsize=13)
plt.ticklabel_format(style='plain', axis='y')

plt.show()

#### Most Reordered Products:

In [None]:
order_products_train_df.head(5)

In [None]:
order_products_prior_df.head(5)

In [None]:
# Combine files together via concatenation in dataframe 'order_products_all'
# Double check new sum.

order_products_all = pd.concat([order_products_train_df, order_products_prior_df], axis = 0)

print("order_products_all size is : ", order_products_all.shape)

In [None]:
order_products_all.columns

In [None]:
# Aggregate columns product_id, Reorder_Sum, and Reorder_Total:
mostreordered = order_products_all.groupby('product_id')['reordered'].aggregate({'Reorder_Sum': sum,'Reorder_Total': 'count'}).reset_index()

# Add column for probability for reorder for each product_id:
mostreordered['Probability_of_Reorder'] = mostreordered['Reorder_Sum']/mostreordered['Reorder_Total']

mostreordered

In [None]:
# Add product names associated with their ID's:
mostreordered = pd.merge(mostreordered,products_df[['product_id','product_name']])

# Sort from highest probability:
mostreordered = mostreordered.sort_values(['Probability_of_Reorder'], ascending=False)

mostreordered

In [None]:
order_products_all.columns

In [None]:
order_products_prior_df.columns