The dataset for this project is a relational set of files describing customers' orders over time. The goal of this peoject is to predict which products will be in a user's next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 **Instacart** users. For each user, we are provided between 4 and 100 of their orders, with the sequence of products purchased in each order. We are also provided the week and hour of day the order was placed, and a relative measure of time between orders.

In [None]:
#Importing useful libraries
import pandas as pd
import numpy as np
import datetime
import plotly.graph_objs as go 
from plotly.offline import init_notebook_mode,iplot,plot
init_notebook_mode(connected=True) 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from subprocess import check_output
print(check_output(["ls", "../input/instacart-market-basket-analysis"]).decode("utf8"))

In [None]:
import zipfile

zf1 = zipfile.ZipFile('../input/instacart-market-basket-analysis/aisles.csv.zip')
zf2 = zipfile.ZipFile('../input/instacart-market-basket-analysis/departments.csv.zip')
zf3 = zipfile.ZipFile('../input/instacart-market-basket-analysis/products.csv.zip')
zf4 = zipfile.ZipFile('../input/instacart-market-basket-analysis/order_products__prior.csv.zip')
zf5 = zipfile.ZipFile('../input/instacart-market-basket-analysis/order_products__train.csv.zip')
zf6 = zipfile.ZipFile('../input/instacart-market-basket-analysis/orders.csv.zip')


In [None]:
aisles=pd.read_csv(zf1.open('aisles.csv'))
departments=pd.read_csv(zf2.open('departments.csv'))
products=pd.read_csv(zf3.open('products.csv'))
order_products_prior=pd.read_csv(zf4.open('order_products__prior.csv'))
order_products_train=pd.read_csv(zf5.open('order_products__train.csv'))
orders=pd.read_csv(zf6.open('orders.csv'))

In [None]:
aisles.head()

In [None]:
aisles.info()  #No missing values

In [None]:
departments.head()

In [None]:
departments.info() #no missing values

In [None]:
products.head()

In [None]:
products.info() #no missing values

In [None]:
order_products_prior.head()

In [None]:
order_products_prior.info()

In [None]:
order_products_train.head()

In [None]:
order_products_train.info() #no missing values

In [None]:
orders.head()

In [None]:
orders.info()

### **Checking for missing values in Data frame order_products__prior and orders**

In [None]:
order_products_prior.isnull().sum() #No missing Values

In [None]:
missing_values = orders.isnull().sum() #MISSING VALUES

In [None]:
percentage = missing_values/orders.isnull().count()
percentage

### Values are missing because for every user’s 1st order ( order_number = 1) the days_since_prior_order is Nan, which makes sense. We can impute 0 here. or since Missing values are only 6% of total values, we can remove these rows and carry on with our analysis.  I will use the second approach.

In [None]:
orders= orders[orders['days_since_prior_order'].notnull()]

In [None]:
orders.isnull().sum() #DataFrame after removing the null values

## Exploratory Data Analyais

### Analyzing the dataframe *orders*

In [None]:
#Columm eval_set has 3 values- prior,train,test.
orders['eval_set'].value_counts()

In [None]:
sns.set_style('whitegrid')
plt.figure(figsize=(8,6))
ax= sns.countplot(x='eval_set', data= orders)
ax.set_title('Evaluation Set Type Vs Number of occurence in Data Set', fontsize=16)

plt.show()

**Now let us check order_dow distribution**

In [None]:
plt.figure(figsize=(8,6))
ax= sns.countplot(x='order_dow', data= orders, palette='rocket')
ax.set_title('Day of week Vs Number of orders on that particular day', fontsize=16)

plt.show()

From the graph above we can see that maximum number of orders is on Sunday, followed by monday. which makes sense as people will want to buy groceries either at the weekend or start of the week. Mid-week has less number of orders with minimum orders on thursday. 

**Checking order_hour_of_day distribution**

In [None]:
plt.figure(figsize=(12,6))
ax= sns.countplot(x='order_hour_of_day', data= orders, palette='rocket')
ax.set_title('Hours of Day Vs Number of orders on that particular hour', fontsize=16)

plt.show()

The above graph clearly depicats that maximum number of orders are in morning 10-11 am or in evening 3-4pm. At night between 1-5am orders are the least as that is when people are sleeping.

**Days_since_prior_order Distribution**

In [None]:
plt.figure(figsize=(12,6))
ax= sns.countplot(x='days_since_prior_order', data= orders, palette='rocket_r')
ax.set_title('Days since prior order Vs Number of orders', fontsize=16)

plt.show()

From the graph it is clear that there is a peak on 7th day, then local peaks at day 14,day 21 and day 28. Then there is a global peak on day 30 meaning monthly peak. 

**Now we will group orders according to day of week and hour of day for better visualization.**

In [None]:
orders_grouped = orders.groupby(['order_dow','order_hour_of_day'])['order_number'].aggregate('count').reset_index()
orders_grouped

In [None]:
#pivoting the data set for better visualization 
#This table shows no. of orders on all days of week on a particular hour of day .
orders_grouped= orders_grouped.pivot('order_dow','order_hour_of_day', 'order_number')
orders_grouped

In [None]:
#Heatmap for visualization
plt.figure(figsize=(12,8))
sns.heatmap(orders_grouped, cmap='coolwarm')

From the heatmap above it is clear that peak orders are either on sunday 2 PM or on monday 10 AM.

In [None]:
plt.figure(figsize=(12,8))

sns.heatmap( orders.corr(), cmap='vlag', annot=True)

###  Merging orders_products_prior with dataframes  departments ,products and aisles and Analyzing it

In [None]:
order_products_prior.head()

In [None]:
order_products_train.head()

In [None]:
#percentage of reordered products in order_products_prior
order_products_prior['reordered'].sum()

In [None]:
len(order_products_prior)

In [None]:
#percentage of reordered products in order_products_train
order_products_train['reordered'].sum()/len(order_products_train)

**Almost 60% products are reordered in both order_products_prior and order_products_train dataframe.**

In [None]:
#concat train order and prior orders
prior_train = pd.concat([order_products_prior, order_products_train]).sort_values(by=['order_id'])


In [None]:
prior_train.info()

In [None]:
#Merging with products dataframe
prior_train_orders = pd.merge(prior_train, products, on='product_id', how='left').sort_values(by=['order_id'])

In [None]:
prior_train_orders.head()

In [None]:
#merge with aisle and department

prior_train_orders = pd.merge(prior_train_orders, aisles, on='aisle_id', how='left')
prior_train_orders = pd.merge(prior_train_orders,departments, on='department_id', how='left')


In [None]:
prior_train_orders.head(5)

In [None]:
#Merging with dataframe orders
prior_train_orders = pd.merge(prior_train_orders, orders, on='order_id',how='left').sort_values(by=['order_id'])

In [None]:
prior_train_orders['eval_set'].value_counts()

In [None]:
col_order = ['user_id','order_id','product_id','aisle_id','department_id','add_to_cart_order',
 'reordered','product_name','aisle','department','eval_set','order_number','order_dow','order_hour_of_day','days_since_prior_order']

prior_train_orders = prior_train_orders[col_order]
prior_train_orders.head()

### Exploratory data analysis using the merged dataframe prior_train_orders

In [None]:
#Distribution of target Variable
target_var= prior_train_orders.groupby(['eval_set'])['reordered'].aggregate(['count','sum']).reset_index()
target_var

In [None]:
target_var['reordered_percentage']= target_var['sum']/target_var['count']
target_var

In [None]:
sns.barplot(x='eval_set', y='reordered_percentage' , data=target_var)

In [None]:
#How many orders were placed by every user

In [None]:
orders_per_user= orders.groupby(['user_id']) ['order_id'].aggregate(lambda group : len(group.unique())).reset_index()
orders_per_user
#here order_id represents - unique number of orders for each user. we will plot this in a bar plot.

In [None]:
plt.figure(figsize=(30,15))
sns.countplot(x='order_id',data=orders_per_user)
plt.xticks(rotation='vertical')

From the graph above we can see that number of orders per user are between 4-100. and very few users have places more than 60 orders.

In [None]:
#most frequently ordered / reordered products

In [None]:
reordered_products=prior_train_orders['product_name'].value_counts().reset_index().head(20)
reordered_products.columns=['product_name','frequency']
reordered_products

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='product_name',y='frequency',data=reordered_products)
plt.xticks(rotation='vertical')

In [None]:
#From which aisle we got most orders/reorders

In [None]:
ordered_aisles=prior_train_orders['aisle'].value_counts().reset_index().head(20)
ordered_aisles.columns=['aisle_name','no_of_products_ordered']
ordered_aisles

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='aisle_name',y='no_of_products_ordered',data=ordered_aisles)
plt.xticks(rotation='vertical')

In [None]:
reordered_aisles=prior_train_orders.groupby(['aisle'])['reordered'].aggregate('sum').sort_values(ascending=False).reset_index().head(20)
reordered_aisles.columns=['aisle_name','no_of_products_reordered']


In [None]:
reordered_aisles['reordered_rate']= reordered_aisles['no_of_products_reordered']
                                     /ordered_aisles['no_of_products_ordered']

In [None]:
reordered_aisles.sort_values(by=['reordered_rate'], ascending=False, inplace=True)
reordered_aisles

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='aisle_name',y='reordered_rate',data=reordered_aisles, alpha=0.7)
plt.xticks(rotation='vertical')

**Most reorderes were placed from aisles Fresh fruits, milk, water etc.**

In [None]:
#From which department we got most orders/reorders

In [None]:
ordered_departments=prior_train_orders['department'].value_counts().reset_index().head(20)
ordered_departments.columns=['department_name','no_of_products_ordered']
ordered_departments

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='department_name',y='no_of_products_ordered',data=ordered_departments)
plt.xticks(rotation='vertical')

Its clear from the graph above that most orders were placed from depatments- produce and dairy eggs. 

In [None]:
reordered_departments=prior_train_orders.groupby(['department'])['reordered'].aggregate('sum').sort_values(ascending=False).reset_index().head(20)
reordered_departments.columns=['department_name','no_of_products_reordered']

reordered_departments['reorder_rate']= reordered_departments['no_of_products_reordered']/ordered_departments['no_of_products_ordered']

reordered_departments.sort_values(by=['reorder_rate'], ascending=False, inplace=True)
reordered_departments

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='department_name',y='reorder_rate',data=reordered_departments, alpha=0.7)
plt.xticks(rotation='vertical')

In [None]:
#cartsize of different orders

In [None]:
cartsize=prior_train_orders['order_id'].value_counts().reset_index()
cartsize.columns=['order_id','no_of_products_in_order']
cartsize

In [None]:
plt.figure(figsize=(20,10))
sns.histplot(x='no_of_products_in_order',data=cartsize,bins=70)
plt.xticks(rotation='vertical')

**Cart size has a right skewed distribution. and there are very few orders with cart size more than 40.maximum cart size is 145.**