# Data Science Project - Build a recommendation engine which will predict the products to be purchased by an Instacart consumer again.

## Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app aim to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

## Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently, they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open-sourced this data - see their blog post on 3 Million Instacart Orders, Open Sourced.

## In this data science project, we are going to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order.

#Datasets
#https://s3.amazonaws.com/hackerday.datascience/87/aisles.csv
#https://s3.amazonaws.com/hackerday.datascience/87/departments.csv
#https://s3.amazonaws.com/hackerday.datascience/87/sample_submission.csv
#https://s3.amazonaws.com/hackerday.datascience/87/order_products__prior.csv
#https://s3.amazonaws.com/hackerday.datascience/87/order_products__train.csv
#https://s3.amazonaws.com/hackerday.datascience/87/orders.csv
#https://s3.amazonaws.com/hackerday.datascience/87/products.csv

In [1]:
# This problem can be solved using three different approaches:

#1. Standard MBA or Market basket analysis (association rules or arules)

#2. using a predictive model to estimate the demand for a particular product

#3. Product recommendation engine using collaborative filtering


In [2]:
# Step we are going to follow in this hackerday

#1. read all the csv files

#2. join the relevant files

#3. doing exploratory data analysis

#4. getting into the solution

In [9]:
#1 = dry fruits
#2 = coffee
#1 = soups
#2 = milk
#3 = vegetables


In [10]:
# transaction set

#Invoice 1 = {apple, oranges, rice, wheat, milk}
#Invoice 2 = {rice, milk, butter, bread, fruits , apple}
.
.
.
#Invoice 20 = {...........}

#total transation is 20

In [None]:
# ARULES

# Support = proportion of transactions in the invoice data which contains an iten set
# X = {milk, rice}
# Support of X = 2/20

# Confidence = confidence of a rule
# rule of buying apple(Y) given that the user has already added X
# Confidence of (X > Y) = Sup(X U Y) / Sup(X)
# X = LHS and Y = RHS
# Confidence = 2/2 = 100

# Lift = lift of a specific rule
# Lift (X > Y) = Sup(X U Y) / Sup(X) * Sup(Y)
# Lift = 2/(2*2) = 2/4 = 1/2 = 0.5


In [11]:
# Steps in creating Association rules
# Apply the minimum support criteria to identify most frequent item set
# these frequent item sets and the minimum confidence constraint are used to form rules


In [12]:
# eclat algorithm
# apriori algorithm

In [13]:
# algorithmic definition
# if a set of products are most frequent in a dataset, then the constituents of the most
# frequent set are also called most frequent

In [14]:
library(dplyr)
library(ggplot2)
library(plyr)
library(Hmisc)
library(readr)
library(arules)
library(arulesViz)
library(stringr)
library(data.table)
library(methods)

------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: ‘plyr’

The following objects are masked from ‘package:dplyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize



ERROR: Error in library(Hmisc): there is no package called ‘Hmisc’


ERROR: Error in library(arules): there is no package called ‘arules’


ERROR: Error in library(arulesViz): there is no package called ‘arulesViz’



Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, last



In [None]:
orders <- fread("https://s3.amazonaws.com/hackerday.datascience/87/orders.csv")
orders

products<-fread('https://s3.amazonaws.com/hackerday.datascience/87/products.csv')
products

order_products_prior <- fread('https://s3.amazonaws.com/hackerday.datascience/87/order_products__prior.csv')
order_products_prior

order_products_train <- fread('https://s3.amazonaws.com/hackerday.datascience/87/order_products__train.csv')
order_products_train

aisles <- fread('https://s3.amazonaws.com/hackerday.datascience/87/aisles.csv')
aisles

departments <- fread('https://s3.amazonaws.com/hackerday.datascience/87/departments.csv')
departments



In [None]:
# which day of the week the company received most orders?

orders$day_week_name <- 
  ifelse(orders$order_dow == 0,
         'Sunday',
         ifelse(orders$order_dow == 1,
                'Monday',
                ifelse(orders$order_dow == 2,
                       'Tuesday',
                       ifelse(orders$order_dow == 3,
                              'Wednesday',
                              ifelse(orders$order_dow == 4,
                                     'Thursday',
                                     ifelse(orders$order_dow == 5,
                                            'Friday',
                                            ifelse(orders$order_dow == 6,
                                                   'Saturday',"")))))))


In [None]:
orders$day_ordered <- factor(orders$day_week_name,levels = c("Sunday",
                                                             "Monday",
                                                             "Tuesday",
                                                             "Wednesday",
                                                             "Thursday",
                                                             "Friday",
                                                             "Saturday"))



In [None]:
# visualization of orders placed by different days of the week

dow_graph <- barplot(
  table(orders$day_ordered),
  main = "Total Orders by Day",
  xlab = 'Days',
  ylab = 'Number of Orders',
  col = 'blue')

text(
  x = dow_graph,
  y = table(orders$day_ordered),
  labels = table(orders$day_ordered),
  pos = 1,
  cex = 1.0,
  col = 'white'
)


In [None]:
##########EDA##############
orders <- fread("https://s3.amazonaws.com/hackerday.datascience/87/orders.csv")
orders

products<-fread('https://s3.amazonaws.com/hackerday.datascience/87/products.csv')
products

order_products_prior <- fread('https://s3.amazonaws.com/hackerday.datascience/87/order_products__prior.csv')
order_products_prior

order_products <- fread('https://s3.amazonaws.com/hackerday.datascience/87/order_products__train.csv')
order_products


aisles <- fread('https://s3.amazonaws.com/hackerday.datascience/87/aisles.csv')
aisles

departments <- fread('https://s3.amazonaws.com/hackerday.datascience/87/departments.csv')
departments


In [None]:
library(knitr)
kable(head(orders,12))

kable(head(order_products,10))

kable(head(products,10))

kable(head(order_products_prior,10))

kable(head(aisles,12))

kable(head(departments,12))


In [None]:
# Recoding the variables

orders <- orders %>%
  mutate(order_hour_of_day = as.numeric(order_hour_of_day),
         eval_set = as.factor(eval_set))

products <- products %>%
  mutate(product_name = as.factor(product_name))

aisles <- aisles %>%
  mutate(aisle = as.factor(aisle))

departments <- departments %>%
  mutate(department = as.factor(department))




In [None]:
###graphs/visualization for orders by hour of the day
orders %>%
  ggplot(aes(x=order_hour_of_day)) +
  geom_histogram(stat = 'count',fill='blue')



In [None]:
# orders by the day of the week

#Hypothesis
# Is there any effect of day of the week on orders

orders %>%
  ggplot(aes(x=order_dow)) +
  geom_histogram(stat = 'count',fill='red')


In [None]:
#conclusion: most orders are placed on Sunday==0, and Monday==1

#Hypothesis
# Do people order more often after exactly 1 week?
orders %>%
  ggplot(aes(x=days_since_prior_order)) +
  geom_histogram(stat='count',fill='orange')



In [None]:
#conclusion: yes they do order more often after exactly 1 week

# Question: how many prior orders placed?
orders %>% filter(eval_set=='prior') %>% count_(orders,'order_number') %>%
  ggplot(aes(order_number,n)) + geom_line(color='red',size=1)



In [None]:
# from the training set

order_products %>%
  group_by(order_id) %>%
  summarise(n_items=last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items)) +
  geom_histogram(stat = 'count',fill='red') +
  geom_rug() #+
  #coord_cartesian(xlim = c(0,80))


In [None]:
order_products_prior %>%
  group_by(order_id) %>%
  summarise(n_items=last(add_to_cart_order)) %>%
  ggplot(aes(x=n_items)) +
  geom_histogram(stat = 'count',fill='red') +
  geom_rug() #+
#coord_cartesian(xlim = c(0,80))


In [None]:
######################################
order_products_prior[1:10,]
products[1:10,]

mydata <- order_products_prior[,1:2]
mydata <- merge(mydata,products,by='product_id')

mydata <- arrange(mydata,order_id)
head(mydata)

mydata<- mydata[,c(2,3)]
head(mydata)


In [None]:
# dataset is disjoint
# for market basket anlysis we would need transactional dataset
# how to convert the available information to a transactional dataset

dt <- split(mydata$product_name,mydata$order_id)

dt2 = as(dt,'transactions')

summary(dt2)

inspect(dt2)[[5]]


In [None]:
#visualiza the most frequent item sets in this dataset
itemFrequency(dt2,type='relative')
itemFrequencyPlot(dt2,topN=20,type='relative')
itemFrequencyPlot(dt2,topN=50,type='absolute')


In [None]:
#create rules
rule_1 = apriori(dt2,parameter = list(support=0.00001,
                                      confidence=0.90))
library(RColorBrewer)
plot(rule_1,control = list(col=brewer.pal(11,"Spectral")),main="")


rule_2 = apriori(dt2,parameter = list(support=0.0001,
                                      confidence=0.90))
plot(rule_2,control = list(col=brewer.pal(11,"Spectral")),main="")


rule_3 = apriori(dt2,parameter = list(support=0.001,
                                      confidence=0.90))
plot(rule_3,control = list(col=brewer.pal(11,"Spectral")),main="")


In [None]:
summary(rule_3)

rule_4 <- apriori(dt2,
                  parameter = list(support=0.001,
                                   confidence=0.8,
                                   minlen=3))

rule_5 <- apriori(dt2,
                  parameter = list(support=0.001,
                                   confidence=0.8,
                                   maxlen=4))


In [None]:
# converting the rules into a data frame
rules3 = as(rule_3,'data.frame')

inspect(subset(rule_3,subset= rhs %pin% 'Banana'))

# before recommending the products to the company you can sort the rules
inspect(head(sort(rule_3,by='lift'),5))

summary(rule_3)


In [None]:

plot(rule_3,method = 'graph',control = list(type='items',main=''))

subrule3 <- head(sort(rule_3,by='lift'),10)

plot(subrule3,method = 'graph',control = list(type='items',main=''))


In [None]:

# shall we continue with the existing set of rules?

# NO, because we need to clear the redundant rules from the set

# identify the unnecessary rules
subset.matrix = is.subset(rule_3,rule_3)
subset.matrix[lower.tri(subset.matrix,diag = T)] <- NA

redundant = colSums(subset.matrix, na.rm = T) >= 1
which(redundant)


In [None]:

rule3_pruned <- rule_3[!redundant]
rules<-rule3_pruned

# clean the rules
inspect(rules)
