# Instacart Market Basket Analysis: Customer's Diet and Their Next Order List


### Before doing any extensive analysis, I made a list of questions that got me curious about the data. I used this list as a guide in my analysis.  

### QUESTIONS TO ANSWER USING THE DATA
1.	How many products?
2.	How many aisles?
3.	How many department?
4.	How many customers?
5.	How many total orders?
6.	Are there missing data? What type of missing data?
7.	When are the peak hours (orders>100,000)? When is the orders highest and lowest?
8.	What day of the week has the highest and lowest order volume?
9.	What is the probability of each product being ordered?
10.	What is the probability of each department being ordered from?
11.	What is the probability of each aisle being ordered from?
12.	Can I identify meat eaters, vegetarian, vegan their percentage in the entire customer list? (Hypothesis Testing is in section)
13.	What is the probability of customers being meat eater, vegetarian or vegan? (I used Bayesian Statistic in this part)
14.	What products appear in all customer A orders? – These products will have high probability being reordered by customer A
15.	How many orders for each customer?
16.	What is the average number of products for across all orders for each customer?
17. Using average number of product per order for each customer and probability of product to be reordered by customer, can I predict products that will be reordered by customer?
18.	What is the accuracy of my predicted next order list to the actual next order list?

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline
import matplotlib.ticker as ticker
import seaborn as sns
from matplotlib import rcParams
from scipy import stats
sns.set(style="whitegrid", color_codes=True)
sns.set_context("poster")

In [3]:
prod=pd.read_csv('products.csv')
prod.head()

FileNotFoundError: File b'products.csv' does not exist

## Q1. How many products? 

In [None]:
prod.product_id.max()  #number of products available

## Q2. How many aisles?

In [None]:
prod.aisle_id.max()

## Q3. How many departments?

In [None]:
prod.department_id.max()

In [None]:
orders=pd.read_csv('orders.csv')
orders.head()    #order_dow days of the week

## Q4. How many customers? 

In [None]:
orders.user_id.unique().size

## Q5. How many total orders?


In [None]:
orders.order_id.size

In [None]:
### Maximum number of order for a customer
orders.order_number.max()  

In [None]:
## User_id of customer with maximum order of 100
orders.set_index('user_id').order_number.idxmax()

## Q1-Q5 ANSWERS
There are a total of 7 csv files used in data analysis namely: 
1) aisles, 2) departments, 3) order_products_prior, 4) order_products_train, 5) orders, 6) products, 7) sample_submission

Querying 5) orders and 6) products we now know that there are 

### 49,688      PRODUCTS
### 134           AISLES
### 21             DEPARMENTS
### 206,209    CUSTOMERS
### 3,421,083 ORDERS

Customer with user_id 210 has odered the most of 100 orders.

## orders dataframe has 3 eval_set (prior, train, test)
prior has the most number of orders and contains order history of users while train and set has latest orders of selected users that can be use for training a model and testing a model

Prior Set= 3,214,874 orders,  Train Set = 131, 209 orders, Test Set =75,000 orders


In [None]:
orders.eval_set.value_counts()

## orders dataframe is separated to the 3 eval_set

In [None]:
oprior=orders[orders.eval_set=='prior']
otrain=orders[orders.eval_set=='train']
otest=orders[orders.eval_set=='test']

## Q6. Are there missing data? 
### Yes only in orders dataframe. There are - 206, 209 "NaN" in days_since_prior_order  column which are the first order of all 206,209 users. This is a MAR (missing at random) type of missing data.

In [None]:
orders.isnull().sum()      #missing values in days_since_prior_order = users first order in INSTACART

In [None]:
prod.isnull().sum()

In [None]:
aisles=pd.read_csv('aisles.csv')
aisles.isnull().sum()

In [None]:
dep=pd.read_csv('departments.csv')     
dep.isnull().sum()      

In [None]:
prior=pd.read_csv('order_products__prior.csv')
prior.isnull().sum()

In [None]:
train=pd.read_csv('order_products__train.csv')
train.isnull().sum()

In [None]:
samp=pd.read_csv('sample_submission.csv')
samp.isnull().sum()

## Q7. When are peak hours when order>100,000? (8am to 10pm) 

### When are orders highest and lowest? (lowest at 3am and highest at 10am)

In [None]:
hourdist=orders['order_hour_of_day'].value_counts().sort_index()
hourdist.plot(kind='bar')
_=plt.xlabel('hour of the day')
_=plt.ylabel('order volume')
_=plt.xticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23])
plt.title('Daily Order Volume')

In [None]:
## These are the peak hours (8-20 orders are over 100,000)
hourdist[hourdist>100000]

In [None]:
hourdist.idxmax()

In [None]:
hourdist.idxmin()

## Q8. What day of the week is order volume higest and lowest?                                                    

## (Highest on Mondays and lowest on Fridays)

In [None]:
orday=orders.order_dow.value_counts()

In [None]:
orday.sort_index().plot(kind='bar', color='r')
_=plt.xlabel('day of week')
_=plt.ylabel('order volume')
_=plt.xticks([0,1,2,3,4,5,6], ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.title('Weekly Order Volume')

In [None]:
orday.idxmax()


In [None]:
orday.idxmin()

## Q7-Q8 ANSWERS
Peak hours where orders are above 100,000 / hour are at 8am to 10 pm.  Highest order is at 10 am and lowest is at 3am. Order volume are at high on Mondays and Tuesdays, starts slowing down on Wednesdays, at minimum on Fridays and picks up a little on the weekend.  This means that the website must be at optimum performance on peak hours 8am-10pm and the days when order volume are at high.  When choosing what day and hour to do the website maintenance that will require down time, Friday between 12am to 6am will be the best day and time.

## MERGING DATAFRAMES


### USING "otrain" dataframe

In [None]:
## alltrain has latest order of selected users but does not include their order history
alltrain=otrain.merge(train).merge(prod).merge(dep).merge(aisles)

In [None]:
alltrain.head(5)

In [None]:
## Number rows,columns in alltrain
alltrain.shape

In [None]:
## Number of user_id in alltrain
alltrain.user_id.unique().size

In [None]:
## Number of orders in alltrain
alltrain.order_id.unique().size

In [None]:
#Average number of products per order in alltrain
alltrain.groupby('order_id')['product_name'].size().mean()

### USING "oprior" dataframe

In [None]:
allprior=oprior.merge(prior).merge(prod).merge(dep).merge(aisles)

In [None]:
allprior.head()

In [None]:
## Number of rows, columns in allprior
allprior.shape

In [None]:
## Number of user_id in allprior
allprior.user_id.unique().size

In [None]:
## Number of orders in allprior
allprior.order_id.unique().size

## Q9. What is the probability of a specific product being ordered?

In [None]:
## Probability of Products being ordered using "alltrain"
ProbA=alltrain.product_name.value_counts()
ProbA=pd.DataFrame(ProbA)
ProbA['PA(ordered)']=ProbA['product_name']/alltrain.product_name.size
ProbA=ProbA.reset_index()
ProbA.columns=['product_name','count','PA(ordered)']
ProbA.head(10)

In [None]:
## Probability of Products being ordered using "allprior"
ProbB=allprior.product_name.value_counts()
ProbB=pd.DataFrame(ProbB)
ProbB['PB(ordered)']=ProbB['product_name']/allprior.product_name.size
ProbB=ProbB.reset_index()
ProbB.columns=['product_name','count','PB(ordered)']
ProbB.head(10)

## Q10. What is the probability of each department being ordered from?


In [None]:
## Probability of Department being ordered from using "alltrain"
ProbDA=alltrain.department.value_counts()
ProbDA=pd.DataFrame(ProbDA)
ProbDA['PDA(ordered)']=ProbDA['department']/alltrain.department.size
ProbDA.head(10)

In [None]:
## Probability of Department being ordered from using "allprior"
ProbDB=allprior.department.value_counts()
ProbDB=pd.DataFrame(ProbDB)
ProbDB['PDB(ordered)']=ProbDB['department']/allprior.department.size
ProbDB.head(10)

## Q11. What is the probability of each aisle being ordered from?


In [None]:
## Probability of Aisle being ordered from using "alltrain"
aisle_percenta=alltrain.aisle_id.value_counts()/alltrain.aisle_id.size
aisle_percenta=pd.DataFrame(aisle_percenta).sort_index().reset_index()
aisle_percenta.columns=['aisle_id', 'aisle_percent_train']

In [None]:
##Percentage/probability of orders coming from each aisle using allprior 
aisle_percentb=allprior.aisle_id.value_counts()/allprior.aisle_id.size
aisle_percentb=pd.DataFrame(aisle_percentb).sort_index().reset_index()
aisle_percentb.columns=['aisle_id', 'aisle_percent_prior']

In [None]:
aisle_percent=aisle_percenta.merge(aisle_percentb)
aisle_percent.head()

In [None]:
fig, ax = plt.subplots(figsize=(8,10))
y=np.arange(134)
plt.barh(range(len(aisle_percenta.aisle_percent_train)), aisle_percenta.aisle_percent_train, color='b', alpha=0.5)
plt.barh(range(len(aisle_percentb.aisle_percent_prior)), aisle_percentb.aisle_percent_prior, color='y', alpha=0.5)
plt.ylim(0,135)
plt.ylabel('Percent')
plt.ylabel('aisle_id')
plt.title('Aisle Distribution')
plt.legend(['train', 'prior'], loc='upper right')
ax.set_yticks(np.arange(0,136,5))
plt.show()

## Q12. Can you classify the customer as MEAT_LOVERS, PESCATARIAN, NONVEGAN AND VEGAN Vegetarians?

I have simplified classifying these 4 groups by looking at what aisles customers order from. 
Meat_Lovers will for sure have meat and seafood products from aisles 5, 7, 15, 34, 35, 39, 49, 95, 96, 106 and 122 
Pescatarians will for sure only have seafood products from aisles 15, 34, 39 and 95
NonVegans will not eat any from the meat and seafood from aisles 5, 7, 15, 34, 35, 39, 49, 95, 96, 106 and 122

Vegans will not eat any from the meat and seafood from aisles 5, 7, 15, 34, 35, 39, 49, 95, 96, 106 and 122 and any dairy products like milk, cheeses and creamsin aisles 2, 21, 53, 84, 86, 108, and 120

#### Q12A. Different Functions are written that will be used to classify the four diet groups
#### Q12B. Diet Classifications will be identified by USERS
#### Q12C. Diet Classifications will be identified by ORDERS
#### Q12D. Generation of Simulated Datasets
#### Q12E. Hypothesis Testing part 1: Are there significant differences between Diet distributions classified by USERS vs. by ORDERS?
#### Q12F: hypothesis testing part 2: Are there significant differences between Diet distribution classified by USERS and by ORDERS using Empirical dataset vs. Simulated dataset 



In [None]:
## U is the dataframe of users with their maximum number of orders using "oprior"
U=oprior.groupby('user_id')['order_number'].agg(['max'])
U['user_id']=U.index.get_level_values('user_id').values
U.columns=['max_order', 'user_id']

In [None]:
## P_num is the dataframe of users with order_id and the number of products per order using "allprior"
P_num=allprior.groupby('user_id')['order_id'].value_counts()
P_num=pd.DataFrame(P_num)
P_num.columns=['prod_per_order']
P_num['user_id']=P_num.index.get_level_values('user_id').values
P_num['order_id']=P_num.index.get_level_values('order_id').values

In [None]:
### Tot_p is the dataframe of users with total number of products they ordered using allprior
Tot_p=allprior.user_id.value_counts()
Tot_p=pd.DataFrame(Tot_p).reset_index()
Tot_p.columns=['user_id', 'total_products']

### Q12A. Functions used to identify Diet Classifications

### F1. This function provides a sample of "n" random users from the oprior dataframe for training dataset

In [None]:
def Sample_maker(Q):  ## User_Max_Or is s dataframe with user_id and their maximum order_number, n is number of samples
    PO=Q.merge(P_num).merge(Tot_p)
    sample1=Q.sample(n=30000, replace=False, random_state=0, axis=0)
    
    ## sample1 is the DataFrame of the first dataset
    sample1=sample1.merge(PO, how='inner')
    sample1=sample1.loc[:,['user_id', 'order_id', 'max_order', 'prod_per_order', 'total_products']]


    #Emp is the Empirical dataframe
    Emp=sample1.merge(allprior, how='inner')
    
    return Emp

### F2. This Function classifies Users as Meat_Lovers, Pescatarian, Vegan, NonVegan according to users overall order history

In [None]:
def Diet_Class_user(A): ## A is a  dataframe produced from "Sample_maker" function
    ## Total products ordered from Meat & Seafood Aisles
    MS=A[(A.aisle_id==5)|(A.aisle_id==7)|(A.aisle_id==15)|(A.aisle_id==34)|(A.aisle_id==35)|(A.aisle_id==39)|(A.aisle_id==49)|(A.aisle_id==95)|(A.aisle_id==96)|(A.aisle_id==106)|(A.aisle_id==122)]
    
    ##Eats Meat and Fish
    M=MS[(MS.aisle_id==5)|(A.aisle_id==7)|(A.aisle_id==35)|(A.aisle_id==49)|(A.aisle_id==96)|(A.aisle_id==106)|(A.aisle_id==122)]
    MF=M.user_id.unique()
    Meat_L=pd.DataFrame(MF, columns=['user_id'])
    Meat_L['Diet']='Meat_Lovers'
    
    ##Pescatarian that eat and not eat other meat
    P=MS[(MS.aisle_id==15)|(MS.aisle_id==34)|(MS.aisle_id==39)|(MS.aisle_id==95)]
    F=P.user_id.unique()
    
    ## Pescatarian customers
    Pesca=np.setdiff1d(F,MF)
    Pesca=pd.DataFrame(Pesca, columns=['user_id'])
    Pesca['Diet']='Pescatarian'

    ## All Vegetarians
    Veg=A.loc[~A.user_id.isin(MS.user_id)]

    # NonVegan 
    NV=Veg[(Veg.aisle_id==86)|(Veg.aisle_id==2)|(Veg.aisle_id==21)|(Veg.aisle_id==53)|(Veg.aisle_id==84)|(Veg.aisle_id==108)|(Veg.aisle_id==120)]
    F1=NV.user_id.unique()

    NonVeg=pd.DataFrame(F1, columns=['user_id'])
    NonVeg['Diet']='NonVegan'
    
    ##Vegans
    Vegans=Veg.loc[~Veg.user_id.isin(F1)]
    Vega=Vegans.user_id.unique()
    Vega=pd.DataFrame(Vega, columns=['user_id'])
    Vega['Diet']='Vegan'
    
    ## Merge all DataFrame of Different Diets
    Sample_class=pd.concat([Meat_L, Pesca, NonVeg, Vega])
    
    return Sample_class

### F3. This Function classifies Meat_Lovers, Pescatarian, Vegan, NonVegan according to orders and disregarding who ordered it (the user)

In [None]:
def Diet_Class_orders(A):
    ## Total products ordered from Meat & Seafood Aisles
    MS=A[(A.aisle_id==5)|(A.aisle_id==7)|(A.aisle_id==15)|(A.aisle_id==34)|(A.aisle_id==35)|(A.aisle_id==39)|(A.aisle_id==49)|(A.aisle_id==95)|(A.aisle_id==96)|(A.aisle_id==106)|(A.aisle_id==122)]
    
    ##Eats Meat and Fish
    M=MS[(MS.aisle_id==5)|(A.aisle_id==7)|(A.aisle_id==35)|(A.aisle_id==49)|(A.aisle_id==96)|(A.aisle_id==106)|(A.aisle_id==122)]
    MF=M.order_id.unique()
    Meat_L=pd.DataFrame(MF, columns=['order_id'])
    Meat_L['Diet']='Meat_Lovers'
    
    ##Pescatarian that eat and not eat other meat
    P=MS[(MS.aisle_id==15)|(MS.aisle_id==34)|(MS.aisle_id==39)|(MS.aisle_id==95)]
    F=P.order_id.unique()
    
    ## Pescatarian customers
    Pesca=np.setdiff1d(F,MF)
    Pesca=pd.DataFrame(Pesca, columns=['order_id'])
    Pesca['Diet']='Pescatarian'

    ## All Vegetarians
    Veg=A.loc[~A.order_id.isin(MS.order_id)]

    # NonVegan 
    NV=Veg[(Veg.aisle_id==86)|(Veg.aisle_id==2)|(Veg.aisle_id==21)|(Veg.aisle_id==53)|(Veg.aisle_id==84)|(Veg.aisle_id==108)|(Veg.aisle_id==120)]
    F1=NV.order_id.unique()

    NonVeg=pd.DataFrame(F1, columns=['order_id'])
    NonVeg['Diet']='NonVegan'
    
    ##Vegans
    Vegans=Veg.loc[~Veg.order_id.isin(F1)]
    Vega=Vegans.order_id.unique()
    Vega=pd.DataFrame(Vega, columns=['order_id'])
    Vega['Diet']='Vegan'
    
    ## Merge all DataFrame of Different Diets
    Sample_class=pd.concat([Meat_L, Pesca, NonVeg, Vega])
    
    return Sample_class

### F4. This function returns a horizontal bar graph of distrubution of the four Diet Categories of Customers 

In [None]:
def Diet_Percentage(B):
    Per=B.Diet.value_counts()
    Per=pd.DataFrame(Per).reset_index()
    Per.columns=['Diet','Size']
    Per['Percent']=Per.Size/Per.Size.sum()
    
    return Per

### F5. This function excludes the 'n' users previously used in generating a sample training data set from oprior


In [None]:
def Remaining_users(S): ## S is an Empirical Sample Generated DataFrame (E1, E2, E3)
    Rem=U.loc[~U.user_id.isin(S.user_id)]
    return Rem

### F6. This function generates simulated sample dataframe

In [None]:
def Simulated_sample(S):  # S isn an empirical sample generated dataframe (E1,E2, E3)
    G=S.loc[:, ['user_id', 'order_id', 'prod_per_order']]
    G=G.drop_duplicates()
    
    ## G.prod_per_order is turned into a list n.  n will be used on a loop to generate a simulated order dataframe
    n=pd.Series.tolist(G.prod_per_order)

    ## This will generate the simulated orders where aisle is randomly picked accoring to its calculated probability or percentage from aisle_percent dataframe
    B=[]
    for i in n:
        Q=np.random.choice(a=aisle_percentb.aisle_id, size=i, p=aisle_percent.aisle_percent_prior)
        Q=Q.tolist()
        B.append(Q)
    
    ## The simulated list of randomly picked aisles "B" is turned into dataframe 
        
    Sim1=pd.DataFrame(B).reset_index().stack()
    Sim1=pd.DataFrame(Sim1).reset_index()
    Sim1=Sim1.rename(columns={'level_0':'order_id', 'level_1':'product_num', 0:'aisle_id'})
    Sim1=Sim1[Sim1['product_num']!='index']
    Sim1['user_id']=S['user_id']
    
    return Sim1

### F7. This function will merge two dataframes from two group of proportions being compared 

In [None]:
## This function will merge two dataframes from two group of proportions being compared 
def df_prop_compare(M, N):    ## M and N are dataframes with Diet classification with Size and Percent
    M.columns=['Diet', 'Size1', 'Percent1']
    N.columns=['Diet', 'Size2', 'Percent2']
    MN=M.merge(N)
    return MN

### F8. This function calculates variance, standard deviation, difference in proportion, MOE, degrees of freedom, t_value and p_value

In [None]:
def diff_std_p_val(W):    ## W is a dataframe with the Diet classification Sizes and Percentages of 2 groups being compared
    W['var_1']=W.Percent1*(1-W.Percent1)/W.Size1
    W['var_2']=W.Percent2*(1-W.Percent2)/W.Size2
    W['var1_2']=W.var_1+W.var_2
    W['std_var1_2']=  W.var1_2**0.5             ## a.k.a. standard error

    W['%_diff']=abs(W.Percent1-W.Percent2)      ## absolute difference between two proportions

    W['moe']=1.96*W.std_var1_2                   ## margin of error

    W['DF']= ((W.var_1/W.Size1 +W.var_2/W.Size2)**2)/(((W.var_1/W.Size1)**2/W.Size1)+((W.var_2/W.Size2)**2/W.Size2))  ## degrees of freedom

    W['t_val']=(W['%_diff']-0)/W.std_var1_2

    W['p_val']=stats.t.sf(np.abs(W.t_val), W.DF)*2  # two-sided pvalue = Prob(abs(t)>tt)

    return W   

### Q12B. Identifying Diet Classification by USERS
In this section Diets are identified by USERS using their order history.  E1, E2, and E3 are empirical sample dataframes each with 30,000 users.  P_user1, P_user2 and P_user3 are dataframes with the percentage of each diet classification

In [None]:
## First Empirical Sample with 30,000 users from oprior DataFrame
E1=Sample_maker(U)           # Empirical Sample Generated
C1=Diet_Class_user(E1)       # Diets are of 30,000 in the Empirical Sample Classified by USERS
P_user1=Diet_Percentage(C1)  # This returns the percentage of the different Diet Classification
P_user1

In [None]:
## Second Empirical Sample with 30,000 users from oprior DataFrame
U1=Remaining_users(E1)      # The 30,000 users in the first Emprical Sample (E1) is excluded from the oprior DataFrame

E2=Sample_maker(U1)         # Empirical Sample Generated
C2=Diet_Class_user(E2)      # Diets are of 30,000 in the Empirical Sample Classified
P_user2=Diet_Percentage(C2) # This returns the percentage of the different Diet Classification
P_user2

In [None]:
## Third Empirical Sample with 30,000 users from oprior DataFrame

ET=pd.concat([E1,E2])           # The first two Empirical Sample were merged a
U2=Remaining_users(ET)          # The 60,000 users in the first and second Emprical Sample (E1+E2) are excluded from the oprior DataFrame
E3=Sample_maker(U2)             # Empirical Sample Generated
C3=Diet_Class_user(E3)          # Diets are of 30,000 in the Empirical Sample Classified
P_user3=Diet_Percentage(C3)     # This returns the percentage of the different Diet Classification
P_user3

### Bar Graph Comparison of % Diet Classification according to USERS across three Empirical Sample Datasets


In [None]:
fig, ax = plt.subplots(figsize=(10,8))

X = np.arange(4)
plt.bar(X + 0.00, P_user1.Percent, color = 'b', width = 0.25)
plt.bar(X + 0.25, P_user2.Percent, color = 'g', width = 0.25)
plt.bar(X + 0.50, P_user3.Percent, color = 'r', width = 0.25)
plt.ylabel('Percent')

ax.set_xticks([p + 1.5 * 0.25 for p in X])
ax.set_xticklabels(P_user1.Diet)

plt.legend(['P_user1', 'P_user2', 'P_user3'], loc='upper right')
plt.title('Diet Distribution of 3 Empirical Sample (user_id)')
plt.show()

### Q12C. Identifying Diet Classification by ORDERS

In [None]:
## Reclassification of Diets using E1 
C_order1=Diet_Class_orders(E1)       # Diets are classified per order_id
P_order1=Diet_Percentage(C_order1)   # This returns the percentage and a bar graph distribution of the different Diet Classification
P_order1

In [None]:
## Reclassification of Diets using E2 
C_order2=Diet_Class_orders(E2)          # Diets are classified per order_id
P_order2=Diet_Percentage(C_order2)      # This returns the percentage of the different Diet Classification
P_order2

In [None]:
## Reclassification of Diets using E3 
C_order3=Diet_Class_orders(E3)          # Diets are classified per order_id
P_order3=Diet_Percentage(C_order3)      # This returns the percentage of the different Diet Classification
P_order3

### Bar Graph Comparison of % Diet Classification according to USERS across three Empirical Sample Datasets

In [None]:
## Bar Graph comparison of Diet distribution using Empirical Samples classified using ORDERS
fig, ax = plt.subplots(figsize=(10,8))

X = np.arange(4)
plt.bar(X + 0.00, P_order1.Percent, color = 'c', width = 0.25)
plt.bar(X + 0.25, P_order2.Percent, color = 'm', width = 0.25)
plt.bar(X + 0.50, P_order3.Percent, color = 'g', width = 0.25)
plt.ylabel('Percent')

ax.set_xticks([p + 1.5 * 0.25 for p in X])
ax.set_xticklabels(P_order1.Diet)

plt.legend(['P_order1', 'P_order2', 'P_order3'], loc='upper right')
plt.title('Diet Distribution of 3 Empirical Samples (order_id)')
plt.show()


### Q12D. Generation of Simulated Datasets

In [None]:
## This is the first simulated Sample dataframe with randomly picked aisles
Simu1= Simulated_sample(E1)

In [None]:
#Simulated_sample orders Diet classified
Csim_or=Diet_Class_orders(Simu1)
Psimulated_order=Diet_Percentage(Csim_or)    #Percentage of each diet is calculated
Psimulated_order

In [None]:
#Simulated_sample users Diet classified
Csim_us=Diet_Class_user(Simu1)    
Psimulated_user=Diet_Percentage(Csim_us) ## Percentage of each diet calculated
Psimulated_user

In [None]:
## User_Order_Emp is the dataframe with Sizes and Percentages of Diets classifications from all "users" and "orders"
User_Order_Emp=df_prop_compare(P_user1, P_order1)
User_Order_Emp

### Q12E. Hypothesis Testing part 1: Are there significant differences between Diet distributions classified by USERS vs. by ORDERS?¶
Ho: There is no significant difference in classifying Diets using "users overall orders" versus using "individual orders disregarding who ordered it"

H1:There is significant difference in classifying Diets between using "users overall orders" versus using "individual orders disregarding who ordered it" 

In [None]:
## This will graph Diet Distributions by USERS and by ORDERS using Empirical Samples
fig, ax = plt.subplots(figsize=(10,8))

X = np.arange(4)
plt.bar(X + 0.00, User_Order_Emp.Percent1, color = 'b', width = 0.25)
plt.bar(X + 0.25, User_Order_Emp.Percent2, color = 'g', width = 0.25)
plt.ylabel('Percent')

ax.set_xticks([p + 1.5 * 0.25 for p in X])
ax.set_xticklabels(User_Order_Emp.Diet)

plt.legend(['P_user1', 'P_order1'], loc='upper right')
plt.title('Diet Distribution by USERS and by ORDERS of Empirical Samples')
plt.show()

In [None]:
## pvalues, variances, standard deviation, % diff, moe, DF and t_values calculated to test Hypothesis I
Use_Or=diff_std_p_val(User_Order_Emp)
Use_Or

### All p_values are <0.05, we can reject the null hypothesis and accept H1.  
H1:There is significant difference in classifying Diets between using "users overall orders" versus using "individual orders disregarding who ordered it" 

### Q12F. Hypothesis testing part 2: Are there significant differences between Diet distribution classified by USERS and by ORDERS using Empirical dataset vs. Simulated dataset
#### I.
Ho: There is no significant difference between Diet distribution classified by USERS using Empirical dataset versus Simulated dataset

H1: There is significant difference between Diet distribution classified by USERS using Empirical dataset versus Simulated dataset

In [None]:
## Sim_Emp_User is the dataframe comparing % of empirical sample and simulated sample both classified by USERS
Sim_Emp_User=df_prop_compare(P_user1, Psimulated_user)
Sim_Emp_User

In [None]:
## This will graph Diet Distributions by USERS using Empirical Sample DataSet and Simulated Sample Dataset
fig, ax = plt.subplots(figsize=(10,8))

X = np.arange(4)
plt.bar(X + 0.00, Sim_Emp_User.Percent1, color = 'r', width = 0.25)
plt.bar(X + 0.25, Sim_Emp_User.Percent2, color = 'm', width = 0.25)
plt.ylabel('Percent')

ax.set_xticks([p + 1.5 * 0.25 for p in X])
ax.set_xticklabels(Sim_Emp_User.Diet)

plt.legend(['P_user1', 'Psimulated_user'], loc='upper right')
plt.title('Diet Distribution by USERS Empirical and Simulated')
plt.show()

In [None]:
## pvalues, variances, standard deviation, % diff, moe, DF and t_values calculated to test Hypothesis II  (by USERS)
Sim_Emp=diff_std_p_val(Sim_Emp_User)
Sim_Emp

### All p_values are <0.05, we can reject the null hypothesis and accept H1.  
H1: There is significant difference between Diet distribution classified by USERS using Empirical dataset versus Simulated dataset

#### II.
Ho: There is no significant difference between Diet distribution classified by ORDERS using Empirical dataset versus Simulated dataset

H1: There is significant difference between Diet distribution classified by ORDERS using Empirical dataset versus Simulated dataset

In [None]:
## Sim_Emp_Order is the dataframe comparing % of empirical sample and simulated sample both classified by ORDERS
Sim_Emp_Order=df_prop_compare(P_order1, Psimulated_order)
Sim_Emp_Order

In [None]:
## This will graph Diet Distributions classified by ORDERS using Empirical and Simulated Sample Datasets
fig, ax = plt.subplots(figsize=(10,8))

X = np.arange(4)
plt.bar(X + 0.00, Sim_Emp_Order.Percent1, color = 'b', width = 0.25)
plt.bar(X + 0.25, Sim_Emp_Order.Percent2, color = 'm', width = 0.25)
plt.ylabel('Percent')

ax.set_xticks([p + 1.5 * 0.25 for p in X])
ax.set_xticklabels(Sim_Emp_Order.Diet)

plt.legend(['P_order1', 'Psimulated_order'], loc='upper right')
plt.title('Diet Distribution by ORDERS Empirical and Simulated')
plt.show()

In [None]:
##Sim_Emp_Order is a dataframe with calculated standard deviation, moe, degrees of freedom, t_values and p_values (ORDERS)
Sim_Emp_Order=diff_std_p_val(Sim_Emp_Order)
Sim_Emp_Order

### For Meat_Lovers, NonVegan, and Vegan, p_values are <0.05, we can reject the null hypothesis and accept H1.¶
#### H1: There is significant difference between Diet distribution classified by ORDERS using Empirical dataset versus Simulated dataset.  
#### However for Pescatarian classified by ORDERS p_value>0.05 which means there is no significant difference in % of this group using Empirical dataset and Simulated dataset

## Q13. What is the probability that customer is in a Diet classification given that they purchased from aisle P(Diet|Aisle) (likelihood)? What is the probability that product is from an Aisle given that their Diet is known P(Aisle|Diet)?¶

In this section, I used Bayesian Statistics to calculate P(Diet|Aisle) and P(Aisle|Diet)

In [None]:
## TA is a dataframe with all orders and each user with Diet Classification
TA=E1.merge(C1)
TA1=TA[TA.Diet=='Meat_Lovers']
TA2=TA[TA.Diet=='Pescatarian']
TA3=TA[TA.Diet=='NonVegan']
TA4=TA[TA.Diet=='Vegan']

In [None]:
##Probaility of buying from aisles given that they are Meat_Lovers(P(aisle|Meat_Lovers))
Prob1=TA1.aisle_id.value_counts()
Prob1=pd.DataFrame(Prob1).reset_index()
Prob1.columns=['aisle_id', 'count1']
Prob1['Prob1']=Prob1['count1']/Prob1['count1'].sum()

In [None]:
##Probaility of buying from aisles given that they are Pescatarian (P(aisle|Pescatarian))
Prob2=TA2.aisle_id.value_counts()
Prob2=pd.DataFrame(Prob2).reset_index()
Prob2.columns=['aisle_id', 'count2']
Prob2['Prob2']=Prob2['count2']/Prob2['count2'].sum()

In [None]:
##Probaility of buying from aisle 1 given that they are NonVegan (P(aisle|NonVegan))
Prob3=TA3.aisle_id.value_counts()
Prob3=pd.DataFrame(Prob3).reset_index()
Prob3.columns=['aisle_id', 'count3']
Prob3['Prob3']=Prob3['count3']/Prob3['count3'].sum()

In [None]:
##Probaility of buying from aisle 1 given that they are Vegan (P(aisle|Vegan))
Prob4=TA4.aisle_id.value_counts()
Prob4=pd.DataFrame(Prob4).reset_index()
Prob4.columns=['aisle_id', 'count4']
Prob4['Prob4']=Prob4['count4']/Prob4['count4'].sum()

In [None]:
Pmeatlover=P_user1.iloc[0,2]
Pnonvegan=P_user1.iloc[1,2]
Pvegan=P_user1.iloc[2,2]
Pescatarian=P_user1.iloc[3,2]

In [None]:
Pall=Prob1.merge(Prob2).merge(Prob3).merge(Prob4).merge(aisle_percentb)

In [None]:
Pall['P(meat_lover|aisle)']=Pall.Prob1*Pmeatlover/Pall.aisle_percent_prior
Pall['P(pescatarian|aisle)']=Pall.Prob2*Pescatarian/Pall.aisle_percent_prior
Pall['P(nonvegan|aisle)']=Pall.Prob3*Pnonvegan/Pall.aisle_percent_prior
Pall['P(vegan|aisle)']=Pall.Prob4*Pvegan/Pall.aisle_percent_prior

In [None]:
Pall=Pall.set_index('aisle_id').sort_index().reset_index()
Pall

In [None]:
## This graphs the Aisle number vs P(Diet|Aisle)
fig, ax = plt.subplots(figsize=(20,5))
plt.scatter(range(len(Pall['P(meat_lover|aisle)'])), Pall['P(meat_lover|aisle)'], color='r')
plt.scatter(range(len(Pall['P(pescatarian|aisle)'])), Pall['P(pescatarian|aisle)'], color='m')
plt.scatter(range(len(Pall['P(nonvegan|aisle)'])), Pall['P(nonvegan|aisle)'], color='b')
plt.scatter(range(len(Pall['P(vegan|aisle)'])), Pall['P(vegan|aisle)'], color='y')
plt.xlim(0,135)
plt.xlabel('aisle_id')
plt.ylabel('probaility')
plt.title('P(Diet|Aisle)')
plt.legend(['Meat_Lover', 'Pescatarian', 'NonVegan', 'Vegan'], loc='upper right')
ax.set_xticks(np.arange(0,136,5))
plt.show()

In [None]:
## This graphs aisle number vs P(Aisle|Diet)
fig, ax = plt.subplots(figsize=(20,10))
plt.scatter(range(len(Pall['Prob1'])), Pall['Prob1'], color='r')
plt.scatter(range(len(Pall['Prob2'])), Pall['Prob2'], color='m')
plt.scatter(range(len(Pall['Prob3'])), Pall['Prob3'], color='b')
plt.scatter(range(len(Pall['Prob4'])), Pall['Prob4'], color='y')
plt.xlim(0,135)
plt.xlabel('aisle_id')
plt.ylabel('probaility')
plt.title('P(Aisle|Diet)')
plt.legend(['Meat_Lover', 'Pescatarian', 'NonVegan', 'Vegan'], loc='upper right')
ax.set_xticks(np.arange(0,136,5))
plt.show()

## Q14. What products appear in all customer A orders? – These products will have high probability being reordered by customer A
allprior dataframe must be used here since Alltrain only have 1 order per user with no order history

In [None]:
## Percent of products reordered in allprior DataFrame
allprior[allprior.reordered==1].shape[0]/allprior.shape[0]

In [None]:
## Percent of products not reordered in Allprior DataFrame
allprior[allprior.reordered==0].shape[0]/allprior.shape[0]

In [None]:
## Percent of products reordered in Alltrain DataFrame
alltrain[alltrain.reordered==1].shape[0]/alltrain.shape[0]

In [None]:
## Percent of products not reordered in Alltrain DataFrame
alltrain[alltrain.reordered==0].shape[0]/alltrain.shape[0]

In [None]:
Reorder_prior=allprior[allprior.reordered==1]

In [None]:
## Products reordered by each users in allprior dataframe. Products with highest reorder count have highest probability of being reordered
allprior[allprior.reordered==1].groupby('user_id')['product_name'].value_counts()

## Q15. How many orders for each customer?

In [None]:
allprior.groupby('user_id')['order_number'].max()

## Q16. What is the average number of products for each customer per order

### Using Train dataframe

In [None]:
alltrain.groupby('order_id')['product_name'].size().mean()   ## overall average products per order

In [None]:
alltrain.groupby('user_id')['order_id'].size().head()     # number of products ordered per customer


### Using allprior dataframe

In [None]:
allprior.groupby('order_id')['product_name'].size().mean()

In [None]:
allprior.groupby('user_id')['order_id'].size().head()

## Q17. Using average number of product per order for each customer and probability of product to be reordered by customer, can I predict products that will be reordered by customer?
What is the accuracy of my predicted next order list to the actual next order list?

In [None]:
Users1=otrain.user_id.sample(n=20000, replace=False, random_state=0, axis=0)

In [None]:
User_train=allprior.loc[allprior.user_id.isin(Users1)]

In [None]:
## U_order_train is the dataframe of User_train with their number of orders
U_order_train=User_train.groupby('user_id')['order_number'].agg(['max'])
U_order_train['user_id']=U_order_train.index.get_level_values('user_id').values
U_order_train.columns=['max_order', 'user_id']

### F9. This function produce a dataframe with predicted orders per user

In [None]:
def predict_order(Use_Or):  
    ##Train Data with user_id and latest product ordered
    ytrain=Use_Or.loc[:,['user_id', 'product_id']]
    ytrain.columns=['user_id', 'product_id_latest_train']
    
    ##Number of products in latest order (Training Data)
    Tr=ytrain.user_id.value_counts()
    Tr=pd.DataFrame(Tr).reset_index()
    Tr.columns=['user_id', 'tQty']

    ## DataFrame with the users in  with product_id and Qty of product_id purchased
    R=Use_Or.groupby('user_id')['product_id'].value_counts()
    R=pd.DataFrame(R)
    R.columns=['Qty']
    R=R.reset_index()
    
    ##DataFrame with average product per order
    Q=Use_Or.groupby(['user_id', 'order_id'])['product_id'].size()
    Q=pd.DataFrame(Q)

    Q=Q.reset_index()
    p=Q.groupby('user_id')[0].mean()
    p=pd.DataFrame(p)
    p.columns=['ave_prod_per_order']
    p=p.reset_index()
    Q=Q.merge(p)
    Q['ave_prod_per_order']=Q['ave_prod_per_order'].round(decimals=0)
    Q=Q.loc[:,['user_id','order_id', 'ave_prod_per_order']]
    Q=Q.loc[:,['user_id', 'ave_prod_per_order']].drop_duplicates()

    ## Dataframe with user_id and total product purchased
    total_products=Use_Or.user_id.value_counts()
    total_products=pd.DataFrame(total_products)
    total_products.reset_index(inplace=True)
    total_products.columns=['user_id', 'total_products']
    total_products.drop_duplicates(inplace=True)
    
    ## This will compute probability of products to be purchased by user
    R=R.merge(total_products)
    R['Prob']=R.Qty/R.total_products
    

    ## This will predict next order list using calculated probabilities of products for each user and using average product per order
    B=[]
    user=pd.Series.tolist(Q.user_id)
    n=pd.Series.tolist(Q.ave_prod_per_order) ## This is a list of average product per order per user

    i=0
    for i in range(len(n)):
        H=R[R.user_id==user[i]]
        K=np.random.choice(a=H.product_id, size=n[i], p=H.Prob, replace=False)
        K=K.tolist()
        B.append(K)
        i+=1
    ## This is the predicted latest order
    Reord=pd.DataFrame(B, index=Q.user_id).stack()
    Reord=pd.DataFrame(Reord, columns=['product_id'])
    Reord=Reord.reset_index()
    Reord.drop('level_1', axis=1, inplace=True)
    
    return Reord

In [None]:
## produce a dataframe that has probability of product being purchased by specific user
U_pred=predict_order(User_train)

### F10. This function reformats dataframe with predicted products two columns where predicted products are all written accross the user_id

In [None]:
def predicted_product_format(predicted):
    pred = ''
    for product in predicted:
        if product > 0:
            pred = pred + str(int(product)) + ' '
    
    if pred != '':
        return pred.rstrip()
    else:
        return 'None'

In [None]:
# this creates a DataFrame of user_id and predicted product_id list 
predicted_order = pd.DataFrame(U_pred.groupby('user_id')["product_id"].apply(predicted_product_format)).reset_index()
predicted_order

In [None]:
## This generates the actual product ordered of Users1 in alltrain dataset
train_order = pd.DataFrame(User_train.groupby('user_id')["product_id"].apply(predicted_product_format)).reset_index()
train_order.columns=['user_id', 'product_id_train']

In [None]:
combine_pred_train=predicted_order.merge(train_order)
combine_pred_train

### F11. This function calculates cosine similarity score

In [None]:
from collections import Counter
import math
## Cosine similarity function to compare % similarities in the train and predicted products

def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)


In [None]:
 i=0
cs=[]
st=combine_pred_train.product_id_train.values.tolist()
sp=combine_pred_train.product_id.values.tolist()

### F12. This function calculates cosine similarity score for all rows in the dataframe with columns of predicting and training order lists

In [None]:
def cos_sim_score(D):
    for i in range(len(st)):
        st[i]=st[i].split()
        sp[i]=sp[i].split()
        a=Counter(st[i])
        b=Counter(sp[i])
        cs.append(counter_cosine_similarity(a,b))
        i+=1
    return pd.Series(cs)

In [None]:
## cosine_similarity_score is added to the combine_pred_train dataframe
combine_pred_train['cosine_similarity_score']=cos_sim_score(combine_pred_train)
combine_pred_train

In [None]:
combine_pred_train.cosine_similarity_score.mean()

### F13. These two functions calculates F1 score for each row of a dataframe with predicted and training set of next order list

In [None]:
def f1_score_single(y_true, y_pred):
    y_true = set(y_true)
    y_pred = set(y_pred)
    cross_size = len(y_true & y_pred)
    if cross_size == 0: return 0.
    p = 1. * cross_size / len(y_pred)
    r = 1. * cross_size / len(y_true)
    return 2 * p * r / (p + r)
    
def f1_score(y_true, y_pred):
    return np.mean([f1_score_single(x, y) for x, y in zip(y_true, y_pred)])

In [None]:
## This for loop is to calculate the F1_score for all 30,000 users in the combine_pred_train dataframe
i=0
F=[]
for i in range(len(st)):
    a=[]
    b=[]
    a.append(st[i])
    b.append(sp[i])
    f1=f1_score(a, b)
    F.append(f1)
    i+=1            

In [None]:
combine_pred_train['F1_score']=pd.Series(F)
combine_pred_train

In [None]:
## This is the mean F1_score of all predicted orders of users 
np.mean(F)