### Contents

01 Import libraries and dataset

02 Mean of order_number, grouped by department

03 Loyalty flag

04 Spending habits by loyalty category

05 Low/high spender categories

06 Order frequency flag

07 Export dataframe

01 Import libraries and dataset

In [2]:
import pandas as pd
import numpy as np
import os

In [3]:
path = r"C:\Users\cathe\OneDrive\Data Analysis\2 4 Instacart Basket Analysis\02 Data"

In [4]:
df = pd.read_pickle(os.path.join(path, 'Prepared Data', 'ords_prods_merge_200225.pkl'))

In [5]:
df.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regular day,Average orders,10,New customer
1,2539329,1,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,Mid-range product,Regularly busy,Regular day,Average orders,10,New customer
2,2539329,1,1,2,8,,12427,3,0,Original Beef Jerky,23,19,4.4,Low-range product,Regularly busy,Regular day,Average orders,10,New customer
3,2539329,1,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,Low-range product,Regularly busy,Regular day,Average orders,10,New customer
4,2539329,1,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,Low-range product,Regularly busy,Regular day,Average orders,10,New customer


02 Mean of order_number, grouped by department

In [7]:
df.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


There are minor differences from the results obtained from the subset, but the differences are not huge.  The sample used previously was not chosen randomly: it picked customers with lower user_ids, ie those who registered with Instacart earliest, but the sample nevertheless seems to be representative of the population.

I don't think mean order_number gives the information that the exercise implies.  If two customers both ordered from department 5 once, one in their first order and one in their 15th order, the second one would be a more significant contribution to mean order_number for department 5, although both have shopped in that department once.

03 Loyalty flag

The head of the dataframe above shows the loyalty flag column already present, so I will reproduce the code used below but I won't run it again.

In [None]:
# Create max_order column

df['max_order'] = df.groupby(['user_id'])['order_number'].transform(np.max)

In [None]:
# Create loyalty_flag column (stage 1)

df.loc[df['max_order'] > 40, 'loyalty_flag'] = 'Loyal  customer'

In [None]:
df.loc[(df['max_order'] <= 40) & (df['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [None]:
df.loc[df['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [15]:
df['loyalty_flag'].value_counts()

loyalty_flag
Regular customer    15876363
Loyal customer      10284010
New customer         6243788
Name: count, dtype: int64

04 Spending habits by loyalty category

In [17]:
df.groupby('loyalty_flag').agg({'prices': ['min', 'max', 'mean']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,min,max,mean
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,1.0,25.0,7.774439
New customer,1.0,25.0,7.802283
Regular customer,1.0,25.0,7.799262


There is very little difference between the spending habits of loyal, regular and new customers.  Each category has the same minimum and maximum spend (`$1` and `$25` respectively).  The average price paid per item differs very little: `$7.80` per item for new and regular customers, and 3 cents less for loyal customers (`$7.77` per item).

05 Low/high spender categories

In [19]:
# Create new column av_item_spend

df['av_item_spend'] = df.groupby('user_id')['prices'].transform(np.mean)

  df['av_item_spend'] = df.groupby('user_id')['prices'].transform(np.mean)


In [21]:
# Check result

df.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,av_item_spend
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,Mid-range product,Regularly busy,Regular day,Average orders,10,New customer,6.367797
1,2539329,1,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,Mid-range product,Regularly busy,Regular day,Average orders,10,New customer,6.367797
2,2539329,1,1,2,8,,12427,3,0,Original Beef Jerky,23,19,4.4,Low-range product,Regularly busy,Regular day,Average orders,10,New customer,6.367797
3,2539329,1,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,Low-range product,Regularly busy,Regular day,Average orders,10,New customer,6.367797
4,2539329,1,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,Low-range product,Regularly busy,Regular day,Average orders,10,New customer,6.367797


This looks right - first one I tried had different numbers for the same user_id so was clearly wrong!

In [23]:
# Create spend_level flag (stage 1)

df.loc[df['av_item_spend'] >= 10, 'spend_level'] = 'High spender'

In [25]:
df.loc[df['av_item_spend'] < 10, 'spend_level'] = 'Low spender'

In [27]:
# Check result

df['spend_level'].value_counts(dropna = False)

spend_level
Low spender     32284201
High spender      119960
Name: count, dtype: int64

06 Order frequency flag

In [29]:
# Create median_days_since_prior_order column

df['median_days_since_prior_order'] = df.groupby('user_id')['days_since_prior_order'].transform(np.median)

  df['median_days_since_prior_order'] = df.groupby('user_id')['days_since_prior_order'].transform(np.median)


In [31]:
# Check result

df[['user_id', 'days_since_prior_order', 'median_days_since_prior_order']].head(60)

Unnamed: 0,user_id,days_since_prior_order,median_days_since_prior_order
0,1,,20.5
1,1,,20.5
2,1,,20.5
3,1,,20.5
4,1,,20.5
5,1,15.0,20.5
6,1,15.0,20.5
7,1,15.0,20.5
8,1,15.0,20.5
9,1,15.0,20.5


In [33]:
# Create user-defined function

def order_frequency(row):

    if row['median_days_since_prior_order'] <= 10:
        return 'Frequent customer'
    elif (row['median_days_since_prior_order'] > 10) and (row['median_days_since_prior_order'] <= 20):
        return 'Regular customer'
    elif row['median_days_since_prior_order'] > 20:
        return 'Non-frequent customer'
    else: return 'Not enough data'

In [35]:
# Apply to dataframe to create column

df['order_frequency'] = df.apply(order_frequency, axis=1)

In [37]:
# Check output

df['order_frequency'].value_counts()

order_frequency
Frequent customer        21559380
Regular customer          7208433
Non-frequent customer     3636343
Not enough data                 5
Name: count, dtype: int64

07 Export dataframe

In [39]:
df.to_pickle(os.path.join(path, 'Prepared Data', 'ords_prods_merge_200225_2.pkl'))