# Exercise 4.8

### Content List:
#### -- Import libraries & datasets
#### -- Ensure 'orders_products_merged.pkl' read successfully
#### -- Aggregated mean of the “order_number” column grouped by “department_id” 
#### -- How do the results for the entire dataframe differ from those of the subset?
#### -- Creating loyalty flag for existing customers
#### -- Create a spending flag for each user based on the average price across all their orders
#### -- Create an order frequency flag
#### -- Export

## Import libraries & datasets

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os
import warnings

In [2]:
# Path variable
path = r'/Users/jsok/Instacart Basket Analysis'

In [3]:
# read pickle file and store in ords_prods_merge
ords_prods_merge = pd.read_pickle(os.path.join(path,'02 Data','Prepared Data','orders_products_merged.pkl'))

## Ensure 'orders_products_merged.pkl' read successfully.

In [4]:
ords_prods_merge.shape

(32404859, 18)

In [5]:
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,Busiest_day,Busiest_days,Busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Average orders
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Average orders
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Average orders
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders


##  Aggregated mean of the “order_number” column grouped by “department_id” 

In [6]:
# adding an .agg() to aggregate the mean order numbers
ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


## How do the results for the entire dataframe differ from those of the subset?
#### The aggregated and grouped mean of the entire dataframe varies from the subset dataframe's mean in that it displays all departments, rather than only 8 departments.  Furthermore, the aggregated means of the subset's 8 departments are inaccurate since it does not include all data, whereas the means in this aggregation are correct.

## Creating loyalty flag for existing customers

In [7]:
# use transform function to generate max orders for each user
ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

In [8]:
# create a flag that assigns a loyalty label to a user ID based on max orders
# Loyal customer flag
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [9]:
# Regular customer flag
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [10]:
# New customer flag
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [11]:
# checking the value counts of the new column
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: loyalty_flag, dtype: int64

## Determine whether price of products purchased differ among loyal, new and regular customers.

In [12]:
# group by the recently created 'loyalty_flag' and aggregate mean, min, and max
ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean','min','max','count']})

Unnamed: 0_level_0,prices,prices,prices,prices
Unnamed: 0_level_1,mean,min,max,count
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Loyal customer,7.772758,0.99,25.0,10284093
New customer,7.800029,0.99,25.0,6243990
Regular customer,7.797197,0.99,25.0,15876776


#### Average price of products purchased hardly differ among customer types. Most customers are Regular customers and they purchase the most expensive products on average, with new customers following second and loyal customers, interestingly, purchase the least expensive products.

## Create a spending flag for each user based on the average price across all their orders

In [13]:
# use transform function to generate average price of each user's orders
ords_prods_merge['avg_order_price'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)

In [14]:
# create a flag that assigns a spending flag label to a user ID based on average price of orders
# Low spender
ords_prods_merge.loc[ords_prods_merge['avg_order_price'] < 10, 'spending_flag'] = 'Low spender'

In [15]:
# High spender
ords_prods_merge.loc[ords_prods_merge['avg_order_price'] >= 10, 'spending_flag'] = 'High spender'

In [16]:
# checking the value counts of the new column
ords_prods_merge['spending_flag'].value_counts(dropna = False)

Low spender     32285165
High spender      119694
Name: spending_flag, dtype: int64

#### Lots of low spenders using Instacart!  119,694 Instacart users are high spenders.

## Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column.

In [17]:
# use transform function to generate MEDIAN of 'days_since_prior_order'
ords_prods_merge['median_dspo'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [18]:
# create a flag that assigns a spending flag label to a user ID based on average price of orders
# Non frequent customer
ords_prods_merge.loc[ords_prods_merge['median_dspo'] > 20, 'order_freq_flag'] = 'Non-frequent customer'

In [19]:
# Regular customer
ords_prods_merge.loc[(ords_prods_merge['median_dspo'] > 10) & (ords_prods_merge['median_dspo'] <= 20), 'order_freq_flag'] = 'Regular customer'

In [20]:
# Frequent customer
ords_prods_merge.loc[ords_prods_merge['median_dspo'] <= 10, 'order_freq_flag'] = 'Frequent customer'

In [21]:
# checking the value counts of the new column
ords_prods_merge['order_freq_flag'].value_counts(dropna = True)

Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
Name: order_freq_flag, dtype: int64

In [22]:
# checking where null values appear
print(ords_prods_merge.isnull().sum())

order_id                        0
user_id                         0
order_number                    0
order_day_of_week               0
order_hour_of_day               0
days_since_prior_order    2076096
product_id                      0
add_to_cart_order               0
reordered                       0
product_name                    0
aisle_id                        0
department_id                   0
prices                          0
_merge                          0
price_range_loc                 0
Busiest_day                     0
Busiest_days                    0
Busiest_period_of_day           0
max_order                       0
loyalty_flag                    0
avg_order_price                 0
spending_flag                   0
median_dspo                     5
order_freq_flag                 5
dtype: int64


## Export

In [23]:
# Export
ords_prods_merge.to_pickle(os.path.join(path, '02 Data','Prepared Data','orders_products_merged.pkl'))