# Task Group & Aggregate Data

## Contents 
## 1. Aggregated mean of the “order_number” column grouped by “department_id”
## 2. Creating a loyalty flag for existing customers using transform() and loc()
## 3. Difference between the spending habits of the three types of customers?
## 4. Creating a spending flag for each user based on the average price across all their orders
## 5. Creating an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column 

In [2]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [3]:
# Project folder path as string
path = r'/Users/sophie/Desktop/CareerFoundry /09 2023 Phython'
path

'/Users/sophie/Desktop/CareerFoundry /09 2023 Phython'

In [4]:
# Import data set, run checks
ords_prods_merge = pd.read_pickle(os.path.join(path, 'Data', 'prepared data ', 'ords_prods_merged_newvar.pkl'))
ords_prods_merge.shape

(32404859, 20)

In [5]:
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_hours,busiest_period_of_day
0,2539329,1,1,2,8,,True,196,1,0,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Average orders,Average orders
1,2398795,1,2,3,7,15.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Average orders,Average orders
2,473747,1,3,3,12,21.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Average orders,Most orders
3,2254736,1,4,4,7,29.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Average orders,Average orders
4,431534,1,5,4,15,28.0,False,196,1,1,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders,Most orders


## 1. Aggregated mean of the “order_number” column grouped by “department_id”

In [6]:
# Average number of orders per user for each department ID
# ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})
# or

ords_prods_merge.groupby('department_id')['order_number'].mean()

department_id
1     15.457838
2     17.277920
3     17.170395
4     17.811403
5     15.215751
6     16.439806
7     17.225802
8     15.340650
9     15.895474
10    20.197148
11    16.170638
12    15.887671
13    16.583536
14    16.773669
15    16.165037
16    17.665606
17    15.694469
18    19.310397
19    17.177343
20    16.473447
21    22.902379
Name: order_number, dtype: float64

## How do the results for the entire dataframe differ from those of the subset? 

### subset:
#### department_id
#### 4     18.825780
#### 7     17.472355
#### 13    17.993423
#### 14    19.246334
#### 16    19.463012
#### 17    11.294069
#### 19    19.305237
#### 20    17.599636

### The subset included more departments with a higher average number of orders per user.  The entire df includes more departments with lover mean order numbers per user (many around 15 orders). 

## 2. Creating a loyalty flag for existing customers using transform() and loc()

In [29]:
# Loyality: max number of orders by EXISTING customers
missing_in_user_id = ords_prods_merge['user_id'].isna().any()

print("Missing in 'user_id' column:", missing_in_user_id)

Missing in 'user_id' column: False


In [30]:
# Loyality: max number of orders by EXISTING customers
missing_in_order_number = ords_prods_merge['order_number'].isna().any()

print("Missing in 'order_number' column:", missing_in_order_number)

Missing in 'order_number' column: False


In [31]:
# Loyality: max number of orders by EXISTING customers 
# 1. Split the data into groups based on the “user_id” column.
# 2. Apply the transform() function on the “order_number” column to generate the maximum orders for each user.
# 3. Create a new column, “max_order,” into which you’ll place the results of your aggregation.

ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

In [52]:
# Check
ords_prods_merge.head(15)
ords_prods_merge['max_order'].head(15)

0     10
1     10
2     10
3     10
4     10
5     10
6     10
7     10
8     10
9     10
10    22
11    22
12    22
13    22
14    22
Name: max_order, dtype: int64

In [33]:
# Deriving Columns with loc()

ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [34]:
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [35]:
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [36]:
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

In [None]:
# Check
ords_prods_merge[['user_id', 'max_order', 'loyalty_flag', 'order_number']].head(60)

## 3. Difference between the spending habits of the three types of customers?

In [39]:
# Use the loyalty flag to check the basic statistics of the product prices for each loyalty category 
# Loyal Customer, Regular Customer, and New Customer
# Determine whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers

ords_prods_merge.groupby('loyalty_flag')['prices'].mean()

loyalty_flag
Loyal customer      10.386336
New customer        13.294670
Regular customer    12.495717
Name: prices, dtype: float64

In [41]:
ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,10.386336,1.0,99999.0
New customer,13.29467,1.0,99999.0
Regular customer,12.495717,1.0,99999.0


In [49]:
# Check whether object is created
ords_prods_merge.groupby('loyalty_flag')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x44adeda50>

### Prices of products purchased by loyal customers are on average lower than for other 2 groups.
### Here, I would also perform a t-test for the stakeholder to see if differences are meaningful.

## 4. Creating a spending flag for each user based on the average price across all their orders

#### Using the following criteria:
#### If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
#### If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [54]:
# Creating a variable for the average amount spended by customer, group by user_id

ords_prods_merge['average_spend'] = ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean)

In [55]:
ords_prods_merge['prices'].dtype

dtype('float64')

In [None]:
# Check
ords_prods_merge[['user_id', 'prices', 'order_number', 'max_order', 'average_spend']].head(30)

In [1]:
# User_id 1 has orders more than the order listed in your notebook. 
# As you can see in the below screenshot, if you run this command 

# ords_prods_merge.loc[ords_prods_merge['user_id'] == 1][['user_id', 'prices', 'average_spend']] 

In [57]:
# Deriving Columns with loc()
# If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
# If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

ords_prods_merge.loc[ords_prods_merge['average_spend'] < 10, 'spending flag'] = 'Low spender'

In [58]:
ords_prods_merge.loc[ords_prods_merge['average_spend'] >= 10, 'spending flag'] = 'High spender'

In [59]:
# Check
ords_prods_merge[['user_id', 'prices', 'average_spend', 'spending flag']].head(80)

Unnamed: 0,user_id,prices,average_spend,spending flag
0,1,9.0,6.367797,Low spender
1,1,9.0,6.367797,Low spender
2,1,9.0,6.367797,Low spender
3,1,9.0,6.367797,Low spender
4,1,9.0,6.367797,Low spender
...,...,...,...,...
75,120,9.0,9.385714,Low spender
76,120,9.0,9.385714,Low spender
77,185,9.0,7.715000,Low spender
78,195,9.0,7.149730,Low spender


In [60]:
# Because of suspicious means, check if there are any high spenders?
value_exists = (ords_prods_merge['spending flag'] == 'High spender').any()

if value_exists:
    print("Value High spender exists in column spending flag.")
else:
    print("Value High spender does not exist in column spending flag.")

Value High spender exists in column spending flag.


## 5. Creating an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column 

#### The criteria for the flag should be as follows:
#### If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
#### If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”
#### If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”

In [61]:
# Creating a variable for the median of “days_since_prior_order", group by user_id
ords_prods_merge['order_regularity'] = ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median)

In [62]:
# Check
ords_prods_merge[['user_id', 'days_since_prior_order', 'order_regularity']].head(30)

Unnamed: 0,user_id,days_since_prior_order,order_regularity
0,1,,20.5
1,1,15.0,20.5
2,1,21.0,20.5
3,1,29.0,20.5
4,1,28.0,20.5
5,1,19.0,20.5
6,1,20.0,20.5
7,1,14.0,20.5
8,1,0.0,20.5
9,1,30.0,20.5


In [66]:
# Deriving Columns with loc()

# If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
# If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”
# If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”

ords_prods_merge.loc[ords_prods_merge['order_regularity'] > 20, 'order regularity flag'] = 'Non-frequent customer'

In [67]:
ords_prods_merge.loc[(ords_prods_merge['order_regularity'] > 10) & (ords_prods_merge['order_regularity'] <= 20), 'order regularity flag'] = 'Regular customer'

In [68]:
ords_prods_merge.loc[(ords_prods_merge['order_regularity'] <= 10), 'order regularity flag'] = 'Frequent customer'

In [69]:
# Check
ords_prods_merge[['user_id', 'order_regularity', 'order regularity flag']].head(30)

Unnamed: 0,user_id,order_regularity,order regularity flag
0,1,20.5,Non-frequent customer
1,1,20.5,Non-frequent customer
2,1,20.5,Non-frequent customer
3,1,20.5,Non-frequent customer
4,1,20.5,Non-frequent customer
5,1,20.5,Non-frequent customer
6,1,20.5,Non-frequent customer
7,1,20.5,Non-frequent customer
8,1,20.5,Non-frequent customer
9,1,20.5,Non-frequent customer


In [70]:
# Export data to pkl
ords_prods_merge.to_pickle(os.path.join(path, 'Data', 'prepared data ', 'ords_prods_merge_flags.pkl'))