### Step 1 - Create a new notebook for this task. Be sure to import the relevant libraries, along with your dataframe, which should include your newly derived columns from the previous Exercise.

In [1]:
# initial setup, import, defs, loading dataframe

import pandas as pd
import numpy as np
import os
data_path = r'/home/nevesfernandes/20250701 Instacart Basket Analysis/2 Data/'
df_orders_products_merged = pd.read_pickle(os.path.join(data_path, '2 Prepared Data', 'orders_products_labeled.pkl'))

In [2]:
df_orders_products_merged.head()

Unnamed: 0.1,order_id,user_id,customer_sequential_order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,Unnamed: 0,product_name,department_id,prices,price_range_loc,busy,busy_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,195,Soda,7,9.0,2,2,2,2
1,2539329,1,1,2,8,,14084,14084,Organic Unsweetened Vanilla Almond Milk,16,12.5,2,2,2,2
2,2539329,1,1,2,8,,12427,12427,Original Beef Jerky,19,4.4,1,2,2,2
3,2539329,1,1,2,8,,26088,26089,Aged White Cheddar Popcorn,19,4.7,1,2,2,2
4,2539329,1,1,2,8,,26405,26406,XL Pick-A-Size Paper Towel Rolls,17,1.0,1,2,2,2


In [3]:
#dropping Unnamed: 0 column
df_orders_products_merged.drop('Unnamed: 0', axis=1, inplace=True)

### Step 2 - In this Exercise, you learned how to find the aggregated mean of the “order_number” column grouped by “department_id” for a subset of your dataframe. Now, repeat this process for the entire dataframe.

In [4]:
df_orders_products_merged.groupby('department_id').agg({'customer_sequential_order_number': ['mean']})

Unnamed: 0_level_0,customer_sequential_order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225814
8,15.34065
9,15.895474
10,20.197148


### Step 3 - Analyze the result. How do the results for the entire dataframe differ from those of the subset? Include your comments in a markdown cell below the executed code.

Results are similar, some departments show average values a bit higher, other departments a bit lower.

This is expected. What we did previously was to make a sample of 1 million records out of 32 million (the entire "population"). In statistical terms, assuming that the first million records are random enough, we could run a hipothesis test (with a z-test) for each department to check if the samples are part of a population. z-test is more appropriate here, because we could know the standard deviation of the population.

### Step 4 - Follow the instructions in the Exercise for creating a loyalty flag for existing customers using the transform() and loc() functions.

In [5]:
#first, we create a column called max_order that will store the maximum number of orders that each customer did
df_orders_products_merged['max_order'] = df_orders_products_merged.groupby(['user_id'])['customer_sequential_order_number'].transform('max')

**Note**: in order to avoid a warning that was showing up, I've used as recommended the string `'max'` instead of `np.max`, as it was in the exercise

In [6]:
#using loc method to create loyalty flag
#The number convention here will be as described below. This number convention is to save memory. 
#Any future visualizations should have this for reference, to display something understandable to the user.
# 1 = new customer
# 2 = regular customer
# 3 = Loyal customer

df_orders_products_merged.loc[df_orders_products_merged['max_order'] > 40, 'loyalty_flag'] = 3
df_orders_products_merged.loc[(df_orders_products_merged['max_order'] <= 40) & (df_orders_products_merged['max_order'] > 10), 'loyalty_flag'] = 2
df_orders_products_merged.loc[df_orders_products_merged['max_order'] <= 10, 'loyalty_flag'] = 1

In [7]:
#changing datatype to effectively save memory (field was a float64)
df_orders_products_merged['loyalty_flag'] = df_orders_products_merged['loyalty_flag'].astype('int8')

### Step 5 - The marketing team at Instacart wants to know whether there’s a difference between the spending habits of the three types of customers you identified. Use the loyalty flag you created and check the basic statistics of the product prices for each loyalty category (Loyal Customer, Regular Customer, and New Customer). What you’re trying to determine is whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.

In [8]:
#grouping records by loyalty flag and calculating average prices, used as_index flag as False because I want to rename values.
df_grouped = df_orders_products_merged.groupby(['loyalty_flag'],as_index=False).mean('prices')

In [9]:
#renaming rows to present meaningful stuff
df_grouped['loyalty_flag'] = df_grouped['loyalty_flag'].replace({1:"New customer",2:"Regular Customer",3:"Loyal Customer"})

In [10]:
#renaming prices column to something more meaningful
df_grouped.rename(columns={'prices':'average_price'},inplace = True)

In [11]:
#presenting results, hiding a meaningless index (0,1,2)
df_grouped[['loyalty_flag','average_price']].style.hide(axis="index")

loyalty_flag,average_price
New customer,7.800373
Regular Customer,7.797473
Loyal Customer,7.772843


There's virtually no difference between averages (the loyal customer average is even a tiny bit lower than others).

This followed literally the directions of the task, although we could think of alternative metrics to check for differences in "Spending Habits" between types of customers.

### Step 6 - The team now wants to target different types of spenders in their marketing campaigns. This can be achieved by looking at the prices of the items people are buying. Create a spending flag for each user based on the average price across all their orders using the following criteria:
If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [12]:
#creating a column with the average price for all goods purchased by each customer
df_orders_products_merged['avg_price_per_user'] = df_orders_products_merged.groupby(['user_id'])['prices'].transform('mean')

In [13]:
#using loc method to create high spender flag
#Since flag can only take 2 possible values, we'll use boolean here. This convention is to save memory. 
#Any future visualizations should have this for reference, to display something understandable to the user.
# True = high spender
# False = low spender

df_orders_products_merged.loc[df_orders_products_merged['avg_price_per_user'] >= 10, 'high_spender_flag'] = True
df_orders_products_merged.loc[df_orders_products_merged['avg_price_per_user'] < 10, 'high_spender_flag'] = False

In [14]:
#changing datatype to effectively save memory (field was an object)
df_orders_products_merged['high_spender_flag'] = df_orders_products_merged['high_spender_flag'].astype('bool')

While it's not requested in the exercise, I'm curious about how many spenders of each type we have

In [15]:
#grouping records by spender type, and counting how many unique users we have
df_grouped = df_orders_products_merged.groupby(['high_spender_flag'], as_index = False)['user_id'].nunique()

In [16]:
#renaming rows to present meaningful stuff
df_grouped['high_spender_flag'] = df_grouped['high_spender_flag'].replace({True:"High Spender",False:"Low Spender"})

In [17]:
#renaming user_id column to something more meaningful
df_grouped.rename(columns={'user_id':'number_of_customers'},inplace = True)

In [18]:
#presenting results, hiding a meaningless index (0,1)
df_grouped[['high_spender_flag','number_of_customers']].style.hide(axis="index")

high_spender_flag,number_of_customers
Low Spender,202822
High Spender,3387


### Step 7 - In order to send relevant notifications to users within the app (for instance, asking users if they want to buy the same item again), the Instacart team wants you to determine frequent versus non-frequent customers. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column. The criteria for the flag should be as follows:
* If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
* If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”
* If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”

In [19]:
#creating a column with the median values for days_since_prior_order for each customer
#To be noted here that by default, pandas ignores NaN when calculating median, which is in our case the desired behaviour

df_orders_products_merged['median_days_between_orders'] = df_orders_products_merged.groupby(['user_id'])['days_since_prior_order'].transform('median')

In [20]:
#using loc method to create frequency flag. note that NaN on the median means that the customer never returned,
#therefore,
#The number convention here will be as described below. This number convention is to save memory. 
#Any future visualizations should have this for reference, to display something understandable to the user.
# 1 = Non-frequent customer
# 2 = Regular customer
# 3 = Frequent customer

df_orders_products_merged.loc[(df_orders_products_merged['median_days_between_orders'] > 20) | (df_orders_products_merged['median_days_between_orders'].isnull()), 'frequency_flag'] = 1
df_orders_products_merged.loc[(df_orders_products_merged['median_days_between_orders'] <= 20) & (df_orders_products_merged['median_days_between_orders'] > 10), 'frequency_flag'] = 2
df_orders_products_merged.loc[df_orders_products_merged['median_days_between_orders'] <= 10, 'frequency_flag'] = 3

In [21]:
#changing datatype to save memory
df_orders_products_merged['frequency_flag'] = df_orders_products_merged['frequency_flag'].astype('int8')

As above, I'll investigate how the customers are distributed across this frequency flag.

In [22]:
#grouping records by frequency flag, and counting how many unique users we have
df_grouped = df_orders_products_merged.groupby(['frequency_flag'], as_index = False)['user_id'].nunique()

In [23]:
#renaming rows to present meaningful stuff
df_grouped['frequency_flag'] = df_grouped['frequency_flag'].replace({1:"Non-frequent customer",2:"Regular customer",3:"Frequent customer"})

In [24]:
#renaming user_id column to something more meaningful
df_grouped.rename(columns={'user_id':'number_of_customers'},inplace = True)

In [25]:
#presenting results, hiding a meaningless index (0,1,2)
df_grouped[['frequency_flag','number_of_customers']].style.hide(axis="index")

frequency_flag,number_of_customers
Non-frequent customer,59623
Regular customer,59992
Frequent customer,86594


The total number of customers here is **206209**, exactly the total amount of customers previously found!

### Step 9 - Export your dataframe as a pickle file and store it correctly in your “Prepared Data” folder.

In [26]:
#exporting to pickle format
df_orders_products_merged.to_pickle(os.path.join(data_path, '2 Prepared Data', 'orders_products_labeled_v2.pkl'))