# Step 1: Notebook Setup

In [14]:
# Import libraries
import pandas as pd
import numpy as np

# Load the main dataframe
data_path = '/Users/dela/Documents/15-01-2025 Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge.pkl'
ords_prods_merge = pd.read_pickle(data_path)

# Check the dataframe
print(ords_prods_merge.head())
print(ords_prods_merge.shape)


   order_id  user_id  order_number  order_day_of_week  order_hour_of_day  \
0   2539329        1             1                  2                  8   
1   2539329        1             1                  2                  8   
2   2539329        1             1                  2                  8   
3   2539329        1             1                  2                  8   
4   2539329        1             1                  2                  8   

   days_since_prior_order product_id  add_to_cart_order  reordered  \
0                     NaN        196                  1          0   
1                     NaN      14084                  2          0   
2                     NaN      12427                  3          0   
3                     NaN      26088                  4          0   
4                     NaN      26405                  5          0   

                              product_name  aisle_id  department_id  prices  
0                                     Soda  

# Step 2: Aggregating Data

In [16]:
# Group and calculate the mean of "order_number" by "department_id"
department_means = ords_prods_merge.groupby('department_id').agg({'order_number': 'mean'})

# Display the results
print(department_means)


               order_number
department_id              
1                 15.457838
2                 17.277920
3                 17.170395
4                 17.811403
5                 15.215751
6                 16.439806
7                 17.225802
8                 15.340650
9                 15.895474
10                20.197148
11                16.170638
12                15.887671
13                16.583536
14                16.773669
15                16.165037
16                17.666284
17                15.694469
18                19.310514
19                17.177343
20                16.473447
21                22.902379


In [18]:
# Save as a new dataframe for further analysis
department_means_df = department_means.reset_index()


# Step 3: Analysis of Aggregation

In [20]:
# Analyze the subset's "department_id" grouping (use earlier subset code if applicable)
subset_means = df.groupby('department_id').agg({'order_number': 'mean'})
print(subset_means)


               order_number
department_id              
1                 14.799835
2                 17.091743
3                 17.913113
4                 17.892927
5                 15.214270
6                 15.382135
7                 17.694027
8                 16.458105
9                 15.957363
10                20.091818
11                16.482026
12                15.615061
13                16.483771
14                17.524632
15                15.691875
16                18.014473
17                16.148899
18                19.602850
19                17.631171
20                17.138607
21                21.956893


### Observations
- The mean values for the entire dataframe provide more reliable insights since they include all data points.
- Department 4 (Produce) has the highest average order count, while Department 17 (Household) has one of the lowest averages.


# Step 4.1: Add a max_order column

In [25]:
# Add the "max_order" column using the string "max" instead of np.max
ords_prods_merge['max_order'] = ords_prods_merge.groupby('user_id')['order_number'].transform("max")


# Step 4.2: Create the loyalty_flag column

In [27]:
# Add the "loyalty_flag" column
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'


# Step 4.3: Check the results

In [29]:
# Check the frequency of loyalty flags
print(ords_prods_merge['loyalty_flag'].value_counts())


loyalty_flag
Regular customer    15876441
Loyal customer      10284025
New customer         6243823
Name: count, dtype: int64


In [31]:
# Export the updated dataframe
export_path = '/Users/dela/Documents/15-01-2025 Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge_updated.pkl'
ords_prods_merge.to_pickle(export_path)


# Step 5: Compare Spending Habits Based on Loyalty Flag

In [33]:
loyalty_spending = ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'median', 'min', 'max']})
print(loyalty_spending)


                     prices                     
                       mean median  min      max
loyalty_flag                                    
Loyal customer    10.386384    7.4  1.0  99999.0
New customer      13.294943    7.4  1.0  99999.0
Regular customer  12.495916    7.4  1.0  99999.0


## Mean Spending:

Loyal Customers: €10.39 (lowest average spending).
New Customers: €13.29 (highest average spending).
Regular Customers: €12.50 (mid-range spending).
Median Prices:

Uniform at €7.40 across all groups, indicating the majority of purchases are similarly priced.
Minimum Prices:

€1.00 for all groups, showing purchases of low-cost items are common.
Maximum Prices:

€99,999.00 across all groups, likely due to outliers or luxury items.
Insights:

New customers spend the most on average, possibly exploring higher-priced products.
Loyal customers spend the least, suggesting familiarity with affordable products or consistent purchases of budget-friendly items.
Regular customers exhibit mid-range spending behavior.

# Step 6: Create Spending Flags for Users


In [36]:
ords_prods_merge['avg_spending'] = ords_prods_merge.groupby('user_id')['prices'].transform('mean')


In [38]:
ords_prods_merge.loc[ords_prods_merge['avg_spending'] < 10, 'spending_flag'] = 'Low spender'
ords_prods_merge.loc[ords_prods_merge['avg_spending'] >= 10, 'spending_flag'] = 'High spender'


In [40]:
print(ords_prods_merge['spending_flag'].value_counts())


spending_flag
Low spender     31770062
High spender      634227
Name: count, dtype: int64


## Proportions of Spend Categories:
Low Spenders: 3,177,062 users (approximately 83.37%).
High Spenders: 634,227 users (approximately 16.63%).
Observations:

The vast majority of users fall under the "Low Spender" category, indicating a tendency toward more budget-friendly or smaller purchases.
A smaller proportion of users are categorized as "High Spenders," possibly representing a niche group that purchases higher-priced or premium products frequently.
This segmentation could help the marketing team in tailoring strategies for both groups, such as offering loyalty rewards for high spenders or promoting deals to encourage higher spending among low spenders.

# Step 7: Create Order Frequency Flags

In [43]:
ords_prods_merge['median_days_since_order'] = ords_prods_merge.groupby('user_id')['days_since_prior_order'].transform('median')


In [45]:
ords_prods_merge.loc[ords_prods_merge['median_days_since_order'] > 20, 'frequency_flag'] = 'Non-frequent customer'
ords_prods_merge.loc[(ords_prods_merge['median_days_since_order'] > 10) & (ords_prods_merge['median_days_since_order'] <= 20), 'frequency_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['median_days_since_order'] <= 10, 'frequency_flag'] = 'Frequent customer'


In [47]:
print(ords_prods_merge['frequency_flag'].value_counts())


frequency_flag
Frequent customer        21559233
Regular customer          7208688
Non-frequent customer     3636363
Name: count, dtype: int64


## Proportions of Frequency Categories:
Frequent Customers: 21,559,233 users (approximately 73.78%).
Regular Customers: 7,208,688 users (approximately 24.68%).
Non-Frequent Customers: 363,363 users (approximately 1.24%).
Observations:

The majority of users fall into the Frequent Customer category, indicating a consistent pattern of regular purchases within a short timeframe (≤10 days).
Regular Customers make up nearly a quarter of the user base, suggesting a significant segment with a moderate ordering frequency (11–20 days).
Non-Frequent Customers constitute a very small proportion of users, reflecting either casual or sporadic engagement with the platform (median days >20).
Implications:

The segmentation of customers based on order frequency can aid in targeted marketing campaigns:
Frequent Customers: Reward programs or loyalty incentives to retain high activity.
Regular Customers: Encouragement to increase order frequency, possibly via discounts or promotions.
Non-Frequent Customers: Strategies to re-engage these users, such as personalized offers or reminders.

# Step 9: Export the Final Dataframe

In [56]:
# Define the export path
export_path = '/Users/dela/Documents/15-01-2025 Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge_final.pkl'

# Export the dataframe as a pickle file
ords_prods_merge.to_pickle(export_path)

# Confirm the export by re-importing and checking the first few rows
test_import = pd.read_pickle(export_path)
print(test_import.head())



   order_id  user_id  order_number  order_day_of_week  order_hour_of_day  \
0   2539329        1             1                  2                  8   
1   2539329        1             1                  2                  8   
2   2539329        1             1                  2                  8   
3   2539329        1             1                  2                  8   
4   2539329        1             1                  2                  8   

   days_since_prior_order product_id  add_to_cart_order  reordered  \
0                     NaN        196                  1          0   
1                     NaN      14084                  2          0   
2                     NaN      12427                  3          0   
3                     NaN      26088                  4          0   
4                     NaN      26405                  5          0   

                              product_name  aisle_id  department_id  prices  \
0                                     Soda 