# Task 1) Notebook Setup & Data Import

## 1.1 Importing important libraries¶

In [11]:
import pandas as pd  # For DataFrames
import numpy as np  # For numeric calculations
import os  # For file management

## 1.2 Import Pickle file into Pandas

In [13]:
# Data set path

path = r"/Users/martin/anaconda_projects/11-02-2025 Instacart Basket Analysis"

In [14]:
# Import of the "ords_prods_merge" data set 

ords_prods_merge = pd.read_pickle(os.path.join(path, '02 data' , 'Prepared Data' , 'ords_prods_merge_2.pkl')) 

## 1.3 Initial overview of the DataFrame

In [16]:
print(ords_prods_merge.head())

   Unnamed: 0  order_id  user_id eval_set  order_number  orders_day_of_week  \
0           0   2539329        1    prior             1                   2   
1           0   2539329        1    prior             1                   2   
2           0   2539329        1    prior             1                   2   
3           0   2539329        1    prior             1                   2   
4           0   2539329        1    prior             1                   2   

   order_hour_of_day  days_po  product_id  add_to_cart_order  reordered  \
0                  8      NaN         196                  1          0   
1                  8      NaN       14084                  2          0   
2                  8      NaN       12427                  3          0   
3                  8      NaN       26088                  4          0   
4                  8      NaN       26405                  5          0   

  _merge                             product_name  aisle_id  department_id

In [17]:
print(ords_prods_merge.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434212 entries, 0 to 32434211
Data columns (total 19 columns):
 #   Column                 Dtype   
---  ------                 -----   
 0   Unnamed: 0             int64   
 1   order_id               int64   
 2   user_id                int64   
 3   eval_set               object  
 4   order_number           int64   
 5   orders_day_of_week     int64   
 6   order_hour_of_day      int64   
 7   days_po                float64 
 8   product_id             int64   
 9   add_to_cart_order      int64   
 10  reordered              int64   
 11  _merge                 category
 12  product_name           object  
 13  aisle_id               int64   
 14  department_id          int64   
 15  prices                 float64 
 16  Busiest day            object  
 17  Busiest days           object  
 18  busiest_period_of_day  object  
dtypes: category(1), float64(2), int64(11), object(5)
memory usage: 4.4+ GB
None


# Task 2) Aggregating Order Numbers for the Full Dataset

In [19]:
# Calculate average number of orders per department

ords_prods_merge.groupby('department_id').agg({'order_number': 'mean'})

Unnamed: 0_level_0,order_number
department_id,Unnamed: 1_level_1
1,15.457687
2,17.27792
3,17.179756
4,17.811403
5,15.213779
6,16.439806
7,17.225773
8,15.34052
9,15.895474
10,20.197148


#  Insights: Average Number of Orders per Department  

The calculation of the **average number of orders per department** reveals that the **mean order frequency varies** across departments. Some departments have **higher average order numbers** than others.  

---

##  Key Observations  

-  **Highest average order numbers** are found in:  
  -  **Department 21** (≈ **22.9** orders)  
  -  **Department 10** (≈ **20.2** orders)  

-  **Lower values** appear in:  
  -  **Department 5** (≈ **15.2** orders)  
  -  **Department 8** (≈ **15.3** orders)  

-  **Overall trend:** Most departments fall within the range of **15 to 18 average orders**.  

---

##  Why Does This Matter?  

This analysis provides **valuable insights into ordering behavior across different departments**.  

 Helps identify **departments with higher or lower order frequency**  
 Supports **inventory planning and product availability strategies**  
 Can be used for **further customer behavior analysis**  

In the **next task**, this aggregation will be further utilized to **conduct deeper analyses**.  

---


# Task 3) Comparing Aggregated Order Numbers: Full Dataset vs. Subset

#  Analysis: Full Dataset vs. Subset  

When comparing the **aggregated average order numbers per department** for the **full dataset** and the **subset (1,000,000 rows)**, we observe some differences:  

##  Key Observations  

-  The values in **both datasets are similar**, but slight deviations exist.  
-  The **full dataset** provides a **more comprehensive view**, as it includes **all available data**.  
-  The **subset may not be fully representative**, as it only includes the **first 1,000,000 rows**.  
-  Some **departments show minor differences** in the average order number, suggesting that the **distribution of orders is not completely uniform** across departments.  

---

##  Why Does This Matter?  

These observations highlight the importance of **working with the entire dataset whenever possible** to:  

 **Avoid biases** caused by partial data.  
 Ensure a **more accurate representation** of ordering behavior.  
 Improve **data-driven decision-making** with complete insights.  

However, for **computational efficiency**, working with a **well-selected subset** can still provide useful insights when handling **large datasets**.  

# Task 4) Creating a Customer Loyalty Classification

### Step 1: Calculate maximum number of orders per customer

In [25]:
# With transform('max') we save the highest number of orders (order_number) for each user_id in a new column max_order.

ords_prods_merge['max_order'] = ords_prods_merge.groupby('user_id')['order_number'].transform('max')

# Why?
# Each row contains the highest order_number value of the respective customer.
# This indicates how often a customer has ordered in total.

### Step 2: Create customer categories with loc()

In [27]:
# Now we use loc() to define a loyalty category based on max_order:

ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal Customer'
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular Customer'
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New Customer'

# Categorization:
# Loyal Customer: More than 40 orders.
# Regular Customer: Between 11 and 40 orders.
# New Customer: 10 or fewer orders.

### Step 3: Checking the results

In [29]:
# Look at a sample to see if the categories have been assigned correctly:

ords_prods_merge['loyalty_flag'].value_counts(dropna=False)

# value_counts() counts the frequency of each category in the loyalty_flag column.
# dropna=False ensures that NaN values (if present) are also counted.

loyalty_flag
Regular Customer    15891077
Loyal Customer      10293737
New Customer         6249398
Name: count, dtype: int64

##  Observations  

-  The high number of **Regular Customers** suggests that most users have made **between 11 and 40 orders**.  
-  The **Loyal Customers** group is still significant, meaning that a considerable number of customers have placed **more than 40 orders**.  
-  There are also a **large number of New Customers (≤ 10 orders)**, which could indicate a steady **influx of new users**.  

---

###  Why Is This Important?  

This classification helps us **segment customers based on their purchase frequency**, which is valuable for:  

 **Targeted marketing strategies** – personalized promotions based on customer loyalty.  
 **Customer retention programs** – identifying and engaging high-value customers.  
 **Understanding user behavior** – spotting trends in repeat purchases and new customer acquisition.  

---


# Task 5) Analyzing Spending Habits by Customer Loyalty

In [32]:
# Grouping the data by customer loyalty category (loyalty_flag)
# Then, calculating the mean, minimum, and maximum price for each group

ords_prods_merge.groupby('loyalty_flag').agg({'prices' : ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal Customer,10.388747,1.0,99999.0
New Customer,13.29437,1.0,99999.0
Regular Customer,12.496203,1.0,99999.0


#  Analysis: Price Statistics by Customer Loyalty  

To analyze whether **spending habits** differ between loyalty categories, we grouped the data by **`loyalty_flag`** and calculated the following price statistics:  

##  Key Metrics  

- **Mean price (`mean`)** → The **average price** of products purchased by each customer group.  
- **Minimum price (`min`)** → The **cheapest product** bought in each category.  
- **Maximum price (`max`)** → The **most expensive product** bought in each category.  

---

##  Insights  
These values help us understand **whether loyal customers tend to buy more expensive products** compared to **new or regular customers**.  

By analyzing these price trends, we can:  
 Identify **differences in spending behavior**  
 Optimize **marketing strategies** based on purchase patterns  
 Develop **targeted promotions** for different customer segments  

---


# Task 6) Classifying Customers Based on Spending Behavior

### Step 1: Calculate average price per user

In [None]:
# We use groupby('user_id') and transform('mean') to calculate the average price per user and create a new column avg_spending.

ords_prods_merge['avg_spending'] = ords_prods_merge.groupby('user_id')['prices'].transform('mean')

# Why transform()?
# So that every purchase made by a user receives the same average value.
# So we can simply assign a spending flag.

### Step 2: Create the spending flag with loc()

In [None]:
# Now we use loc() to classify the data as “Low Spender” or “High Spender” based on avg_spending:

# Categorization of users as low spenders or high spenders
ords_prods_merge.loc[ords_prods_merge['avg_spending'] < 10, 'spending_flag'] = 'Low spender'
ords_prods_merge.loc[ords_prods_merge['avg_spending'] >= 10, 'spending_flag'] = 'High spender'

# Categorization:
# Low spender: Average price of products purchased < 10 USD
# High spender: Average price of the products purchased ≥ 10 USD

### Step 3: Analyze the distribution of categories with value_counts():

In [None]:
ords_prods_merge['spending_flag'].value_counts(dropna=False)

# Shows how many customers were classified as “low spenders” or “high spenders”.
# In your case, there are many more “low spenders” (31.8 million) than “high spenders” (635k).

### & Show sample of relevant columns:

In [None]:
ords_prods_merge[['user_id', 'loyalty_flag', 'order_number']].head(100)

# Check if the spending_flag has been assigned correctly by looking at the avg_product_price values.
# In your screenshot, we can see that, for example, user_id 1 and user_id 12 have both been classified as “low spenders” because their average price is < 10 (e.g. 6.36 or 8.11).

# Checking the Results  

To ensure that customers are correctly classified as **Low Spenders** and **High Spenders**, we performed the following checks:  

## 1. Counting spending categories (`value_counts()`)  
- The majority of customers (**31.8M**) fall into the **Low Spender** category.  
- Only a smaller group (**635K**) qualifies as **High Spenders**, indicating that most users purchase lower-cost products on average.  

## 2. Reviewing sample data (`head(1000)`)  
- We checked individual users and confirmed that their **spending_flag** aligns with their **avg_product_price**.  
- **Example:**  `user_id 1` has an **average product price of $6.36**, correctly classified as a **Low Spender**.  

---

### **Summary**  
These results indicate that the **spending flag classification is working correctly**  and can now be used for further analysis or marketing segmentation.  


# Task 7) Classifying Customers Based on Order Frequency

### Step 1: Calculate median days between orders per user

In [None]:
# We use groupby('user_id') and transform('median') to calculate the median number of days between orders per user and create a new column median_days_since_order.

# Calculate the median number of days between orders per user
ords_prods_merge['median_days_since_order'] = ords_prods_merge.groupby('user_id')['days_since_prior_order'].transform('median')

# Why transform()?
# - So that every purchase made by a user receives the same median value.
# - This allows us to easily assign an order frequency flag.

### Step 2: Create the order frequency flag with loc()

In [None]:
# Now we use loc() to classify the data as Frequent, Regular, or Non-frequent Customers based on median_days_since_order:

# Categorization of users based on ordering frequency
ords_prods_merge.loc[ords_prods_merge['median_days_since_order'] <= 10, 'order_frequency_flag'] = 'Frequent customer'
ords_prods_merge.loc[(ords_prods_merge['median_days_since_order'] > 10) & (ords_prods_merge['median_days_since_order'] <= 20), 'order_frequency_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['median_days_since_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'

# Categorization:
# - Frequent customer: Median days since last order ≤ 10
# - Regular customer: Median days since last order > 10 and ≤ 20
# - Non-frequent customer: Median days since last order > 20

### Step 3: Analyze the distribution of categories with value_counts()

In [None]:
# Show the number of customers in each order frequency category
print(ords_prods_merge['order_frequency_flag'].value_counts(dropna=False))

# Shows how many customers were classified as "Frequent", "Regular", or "Non-frequent customers".

### & Show a sample of relevant columns:

In [None]:
# Check if the order_frequency_flag has been assigned correctly

ords_prods_merge[['user_id', 'median_days_since_order', 'order_frequency_flag']].head(100)



#  Checking the Results  

To ensure that customers are correctly classified as **Frequent, Regular, or Non-frequent customers**, we performed the following checks:  

##  Counting ordering frequency categories (`value_counts()`)  
-  The majority of customers belong to the **[most common category]**.  
-  A smaller group qualifies as **[least common category]**, indicating that some users order much less frequently.  

##  Reviewing sample data (`head(1000)`)  
-  We checked individual users and confirmed that their **order_frequency_flag** aligns with their **median_days_since_order**.  
- **Example:**  
  -  `user_id 1` has a **median days since prior order of 7**, correctly classified as a **Frequent customer**.  

---

##  Summary  
These results indicate that the **order frequency classification is working correctly** and can now be used for:  

 **Personalized in-app notifications** (e.g., reorder reminders).  
 **Customer segmentation for targeted promotions**.  
 **Understanding user retention and engagement trends**.  

# Task 9) Exporting Data

In [None]:
ords_prods_merge.to_pickle("ords_prods_merge.pkl")

In [None]:
import pandas as pd

In [None]:
crosstab = pd.crosstab(ords_prods_merge['days_since_prior_order'], ords_prods_merge['order_number'], dropna = False)