# Aggregating and Grouping Data

## Introduction
In this task, we perform aggregation and grouping on our dataset to derive insights based on customer behavior and product categories. Specifically, we:
1. Import the merged orders and products data.
2. Perform various aggregation operations.
3. Group data by customer and product attributes.
4. Export the final dataframe.

## Step 1: Load and Inspect the Data
We start by loading the combined orders and products dataframe and inspecting its shape and columns.

## Step 2: Aggregate Data by Product
We perform aggregation operations to find the total orders and mean prices for each product.

## Step 3: Group Data by Customer
We group the data by customer to derive insights such as the total spend and average order size.

## Step 4: Group Data by Product Category
We ensure that the `department_id` column is present and group the data by department to analyze trends.

## Step 5: Create Additional Derived Metrics
We create additional derived metrics such as the frequency of reorder for each product.

## Step 6: Export the Final Dataframe
Finally, we export the aggregated data as a pickle file for further analysis.

## Step 7: Create Spending Flag for Each User
We create a spending flag for each user based on the average price across all their orders.

## Step 8: Create Order Frequency Flag
We create an order frequency flag to categorize customers based on their ordering behavior.

## Step 9: Clean and Structure the Notebook
We ensure the notebook is clean, well-structured, and all code is well commented.

## Summary
In this task, we:
1. Loaded the merged orders and products data.
2. Performed various aggregation operations to derive insights.
3. Grouped data by customer and product attributes.
4. Created additional derived metrics.
5. Created spending and order frequency flags for users.
6. Exported the final aggregated dataframe.
 the final aggregated dataframe.


In [52]:
import pandas as pd
import os

# Path to the project folder
project_path = r'C:\Users\sudee\OneDrive\Documents\Python Scripts\Instacart Basket Analysis'

# File paths
combined_data_path = os.path.join(project_path, '02 Data', 'Prepared Data', 'orders_products_combined_with_labels.pkl')
df_products_path = os.path.join(project_path, '02 Data', 'Prepared Data', 'products_checked_clean.csv')

# Load the combined dataset
df_combined = pd.read_pickle(combined_data_path)

# Check the shape and head of the dataframe
print("Combined DataFrame shape:", df_combined.shape)
print(df_combined.head())
print(df_combined.columns)

# Load the products data
df_products = pd.read_csv(df_products_path)

# Ensure 'department_id' is present by merging with products data if necessary
df_combined = df_combined.merge(df_products[['product_id', 'department_id']], on='product_id', how='left')

# Check if 'department_id' is now in df_combined
print("Columns after merge to include 'department_id':", df_combined.columns)

# Step 2: Aggregate data to find the total orders and mean prices for each product
product_aggregations = df_combined.groupby('product_id').agg(
    total_orders=('order_id', 'count'),
    mean_price=('prices', 'mean')
).reset_index()

# Check the result of aggregation
print("Product Aggregations:")
print(product_aggregations.head())

# Step 3: Group data by customer to find total spend and average order size
customer_aggregations = df_combined.groupby('user_id').agg(
    total_spend=('prices', 'sum'),
    avg_order_size=('order_id', 'count')
).reset_index()

# Check the result of customer aggregation
print("Customer Aggregations:")
print(customer_aggregations.head())

# Step 4: Group data by department to find total orders and mean prices
department_aggregations = df_combined.groupby('department_id').agg(
    total_orders=('order_id', 'count'),
    mean_price=('prices', 'mean')
).reset_index()

# Check the result of department aggregation
print("Department Aggregations:")
print(department_aggregations.head())

# Step 5: Create additional derived metrics
df_combined['reorder_frequency'] = df_combined.groupby('product_id')['reordered'].transform('mean')

# Check the result of derived metrics
print("Reorder Frequency for Products:")
print(df_combined[['product_id', 'reorder_frequency']].drop_duplicates().head())

# Step 7: Create spending flag for each user
user_spending = df_combined.groupby('user_id').agg(mean_price=('prices', 'mean')).reset_index()

# Define spending flag
user_spending['spending_flag'] = user_spending['mean_price'].apply(
    lambda x: 'Low spender' if x < 10 else 'High spender'
)

# Merge spending flag back to the main dataframe
df_combined = df_combined.merge(user_spending[['user_id', 'spending_flag']], on='user_id', how='left')

# Check the spending flag
print("Spending Flag for Users:")
print(user_spending.head())

# Step 8: Create order frequency flag
user_order_frequency = df_combined.groupby('user_id').agg(median_days_since_prior=('days_since_prior_order', 'median')).reset_index()

# Define order frequency flag
def order_frequency_flag(days):
    if days > 20:
        return 'Non-frequent customer'
    elif days > 10:
        return 'Regular customer'
    else:
        return 'Frequent customer'

user_order_frequency['order_frequency_flag'] = user_order_frequency['median_days_since_prior'].apply(order_frequency_flag)

# Merge order frequency flag back to the main dataframe
df_combined = df_combined.merge(user_order_frequency[['user_id', 'order_frequency_flag']], on='user_id', how='left')

# Check the order frequency flag
print("Order Frequency Flag for Users:")
print(user_order_frequency.head())

# Step 10: Export the final dataframe as a pickle file
final_data_path = os.path.join(project_path, '02 Data', 'Prepared Data', 'final_aggregated_orders_products.pkl')
df_combined.to_pickle(final_data_path)
print(f"Final DataFrame exported to {final_data_path}")


Combined DataFrame shape: (32435059, 13)
   order_id  user_id  order_number  order_day_of_week  order_hour_of_day  \
0   2539329        1             1                  2                  8   
1   2539329        1             1                  2                  8   
2   2539329        1             1                  2                  8   
3   2539329        1             1                  2                  8   
4   2539329        1             1                  2                  8   

   days_since_prior_order  product_id  add_to_cart_order  reordered  prices  \
0               11.114836         196                  1          0     9.0   
1               11.114836       14084                  2          0    12.5   
2               11.114836       12427                  3          0     4.4   
3               11.114836       26088                  4          0     4.7   
4               11.114836       26405                  5          0     1.0   

         price_label     bu