## Exploratory Data Analysis
### Instacart Market Basket Analysis

This notebook performs explortory data analysis on the Instacart Online Grocery Shopping dataset. 
The goal of the EDA are to:
- Unserstand the structure and statistics of the data.
- Visualize user behavior, product demand, and reorder patterns.
- Identify relationships and insights that will guide preprocessing and feature engineering.

### Loading the necessary libraries and data

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

RAW = Path("../data/raw")
PROCESSED = Path("../data/processed")

orders = pd.read_csv(RAW/"orders.csv")
order_prod_prior = pd.read_csv(RAW/"order_products__prior.csv")
order_prod_train = pd.read_csv(RAW/"order_products__train.csv")
products = pd.read_csv(RAW/"products.csv")
aisles = pd.read_csv(RAW/"aisles.csv")
departments = pd.read_csv(RAW/"departments.csv")

print("Orders:", orders.shape)
print("Order Products Prior:", order_prod_prior.shape)
print("Order Products Train:", order_prod_train.shape)
print("Products:", products.shape)
print("Aisles:", aisles.shape)
print("Departments:", departments.shape)


: 

### Dataset Overview

- **orders.csv**: Contains order-level information such as order number, day of week, hour of day, and days since prior order.  
- **order_products__prior.csv**: Line-item details of products from prior orders (used for feature engineering and EDA).  
- **order_products__train.csv**: Line-item details of products from training orders (used for supervised learning).  
- **products.csv**: Product metadata including product name, aisle, and department IDs.  
- **aisles.csv**: Aisle names and IDs.  
- **departments.csv**: Department names and IDs.  

For EDA, we primarily use `orders` (eval_set = "prior") joined with `order_products__prior` and product metadata.


### Data Overview and Statistics

Before visualizing, we examine the structure and summary statistics of the datasets.  
This helps identify missing values, variable types, and potential issues (e.g., outliers, skewness).


In [None]:
# Orders dataset
display(orders.head())
orders.info()
orders[['order_hour_of_day', 'order_dow', 'days_since_prior_order']].describe().T
print("orders eval_set distribution:", orders["eval_set"].value_counts(normalize=True))


- `order_id` and `user_id` are identifiers.  
- `eval_set` has values: prior, train, test (only prior used for EDA).  
- `order_number` shows sequence per user. (1 being the first order, 2 being second order and so on)  
- `order_dow` (0–6) - Sunday through Saturday
- `order_hour_of_day` (0–23) represent timing.  
- `days_since_prior_order` The number of days between current order and the user's previous order and has missing values for first orders.  

In [None]:
# Prior order products
display(order_prod_prior.head())
order_prod_prior.info()
order_prod_prior[['add_to_cart_order','reordered']].describe().T


- Contains product-level details for prior orders.  
- `add_to_cart_order` shows position of the product in the user's cart.  
- `reordered`: binary (1 if the product was ordered before by the same user, else 0).  
- This is the core table for analyzing product-level reorder behavior.  

In [None]:
# Products metadata
display(products.head())
print("Unique products:", products['product_id'].nunique())
print("Unique aisles:", aisles['aisle_id'].nunique())
print("Unique departments:", departments['department_id'].nunique())


- `products.csv` links products to aisles and departments.  
- There are 49,688 unique products, spread across 134 aisles and 21 departments.  
- This metadata will be useful for grouping and visualization.  

### Missing Values Check

In [None]:
# Missing values across datasets
missing_orders = orders.isnull().sum()
missing_prior = order_prod_prior.isnull().sum()
missing_products = products.isnull().sum()

print("Missing values in orders:")
print(missing_orders[missing_orders > 0])

print("\nMissing values in order_products__prior:")
if (missing_prior[missing_prior > 0]).any():
    print(missing_prior[missing_prior > 0])
else:
    print("No missing values")

print("\nMissing values in products:")
if (missing_products[missing_products > 0]).any():
    print(missing_products[missing_products > 0])
else:
    print("No missing values")


### Data Visualization
This section visualizes user behavior, product demand, and reorder patterns using the PRIOR orders.

Defining the path where the graphs and figures will be stored. 

In [None]:
FIG_DIR = Path("reports/figures")
FIG_DIR.mkdir(parents=True, exist_ok=True)

# Using PRIOR orders for EDA
orders_prior = orders.loc[orders["eval_set"] == 'prior'].copy()

print("orders_prior shape:", orders_prior.shape)
print("order eval_set value counts:\n", orders["eval_set"].value_counts())


### User Behavior
#### Orders by Day of Week
This plot shows how Instacart orders are distributed across the days of the week. 
The 'order_dow' variable range from 0-6, where 0 corresponds to Sunday and 6 to Saturday.

In [None]:
dow_map = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', 3: 'Wednesday', 
           4: 'Thursday', 5: 'Friday', 6: 'Saturday'}

dow_counts = (orders_prior['order_dow'].value_counts().sort_index()
              .rename_axis('order_dow').reset_index(name='orders'))

dow_counts['day'] = dow_counts['order_dow'].map(dow_map)
dow_counts['percentage'] = 100 * dow_counts['orders'] / dow_counts['orders'].sum()

# Plotting
plt.figure(figsize=(8,5))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.grid(axis='x', linestyle='--', alpha=0.5)
bars = plt.bar(dow_counts['day'], dow_counts['percentage'],
               color = sns.color_palette("pastel"),
               edgecolor='black',
               width = 0.6)

# Annotate bars with percentages
for bar, pct in zip(bars, dow_counts['percentage']):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.15, 
             f'{pct:.1f}%', ha='center', va='bottom', fontsize=9)

plt.title('Orders by Day of Week (Prior Orders)', fontsize = 12, weight='bold')
plt.xlabel('Day of Week', fontsize = 10)
plt.ylabel('Percentage of Orders (%)', fontsize = 10)
plt.tight_layout()

fig_path = FIG_DIR / "EDA-01_orders_by_dow.png"
plt.savefig(fig_path, dpi=150)
plt.show()

**Interpretation.**  
Orders vary noticeably by day of week where most orders occur on Sundays and Mondays,indicating that customers tend to restock groceries at the start of the week.  
Order volume slows down from Tuesday through Saturday, suggesting less activity after initial grocery purchase.  

This pattern implies that **day-of-week** can serve as an important feature in predictive models—capturing routine behavior and planning cycles in customer purchasing habits.

#### Orders by Hour of Day

This plot displays how Instacart orders are distributed by the hour of the day (0–23).  
Understanding hourly shopping trends helps identify peak activity periods and user engagement patterns.


In [None]:
# Aggerate total orders by hour of day
hour_counts = (orders_prior['order_hour_of_day'].value_counts().sort_index()
               .rename_axis('order_hour_of_day').reset_index(name='orders'))

hour_counts['percentage'] = 100 * hour_counts['orders'] / hour_counts['orders'].sum()

# Plotting
plt.figure(figsize=(10,5))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.plot(hour_counts['order_hour_of_day'], hour_counts['percentage'],
         marker='o', color = "black", markersize=6, linewidth=2)
plt.fill_between(hour_counts['order_hour_of_day'], hour_counts['percentage'],
                 color = sns.color_palette("deep"), alpha=0.7)

plt.title('Orders by Hour of Day (Prior Orders)', fontsize = 12, weight='bold')
plt.xlabel('Hour of Day (0-23)', fontsize = 10)
plt.ylabel('Percentage of Orders (%)', fontsize = 10)
plt.xticks(range(0,24))
plt.tight_layout()

fig_path = FIG_DIR / "EDA-02_orders_by_hour.png"
plt.savefig(fig_path, dpi=150)
plt.show()




**Interpretation.**  
Order volume is lowest during late-night hours and rises sharply in the **morning**, peaking between **10 AM and 4 PM**.  After 5 PM, activity declines steadily until 6 AM next day.  

This suggests that most users place grocery orders **during midday or early afternoon**, possibly during breaks or after work.  
The clear daily rhythm supports using **hour-of-day** as a time-based feature in predictive or recommendation models.

#### Days Since Prior Order Distribution

This plot illustrates the distribution of `days_since_prior_order`, which measures the number of days between a user's current and previous orders. It helps reveal customers' ordering frequency patterns.

In [None]:
# Excluding the first order (NaN days_since_prior_order)
valid_orders = orders_prior.dropna(subset=['days_since_prior_order'])

plt.figure(figsize = (8,5))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.grid(axis='x', linestyle='--', alpha=0.5)

plt.hist(valid_orders['days_since_prior_order'], 
         bins = 30, color = sns.color_palette("deep")[0], 
         edgecolor='black', alpha=0.8)
plt.title('Distribution of Days Since Prior Order', fontsize = 12, weight='bold')
plt.xlabel('Days Since Prior Order', fontsize = 10)
plt.ylabel('Number of Orders', fontsize = 10)
plt.xticks(range(0,31))
plt.tight_layout()

# Save figure
fig_path = FIG_DIR / "EDA-03_days_since_prior_order.png"
plt.savefig(fig_path, dpi=150)  
plt.show()


**Interpretation.**  
The distribution shows clear peaks between **7 days - 8 days** and **29 days - 30 days**, suggesting that many customers follow **weekly** or **monthly** grocery cycles.  
A long right-tail indicates more number users reordering at those intervals.  

This recency pattern will be valuable for feature engineering `days_since_prior_order` or its transformations (e.g., recency bins) can capture user purchase cadence in later models.

#### Orders per User Distribution

This plot shows the distribution of how many orders each user has placed.  
It provides insight into customer engagement and repeat shopping behavior.


In [None]:
# Count how many orders each user placed
user_order_counts = (orders_prior.groupby("user_id")["order_number"]
                     .max().reset_index(name = "total_orders"))

# Create bins
bins = np.arange(0, 105, 5)
hist, edges = np.histogram(user_order_counts["total_orders"], bins = bins)

# Cumulative percentage of total orders per user
cumulative = np.cumsum(hist) / user_order_counts.shape[0] * 100

fig, ax1 = plt.subplots(figsize = (9,5))
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.grid(axis='x', linestyle='--', alpha=0.5)

# Plotting Histogram
ax1.bar(edges[:-1], hist, width = 4.5, align="edge", 
        color="blue", edgecolor="black", alpha=0.5)
ax1.set_xlabel("Total Orders per User (grouped by 5)", fontsize=10)
ax1.set_ylabel("Number of Users", color="black")
ax1.tick_params(axis="y", labelcolor="black")
ax1.set_title("Distribution of User Order Counts", fontsize=13, weight="bold")

# Secondary y-axis for cumulative %
ax2 = ax1.twinx()
ax2.plot(edges[:-1], cumulative, color="red", marker="o")
ax2.set_ylabel("Cumulative % of Users", color="black")
ax2.tick_params(axis="y", labelcolor="black")

plt.tight_layout()
fig_path = FIG_DIR / "EDA-04_orders_per_user_improved.png"
plt.savefig(fig_path, dpi=150)
plt.show()


**Interpretation.**  
Most users place **fewer than 10 orders**, with over **70% of users** completing fewer than 15.The histogram’s steep left side indicates a large number of casual or first-time shoppers, while the cumulative curve shows that only a small fraction of users are highly active (30+ orders).  
This demonstrates a **long-tail user engagement pattern**; a few loyal customers account for many of the total orders. Such insights can inform retention strategies and customer segmentation.


### Product Demand
#### Top 15 Departments
This plot shows which product departments account for the highest share of orders. 
By joining prior order details with product metadata, we can identify where customer demand is concentrated across broad product categories.

In [None]:
# Join order_products_prior with product and department info
prior_with_dept = (order_prod_prior
                   .merge(products[["product_id", "department_id"]], 
                                          on = "product_id",
                                          how = "left")
                                          .merge(departments, on = "department_id", 
                                                 how = "left"))
# Count number of orders per department for Top 15 frequently ordered products
dept_count = (prior_with_dept['department']
             .value_counts().head(15)
             .rename_axis("department").reset_index(name = 'orders'))

# Convert the number of orders per department to percentage
dept_count['percentage'] = 100 * dept_count['orders'] / dept_count['orders'].sum()

plt.figure(figsize = (9,5))
bars = plt.barh(dept_count['department'],
                dept_count['percentage'], color = sns.color_palette("muted"),
                edgecolor = 'black')

plt.title("Top 15 Departments by Share of orders")
plt.xlabel("Percentage of orders (%)")
plt.ylabel("Departments")
plt.gca().invert_yaxis()

# Annotate the bars with ordered percentage
for bar, pct in zip(bars, dept_count['percentage']):
    plt.text(pct + 0.2, bar.get_y() +
             bar.get_height()/2, 
             f"{pct:.1f}%",
             va = "center", fontsize = 7)
    
plt.tight_layout()
plt.show()

# Save the fig
fig_path = FIG_DIR/"EDA-05_top_department.png"
plt.savefig(fig_path, dpi = 150)


**Interpretation.**

The **Produce** department dominates Instacart orders, followed by **Dairy eggs**, **snakes** ,and **beverages**. This indicates that users primarily shop for fresh essential and daily use items. 
The high presence of Produce department highlights Instacart's core grocery focus, while smaller departments like babies represent niche shopping categories.

#### Top 15 Aisles by Share of orders
Aisles represent finer_grained product groupings within departments. The tree map visualizes the 15 aisles with the highest order volume, where each rectangle's area corresponds to its share of all prior order line items.

In [None]:
import squarify
import textwrap

# Join order_product_prior with product and aisle info
prior_with_aisle = (order_prod_prior.merge(products[['product_id', 'aisle_id']],
                                           on = 'product_id', how = 'left')
                                           .merge(aisles, on = 'aisle_id',
                                                  how = 'left'))
# Aggregate top 15 aisles

aisle_counts = (prior_with_aisle['aisle'].value_counts().head(15)
                .rename_axis("aisle").reset_index(name = "orders"))
aisle_counts['percentage'] = 100 * aisle_counts['orders'] / aisle_counts['orders'].sum()

def short(s, n = 18):
    return s if len(s) <= n else s[:n-1] + "..."
labels = [f"{short(a)}\n{p:.1f}%" for a, p in zip(aisle_counts['aisle'], aisle_counts['percentage'])]

# Build tree map

plt.figure(figsize = (10,8))
squarify.plot(sizes = aisle_counts['orders'],
              label = labels,
              alpha = 0.5, color = sns.color_palette('viridis', n_colors = 15), edgecolor = 'black', pad = 0.25,
              
              text_kwargs = dict(fontsize = 10) )

plt.title("Top 15 Aisles by Share of orders", fontsize = 12, weight = 'bold')
plt.tight_layout()
plt.show()

**Interpretation.**  
The treemap shows that **Fresh Vegetables**, **Fresh Fruits**, and **Packaged Vegetables & Fruits** dominate aisle-level demand, together accounting for the largest portion of orders. Smaller tiles like **Yogurt**, **Soft Drinks**, and **Cereal** indicate strong but secondary categories.  

This emphasizes Instacart’s heavy focus on **fresh produce** and **daily essentials**, while niche aisles contribute less to total order volume. The visualization helps identify which aisles could benefit most from targeted promotions or personalized recommendations.


#### Top 15 Products by Frequency

This plot shows the 15 most frequently ordered products. It highlights the high-volume staples that define user demand at the product level.


In [None]:
# Join prior orders with product names

prior_with_product = order_prod_prior.merge(
    products[["product_id", "product_name"]], on="product_id", how="left"
)

# Aggregate top 15 products
top_products = (
    prior_with_product["product_name"]
    .value_counts()
    .head(15)
    .rename_axis("product_name")
    .reset_index(name="orders")
)
top_products["percentage"] = 100 * top_products["orders"] / top_products["orders"].sum()

# Plot
plt.figure(figsize=(11, 5))
bars = plt.barh(top_products["product_name"], top_products["orders"],
                color= sns.color_palette("dark"), edgecolor="black", alpha = 0.5)
plt.gca().invert_yaxis()   # show highest on top
plt.title("Top 15 Products by Order Frequency", fontsize=13, weight="bold")
plt.xlabel("Number of Orders")
plt.ylabel("Product Name")
plt.yticks(fontsize = 8)

# Annotate bars
for bar, val in zip(bars, top_products["orders"]):
    plt.text(val + 2000, bar.get_y() + bar.get_height()/2,
             f"{val:,}", va="center", fontsize=7)

plt.tight_layout()

# Save
fig_path = FIG_DIR / "EDA-07_top_products.png"
plt.savefig(fig_path, dpi=150)
plt.show()


**Interpretation.**  
The results confirm a strong dominance of everyday staples such as **Bananas**, **Organic Strawberries**, and **Organic Baby Spinach**.  
These fresh produce items lead Instacart’s overall product frequency, underscoring that users primarily use the platform for recurring grocery basics.  

This top-products list also guides feature engineering — e.g., **product-level popularity scores** or **baseline reorder probabilities** can be derived from these counts.

### Reorder Patterns
#### Reorder Rate Day of the Week
The reorder rate is calculated as the share of line-items where `reordered = 1`.  
By examining it across the day of week, we can detect weekly rhythms in user loyalty and replenishment behavior.