## Exploratory Data Analysis
### Instacart Market Basket Analysis

This notebook performs explortory data analysis on the Instacart Online Grocery Shopping dataset. 
The goal of the EDA are to:
- Unserstand the structure and statistics of the data.
- Visualize user behavior, product demand, and reorder patterns.
- Identify relationships and insights that will guide preprocessing and feature engineering.

### Loading the necessary libraries and data

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

RAW = Path("../data/raw")
PROCESSED = Path("../data/processed")

orders = pd.read_csv(RAW/"orders.csv")
order_prod_prior = pd.read_csv(RAW/"order_products__prior.csv")
order_prod_train = pd.read_csv(RAW/"order_products__train.csv")
products = pd.read_csv(RAW/"products.csv")
aisles = pd.read_csv(RAW/"aisles.csv")
departments = pd.read_csv(RAW/"departments.csv")

print("Orders:", orders.shape)
print("Order Products Prior:", order_prod_prior.shape)
print("Order Products Train:", order_prod_train.shape)
print("Products:", products.shape)
print("Aisles:", aisles.shape)
print("Departments:", departments.shape)


Orders: (3421083, 7)
Order Products Prior: (32434489, 4)
Order Products Train: (1384617, 4)
Products: (49688, 4)
Aisles: (134, 2)
Departments: (21, 2)


### Dataset Overview

- **orders.csv**: Contains order-level information such as order number, day of week, hour of day, and days since prior order.  
- **order_products__prior.csv**: Line-item details of products from prior orders (used for feature engineering and EDA).  
- **order_products__train.csv**: Line-item details of products from training orders (used for supervised learning).  
- **products.csv**: Product metadata including product name, aisle, and department IDs.  
- **aisles.csv**: Aisle names and IDs.  
- **departments.csv**: Department names and IDs.  

For EDA, we primarily use `orders` (eval_set = "prior") joined with `order_products__prior` and product metadata.


### Data Overview and Statistics

Before visualizing, we examine the structure and summary statistics of the datasets.  
This helps identify missing values, variable types, and potential issues (e.g., outliers, skewness).


In [16]:
# Orders dataset
display(orders.head())
orders.info()
orders[['order_hour_of_day', 'order_dow', 'days_since_prior_order']].describe().T



Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                object 
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
order_hour_of_day,3421083.0,13.452015,4.226088,0.0,10.0,13.0,16.0,23.0
order_dow,3421083.0,2.776219,2.046829,0.0,1.0,3.0,5.0,6.0
days_since_prior_order,3214874.0,11.114836,9.206737,0.0,4.0,7.0,15.0,30.0


- `order_id` and `user_id` are identifiers.  
- `eval_set` has values: prior, train, test (only prior used for EDA).  
- `order_number` shows sequence per user. (1 being the first order, 2 being second order and so on)  
- `order_dow` (0–6) - Sunday through Saturday
- `order_hour_of_day` (0–23) represent timing.  
- `days_since_prior_order` The number of days between current order and the user's previous order and has missing values for first orders.  

In [17]:
# Prior order products
display(order_prod_prior.head())
order_prod_prior.info()
order_prod_prior[['add_to_cart_order','reordered']].describe().T


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
add_to_cart_order,32434489.0,8.351076,7.126671,1.0,3.0,6.0,11.0,145.0
reordered,32434489.0,0.589697,0.491889,0.0,0.0,1.0,1.0,1.0


- Contains product-level details for prior orders.  
- `add_to_cart_order` shows position of the product in the user's cart.  
- `reordered`: binary (1 if the product was ordered before by the same user, else 0).  
- This is the core table for analyzing product-level reorder behavior.  

In [10]:
# Products metadata
display(products.head())
print("Unique products:", products['product_id'].nunique())
print("Unique aisles:", aisles['aisle_id'].nunique())
print("Unique departments:", departments['department_id'].nunique())


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


Unique products: 49688
Unique aisles: 134
Unique departments: 21


- `products.csv` links products to aisles and departments.  
- There are 49,688 unique products, spread across 134 aisles and 21 departments.  
- This metadata will be useful for grouping and visualization.  

### Missing Values Check

In [14]:
# Missing values across datasets
missing_orders = orders.isnull().sum()
missing_prior = order_prod_prior.isnull().sum()
missing_products = products.isnull().sum()

print("Missing values in orders:")
print(missing_orders[missing_orders > 0])

print("\nMissing values in order_products__prior:")
if (missing_prior[missing_prior > 0]).any():
    print(missing_prior[missing_prior > 0])
else:
    print("No missing values")

print("\nMissing values in products:")
if (missing_products[missing_products > 0]).any():
    print(missing_products[missing_products > 0])
else:
    print("No missing values")


Missing values in orders:
days_since_prior_order    206209
dtype: int64

Missing values in order_products__prior:
No missing values

Missing values in products:
No missing values


- `days_since_prior_order` has missing values (expected for first orders).  
- Other tables (`order_products_prior`, `products`, `aisles`, `departments`) have no missing values.  