N# Pandas Student Notebook — Foundations Practice (5)
## Dataset: Kaggle “Instacart Market Basket Analysis” (simplified)

### Goal of this notebook
This notebook trains **basket-level reasoning and normalization**: understanding transactions, products, customers, and why naïve aggregation leads to wrong conclusions.

Key habits reinforced:
- Always identify the transaction grain
- Aggregate before comparing
- Normalize counts (rates, shares) instead of raw totals
- Validate assumptions with sanity checks

You will work with these tables:
- `orders.csv`
- `order_products__prior.csv` (as `order_products`)
- `products.csv`
- `aisles.csv`
- `departments.csv`


In [47]:
import kagglehub
import pandas as pd
import os

# Download latest version
path = kagglehub.dataset_download("psparks/instacart-market-basket-analysis")

orders = pd.read_csv(os.path.join(path, 'orders.csv'))
order_products = pd.read_csv(os.path.join(path, 'order_products__prior.csv'))
products = pd.read_csv(os.path.join(path, 'products.csv'))
aisles = pd.read_csv(os.path.join(path, 'aisles.csv'))
departments = pd.read_csv(os.path.join(path, 'departments.csv'))


## 0. Setup + grain awareness

Load the five CSV files listed above.

Display:
- `shape` and `info()` for each table

Write as a comment:
- What is the grain of each table?
- Which tables can cause row explosion if joined incorrectly?


In [48]:

tables = {
    "orders": orders,
    "order_products": order_products,
    "products": products,
    "aisles": aisles,
    "departments": departments,
}

for name, df in tables.items():
    print(f"\n{name}: shape={df.shape}")
    display(df.info())


# Grain:
# - orders: 1 row per order_id
# - order_products__prior: 1 row per order_id × product_id (order line)
# - products: 1 row per product_id
# - aisles: 1 row per aisle_id
# - departments: 1 row per department_id
#
# Row explosion risk:
# - Order_products: Joining anything to order_products must respect its grain (order×product). Eg.Joining orders (order-level) to order_products (line-level)
#   multiplies rows (order becomes many rows).




orders: shape=(3421083, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                object 
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), object(1)
memory usage: 182.7+ MB


None


order_products: shape=(32434489, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 4 columns):
 #   Column             Dtype
---  ------             -----
 0   order_id           int64
 1   product_id         int64
 2   add_to_cart_order  int64
 3   reordered          int64
dtypes: int64(4)
memory usage: 989.8 MB


None


products: shape=(49688, 4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     49688 non-null  int64 
 1   product_name   49688 non-null  object
 2   aisle_id       49688 non-null  int64 
 3   department_id  49688 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


None


aisles: shape=(134, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   aisle_id  134 non-null    int64 
 1   aisle     134 non-null    object
dtypes: int64(1), object(1)
memory usage: 2.2+ KB


None


departments: shape=(21, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   department_id  21 non-null     int64 
 1   department     21 non-null     object
dtypes: int64(1), object(1)
memory usage: 468.0+ bytes


None

## 1. Order-level reconstruction

1. Create an order-level table with:
- `order_id`
- `user_id`
- `order_number`
- `n_products`, the number of products in the order (if there are no products in the order, choose '0')

2. Ensure yourself that there is exactly one row per order (of nog: dat elke order_id slechts 1x voorkomt in de nieuwe tabel)

Don't mind managing suffixes for now

Constraints:
- No loops

Write as a comment:
- Why counting rows after a join is dangerous here


In [49]:
order_level = (
    order_products
    .groupby("order_id", as_index=False)
    .agg(n_products=("product_id", "size"))
)

order_level = (
    orders[["order_id", "user_id", "order_number"]]
    .merge(order_level, on="order_id", how="left")
)

# Orders with no prior rows get NaN -> treat as 0 products for this simplified setup
order_level["n_products"] = order_level["n_products"].fillna(0).astype(int)


# Exactly one row per order
print(f"Amount of duplicate order_id's: {order_level["order_id"].duplicated().sum()}")
# Better way: assert order_level["order_id"].is_unique

order_level.head()

# Why counting rows after a join is dangerous:
# Because joining order-level data to order_products__prior changes grain (1 row/order → many rows/order),
# so row counts reflect duplication, not “more orders”.


Amount of duplicate order_id's: 0


Unnamed: 0,order_id,user_id,order_number,n_products
0,2539329,1,1,5
1,2398795,1,2,6
2,473747,1,3,5
3,2254736,1,4,5
4,431534,1,5,8


## 2. Product popularity: raw vs normalized

1) Compute how many times each product appears in orders (raw count).  
2) Compute how many *distinct orders* each product appears in.  
3) Compare the two metrics for the top 10 products.

Write as a comment:
- When do raw counts exaggerate popularity?


In [50]:
prod_raw = (
    order_products
    .groupby("product_id", as_index=False)
    .agg(raw_rows=("order_id", "size"),
         distinct_orders=("order_id", "nunique"))
)

top10_raw = prod_raw.sort_values("raw_rows", ascending=False).head(10)
top10_distinct = prod_raw.sort_values("distinct_orders", ascending=False).head(10)

display(top10_raw)
display(top10_distinct)

# Compare side-by-side for the top 10 by raw_rows
compare_top10 = (
    top10_raw[["product_id", "raw_rows", "distinct_orders"]]
    .merge(products[["product_id", "product_name"]], on="product_id", how="left")
)
compare_top10


Unnamed: 0,product_id,raw_rows,distinct_orders
24848,24852,472565,472565
13172,13176,379450,379450
21133,21137,264683,264683
21899,21903,241921,241921
47198,47209,213584,213584
47755,47766,176815,176815
47615,47626,152657,152657
16793,16797,142951,142951
26204,26209,140627,140627
27839,27845,137905,137905


Unnamed: 0,product_id,raw_rows,distinct_orders
24848,24852,472565,472565
13172,13176,379450,379450
21133,21137,264683,264683
21899,21903,241921,241921
47198,47209,213584,213584
47755,47766,176815,176815
47615,47626,152657,152657
16793,16797,142951,142951
26204,26209,140627,140627
27839,27845,137905,137905


Unnamed: 0,product_id,raw_rows,distinct_orders,product_name
0,24852,472565,472565,Banana
1,13176,379450,379450,Bag of Organic Bananas
2,21137,264683,264683,Organic Strawberries
3,21903,241921,241921,Organic Baby Spinach
4,47209,213584,213584,Organic Hass Avocado
5,47766,176815,176815,Organic Avocado
6,47626,152657,152657,Large Lemon
7,16797,142951,142951,Strawberries
8,26209,140627,140627,Limes
9,27845,137905,137905,Organic Whole Milk


## 3. Reorder behavior per product

For each product:
- compute reorder rate = mean of `reordered`
- compute total number of orders containing the product

Filter to products with at least 100 orders.

Write as a comment:
- Why is filtering on minimum volume necessary?


In [51]:
prod_reorder = (
    order_products
    .groupby("product_id", as_index=False)
    .agg(
        reorder_rate=("reordered", "mean"),
        n_orders=("order_id", "nunique"),
    )
)

prod_reorder_100 = prod_reorder[prod_reorder["n_orders"] >= 100].copy()
prod_reorder_100.sort_values("reorder_rate", ascending=False).head(10)

# Why filter on minimum volume?
# Small-n products have highly unstable rates, a few reorders can create extreme values that are noise, not signal.

Unnamed: 0,product_id,reorder_rate,n_orders
27734,27740,0.920792,101
35598,35604,0.9,100
38243,38251,0.891892,111
10232,10236,0.875969,129
20594,20598,0.875,112
35490,35496,0.862528,451
9288,9292,0.861691,2921
45495,45504,0.860233,9108
43386,43394,0.85903,8477
5511,5514,0.857683,3970


## 4. Joining product metadata safely

Join:
- products → aisles → departments

Then attach aisle and department to order-product rows.

Write as a comment:
- Why left joins are appropriate here


In [52]:
# products → aisles → departments
prod_dim = (
    products
    .merge(aisles, on="aisle_id", how="left", validate="many_to_one")
    .merge(departments, on="department_id", how="left", validate="many_to_one")
)

# Attach to line-level table
order_products_enriched = (
    order_products
    .merge(prod_dim[["product_id", "aisle_id", "aisle", "department_id", "department"]],
           on="product_id", how="left", validate="many_to_one")
)


order_products_enriched.head()

# Why left joins?
# Because order_products defines the fact rows; we want to keep every fact row even if metadata is missing.

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,aisle_id,aisle,department_id,department
0,2,33120,1,1,86,eggs,16,dairy eggs
1,2,28985,2,1,83,fresh vegetables,4,produce
2,2,9327,3,0,104,spices seasonings,13,pantry
3,2,45918,4,1,19,oils vinegars,13,pantry
4,2,30035,5,0,17,baking ingredients,13,pantry


## 5. Department-level demand (correctly)

Compute for each department:
- number of distinct orders
- total number of product rows
- average products per order (department-specific)

Write as a comment:
- Why “total rows” alone is misleading


In [53]:
order_products_enriched

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,aisle_id,aisle,department_id,department
0,2,33120,1,1,86,eggs,16,dairy eggs
1,2,28985,2,1,83,fresh vegetables,4,produce
2,2,9327,3,0,104,spices seasonings,13,pantry
3,2,45918,4,1,19,oils vinegars,13,pantry
4,2,30035,5,0,17,baking ingredients,13,pantry
...,...,...,...,...,...,...,...,...
32434484,3421083,39678,6,1,74,dish detergents,17,household
32434485,3421083,11352,7,0,78,crackers,19,snacks
32434486,3421083,4600,8,0,52,frozen breakfast,1,frozen
32434487,3421083,24852,9,1,24,fresh fruits,4,produce


In [54]:
dept_demand = (
    order_products_enriched
    .groupby("department", as_index=False)
    .agg(
        distinct_orders=("order_id", "nunique"),
        total_rows=("order_id", "size"),
    )
)

dept_demand["avg_products_per_order_in_dept"] = (
    dept_demand["total_rows"] / dept_demand["distinct_orders"]
)

dept_demand.sort_values("distinct_orders", ascending=False).head(10)

# Why “total rows” alone is misleading?
# Because it conflates “many orders” with “big baskets”; distinct_orders reflects breadth, total_rows reflects volume.


Unnamed: 0,department,distinct_orders,total_rows,avg_products_per_order_in_dept
19,produce,2409320,9479291,3.934426
7,dairy eggs,2177338,5414016,2.48653
3,beverages,1457351,2690129,1.845903
20,snacks,1391447,2887550,2.075214
10,frozen,1181018,2236432,1.893648
16,pantry,1117892,1875577,1.67778
2,bakery,881556,1176787,1.334898
8,deli,770300,1051249,1.364727
6,canned goods,681305,1068058,1.567665
9,dry goods pasta,597862,866627,1.449544


## 6. User behavior normalization

For each user:
- total number of orders
- average number of products per order

Attach these back to the order-level table as 2 separate columns

Write as a comment:
- Why you cannot use 'agg' here.

In [55]:
# Order-level table already has n_products per order. Now add per-user features to each order row.

order_level["user_total_orders"] = order_level.groupby("user_id")["order_id"].transform("count")
order_level["user_avg_basket_size"] = order_level.groupby("user_id")["n_products"].transform("mean")

order_level[["user_id", "order_number", "n_products", "user_total_orders", "user_avg_basket_size"]].head()

# Why transform instead of agg?
# Because we need a per-row column aligned back to the original order-level rows (same length as order_level).

Unnamed: 0,user_id,order_number,n_products,user_total_orders,user_avg_basket_size
0,1,1,5,11,5.363636
1,1,2,6,11,5.363636
2,1,3,5,11,5.363636
3,1,4,5,11,5.363636
4,1,5,8,11,5.363636


## 7. Basket concentration metric

For each order:
- compute share of the most common department in that basket

Steps (high level):
1) Count products per department per order
2) Compute department share within each order
3) For each order, keep only the highest department share; this represents how dominant the main department is within that basket.

Write as a comment:
- What does a high concentration value mean behaviorally?


In [56]:
# 1) Count products per department per order
dept_counts = (
    order_products_enriched
    .groupby(["order_id", "department"], as_index=False)
    .agg(n=("product_id", "size"))
)
dept_counts

Unnamed: 0,order_id,department,n
0,2,dairy eggs,1
1,2,pantry,5
2,2,produce,3
3,3,bakery,1
4,3,dairy eggs,3
...,...,...,...
15226193,3421083,babies,4
15226194,3421083,frozen,1
15226195,3421083,household,1
15226196,3421083,produce,1


In [57]:
# 2) Compute department share within each order
dept_counts["order_total"] = dept_counts.groupby("order_id")["n"].transform("sum")
dept_counts["dept_share"] = dept_counts["n"] / dept_counts["order_total"]
dept_counts

Unnamed: 0,order_id,department,n,order_total,dept_share
0,2,dairy eggs,1,9,0.111111
1,2,pantry,5,9,0.555556
2,2,produce,3,9,0.333333
3,3,bakery,1,8,0.125000
4,3,dairy eggs,3,8,0.375000
...,...,...,...,...,...
15226193,3421083,babies,4,10,0.400000
15226194,3421083,frozen,1,10,0.100000
15226195,3421083,household,1,10,0.100000
15226196,3421083,produce,1,10,0.100000


In [58]:
# 3) Keep the maximum share per order
basket_conc = (
    dept_counts
    .groupby("order_id", as_index=False)
    .agg(basket_concentration=("dept_share", "max"))
)
basket_conc

Unnamed: 0,order_id,basket_concentration
0,2,0.555556
1,3,0.375000
2,4,0.307692
3,5,0.269231
4,6,0.666667
...,...,...
3214869,3421079,1.000000
3214870,3421080,0.555556
3214871,3421081,0.142857
3214872,3421082,0.285714


In [59]:
# Attach to order_level
order_level = order_level.merge(basket_conc, on="order_id", how="left")

# Orders with 0 products → no dept share; set concentration to 0 (or could be NaN; spec wants no missings)
order_level["basket_concentration"] = order_level["basket_concentration"].fillna(0.0)

order_level[["order_id", "n_products", "basket_concentration"]].head()
order_level

Unnamed: 0,order_id,user_id,order_number,n_products,user_total_orders,user_avg_basket_size,basket_concentration
0,2539329,1,1,5,11,5.363636,0.400000
1,2398795,1,2,6,11,5.363636,0.500000
2,473747,1,3,5,11,5.363636,0.400000
3,2254736,1,4,5,11,5.363636,0.400000
4,431534,1,5,8,11,5.363636,0.500000
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,9,14,9.214286,0.111111
3421079,1854736,206209,11,8,14,9.214286,0.250000
3421080,626363,206209,12,20,14,9.214286,0.250000
3421081,2977660,206209,13,9,14,9.214286,0.444444


## 8. Data quality check: impossible combinations

Flag orders as `suspicious_order` if:
- order has zero products
- OR reorder rate is 1 for a user’s first order

Show:
- count of suspicious orders
- sample rows



In [60]:
# Reorder rate per order (mean of reordered across its lines)
order_reorder_rate = (
    order_products
    .groupby("order_id", as_index=False)
    .agg(order_reorder_rate=("reordered", "mean"))
)
order_reorder_rate

Unnamed: 0,order_id,order_reorder_rate
0,2,0.666667
1,3,1.000000
2,4,0.923077
3,5,0.807692
4,6,0.000000
...,...,...
3214869,3421079,0.000000
3214870,3421080,0.444444
3214871,3421081,0.000000
3214872,3421082,0.571429


In [61]:
order_level = order_level.merge(order_reorder_rate, on="order_id", how="left")
order_level["order_reorder_rate"] = order_level["order_reorder_rate"].fillna(0.0)
order_level

Unnamed: 0,order_id,user_id,order_number,n_products,user_total_orders,user_avg_basket_size,basket_concentration,order_reorder_rate
0,2539329,1,1,5,11,5.363636,0.400000,0.000000
1,2398795,1,2,6,11,5.363636,0.500000,0.500000
2,473747,1,3,5,11,5.363636,0.400000,0.600000
3,2254736,1,4,5,11,5.363636,0.400000,1.000000
4,431534,1,5,8,11,5.363636,0.500000,0.625000
...,...,...,...,...,...,...,...,...
3421078,2266710,206209,10,9,14,9.214286,0.111111,0.333333
3421079,1854736,206209,11,8,14,9.214286,0.250000,0.750000
3421080,626363,206209,12,20,14,9.214286,0.250000,0.700000
3421081,2977660,206209,13,9,14,9.214286,0.444444,0.444444


In [62]:
order_level["suspicious_order"] = (
    (order_level["n_products"] == 0)
    | ((order_level["order_number"] == 1) & (order_level["order_reorder_rate"] == 1))
)

print("suspicious count:", order_level["suspicious_order"].sum())

order_level[order_level["suspicious_order"] == 1][["order_id", "user_id", "order_number", "n_products", "order_reorder_rate"]].head(10)


suspicious count: 206209


Unnamed: 0,order_id,user_id,order_number,n_products,order_reorder_rate
10,1187899,1,11,0,0.0
25,1492625,2,15,0,0.0
38,2774568,3,13,0,0.0
44,329954,4,6,0,0.0
49,2196797,5,5,0,0.0
53,1528013,6,4,0,0.0
74,525192,7,21,0,0.0
78,880375,8,4,0,0.0
82,1094988,9,4,0,0.0
88,1822501,10,6,0,0.0


## 9. Capstone: analytical feature table

Create `analysis_df` with one row per order containing:
- `order_id`
- `user_id`
- `order_number`
- `n_products`
- user average basket size
- rolling 3-order average basket size
- basket concentration
- suspicious_order (as int)

Requirements:
- No missing values in engineered features
- Show `head()` and `isna().sum()`

Write as a comment:
- Which feature would you be most careful with when modeling, and why?


In [63]:
analysis_df = order_level[[
    "order_id",
    "user_id",
    "order_number",
    "n_products",
    "user_avg_basket_size",
    "basket_concentration",
    "suspicious_order",
]].copy()

# enforce dtypes / no missings
analysis_df["suspicious_order"] = analysis_df["suspicious_order"].astype(int)

analysis_df.head()



Unnamed: 0,order_id,user_id,order_number,n_products,user_avg_basket_size,basket_concentration,suspicious_order
0,2539329,1,1,5,5.363636,0.4,0
1,2398795,1,2,6,5.363636,0.5,0
2,473747,1,3,5,5.363636,0.4,0
3,2254736,1,4,5,5.363636,0.4,0
4,431534,1,5,8,5.363636,0.5,0


In [64]:
analysis_df.isna().sum()

order_id                0
user_id                 0
order_number            0
n_products              0
user_avg_basket_size    0
basket_concentration    0
suspicious_order        0
dtype: int64