# Manipulating Data (Advanced Practice)

This notebook contains **advanced (but not too much)** problems on manipulating data with **NumPy + pandas**, each with a **complete solution**.

**Best practices used:**
- Reproducible data (`np.random.default_rng`)
- Clear separation of **Exercise** vs **Solution**
- Solutions include **sanity checks** (`assert`) where helpful
- No external files required


In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 30)

rng = np.random.default_rng(42)

## Dataset

We'll work with a small simulated dataset of orders.

In [2]:
n = 400

customers = [f"C{c:03d}" for c in rng.integers(1, 61, size=n)]
regions = rng.choice(["North", "South", "East", "West"], size=n, p=[0.25, 0.25, 0.25, 0.25])
products = rng.choice(["Widget", "Gizmo", "Doodad", "Thingamajig"], size=n, p=[0.35, 0.25, 0.25, 0.15])
channels = rng.choice(["Online", "Retail"], size=n, p=[0.65, 0.35])

order_dates = pd.to_datetime("2025-01-01") + pd.to_timedelta(rng.integers(0, 180, size=n), unit="D")

qty = rng.integers(1, 8, size=n)
unit_price = rng.normal(loc=30, scale=10, size=n).clip(5, None).round(2)
discount = rng.choice([0.0, 0.05, 0.10, 0.15, 0.20], size=n, p=[0.40, 0.20, 0.20, 0.15, 0.05])

df = pd.DataFrame({
    "order_id": [f"O{i:05d}" for i in range(1, n + 1)],
    "order_date": order_dates,
    "customer_id": customers,
    "region": regions,
    "channel": channels,
    "product": products,
    "qty": qty,
    "unit_price": unit_price,
    "discount": discount,
})

df["gross"] = (df["qty"] * df["unit_price"]).round(2)
df["net"] = (df["gross"] * (1 - df["discount"])).round(2)

df.head()

Unnamed: 0,order_id,order_date,customer_id,region,channel,product,qty,unit_price,discount,gross,net
0,O00001,2025-05-23,C006,West,Online,Gizmo,7,52.68,0.05,368.76,350.32
1,O00002,2025-03-22,C047,West,Online,Doodad,7,19.69,0.1,137.83,124.05
2,O00003,2025-06-28,C040,East,Online,Thingamajig,6,30.7,0.05,184.2,174.99
3,O00004,2025-01-12,C027,North,Online,Doodad,1,18.24,0.0,18.24,18.24
4,O00005,2025-01-31,C026,North,Online,Thingamajig,3,22.22,0.0,66.66,66.66


## Exercise 1 — Multi-metric aggregation + ranking

Create a table of **product performance by region** with these columns:
- `orders`: number of orders
- `units`: total quantity
- `revenue`: total net revenue
- `aov`: average order value (net per order)

Then, within each region, compute the product **rank** by `revenue` (highest revenue = rank 1). Return the top **2 products per region**.

**Requirements:**
- Use `groupby` + named aggregations
- Use `rank` or `nlargest` in a way that preserves ties sensibly
- Final output should include: `region`, `product`, `orders`, `units`, `revenue`, `aov`, `rev_rank`


In [3]:
# TODO
# 1) Aggregate by ['region', 'product']
# 2) Compute rank per region by revenue (descending)
# 3) Keep only top 2 per region
#
# top2 = ...
# top2

In [4]:
# SOLUTION

perf = (
    df.groupby(["region", "product"], as_index=False)
      .agg(
          orders=("order_id", "size"),
          units=("qty", "sum"),
          revenue=("net", "sum"),
      )
)
perf["revenue"] = perf["revenue"].round(2)
perf["aov"] = (perf["revenue"] / perf["orders"]).round(2)

# Rank within each region by revenue (descending).
# method='dense' gives 1,2,2,3... if ties happen.
perf["rev_rank"] = perf.groupby("region")["revenue"].rank(ascending=False, method="dense").astype(int)

top2 = (
    perf.sort_values(["region", "rev_rank", "revenue"], ascending=[True, True, False])
        .groupby("region", as_index=False)
        .head(2)
        .reset_index(drop=True)
)

# Sanity checks
assert set(["region", "product", "orders", "units", "revenue", "aov", "rev_rank"]).issubset(top2.columns)
assert top2.groupby("region").size().max() <= 2

top2

Unnamed: 0,region,product,orders,units,revenue,aov,rev_rank
0,East,Widget,40,155,4255.21,106.38,1
1,East,Thingamajig,19,77,2500.55,131.61,2
2,North,Widget,41,166,4220.04,102.93,1
3,North,Gizmo,20,83,2505.54,125.28,2
4,South,Widget,26,111,2631.05,101.19,1
5,South,Doodad,26,97,2606.23,100.24,2
6,West,Widget,42,186,5320.39,126.68,1
7,West,Gizmo,23,121,3399.68,147.81,2


## Exercise 2 — Vectorized feature engineering (no loops)

Create 3 new columns using **fully vectorized** operations:
1. `discount_band`:
   - `"none"` for discount = 0
   - `"low"` for (0, 0.10]
   - `"mid"` for (0.10, 0.15]
   - `"high"` for > 0.15
2. `is_weekend`: True if `order_date` is Saturday or Sunday
3. `net_per_unit`: net / qty

Then compute **net revenue share** by `discount_band` (a Series that sums to ~1.0).

**Requirements:**
- Use `np.select` or `pd.cut` (either is fine)
- No Python loops


In [5]:
# TODO
# df2 = df.copy()
# ...
# share = ...
# share

In [6]:
# SOLUTION

df2 = df.copy()

conditions = [
    df2["discount"].eq(0),
    df2["discount"].gt(0) & df2["discount"].le(0.10),
    df2["discount"].gt(0.10) & df2["discount"].le(0.15),
    df2["discount"].gt(0.15),
]
choices = ["none", "low", "mid", "high"]
df2["discount_band"] = np.select(conditions, choices, default="unknown")

df2["is_weekend"] = df2["order_date"].dt.dayofweek.ge(5)
df2["net_per_unit"] = (df2["net"] / df2["qty"]).round(2)

share = (df2.groupby("discount_band")["net"].sum() / df2["net"].sum()).sort_values(ascending=False)

# Sanity checks
assert df2["discount_band"].isin(["none", "low", "mid", "high"]).all()
assert share.sum() == pytest_approx if False else True  # placeholder to avoid external deps
assert abs(float(share.sum()) - 1.0) < 1e-9

share

discount_band
low     0.424434
none    0.402861
mid     0.139894
high    0.032812
Name: net, dtype: float64

## Exercise 3 — Reshaping: pivot + tidy output

Create a **monthly** revenue pivot table with:
- index: `month` (YYYY-MM)
- columns: `channel`
- values: total `net`

Then convert it back to **tidy long format** with columns: `month`, `channel`, `revenue`.

**Requirements:**
- Use `dt.to_period('M')` or equivalent
- Use `pivot_table` (or `groupby().unstack()`) and then `stack()`/`melt()`


In [7]:
# TODO
# pivot = ...
# tidy = ...
# tidy.head()

In [8]:
# SOLUTION

tmp = df.copy()
tmp["month"] = tmp["order_date"].dt.to_period("M").astype(str)

pivot = (
    tmp.pivot_table(
        index="month",
        columns="channel",
        values="net",
        aggfunc="sum",
        fill_value=0.0,
    )
    .round(2)
    .sort_index()
)

tidy = (
    pivot.reset_index()
         .melt(id_vars="month", var_name="channel", value_name="revenue")
         .sort_values(["month", "channel"], ignore_index=True)
)

# Sanity checks
assert set(tidy.columns) == {"month", "channel", "revenue"}
assert tidy["revenue"].ge(0).all()

tidy.head(10)

Unnamed: 0,month,channel,revenue
0,2025-01,Online,3540.31
1,2025-01,Retail,2165.13
2,2025-02,Online,6269.48
3,2025-02,Retail,2748.1
4,2025-03,Online,5468.41
5,2025-03,Retail,2973.59
6,2025-04,Online,3838.8
7,2025-04,Retail,3091.56
8,2025-05,Online,5661.93
9,2025-05,Retail,2026.75


## Exercise 4 — Coercing messy numeric data + nullable integers

Create a new Series `messy_qty` derived from `qty`, but inject some messy values:
- Replace ~5% with the string `"missing"`
- Replace ~5% with numeric strings like `"7"`
- Replace ~3% with `None`

Tasks:
1. Coerce to numeric using `pd.to_numeric(..., errors='coerce')`
2. Convert to a **nullable integer dtype** (so missing values are allowed) using pandas `Int64`
3. Report how many values became missing (`NA`) after coercion

**Requirements:**
- Do not drop rows; preserve length
- Use `Int64` (nullable) not `int`


In [9]:
# TODO
# messy_qty = ...
# coerced = ...
# coerced_int = ...
# na_count = ...
# coerced_int.dtype, na_count

In [10]:
# SOLUTION

messy_qty = df["qty"].astype(object).copy()

idx = np.arange(len(messy_qty))
rng.shuffle(idx)

n_missing_str = int(0.05 * len(idx))
n_numeric_str = int(0.05 * len(idx))
n_none = int(0.03 * len(idx))

missing_str_idx = idx[:n_missing_str]
numeric_str_idx = idx[n_missing_str:n_missing_str + n_numeric_str]
none_idx = idx[n_missing_str + n_numeric_str:n_missing_str + n_numeric_str + n_none]

messy_qty.iloc[missing_str_idx] = "missing"
messy_qty.iloc[numeric_str_idx] = df.loc[numeric_str_idx, "qty"].astype(str).values
messy_qty.iloc[none_idx] = None

coerced = pd.to_numeric(messy_qty, errors="coerce")
coerced_int = coerced.astype("Int64")
na_count = int(coerced_int.isna().sum())

# Sanity checks
assert len(coerced_int) == len(df)
assert str(coerced_int.dtype) == "Int64"

coerced_int.dtype, na_count

(Int64Dtype(), 32)

## Exercise 5 — Concatenation vs merge: aligning on keys

Create a `customers_df` with one row per customer:
- `customer_id`
- `signup_date` (random date in 2024)
- `segment` (A/B/C)

Then:
1. Produce a customer-level summary from `df` with: total `orders`, total `revenue`.
2. Join the summary to `customers_df` using a **key-based merge**.
3. Contrast this with `concat`-by-index: set index appropriately, concat, and show that it matches the merge result.

**Requirements:**
- Use `merge(..., how='left')`
- Demonstrate the **index alignment** behavior of `concat`
- Final output should have one row per customer in `customers_df`


In [11]:
# TODO
# customers_df = ...
# cust_summary = ...
# merged = ...
# via_concat = ...
# merged.head()

In [12]:
# SOLUTION

unique_customers = pd.Index(sorted(df["customer_id"].unique()))

signup_dates = pd.to_datetime("2024-01-01") + pd.to_timedelta(rng.integers(0, 366, size=len(unique_customers)), unit="D")
segments = rng.choice(["A", "B", "C"], size=len(unique_customers), p=[0.4, 0.4, 0.2])

customers_df = pd.DataFrame({
    "customer_id": unique_customers,
    "signup_date": signup_dates,
    "segment": segments,
}).sort_values("customer_id", ignore_index=True)

cust_summary = (
    df.groupby("customer_id", as_index=False)
      .agg(
          orders=("order_id", "size"),
          revenue=("net", "sum"),
      )
)
cust_summary["revenue"] = cust_summary["revenue"].round(2)

merged = customers_df.merge(cust_summary, on="customer_id", how="left")
merged[["orders", "revenue"]] = merged[["orders", "revenue"]].fillna({"orders": 0, "revenue": 0.0})
merged["orders"] = merged["orders"].astype(int)

# Now do the same thing using concat with index alignment.
customers_ix = customers_df.set_index("customer_id")
summary_ix = cust_summary.set_index("customer_id")

via_concat = pd.concat([customers_ix, summary_ix], axis=1)
via_concat[["orders", "revenue"]] = via_concat[["orders", "revenue"]].fillna({"orders": 0, "revenue": 0.0})
via_concat["orders"] = via_concat["orders"].astype(int)
via_concat = via_concat.reset_index()

# Sanity check: the two approaches should match on key columns.
check_cols = ["customer_id", "signup_date", "segment", "orders", "revenue"]
m_sorted = merged.sort_values("customer_id", ignore_index=True)[check_cols]
c_sorted = via_concat.sort_values("customer_id", ignore_index=True)[check_cols]
pd.testing.assert_frame_equal(m_sorted, c_sorted)

merged.head()

Unnamed: 0,customer_id,signup_date,segment,orders,revenue
0,C001,2024-02-24,C,3,373.85
1,C002,2024-06-05,B,7,685.8
2,C003,2024-12-30,B,7,552.06
3,C004,2024-06-25,A,3,415.89
4,C005,2024-05-18,A,6,779.41


## Exercise 6 — Group-wise time features: rolling 7-day revenue per region

Compute **daily net revenue** per region, then add a **7-day rolling sum** per region.

Tasks:
1. Aggregate to daily revenue per region: columns `region`, `date`, `daily_revenue`
2. Within each region, compute `roll7_revenue`: rolling 7-day sum over `daily_revenue`

**Requirements:**
- Sort correctly before rolling
- Use `groupby(...).rolling(...)` or `transform` patterns
- Use `min_periods=1` so the first days still get a value


In [13]:
# TODO
# daily = ...
# daily.head(10)

In [14]:
# SOLUTION

daily = (
    df.assign(date=df["order_date"].dt.normalize())
      .groupby(["region", "date"], as_index=False)
      .agg(daily_revenue=("net", "sum"))
)
daily["daily_revenue"] = daily["daily_revenue"].round(2)

daily = daily.sort_values(["region", "date"], ignore_index=True)

# Rolling sum per region
daily["roll7_revenue"] = (
    daily.groupby("region", group_keys=False)["daily_revenue"]
         .apply(lambda s: s.rolling(window=7, min_periods=1).sum())
         .round(2)
)

# Sanity checks
assert daily.groupby("region")["date"].is_monotonic_increasing.all()
assert daily["roll7_revenue"].ge(daily["daily_revenue"] - 1e-9).all()  # rolling sum should be >= daily

daily.head(12)

Unnamed: 0,region,date,daily_revenue,roll7_revenue
0,East,2025-01-06,56.65,56.65
1,East,2025-01-13,249.55,306.2
2,East,2025-01-14,58.93,365.13
3,East,2025-01-15,223.65,588.78
4,East,2025-01-17,34.1,622.88
5,East,2025-01-23,267.72,890.6
6,East,2025-01-24,146.11,1036.71
7,East,2025-01-27,19.24,999.3
8,East,2025-01-28,154.4,904.15
9,East,2025-01-30,70.86,916.08


## Quick recap

You practiced:
- Multi-metric `groupby` aggregations + ranking
- Vectorized feature engineering (`np.select`, datetime ops)
- Reshaping (`pivot_table`, `melt`)
- Coercing messy numeric data + nullable integer dtype (`Int64`)
- `merge` vs `concat` alignment logic
- Group-wise rolling windows
