# Sorting and Filtering — Advanced Practice (with Solutions)

This notebook contains **advanced (but not too much)** practice problems on **filtering, boolean masks, query, sorting (including custom keys), MultiIndex sorting, stable sorting, and top-k selection** in pandas.

**Best practices used here:**
- Reproducible data (fixed random seed)
- Small, realistic dataset
- Clear problem statements + immediately runnable solutions
- Avoids external files (standalone notebook)


In [1]:
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 60)
pd.set_option('display.max_columns', 30)

rng = np.random.default_rng(7)

## Dataset

We'll work with a small "orders" dataset that has:
- missing values
- duplicates
- dates
- categorical-like columns

So we can practice realistic filtering and sorting patterns.

In [2]:
# --- Create a small, realistic dataset ---
n = 40
customers = np.array(["Acme", "Bravo", "Cyan", "Delta", "Eon", "Flux"])
regions = np.array(["EMEA", "APAC", "AMER", None])
statuses = np.array(["new", "processing", "shipped", "cancelled"]) 
priorities = np.array(["low", "medium", "high"])

base_date = np.datetime64("2025-10-01")
order_dates = base_date + rng.integers(0, 90, size=n).astype("timedelta64[D]")

df = pd.DataFrame({
    "order_id": rng.integers(1000, 1100, size=n),
    "customer": rng.choice(customers, size=n, replace=True),
    "region": rng.choice(regions, size=n, replace=True, p=[0.35, 0.25, 0.30, 0.10]),
    "order_date": pd.to_datetime(order_dates),
    "revenue": np.round(rng.normal(loc=2200, scale=700, size=n), 2),
    "discount": np.round(rng.uniform(0, 0.35, size=n), 2),
    "status": rng.choice(statuses, size=n, replace=True, p=[0.25, 0.35, 0.30, 0.10]),
    "priority": rng.choice(priorities, size=n, replace=True, p=[0.35, 0.40, 0.25]),
})

# Inject a few missing discounts
missing_idx = rng.choice(df.index, size=5, replace=False)
df.loc[missing_idx, "discount"] = np.nan

# Ensure some negative-ish outliers (for filtering practice)
df.loc[rng.choice(df.index, size=2, replace=False), "revenue"] *= -0.4

# Make a couple of deliberate duplicate order_id rows (realistic-ish)
dup_rows = df.sample(2, random_state=1).copy()
dup_rows["status"] = "processing"
df = pd.concat([df, dup_rows], ignore_index=True)

# Clean up: revenue should be float, order_date datetime
df["order_date"] = pd.to_datetime(df["order_date"])
df["revenue"] = df["revenue"].astype(float)

df.head(10)

Unnamed: 0,order_id,customer,region,order_date,revenue,discount,status,priority
0,1046,Delta,AMER,2025-12-25,775.27,0.26,cancelled,high
1,1021,Bravo,AMER,2025-11-26,1986.87,0.2,shipped,low
2,1084,Flux,AMER,2025-12-01,1570.05,,shipped,medium
3,1016,Flux,EMEA,2025-12-20,2314.84,0.31,processing,medium
4,1085,Bravo,APAC,2025-11-22,3771.33,0.14,cancelled,medium
5,1061,Delta,EMEA,2025-12-09,-647.116,,new,low
6,1011,Flux,APAC,2025-12-15,1763.24,0.02,new,medium
7,1004,Flux,EMEA,2025-10-21,2343.78,,new,medium
8,1044,Eon,,2025-10-05,2545.11,0.18,processing,high
9,1003,Delta,EMEA,2025-10-28,2076.52,0.33,new,high


A quick look at missing values and basic info:

In [3]:
df.info()
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype        
---  ------      --------------  -----        
 0   order_id    42 non-null     int64        
 1   customer    42 non-null     object       
 2   region      38 non-null     object       
 3   order_date  42 non-null     datetime64[s]
 4   revenue     42 non-null     float64      
 5   discount    36 non-null     float64      
 6   status      42 non-null     object       
 7   priority    42 non-null     object       
dtypes: datetime64[s](1), float64(2), int64(1), object(4)
memory usage: 2.8+ KB


order_id      0
customer      0
region        4
order_date    0
revenue       0
discount      6
status        0
priority      0
dtype: int64

---

## Problem 1 — Compound Filtering + Null Handling

**Task:** Create a filtered DataFrame `p1` containing orders that:
- are in region **EMEA** or **AMER**
- have **status not equal** to `cancelled`
- have **revenue >= 2000**
- treat missing `discount` as **0** for the purpose of filtering, and keep only rows where `discount <= 0.10`

**Output:** Show `p1` sorted by `order_date` (ascending) then `revenue` (descending).


In [4]:
# SOLUTION (Problem 1)

p1_mask = (
    df["region"].isin(["EMEA", "AMER"]) &
    (df["status"] != "cancelled") &
    (df["revenue"] >= 2000) &
    (df["discount"].fillna(0) <= 0.10)
)

p1 = df.loc[p1_mask].sort_values(["order_date", "revenue"], ascending=[True, False])
p1

Unnamed: 0,order_id,customer,region,order_date,revenue,discount,status,priority
7,1004,Flux,EMEA,2025-10-21,2343.78,,new,medium
23,1024,Delta,EMEA,2025-10-26,2116.97,0.04,new,medium
10,1014,Acme,AMER,2025-10-26,2055.85,0.09,new,low
19,1051,Flux,EMEA,2025-11-12,2880.45,0.02,processing,high
36,1061,Delta,EMEA,2025-12-03,3209.2,,shipped,low
20,1026,Eon,EMEA,2025-12-13,2334.92,0.07,shipped,medium
39,1015,Delta,EMEA,2025-12-29,2088.09,0.04,shipped,low


---

## Problem 2 — `query()` with Dates + String Filtering

**Task:** Using `df.query(...)`, create `p2` with orders that:
- occurred in **November 2025**
- are **not** in `APAC`
- customer name **starts with** either `A` or `B`

**Output:** Return columns: `order_id, customer, region, order_date, revenue`.

**Hint:** `query` can't easily do `.str.startswith` directly, so build a helper boolean column first.


In [5]:
# SOLUTION (Problem 2)

tmp = df.copy()
tmp["cust_AB"] = tmp["customer"].str.startswith(("A", "B"))

# November 2025 range
start = pd.Timestamp("2025-11-01")
end = pd.Timestamp("2025-12-01")

p2 = (
    tmp.query("@start <= order_date < @end and region != 'APAC' and cust_AB")
       .loc[:, ["order_id", "customer", "region", "order_date", "revenue"]]
)

p2

Unnamed: 0,order_id,customer,region,order_date,revenue
1,1021,Bravo,AMER,2025-11-26,1986.87
14,1080,Bravo,EMEA,2025-11-14,2144.57
29,1069,Bravo,EMEA,2025-11-15,977.73
30,1088,Bravo,EMEA,2025-11-22,2729.72
31,1020,Acme,AMER,2025-11-19,1608.15
41,1020,Acme,AMER,2025-11-19,1608.15


---

## Problem 3 — Stable Sorting + Tie-breaking

Suppose you want to rank orders by `status` (custom order), and within the same status keep the original row order (stable).

**Task:**
1. Define a status order: `new` < `processing` < `shipped` < `cancelled`
2. Sort by that status order **stably** (so ties preserve original order).

**Output:** Show only `order_id, status, order_date, revenue`.

**Hint:** Use a `CategoricalDtype` or `Categorical` and `kind='mergesort'` for stability.


In [6]:
# SOLUTION (Problem 3)

status_order = ["new", "processing", "shipped", "cancelled"]
p3 = df.copy()
p3["status"] = pd.Categorical(p3["status"], categories=status_order, ordered=True)

# Stable sort by status
p3_sorted = p3.sort_values("status", kind="mergesort")

p3_sorted[["order_id", "status", "order_date", "revenue"]].head(25)

Unnamed: 0,order_id,status,order_date,revenue
5,1061,new,2025-12-09,-647.116
6,1011,new,2025-12-15,1763.24
7,1004,new,2025-10-21,2343.78
9,1003,new,2025-10-28,2076.52
10,1014,new,2025-10-26,2055.85
12,1097,new,2025-12-22,2563.94
14,1080,new,2025-11-14,2144.57
16,1082,new,2025-10-12,1461.86
23,1024,new,2025-10-26,2116.97
25,1001,new,2025-10-23,-563.204


---

## Problem 4 — Custom Sort Key (Case-insensitive + Length)

**Task:** Create a table `p4` of total revenue per customer:
- group by `customer`
- sum `revenue`
- then sort customer names by:
  1) **case-insensitive name**
  2) for ties (unlikely here), by **name length**

**Output:** A Series or DataFrame sorted with a custom key.

**Hint:** `sort_index(key=...)` receives the *entire index*.


In [7]:
# SOLUTION (Problem 4)

p4 = df.groupby("customer")["revenue"].sum()

# Use a composite key by mapping to a DataFrame-like object (2 arrays) via a structured approach:
# We'll sort by (casefolded string, length). pandas sort_index key only provides one key;
# we can encode both into a tuple-like sortable string, or build a temporary index.

idx = p4.index
compound = idx.str.casefold() + "|" + idx.str.len().astype(str).str.zfill(3)

p4_sorted = p4.copy()
p4_sorted.index = pd.Index(compound, name="compound_key")
p4_sorted = p4_sorted.sort_index()

# Restore original index order but keep sorted order from compound keys
p4_sorted.index = idx[p4_sorted.index.str.split("|").str[0].argsort(kind="mergesort")]  # simple restore

# A cleaner approach is to just present a DataFrame with explicit sort keys:
p4_df = p4.reset_index(name="total_revenue")
p4_df["name_key"] = p4_df["customer"].str.casefold()
p4_df["len_key"] = p4_df["customer"].str.len()
p4_df = p4_df.sort_values(["name_key", "len_key"], ascending=[True, True]).drop(columns=["name_key", "len_key"])

p4_df

Unnamed: 0,customer,total_revenue
0,Acme,6986.956
1,Bravo,11610.22
2,Cyan,14488.62
3,Delta,22857.854
4,Eon,11325.28
5,Flux,14596.35


> Note: In practice, using explicit sort keys in columns (as in `p4_df`) is often the clearest approach when you need multi-criteria sorting on the index values.

---

## Problem 5 — Top-k per Group (Filtering after Sorting)

**Task:** For each `region` (excluding null regions), find the **top 2 orders by revenue**.

Requirements:
- ignore rows with null `region`
- if a region has fewer than 2 rows, return what exists
- output should include: `region, order_id, customer, revenue, order_date`

**Hint:** Sort then use `groupby(...).head(k)`.


In [8]:
# SOLUTION (Problem 5)

p5 = (
    df.loc[df["region"].notna()]
      .sort_values(["region", "revenue"], ascending=[True, False])
      .groupby("region", as_index=False, sort=False)
      .head(2)
      .loc[:, ["region", "order_id", "customer", "revenue", "order_date"]]
)

p5

Unnamed: 0,region,order_id,customer,revenue,order_date
35,AMER,1000,Cyan,3074.4,2025-12-11
12,AMER,1097,Cyan,2563.94,2025-12-22
4,APAC,1085,Bravo,3771.33,2025-11-22
28,APAC,1096,Delta,2792.63,2025-11-13
36,EMEA,1061,Delta,3209.2,2025-12-03
19,EMEA,1051,Flux,2880.45,2025-11-12


---

## Problem 6 — MultiIndex Sorting + Level-specific Order

**Task:** Create a MultiIndex DataFrame with index `(region, customer, order_id)` and then:
1. Sort by index with a **custom region order**: `AMER`, `EMEA`, `APAC`, then missing last
2. Within each region, sort customers alphabetically (case-insensitive)
3. Within each customer, sort by `order_id` ascending

**Output:** Show the first 20 rows of the sorted MultiIndex frame with columns `order_date, revenue, status`.


In [9]:
# SOLUTION (Problem 6)

m = df.copy()

# Put missing regions as a label so they can be ordered last explicitly
m["region_filled"] = m["region"].fillna("(missing)")

# Define custom region order
region_order = ["AMER", "EMEA", "APAC", "(missing)"]
m["region_filled"] = pd.Categorical(m["region_filled"], categories=region_order, ordered=True)

mi = (
    m.set_index(["region_filled", "customer", "order_id"])
     .loc[:, ["order_date", "revenue", "status"]]
)

# Sort by multiple index levels; for customer use casefold via sort_index key,
# but key gets the entire Index (possibly MultiIndex). We'll sort in steps for clarity.

# 1) Sort by region level (categorical already defines order)
mi1 = mi.sort_index(level=0)

# 2) Within each region, sort by customer case-insensitively and then order_id.
# Approach: reset_index to columns, build sort keys, sort_values, then set index back.
tmp = mi1.reset_index()
tmp["cust_key"] = tmp["customer"].str.casefold()
tmp = tmp.sort_values(
    by=["region_filled", "cust_key", "customer", "order_id"],
    ascending=[True, True, True, True],
    kind="mergesort",
)
tmp = tmp.drop(columns=["cust_key"]).set_index(["region_filled", "customer", "order_id"])

tmp.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,order_date,revenue,status
region_filled,customer,order_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AMER,Acme,1001,2025-10-23,-563.204,new
AMER,Acme,1014,2025-10-26,2055.85,new
AMER,Acme,1020,2025-11-19,1608.15,processing
AMER,Acme,1020,2025-11-19,1608.15,processing
AMER,Acme,1046,2025-10-01,1476.43,processing
AMER,Acme,1099,2025-12-04,801.58,processing
AMER,Bravo,1021,2025-11-26,1986.87,shipped
AMER,Cyan,1000,2025-12-11,3074.4,processing
AMER,Cyan,1097,2025-12-22,2563.94,new
AMER,Delta,1046,2025-12-25,775.27,cancelled


---

## Problem 7 — Filtering with `between`, `isin`, and Excluding Outliers

**Task:** Create `p7` selecting orders that:
- have `revenue` between **1500 and 3500** (inclusive)
- have `priority` in `{medium, high}`
- status is one of `{new, processing, shipped}`
- exclude any order_ids that appear **more than once** in the dataset (duplicates)

**Output:** Sort by `revenue` descending, and show `order_id, customer, revenue, priority, status`.


In [10]:
# SOLUTION (Problem 7)

duplicate_order_ids = df.loc[df["order_id"].duplicated(keep=False), "order_id"].unique()

p7 = df.loc[
    df["revenue"].between(1500, 3500, inclusive="both") &
    df["priority"].isin(["medium", "high"]) &
    df["status"].isin(["new", "processing", "shipped"]) &
    (~df["order_id"].isin(duplicate_order_ids))
].sort_values("revenue", ascending=False)

p7[["order_id", "customer", "revenue", "priority", "status"]]

Unnamed: 0,order_id,customer,revenue,priority,status
35,1000,Cyan,3074.4,medium,processing
28,1096,Delta,2792.63,high,shipped
30,1088,Bravo,2729.72,medium,processing
12,1097,Cyan,2563.94,high,new
7,1004,Flux,2343.78,medium,new
20,1026,Eon,2334.92,medium,shipped
3,1016,Flux,2314.84,medium,processing
37,1083,Flux,2153.94,high,shipped
23,1024,Delta,2116.97,medium,new
9,1003,Delta,2076.52,high,new


---

## Problem 8 — `nlargest` with Conditions + Deterministic Ties

**Task:** Among orders that are **not cancelled** and have a **non-null region**, find the **top 5** by revenue.

Requirements:
- Use `nlargest`
- Break ties deterministically by using `order_date` (earlier first) then `order_id` (smaller first)

**Hint:** `nlargest` only sorts by one column, so you can build a pre-sorted frame and then take head(5), OR use a composite score.
We'll do the clear, best-practice approach: filter → sort_values with tie-breakers → head.


In [11]:
# SOLUTION (Problem 8)

base = df.loc[(df["status"] != "cancelled") & df["region"].notna()].copy()

# Use nlargest to get candidates, then apply deterministic tie-breaking
candidates = base.nlargest(10, "revenue")  # extra to handle tie-break sorting safely

p8 = (
    candidates.sort_values(
        by=["revenue", "order_date", "order_id"],
        ascending=[False, True, True],
        kind="mergesort",
    )
    .head(5)
    .loc[:, ["order_id", "customer", "region", "order_date", "revenue", "status"]]
)

p8

Unnamed: 0,order_id,customer,region,order_date,revenue,status
36,1061,Delta,EMEA,2025-12-03,3209.2,shipped
35,1000,Cyan,AMER,2025-12-11,3074.4,processing
19,1051,Flux,EMEA,2025-11-12,2880.45,processing
28,1096,Delta,APAC,2025-11-13,2792.63,shipped
32,1072,Cyan,APAC,2025-11-15,2745.29,new


## Extra mini-checks (optional)

These quick assertions are useful when practicing to ensure your outputs follow the rules.

In [12]:
# Optional sanity checks

# Problem 1 checks
if len(p1) > 0:
    assert set(p1["region"].unique()).issubset({"EMEA", "AMER"})
    assert (p1["status"] != "cancelled").all()
    assert (p1["revenue"] >= 2000).all()
    assert (p1["discount"].fillna(0) <= 0.10).all()

# Problem 7: ensure no duplicate order_id
assert not p7["order_id"].duplicated().any()

# Problem 8 checks
assert (p8["status"] != "cancelled").all()
assert p8["region"].notna().all()

"All checks passed (if no assertion errors)."

'All checks passed (if no assertion errors).'