# Selecting Data — Master Level (with Solutions)

This notebook is a **master-level** drill on Pandas selection.

## Themes
- MultiIndex + `IndexSlice` + partial keys + lexsorted requirements
- `searchsorted` for fast range selection
- boolean masks with `.dt` / `.str` / `.between` and `np.select`
- label-alignment pitfalls (Series/DataFrame assignment)
- safe assignment patterns (no chained indexing)
- robust "defensive" indexing (`reindex`, `.get_indexer_for`, `.filter`)
- advanced tools: `pd.eval`, `DataFrame.where`, `mask`, `align`

Best practice: treat selection as a **first-class, testable step** in your pipeline: build masks, name them, validate counts.


In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.width', 140)
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 60)

rng = np.random.default_rng(2025)

## Setup: a larger, more realistic dataset

We simulate event-level data with:
- `DatetimeIndex`
- categorical dimensions (`region`, `store`, `sku`)
- a `user_id`
- numeric measures (`qty`, `unit_price`)
- a `status` field

Then we build a **MultiIndex** view for advanced selection.


In [2]:
n = 2_000

# dates over 90 days
dates = pd.to_datetime('2025-01-01') + pd.to_timedelta(rng.integers(0, 90, size=n), unit='D')

regions = np.array(['EU', 'NA', 'APAC'])
stores = np.array(['S1', 'S2', 'S3', 'S4', 'S5'])
skus = np.array([f'SKU-{i:03d}' for i in range(1, 41)])
statuses = np.array(['paid', 'refunded', 'chargeback', 'pending'])

df = pd.DataFrame({
    'ts': dates,
    'region': rng.choice(regions, size=n, p=[0.45, 0.4, 0.15]),
    'store': rng.choice(stores, size=n),
    'sku': rng.choice(skus, size=n),
    'user_id': rng.integers(10_000, 10_250, size=n),
    'qty': rng.integers(1, 8, size=n),
    'unit_price': rng.choice([19.99, 29.99, 49.99, 79.99, 99.99, 149.99], size=n),
    'status': rng.choice(statuses, size=n, p=[0.78, 0.12, 0.03, 0.07]),
})
df['revenue'] = df['qty'] * df['unit_price']

# Make a stable sort by time for time-range operations
df = df.sort_values('ts', kind='mergesort').reset_index(drop=True)

df.head()

Unnamed: 0,ts,region,store,sku,user_id,qty,unit_price,status,revenue
0,2025-01-01,EU,S3,SKU-001,10150,4,79.99,paid,319.96
1,2025-01-01,EU,S3,SKU-011,10017,5,49.99,paid,249.95
2,2025-01-01,,S1,SKU-033,10164,5,149.99,paid,749.95
3,2025-01-01,EU,S2,SKU-028,10209,2,29.99,paid,59.98
4,2025-01-01,EU,S4,SKU-026,10172,5,49.99,paid,249.95


Create a MultiIndex view `mi` for master-level indexing:

**Index order** matters for performance and partial selection.
We'll use a lexsorted MultiIndex:

`(region, store, ts, sku, user_id)`


In [3]:
mi = (
    df.set_index(['region', 'store', 'ts', 'sku', 'user_id'])
      .sort_index()  # important for slicing / IndexSlice
)

mi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,qty,unit_price,status,revenue
region,store,ts,sku,user_id,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
APAC,S1,2025-01-01,SKU-003,10132,3,99.99,paid,299.97
APAC,S1,2025-01-02,SKU-005,10169,1,79.99,paid,79.99
APAC,S1,2025-01-02,SKU-011,10095,4,49.99,paid,199.96
APAC,S1,2025-01-03,SKU-017,10179,3,149.99,paid,449.97
APAC,S1,2025-01-03,SKU-035,10220,1,29.99,paid,29.99


## Problem 1 — MultiIndex partial selection + `IndexSlice` (master)

**Task**: Select all rows satisfying:
- `region='EU'`
- `store` in `['S2', 'S4']`
- timestamp in `[2025-02-01, 2025-02-15]` inclusive

Return columns: `qty`, `unit_price`, `status`, `revenue`.

Requirement: use `pd.IndexSlice` and MultiIndex slicing (not boolean filtering on columns).


In [4]:
# SOLUTION
idx = pd.IndexSlice
start = pd.Timestamp('2025-02-01')
end = pd.Timestamp('2025-02-15')

# For non-contiguous store selection, pass a list at that level.
res_p1 = mi.loc[idx['EU', ['S2', 'S4'], start:end, :, :], ['qty', 'unit_price', 'status', 'revenue']]
res_p1.head(), res_p1.shape

(                                         qty  unit_price status  revenue
 region store ts         sku     user_id                                 
 EU     S2    2025-02-01 SKU-003 10180      6       79.99   paid   479.94
                         SKU-036 10089      6      149.99   paid   899.94
              2025-02-02 SKU-008 10025      4       19.99   paid    79.96
                                 10091      4       49.99   paid   199.96
              2025-02-03 SKU-008 10171      1       19.99   paid    19.99,
 (76, 4))

## Problem 2 — Fast time-window selection with `searchsorted` (index-aware)

Boolean filtering is expressive but not always fastest for large frames.

**Task**:
- Using the **time-sorted** `df` (not `mi`), select rows where `ts` is in `[2025-03-01, 2025-03-10)`.
- Do it with `searchsorted` on the `ts` column.

Return `ts`, `region`, `store`, `revenue`.


In [5]:
# SOLUTION
ts = df['ts']
left = pd.Timestamp('2025-03-01')
right = pd.Timestamp('2025-03-10')

# searchsorted requires sorted values — df is sorted by ts above.
i0 = ts.searchsorted(left, side='left')
i1 = ts.searchsorted(right, side='left')

res_p2 = df.iloc[i0:i1, :][['ts', 'region', 'store', 'revenue']]
res_p2.head(), (i0, i1), res_p2['ts'].min(), res_p2['ts'].max()

(             ts region store  revenue
 1355 2025-03-01     NA    S4   899.94
 1356 2025-03-01     NA    S4   139.93
 1357 2025-03-01   APAC    S2   119.94
 1358 2025-03-01     NA    S1   299.98
 1359 2025-03-01     NA    S5   149.97,
 (np.int64(1355), np.int64(1536)),
 Timestamp('2025-03-01 00:00:00'),
 Timestamp('2025-03-09 00:00:00'))

## Problem 3 — Composite conditions with `.dt`, `.str`, and `np.select`

**Task**: Create a new column `risk_bucket` with rules:
- `chargeback` OR (`refunded` AND revenue >= 300) → `'high'`
- `pending` → `'review'`
- otherwise `'normal'`

Then select rows:
- `risk_bucket != 'normal'`
- and `sku` ends with `'7'` or `'8'` (string condition)

Return: `ts`, `region`, `store`, `sku`, `status`, `revenue`, `risk_bucket`.

Goal: clean, vectorized logic and reusable masks.


In [6]:
# SOLUTION
df_p3 = df.copy()

cond_high = df_p3['status'].eq('chargeback') | (df_p3['status'].eq('refunded') & df_p3['revenue'].ge(300))
cond_review = df_p3['status'].eq('pending')

df_p3['risk_bucket'] = np.select(
    [cond_high, cond_review],
    ['high', 'review'],
    default='normal'
)

mask_bucket = df_p3['risk_bucket'].ne('normal')
mask_sku = df_p3['sku'].str.endswith(('7', '8'))

res_p3 = df_p3.loc[mask_bucket & mask_sku, ['ts', 'region', 'store', 'sku', 'status', 'revenue', 'risk_bucket']]
res_p3.head(), res_p3['risk_bucket'].value_counts()

(            ts region store      sku      status  revenue risk_bucket
 32  2025-01-02     NA    S2  SKU-008    refunded   319.96        high
 49  2025-01-03     EU    S5  SKU-008     pending   599.94      review
 56  2025-01-03     EU    S1  SKU-028  chargeback   449.97        high
 81  2025-01-04     EU    S2  SKU-008     pending   299.98      review
 112 2025-01-06   APAC    S5  SKU-027  chargeback   899.94        high,
 risk_bucket
 high      30
 review    25
 Name: count, dtype: int64)

## Problem 4 — Defensive selection: missing columns, missing labels, stable ordering

**Task**:
1. Select columns `['ts','region','store','sku','revenue','does_not_exist']` without raising.
2. Select a list of timestamps, some of which might not exist.
3. Preserve the requested order.

Hint: `.reindex(columns=...)` and `.reindex(index=...)`.


In [8]:
# SOLUTION
cols = ['ts', 'region', 'store', 'sku', 'revenue', 'does_not_exist']

requested_ts = pd.to_datetime(
    ['2025-01-03', '2025-01-04', '2025-01-99'],
    errors='coerce'
).dropna()

# ts is not unique in df, so avoid reindexing on it.
tmp = df.set_index('ts', drop=False)

# Pick all rows whose ts is in requested_ts, then keep the requested_ts order.
safe_rows = (
    tmp.loc[tmp.index.isin(requested_ts)]
       .assign(_ts_order=lambda x: pd.Categorical(x.index, categories=requested_ts, ordered=True))
       .sort_values('_ts_order')
       .drop(columns='_ts_order')
       .reindex(columns=cols)
)
safe_rows

Unnamed: 0_level_0,ts,region,store,sku,revenue,does_not_exist
ts,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2025-01-03,2025-01-03,APAC,S3,SKU-012,749.95,
2025-01-03,2025-01-03,,S1,SKU-021,399.95,
2025-01-03,2025-01-03,EU,S3,SKU-002,179.94,
2025-01-03,2025-01-03,APAC,S1,SKU-035,29.99,
2025-01-03,2025-01-03,APAC,S2,SKU-024,1049.93,
2025-01-03,2025-01-03,EU,S1,SKU-030,239.97,
2025-01-03,2025-01-03,,S2,SKU-009,49.99,
2025-01-03,2025-01-03,APAC,S5,SKU-001,1049.93,
2025-01-03,2025-01-03,APAC,S3,SKU-024,349.93,
2025-01-03,2025-01-03,EU,S1,SKU-028,449.97,


## Problem 5 — Selection + assignment with alignment control (`align`, `where`, `mask`)

**Task**:
1. Compute store-level median revenue per day: `median_rev` for each `(ts, store)`.
2. Add a column `above_store_median` indicating whether each row's revenue is above its store's daily median.

Then:
- For rows where `above_store_median` is True and `status == 'paid'`, apply a 5% surcharge to `unit_price`.
- Recompute revenue.

**Master requirement**: do this with **alignment-safe** joins / mapping (no loops).


In [9]:
# SOLUTION
df_p5 = df.copy()

# 1) median revenue by (ts, store)
median_rev = (
    df_p5.groupby(['ts', 'store'], sort=False)['revenue']
         .median()
)

# 2) align back using MultiIndex mapping
key = pd.MultiIndex.from_frame(df_p5[['ts', 'store']])
df_p5['store_daily_median_rev'] = median_rev.reindex(key).to_numpy()

df_p5['above_store_median'] = df_p5['revenue'] > df_p5['store_daily_median_rev']

# apply surcharge
mask = df_p5['above_store_median'] & df_p5['status'].eq('paid')
df_p5.loc[mask, 'unit_price'] = df_p5.loc[mask, 'unit_price'] * 1.05
df_p5['revenue'] = df_p5['qty'] * df_p5['unit_price']

df_p5.loc[mask, ['ts','store','status','qty','unit_price','store_daily_median_rev','above_store_median','revenue']].head()

Unnamed: 0,ts,store,status,qty,unit_price,store_daily_median_rev,above_store_median,revenue
0,2025-01-01,S3,paid,4,83.9895,284.955,True,335.958
2,2025-01-01,S1,paid,5,157.4895,524.96,True,787.4475
8,2025-01-01,S5,paid,5,104.9895,369.96,True,524.9475
10,2025-01-01,S3,paid,7,52.4895,284.955,True,367.4265
11,2025-01-01,S4,paid,5,104.9895,374.95,True,524.9475


## Problem 6 — MultiIndex selection with `get_locs` (efficient complex selections)

Sometimes you want to select by multiple conditions on different levels and turn that into integer locations.

**Task**:
On `mi` (MultiIndex):
- region in `['EU', 'NA']`
- store == `'S3'`
- sku in a chosen subset (e.g. first 5 SKUs)

Return the first 10 rows of the selection.

Requirement: use `mi.index.get_locs(...)` and then `.iloc`.


In [10]:
# SOLUTION
sku_subset = skus[:5]
locs = mi.index.get_locs([
    ['EU', 'NA'],  # region
    'S3',          # store
    slice(None),   # ts
    list(sku_subset),
    slice(None)    # user_id
])

res_p6 = mi.iloc[locs, :].head(10)
res_p6

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,qty,unit_price,status,revenue
region,store,ts,sku,user_id,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
EU,S3,2025-01-01,SKU-001,10150,4,79.99,paid,319.96
EU,S3,2025-01-03,SKU-002,10206,6,29.99,paid,179.94
EU,S3,2025-01-05,SKU-005,10110,5,79.99,pending,399.95
EU,S3,2025-01-06,SKU-004,10176,7,19.99,paid,139.93
EU,S3,2025-01-07,SKU-003,10090,5,99.99,paid,499.95
EU,S3,2025-01-08,SKU-005,10177,6,29.99,paid,179.94
EU,S3,2025-01-30,SKU-003,10249,3,49.99,paid,149.97
EU,S3,2025-02-02,SKU-001,10122,6,149.99,paid,899.94
EU,S3,2025-02-03,SKU-001,10082,2,99.99,refunded,199.98
EU,S3,2025-02-03,SKU-002,10129,6,49.99,paid,299.94


## Problem 7 — `pd.eval` for mask computation + `.loc` selection

`pd.eval` can be handy for concise expressions (and can be faster for large data).

**Task**:
Select rows where:
- `status == 'paid'`
- `qty >= 3`
- `revenue >= 200`
- and region is not APAC

Return `ts`, `region`, `store`, `qty`, `revenue`.

Requirement: build the mask with `pd.eval` using `engine='python'` (portable).


In [11]:
# SOLUTION
df_p7 = df.copy()

mask = pd.eval(
    "(status == 'paid') & (qty >= 3) & (revenue >= 200) & (region != 'APAC')",
    engine='python',
    parser='pandas',
    local_dict=df_p7
)

res_p7 = df_p7.loc[mask, ['ts','region','store','qty','revenue']]
res_p7.head(), res_p7.shape

(           ts region store  qty  revenue
 0  2025-01-01     EU    S3    4   319.96
 1  2025-01-01     EU    S3    5   249.95
 2  2025-01-01     NA    S1    5   749.95
 4  2025-01-01     EU    S4    5   249.95
 11 2025-01-01     NA    S4    5   499.95,
 (595, 5))

## Problem 8 — Alignment trap: assigning DataFrame into a slice (and fixing it)

**Task**:
1. Take a slice of `df` (first 5 rows) with columns `['qty','unit_price']`.
2. Build a replacement `DataFrame` of the same shape but with a *different index*.
3. Assign it into the slice and observe the NaNs due to alignment.
4. Fix it by assigning `.to_numpy()` (positional), or by aligning indexes.


In [12]:
# SOLUTION
df_p8 = df.copy()

target = df_p8.loc[df_p8.index[:5], ['qty','unit_price']]

replacement = pd.DataFrame(
    {'qty': [99, 98, 97, 96, 95], 'unit_price': [1, 2, 3, 4, 5]},
    index=pd.Index([10, 11, 12, 13, 14], name='different_index')
)

# 3) alignment-based assignment -> will likely introduce NaNs because labels don't match
df_p8.loc[df_p8.index[:5], ['qty','unit_price']] = replacement
broken = df_p8.loc[df_p8.index[:5], ['qty','unit_price']]

# 4a) positional fix
df_p8_fix = df.copy()
df_p8_fix.loc[df_p8_fix.index[:5], ['qty','unit_price']] = replacement.to_numpy()
fixed_positional = df_p8_fix.loc[df_p8_fix.index[:5], ['qty','unit_price']]

# 4b) label-align fix (make indexes match)
df_p8_fix2 = df.copy()
replacement_aligned = replacement.set_axis(df_p8_fix2.index[:5], axis=0)
df_p8_fix2.loc[df_p8_fix2.index[:5], ['qty','unit_price']] = replacement_aligned
fixed_aligned = df_p8_fix2.loc[df_p8_fix2.index[:5], ['qty','unit_price']]

target, broken, fixed_positional, fixed_aligned

(   qty  unit_price
 0    4       79.99
 1    5       49.99
 2    5      149.99
 3    2       29.99
 4    5       49.99,
    qty  unit_price
 0  NaN         NaN
 1  NaN         NaN
 2  NaN         NaN
 3  NaN         NaN
 4  NaN         NaN,
    qty  unit_price
 0   99         1.0
 1   98         2.0
 2   97         3.0
 3   96         4.0
 4   95         5.0,
    qty  unit_price
 0   99         1.0
 1   98         2.0
 2   97         3.0
 3   96         4.0
 4   95         5.0)

## Problem 9 — Stable top-k per group via selection (no sorting pitfalls)

**Task**:
For each `store`, select the **top 3 rows by revenue** among `paid` rows.
Return: `ts`, `store`, `sku`, `revenue`.

Constraints:
- Must preserve deterministic behavior.
- Must not rely on accidental index order.

Hint: filter → sort by `['store','revenue']` with stable sort → `groupby.head(3)`.


In [13]:
# SOLUTION
df_p9 = df.copy()
paid = df_p9['status'].eq('paid')

res_p9 = (
    df_p9.loc[paid, ['ts','store','sku','revenue']]
         .sort_values(['store', 'revenue', 'ts'], ascending=[True, False, True], kind='mergesort')
         .groupby('store', sort=False, as_index=False)
         .head(3)
         .reset_index(drop=True)
)

res_p9

Unnamed: 0,ts,store,sku,revenue
0,2025-01-01,S1,SKU-039,1049.93
1,2025-01-23,S1,SKU-004,1049.93
2,2025-01-26,S1,SKU-022,1049.93
3,2025-01-01,S2,SKU-037,1049.93
4,2025-01-09,S2,SKU-011,1049.93
5,2025-02-03,S2,SKU-008,1049.93
6,2025-01-02,S3,SKU-033,1049.93
7,2025-01-05,S3,SKU-023,1049.93
8,2025-01-21,S3,SKU-025,1049.93
9,2025-01-18,S4,SKU-007,1049.93


## Problem 10 — `DataFrame.where` / `mask` for selection-driven transformation

**Task**:
Create a new column `net_revenue` such that:
- if `status == 'paid'` → `revenue`
- if `status == 'refunded'` → `-revenue`
- otherwise → `0`

Then select all rows where `net_revenue != 0` and show totals by region.

Requirement: use `where` / `mask` (not `apply`).


In [14]:
# SOLUTION
df_p10 = df.copy()

df_p10['net_revenue'] = 0.0
df_p10['net_revenue'] = df_p10['net_revenue'].mask(df_p10['status'].eq('paid'), df_p10['revenue'])
df_p10['net_revenue'] = df_p10['net_revenue'].mask(df_p10['status'].eq('refunded'), -df_p10['revenue'])

nonzero = df_p10['net_revenue'].ne(0)
selected = df_p10.loc[nonzero, ['ts','region','status','revenue','net_revenue']]

totals = selected.groupby('region', sort=False)['net_revenue'].sum().sort_values(ascending=False)

selected.head(), totals

(          ts region status  revenue  net_revenue
 0 2025-01-01     EU   paid   319.96       319.96
 1 2025-01-01     EU   paid   249.95       249.95
 2 2025-01-01     NA   paid   749.95       749.95
 3 2025-01-01     EU   paid    59.98        59.98
 4 2025-01-01     EU   paid   249.95       249.95,
 region
 EU      161717.08
 NA      155387.84
 APAC     63131.33
 Name: net_revenue, dtype: float64)