# Selecting Data — Advanced Practice (with Solutions)

This notebook focuses on **robust, idiomatic Pandas data selection**.

## Best practices you'll see throughout
- Prefer **`.loc` / `.iloc`** for selection.
- Use **`.at` / `.iat`** for single scalar lookups.
- Avoid **chained indexing** (e.g., `df[df['x']>0]['y'] = ...`) → use `.loc`.
- Be mindful of **alignment** when assigning from a `Series`/`DataFrame`.
- Use clear boolean masks (`mask = ...`) and reuse them.


In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

np.random.seed(42)

## Setup: a realistic dataset

We'll work with a small "orders" dataset that includes:
- a `DatetimeIndex`
- categories (`region`, `product`)
- numbers (`qty`, `unit_price`)
- and a computed `revenue` column


In [2]:
dates = pd.date_range('2025-01-01', periods=12, freq='D')
orders = pd.DataFrame(
    {
        'order_id': np.arange(1001, 1013),
        'region': ['EU', 'EU', 'NA', 'APAC', 'EU', 'NA', 'NA', 'APAC', 'EU', 'EU', 'APAC', 'NA'],
        'product': ['A', 'B', 'A', 'C', 'C', 'B', 'A', 'B', 'A', 'C', 'C', 'B'],
        'qty': [1, 4, 2, 3, 5, 1, 6, 2, 3, 2, 4, 1],
        'unit_price': [120, 80, 120, 60, 60, 80, 120, 80, 120, 60, 60, 80],
        'status': ['paid', 'paid', 'refunded', 'paid', 'paid', 'paid', 'paid', 'refunded', 'paid', 'paid', 'paid', 'paid'],
    },
    index=dates
)
orders['revenue'] = orders['qty'] * orders['unit_price']
orders

Unnamed: 0,order_id,region,product,qty,unit_price,status,revenue
2025-01-01,1001,EU,A,1,120,paid,120
2025-01-02,1002,EU,B,4,80,paid,320
2025-01-03,1003,,A,2,120,refunded,240
2025-01-04,1004,APAC,C,3,60,paid,180
2025-01-05,1005,EU,C,5,60,paid,300
2025-01-06,1006,,B,1,80,paid,80
2025-01-07,1007,,A,6,120,paid,720
2025-01-08,1008,APAC,B,2,80,refunded,160
2025-01-09,1009,EU,A,3,120,paid,360
2025-01-10,1010,EU,C,2,60,paid,120


## Problem 1 — Multi-condition filtering + selective columns

**Task**: Select orders that satisfy **all** conditions:
- `region` is either `EU` or `NA`
- `status` is `paid`
- `qty >= 2`

Return only columns: `order_id`, `region`, `product`, `qty`, `revenue`.

**Goal**: practice clean boolean masks + `.loc[row_mask, col_list]`.


In [3]:
# SOLUTION
mask_region = orders['region'].isin(['EU', 'NA'])
mask_status = orders['status'].eq('paid')
mask_qty = orders['qty'].ge(2)

result_p1 = orders.loc[mask_region & mask_status & mask_qty, ['order_id', 'region', 'product', 'qty', 'revenue']]
result_p1

Unnamed: 0,order_id,region,product,qty,revenue
2025-01-02,1002,EU,B,4,320
2025-01-05,1005,EU,C,5,300
2025-01-07,1007,,A,6,720
2025-01-09,1009,EU,A,3,360
2025-01-10,1010,EU,C,2,120


## Problem 2 — DatetimeIndex slicing + endpoint behavior

**Task**:
1. Select rows from **2025-01-03** through **2025-01-07** (inclusive).
2. From those rows, return only `region`, `qty`, `revenue`.

Tip: With `.loc` and label slicing, the endpoint is **included** for ordered indexes like dates.


In [4]:
# SOLUTION
result_p2 = orders.loc['2025-01-03':'2025-01-07', ['region', 'qty', 'revenue']]
result_p2

Unnamed: 0,region,qty,revenue
2025-01-03,,2,240
2025-01-04,APAC,3,180
2025-01-05,EU,5,300
2025-01-06,,1,80
2025-01-07,,6,720


## Problem 3 — `.iloc` positional selection + `take`

**Task**:
- Select the **1st, 4th, and last** row by position.
- Return columns: `order_id`, `product`, `status`.

Do it two ways:
1. Using `.iloc`
2. Using `.take` (nice when you already have integer positions)


In [5]:
# SOLUTION 1: iloc
pos = [0, 3, len(orders) - 1]
result_p3_iloc = orders.iloc[pos, :][['order_id', 'product', 'status']]

# SOLUTION 2: take
result_p3_take = orders.take(pos)[['order_id', 'product', 'status']]

result_p3_iloc, result_p3_take

(            order_id product status
 2025-01-01      1001       A   paid
 2025-01-04      1004       C   paid
 2025-01-12      1012       B   paid,
             order_id product status
 2025-01-01      1001       A   paid
 2025-01-04      1004       C   paid
 2025-01-12      1012       B   paid)

## Problem 4 — MultiIndex: selecting with `.loc` and `.xs`

We'll create a MultiIndex view by `(region, product, date)`.

**Task**:
1. Build `mi = orders.set_index(['region', 'product'], append=True).reorder_levels(['region', 'product', None]).sort_index()`
   - (We reorder levels to: `region`, `product`, `date`)
2. Select all rows for `region='EU'` and `product='C'`.
3. Then select all rows for `product='B'` across **all regions** using `.xs`.

**Goal**: comfort with MultiIndex selection patterns.


In [6]:
# SOLUTION
mi = (
    orders
    .set_index(['region', 'product'], append=True)
    .reorder_levels(['region', 'product', None])
    .sort_index()
)

# 2) EU + C
result_p4_eu_c = mi.loc[('EU', 'C')]

# 3) product B across all regions
result_p4_product_b = mi.xs('B', level='product')

mi.head(8), result_p4_eu_c, result_p4_product_b.head()

(                           order_id  qty  unit_price    status  revenue
 region product                                                         
 APAC   B       2025-01-08      1008    2          80  refunded      160
        C       2025-01-04      1004    3          60      paid      180
                2025-01-11      1011    4          60      paid      240
 EU     A       2025-01-01      1001    1         120      paid      120
                2025-01-09      1009    3         120      paid      360
        B       2025-01-02      1002    4          80      paid      320
        C       2025-01-05      1005    5          60      paid      300
                2025-01-10      1010    2          60      paid      120,
             order_id  qty  unit_price status  revenue
 2025-01-05      1005    5          60   paid      300
 2025-01-10      1010    2          60   paid      120,
                    order_id  qty  unit_price    status  revenue
 region                               

## Problem 5 — Avoiding chained indexing + safe assignment

**Task**:
Apply a 10% discount to `unit_price` for rows where:
- `status == 'paid'`
- `region == 'EU'`
- `product` in `['A', 'C']`

Then recompute `revenue`.

**Constraints**:
- Do **not** use chained indexing.
- Do the update in-place using `.loc`.


In [7]:
# SOLUTION
orders_p5 = orders.copy()

mask = (
    orders_p5['status'].eq('paid')
    & orders_p5['region'].eq('EU')
    & orders_p5['product'].isin(['A', 'C'])
)

# safe assignment with .loc
orders_p5.loc[mask, 'unit_price'] = orders_p5.loc[mask, 'unit_price'] * 0.9
orders_p5['revenue'] = orders_p5['qty'] * orders_p5['unit_price']

orders_p5.loc[mask, ['order_id', 'region', 'product', 'qty', 'unit_price', 'revenue']]

Unnamed: 0,order_id,region,product,qty,unit_price,revenue
2025-01-01,1001,EU,A,1,108,108
2025-01-05,1005,EU,C,5,54,270
2025-01-09,1009,EU,A,3,108,324
2025-01-10,1010,EU,C,2,54,108


## Problem 6 — Alignment gotcha when assigning from a Series

Pandas aligns by index labels when assigning a `Series`.

**Task**:
1. Create a `Series` of 3 values with an index that is a **subset** of `orders.index`.
2. Assign it into a new column `promo_code` for those dates.
3. Then show a version that assigns **positionally** instead (ignoring labels).

**Goal**: understand label alignment vs positional assignment.


In [8]:
# SOLUTION
orders_p6 = orders.copy()

# 1) label-aligned assignment
s = pd.Series(['P10', 'P20', 'P30'], index=[pd.Timestamp('2025-01-02'), pd.Timestamp('2025-01-05'), pd.Timestamp('2025-01-12')])
orders_p6['promo_code'] = np.nan
orders_p6.loc[:, 'promo_code'] = s  # aligns on index labels

label_aligned = orders_p6.loc['2025-01-01':'2025-01-06', ['order_id', 'promo_code']]

# 2) positional assignment (ignores labels) into a slice of 3 rows
orders_p6b = orders.copy()
orders_p6b['promo_code'] = np.nan
pos_rows = [1, 4, 11]  # positions corresponding to the 3 dates above (but we're doing positional on purpose)
orders_p6b.iloc[pos_rows, orders_p6b.columns.get_loc('promo_code')] = s.to_numpy()
positional = orders_p6b.loc[['2025-01-02', '2025-01-05', '2025-01-12'], ['order_id', 'promo_code']]

label_aligned, positional

  orders_p6.loc[:, 'promo_code'] = s  # aligns on index labels
  orders_p6b.iloc[pos_rows, orders_p6b.columns.get_loc('promo_code')] = s.to_numpy()


(            order_id promo_code
 2025-01-01      1001        NaN
 2025-01-02      1002        P10
 2025-01-03      1003        NaN
 2025-01-04      1004        NaN
 2025-01-05      1005        P20
 2025-01-06      1006        NaN,
             order_id promo_code
 2025-01-02      1002        P10
 2025-01-05      1005        P20
 2025-01-12      1012        P30)

## Problem 7 — `.query()` + external variables + `.between()`

**Task**:
1. Use `.query()` to select rows where:
   - `region != 'APAC'`
   - `status == 'paid'`
   - `qty` is between 2 and 5 inclusive
2. Use an external Python variable `min_rev` and keep only rows with `revenue >= min_rev`.

Return `order_id`, `region`, `qty`, `revenue`.


In [9]:
# SOLUTION
min_rev = 300

result_p7 = (
    orders
    .query("region != 'APAC' and status == 'paid' and qty.between(2, 5)")
    .query("revenue >= @min_rev")
    .loc[:, ['order_id', 'region', 'qty', 'revenue']]
)

result_p7

Unnamed: 0,order_id,region,qty,revenue
2025-01-02,1002,EU,4,320
2025-01-05,1005,EU,5,300
2025-01-09,1009,EU,3,360


## Problem 8 — Fast scalar access with `.at` / `.iat` + defensive selection

**Task**:
1. Retrieve the scalar `status` on date `2025-01-04` using `.at`.
2. Retrieve the scalar `order_id` on the 6th row (position 5) using `.iat`.
3. Safely select columns `['order_id', 'region', 'missing_col']` **without raising** an error.
   - (Hint: use `.reindex(columns=...)`)


In [11]:
# SOLUTION
status_0104 = orders.at[pd.Timestamp('2025-01-04'), 'status']

order_id_pos5 = orders.iat[5, orders.columns.get_loc('order_id')]

safe_cols = ['order_id', 'region', 'missing_col']
safe_selected = orders.reindex(columns=safe_cols)

status_0104, order_id_pos5, safe_selected.head()

('paid',
 np.int64(1006),
             order_id region  missing_col
 2025-01-01      1001     EU          NaN
 2025-01-02      1002     EU          NaN
 2025-01-03      1003     NA          NaN
 2025-01-04      1004   APAC          NaN
 2025-01-05      1005     EU          NaN)