# Masking (Advanced)

This notebook contains **advanced** practice problems **with solutions** on **NumPy boolean masking**.

## Best practices you'll see
- Prefer **vectorized** operations over Python loops.
- Use parentheses with `&`, `|`, `~` (operator precedence matters).
- Keep masks in well-named variables when logic gets complex.
- Use `np.where`, `np.select`, `np.isnan`, and `np.isfinite` appropriately.
- Be careful: boolean indexing usually **flattens** selected elements.

> Tip: Try each problem yourself first by running the cells under **Problem** before revealing the **Solution** cells.

In [1]:
import numpy as np

np.set_printoptions(suppress=True, precision=4)

rng = np.random.default_rng(42)  # reproducible randomness

## Problem 1 — Multi-condition masking + masked assignment

You are given an array of sensor readings. Some are invalid:
- values `< 0` are invalid
- values `> 100` are invalid
- `np.nan` means "missing"

**Task**
1. Create a boolean mask for **valid** values.
2. Compute the **mean of valid values** (ignore invalid and NaN).
3. Create a cleaned copy of the array where **invalid values are replaced with the mean**.

**Constraints**
- Do not use Python loops.
- Do not use `np.nanmean` (do it via masking).

In [2]:
x = np.array([10.0, np.nan, 25.0, -3.0, 80.0, 150.0, 60.0, np.nan, 0.0])
x

array([ 10.,  nan,  25.,  -3.,  80., 150.,  60.,  nan,   0.])

### Solution 1

In [3]:
# 1) valid mask: finite AND within range
valid = np.isfinite(x) & (x >= 0) & (x <= 100)
valid

array([ True, False,  True, False,  True, False,  True, False,  True])

In [4]:
# 2) mean of valid values (no loops, no nanmean)
valid_mean = x[valid].sum() / valid.sum()
valid_mean

np.float64(35.0)

In [5]:
# 3) cleaned copy: replace everything that's NOT valid
x_clean = x.copy()
x_clean[~valid] = valid_mean
x_clean

array([10., 35., 25., 35., 80., 35., 60., 35.,  0.])

## Problem 2 — Row filtering in a 2D array (keep rows)

You have a `transactions` matrix with columns:
- `amount`
- `fee`
- `is_refund` (0 or 1)

**Task**
Select the rows that satisfy **all** of the following:
- `amount` is between 20 and 200 inclusive
- `fee` is less than 5
- `is_refund` is 0

Then compute the **net amount** for those rows: `amount - fee`.

**Constraints**
- Use boolean masks.
- The filtered result should be a 1D vector of net amounts.

In [6]:
# columns: [amount, fee, is_refund]
transactions = np.array([
    [15.0,  1.2, 0],
    [25.0,  0.5, 0],
    [250.0, 2.0, 0],
    [120.0, 5.1, 0],
    [80.0,  2.5, 1],
    [199.0, 4.9, 0],
    [20.0,  0.0, 0],
])
transactions

array([[ 15. ,   1.2,   0. ],
       [ 25. ,   0.5,   0. ],
       [250. ,   2. ,   0. ],
       [120. ,   5.1,   0. ],
       [ 80. ,   2.5,   1. ],
       [199. ,   4.9,   0. ],
       [ 20. ,   0. ,   0. ]])

### Solution 2

In [7]:
amount = transactions[:, 0]
fee = transactions[:, 1]
is_refund = transactions[:, 2].astype(bool)

mask = (amount >= 20) & (amount <= 200) & (fee < 5) & (~is_refund)
mask

array([False,  True, False, False, False,  True,  True])

In [8]:
filtered = transactions[mask]
net = filtered[:, 0] - filtered[:, 1]
net

array([ 24.5, 194.1,  20. ])

## Problem 3 — Column-wise conditions (mask per column, then reduce)

You have a matrix `A` (shape `(n, m)`).

**Task**
Find the indices of rows where **at least 3 elements are negative**.

**Constraints**
- Use boolean masks and vectorized reductions.
- Output should be a 1D array of row indices.

In [9]:
A = rng.integers(-5, 6, size=(8, 6))
A

array([[-5,  3,  2, -1, -1,  4],
       [-5,  2, -3, -4,  0,  5],
       [ 3,  3,  2,  3,  0, -4],
       [ 4, -1,  0, -1, -3,  5],
       [ 3,  2, -1,  4,  0, -1],
       [-1, -3, -4,  1,  4, -5],
       [ 4,  4, -2,  1, -4,  3],
       [ 2, -2, -5,  5, -1,  4]])

### Solution 3

In [10]:
neg = A < 0
neg_counts = neg.sum(axis=1)
row_idx = np.where(neg_counts >= 3)[0]

neg_counts, row_idx

(array([3, 3, 1, 3, 2, 4, 2, 3]), array([0, 1, 3, 5, 7]))

## Problem 4 — Keep 2D structure: masking rows vs masking elements

A common pitfall: `M[M > 0]` returns a **1D** array of selected elements.

You are given a matrix `M`.

**Task**
1. Create a matrix `P` where all negative values are replaced by `0`, but the **shape stays the same**.
2. Create a vector containing only the positive elements (1D is fine here).

**Constraints**
- Do not use Python loops.
- Use masking and/or `np.where`.

In [11]:
M = np.array([
    [-3,  2, -1,  4],
    [ 0, -2,  5, -6],
    [ 7, -8,  9, -1],
])
M

array([[-3,  2, -1,  4],
       [ 0, -2,  5, -6],
       [ 7, -8,  9, -1]])

### Solution 4

In [12]:
# 1) preserve shape
P = np.where(M < 0, 0, M)
P

array([[0, 2, 0, 4],
       [0, 0, 5, 0],
       [7, 0, 9, 0]])

In [13]:
# 2) positive elements only (1D)
pos = M[M > 0]
pos

array([2, 4, 5, 7, 9])

## Problem 5 — Boolean masks + `np.select` for multi-class labeling

You have exam scores for students.

**Task**
Create a `labels` array (dtype string) with:
- `"A"` for scores >= 90
- `"B"` for 80–89
- `"C"` for 70–79
- `"D"` for 60–69
- `"F"` for < 60

**Constraints**
- Use boolean masks with `np.select` (not loops).
- Ensure the result is the same shape as `scores`.

In [14]:
scores = np.array([95, 83, 77, 61, 59, 100, 89, 70, 69, 0])
scores

array([ 95,  83,  77,  61,  59, 100,  89,  70,  69,   0])

### Solution 5

In [15]:
conditions = [
    scores >= 90,
    scores >= 80,
    scores >= 70,
    scores >= 60,
]
choices = ["A", "B", "C", "D"]

labels = np.select(conditions, choices, default="F")
labels

array(['A', 'B', 'C', 'D', 'F', 'A', 'B', 'C', 'D', 'F'], dtype='<U1')

## Problem 6 — Time series style filtering + aligned arrays

You have aligned arrays:
- `dates` (strings)
- `open_`, `high`, `low`, `close` (floats)
- `volume` (ints)

**Task**
1. Find the days where **close > open** ("green" days).
2. Among those, keep only days where **volume is in the top 25%**.
3. Return a 2D array with columns `[date, open, close, volume]` for the selected days.

**Constraints**
- Use masks; no loops.
- Use `np.percentile` for the top-25% volume threshold.
- Keep everything aligned (same row selection for all arrays).

In [16]:
dates = np.array([
    "2025-12-01", "2025-12-02", "2025-12-03", "2025-12-04", "2025-12-05",
    "2025-12-08", "2025-12-09", "2025-12-10", "2025-12-11", "2025-12-12",
])

open_  = np.array([100.0, 101.5,  98.2,  99.0, 102.3, 101.0, 103.5, 104.0, 102.0, 105.0])
close  = np.array([101.2, 100.1,  99.5,  98.8, 103.0, 100.5, 104.2, 102.0, 103.5, 106.5])
high   = np.maximum(open_, close) + rng.random(size=open_.shape) * 1.5
low    = np.minimum(open_, close) - rng.random(size=open_.shape) * 1.5
volume = rng.integers(1_000_000, 5_000_000, size=open_.shape)

dates, open_, close, volume

(array(['2025-12-01', '2025-12-02', '2025-12-03', '2025-12-04',
        '2025-12-05', '2025-12-08', '2025-12-09', '2025-12-10',
        '2025-12-11', '2025-12-12'], dtype='<U10'),
 array([100. , 101.5,  98.2,  99. , 102.3, 101. , 103.5, 104. , 102. ,
        105. ]),
 array([101.2, 100.1,  99.5,  98.8, 103. , 100.5, 104.2, 102. , 103.5,
        106.5]),
 array([4071375, 4329039, 2740866, 4219057, 4366767, 2549913, 4592350,
        2153312, 1958231, 3729982]))

### Solution 6

In [17]:
# 1) green days
green = close > open_

# 2) top 25% volume threshold
threshold = np.percentile(volume, 75)
high_vol = volume >= threshold

# combined mask
mask = green & high_vol
mask, threshold

(array([False, False, False, False,  True, False,  True, False, False,
        False]),
 np.float64(4301543.5))

In [18]:
# 3) assemble selected columns; cast to object to mix strings + numbers cleanly
result = np.column_stack([
    dates[mask],
    open_[mask],
    close[mask],
    volume[mask],
]).astype(object)

result

array([['2025-12-05', '102.3', '103.0', '4366767'],
       ['2025-12-09', '103.5', '104.2', '4592350']], dtype=object)

## Extra mini-check (optional) — mask debugging pattern

When a mask gets complicated, a good best practice is to compute intermediate masks and sanity-check counts.

Example:
```python
m1 = ...
m2 = ...
mask = m1 & m2
print(m1.sum(), m2.sum(), mask.sum())
```