# Lab 04 — Pandas Selection & Filtering

**Focus Area:** Turning messy raw data into consistent, joined datasets using Pandas selection & filtering (`loc`, boolean masks, `query`, chained masks)

## Outcomes

By the end of this lab, you will be able to:

1. Construct correct boolean masks using `&`, `|`, and `~` with parentheses.
2. Select rows/columns using `df.loc[row_mask, col_list]` vs. positional `iloc`.
3. Apply chained masks safely without the chained‑assignment trap.
4. Use `DataFrame.query` for readable filters (and when to prefer it vs. masks).
5. Build reusable filter functions and compose them for clarity.

## Prerequisites & Setup

Create a synthetic dataset for practicing selection and filtering operations.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Set random seed for reproducibility
rng = np.random.default_rng(42)
n = 1000

# Create synthetic users dataset
users = pd.DataFrame({
    'user_id': np.arange(n),
    'age': rng.integers(16, 80, size=n),
    'country': rng.choice(['US','U.S.A.','USA','SG','DE','BR','IN'], size=n, p=[.25,.05,.1,.15,.15,.15,.15]),
    'sessions': rng.poisson(3, size=n),
    'avg_session_sec': rng.normal(300, 60, size=n).clip(30, 1500),
    'spend_usd': np.round(rng.lognormal(mean=3.0, sigma=0.7, size=n), 2)
})

print(f"Created dataset with {len(users)} users")
users.head()

## Part A — Boolean Masks & Parentheses

### A1. Build masks correctly

We'll create boolean masks for different conditions and combine them using `&` (AND), `|` (OR), and `~` (NOT) operators.

**Important:** Always use parentheses when combining masks!

In [None]:
# Create individual boolean masks
adults = users['age'] >= 18
heavy_users = users['sessions'] >= 5
us_like = users['country'].isin(['US','U.S.A.','USA'])

# Combine with parentheses!
mask = adults & heavy_users & us_like

# Select rows and a subset of columns
view = users.loc[mask, ['user_id','age','country','sessions','avg_session_sec']]
print(f"Filtered to {len(view)} users out of {len(users)}")
print(f"\nShape: {view.shape}")
view.head()

**Checkpoint:** Why are parentheses required?

- Parentheses are required because of operator precedence. Without them, comparison operators like `==` have different precedence than boolean operators like `&`.
- `adults & heavy_users == True` would be evaluated as `adults & (heavy_users == True)`, which is not what we want.
- Always use parentheses to make the order of operations explicit: `(adults) & (heavy_users) & (us_like)`

### A2. Negation and OR

Using the negation operator `~` and the OR operator `|`.

In [None]:
# Create filters for low engagement or non-US users
low_engagement = users['sessions'] <= 1
non_us = ~us_like

# Combine with OR operator
subset = users.loc[low_engagement | non_us, ['user_id','country','sessions']]

print(f"Found {len(subset)} users with low engagement or non-US")
print(f"\nBreakdown:")
print(f"  Low engagement users: {low_engagement.sum()}")
print(f"  Non-US users: {non_us.sum()}")
print(f"  Combined (with OR): {len(subset)}")

subset.sample(5, random_state=1)

**Quick note:** The `~` operator negates a boolean Series. Ensure the operand has boolean dtype, otherwise you'll get unexpected results.

## Part B — `loc` vs `iloc` & Avoiding Chained Assignment

### B1. Label vs positional selection

In [None]:
# loc uses labels (row indices and column names)
first_10_by_label = users.loc[0:9, ['user_id','age']]

# iloc uses integer positions (0-based, exclusive end)
first_10_by_pos = users.iloc[0:10, [0,1]]  # same rows; columns by index

print("Using .loc (label-based, inclusive):")
print(f"Shape: {first_10_by_label.shape}")
display(first_10_by_label.head())

print("\nUsing .iloc (position-based, exclusive):")
print(f"Shape: {first_10_by_pos.shape}")
display(first_10_by_pos.head())

# Verify they're the same
print(f"\nAre they equal? {first_10_by_label.equals(first_10_by_pos)}")

### Correct single-column assignment with loc

This demonstrates the **correct** way to add or modify a column based on a condition.

In [None]:
# Correct way: use loc to avoid chained assignment
users.loc[adults, 'is_adult'] = True
users.loc[~adults, 'is_adult'] = False

print("Added 'is_adult' column:")
print(users[['age','is_adult']].head(10))
print(f"\nAdults: {users['is_adult'].sum()}")
print(f"Minors: {(~users['is_adult']).sum()}")

### B2. Chained assignment trap demo

#### What the "chained assignment trap" is

In pandas, expressions like `df[cond][col] = value` may operate on a temporary view or a temporary copy of your data. If it's a copy, the assignment modifies only the temporary object — not the original `df`. Because pandas has to balance speed and memory, whether you get a view or a copy is not guaranteed across versions/operations. That uncertainty is the "trap."

- **Chaining = two indexing steps** (e.g., `df[cond][col]`) instead of a single `.loc` step.
- **Symptom:** You see a `SettingWithCopyWarning` (often), or worse, *no warning* but the change silently doesn't stick.

In [None]:
# Demonstrate the trap - create a copy of users to avoid modifying original
users_test = users.copy()

# BAD: Chained assignment (may or may not modify users_test)
print("=== BAD APPROACH: Chained Assignment ===")
tmp = users_test[users_test['sessions'] > 5]
tmp['flag_bad'] = 1
print(f"Rows in tmp: {len(tmp)}")
print(f"Non-null flags in tmp: {tmp['flag_bad'].notna().sum()}")

# Check if users_test was modified
if 'flag_bad' not in users_test.columns:
    print("❌ 'flag_bad' column doesn't exist in users_test - assignment failed!")
else:
    print(f"NaN count in users_test['flag_bad']: {users_test['flag_bad'].isna().sum()}")
    print("❌ Assignment may have worked on copy, not original!")

In [None]:
# GOOD: Direct assignment with loc
print("=== GOOD APPROACH: Using .loc ===")
users_test.loc[users_test['sessions'] > 5, 'flag_good'] = 1
print(f"Non-null flags: {users_test['flag_good'].notna().sum()}")
print(f"NaN count: {users_test['flag_good'].isna().sum()}")
print("✅ Assignment worked correctly!")

# Compare the results
print("\nComparison:")
print(f"Sessions > 5: {(users_test['sessions'] > 5).sum()} users")
print(f"flag_good set: {users_test['flag_good'].notna().sum()} users")

**Checkpoint:** Why does `tmp['flag'] = 1` may not modify `users`?

**Answer:** 
- `tmp = users[users['sessions'] > 5]` creates a temporary object that *might* be a view or *might* be a copy of the data.
- When you assign `tmp['flag'] = 1`, you're modifying `tmp`, but if `tmp` is a copy, the original `users` DataFrame remains unchanged.
- Pandas cannot always guarantee whether you get a view or a copy, which is why this pattern is unreliable.
- The **correct** approach is to use `.loc` with a single indexing operation: `users.loc[users['sessions'] > 5, 'flag'] = 1`

#### How to avoid the chained assignment trap

**Rules of thumb:**

1. **Use `.loc` for assignment:** `df.loc[row_mask, 'col'] = value`
2. **If working on a subset, make it explicit:** `subset = df[cond].copy()`
3. **Don't chain when assigning** (reading is fine): Prefer `df.loc[cond, 'col']` over `df[cond]['col']`
4. **Normalize complex conditions into variables:**

In [None]:
# Good pattern: normalize complex conditions
m = (users['age'] > 30) & (users['country'] == 'US')
users.loc[m, 'segment'] = 'US_30plus'

# Alternative safe patterns
users.loc[:, 'segment2'] = np.where(m, 'US_30plus', 'Other')

# Using assign (creates new DataFrame)
users_with_segment = users.assign(segment3=np.where(m, 'US_30plus', 'Other'))

print("Different safe assignment patterns:")
print(users[['age', 'country', 'segment', 'segment2']].head(10))

## Part C — Chained Masks & Reusable Filters

### C1. Compose filters

Create reusable filter functions that can be combined for complex queries.

In [None]:
def f_is_adult(df):
    """Filter for adult users (age >= 18)"""
    return df['age'] >= 18

def f_high_value(df, p=90):
    """Filter for high-value users (top p percentile by spend)"""
    thr = df['spend_usd'].quantile(p/100)
    return df['spend_usd'] >= thr

def f_core_markets(df):
    """Filter for core market countries"""
    return df['country'].isin(['US','SG','DE'])

# Combine filters
mask = f_is_adult(users) & f_high_value(users, 85) & f_core_markets(users)
hv_core = users.loc[mask, ['user_id','age','country','spend_usd']]

print(f"High-value core market adults: {len(hv_core)} users")
print(f"Average spend: ${hv_core['spend_usd'].mean():.2f}")
print(f"Minimum spend (85th percentile): ${users['spend_usd'].quantile(0.85):.2f}")
hv_core.head()

### C2. Multi-condition with between/isin/str.contains

Using pandas' convenient methods for range checks and string matching.

In [None]:
# Using between and isin
mask2 = users['age'].between(25, 40) & users['country'].isin(['US','SG'])
print(f"Users aged 25-40 in US or SG: {mask2.sum()}")

# String contains - naive approach
mask3 = users['country'].str.contains('US', regex=False)
print(f"\nCountries containing 'US' (naive): {mask3.sum()}")
print(f"Unique countries matched: {users.loc[mask3, 'country'].unique()}")

**Note:** The naive `contains('US')` approach only matches exact substring 'US', not 'U.S.A.'. Let's normalize first for better matching.

In [None]:
# Safer normalization before contains
norm = users['country'].str.replace('.', '', regex=False).str.upper()
mask3b = norm.str.contains('USA')

print(f"After normalization:")
print(f"Countries containing 'USA': {mask3b.sum()}")
print(f"Unique normalized countries matched: {norm[mask3b].unique()}")

# Combine all masks
filtered = users.loc[mask2 & mask3b]
print(f"\nFinal filtered count: {len(filtered)}")
filtered.head()

**Checkpoint:** Why normalize before `contains`?

**Answer:** The raw data has inconsistent country codes: 'US', 'U.S.A.', and 'USA'. Without normalization:
- `contains('US')` matches 'US' only
- After removing dots and uppercasing, `contains('USA')` matches all three variants
- This ensures we capture all US-related entries regardless of formatting

In [None]:
# Show the difference in counts
print("Comparison of matching approaches:")
print(f"  Raw 'US' contains: {mask3.sum()} matches")
print(f"  Normalized 'USA' contains: {mask3b.sum()} matches")
print(f"  Difference: {mask3b.sum() - mask3.sum()} additional matches captured")
print(f"\nCountries in original data:")
print(users['country'].value_counts().sort_index())

## Part D — `DataFrame.query` & `eval`

### D1. Using `query`

The `query` method provides a more readable syntax for filtering, especially with complex conditions.

In [None]:
# Query understands column names directly; use @ for Python variables
min_sess = 3
q1 = users.query('(age >= 18) & (sessions >= @min_sess) & country in ["US", "U.S.A.", "USA"]')

print(f"Query results: {len(q1)} users")
print(f"Matches adults with >= {min_sess} sessions in US variants")
q1[['user_id','age','country','sessions']].head()

### D2. When to choose query vs masks

**Prefer masks when:**
- You need IDE/type support and autocomplete
- Refactoring code (easier to find/replace)
- Complex Python expressions or custom functions
- Need to reuse the mask multiple times

**Prefer `query` when:**
- Readability is important (notebooks, presentations)
- Complex boolean logic across many columns
- Working interactively with data exploration
- Want to write SQL-like expressions

In [None]:
# Example: Compare readability
# Using masks
mask_version = users.loc[
    (users['age'] >= 18) & 
    (users['sessions'] >= 3) & 
    (users['spend_usd'] >= 20) & 
    users['country'].isin(['US', 'SG', 'DE'])
]

# Using query
query_version = users.query(
    '(age >= 18) and (sessions >= 3) and (spend_usd >= 20) and country in ["US", "SG", "DE"]'
)

print(f"Mask version: {len(mask_version)} results")
print(f"Query version: {len(query_version)} results")
print(f"Results match: {len(mask_version) == len(query_version)}")

### D3. Bonus: `eval` for computed columns in‑place

The `eval` method can be faster for large DataFrames and avoids creating temporary objects.

In [None]:
# Using eval to create a computed column
users = users.eval('engagement = sessions * avg_session_sec')

print("Created 'engagement' column:")
print(users[['sessions', 'avg_session_sec', 'engagement']].head())

# Now use query with the new column
engagement_threshold = users.engagement.quantile(0.9)
crit = users.query('engagement >= @engagement_threshold')

print(f"\nTop 10% by engagement: {len(crit)} users")
print(f"Engagement threshold (90th percentile): {engagement_threshold:.0f}")
crit[['user_id', 'sessions', 'avg_session_sec', 'engagement']].head()

## Part E — Bonus (Optional) — Filter orders Parquet artifacts

This section uses partitioned Parquet files from Lab 03. If you don't have these files, skip this section or run Lab 03 first.

In [None]:
# Check if the artifacts exist
from pathlib import Path

p = Path('artifacts/parquet/orders')
if p.exists():
    files = sorted(p.glob('shipcountry=*.parquet'))
    print(f"Found {len(files)} parquet files")
    if files:
        orders = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
        print(f"Loaded {len(orders)} orders")
        print(orders.head())
    else:
        print("No parquet files found in directory")
        orders = None
else:
    print(f"Directory '{p}' does not exist. Skipping Part E.")
    print("Run Lab 03 first to create the parquet artifacts.")
    orders = None

### E2. Boolean masks vs `query` on orders data

Let's compare both approaches on the orders dataset (if available).

In [None]:
if orders is not None:
    # Approach 1: Boolean masks
    m_country = orders['ShipCountry'].isin(['USA','Germany'])
    m_date = pd.to_datetime(orders['OrderDate'], errors='coerce').between('1997-01-01','1998-12-31')
    m_freight = orders['Freight'] >= orders['Freight'].quantile(0.9)
    
    subset_mask = orders.loc[
        m_country & m_date & m_freight, 
        ['OrderID','CustomerID','ShipCountry','OrderDate','Freight']
    ]
    
    print(f"=== Mask approach ===")
    print(f"Found {len(subset_mask)} orders")
    print(f"  Country filter: {m_country.sum()} orders")
    print(f"  Date filter: {m_date.sum()} orders")
    print(f"  Freight filter: {m_freight.sum()} orders")
    print(f"  Combined: {len(subset_mask)} orders")
    
    # Approach 2: Query
    orders2 = orders.assign(OrderDate=pd.to_datetime(orders['OrderDate'], errors='coerce'))
    freight_threshold = orders2['Freight'].quantile(0.9)
    
    q = orders2.query(
        'ShipCountry in ["USA","Germany"] and '
        '(OrderDate >= @pd.Timestamp("1997-01-01")) and '
        '(OrderDate <= @pd.Timestamp("1998-12-31")) and '
        'Freight >= @freight_threshold'
    )
    subset_query = q[['OrderID','CustomerID','ShipCountry','OrderDate','Freight']]
    
    print(f"\n=== Query approach ===")
    print(f"Found {len(subset_query)} orders")
    print(f"\nResults match: {len(subset_mask) == len(subset_query)}")
    
    display(subset_mask.head())
else:
    print("Skipping - no orders data available")

### E3. Group & sanity‑check

In [None]:
if orders is not None and 'subset_mask' in locals():
    print("Orders by country:")
    country_counts = subset_mask.groupby('ShipCountry').size().sort_values(ascending=False)
    print(country_counts)
    
    print(f"\nFreight statistics:")
    print(subset_mask['Freight'].describe())
else:
    print("Skipping - no filtered orders available")

**Checkpoint:** Confirm mask and query produce the same number of rows. If not, inspect dtype differences and how dates were handled.

## Part F — Wrap‑Up

### Summary Questions

1. **Show two equivalent filters: one with masks and one with `query`**
2. **Why is `df.loc[mask, 'col'] = ...` preferred over chained assignment?**
3. **When might `eval` or `query` *not* be appropriate?**

### Answer 1: Equivalent filters

In [None]:
# Mask approach
mask = (users['age'] >= 18) & users['sessions'].between(3, 10) & users['country'].isin(['US','U.S.A.','USA'])
sol_mask = users.loc[mask, ['user_id','age','sessions','country']]

# Query approach
sol_query = users.query(
    '(age >= 18) and (sessions >= 3) and (sessions <= 10) and country in ["US","U.S.A.","USA"]'
).loc[:, ['user_id','age','sessions','country']]

print("Mask approach:")
print(f"  Results: {len(sol_mask)}")
display(sol_mask.head())

print("\nQuery approach:")
print(f"  Results: {len(sol_query)}")
display(sol_query.head())

print(f"\nBoth produce same count: {len(sol_mask) == len(sol_query)}")

### Answer 2: Why use `.loc` for assignment?

**`df.loc[mask, 'col'] = ...` is preferred because:**

1. **Reliability:** It always modifies the original DataFrame, not a temporary copy
2. **Single indexing operation:** Uses one step instead of chained indexing
3. **Clear intent:** Explicitly shows you're modifying the DataFrame in place
4. **No warnings:** Avoids `SettingWithCopyWarning`
5. **Predictable behavior:** Works consistently across pandas versions

**Chained assignment problems:**
- `df[condition]['column'] = value` uses two indexing operations
- First `df[condition]` might return a view OR a copy (unpredictable)
- Second `['column'] = value` might modify a temporary object
- The original DataFrame might not be updated at all

### Answer 3: When NOT to use `eval` or `query`

**Avoid `eval`/`query` when:**

1. **Dynamic column names:** Column names determined at runtime or from variables
   ```python
   col = 'age'  # Query can't use @col as column name
   ```

2. **Complex Python functions:** Custom functions or methods not supported
   ```python
   # This won't work in query
   users.query('my_custom_function(age) > 18')
   ```

3. **Need IDE support:** Type checking, autocomplete, refactoring tools
4. **String operations:** Complex string methods like `str.extract()`, regex patterns
5. **Performance for small DataFrames:** Overhead of parsing string expressions
6. **Debugging:** Stack traces are less clear with string expressions
7. **Column names with spaces/special chars:** Require backticks, less readable
8. **Need to reuse masks:** Can't easily save and reuse query strings as variables

**Example where masks are better:**

In [None]:
# Dynamic column selection - can't use query
cols_to_check = ['age', 'sessions']
thresholds = {'age': 18, 'sessions': 3}

# Must use masks
mask = pd.Series(True, index=users.index)
for col in cols_to_check:
    mask &= users[col] >= thresholds[col]

result = users.loc[mask]
print(f"Dynamic filtering: {len(result)} results")

# Custom function - can't use query
def is_high_engagement(row):
    return row['sessions'] * row['avg_session_sec'] > 1000

high_eng = users.apply(is_high_engagement, axis=1)
print(f"Custom function: {high_eng.sum()} high-engagement users")

### Common Pitfalls Summary

**1. Missing parentheses in boolean combinations:**
```python
# Wrong: (comparison has lower precedence than &)
mask = users['age'] >= 18 & users['sessions'] > 3  # ERROR!

# Correct:
mask = (users['age'] >= 18) & (users['sessions'] > 3)
```

**2. Confusing `and`/`or` with `&`/`|`:**
```python
# Wrong: 'and' is for scalar boolean, not Series
mask = (users['age'] >= 18) and (users['sessions'] > 3)  # ERROR!

# Correct: use & for element-wise boolean operations
mask = (users['age'] >= 18) & (users['sessions'] > 3)
```

**3. Chained assignment:**
```python
# Wrong: might not modify original
users[users['age'] > 18]['category'] = 'adult'  # WARNING!

# Correct: use .loc
users.loc[users['age'] > 18, 'category'] = 'adult'
```

## Conclusion

In this lab, we covered:

✅ Boolean mask construction with proper use of `&`, `|`, `~`, and parentheses  
✅ Difference between `.loc` (label-based) and `.iloc` (position-based) indexing  
✅ Avoiding the chained assignment trap by using `.loc` for all assignments  
✅ Using `.query()` for readable filtering when appropriate  
✅ Building reusable filter functions for complex logic  
✅ Understanding when to prefer masks vs. query vs. eval  

**Key Takeaways:**
- Always use parentheses with boolean operators
- Use `.loc` for reliable assignment operations
- Choose the right tool: masks for flexibility, query for readability
- Normalize data before string matching operations
- Create reusable filter functions for maintainable code

## Bonus: Additional Examples and Visualizations

In [None]:
# Visualize the effect of different filters
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Age distribution with adult filter
axes[0, 0].hist(users['age'], bins=30, alpha=0.5, label='All users')
axes[0, 0].hist(users.loc[adults, 'age'], bins=30, alpha=0.5, label='Adults (>=18)')
axes[0, 0].axvline(18, color='red', linestyle='--', label='Age 18')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Count')
axes[0, 0].set_title('Age Distribution with Adult Filter')
axes[0, 0].legend()

# Plot 2: Sessions distribution
axes[0, 1].hist(users['sessions'], bins=20, alpha=0.5, label='All users')
axes[0, 1].hist(users.loc[heavy_users, 'sessions'], bins=20, alpha=0.5, label='Heavy users (>=5)')
axes[0, 1].axvline(5, color='red', linestyle='--', label='5 sessions')
axes[0, 1].set_xlabel('Sessions')
axes[0, 1].set_ylabel('Count')
axes[0, 1].set_title('Session Distribution with Heavy User Filter')
axes[0, 1].legend()

# Plot 3: Country distribution
country_counts = users['country'].value_counts()
axes[1, 0].bar(range(len(country_counts)), country_counts.values)
axes[1, 0].set_xticks(range(len(country_counts)))
axes[1, 0].set_xticklabels(country_counts.index, rotation=45)
axes[1, 0].set_xlabel('Country')
axes[1, 0].set_ylabel('Count')
axes[1, 0].set_title('Country Distribution')

# Plot 4: Spend distribution
axes[1, 1].hist(users['spend_usd'], bins=50, alpha=0.7)
axes[1, 1].axvline(users['spend_usd'].quantile(0.85), color='red', linestyle='--', 
                   label='85th percentile')
axes[1, 1].set_xlabel('Spend (USD)')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Spend Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

print("Filter Statistics:")
print(f"  Total users: {len(users)}")
print(f"  Adults: {adults.sum()} ({adults.sum()/len(users)*100:.1f}%)")
print(f"  Heavy users: {heavy_users.sum()} ({heavy_users.sum()/len(users)*100:.1f}%)")
print(f"  US-like countries: {us_like.sum()} ({us_like.sum()/len(users)*100:.1f}%)")
print(f"  Combined filter: {mask.sum()} ({mask.sum()/len(users)*100:.1f}%)")

In [None]:
# Create a comparison table of different filtering approaches
comparison_data = {
    'Filter Description': [
        'All users',
        'Adults (age >= 18)',
        'Heavy users (sessions >= 5)',
        'US-like countries',
        'Adults AND Heavy AND US',
        'Low engagement OR Non-US',
        'Age 25-40 AND (US or SG)',
        'Top 10% engagement'
    ],
    'Count': [
        len(users),
        adults.sum(),
        heavy_users.sum(),
        us_like.sum(),
        mask.sum(),
        (low_engagement | non_us).sum(),
        mask2.sum(),
        len(crit)
    ]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df['Percentage'] = (comparison_df['Count'] / len(users) * 100).round(1)
comparison_df['Bar'] = comparison_df['Percentage'].apply(lambda x: '█' * int(x/2))

print("\nFilter Comparison Summary:")
print("="*80)
display(comparison_df)