# 10. Boolean Indexing.

-----

Boolean Indexing is the *primary* way you filter rows in a DataFrame. Instead of selecting data by its label or position, you ask a **True/False question** for every row. This "question" is called a **boolean mask**.

Think of it as putting a "filter" over your spreadsheet. You create a list of `True`/`False` values (the mask) as long as your DataFrame. `True` means "keep this row" and `False` means "hide this row." Pandas then shows you only the rows where the mask was `True`.

**How It Works in Memory**: When you write `df['Age'] > 30`, Pandas performs a vectorized operation on the 'Age' column, creating a brand-new `pd.Series` of `dtype: bool` (e.g., `[True, False, True, ...]`). This is the **mask**. When you then pass this mask into the DataFrame (e.g., `df[mask]`), Pandas iterates through the mask. For every `True` value, it includes that row's index in the final result. This new, filtered DataFrame is a **copy** of the original data, not a view.

**When to Use This**: This is your main tool for data filtering.

  * You *must* use this whenever you need to find data **based on its values**.
  * Examples: "Find all users older than 30," "Select all sales from the 'East' region," or "Get all rows where 'Profit' was negative."
  * It's also the *correct* way to **change data based on a condition** (e.g., "Set the 'Status' to 'High-Value' for all customers with 'Purchases' \> 10").

-----

### 0\. Syntax & Parameters (MUST COME FIRST)

There are two main syntaxes for filtering.

#### 1\. `dataframe[mask]`

This is the most common for *just filtering*.

```python
# mask is a Series of True/False values
dataframe[mask]
```

#### 2\. `dataframe.loc[mask]` or `dataframe.loc[mask, columns]`

This is the **preferred** method, as it's more powerful and is the *only* correct way to set values.

```python
# Select all columns for rows in the mask
dataframe.loc[mask]

# Select specific columns for rows in the mask
dataframe.loc[mask, ['Column1', 'Column2']]

# SET values based on a mask
dataframe.loc[mask, 'Column_to_Set'] = new_value
```

#### 3\. Multiple Conditions

You **cannot** use the Python keywords `and`, `or`, `not`. You *must* use the bitwise operators:

  * `&` (for **AND**)
  * `|` (for **OR**)
  * `~` (for **NOT**)

Each condition **must** be wrapped in its own parentheses `()`.

```python
# AND
dataframe[(condition1) & (condition2)]

# OR
dataframe[(condition1) | (condition2)]

# NOT
dataframe[~(condition1)]
```

-----

### 1\. Basic Example

Let's create a simple filter.

```python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clara', 'David'],
    'Age': [25, 30, 22, 35],
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago']
})
print("--- 1. Original DataFrame ---")
print(df)

# Example 1: Create the boolean mask
# This is just a Series of True/False
mask = df['Age'] > 28
print("\n--- 2. The Boolean Mask (df['Age'] > 28) ---")
print(mask)

# Example 2: Apply the mask to the DataFrame
df_filtered = df[mask]
print("\n--- 3. Filtered DataFrame (df[mask]) ---")
print(df_filtered)
```

**Output:**

```
--- 1. Original DataFrame ---
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
2  Clara   22     New York
3  David   35      Chicago

--- 2. The Boolean Mask (df['Age'] > 28) ---
0    False
1     True
2    False
3     True
Name: Age, dtype: bool

--- 3. Filtered DataFrame (df[mask]) ---
    Name  Age         City
1    Bob   30  Los Angeles
3  David   35      Chicago
```

**Explanation:**
First, we created the `mask`. It was `False` for Alice (25) and Clara (22) but `True` for Bob (30) and David (35). When we passed this mask into `df[mask]`, Pandas returned only the rows (index 1 and 3) that corresponded to a `True` value.

-----

### 2\. Intermediate Example (Multiple Conditions)

This is the most critical part: combining conditions with `&` and `|`.

**Example 3: `&` (AND) Condition**
Find all people who are older than 25 *AND* live in New York.

```python
# Note the parentheses around each condition
mask_and = (df['Age'] > 25) & (df['City'] == 'New York')
print("\n--- 1. The AND Mask ---")
print(mask_and)

print("\n--- 2. Filtered with AND ---")
print(df[mask_and])
```

**Output:**

```
--- 1. The AND Mask ---
0    False
1    False
2    False
3    False
dtype: bool

--- 2. Filtered with AND ---
Empty DataFrame
Columns: [Name, Age, City]
Index: []
```

*(Self-correction: The mask for Alice (Age 25, City NY) is `(False) & (True)` which is `False`. The mask for Bob (Age 30, City LA) is `(True) & (False)` which is `False`. No rows matched. This is a good example.)*

**Let's try a better one:** Age \> 20 AND City == 'New York'.

```python
# Example 4: & (AND) Condition - Take 2
mask_and_2 = (df['Age'] > 20) & (df['City'] == 'New York')
print("\n--- 3. The Second AND Mask ---")
print(mask_and_2)

print("\n--- 4. Filtered with Second AND ---")
print(df[mask_and_2])
```

**Output:**

```
--- 3. The Second AND Mask ---
0     True
1    False
2     True
3    False
dtype: bool

--- 4. Filtered with Second AND ---
    Name  Age      City
0  Alice   25  New York
2  Clara   22  New York
```

**Explanation:** Alice (25, NY) was `(True) & (True)` -\> `True`. Clara (22, NY) was `(True) & (True)` -\> `True`. The other two rows had at least one `False`.

**Example 5: `|` (OR) Condition**
Find all people who are older than 30 *OR* live in New York.

```python
mask_or = (df['Age'] > 30) | (df['City'] == 'New York')
print("\n--- 5. The OR Mask ---")
print(mask_or)

print("\n--- 6. Filtered with OR ---")
print(df[mask_or])
```

**Output:**

```
--- 5. The OR Mask ---
0     True
1    False
2     True
3     True
dtype: bool

--- 6. Filtered with OR ---
    Name  Age      City
0  Alice   25  New York
2  Clara   22  New York
3  David   35   Chicago
```

**Explanation:**

  * Alice: (False OR True) -\> True
  * Bob: (False OR False) -\> False
  * Clara: (False OR True) -\> True
  * David: (True OR False) -\> True

**Example 6: `~` (NOT) Condition**
Find all people who *do not* live in New York.

```python
mask_not = ~(df['City'] == 'New York')
print("\n--- 7. The NOT Mask ---")
print(mask_not)

print("\n--- 8. Filtered with NOT ---")
print(df[mask_not])
```

**Output:**

```
--- 7. The NOT Mask ---
0    False
1     True
2    False
3     True
dtype: bool

--- 8. Filtered with NOT ---
    Name  Age         City
1    Bob   30  Los Angeles
3  David   35      Chicago
```

-----

### 3\. Advanced or Tricky Case

Using `.loc` for filtering and, more importantly, for *setting* values.

**Example 7: Using `.loc` to filter rows and select columns**
This is cleaner than `df[mask][['Name', 'Age']]`.

```python
mask = df['Age'] > 28
df_loc_filtered = df.loc[mask, ['Name', 'Age']]
print("\n--- 1. Filtered with .loc (and selecting cols) ---")
print(df_loc_filtered)
```

**Output:**

```
--- 1. Filtered with .loc (and selecting cols) ---
    Name  Age
1    Bob   30
3  David   35
```

**Example 8: Using `.loc` to *set values* (The Killer Feature)**
This is the **correct** way to change data based on a condition.

```python
print("\n--- 2. Before Setting Value ---")
print(df)

# Set 'Status' to 'Senior' for anyone over 28
df.loc[df['Age'] > 28, 'Status'] = 'Senior'
df.loc[df['Age'] <= 28, 'Status'] = 'Junior'

print("\n--- 3. After Setting Value ---")
print(df)
```

**Output:**

```
--- 2. Before Setting Value ---
    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
2  Clara   22     New York
3  David   35      Chicago

--- 3. After Setting Value ---
    Name  Age         City  Status
0  Alice   25     New York  Junior
1    Bob   30  Los Angeles  Senior
2  Clara   22     New York  Junior
3  David   35      Chicago  Senior
```

**Example 9: Filtering with string methods (`.str`)**
Find all people whose name starts with 'A'.

```python
mask_str = df['Name'].str.startswith('A')
print("\n--- 4. String mask (startswith 'A') ---")
print(mask_str)

print("\n--- 5. Filtered by string method ---")
print(df[mask_str])
```

**Output:**

```
--- 4. String mask (startswith 'A') ---
0     True
1    False
2    False
3    False
Name: Name, dtype: bool

--- 5. Filtered by string method ---
    Name  Age      City  Status
0  Alice   25  New York  Junior
```

-----

### 4\. Real-World Use Case

**Example 10: Filtering for null values**
Find all rows where 'Status' is missing (if we hadn't filled them).

```python
df.loc[2, 'Status'] = np.nan # Manually create a NaN
print("\n--- 1. DF with NaN ---")
print(df)

mask_null = df['Status'].isna()
print("\n--- 2. Filtered for NaN ---")
print(df[mask_null])
```

**Output:**

```
--- 1. DF with NaN ---
    Name  Age         City  Status
0  Alice   25     New York  Junior
1    Bob   30  Los Angeles  Senior
2  Clara   22     New York     NaN
3  David   35      Chicago  Senior

--- 2. Filtered for NaN ---
    Name  Age      City  Status
2  Clara   22  New York     NaN
```

**Example 11: Filtering for non-null values**
Find all rows that are *not* missing a 'Status'.

```python
mask_not_null = df['Status'].notna()
print("\n--- 3. Filtered for NOT NaN ---")
print(df[mask_not_null])
```

**Output:**

```
--- 3. Filtered for NOT NaN ---
    Name  Age         City  Status
0  Alice   25     New York  Junior
1    Bob   30  Los Angeles  Senior
3  David   35      Chicago  Senior
```

**Example 12: Cleaning bad data**
Imagine you find bad data (e.g., Age = -1). You can use boolean indexing to find and fix it.

```python
df.loc[3, 'Age'] = -35 # Manually create bad data
print("\n--- 4. DF with Bad Data ---")
print(df)

# Find and fix
print("\n--- 5. Fixing bad data... ---")
df.loc[df['Age'] < 0, 'Age'] = 0 # Set negative ages to 0
print(df)
```

**Output:**

```
--- 4. DF with Bad Data ---
    Name  Age         City  Status
0  Alice   25     New York  Junior
1    Bob   30  Los Angeles  Senior
2  Clara   22     New York     NaN
3  David  -35      Chicago  Senior

--- 5. Fixing bad data... ---
    Name  Age         City  Status
0  Alice   25     New York  Junior
1    Bob   30  Los Angeles  Senior
2  Clara   22     New York     NaN
3  David    0      Chicago  Senior
```

*(Restoring data for next example)*
`df.loc[3, 'Age'] = 35`
`df.loc[2, 'Status'] = 'Junior'`

-----

### 5\. Common Mistakes / Pitfalls

**Mistake 13: Using `and`, `or` instead of `&`, `|`**

```python
# Wrong code
try:
    df[(df['Age'] > 20) and (df['City'] == 'New York')]
except ValueError as e:
    print(f"\n--- Mistake 1: Using 'and' ---")
    print(e)
```

**Error/Wrong Output:**
`ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().`
**Why it happens:** Python's `and` keyword tries to compare the *entire Series* at once, which is ambiguous. `&` performs an *element-wise* comparison.
**Correction:** `df[(df['Age'] > 20) & (df['City'] == 'New York')]`

**Mistake 14: Forgetting parentheses `()`**

```python
# Wrong code
try:
    df[df['Age'] > 20 & df['City'] == 'New York']
except TypeError as e:
    print(f"\n--- Mistake 2: Missing parentheses ---")
    print(e)
```

**Error/Wrong Output:**
`TypeError: cannot compare a dtyped [object] array with a scalar of type [bool]`
**Why it happens:** Without `()`, Python's order of operations tries to calculate `20 & df['City']` first, which is nonsense.
**Correction:** `df[(df['Age'] > 20) & (df['City'] == 'New York')]`

**Mistake 15: The `SettingWithCopyWarning`**
This happens when you chain `[]` and `[]` to set a value.

```python
# Wrong code (This will run but give a warning)
print("\n--- Mistake 3: Chained Indexing (BAD) ---")
df_copy = df.copy()
# This MIGHT work, or it might fail silently
df_copy[df_copy['Age'] > 28]['Status'] = 'Senior_v2' 
print(df_copy)
```

**Why it happens:** `df_copy[df_copy['Age'] > 28]` returns a *copy*. You are then setting a value on *that temporary copy*, not on the original `df_copy`.
**Example 16: Corrected code:**
Use `.loc` for a single, atomic operation.

```python
print("\n--- Corrected: Using .loc (GOOD) ---")
df_copy.loc[df_copy['Age'] > 28, 'Status'] = 'Senior_v2'
print(df_copy)
```

-----

### 6\. Key Terms (Explained Simply)

  * **Boolean Mask:** A Series (or list) containing only `True` and `False` values, used to select rows.
  * **Bitwise Operators:** The operators `&` (AND), `|` (OR), and `~` (NOT) that are used on Series. They work element-by-element.
  * **Vectorization:** The operation (e.g., `df['Age'] > 30`) is applied to the *entire column at once* without you writing a `for` loop. This is why it's fast.
  * **Boolean Indexing:** The overall *technique* of using a boolean mask to select data.
  * **`SettingWithCopyWarning`**: Pandas's most famous warning. It means you are *probably* trying to modify a copy of your data instead of the original. You fix it by using `.loc`.

-----

### 7\. Best Practices

  * **Always use `&`, `|`, `~`** for multiple conditions.
  * **Always wrap each condition in parentheses `()`**.
  * **Use `.loc` to set values:** `df.loc[mask, 'col'] = value`. This is the *only* 100% safe way to do it and avoids all `SettingWithCopyWarning`s.
  * **Use `.loc[mask]` for filtering:** `df[mask]` is fine, but `df.loc[mask]` is more explicit and just as good.
  * **Create complex masks in variables:** If your filter is `(cond1 & cond2) | (cond3 & cond4)`, save it as `my_mask = ...` first, then use `df[my_mask]`. It's much cleaner.

-----

### 8\. Mini Summary

  * Boolean Indexing is filtering rows with a `True`/`False` **mask**.
  * Create a mask with a condition like `df['Age'] > 30`.
  * Apply the mask with `df[mask]`.
  * For multiple conditions, use `&` (AND), `|` (OR), `~` (NOT).
  * **CRITICAL:** `(condition1) & (condition2)`. The parentheses are mandatory.
  * To *change* data, **always** use `df.loc[mask, 'col_to_change'] = new_value`.

-----

### 10\. Practice Tasks

**Data for Tasks:**

```python
df_practice = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Carrot', 'Donut', 'Eclair'],
    'Category': ['Fruit', 'Fruit', 'Veg', 'Bakery', 'Bakery'],
    'Price': [0.5, 0.4, 0.2, 1.0, 1.5],
    'Stock': [50, 75, 100, 30, 20]
})
```

**Task 17 (Easy):**
Select a new DataFrame `df_cheap` that contains only products that cost less than 50 cents (`Price < 0.5`).

**Task 18 (Medium):**
Select a new DataFrame `df_bakery_stocked` that contains all 'Bakery' items that have 'Stock' \> 25.

**Task 19 (Hard):**
Select a new DataFrame `df_targets` that contains all 'Fruit' items *OR* any item (from any category) with a 'Price' over $1.00.

**Bonus Task 20 (Hardest):**
Use `.loc` to update `df_practice`. Create a new column 'Discount' and set it to `True` for all 'Fruit' items, but `False` for everything else.

-----

### 11\. Recommended Next Topic

You have mastered the standard (and fastest) way to filter. The next logical steps are to learn two "shortcut" methods for filtering.

[cite\_start]**Recommended:** **.query() and .isin() for filtering** [cite: 108-109]

-----

### 12\. Quick Reference Card

| Operation | Syntax | Example |
| :--- | :--- | :--- |
| **Single Condition** | `df[df['col'] > val]` | `df[df['Age'] > 18]` |
| **AND** | `df[(cond1) & (cond2)]` | `df[(df['Age'] > 18) & (df['Cat'] == 'A')]` |
| **OR** | `df[(cond1) \| (cond2)]` | `df[(df['Age'] > 18) \| (df['Cat'] == 'A')]` |
| **NOT** | `df[~(cond1)]` | `df[~(df['City'] == 'New York')]` |
| **String Method** | `df[df['col'].str.method()]` | `df[df['Name'].str.startswith('A')]` |
| **Is Null** | `df[df['col'].isna()]` | `df[df['Email'].isna()]` |
| **Is Not Null** | `df[df['col'].notna()]` | `df[df['Email'].notna()]` |
| **Set Value (GOOD)** | `df.loc[mask, 'col'] = val` | `df.loc[df['Age'] < 0, 'Age'] = 0` |

-----

### 13\. Common Interview Questions

1.  **I'm filtering with `df[df['A'] > 5 and df['B'] < 10]`. Why do I get a `ValueError`?**
      * You must use the bitwise `&` (AND) operator, not the Python `and` keyword. `and` tries to compare the truth of the *entire Series*, which is ambiguous.
2.  **What's the *other* mistake with that code?**
      * You forgot to put parentheses `()` around each condition. The correct syntax is `df[(df['A'] > 5) & (df['B'] < 10)]`.
3.  **How do you find all rows where `City` is *not* 'New York'?**
      * You can use the `~` (NOT) operator: `df[~(df['City'] == 'New York')]`.
      * You can also use the "not equals" operator: `df[df['City'] != 'New York']`.
4.  **How do you set the 'Status' to 'Expired' for all rows where 'Date' is before today?**
      * You *must* use `.loc` to avoid the `SettingWithCopyWarning`.
      * `today = pd.to_datetime('today')`
      * `df.loc[df['Date'] < today, 'Status'] = 'Expired'`

-----

### 14\. Performance Considerations

  * **Time Complexity:** Boolean indexing is **O(n)**, where 'n' is the number of rows. Pandas must create the mask (check the condition for every row) and then build the new DataFrame.
  * This is highly **vectorized** and extremely fast, far faster than any `for` loop you could write.
  * **Memory Usage (Copy vs. View):**
      * Filtering with `df[mask]` or `df.loc[mask]` *always* returns a **copy** of the data.
      * This is a "shallow" copy; the underlying data blocks are shared until modified, but it is a new DataFrame object.
      * This is *why* modifying the result (chained indexing) gives a warning: you are modifying a *new copy*, and the original DataFrame will not be changed.

-----

### 15\. When NOT to Use This

  * **When you know the *label*:** If you want to get row 'a', don't write `df[df.index == 'a']`. That is slow. Use the fast, optimized `df.loc['a']`.
  * **When you know the *position*:** If you want the 5th row, don't try to find a condition. Use the fast, optimized `df.iloc[4]`.
  * **When filtering for a list of values:** You *can* write `df[(df['City'] == 'A') | (df['City'] == 'B') | (df['City'] == 'C')]`, but it's much, much cleaner to use `.isin()` (the next topic).