# Boolean Indexing and Conditional Filtering

### What Is Boolean Indexing?

Boolean indexing is a fundamental technique in Pandas that allows you to filter rows in a DataFrame based on one or more logical conditions. When you apply a condition to a column, Pandas returns a Boolean Series consisting of `True` and `False` values for each row. Using this Boolean Series as an index, you can select only the rows where the condition evaluates to `True`.

This method is essential for efficiently extracting subsets of data without the need for explicit loops, which can be slower and less readable. Boolean indexing provides a clear, concise, and performant way to manipulate and analyze data based on conditions.

### Why It Is Important?

Boolean indexing is crucial in data analysis, especially in machine learning workflows, for several reasons:

- **Data Cleaning**: Quickly filter out or isolate rows with missing or invalid values.
- **Data Segmentation**: Separate data into meaningful groups based on conditions such as age, class, or survival status.
- **Feature Engineering**: Create new features based on conditional logic, such as categorizing passengers as children or adults.
- **Exploratory Data Analysis (EDA)**: Inspect specific subsets of data to understand patterns or anomalies.
- **Label and Sample Selection**: Filter samples that meet criteria for training or validation in machine learning models.

Mastering Boolean indexing allows you to handle complex filtering logic with readable and maintainable code, which is essential for working with real-world datasets.

### Syntax and Usage

The basic syntax uses the DataFrame with a condition inside square brackets:

```python
df[condition]
```

For example:

In [1]:
import pandas as pd
df = pd.read_csv("data/train.csv")

print(df[df['Age'] > 30].head())

    PassengerId  Survived  Pclass  \
1             2         1       1   
3             4         1       1   
4             5         0       3   
6             7         0       1   
11           12         1       1   

                                                 Name     Sex   Age  SibSp  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                            Allen, Mr. William Henry    male  35.0      0   
6                             McCarthy, Mr. Timothy J    male  54.0      0   
11                           Bonnell, Miss. Elizabeth  female  58.0      0   

    Parch    Ticket     Fare Cabin Embarked  
1       0  PC 17599  71.2833   C85        C  
3       0    113803  53.1000  C123        S  
4       0    373450   8.0500   NaN        S  
6       0     17463  51.8625   E46        S  
11      0    113783  26.5500  C103        S  


returns all rows where the 'Age' column has values greater than 30.

### Common Conditional Operators

| Operator | Description |
| --- | --- |
| `==` | Equal to |
| `!=` | Not equal to |
| `>` | Greater than |
| `<` | Less than |
| `>=` | Greater than or equal |
| `<=` | Less than or equal |
| `&` | Logical AND (combine multiple conditions, requires parentheses) |
| `\|` | Logical OR (combine multiple conditions, requires parentheses) |
| `~` | Logical NOT (negation) |

**Note:** When combining multiple conditions, each condition must be enclosed in parentheses due to operator precedence.

### Examples Using the Titanic Dataset

In [2]:
# Passengers older than 60
print(df[df['Age'] > 60].head())

# Female passengers who survived
print(df[(df['Sex'] == 'female') & (df['Survived'] == 1)].head())

# Passengers who paid more than 100 or were in first class
print(df[(df['Fare'] > 100) | (df['Pclass'] == 1)].head())

# Passengers with missing Age values
print(df[df['Age'].isnull()].head())

# Passengers not in third class
print(df[df['Pclass'] != 3].head())

     PassengerId  Survived  Pclass                            Name   Sex  \
33            34         0       2           Wheadon, Mr. Edward H  male   
54            55         0       1  Ostby, Mr. Engelhart Cornelius  male   
96            97         0       1       Goldschmidt, Mr. George B  male   
116          117         0       3            Connors, Mr. Patrick  male   
170          171         0       1       Van der hoef, Mr. Wyckoff  male   

      Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
33   66.0      0      0  C.A. 24579  10.5000   NaN        S  
54   65.0      0      1      113509  61.9792   B30        C  
96   71.0      0      0    PC 17754  34.6542    A5        C  
116  70.5      0      0      370369   7.7500   NaN        Q  
170  61.0      0      0      111240  33.5000   B19        S  
   PassengerId  Survived  Pclass  \
1            2         1       1   
2            3         1       3   
3            4         1       1   
8            9         1   

### Additional Filtering Methods

- `.isin()` method to filter rows where column values match a list of values:

In [3]:
print(df[df['Pclass'].isin([1, 2])].head())

    PassengerId  Survived  Pclass  \
1             2         1       1   
3             4         1       1   
6             7         0       1   
9            10         1       2   
11           12         1       1   

                                                 Name     Sex   Age  SibSp  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                             McCarthy, Mr. Timothy J    male  54.0      0   
9                 Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   
11                           Bonnell, Miss. Elizabeth  female  58.0      0   

    Parch    Ticket     Fare Cabin Embarked  
1       0  PC 17599  71.2833   C85        C  
3       0    113803  53.1000  C123        S  
6       0     17463  51.8625   E46        S  
9       0    237736  30.0708   NaN        C  
11      0    113783  26.5500  C103        S  


- `.between()` method to filter values within a range:

In [4]:
print(df[df['Fare'].between(50, 100)].head())

    PassengerId  Survived  Pclass  \
1             2         1       1   
3             4         1       1   
6             7         0       1   
34           35         0       1   
35           36         0       1   

                                                 Name     Sex   Age  SibSp  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
3        Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
6                             McCarthy, Mr. Timothy J    male  54.0      0   
34                            Meyer, Mr. Edgar Joseph    male  28.0      1   
35                     Holverson, Mr. Alexander Oskar    male  42.0      1   

    Parch    Ticket     Fare Cabin Embarked  
1       0  PC 17599  71.2833   C85        C  
3       0    113803  53.1000  C123        S  
6       0     17463  51.8625   E46        S  
34      0  PC 17604  82.1708   NaN        C  
35      0    113789  52.0000   NaN        S  


### Practical Use Cases in Machine Learning

Boolean indexing is extensively used to prepare data for machine learning tasks:

- Selecting only labeled data points for supervised learning.
- Extracting specific demographic groups for targeted analysis.
- Removing or imputing missing or invalid data by filtering.
- Creating binary features or flags based on conditions (e.g., ‘is_child’ where age < 12).
- Segmenting data to analyze model performance across subpopulations.

### Best Practices

- Always use parentheses `()` to group conditions when using logical operators `&` (AND), `|` (OR), and `~` (NOT).
- Avoid using Python’s built-in `and`/`or` operators, as they do not operate element-wise on Pandas Series.
- Use `.copy()` on filtered DataFrames if you intend to modify them to avoid `SettingWithCopyWarning`.
- For complex filtering logic, consider using the `.query()` method (covered in advanced topics).

### Exercises

Q1. Filter all passengers who survived

In [5]:
survivors = df[df['Survived'] == 1]
print(survivors[['Name', 'Survived']].head())

                                                Name  Survived
1  Cumings, Mrs. John Bradley (Florence Briggs Th...         1
2                             Heikkinen, Miss. Laina         1
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)         1
8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)         1
9                Nasser, Mrs. Nicholas (Adele Achem)         1


Q2. Show all male passengers older than 40

In [6]:
older_males = df[(df['Sex'] == 'male') & (df['Age'] > 40)]
print(older_males[['Name', 'Sex', 'Age']].head())

                              Name   Sex   Age
6          McCarthy, Mr. Timothy J  male  54.0
33           Wheadon, Mr. Edward H  male  66.0
35  Holverson, Mr. Alexander Oskar  male  42.0
54  Ostby, Mr. Engelhart Cornelius  male  65.0
62     Harris, Mr. Henry Birkhardt  male  45.0


Q3. Find all passengers who paid more than 100 and were in 1st class

In [7]:
high_fare_first_class = df[(df['Fare'] > 100) & (df['Pclass'] == 1)]
print(high_fare_first_class[['Name', 'Fare', 'Pclass']].head())

                                               Name      Fare  Pclass
27                   Fortune, Mr. Charles Alexander  263.0000       1
31   Spencer, Mrs. William Augustus (Marie Eugenie)  146.5208       1
88                       Fortune, Miss. Mabel Helen  263.0000       1
118                        Baxter, Mr. Quigg Edmond  247.5208       1
195                            Lurette, Miss. Elise  146.5208       1


Q4. Get all passengers with missing Age values

In [8]:
missing_ages = df[df['Age'].isnull()]
print(missing_ages[['Name', 'Age']].head())

                             Name  Age
5                Moran, Mr. James  NaN
17   Williams, Mr. Charles Eugene  NaN
19        Masselmani, Mrs. Fatima  NaN
26        Emir, Mr. Farred Chehab  NaN
28  O'Dwyer, Miss. Ellen "Nellie"  NaN


Q5. Select female passengers who embarked from Cherbourg and survived

In [9]:
female_survivors_cherbourg = df[
    (df['Sex'] == 'female') & 
    (df['Embarked'] == 'C') & 
    (df['Survived'] == 1)
]
print(female_survivors_cherbourg[['Name', 'Sex', 'Embarked', 'Survived']].head())

                                                 Name     Sex Embarked  \
1   Cumings, Mrs. John Bradley (Florence Briggs Th...  female        C   
9                 Nasser, Mrs. Nicholas (Adele Achem)  female        C   
19                            Masselmani, Mrs. Fatima  female        C   
31     Spencer, Mrs. William Augustus (Marie Eugenie)  female        C   
39                        Nicola-Yarred, Miss. Jamila  female        C   

    Survived  
1          1  
9          1  
19         1  
31         1  
39         1  


### Summary

Boolean indexing and conditional filtering are core operations in Pandas for extracting relevant subsets of data based on logical conditions. By applying Boolean expressions to DataFrame columns, you generate a Boolean mask that efficiently filters rows without the need for iterative loops. This technique is essential for data cleaning, segmentation, feature engineering, and exploratory analysis, all of which are foundational steps in building robust machine learning models.

Using operators like `==`, `!=`, `>`, `<` alongside logical connectors such as `&` and `|` enables complex multi-condition filters to be expressed concisely and clearly. Additional methods like `.isin()` and `.between()` further simplify common filtering patterns.

Mastering Boolean indexing ensures your data manipulation code is both performant and maintainable, helping you to focus on extracting meaningful insights from your data and preparing it effectively for modeling and analysis.