# 03 - Data Selection and Filtering

## Introduction

Selecting and filtering data is one of the most common operations in data engineering. You'll frequently need to extract specific rows and columns based on conditions.

## What You'll Learn

- Selecting columns
- Selecting rows by index
- Selecting rows by condition (filtering)
- Using loc and iloc
- Boolean indexing
- Multiple conditions


In [1]:
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 32, 27],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'Sydney', 'Berlin'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 58000],
    'Department': ['IT', 'Sales', 'IT', 'Marketing', 'IT', 'Sales']
})

print("Sample DataFrame:")
print(df)


Sample DataFrame:
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
1      Bob   30    London   60000      Sales
2  Charlie   35     Tokyo   70000         IT
3    Diana   28     Paris   55000  Marketing
4      Eve   32    Sydney   65000         IT
5    Frank   27    Berlin   58000      Sales


## Selecting Columns

You can select one or more columns from a DataFrame.


In [2]:
# Select a single column (returns Series)
name_column = df['Name']
print("Single column (Series):")
print(name_column)
print(f"\nType: {type(name_column)}")


Single column (Series):
0      Alice
1        Bob
2    Charlie
3      Diana
4        Eve
5      Frank
Name: Name, dtype: object

Type: <class 'pandas.core.series.Series'>


In [3]:
# Select multiple columns (returns DataFrame)
name_age = df[['Name', 'Age']]
print("Multiple columns (DataFrame):")
print(name_age)
print(f"\nType: {type(name_age)}")


Multiple columns (DataFrame):
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    Diana   28
4      Eve   32
5    Frank   27

Type: <class 'pandas.core.frame.DataFrame'>


## Filtering Rows by Condition

Filtering allows you to select rows that meet certain conditions. This is similar to SQL's WHERE clause.


In [4]:
# Filter rows where Age is greater than 30
older_than_30 = df[df['Age'] > 30]
print("People older than 30:")
print(older_than_30)


People older than 30:
      Name  Age    City  Salary Department
2  Charlie   35   Tokyo   70000         IT
4      Eve   32  Sydney   65000         IT


In [5]:
# Filter rows where Salary is greater than 60000
high_salary = df[df['Salary'] > 60000]
print("People with salary > 60000:")
print(high_salary)


People with salary > 60000:
      Name  Age    City  Salary Department
2  Charlie   35   Tokyo   70000         IT
4      Eve   32  Sydney   65000         IT


In [6]:
# Filter rows with string conditions
it_department = df[df['Department'] == 'IT']
print("People in IT department:")
print(it_department)


People in IT department:
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
2  Charlie   35     Tokyo   70000         IT
4      Eve   32    Sydney   65000         IT


## Multiple Conditions

You can combine multiple conditions using `&` (AND) and `|` (OR). **Important:** Use parentheses around each condition!


In [7]:
# AND condition: Age > 30 AND Salary > 60000
condition = (df['Age'] > 30) & (df['Salary'] > 60000)
result = df[condition]
print("Age > 30 AND Salary > 60000:")
print(result)


Age > 30 AND Salary > 60000:
      Name  Age    City  Salary Department
2  Charlie   35   Tokyo   70000         IT
4      Eve   32  Sydney   65000         IT


In [8]:
# OR condition: Department == 'IT' OR Department == 'Sales'
condition = (df['Department'] == 'IT') | (df['Department'] == 'Sales')
result = df[condition]
print("Department is IT OR Sales:")
print(result)


Department is IT OR Sales:
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
1      Bob   30    London   60000      Sales
2  Charlie   35     Tokyo   70000         IT
4      Eve   32    Sydney   65000         IT
5    Frank   27    Berlin   58000      Sales


In [9]:
# Complex condition: (Age > 28) AND (Salary > 55000) AND (Department == 'IT')
condition = (df['Age'] > 28) & (df['Salary'] > 55000) & (df['Department'] == 'IT')
result = df[condition]
print("Complex condition:")
print(result)


Complex condition:
      Name  Age    City  Salary Department
2  Charlie   35   Tokyo   70000         IT
4      Eve   32  Sydney   65000         IT


## Using loc for Label-Based Selection

`loc` is used for label-based indexing. It uses row and column labels.


In [10]:
# Select specific row and column using loc
# Syntax: df.loc[row_label, column_label]
print("Select row 0, column 'Name':")
print(df.loc[0, 'Name'])


Select row 0, column 'Name':
Alice


In [11]:
# Select multiple rows and columns
print("Select rows 0-2, columns 'Name' and 'Age':")
print(df.loc[0:2, ['Name', 'Age']])


Select rows 0-2, columns 'Name' and 'Age':
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [12]:
# Select all rows for specific columns
print("All rows, Name and Salary columns:")
print(df.loc[:, ['Name', 'Salary']])


All rows, Name and Salary columns:
      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   70000
3    Diana   55000
4      Eve   65000
5    Frank   58000


In [13]:
# Using loc with conditions
print("Using loc with condition:")
print(df.loc[df['Age'] > 30, ['Name', 'Age', 'Salary']])


Using loc with condition:
      Name  Age  Salary
2  Charlie   35   70000
4      Eve   32   65000


## Using iloc for Position-Based Selection

`iloc` is used for integer position-based indexing. It uses integer positions (0-based).


In [14]:
# Select specific row and column using iloc
# Syntax: df.iloc[row_position, column_position]
print("Select row 0, column 0:")
print(df.iloc[0, 0])


Select row 0, column 0:
Alice


In [15]:
# Select first 3 rows, first 2 columns
print("First 3 rows, first 2 columns:")
print(df.iloc[0:3, 0:2])


First 3 rows, first 2 columns:
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


In [16]:
# Select all rows, specific columns by position
print("All rows, columns 0, 2, 3:")
print(df.iloc[:, [0, 2, 3]])


All rows, columns 0, 2, 3:
      Name      City  Salary
0    Alice  New York   50000
1      Bob    London   60000
2  Charlie     Tokyo   70000
3    Diana     Paris   55000
4      Eve    Sydney   65000
5    Frank    Berlin   58000


## Common Filtering Operations


In [17]:
# Using isin() for multiple values
departments = df[df['Department'].isin(['IT', 'Sales'])]
print("People in IT or Sales:")
print(departments)


People in IT or Sales:
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
1      Bob   30    London   60000      Sales
2  Charlie   35     Tokyo   70000         IT
4      Eve   32    Sydney   65000         IT
5    Frank   27    Berlin   58000      Sales


In [18]:
# Using contains() for string matching
names_with_a = df[df['Name'].str.contains('a', case=False)]
print("Names containing 'a':")
print(names_with_a)


Names containing 'a':
      Name  Age      City  Salary Department
0    Alice   25  New York   50000         IT
2  Charlie   35     Tokyo   70000         IT
3    Diana   28     Paris   55000  Marketing
5    Frank   27    Berlin   58000      Sales


In [19]:
# Using between() for range
age_range = df[df['Age'].between(28, 32, inclusive='both')]
print("Age between 28 and 32:")
print(age_range)


Age between 28 and 32:
    Name  Age    City  Salary Department
1    Bob   30  London   60000      Sales
3  Diana   28   Paris   55000  Marketing
4    Eve   32  Sydney   65000         IT


## Summary

In this notebook, you learned:
- ✅ How to select columns (single and multiple)
- ✅ How to filter rows by conditions
- ✅ How to combine multiple conditions with & and |
- ✅ How to use loc for label-based selection
- ✅ How to use iloc for position-based selection
- ✅ Common filtering operations (isin, contains, between)

**Next:** Learn how to clean data in `04_data_cleaning.ipynb`
