# Lecture 3: Pandas Filtering - Step by Step

Learn how to extract exactly the data you need from DataFrames, building up from simple to complex filtering.

## What We'll Learn:
1. Select single columns
2. Select multiple columns
3. Select rows by position (iloc)
4. Select rows and columns together (loc)
5. Filter rows using conditions (boolean filtering)
6. Combine multiple conditions

Let's start with our cancer research dataset! 🧬

## Load Our Dataset

First, let's load our DepMap cancer dataset and take a look at it.

In [None]:
import pandas as pd
import numpy as np

# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst 3 rows:")
print(df.head(3))

In [None]:
# Let's see what columns we have
print("Column names:")
print(df.columns.tolist()[:10])  # First 10 columns

---

## Section 1: Selecting a Single Column

### Guided Example 1.1: Getting one column

The simplest thing we can do is get just one column from our DataFrame.

In [None]:
# Get the 'cell_line_name' column
cell_lines = df['cell_line_name']

print("First 10 cell line names:")
print(cell_lines.head(10))

**What's happening here?**
- We use square brackets: `df['column_name']`
- This gives us a **Series** (a single column)
- It's like picking one column from a spreadsheet

### Guided Example 1.2: Looking at the column type

In [None]:
# Get a column and check what type it is
cancer_types = df['oncotree_lineage']

print(f"Data type: {type(cancer_types)}")
print(f"\nHow many values: {len(cancer_types)}")
print(f"\nUnique cancer types: {cancer_types.nunique()}")
print(f"\nFirst few values:")
print(cancer_types.head())

**What's new here?**
- When we select one column, we get a Series (not a DataFrame)
- We can use methods like `.nunique()` to count unique values
- Series have useful properties like `.head()` just like DataFrames

### Practice Example 1.1: Get the model_id column

Select the 'model_id' column and print the first 8 values

In [None]:
# YOUR CODE HERE: get the 'model_id' column
# Print the first 8 values

### Practice Example 1.2: Get a gene column

Select the 'A1BG' gene column and print:
- How many values it has
- The first 5 values

In [None]:
# YOUR CODE HERE: get the 'A1BG' column
# Print the length and first 5 values

---

## Section 2: Selecting Multiple Columns

### Guided Example 2.1: Getting several columns

What if we want more than one column? We pass a **list** of column names.

In [None]:
# Get three columns: cell line name, cancer type, and primary disease
columns_we_want = ['cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease']
subset = df[columns_we_want]

print("Selected 3 columns:")
print(subset.head())

**What's happening here?**
- We use **double brackets**: `df[['col1', 'col2', 'col3']]`
- The inner brackets make a list of column names
- The outer brackets select those columns from the DataFrame
- This gives us a **DataFrame** (not a Series)

### Guided Example 2.2: Selecting columns including genes

In [None]:
# Get metadata and first two gene columns
cols = ['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1CF']
data_subset = df[cols]

print("Cell lines with two gene scores:")
print(data_subset.head())
print(f"\nShape: {data_subset.shape}")

**What's new here?**
- We can mix metadata columns (like cell line name) with data columns (like gene scores)
- `.shape` shows us (rows, columns)
- The result is still a DataFrame

### Practice Example 2.1: Select three columns

Select these three columns: 'model_id', 'cell_line_name', 'oncotree_lineage'

Print the first 6 rows

In [None]:
# YOUR CODE HERE: select the three columns
# Print first 6 rows

### Practice Example 2.2: Select metadata and genes

Select: 'cell_line_name', 'A1BG', 'A1CF', 'A2M'

Print the shape and first 5 rows

In [None]:
# YOUR CODE HERE: select the columns
# Print shape and first 5 rows

---

## Section 3: Selecting Rows by Position (iloc)

### Guided Example 3.1: Getting specific rows

Sometimes we want specific rows by their position (like row 0, 1, 2...). We use `.iloc[]`

In [None]:
# Get the first 5 rows (positions 0, 1, 2, 3, 4)
first_five = df.iloc[0:5]

print("First 5 rows:")
print(first_five[['cell_line_name', 'oncotree_lineage']])

**What's happening here?**
- `.iloc[0:5]` means "get rows at positions 0, 1, 2, 3, 4"
- The number after `:` is NOT included (just like `range()`)
- `iloc` stands for "integer location"

### Guided Example 3.2: Different ways to use iloc

In [None]:
# Get just row 0
row_zero = df.iloc[0]
print("Just row 0:")
print(row_zero[['cell_line_name', 'oncotree_lineage']])

# Get rows 10 to 15
rows_10_to_15 = df.iloc[10:15]
print("\nRows 10-14:")
print(rows_10_to_15[['cell_line_name', 'oncotree_lineage']])

# Get the last 3 rows
last_three = df.iloc[-3:]
print("\nLast 3 rows:")
print(last_three[['cell_line_name', 'oncotree_lineage']])

**What's new here?**
- `.iloc[0]` - gets just one row (returns a Series)
- `.iloc[10:15]` - gets a range of rows
- `.iloc[-3:]` - negative numbers count from the end

### Practice Example 3.1: Get rows 20 to 25

Use iloc to get rows at positions 20, 21, 22, 23, 24

Show the 'cell_line_name' and 'oncotree_lineage' columns

In [None]:
# YOUR CODE HERE: get rows 20-24 using iloc
# Show cell_line_name and oncotree_lineage

### Practice Example 3.2: Get the first 10 rows

Use iloc to get the first 10 rows

Show the 'model_id' and 'cell_line_name' columns

In [None]:
# YOUR CODE HERE: get first 10 rows
# Show model_id and cell_line_name

### Practice Example 3.3: Get the last 5 rows

Use iloc with negative indexing to get the last 5 rows

In [None]:
# YOUR CODE HERE: get last 5 rows
# Show cell_line_name and oncotree_lineage

---

## Section 4: Selecting Rows AND Columns Together (loc)

### Guided Example 4.1: Using loc to select rows and columns

`.loc[]` lets us select specific rows AND specific columns at the same time

In [None]:
# Get first 5 rows, but only 3 specific columns
result = df.loc[0:4, ['cell_line_name', 'oncotree_lineage', 'A1BG']]

print("First 5 rows, 3 columns:")
print(result)

**What's happening here?**
- `.loc[rows, columns]` - takes two parts separated by a comma
- First part: which rows (0:4 means rows 0, 1, 2, 3, 4)
- Second part: which columns (a list of column names)
- **Note:** `.loc[0:4]` INCLUDES row 4 (different from iloc!)

### Guided Example 4.2: Using all rows or all columns

In [None]:
# Get ALL rows, but only 2 columns
all_rows_two_cols = df.loc[:, ['cell_line_name', 'A1BG']]
print("All rows, 2 columns (first 5 rows shown):")
print(all_rows_two_cols.head())

# Get rows 10-15, ALL columns (too many to show nicely)
few_rows_all_cols = df.loc[10:15, :]
print(f"\nRows 10-15, all columns - shape: {few_rows_all_cols.shape}")

**What's new here?**
- `:` by itself means "all"
- `.loc[:, ['col1', 'col2']]` - all rows, specific columns
- `.loc[0:10, :]` - specific rows, all columns

### Practice Example 4.1: Get rows 0-9, three columns

Use loc to get rows 0 through 9 (first 10 rows)

Select only: 'cell_line_name', 'oncotree_lineage', 'A1CF'

In [None]:
# YOUR CODE HERE: use .loc to get first 10 rows and 3 columns

### Practice Example 4.2: Get all rows, two gene columns

Use loc to get ALL rows, but only the 'A1BG' and 'A2M' columns

Print the first 8 rows

In [None]:
# YOUR CODE HERE: get all rows, two columns
# Print first 8 rows

---

## Section 5: Boolean Filtering - Basic Conditions

### Guided Example 5.1: Filtering with one condition

Now for the powerful part! We can filter rows based on their values.

In [None]:
# Find all breast cancer cell lines
breast_cancer = df[df['oncotree_lineage'] == 'Breast']

print(f"Number of breast cancer cell lines: {len(breast_cancer)}")
print("\nFirst few breast cancer lines:")
print(breast_cancer[['cell_line_name', 'oncotree_lineage']].head())

**What's happening here?**
- `df['oncotree_lineage'] == 'Breast'` creates True/False values
- True for rows where lineage is 'Breast'
- False for all other rows
- `df[True/False values]` keeps only the True rows

In [None]:
# Let's see what the True/False values look like
condition = df['oncotree_lineage'] == 'Breast'
print("First 10 True/False values:")
print(condition.head(10))
print(f"\nNumber of True values: {condition.sum()}")

**Understanding the condition:**
- The condition creates a Series of True/False (called a "boolean mask")
- We can sum it to count True values (True = 1, False = 0)
- Then we use it to filter: `df[condition]`

### Guided Example 5.2: Filtering with number comparisons

In [None]:
# Find cell lines where A1BG gene effect is very negative (< -0.1)
negative_a1bg = df[df['A1BG'] < -0.1]

print(f"Cell lines with A1BG < -0.1: {len(negative_a1bg)}")
print("\nFirst few:")
print(negative_a1bg[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

**What's new here?**
- We can use comparison operators: `<`, `>`, `<=`, `>=`, `==`, `!=`
- `df['A1BG'] < -0.1` finds all rows where A1BG is less than -0.1
- Very useful for finding genes with strong effects!

### Practice Example 5.1: Filter for lung cancer

Filter the DataFrame to get only lung cancer cell lines

(where 'oncotree_lineage' equals 'Lung')

How many are there? Show the first 5.

In [None]:
# YOUR CODE HERE: filter for Lung cancer
# Print the count and first 5 rows

### Practice Example 5.2: Filter for positive gene effects

Find all cell lines where A1CF is greater than 0.05

How many are there? Show 'cell_line_name', 'oncotree_lineage', and 'A1CF' for the first 5.

In [None]:
# YOUR CODE HERE: filter for A1CF > 0.05
# Print count and first 5 rows with selected columns

### Practice Example 5.3: Filter for strong negative effects

Find cell lines where A2M is less than -0.15

Show 'cell_line_name', 'oncotree_lineage', and 'A2M'. Sort by A2M value (most negative first).

In [None]:
# YOUR CODE HERE: filter for A2M < -0.15
# Show selected columns, sorted by A2M
# Hint: use .sort_values('A2M')

---

## Section 6: Combining Multiple Conditions

### Guided Example 6.1: Using AND (&) to combine conditions

What if we want rows that meet TWO conditions? We use `&` (and)

In [None]:
# Find breast cancer cell lines where A1BG is very negative
breast_and_negative = df[(df['oncotree_lineage'] == 'Breast') & (df['A1BG'] < -0.1)]

print(f"Breast cancer with A1BG < -0.1: {len(breast_and_negative)}")
print("\nThese cell lines:")
print(breast_and_negative[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

**What's happening here?**
- `&` means "AND" - both conditions must be True
- **Important:** Put each condition in parentheses `(condition1) & (condition2)`
- This finds rows where lineage is Breast AND A1BG is less than -0.1

### Guided Example 6.2: Using OR (|) to combine conditions

In [None]:
# Find cell lines that are EITHER breast OR lung cancer
breast_or_lung = df[(df['oncotree_lineage'] == 'Breast') | (df['oncotree_lineage'] == 'Lung')]

print(f"Breast OR Lung cancer: {len(breast_or_lung)}")
print("\nCounts by cancer type:")
print(breast_or_lung['oncotree_lineage'].value_counts())

**What's new here?**
- `|` means "OR" - at least one condition must be True
- `(condition1) | (condition2)` - returns True if either is True
- Useful for including multiple categories

### Practice Example 6.1: Breast cancer with positive A1CF

Find breast cancer cell lines where A1CF is greater than 0.05

Use AND (&) to combine two conditions:
1. oncotree_lineage equals 'Breast'
2. A1CF is greater than 0.05

How many are there?

In [None]:
# YOUR CODE HERE: filter using & to combine conditions
# Print count and first few rows

### Practice Example 6.2: Lung or brain cancer

Find cell lines that are EITHER 'Lung' OR 'Brain' cancer

How many of each type?

In [None]:
# YOUR CODE HERE: filter using | (OR)
# Print counts using .value_counts()

### Practice Example 6.3: Complex filtering

Find breast cancer cell lines where BOTH genes show effects:
- A1BG is less than -0.08 AND
- A1CF is less than -0.05

Show 'cell_line_name', 'A1BG', and 'A1CF'

In [None]:
# YOUR CODE HERE: filter with three conditions combined
# (cancer type AND gene1 AND gene2)

### Practice Example 6.4: Very complex filtering

Find cell lines where:
- Cancer type is 'Breast' OR 'Ovary' AND
- A2M is less than -0.1

How many cell lines meet these criteria?

In [None]:
# YOUR CODE HERE: combine multiple conditions
# Hint: ((condition1 | condition2) & condition3)

---

## Summary

Congratulations! You've learned pandas filtering step-by-step:

**Section 1 - Single column:**
- ✅ `df['column_name']` - gets one column

**Section 2 - Multiple columns:**
- ✅ `df[['col1', 'col2']]` - gets several columns

**Section 3 - Rows by position:**
- ✅ `df.iloc[0:5]` - gets rows by position

**Section 4 - Rows and columns:**
- ✅ `df.loc[rows, columns]` - gets specific rows and columns

**Section 5 - Boolean filtering:**
- ✅ `df[df['col'] == value]` - filters rows by condition
- ✅ Can use `<`, `>`, `<=`, `>=`, `==`, `!=`

**Section 6 - Multiple conditions:**
- ✅ `df[(condition1) & (condition2)]` - AND (both must be true)
- ✅ `df[(condition1) | (condition2)]` - OR (at least one must be true)

**Next Steps:**
- Practice with your own datasets
- Combine filtering with sorting
- Learn about `.isin()` for filtering multiple values

Keep practicing! 🚀