# Lecture 4: Pandas Slicing and Selection
## Python for Biology
**Learning Objectives:**
- Select single columns and multiple columns from DataFrames
- Use `.loc[]` to select by labels (row/column names)
- Use `.iloc[]` to select by integer positions
- Use boolean indexing to filter rows based on conditions
- Combine multiple conditions with `&` (and) and `|` (or)

---

In [None]:
import pandas as pd

## Section 1: Creating Our DataFrames

We'll use two small DataFrames throughout this notebook:
1. **genes_df** - for guided examples (genes and expression data)
2. **proteins_df** - for practice exercises (protein properties)

### Demonstration DataFrame: Genes

In [None]:
# Create a small gene expression dataset for demonstrations
genes_df = pd.DataFrame({
    'gene_name': ['BRCA1', 'TP53', 'MYC', 'EGFR', 'KRAS'],
    'chromosome': ['17', '17', '8', '7', '12'],
    'expression': [4.5, 6.2, 3.8, 5.1, 7.3],
    'length': [1863, 1182, 1320, 3633, 567],
    'oncogene': [False, False, True, True, True]
})

print("Demonstration DataFrame: genes_df")
genes_df

### Practice DataFrame: Proteins

In [None]:
# Create a small protein dataset for practice exercises
proteins_df = pd.DataFrame({
    'protein_name': ['Insulin', 'Hemoglobin', 'Actin', 'Collagen', 'Albumin', 'Myosin'],
    'molecular_weight': [5.8, 64.5, 42.0, 300.0, 66.5, 520.0],
    'num_amino_acids': [51, 574, 375, 1050, 585, 1938],
    'location': ['Secreted', 'Cytoplasm', 'Cytoskeleton', 'Extracellular', 'Secreted', 'Cytoskeleton'],
    'structural': [False, False, True, True, False, True]
})

print("Practice DataFrame: proteins_df")
proteins_df

**What these DataFrames contain:**
- Each **row** represents one gene/protein
- Each **column** represents a property
- **Index** (row numbers) are automatically 0, 1, 2, 3, 4...

---

## Section 2: Selecting Columns

### Guided Example 2.1: Select a single column

To get one column, use square brackets with the column name:

In [None]:
# Get just the gene names
gene_names = genes_df['gene_name']
print("Gene names:")
print(gene_names)
print(f"\nType: {type(gene_names)}")  # This is a Series!

**What happened:**
- `genes_df['gene_name']` selects the 'gene_name' column
- Returns a **Series** (a single column)
- Series is like a list with an index

### Practice Example 2.1: Select a single column

Select the `molecular_weight` column from proteins_df

In [None]:
# YOUR CODE HERE: Select the molecular_weight column


# Should show: 5.8, 64.5, 42.0, 300.0, 66.5, 520.0

### Guided Example 2.2: Select multiple columns

To get multiple columns, use a **list** of column names:

In [None]:
# Get gene name and expression level
subset = genes_df[['gene_name', 'expression']]
print("Gene names and expression:")
print(subset)
print(f"\nType: {type(subset)}")  # This is a DataFrame!

**What happened:**
- `genes_df[['gene_name', 'expression']]` - note the **double square brackets**!
- Outer brackets = "select from DataFrame"
- Inner brackets = the list of columns we want
- Returns a **DataFrame** (multiple columns)

### Practice Example 2.2: Select multiple columns

Select `protein_name`, `molecular_weight`, and `location` from proteins_df

In [None]:
# YOUR CODE HERE: Select the three columns
# Remember: use double square brackets!


# Should show just those 3 columns

---

## Section 3: Using .loc[] - Label-based Selection

`.loc[]` lets you select rows and columns by their **names/labels**.

**Syntax:** `df.loc[row_selection, column_selection]`

### Guided Example 3.1: Select a single row with .loc

Select one row by its index:

In [None]:
# Get the row at index 0 (BRCA1)
row = genes_df.loc[0]
print("First row (index 0):")
print(row)
print(f"\nType: {type(row)}")  # Series

**What happened:**
- `genes_df.loc[0]` selects the row with index label 0
- Returns a Series with column names as the index
- We can access specific values: `row['gene_name']` would give 'BRCA1'

### Practice Example 3.1: Select a single row

Select the row at index 2 from proteins_df (should be Actin)

In [None]:
# YOUR CODE HERE: Use .loc to select row at index 2


# Should show Actin with all its properties

### Guided Example 3.2: Select multiple rows with .loc

Select multiple rows using a list of indices:

In [None]:
# Get rows at indices 0, 2, and 4
rows = genes_df.loc[[0, 2, 4]]
print("Selected rows:")
print(rows)

**What happened:**
- `genes_df.loc[[0, 2, 4]]` selects rows 0, 2, and 4
- Returns a DataFrame with those rows
- Note the list inside .loc[]

### Practice Example 3.2: Select multiple rows

Select rows at indices 1, 3, and 5 from proteins_df

In [None]:
# YOUR CODE HERE: Select multiple rows


# Should show Hemoglobin, Collagen, and Myosin

### Guided Example 3.3: Select rows AND columns with .loc

The real power of .loc: select specific rows AND specific columns!

In [None]:
# Get gene_name and expression for rows 1, 2, 3
subset = genes_df.loc[[1, 2, 3], ['gene_name', 'expression']]
print("Selected rows and columns:")
print(subset)

**Syntax breakdown:**
- `genes_df.loc[rows, columns]`
- `[1, 2, 3]` = which rows (before the comma)
- `['gene_name', 'expression']` = which columns (after the comma)
- Returns a DataFrame with the intersection

### Practice Example 3.3: Select rows and columns

From proteins_df, select:
- Rows: 0, 2, 4
- Columns: 'protein_name', 'molecular_weight', 'structural'

In [None]:
# YOUR CODE HERE: Select specific rows and columns


# Should show Insulin, Actin, and Albumin with those 3 properties

### Guided Example 3.4: Using slices with .loc

You can use slices to select ranges (inclusive on both ends!):

In [None]:
# Get rows 1 through 3 (inclusive!)
rows = genes_df.loc[1:3]
print("Rows 1 to 3:")
print(rows)

**Important:** 
- `.loc[1:3]` includes BOTH 1 and 3 (unlike normal Python slicing!)
- With .loc, slices are **inclusive** on both ends

### Practice Example 3.4: Use slice with .loc

Select rows 2 through 4 (inclusive) from proteins_df

In [None]:
# YOUR CODE HERE: Use a slice with .loc


# Should show Actin, Collagen, and Albumin

---

## Section 4: Using .iloc[] - Position-based Selection

`.iloc[]` selects by **integer positions** (like list indexing).

**Syntax:** `df.iloc[row_positions, column_positions]`

### Guided Example 4.1: Select by position with .iloc

Get specific rows and columns by their position numbers:

In [None]:
# Get the first row, first two columns
subset = genes_df.iloc[0, 0:2]
print("First row, first two columns:")
print(subset)

**What happened:**
- `genes_df.iloc[0, 0:2]`
- `0` = first row (position 0)
- `0:2` = columns at positions 0 and 1 (excludes 2, like normal Python slicing!)
- With .iloc, slices are **exclusive** on the right end

### Guided Example 4.2: Multiple rows and columns with .iloc

In [None]:
# Get first 3 rows, columns 0 and 2
subset = genes_df.iloc[0:3, [0, 2]]
print("First 3 rows, columns 0 and 2:")
print(subset)

**Key points:**
- `0:3` = rows 0, 1, 2 (excludes 3)
- `[0, 2]` = columns at positions 0 and 2
- All position-based, no column names needed!

### Practice Example 4.1: Use .iloc with positions

From proteins_df, select:
- Rows: positions 1, 3, 5 (use a list)
- Columns: positions 0 and 1

In [None]:
# YOUR CODE HERE: Use .iloc with specific positions


# Should show protein names and molecular weights for Hemoglobin, Collagen, and Myosin

### Guided Example 4.3: .loc vs .iloc - Key Differences

In [None]:
print("Using .loc[1:3]:")
print(genes_df.loc[1:3])  # Includes both 1 AND 3

print("\nUsing .iloc[1:3]:")
print(genes_df.iloc[1:3])  # Includes 1 but EXCLUDES 3

**Summary: .loc vs .iloc**

| Feature | .loc | .iloc |
|---------|------|-------|
| Uses | **Labels** (names/index) | **Positions** (integers) |
| Columns | Use column names | Use column positions |
| Slicing | **Inclusive** on both ends | **Exclusive** on right end |
| Example | `df.loc[1:3, 'gene_name']` | `df.iloc[1:3, 0]` |

**When to use which:**
- Use `.loc` when you know column/row **names**
- Use `.iloc` when you want rows/columns by **position**

### Practice Example 4.2: Compare .loc and .iloc

Use both methods to select the same data from proteins_df:
- First 4 rows
- 'protein_name' and 'location' columns

Try it with .loc first, then with .iloc

In [None]:
# YOUR CODE HERE: Use .loc
print("Using .loc:")


# YOUR CODE HERE: Use .iloc
print("\nUsing .iloc:")


# Both should give the same result!

---

## Section 5: Boolean Indexing - Filtering with Conditions

**Boolean indexing** lets you filter rows based on conditions. This is one of the most powerful features of pandas!

### Guided Example 5.1: Creating a boolean mask

First, let's understand what a boolean mask is:

In [None]:
# Create a condition: expression > 5
mask = genes_df['expression'] > 5

print("Original expression values:")
print(genes_df['expression'])

print("\nBoolean mask (expression > 5):")
print(mask)

**What happened:**
- `genes_df['expression'] > 5` creates a Series of True/False values
- True where the condition is met, False otherwise
- This is called a **boolean mask**

### Guided Example 5.2: Using the mask to filter

Now use the mask to select only rows where the condition is True:

In [None]:
# Use the mask to filter the DataFrame
high_expression = genes_df[mask]

print("Genes with expression > 5:")
print(high_expression)

**What happened:**
- `genes_df[mask]` keeps only rows where mask is True
- TP53 and KRAS have expression > 5, so they're selected
- This is **filtering** the DataFrame

### Guided Example 5.3: Filter in one step

Usually, we combine creating the mask and filtering into one step:

In [None]:
# Filter for genes with expression > 5 (in one step)
high_expression = genes_df[genes_df['expression'] > 5]

print("Genes with high expression:")
print(high_expression)

**This is the standard way:**
- `genes_df[genes_df['expression'] > 5]`
- Inner part creates the mask
- Outer part uses it to filter
- Same result, less code!

### Practice Example 5.1: Simple boolean filtering

Filter proteins_df to show only proteins with molecular_weight > 100

In [None]:
# YOUR CODE HERE: Filter for molecular_weight > 100


# Should show Collagen and Myosin

### Guided Example 5.4: Different comparison operators

You can use many different comparisons:

In [None]:
# Equal to
print("Genes on chromosome 17:")
print(genes_df[genes_df['chromosome'] == '17'])

print("\nGenes with length <= 1500:")
print(genes_df[genes_df['length'] <= 1500])

print("\nGenes that are oncogenes:")
print(genes_df[genes_df['oncogene'] == True])

**Comparison operators:**
- `==` equal to
- `!=` not equal to
- `>` greater than
- `<` less than
- `>=` greater than or equal
- `<=` less than or equal

### Practice Example 5.2: Different comparisons

From proteins_df:
1. Find proteins located in 'Cytoplasm'
2. Find proteins with num_amino_acids < 600
3. Find structural proteins (structural == True)

In [None]:
# YOUR CODE HERE: Proteins in Cytoplasm
print("Proteins in Cytoplasm:")


# YOUR CODE HERE: Proteins with < 600 amino acids
print("\nProteins with < 600 amino acids:")


# YOUR CODE HERE: Structural proteins
print("\nStructural proteins:")


---

## Section 6: Combining Conditions

### Guided Example 6.1: AND conditions with &

Use `&` to combine conditions (both must be True):

In [None]:
# Find oncogenes with high expression (> 5)
result = genes_df[(genes_df['oncogene'] == True) & (genes_df['expression'] > 5)]

print("Oncogenes with expression > 5:")
print(result)

**Important syntax:**
- `&` means AND (both conditions must be True)
- **Must use parentheses** around each condition!
- `(condition1) & (condition2)`
- Without parentheses, you'll get an error!

### Practice Example 6.1: Combine conditions with AND

Find proteins that are:
- Secreted AND
- Have molecular_weight > 50

In [None]:
# YOUR CODE HERE: Filter with two conditions using &
# Remember: use parentheses around each condition!


# Should show Albumin

### Guided Example 6.2: OR conditions with |

Use `|` for OR (either condition can be True):

In [None]:
# Find genes on chromosome 17 OR chromosome 8
result = genes_df[(genes_df['chromosome'] == '17') | (genes_df['chromosome'] == '8')]

print("Genes on chromosome 17 or 8:")
print(result)

**Important:**
- `|` means OR (either or both conditions can be True)
- Still need parentheses around each condition!
- `(condition1) | (condition2)`

### Practice Example 6.2: Combine conditions with OR

Find proteins that are located in:
- 'Cytoskeleton' OR 'Extracellular'

In [None]:
# YOUR CODE HERE: Filter with OR condition


# Should show Actin, Collagen, and Myosin

### Guided Example 6.3: Complex conditions

You can combine AND and OR:

In [None]:
# Find genes that are:
# (oncogenes with expression > 4) OR (on chromosome 17 with length > 1500)
result = genes_df[
    ((genes_df['oncogene'] == True) & (genes_df['expression'] > 4)) |
    ((genes_df['chromosome'] == '17') & (genes_df['length'] > 1500))
]

print("Complex filter:")
print(result)

**Breaking it down:**
- First part: `(oncogene == True) & (expression > 4)` → MYC, EGFR, KRAS
- Second part: `(chromosome == '17') & (length > 1500)` → BRCA1
- Combined with OR: gets rows matching either condition
- Use extra parentheses to group conditions clearly!

### Practice Example 6.3: Complex filtering

Find proteins that meet ANY of these criteria:
- (molecular_weight > 500) OR
- (structural == True AND num_amino_acids < 400)

In [None]:
# YOUR CODE HERE: Complex filter with AND and OR
# Use extra parentheses to group conditions!


# Should show Actin and Myosin

### Guided Example 6.4: Using .isin() for multiple values

When checking if a column matches multiple values, use `.isin()`:

In [None]:
# Find genes on chromosomes 7, 8, or 12
# Instead of: (chr=='7') | (chr=='8') | (chr=='12')
result = genes_df[genes_df['chromosome'].isin(['7', '8', '12'])]

print("Genes on chromosomes 7, 8, or 12:")
print(result)

**Why .isin() is useful:**
- Much cleaner than multiple OR conditions
- `.isin([list, of, values])` checks if column value is in the list
- Perfect for "belongs to this group" filters

### Practice Example 6.4: Use .isin()

Find proteins located in any of these locations:
- 'Cytoplasm', 'Cytoskeleton', or 'Secreted'

In [None]:
# YOUR CODE HERE: Use .isin() to filter


# Should show all except Collagen

---

## Section 7: Practical Applications

### Challenge 1: Multi-step filtering

From genes_df:
1. Filter for oncogenes only
2. From those, select only gene_name and expression columns
3. Sort by expression (highest first)

**Hint:** Chain operations together!

In [None]:
# YOUR CODE HERE


# Should show MYC, EGFR, KRAS with their expression, sorted

### Challenge 2: Statistics on filtered data

From proteins_df:
1. Filter for structural proteins only
2. Calculate the mean molecular_weight of structural proteins
3. Calculate the max num_amino_acids of structural proteins

In [None]:
# YOUR CODE HERE


# Print the statistics

### Challenge 3: Create a subset DataFrame

From proteins_df, create a new DataFrame called `small_proteins` that contains:
- Only proteins with molecular_weight < 100
- Only these columns: 'protein_name', 'molecular_weight', 'num_amino_acids'
- Sorted by molecular_weight (smallest first)

In [None]:
# YOUR CODE HERE: Create small_proteins


# Display the result

---

## Summary

Congratulations! You've mastered pandas slicing and selection!

**Column selection:**
- ✅ Single column: `df['column']` → Series
- ✅ Multiple columns: `df[['col1', 'col2']]` → DataFrame

**Label-based selection (.loc):**
- ✅ Select by labels/names
- ✅ Syntax: `df.loc[rows, columns]`
- ✅ Slices are **inclusive** on both ends
- ✅ Example: `df.loc[1:3, 'gene_name']`

**Position-based selection (.iloc):**
- ✅ Select by integer positions
- ✅ Syntax: `df.iloc[row_positions, col_positions]`
- ✅ Slices are **exclusive** on right end
- ✅ Example: `df.iloc[1:3, 0]`

**Boolean indexing:**
- ✅ Filter with conditions: `df[df['col'] > 5]`
- ✅ Combine with AND: `(condition1) & (condition2)`
- ✅ Combine with OR: `(condition1) | (condition2)`
- ✅ Multiple values: `df[df['col'].isin([val1, val2])]`

**Key differences:**

| Operation | Use when... | Example |
|-----------|-------------|----------|
| `df['col']` | Getting one column | `df['gene_name']` |
| `.loc[]` | Using labels/names | `df.loc[0:3, 'expression']` |
| `.iloc[]` | Using positions | `df.iloc[0:3, 2]` |
| Boolean | Filtering rows | `df[df['expr'] > 5]` |

**Common mistakes to avoid:**
- ⚠️ Forgetting inner brackets: `df[['col1', 'col2']]` not `df['col1', 'col2']`
- ⚠️ Mixing .loc and .iloc
- ⚠️ Forgetting parentheses in boolean: `(cond1) & (cond2)`
- ⚠️ Using `and`/`or` instead of `&`/`|` in boolean indexing

**Next steps:**
- Practice with real biological datasets
- Combine filtering with groupby and aggregation
- Learn about setting values in DataFrames

You're now ready to slice and filter data like a pro! 🔬📊