# Lecture 4 Notebook 8: Comparing Groups with Box Plots

**Learning Objectives:**
- Understand what box plots show and how to read them
- Interpret the five-number summary (min, Q1, median, Q3, max)
- Create box plots to compare gene expression across cancer types
- Use both matplotlib and pandas methods for box plots
- Customize box plot appearance and colors
- Identify outliers and interpret biological significance
- Compare multiple genes side-by-side

---

## 1. Introduction: What is a Box Plot?

A **box plot** (also called a box-and-whisker plot) shows the **distribution of data** through five key statistics:

1. **Minimum**: Lowest value (excluding outliers)
2. **Q1 (First Quartile)**: 25th percentile - 25% of data below this value
3. **Median (Second Quartile)**: 50th percentile - middle value
4. **Q3 (Third Quartile)**: 75th percentile - 75% of data below this value
5. **Maximum**: Highest value (excluding outliers)

**Additional features:**
- **IQR (Interquartile Range)**: Q3 - Q1, the height of the box
- **Outliers**: Points beyond 1.5 × IQR from the quartiles
- **Whiskers**: Lines extending from the box to min/max (excluding outliers)

**Perfect for:**
- 📊 Comparing distributions across multiple groups
- 🧬 Analyzing gene expression across different cancer types
- 🔍 Identifying which groups have high variability
- 🎯 Finding outliers (unusual cell lines)

**Biological question:** "Is BRCA1 expression different in breast cancer vs myeloid cancer?"

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 2. Load Gene Expression Data

In [None]:
# Load the gene expression data
url = "https://zenodo.org/records/17377786/files/expression_filtered.csv?download=1"
gene_df = pd.read_csv(url)

print(f"Dataset shape: {gene_df.shape}")
print(f"\nCancer types:")
print(gene_df['oncotree_lineage'].value_counts())
gene_df.head()

---

## 3. Understanding Box Plot Components

Before we create box plots, let's calculate the five-number summary manually to understand what we're visualizing:

In [None]:
# Calculate five-number summary for BRCA1
brca1_data = gene_df['BRCA1']

print("Five-Number Summary for BRCA1:")
print(f"Minimum:  {brca1_data.min():.2f}")
print(f"Q1 (25%): {brca1_data.quantile(0.25):.2f}")
print(f"Median:   {brca1_data.median():.2f}")
print(f"Q3 (75%): {brca1_data.quantile(0.75):.2f}")
print(f"Maximum:  {brca1_data.max():.2f}")
print(f"\nIQR (Q3-Q1): {brca1_data.quantile(0.75) - brca1_data.quantile(0.25):.2f}")
print(f"Mean (for comparison): {brca1_data.mean():.2f}")

**What these numbers tell us:**
- **Median**: The typical BRCA1 expression level
- **IQR**: How spread out the middle 50% of values are
- **Range (max - min)**: The full spread of the data

Now let's visualize this with a simple box plot!

---

## 4. Your First Box Plot: Single Gene

Let's create a simple box plot for BRCA1 expression:

In [None]:
# Simple box plot for BRCA1
fig, ax = plt.subplots(figsize=(6, 8))

bp = ax.boxplot([gene_df['BRCA1']], 
                 labels=['BRCA1'],
                 patch_artist=True,
                 showmeans=True)

# Color the box
bp['boxes'][0].set_facecolor('skyblue')
bp['boxes'][0].set_alpha(0.7)

ax.set_ylabel('Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 Expression Distribution', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

**Reading the box plot:**
- **Orange line in box**: Median (50th percentile)
- **Green triangle**: Mean (average)
- **Box edges**: Q1 (bottom) and Q3 (top)
- **Whiskers**: Extend to min/max within 1.5×IQR
- **Individual points**: Outliers beyond the whiskers

**Key parameters:**
- `patch_artist=True`: Allows us to color the boxes
- `showmeans=True`: Shows mean as a triangle marker
- `labels`: Names for each box

### 📊 Exercise 1: Box Plot of TP53

Create a box plot for TP53 expression. Use a different color (e.g., 'lightcoral').

**Questions:**
- What is the median TP53 expression?
- Are there more outliers than BRCA1?
- Is the IQR larger or smaller than BRCA1?

In [None]:
# Your code here:


---

## 5. Comparing Groups: BRCA1 Across Cancer Types

Now the real power of box plots: **comparing distributions across multiple groups!**

Let's compare BRCA1 expression in Breast vs Myeloid cancer cell lines:

### Method 1: Using Matplotlib Boxplot

First, we need to prepare our data as separate lists for each group:

In [None]:
# Prepare data for box plot - one list per cancer type
lineages = gene_df['oncotree_lineage'].unique()
data_to_plot = [
    gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1']
    for lineage in lineages
]

print(f"Number of groups: {len(data_to_plot)}")
print(f"Groups: {lineages}")

In [None]:
# Create box plot comparing cancer types
fig, ax = plt.subplots(figsize=(10, 6))

bp = ax.boxplot(data_to_plot,
                labels=lineages,
                patch_artist=True,
                notch=True,
                showmeans=True)

# Customize colors
colors = ['skyblue', 'lightcoral', 'lightgreen', 'wheat', 'plum']
for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_xlabel('Cancer Type', fontsize=12)
ax.set_ylabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 Expression Across Cancer Types', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**What's new?**
- `notch=True`: Adds notches showing confidence interval around the median
  - If notches don't overlap, medians are statistically different (roughly)
- Multiple boxes side-by-side for comparison
- `plt.xticks(rotation=45)`: Rotates labels to prevent overlap

**Biological questions:**
- Which cancer type has the highest median BRCA1 expression?
- Which has the most variability (tallest box)?
- Are there cancer-type-specific outliers?

### Method 2: Using Pandas Boxplot (Much Easier!)

Pandas has a built-in method that's much simpler:

In [None]:
# Pandas makes it super easy!
fig, ax = plt.subplots(figsize=(10, 6))

gene_df.boxplot(column='BRCA1',
                by='oncotree_lineage',
                ax=ax,
                patch_artist=True,
                grid=False)

# Clean up the automatic title
ax.set_title('BRCA1 Expression Across Cancer Types', fontsize=14, fontweight='bold')
ax.set_xlabel('Cancer Type', fontsize=12)
ax.set_ylabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)

# Remove the automatic suptitle
plt.suptitle('')

plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

**Why use pandas boxplot?**
- ✅ Much simpler syntax - just specify column and grouping variable
- ✅ Automatically groups data
- ✅ No need to prepare data lists manually
- ✅ Perfect for quick exploratory analysis

**When to use matplotlib boxplot?**
- When you need more customization (colors, notches, etc.)
- For publication-quality figures
- When data isn't in a DataFrame

### 📊 Exercise 2: TP53 Across Cancer Types

Create a box plot comparing TP53 expression across cancer types.

Use **either** method (matplotlib or pandas - your choice!).

**Questions to answer:**
- Which cancer type has the highest median TP53?
- Which has the most outliers?
- Do the notches overlap between Breast and Myeloid?

In [None]:
# Your code here:


---

## 6. Comparing Multiple Genes Side-by-Side

Let's compare how BRCA1 and TP53 expression differ across cancer types using subplots:

In [None]:
# Compare BRCA1 and TP53 across cancer types
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

genes = ['BRCA1', 'TP53']
colors_list = [['skyblue', 'lightcoral'], ['lightgreen', 'wheat']]

for idx, gene in enumerate(genes):
    # Prepare data
    data_to_plot = [
        gene_df[gene_df['oncotree_lineage'] == lineage][gene]
        for lineage in lineages
    ]
    
    # Create box plot
    bp = axes[idx].boxplot(
        data_to_plot,
        labels=lineages,
        patch_artist=True,
        showmeans=True,
        notch=True
    )
    
    # Color the boxes
    for patch, color in zip(bp['boxes'], colors_list[idx]):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    axes[idx].set_xlabel('Cancer Type', fontsize=11)
    axes[idx].set_ylabel(f'{gene} Expression', fontsize=11)
    axes[idx].set_title(f'{gene} Across Cancer Types', fontsize=13, fontweight='bold')
    axes[idx].grid(True, alpha=0.3, axis='y')
    axes[idx].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

**Comparison questions:**
- Which gene shows more variation across cancer types?
- Are the patterns similar between BRCA1 and TP53?
- Which gene has more outliers overall?

### 📊 Exercise 3: Four Genes in a 2×2 Grid

Create a 2×2 grid of box plots comparing these genes across cancer types:
- BRCA1, TP53, MYC, EGFR

**Hints:**
- Use `fig, axes = plt.subplots(2, 2, figsize=(16, 12))`
- Flatten axes with `axes.flatten()`
- Loop through genes with `enumerate()`
- Use different colors for each gene

In [None]:
# Your code here:


---

## 7. Interpreting Box Plots: What to Look For

Let's break down what different features mean biologically:

### Feature 1: Median Differences

Compare the median (orange line) across groups:

In [None]:
# Calculate median BRCA1 for each cancer type
print("Median BRCA1 Expression by Cancer Type:")
print("="*50)
for lineage in lineages:
    median_expr = gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1'].median()
    print(f"{lineage:15s}: {median_expr:.2f}")

**Biological interpretation:**
- Higher median = typical expression is higher in this cancer type
- Could reflect tissue-specific regulation
- BRCA1 in breast cancer is biologically relevant!

### Feature 2: Box Height (IQR) - Variability

Taller boxes = more variability within that group:

In [None]:
# Calculate IQR for each cancer type
print("IQR (Interquartile Range) for BRCA1 by Cancer Type:")
print("="*50)
for lineage in lineages:
    data = gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1']
    iqr = data.quantile(0.75) - data.quantile(0.25)
    print(f"{lineage:15s}: {iqr:.2f}")

**Biological interpretation:**
- High IQR = heterogeneous cell lines within this cancer type
- Could indicate:
  - Multiple subtypes within the cancer category
  - Different stages or grades
  - Varied mutations affecting this gene
- Low IQR = consistent expression across cell lines

### Feature 3: Outliers - Unusual Cell Lines

In [None]:
# Identify outliers for each cancer type
print("Outliers in BRCA1 Expression:")
print("="*50)

for lineage in lineages:
    data = gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1']
    q1 = data.quantile(0.25)
    q3 = data.quantile(0.75)
    iqr = q3 - q1
    
    # Outliers are beyond 1.5 * IQR from quartiles
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    outliers = gene_df[
        (gene_df['oncotree_lineage'] == lineage) &
        ((gene_df['BRCA1'] < lower_bound) | (gene_df['BRCA1'] > upper_bound))
    ]
    
    if len(outliers) > 0:
        print(f"\n{lineage}:")
        print(outliers[['cell_line_name', 'BRCA1']])
    else:
        print(f"\n{lineage}: No outliers")

**Biological interpretation of outliers:**
- 🔬 **Low BRCA1 outlier in breast cancer**: Possible BRCA1 mutation!
- 🔬 **High expression outlier**: May have gene amplification
- 🔬 **Consistent outliers**: Worth investigating - could be a unique subtype
- ⚠️ **Could also be**: Experimental artifacts or measurement errors

### Feature 4: Overlapping Notches

When notches (the narrowing around the median) **overlap**, the medians are **not significantly different**.

When notches **don't overlap**, there's strong evidence the medians differ!

### 📊 Exercise 4: Biological Interpretation Challenge

Choose a gene of interest and:
1. Create a box plot across cancer types
2. Calculate and print median for each type
3. Calculate and print IQR for each type
4. Identify and print outliers
5. Write a 2-3 sentence biological interpretation

**Suggested genes:** EGFR, KRAS, TP53, MYC

In [None]:
# Your code here:


---

## 8. Advanced: Horizontal Box Plots

Sometimes horizontal box plots are easier to read, especially with long labels:

In [None]:
# Horizontal box plot
fig, ax = plt.subplots(figsize=(10, 6))

data_to_plot = [
    gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1']
    for lineage in lineages
]

bp = ax.boxplot(data_to_plot,
                labels=lineages,
                patch_artist=True,
                vert=False,  # Make it horizontal!
                showmeans=True,
                notch=True)

# Color boxes
colors = ['skyblue', 'lightcoral', 'lightgreen', 'wheat', 'plum']
for patch, color in zip(bp['boxes'], colors[:len(bp['boxes'])]):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('Cancer Type', fontsize=12)
ax.set_title('BRCA1 Expression Across Cancer Types (Horizontal)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

**When to use horizontal:**
- Long category names that would overlap
- Many categories (easier to read vertically)
- Publication figures where space is limited

---

## 9. Bonus: Violin Plots - Box Plots with Full Distribution

**Violin plots** combine box plots with density plots - showing the full distribution shape!

In [None]:
# Create violin plot
fig, ax = plt.subplots(figsize=(10, 6))

# Prepare data in long format for violin plot
positions = range(1, len(lineages) + 1)
data_to_plot = [
    gene_df[gene_df['oncotree_lineage'] == lineage]['BRCA1']
    for lineage in lineages
]

vp = ax.violinplot(data_to_plot, positions=positions, 
                    showmeans=True, showmedians=True)

# Customize violin colors
colors = ['skyblue', 'lightcoral', 'lightgreen', 'wheat', 'plum']
for i, pc in enumerate(vp['bodies']):
    pc.set_facecolor(colors[i % len(colors)])
    pc.set_alpha(0.7)

ax.set_xticks(positions)
ax.set_xticklabels(lineages, rotation=45, ha='right')
ax.set_xlabel('Cancer Type', fontsize=12)
ax.set_ylabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 Expression Across Cancer Types (Violin Plot)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

**Violin plots show:**
- **Width**: Density of data at that value (like a density plot rotated)
- **White dot**: Median
- **Thick black bar**: Interquartile range
- **Thin black line**: Range

**Advantages over box plots:**
- Shows the full distribution shape (bimodal, skewed, etc.)
- Can see if data has multiple peaks
- More informative than just five numbers

**Disadvantages:**
- Can be harder to read for non-experts
- Requires more data points to look good
- Takes more space

### 📊 Exercise 5: Create Your Own Violin Plot

Create a violin plot comparing TP53 expression across cancer types.

**Questions:**
- Do you see any bimodal distributions (two peaks)?
- Which cancer type has the widest distribution?
- Do violin plots give you more information than box plots for this data?

In [None]:
# Your code here:


---

## 10. Key Takeaways

✅ **Box plots** show the five-number summary: min, Q1, median, Q3, max

✅ **Components:**
   - Box: IQR (Q3 - Q1), contains middle 50% of data
   - Line in box: Median (50th percentile)
   - Whiskers: Extend to min/max within 1.5×IQR
   - Points: Outliers beyond whiskers

✅ **Two methods:**
   - Matplotlib: `ax.boxplot()` - more customization
   - Pandas: `df.boxplot()` - simpler syntax

✅ **What to look for:**
   - Median differences: Which group has higher/lower values?
   - Box height (IQR): Which group is more variable?
   - Overlapping notches: Are medians significantly different?
   - Outliers: Unusual observations worth investigating

✅ **Key parameters:**
   - `patch_artist=True`: Enable coloring
   - `notch=True`: Show confidence intervals
   - `showmeans=True`: Display mean marker
   - `vert=False`: Make horizontal

✅ **Biological interpretation:**
   - High median: Typical expression is high in this group
   - Large IQR: Heterogeneous cell lines (possible subtypes)
   - Outliers: Mutations, amplifications, or artifacts
   - Non-overlapping notches: Statistically different groups

✅ **Violin plots** add full distribution shape to box plot information

✅ **Always pair with biological knowledge** - ask "why" patterns exist!

---

## 🎯 Challenge Problems

1. **Comprehensive Comparison**: Create a 3×2 grid comparing 6 different genes across cancer types using box plots

2. **Box + Violin Combo**: Create side-by-side plots with box plot on left, violin plot on right for the same data

3. **Statistical Testing**: For a gene of interest, identify which cancer type pairs have significantly different medians (non-overlapping notches)

4. **Outlier Investigation**: Pick a gene, find all outliers, and investigate if they share common characteristics (same cancer subtype, similar other gene expressions, etc.)

5. **Publication Figure**: Create a multi-panel figure with:
   - Panel A: Box plot of BRCA1 across cancer types
   - Panel B: Box plot of TP53 across cancer types  
   - Panel C: Violin plot comparing the two genes in breast cancer only
   - Add figure labels (A, B, C) and a caption

---

**Congratulations!** You've completed all the core visualization techniques for gene expression analysis! 🎉

You can now:
- Visualize distributions (histograms, density plots)
- Compare multiple plots (subplots)
- Explore relationships (scatter plots)
- Compare groups (box plots, violin plots)

These are the fundamental tools for biological data analysis!