# Data Visualization with Seaborn

Seaborn is a Python visualization library built on top of matplotlib that makes it easy to create beautiful, informative statistical plots.

## Learning Objectives

By the end of this notebook, you will be able to:
1. Create common plot types with seaborn (scatter, box, violin, heatmap)
2. Customize plot styles and color palettes
3. Add statistical information to plots automatically
4. Create multi-panel figures with seaborn
5. Choose the right plot for your data

---

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed
np.random.seed(42)

print('Libraries loaded!')

Libraries loaded!


---

## Load Gene Dependency Data

We'll use real DepMap data: gene dependencies in breast and myeloid cancer cell lines.

In [2]:
# Load data
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f'Dataset shape: {df.shape}')
print('\nFirst few columns:')
print(df.columns[:10].tolist())
print('\nCancer types:')
print(df['oncotree_lineage'].value_counts())

Dataset shape: (94, 17211)

First few columns:
['model_id', 'cell_line_name', 'stripped_cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype', 'A1BG', 'A1CF', 'A2M', 'A2ML1']

Cancer types:
oncotree_lineage
Breast     53
Myeloid    41
Name: count, dtype: int64


---

## Part 1: Setting the Style

Seaborn has several built-in styles that control the overall look of your plots.

### Example 1: Comparing Styles

In [None]:
# Available styles
styles = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for i, style in enumerate(styles):
    sns.set_style(style)
    ax = axes[i]
    
    # Simple scatter plot
    ax.scatter(df['BRCA1'][:20], df['TP53'][:20], alpha=0.6)
    ax.set_xlabel('BRCA1')
    ax.set_ylabel('TP53')
    ax.set_title(f'Style: {style}')

# Remove extra subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

# Set back to default
sns.set_style('whitegrid')
print('→ "whitegrid" is a good default for scientific plots')

### 📝 Practice Question 1

**Task:** Try the 'white' style and create a scatter plot of MYC vs ATR expression.

Add appropriate axis labels and a title.

In [None]:
# YOUR CODE HERE
# Set style to 'white' and create scatter plot


---

## Part 2: Scatter Plots with Regression Lines

Seaborn's `regplot()` adds a regression line and confidence interval automatically.

### Example 2: Basic Regression Plot

In [None]:
# Create regression plot
fig, ax = plt.subplots(figsize=(8, 6))

sns.regplot(x='BRCA1', y='TP53', data=df, 
            scatter_kws={'alpha': 0.5, 's': 50},
            line_kws={'color': 'red', 'linewidth': 2},
            ax=ax)

ax.set_xlabel('BRCA1 Dependency')
ax.set_ylabel('TP53 Dependency')
ax.set_title('BRCA1 vs TP53 with Regression Line')

plt.tight_layout()
plt.show()

print('→ Shaded area shows 95% confidence interval')

### Example 3: Scatter Plot by Category

In [None]:
# Use lmplot for categorical grouping
g = sns.lmplot(x='ATR', y='CHEK1', data=df, 
               hue='oncotree_lineage',  # Color by cancer type
               height=6, aspect=1.3,
               scatter_kws={'alpha': 0.6, 's': 50})

g.set_axis_labels('ATR Dependency', 'CHEK1 Dependency')
plt.title('ATR vs CHEK1 by Cancer Type')
plt.tight_layout()
plt.show()

print('→ Different colors for different cancer types')

### 📝 Practice Question 2

**Task:** Create a regression plot showing the relationship between RPA1 and ATR.

Customize:
- Scatter points: green color, size 60, alpha 0.4
- Line: purple color, width 3

**Hint:** Use `scatter_kws` and `line_kws` parameters

In [None]:
# YOUR CODE HERE
# Create customized regression plot


---

## Part 3: Box Plots and Violin Plots

Great for comparing distributions across groups.

### Example 4: Box Plot

In [None]:
# Compare ATR dependency between cancer types
fig, ax = plt.subplots(figsize=(8, 6))

sns.boxplot(x='oncotree_lineage', y='ATR', data=df, ax=ax)

ax.set_xlabel('Cancer Type')
ax.set_ylabel('ATR Dependency')
ax.set_title('ATR Dependency by Cancer Type')

plt.tight_layout()
plt.show()

print('Box shows: median (line), Q1-Q3 (box), whiskers (1.5*IQR), outliers (dots)')

### Example 5: Violin Plot (Shows Full Distribution)

In [None]:
# Violin plot shows distribution shape
fig, ax = plt.subplots(figsize=(8, 6))

sns.violinplot(x='oncotree_lineage', y='TP53', data=df, ax=ax)

ax.set_xlabel('Cancer Type')
ax.set_ylabel('TP53 Dependency')
ax.set_title('TP53 Dependency Distribution by Cancer Type')

plt.tight_layout()
plt.show()

print('→ Width shows density: wider = more data points at that value')

### Example 6: Combined Violin + Box Plot

In [None]:
# Best of both worlds!
fig, ax = plt.subplots(figsize=(8, 6))

sns.violinplot(x='oncotree_lineage', y='BRCA1', data=df, 
               inner='box',  # Add box plot inside
               ax=ax)

ax.set_xlabel('Cancer Type')
ax.set_ylabel('BRCA1 Dependency')
ax.set_title('BRCA1: Distribution + Quartiles')

plt.tight_layout()
plt.show()

print('→ Shows both shape (violin) and quartiles (box)')

### 📝 Practice Question 3

**Task:** Create a box plot comparing MYC dependency between breast and myeloid cancers.

Add custom colors using the `palette` parameter:
```python
palette=['lightblue', 'lightcoral']
```

In [None]:
# YOUR CODE HERE
# Create colored box plot


---

## Part 4: Heatmaps

Perfect for visualizing correlation matrices or gene expression patterns.

### Example 7: Correlation Heatmap

In [None]:
# Select a few interesting genes
genes_of_interest = ['ATR', 'ATRIP', 'CHEK1', 'RPA1', 'TP53', 'BRCA1']
gene_subset = df[genes_of_interest]

# Calculate correlation matrix
corr_matrix = gene_subset.corr()

# Create heatmap
fig, ax = plt.subplots(figsize=(8, 7))

sns.heatmap(corr_matrix, 
            annot=True,  # Show values
            fmt='.2f',   # 2 decimal places
            cmap='coolwarm',  # Color scheme
            center=0,    # Center colormap at 0
            square=True,  # Square cells
            ax=ax)

ax.set_title('Gene Correlation Matrix')
plt.tight_layout()
plt.show()

print('→ Red = positive correlation, Blue = negative correlation')

### Example 8: Gene Expression Heatmap (Clustered)

In [None]:
# Show expression pattern across samples
# Take first 10 samples and 8 genes
sample_subset = df.iloc[:10]
genes_to_plot = ['ATR', 'BRCA1', 'TP53', 'MYC', 'CHEK1', 'RPA1', 'PTEN', 'MDM2']
expression_data = sample_subset[genes_to_plot].T  # Transpose: genes as rows

# Create clustered heatmap
g = sns.clustermap(expression_data, 
                   cmap='viridis',
                   figsize=(10, 6),
                   cbar_kws={'label': 'Dependency Score'})

g.fig.suptitle('Hierarchical Clustering of Gene Dependencies', y=1.02)
plt.show()

print('→ Dendrograms show similarity: closer branches = more similar patterns')

### 📝 Practice Question 4

**Task:** Create a correlation heatmap for these genes: ['MYC', 'PTEN', 'MDM2', 'CDK4']

Customize:
- Use 'RdBu_r' colormap (red-blue reversed)
- Show annotations
- Add a title

In [None]:
# YOUR CODE HERE
# Create correlation heatmap


---

## Part 5: Distribution Plots

Visualize how data is distributed.

### Example 9: Histogram with KDE

In [None]:
# Distribution of ATR dependency
fig, ax = plt.subplots(figsize=(8, 5))

sns.histplot(df['ATR'], bins=20, kde=True, ax=ax)

ax.set_xlabel('ATR Dependency')
ax.set_ylabel('Count')
ax.set_title('Distribution of ATR Dependency\n(with kernel density estimate)')

plt.tight_layout()
plt.show()

print('→ KDE (smooth curve) shows estimated probability density')

### Example 10: Overlapping Distributions

In [None]:
# Compare distributions between cancer types
fig, ax = plt.subplots(figsize=(10, 6))

sns.histplot(data=df, x='TP53', hue='oncotree_lineage', 
             bins=20, alpha=0.5, kde=True, ax=ax)

ax.set_xlabel('TP53 Dependency')
ax.set_ylabel('Count')
ax.set_title('TP53 Distribution by Cancer Type')

plt.tight_layout()
plt.show()

print('→ Overlapping histograms show if distributions differ between groups')

### 📝 Practice Question 5

**Task:** Create a histogram of BRCA1 dependency with a KDE overlay.

Customize:
- Use 25 bins
- Color: 'coral'
- Add vertical line at the mean (use `ax.axvline()`)

In [None]:
# YOUR CODE HERE
# Create histogram with mean line


---

## Part 6: Multi-Panel Plots

Create publication-quality figures with multiple subplots.

### Example 11: Pairplot (All Pairwise Relationships)

In [None]:
# Select a few genes and create pairplot
genes_for_pairs = ['ATR', 'ATRIP', 'CHEK1', 'RPA1']
pair_data = df[genes_for_pairs + ['oncotree_lineage']]

g = sns.pairplot(pair_data, hue='oncotree_lineage', 
                 height=2.5, aspect=1,
                 plot_kws={'alpha': 0.6, 's': 30})

g.fig.suptitle('Pairwise Gene Correlations', y=1.02)
plt.show()

print('→ Diagonal: distributions, Off-diagonal: scatter plots')
print('→ Great for exploratory analysis!')

### Example 12: FacetGrid (Multiple Subplots by Category)

In [None]:
# Create separate histograms for each cancer type
g = sns.FacetGrid(df, col='oncotree_lineage', height=4, aspect=1.2)
g.map(sns.histplot, 'ATR', bins=15, kde=True)
g.set_axis_labels('ATR Dependency', 'Count')
g.fig.suptitle('ATR Distribution by Cancer Type', y=1.02)
plt.tight_layout()
plt.show()

print('→ Side-by-side comparison makes differences clear')

### 📝 Practice Question 6

**Task:** Create a 2x2 subplot figure showing:
- Top left: Box plot of BRCA1 by cancer type
- Top right: Violin plot of TP53 by cancer type
- Bottom left: Scatter plot of BRCA1 vs TP53
- Bottom right: Histogram of ATR

**Hint:** Use `plt.subplots(2, 2)` and seaborn functions with `ax=` parameter

In [None]:
# YOUR CODE HERE
# Create 2x2 multi-panel figure


---

## Part 7: Color Palettes

Choose colors that make your data clear and accessible.

### Example 13: Built-in Color Palettes

In [None]:
# Compare different palettes
palettes = ['deep', 'muted', 'bright', 'pastel', 'dark', 'colorblind']

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for i, palette_name in enumerate(palettes):
    ax = axes[i]
    sns.boxplot(x='oncotree_lineage', y='ATR', data=df, 
                palette=palette_name, ax=ax)
    ax.set_title(f'Palette: {palette_name}')
    ax.set_xlabel('')
    ax.set_ylabel('ATR Dependency' if i % 3 == 0 else '')

plt.tight_layout()
plt.show()

print('→ "colorblind" palette is accessible to people with color vision deficiency')

### Example 14: Custom Color Palette

In [None]:
# Create your own color scheme
custom_colors = ['#FF6B6B', '#4ECDC4']  # Red and teal

fig, ax = plt.subplots(figsize=(8, 6))

sns.violinplot(x='oncotree_lineage', y='CHEK1', data=df,
               palette=custom_colors, ax=ax)

ax.set_xlabel('Cancer Type')
ax.set_ylabel('CHEK1 Dependency')
ax.set_title('Custom Color Palette')

plt.tight_layout()
plt.show()

print('→ Use hex codes for precise color control')

### 📝 Practice Question 7 (Challenge)

**Task:** Create a publication-ready figure with:

1. Use the 'white' style with ticks
2. Create a 1x2 subplot figure
3. Left panel: Scatter plot with regression line (ATR vs RPA1)
4. Right panel: Box plot by cancer type (TP53)
5. Use 'colorblind' palette
6. Add overall title: "Gene Dependencies in Cancer"

Make it look professional!

In [None]:
# YOUR CODE HERE
# Create publication-quality figure


---

## Summary

### Key Seaborn Functions:

**1. Scatter & Regression:**
```python
sns.regplot(x='gene1', y='gene2', data=df)
sns.lmplot(x='gene1', y='gene2', hue='group', data=df)
```

**2. Distributions:**
```python
sns.boxplot(x='group', y='value', data=df)
sns.violinplot(x='group', y='value', data=df)
sns.histplot(data=df, x='value', kde=True)
```

**3. Heatmaps:**
```python
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
sns.clustermap(data, cmap='viridis')
```

**4. Multi-panel:**
```python
sns.pairplot(df, hue='group')
g = sns.FacetGrid(df, col='category')
g.map(sns.histplot, 'value')
```

### Style Settings:
```python
sns.set_style('whitegrid')  # or 'white', 'dark', 'ticks'
sns.set_palette('colorblind')  # Accessible colors
```

### Common Parameters:
- `data=df` - DataFrame to plot
- `x='col1', y='col2'` - Columns to plot
- `hue='category'` - Color by category
- `palette='colorblind'` - Color scheme
- `ax=ax` - Matplotlib axis to use
- `alpha=0.5` - Transparency

### When to Use Each Plot:

| Plot Type | Use When |
|-----------|----------|
| **regplot/lmplot** | Showing correlation between 2 continuous variables |
| **boxplot** | Comparing distributions across groups (shows quartiles) |
| **violinplot** | Comparing distributions (shows full shape) |
| **heatmap** | Showing correlation matrix or expression patterns |
| **histplot** | Showing distribution of single variable |
| **pairplot** | Exploring relationships between many variables |

### Tips for Publication-Quality Figures:
1. Use `sns.set_style('white')` or `'ticks'` for clean look
2. Choose `'colorblind'` palette for accessibility
3. Add clear axis labels and titles
4. Use `plt.tight_layout()` to prevent label overlap
5. Set figure size appropriately (`figsize=(8, 6)`)
6. Adjust `alpha` for overlapping points

### Advantages of Seaborn:
- ✅ Less code than pure matplotlib
- ✅ Beautiful defaults
- ✅ Automatic statistical visualizations
- ✅ Easy integration with pandas DataFrames
- ✅ Built-in themes and color palettes

Remember: Seaborn works **with** matplotlib, not instead of it!