# Lecture 4 Notebook 7: Exploring Relationships with Scatter Plots

**Learning Objectives:**
- Understand what scatter plots show and when to use them
- Create basic scatter plots with matplotlib
- Interpret patterns: correlations, clusters, and outliers
- Compare multiple relationships using subplots
- Color points by category to reveal subgroups
- Connect scatter plot patterns to biological questions

---

## 1. Introduction: What is a Scatter Plot?

A **scatter plot** shows the relationship between **two numerical variables**. Each point represents one observation.

**In our gene expression data:**
- Each point = one cell line
- X-axis = expression of Gene A
- Y-axis = expression of Gene B

**Why use scatter plots?**
- 🔍 Discover if two genes are **correlated** (co-expressed)
- 🧬 Identify genes that work together in the same pathway
- 📊 Find outliers (unusual cell lines)
- 🎯 Reveal clusters (subgroups of cell lines)

**Key biological insight:** If two genes show a pattern (linear relationship), they may be:
- Part of the same protein complex
- Co-regulated by the same transcription factor
- Working together in a pathway
- One regulating the other

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## 2. Load Gene Expression Data

In [None]:
# Load the gene expression data
url = "https://zenodo.org/records/17377786/files/expression_filtered.csv?download=1"
gene_df = pd.read_csv(url)

print(f"Dataset shape: {gene_df.shape}")
print(f"Number of cell lines: {gene_df.shape[0]}")
gene_df.head()

---

## 3. Your First Scatter Plot: BRCA1 vs BRCA2

Let's explore the relationship between **BRCA1** and **BRCA2**, two genes involved in DNA repair.

**Biological question:** Are these genes co-expressed in cancer cell lines?

In [None]:
# Create a scatter plot
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(gene_df['BRCA1'], gene_df['BRCA2'], 
           alpha=0.6, s=50, color='skyblue',
           edgecolor='black', linewidth=0.5)

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('BRCA2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 vs BRCA2 Expression Across Cell Lines', fontsize=14, fontweight='bold')

# Add grid for easier reading
ax.grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

**Understanding the plot:**
- Each **point** = one cell line
- **X-position** = BRCA1 expression in that cell line
- **Y-position** = BRCA2 expression in that cell line

**Key parameters:**
- `alpha=0.6`: Transparency (0=invisible, 1=opaque). Helps see overlapping points!
- `s=50`: Size of points (try changing this!)
- `color='skyblue'`: Color of points
- `edgecolor='black'`: Border color around each point
- `linewidth=0.5`: Thickness of the border

**What pattern do you see?** Is there a relationship?

### 📊 Exercise 1: Scatter Plot of TP53 vs MDM2

Create a scatter plot showing the relationship between TP53 and MDM2.

**Biological context:** MDM2 is a negative regulator of TP53 - it targets TP53 for degradation. What pattern would you expect?

**Hints:**
- Use a different color (e.g., 'lightcoral')
- Add appropriate labels and title
- Include a grid

In [None]:
# Your code here:


---

## 4. Strong Correlation: TSC1 vs TSC2

Let's look at a gene pair that we **expect** to be strongly correlated.

**Biological background:**
- **TSC1** and **TSC2** form the TSC protein complex
- This complex regulates mTOR signaling (critical for cell growth)
- Cells need BOTH proteins to form a functional complex

**Prediction:** These genes should show **strong positive correlation** because cells that express one usually express the other!

In [None]:
# TSC1 and TSC2: Genes in the same protein complex
fig, ax = plt.subplots(figsize=(8, 6))

ax.scatter(gene_df['TSC1'], gene_df['TSC2'],
           alpha=0.6, s=60,
           color='lightcoral',
           edgecolor='darkred',
           linewidth=0.5)

ax.set_xlabel('TSC1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('TSC2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('TSC1 vs TSC2: Co-regulated Genes', fontsize=14, fontweight='bold')

ax.grid(True, alpha=0.3, linestyle='--')

plt.tight_layout()
plt.show()

**What do you see?**
- Is there a **linear pattern** (points form a line)?
- Does the line go **upward** (positive correlation) or **downward** (negative correlation)?
- Are there any **outliers** (points far from the pattern)?

**Biological interpretation:**
- Strong positive correlation confirms these genes are co-regulated
- Outliers might be cell lines with mutations or unusual regulatory mechanisms
- This validates our biological expectation!

### Patterns to Look For in Scatter Plots

**1. Positive Correlation** (upward slope)
- As Gene A increases, Gene B increases
- Suggests: co-regulation, same pathway, or functional relationship

**2. Negative Correlation** (downward slope)
- As Gene A increases, Gene B decreases
- Suggests: antagonistic relationship, or one inhibits the other

**3. No Correlation** (random cloud)
- No clear pattern
- Suggests: genes are independently regulated

**4. Outliers**
- Points far from the main pattern
- Could be: unusual cell lines, mutations, or experimental artifacts

**5. Clusters**
- Groups of points separate from others
- Could indicate: different cancer subtypes or tissue origins

### 📊 Exercise 2: Find a Strong Correlation

Explore other gene pairs! Create scatter plots for these pairs and identify which shows the strongest correlation:

1. MYC vs MYCN
2. AKT1 vs AKT2
3. KRAS vs NRAS

**Hint:** Create 3 subplots side-by-side to compare them easily!

In [None]:
# Your code here:


---

## 5. Comparing Multiple Relationships with Subplots

Let's compare two gene pairs side-by-side to see which shows stronger co-regulation:

In [None]:
# Compare two gene pairs
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Left: BRCA1 vs BRCA2
axes[0].scatter(gene_df['BRCA1'], gene_df['BRCA2'],
                alpha=0.6, s=50, color='skyblue',
                edgecolor='black', linewidth=0.5)
axes[0].set_xlabel('BRCA1 Expression')
axes[0].set_ylabel('BRCA2 Expression')
axes[0].set_title('BRCA1 vs BRCA2')
axes[0].grid(True, alpha=0.3)

# Right: TSC1 vs TSC2
axes[1].scatter(gene_df['TSC1'], gene_df['TSC2'],
                alpha=0.6, s=60, color='lightcoral',
                edgecolor='darkred', linewidth=0.5)
axes[1].set_xlabel('TSC1 Expression')
axes[1].set_ylabel('TSC2 Expression')
axes[1].set_title('TSC1 vs TSC2 (Strong Correlation)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Comparison:**
- Which pair shows a tighter linear pattern?
- Which has more scatter around the trend?
- What does this tell us about their biological relationship?

Side-by-side plots make it easy to directly compare correlation strength!

### 📊 Exercise 3: Grid of Correlations

Create a 2×2 grid showing scatter plots of all pairwise combinations of these 3 genes:
BRCA1, TP53, MYC

Plot combinations:
- (0,0): BRCA1 vs TP53
- (0,1): BRCA1 vs MYC
- (1,0): TP53 vs MYC
- (1,1): Leave empty or add text summary

**Hint:** Use `axes.flatten()` to make looping easier!

In [None]:
# Your code here:


---

## 6. Advanced: Color by Category

So far, all our points are the same color. But what if we want to see if **different cancer types** show different patterns?

We can **color points by category** to reveal subgroups!

### First, let's see what cancer types we have:

In [None]:
# Check unique cancer lineages
print("Cancer types in our dataset:")
print(gene_df['oncotree_lineage'].unique())
print(f"\nNumber of types: {gene_df['oncotree_lineage'].nunique()}")

### Color Points by Cancer Type

Now let's create a scatter plot where each cancer type has a different color:

In [None]:
# Color points by lineage
fig, ax = plt.subplots(figsize=(10, 7))

# Get unique lineages
lineages = gene_df['oncotree_lineage'].unique()
colors = ['red', 'blue', 'green', 'orange', 'purple']

# Plot each lineage separately with different color
for lineage, color in zip(lineages, colors):
    # Filter data for this lineage
    mask = gene_df['oncotree_lineage'] == lineage
    
    ax.scatter(gene_df[mask]['BRCA1'],
               gene_df[mask]['BRCA2'],
               alpha=0.6, s=60,
               color=color,
               label=lineage,
               edgecolor='black',
               linewidth=0.5)

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('BRCA2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 vs BRCA2 by Cancer Type', fontsize=14, fontweight='bold')
ax.legend(title='Cancer Lineage', loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**How this works:**
1. We get all unique cancer types
2. We loop through each type
3. For each type, we:
   - Create a **mask** (boolean filter) to select only that type
   - Plot those points with a specific color
   - Add a label for the legend

**What can we learn?**
- Do certain cancer types **cluster together**?
- Are there cancer-type-specific patterns?
- Are outliers from specific cancer types?

**Biological questions:**
- Do breast cancer cell lines show different BRCA1/BRCA2 patterns than myeloid cancers?
- Are there tissue-specific regulatory mechanisms?

### 📊 Exercise 4: Color by Cancer Type - Different Genes

Create a scatter plot of TP53 vs MYC, colored by cancer type.

**Questions to answer:**
- Do you see any cancer-type-specific clusters?
- Which cancer type shows the most variation?
- Are there any clear outliers?

**Hint:** Copy the code above and change the genes!

In [None]:
# Your code here:


---

## 7. Customizing Point Appearance

Let's explore different ways to customize scatter plot appearance:

### Varying Point Size

We can make point size represent a third variable!

In [None]:
# Size points by a third variable (e.g., MYC expression)
fig, ax = plt.subplots(figsize=(10, 7))

# Create sizes proportional to MYC expression
sizes = gene_df['MYC'] * 10  # Scale for visibility

scatter = ax.scatter(gene_df['BRCA1'], gene_df['BRCA2'],
                     alpha=0.5, s=sizes,
                     color='skyblue',
                     edgecolor='black',
                     linewidth=0.5)

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('BRCA2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 vs BRCA2 (Point size = MYC expression)', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**What this shows:**
- Larger points = higher MYC expression
- This adds a **third dimension** to our 2D plot!
- We can ask: "Do cell lines with high BRCA1/BRCA2 also have high MYC?"

### Using Different Markers

We can use different shapes for different categories:

In [None]:
# Different markers for different cancer types
fig, ax = plt.subplots(figsize=(10, 7))

lineages = gene_df['oncotree_lineage'].unique()
markers = ['o', 's', '^', 'D', 'v']  # circle, square, triangle up, diamond, triangle down
colors = ['red', 'blue', 'green', 'orange', 'purple']

for lineage, marker, color in zip(lineages, markers, colors):
    mask = gene_df['oncotree_lineage'] == lineage
    ax.scatter(gene_df[mask]['BRCA1'],
               gene_df[mask]['BRCA2'],
               alpha=0.6, s=80,
               color=color,
               marker=marker,
               label=lineage,
               edgecolor='black',
               linewidth=0.5)

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('BRCA2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 vs BRCA2 (Different Markers by Cancer Type)', fontsize=14, fontweight='bold')
ax.legend(title='Cancer Lineage', loc='best')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Why use different markers?**
- Useful if colors are hard to distinguish
- Better for colorblind-friendly plots
- Better for black-and-white printing

**Common markers:**
- `'o'`: circle (default)
- `'s'`: square
- `'^'`: triangle up
- `'v'`: triangle down
- `'D'`: diamond
- `'*'`: star
- `'+'`: plus
- `'x'`: x

### 📊 Exercise 5: Customize Your Own Scatter Plot

Create a scatter plot with these features:
1. X-axis: TSC1, Y-axis: TSC2
2. Color by cancer type
3. Point size based on BRCA1 expression
4. Add a legend and grid

**Bonus:** Try different marker shapes for each cancer type!

In [None]:
# Your code here:


---

## 8. Identifying Outliers

Outliers can be biologically interesting! Let's identify them:

In [None]:
# Find outliers in BRCA1 expression
brca1_mean = gene_df['BRCA1'].mean()
brca1_std = gene_df['BRCA1'].std()

# Define outliers as points > 2 standard deviations from mean
outlier_threshold = 2
outliers = (gene_df['BRCA1'] > brca1_mean + outlier_threshold * brca1_std) | \
           (gene_df['BRCA1'] < brca1_mean - outlier_threshold * brca1_std)

print(f"Number of outliers: {outliers.sum()}")
print("\nOutlier cell lines:")
print(gene_df[outliers][['cell_line_name', 'oncotree_lineage', 'BRCA1', 'BRCA2']])

### Visualize Outliers with Different Colors

In [None]:
# Plot with outliers highlighted
fig, ax = plt.subplots(figsize=(10, 7))

# Normal points
ax.scatter(gene_df[~outliers]['BRCA1'],
           gene_df[~outliers]['BRCA2'],
           alpha=0.5, s=50,
           color='skyblue',
           label='Normal',
           edgecolor='black',
           linewidth=0.5)

# Outliers
ax.scatter(gene_df[outliers]['BRCA1'],
           gene_df[outliers]['BRCA2'],
           alpha=0.8, s=100,
           color='red',
           label='Outliers (>2 SD)',
           edgecolor='darkred',
           linewidth=1.5,
           marker='^')

ax.set_xlabel('BRCA1 Expression (log2 TPM + 1)', fontsize=12)
ax.set_ylabel('BRCA2 Expression (log2 TPM + 1)', fontsize=12)
ax.set_title('BRCA1 vs BRCA2 with Outliers Highlighted', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Why identify outliers?**
- May have mutations in these genes
- May represent unusual biological states
- Could be experimental artifacts
- Worth investigating further!

### 📊 Exercise 6: Find Outliers in Your Gene Pair

Choose any two genes and:
1. Calculate outliers based on either gene (>2 SD from mean)
2. Create a scatter plot highlighting outliers
3. Print the names of outlier cell lines
4. Investigate: Do the outliers come from specific cancer types?

**Hint:** Use the code above as a template!

In [None]:
# Your code here:


---

## 9. Key Takeaways

✅ **Scatter plots** show relationships between two numerical variables

✅ **Each point** represents one observation (cell line in our case)

✅ **Patterns to recognize:**
   - Positive correlation: upward slope
   - Negative correlation: downward slope
   - No correlation: random cloud
   - Outliers: points far from pattern
   - Clusters: distinct groups

✅ **Key parameters:**
   - `alpha`: transparency (0-1)
   - `s`: point size
   - `color`: point color
   - `marker`: point shape
   - `edgecolor`: border color

✅ **Advanced techniques:**
   - Color by category to reveal subgroups
   - Vary size by third variable
   - Use different markers for categories
   - Highlight outliers

✅ **Biological insights:**
   - Strong correlations suggest co-regulation
   - Clusters may indicate cancer subtypes
   - Outliers may have mutations or be biologically interesting

✅ **Always ask:** "What biological story does this pattern tell?"

---

## 🎯 Challenge Problems

1. **Correlation Matrix**: Create a 3×3 grid showing all pairwise scatter plots for 4 genes of your choice

2. **Publication Figure**: Create a figure with:
   - Top: 2 scatter plots comparing gene pairs
   - Bottom: 1 scatter plot colored by cancer type with outliers highlighted

3. **Discovery Challenge**: Find two genes that show:
   - Strong positive correlation overall
   - BUT different patterns in Breast vs Myeloid cancers when colored separately

4. **Marker Exploration**: Create a scatter plot using size for one variable, color for another, and shape for a third (hint: this shows 5 dimensions of data!)

---

**Next:** In the next notebook, we'll learn about box plots and violin plots for comparing distributions across groups! 📊