# Lecture 4: Correlation Analysis in Cancer Research 🔬

Discover relationships between genes using correlation analysis with pandas.

## What We'll Learn:
1. **Data Preparation** - Load and filter cancer cell line data
2. **Calculate Correlations** - Use pandas methods to find gene relationships
3. **Interpret Results** - Understand what correlation values mean biologically
4. **Practice Exercises** - Apply correlation analysis to find gene dependencies

## Research Context
In cancer research, genes often work together in pathways. When two genes show strong correlation in their dependency scores across cell lines, it suggests they may:
- Function in the same pathway
- Regulate each other
- Be synthetic lethal partners

Let's explore gene correlations in breast cancer cell lines! 🧬

---
## Step 1: Load and Filter Data 📊

First, we'll load the DepMap gene dependency dataset and focus on breast cancer cell lines. We'll specifically examine **ATR** and **ATRIP** - two genes involved in DNA damage response that often work together.

In [1]:
# Load the DepMap dataset
import pandas as pd

# Note: Using the same dataset URL as previous lectures
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)
print(f"Dataset shape: {df.shape}")

# Filter for breast cancer cell lines
breast_df = df.loc[df['oncotree_lineage'] == 'Breast']
print(f"Breast cancer lines: {len(breast_df)}")

# Check if ATR and ATRIP exist in our dataset
if 'ATR' in df.columns and 'ATRIP' in df.columns:
    # Extract ATR and ATRIP columns
    atr_data = breast_df['ATR']
    atrip_data = breast_df['ATRIP']
    
    print(f"\nATR data: {len(atr_data)} cell lines")
    print(f"ATRIP data: {len(atrip_data)} cell lines")
    print(f"\nATR effect range: [{atr_data.min():.3f}, {atr_data.max():.3f}]")
    print(f"ATRIP effect range: [{atrip_data.min():.3f}, {atrip_data.max():.3f}]")
else:
    # If ATR/ATRIP not in dataset, use alternative genes for demonstration
    print("\nATR/ATRIP not found in dataset. Using alternative genes for demonstration...")
    gene_cols = [col for col in df.columns if col not in ['model_id', 'cell_line_name', 'stripped_cell_line_name', 
                                                           'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']]
    # Select first two gene columns for demonstration
    gene1, gene2 = gene_cols[0], gene_cols[1]
    atr_data = breast_df[gene1]
    atrip_data = breast_df[gene2]
    print(f"\nUsing {gene1} and {gene2} for demonstration")
    print(f"{gene1} data: {len(atr_data)} cell lines")
    print(f"{gene2} data: {len(atrip_data)} cell lines")

Dataset shape: (94, 17211)
Breast cancer lines: 53

ATR data: 53 cell lines
ATRIP data: 53 cell lines

ATR effect range: [-2.290, -0.474]
ATRIP effect range: [-1.482, 0.101]


---
## Step 2: Calculate Correlations 📈

Now let's calculate the correlation between genes. **Correlation** measures how two variables move together:
- **Positive correlation (0 to 1)**: When one gene is essential, the other tends to be essential too
- **Negative correlation (-1 to 0)**: When one gene is essential, the other tends to be non-essential
- **No correlation (near 0)**: No relationship between the genes

We'll use pandas' built-in `.corr()` method which calculates the Pearson correlation coefficient.

In [2]:
# Method 1: Using pandas corr() method (recommended)
correlation = atr_data.corr(atrip_data)
print(f"Gene correlation: {correlation:.3f}")

# Method 2: Using correlation matrix for multiple comparisons
# Select genes of interest - adjust based on available genes
if 'ATR' in df.columns and 'ATRIP' in df.columns:
    genes_of_interest = ['ATR', 'ATRIP', 'CHEK1', 'RPA1'] if all(g in df.columns for g in ['ATR', 'ATRIP', 'CHEK1', 'RPA1']) else ['ATR', 'ATRIP']
else:
    # Use first 4 available genes
    gene_cols = [col for col in df.columns if col not in ['model_id', 'cell_line_name', 'stripped_cell_line_name', 
                                                           'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']]
    genes_of_interest = gene_cols[:4]

correlation_matrix = breast_df[genes_of_interest].corr()
print("\nCorrelation matrix:")
print(correlation_matrix.round(3))

# Extract specific correlation
gene1_name = genes_of_interest[0]
gene2_name = genes_of_interest[1]
specific_corr = correlation_matrix.loc[gene1_name, gene2_name]
print(f"\n{gene1_name}-{gene2_name} correlation: {specific_corr:.3f}")

# Find the strongest correlation (excluding diagonal)
import numpy as np
corr_values = correlation_matrix.values
np.fill_diagonal(corr_values, np.nan)
max_corr = np.nanmax(np.abs(corr_values))
max_idx = np.where(np.abs(corr_values) == max_corr)
strongest_pair = (genes_of_interest[max_idx[0][0]], genes_of_interest[max_idx[1][0]])
print(f"\nStrongest correlation: {strongest_pair[0]} & {strongest_pair[1]} (r={corr_values[max_idx][0]:.3f})")

Gene correlation: 0.591

Correlation matrix:
         ATR  ATRIP  CHEK1   RPA1
ATR    1.000  0.591  0.464  0.535
ATRIP  0.591  1.000  0.447  0.611
CHEK1  0.464  0.447  1.000  0.376
RPA1   0.535  0.611  0.376  1.000

ATR-ATRIP correlation: 0.591

Strongest correlation: ATRIP & RPA1 (r=0.611)


---
## Step 3: Interpret Correlation Results 🔍

Understanding correlation values is crucial for biological interpretation. Let's create a function to help interpret our findings and determine their biological significance.

In [None]:
# Interpret correlation strength
def interpret_correlation(r):
    abs_r = abs(r)
    if abs_r >= 0.7:
        return "Strong correlation"
    elif abs_r >= 0.3:
        return "Moderate correlation"
    else:
        return "Weak correlation"

interpretation = interpret_correlation(correlation)
print(f"Interpretation: {interpretation} (r={correlation:.3f})")

# Check if correlation is positive or negative
if correlation > 0:
    print("Positive correlation: genes have similar dependency patterns")
    print("→ When cells depend on one gene, they tend to depend on the other")
else:
    print("Negative correlation: genes have opposite dependency patterns")
    print("→ When cells depend on one gene, they tend NOT to depend on the other")

# Statistical significance (basic check)
print("\n" + "="*50)
print("Biological Significance Assessment:")
print("="*50)

if abs(correlation) > 0.5:
    print("⚠️  Likely biologically significant")
    print("   - Strong enough to suggest functional relationship")
    print("   - May indicate genes work in same pathway")
elif abs(correlation) > 0.3:
    print("🔍 Potentially biologically relevant")
    print("   - Moderate relationship detected")
    print("   - Worth further investigation")
else:
    print("ℹ️  May not be biologically meaningful")
    print("   - Weak or no relationship")
    print("   - Genes likely function independently")

# Additional context
print("\n💡 Research Implications:")
if correlation > 0.7:
    print("• Consider these genes as potential co-targets for therapy")
    print("• They may be part of the same protein complex or pathway")
elif correlation < -0.5:
    print("• These genes may have compensatory roles")
    print("• Could represent synthetic lethal interactions")
else:
    print("• Further analysis needed to understand relationship")
    print("• Consider examining in specific cancer subtypes")

---
## Visualizing Correlations 📊

Let's create a simple visualization to better understand the correlation patterns.

In [None]:
import matplotlib.pyplot as plt

# Create scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(atr_data, atrip_data, alpha=0.6, s=50)

# Add trend line
z = np.polyfit(atr_data, atrip_data, 1)
p = np.poly1d(z)
plt.plot(atr_data, p(atr_data), "r--", alpha=0.8, label=f'Trend line (r={correlation:.3f})')

# Labels and title
if 'ATR' in df.columns and 'ATRIP' in df.columns:
    plt.xlabel('ATR Dependency Score')
    plt.ylabel('ATRIP Dependency Score')
    plt.title('ATR vs ATRIP Dependency in Breast Cancer Cell Lines')
else:
    plt.xlabel(f'{gene1_name} Dependency Score')
    plt.ylabel(f'{gene2_name} Dependency Score')
    plt.title(f'{gene1_name} vs {gene2_name} Dependency in Breast Cancer Cell Lines')

plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Interpretation of the plot
print("\n📊 Plot Interpretation:")
if correlation > 0.3:
    print("• Points follow an upward trend - positive correlation")
    print("• Cell lines cluster along the trend line")
elif correlation < -0.3:
    print("• Points follow a downward trend - negative correlation")
    print("• Inverse relationship visible")
else:
    print("• Points are scattered without clear pattern")
    print("• No strong linear relationship")

---
## Advanced: Correlation Heatmap 🗺️

Let's create a heatmap to visualize correlations between multiple genes at once.

In [None]:
import seaborn as sns

# Select more genes for comprehensive analysis
gene_cols = [col for col in df.columns if col not in ['model_id', 'cell_line_name', 'stripped_cell_line_name', 
                                                       'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']]

# Select first 10 genes for visualization
selected_genes = gene_cols[:10]
correlation_matrix_large = breast_df[selected_genes].corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix_large, 
            annot=True, 
            fmt='.2f', 
            cmap='coolwarm', 
            center=0,
            vmin=-1, vmax=1,
            square=True,
            linewidths=0.5)
plt.title('Gene Dependency Correlation Heatmap\nBreast Cancer Cell Lines')
plt.tight_layout()
plt.show()

# Find highly correlated gene pairs
print("\n🔗 Highly Correlated Gene Pairs (|r| > 0.5):")
for i in range(len(selected_genes)):
    for j in range(i+1, len(selected_genes)):
        corr_val = correlation_matrix_large.iloc[i, j]
        if abs(corr_val) > 0.5:
            print(f"• {selected_genes[i]} & {selected_genes[j]}: r={corr_val:.3f}")

---
## 📚 Practice Exercises

Now it's your turn! Apply correlation analysis to explore gene relationships in the cancer dataset.

### Exercise 1: Cancer Type Comparison
Compare correlations between the same gene pair in breast vs myeloid cancer. Do the correlations differ?

In [None]:
# Your code here:
# 1. Calculate correlation for two genes in breast cancer
# 2. Calculate the same correlation in myeloid cancer
# 3. Compare and interpret the difference

# breast_corr = ?
# myeloid_corr = ?
# print(f"Breast cancer correlation: {breast_corr:.3f}")
# print(f"Myeloid cancer correlation: {myeloid_corr:.3f}")

### Exercise 2: Find Gene Partners
For a given gene, find its top 3 most positively correlated and top 3 most negatively correlated genes.

In [None]:
# Your code here:
# 1. Select a target gene
# 2. Calculate correlations with all other genes
# 3. Find top positive and negative correlations

# target_gene = gene_cols[0]  # or choose your own
# correlations_with_target = ?
# top_positive = ?
# top_negative = ?
# print(f"Top positive correlations with {target_gene}:")
# print(top_positive)
# print(f"\nTop negative correlations with {target_gene}:")
# print(top_negative)

### Exercise 3: Pathway Analysis
Given a list of genes known to be in the same pathway, calculate their average pairwise correlation. Is it higher than random gene pairs?

In [None]:
# Your code here:
# Hypothetical DNA repair pathway genes (use available genes in your dataset)
# pathway_genes = [g for g in ['BRCA1', 'BRCA2', 'RAD51', 'PALB2'] if g in df.columns]
# if len(pathway_genes) < 2:
#     pathway_genes = gene_cols[:4]  # Use first 4 genes as example

# Calculate average pairwise correlation
# pathway_corr_matrix = ?
# avg_pathway_corr = ?

# Compare with random genes
# random_genes = ?
# random_corr_matrix = ?
# avg_random_corr = ?

# print(f"Average correlation in pathway: {avg_pathway_corr:.3f}")
# print(f"Average correlation in random genes: {avg_random_corr:.3f}")

### Exercise 4: Correlation Stability
Check if correlations are stable by calculating them on different subsets of cell lines.

In [None]:
# Your code here:
# 1. Split breast cancer cell lines into two halves
# 2. Calculate correlation in each half
# 3. Compare the correlations

# first_half = breast_df.iloc[:len(breast_df)//2]
# second_half = breast_df.iloc[len(breast_df)//2:]

# corr_first = ?
# corr_second = ?

# print(f"Correlation in first half: {corr_first:.3f}")
# print(f"Correlation in second half: {corr_second:.3f}")
# print(f"Difference: {abs(corr_first - corr_second):.3f}")

### Exercise 5: Advanced Challenge - Finding Synthetic Lethals
Find gene pairs with strong negative correlation (< -0.5), which might indicate synthetic lethal relationships.

In [None]:
# Your code here:
# 1. Calculate correlation matrix for all genes
# 2. Find pairs with correlation < -0.5
# 3. List potential synthetic lethal pairs

# # Calculate correlations for first 20 genes (for speed)
# genes_subset = gene_cols[:20]
# corr_matrix = breast_df[genes_subset].corr()

# # Find negative correlations
# synthetic_lethal_candidates = []
# for i in range(len(genes_subset)):
#     for j in range(i+1, len(genes_subset)):
#         if corr_matrix.iloc[i, j] < -0.5:
#             synthetic_lethal_candidates.append(???)

# print(f"Potential synthetic lethal pairs: {len(synthetic_lethal_candidates)}")
# for pair in synthetic_lethal_candidates:
#     print(pair)

---
## 🎯 Solutions

Run these cells to see solutions to the exercises.

In [None]:
# Solution 1: Cancer Type Comparison
print("Solution 1: Cancer Type Comparison")
print("="*40)

# Select two genes
gene1, gene2 = gene_cols[0], gene_cols[1]

# Breast cancer correlation
breast_corr = breast_df[gene1].corr(breast_df[gene2])

# Myeloid cancer correlation
myeloid_df = df[df['oncotree_lineage'] == 'Myeloid']
myeloid_corr = myeloid_df[gene1].corr(myeloid_df[gene2])

print(f"{gene1} vs {gene2}:")
print(f"Breast cancer correlation: {breast_corr:.3f}")
print(f"Myeloid cancer correlation: {myeloid_corr:.3f}")
print(f"Difference: {abs(breast_corr - myeloid_corr):.3f}")

if abs(breast_corr - myeloid_corr) > 0.3:
    print("\n⚠️ Significant difference between cancer types!")
    print("This suggests cancer-type-specific gene relationships")

In [None]:
# Solution 2: Find Gene Partners
print("Solution 2: Find Gene Partners")
print("="*40)

target_gene = gene_cols[0]
print(f"Target gene: {target_gene}\n")

# Calculate correlations with all other genes
correlations_with_target = breast_df[gene_cols[:30]].corr()[target_gene].drop(target_gene)

# Top positive correlations
top_positive = correlations_with_target.nlargest(3)
print("Top 3 positive correlations:")
for gene, corr in top_positive.items():
    print(f"  {gene}: {corr:.3f}")

# Top negative correlations
top_negative = correlations_with_target.nsmallest(3)
print("\nTop 3 negative correlations:")
for gene, corr in top_negative.items():
    print(f"  {gene}: {corr:.3f}")

In [None]:
# Solution 3: Pathway Analysis
print("Solution 3: Pathway Analysis")
print("="*40)

# Use first 4 genes as "pathway" genes
pathway_genes = gene_cols[:4]
print(f"Analyzing genes: {pathway_genes}\n")

# Calculate average pairwise correlation for pathway
pathway_corr_matrix = breast_df[pathway_genes].corr()
n = len(pathway_genes)
pathway_corrs = []
for i in range(n):
    for j in range(i+1, n):
        pathway_corrs.append(pathway_corr_matrix.iloc[i, j])
avg_pathway_corr = np.mean(pathway_corrs)

# Random genes for comparison
import random
random.seed(42)
random_genes = random.sample(gene_cols[10:30], 4)
random_corr_matrix = breast_df[random_genes].corr()
random_corrs = []
for i in range(n):
    for j in range(i+1, n):
        random_corrs.append(random_corr_matrix.iloc[i, j])
avg_random_corr = np.mean(random_corrs)

print(f"Average correlation in selected genes: {avg_pathway_corr:.3f}")
print(f"Average correlation in random genes: {avg_random_corr:.3f}")

if avg_pathway_corr > avg_random_corr + 0.1:
    print("\n✅ Selected genes show higher correlation than random!")
else:
    print("\n❌ No significant difference from random genes")

In [None]:
# Solution 4: Correlation Stability
print("Solution 4: Correlation Stability")
print("="*40)

gene1, gene2 = gene_cols[0], gene_cols[1]
print(f"Testing stability of {gene1} vs {gene2} correlation\n")

# Split data
first_half = breast_df.iloc[:len(breast_df)//2]
second_half = breast_df.iloc[len(breast_df)//2:]

# Calculate correlations
corr_first = first_half[gene1].corr(first_half[gene2])
corr_second = second_half[gene1].corr(second_half[gene2])
corr_full = breast_df[gene1].corr(breast_df[gene2])

print(f"Full dataset correlation: {corr_full:.3f}")
print(f"First half correlation: {corr_first:.3f}")
print(f"Second half correlation: {corr_second:.3f}")
print(f"Difference: {abs(corr_first - corr_second):.3f}")

if abs(corr_first - corr_second) < 0.2:
    print("\n✅ Correlation is stable across subsets")
else:
    print("\n⚠️ Correlation varies significantly between subsets")

In [None]:
# Solution 5: Finding Synthetic Lethals
print("Solution 5: Finding Synthetic Lethal Candidates")
print("="*40)

# Calculate correlations for first 20 genes
genes_subset = gene_cols[:20]
corr_matrix = breast_df[genes_subset].corr()

# Find strong negative correlations
synthetic_lethal_candidates = []
for i in range(len(genes_subset)):
    for j in range(i+1, len(genes_subset)):
        corr_val = corr_matrix.iloc[i, j]
        if corr_val < -0.3:  # Using -0.3 as threshold for demonstration
            synthetic_lethal_candidates.append({
                'gene1': genes_subset[i],
                'gene2': genes_subset[j],
                'correlation': corr_val
            })

# Sort by correlation strength
synthetic_lethal_candidates.sort(key=lambda x: x['correlation'])

print(f"Found {len(synthetic_lethal_candidates)} potential synthetic lethal pairs\n")

if synthetic_lethal_candidates:
    print("Top synthetic lethal candidates:")
    for pair in synthetic_lethal_candidates[:5]:
        print(f"  {pair['gene1']} & {pair['gene2']}: r={pair['correlation']:.3f}")
else:
    print("No strong negative correlations found in this subset")

---
## 🎊 Summary

You've learned how to perform correlation analysis on cancer gene dependency data! Here's what we covered:

✅ **Data Loading & Filtering**: Prepare cancer-specific datasets  
✅ **Correlation Calculation**: Use `.corr()` method for single pairs and matrices  
✅ **Result Interpretation**: Understand biological meaning of correlations  
✅ **Visualization**: Create scatter plots and heatmaps  
✅ **Advanced Analysis**: Find gene partners and synthetic lethal candidates  

**Key Takeaways**:
- Correlations reveal functional relationships between genes
- Positive correlations suggest genes work together
- Negative correlations might indicate synthetic lethality
- Always validate findings with biological knowledge

**Next Steps**:
- Explore correlations in different cancer types
- Validate findings with pathway databases
- Use correlation analysis to identify drug targets

Keep discovering gene relationships! 🧬🔬