# Lecture 3: DepMap CRISPR Data Analysis

🗺️ **Our Analysis Roadmap:**
1. **Filter Breast Cancer Data** - Extract only breast cancer cell lines from the dataset
2. **Calculate Mean Gene Effects** - Average gene scores across all breast cancer cell lines  
3. **Sort & Select Top 10** - Find the most negative scores (most essential genes)
4. **Repeat for Myeloid Cancer** - Same process: filter → mean → sort → top 10
5. **Compare & Visualize** - Compare top 10 lists - what's different between cancer types?

🚀 **Advanced Option:** Later we'll learn `df.groupby()` to analyze both cancer types at once!

## Load the Data

First, let's load the same DepMap CRISPR dataset we used in the pandas introduction notebook.

In [2]:
import pandas as pd



# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"Cancer types available: {df['oncotree_lineage'].unique()}")

Dataset shape: (94, 17211)
Cancer types available: ['Myeloid' 'Breast']


## Step 1: Filter Breast Cancer Data 🎯

Extract only breast cancer cell lines from the dataset. We'll use boolean indexing to filter rows where `oncotree_lineage` equals "Breast".

In [3]:
# Filter for breast cancer cell lines
breast_cancer = df[df['oncotree_lineage'] == 'Breast']

print(f"Total cell lines: {len(df)}")
print(f"Breast cancer cell lines: {len(breast_cancer)}")
print("\nBreast cancer cell lines:")
print(breast_cancer[['cell_line_name', 'oncotree_primary_disease']].head())

Total cell lines: 94
Breast cancer cell lines: 53

Breast cancer cell lines:
  cell_line_name   oncotree_primary_disease
2        SK-BR-3  Invasive Breast Carcinoma
3           MCF7  Invasive Breast Carcinoma
4          KPL-1  Invasive Breast Carcinoma
8        ZR-75-1  Invasive Breast Carcinoma
9        HCC1187  Invasive Breast Carcinoma


## Step 2: Calculate Mean Gene Effects 📊

Average gene scores across all breast cancer cell lines. We need to:
1. Select only the gene columns (numeric data)
2. Calculate the mean for each gene across all breast cancer cell lines

In [4]:
# Get the gene columns (all columns except the first 6 metadata columns)
gene_columns = df.columns[6:]  # Skip metadata columns
print(f"Number of genes: {len(gene_columns)}")
print(f"First few gene names: {list(gene_columns[:5])}")

# Calculate mean gene effects for breast cancer
breast_gene_means = breast_cancer[gene_columns].mean()

print(f"\nMean gene effects calculated for {len(breast_gene_means)} genes")
print(f"Example - A1BG mean effect: {breast_gene_means['A1BG']:.4f}")

Number of genes: 17205
First few gene names: ['A1BG', 'A1CF', 'A2M', 'A2ML1', 'A3GALT2']

Mean gene effects calculated for 17205 genes
Example - A1BG mean effect: -0.0397


## Step 3: Sort & Select Top 10 🔝

Find the most negative scores (most essential genes). In CRISPR screens, more negative scores indicate genes that are more essential for cell survival.

In [5]:
# Sort genes by their mean effects (ascending = most negative first)
breast_top10 = breast_gene_means.sort_values(ascending=True).head(10)

print("Top 10 most essential genes in breast cancer:")
print("=" * 45)
for i, (gene, score) in enumerate(breast_top10.items(), 1):
    print(f"{i:2d}. {gene:<8} {score:8.4f}")

# Let's also look at some basic statistics
print(f"\nStatistics for breast cancer gene effects:")
print(f"Most essential (lowest): {breast_gene_means.min():.4f}")
print(f"Least essential (highest): {breast_gene_means.max():.4f}")
print(f"Mean effect: {breast_gene_means.mean():.4f}")

Top 10 most essential genes in breast cancer:
 1. RAN       -4.1840
 2. HSPE1     -3.4315
 3. SNRPF     -3.1414
 4. SMU1      -3.0940
 5. PSMA6     -3.0273
 6. SNRPA1    -2.9927
 7. RRM1      -2.9468
 8. PCNA      -2.9238
 9. PLK1      -2.9126
10. SF3B5     -2.9118

Statistics for breast cancer gene effects:
Most essential (lowest): -4.1840
Least essential (highest): 0.3305
Mean effect: -0.1403


## Step 4: Repeat for Myeloid Cancer 🔄

Same process: filter → mean → sort → top 10

In [6]:
# Filter for myeloid cancer cell lines
myeloid_cancer = df[df['oncotree_lineage'] == 'Myeloid']

print(f"Myeloid cancer cell lines: {len(myeloid_cancer)}")
print(f"\nMyeloid cancer cell lines:")
print(myeloid_cancer[['cell_line_name', 'oncotree_primary_disease']].head())

Myeloid cancer cell lines: 41

Myeloid cancer cell lines:
  cell_line_name      oncotree_primary_disease
0            HEL        Acute Myeloid Leukemia
1     HEL 92.1.7        Acute Myeloid Leukemia
5         MV4-11        Acute Myeloid Leukemia
6          KU812  Myeloproliferative Neoplasms
7           NCO2  Myeloproliferative Neoplasms


In [7]:
# Calculate mean gene effects for myeloid cancer
myeloid_gene_means = myeloid_cancer[gene_columns].mean()

# Sort and get top 10 most essential genes
myeloid_top10 = myeloid_gene_means.sort_values(ascending=True).head(10)

print("Top 10 most essential genes in myeloid cancer:")
print("=" * 45)
for i, (gene, score) in enumerate(myeloid_top10.items(), 1):
    print(f"{i:2d}. {gene:<8} {score:8.4f}")

Top 10 most essential genes in myeloid cancer:
 1. RAN       -3.9426
 2. HSPE1     -3.5099
 3. RPL17     -3.2110
 4. RPS8      -2.9228
 5. RPS29     -2.8929
 6. RRM1      -2.8591
 7. PLK1      -2.8287
 8. RPS19     -2.7984
 9. UBL5      -2.7908
10. PSMA6     -2.7539


## Step 5: Compare & Visualize 📈

Compare top 10 lists - what's different between cancer types?

In [8]:
# Compare the two lists
breast_genes = set(breast_top10.index)
myeloid_genes = set(myeloid_top10.index)

# Find overlapping and unique genes
overlap = breast_genes.intersection(myeloid_genes)
breast_only = breast_genes - myeloid_genes
myeloid_only = myeloid_genes - breast_genes

print("📊 COMPARISON RESULTS:")
print("=" * 50)
print(f"Genes in both top 10 lists: {len(overlap)}")
print(f"Overlapping genes: {list(overlap)}")
print(f"\nUnique to breast cancer: {len(breast_only)}")
print(f"Breast-specific genes: {list(breast_only)}")
print(f"\nUnique to myeloid cancer: {len(myeloid_only)}")
print(f"Myeloid-specific genes: {list(myeloid_only)}")

📊 COMPARISON RESULTS:
Genes in both top 10 lists: 5
Overlapping genes: ['RRM1', 'PSMA6', 'HSPE1', 'RAN', 'PLK1']

Unique to breast cancer: 5
Breast-specific genes: ['SMU1', 'SNRPA1', 'PCNA', 'SF3B5', 'SNRPF']

Unique to myeloid cancer: 5
Myeloid-specific genes: ['RPL17', 'RPS8', 'RPS29', 'RPS19', 'UBL5']


## 🚀 Advanced Option: Using GroupBy

Here's how we could do the same analysis more efficiently using pandas `groupby()`:

In [9]:
# Advanced approach: analyze both cancer types at once
cancer_comparison = df[df['oncotree_lineage'].isin(['Breast', 'Myeloid'])]

# Group by cancer type and calculate means for all genes
grouped_means = cancer_comparison.groupby('oncotree_lineage')[gene_columns].mean()

print("Gene effect means by cancer type:")
print(grouped_means.iloc[:, :5])  # Show first 5 genes

# Get top 10 for each cancer type
print("\nTop 5 most essential genes by cancer type:")
for cancer_type in ['Breast', 'Myeloid']:
    top5 = grouped_means.loc[cancer_type].sort_values().head(5)
    print(f"\n{cancer_type}:")
    for gene, score in top5.items():
        print(f"  {gene}: {score:.4f}")

Gene effect means by cancer type:
                      A1BG      A1CF       A2M     A2ML1   A3GALT2
oncotree_lineage                                                  
Breast           -0.039728 -0.122851  0.021073  0.076419 -0.119875
Myeloid          -0.074593 -0.056000  0.041108  0.050162 -0.107921

Top 5 most essential genes by cancer type:

Breast:
  RAN: -4.1840
  HSPE1: -3.4315
  SNRPF: -3.1414
  SMU1: -3.0940
  PSMA6: -3.0273

Myeloid:
  RAN: -3.9426
  HSPE1: -3.5099
  RPL17: -3.2110
  RPS8: -2.9228
  RPS29: -2.8929


## 🎯 Key Takeaways

Through this analysis, we learned:

1. **Data Filtering**: How to extract subsets of data based on conditions
2. **Aggregation**: Computing summary statistics (mean) across multiple observations
3. **Sorting**: Ranking data to find extremes (most essential genes)
4. **Comparison**: Analyzing differences between groups

**Biological Insights**:
- Some genes are essential across cancer types (potential broad targets)
- Other genes are cancer-specific (potential precision medicine targets)
- CRISPR screens help identify genetic vulnerabilities in cancer
