# 📊 Biological Data Analysis Toolkit

## Real Pandas Tools for Lab Data Management

Welcome to your **pandas practice toolkit**! These tools demonstrate how to manage and analyze real biological datasets using pandas - the most powerful data analysis library in Python.

**What you'll build:**
- 🧬 Gene expression data analyzer
- 🔬 Quality control dashboard
- 💊 Drug screening hit selector
- 📈 Patient cohort analyzer
- 🧪 Experiment result comparator
- 🎯 Multi-omics data integrator

**Skills used:** DataFrame filtering, sorting, grouping, and aggregation - everything from Lecture 3!

**Real data:** We'll work with actual cancer cell line data from the DepMap project (Zenodo repository)

## 📥 Setup and Data Loading

First, let's load pandas and download our real biological dataset!

In [1]:
import pandas as pd
import numpy as np

print("✓ Libraries loaded successfully!")
print(f"Pandas version: {pd.__version__}")

✓ Libraries loaded successfully!
Pandas version: 2.3.2


In [None]:
# Load real cancer cell line CRISPR screening data
print("🔽 Downloading DepMap CRISPR dataset...")
url_all_cell_lines = "https://zenodo.org/records/17161166/files/combined_model_crispr_data.csv?download=1" # all cell lines, takes ages to download but a lot more data to play with

url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
crispr_df = pd.read_csv(url)

print(f"✓ Dataset loaded: {crispr_df.shape[0]} cell lines, {crispr_df.shape[1]} columns")
print(f"✓ Contains data for {crispr_df.shape[1] - 6} genes")
print(f"\nCancer types in dataset: {crispr_df['oncotree_lineage'].nunique()}")
print(crispr_df['oncotree_lineage'].value_counts())

print("\nReady for analysis! 🧬")

🔽 Downloading DepMap CRISPR dataset...
✓ Dataset loaded: 1165 cell lines, 17211 columns
✓ Contains data for 17205 genes

Cancer types in dataset: 30
oncotree_lineage
Lung                         124
Lymphoid                      92
CNS/Brain                     89
Head and Neck                 75
Skin                          73
Esophagus/Stomach             69
Bowel                         62
Ovary/Fallopian Tube          59
Breast                        53
Pancreas                      47
Bone                          45
Peripheral Nervous System     44
Soft Tissue                   43
Myeloid                       41
Kidney                        35
Biliary Tract                 34
Bladder/Urinary Tract         34
Uterus                        34
Liver                         24
Pleura                        21
Cervix                        18
Eye                           15
Thyroid                       11
Prostate                      10
Ampulla of Vater               5
Testis   

## 🧬 Tool 1: Gene Dependency Data Analyzer

**The Problem:** You have gene knockout data for hundreds of cell lines and thousands of genes. You need to quickly identify which genes are essential in specific cancer types.

**Real Scenario:** You're researching breast cancer and need to find genes that are specifically essential in breast cancer cell lines but not in other cancer types.

**Skills:** Filtering, sorting, grouping

**Difficulty:** ⭐⭐ Intermediate

In [5]:
# Gene Expression Data Analyzer
# =============================

def analyze_gene_essentiality(df, gene_name, cancer_type=None, threshold=-0.5):
    """
    Analyze gene essentiality across cell lines
    
    Parameters:
    - df: DataFrame with CRISPR data
    - gene_name: Name of gene to analyze
    - cancer_type: Specific cancer type to focus on (None = all)
    - threshold: Effect threshold for "essential" classification
    """
    
    # Check if gene exists
    if gene_name not in df.columns:
        print(f"❌ Gene '{gene_name}' not found in dataset")
        return None
    
    # Filter by cancer type if specified
    if cancer_type:
        analysis_df = df[df['oncotree_lineage'] == cancer_type].copy()
        title = f"{cancer_type} Cancer"
    else:
        analysis_df = df.copy()
        title = "All Cancer Types"
    
    # Calculate statistics
    gene_data = analysis_df[gene_name]
    
    # Identify essential cell lines
    essential_lines = analysis_df[analysis_df[gene_name] < threshold]
    
    # Sort by gene effect
    top_sensitive = analysis_df.nsmallest(10, gene_name)
    
    # Create results dictionary
    results = {
        'gene': gene_name,
        'cancer_type': title,
        'n_cell_lines': len(analysis_df),
        'mean_effect': gene_data.mean(),
        'std_effect': gene_data.std(),
        'min_effect': gene_data.min(),
        'max_effect': gene_data.max(),
        'n_essential': len(essential_lines),
        'pct_essential': (len(essential_lines) / len(analysis_df)) * 100,
        'top_sensitive': top_sensitive
    }
    
    return results

def display_gene_analysis(results):
    """Display gene analysis results in a professional format"""
    
    if results is None:
        return
    
    print("🧬 GENE ESSENTIALITY ANALYSIS")
    print("=" * 60)
    print(f"Gene: {results['gene']}")
    print(f"Cancer Type: {results['cancer_type']}")
    print(f"Cell Lines Analyzed: {results['n_cell_lines']}")
    
    print("\n📊 EFFECT STATISTICS:")
    print(f"  Mean effect: {results['mean_effect']:>8.4f}")
    print(f"  Std deviation: {results['std_effect']:>8.4f}")
    print(f"  Range: {results['min_effect']:.4f} to {results['max_effect']:.4f}")
    
    print("\n🎯 ESSENTIALITY SUMMARY:")
    print(f"  Essential cell lines: {results['n_essential']} ({results['pct_essential']:.1f}%)")
    
    # Interpret results
    if results['pct_essential'] > 50:
        status = "✅ Highly essential gene (>50% of lines)"
    elif results['pct_essential'] > 20:
        status = "⚠️  Moderately essential gene (20-50% of lines)"
    else:
        status = "ℹ️  Non-essential or context-dependent (< 20% of lines)"
    print(f"  Status: {status}")
    
    print("\n🔝 TOP 10 MOST SENSITIVE CELL LINES:")
    display_cols = ['cell_line_name', 'oncotree_lineage', results['gene']]
    top_df = results['top_sensitive'][display_cols].copy()
    top_df.columns = ['Cell Line', 'Cancer Type', 'Gene Effect']
    
    for idx, row in top_df.iterrows():
        print(f"  {row['Cell Line']:20} {row['Cancer Type']:15} {row['Gene Effect']:>8.4f}")


# Example 1: Analyze TP53 in breast cancer
print("Example 1: TP53 essentiality in Breast Cancer")
print("-" * 60)
tp53_breast = analyze_gene_essentiality(crispr_df, 'TP53', cancer_type='Breast', threshold=-0.5)
display_gene_analysis(tp53_breast)

print("\n" + "=" * 60 + "\n")

# Example 2: Analyze KRAS across all cancer types
print("Example 2: KRAS essentiality across all cancers")
print("-" * 60)
kras_all = analyze_gene_essentiality(crispr_df, 'KRAS', cancer_type=None, threshold=-0.5)
display_gene_analysis(kras_all)

Example 1: TP53 essentiality in Breast Cancer
------------------------------------------------------------
🧬 GENE ESSENTIALITY ANALYSIS
Gene: TP53
Cancer Type: Breast Cancer
Cell Lines Analyzed: 53

📊 EFFECT STATISTICS:
  Mean effect:   0.1751
  Std deviation:   0.2667
  Range: -0.7338 to 1.1820

🎯 ESSENTIALITY SUMMARY:
  Essential cell lines: 1 (1.9%)
  Status: ℹ️  Non-essential or context-dependent (< 20% of lines)

🔝 TOP 10 MOST SENSITIVE CELL LINES:
  HCC1143              Breast           -0.7338
  HCC70                Breast           -0.3026
  HCC1419              Breast           -0.1716
  MDA-MB-453           Breast           -0.0836
  UACC-3199            Breast           -0.0786
  SUM-52PE             Breast           -0.0068
  ACC-3133             Breast           -0.0038
  COLO 824             Breast            0.0114
  UACC-893             Breast            0.0134
  MDA-MB-415           Breast            0.0262


Example 2: KRAS essentiality across all cancers
------------

### 🎯 Try It Yourself!

**Exercise:** Analyze the essentiality of the EGFR gene in Myoeloid cancer. How does it compare to TP53?

In [6]:
# Your code here:
# egfr_lung = analyze_gene_essentiality(...)
# display_gene_analysis(egfr_lung)

## 🔬 Tool 2: Quality Control Dashboard

**The Problem:** Before analyzing your dataset, you need to check data quality: missing values, outliers, and distribution of values.

**Real Scenario:** You've just received gene expression data from the sequencing facility. You need to check if any samples have quality issues before proceeding with analysis.

**Skills:** Filtering, aggregation, statistical analysis

**Difficulty:** ⭐⭐ Intermediate

In [7]:
# Quality Control Dashboard
# ========================

def quality_control_report(df, sample_col='cell_line_name'):
    """
    Generate comprehensive QC report for biological dataset
    
    Parameters:
    - df: DataFrame to analyze
    - sample_col: Column containing sample names
    """
    
    print("🔬 QUALITY CONTROL DASHBOARD")
    print("=" * 70)
    
    # 1. Dataset Overview
    print("\n📋 DATASET OVERVIEW:")
    print(f"  Total samples: {len(df)}")
    print(f"  Total features: {len(df.columns)}")
    print(f"  Metadata columns: {len([col for col in df.columns if df[col].dtype == 'object'])}")
    print(f"  Numeric columns: {len([col for col in df.columns if df[col].dtype in ['float64', 'int64']])}")
    
    # 2. Missing Data Analysis
    print("\n🔍 MISSING DATA ANALYSIS:")
    missing_counts = df.isnull().sum()
    columns_with_missing = missing_counts[missing_counts > 0]
    
    if len(columns_with_missing) == 0:
        print("  ✅ No missing values detected")
    else:
        print(f"  ⚠️  {len(columns_with_missing)} columns have missing values")
        print(f"\n  Top 5 columns with most missing data:")
        top_missing = columns_with_missing.nlargest(5)
        for col, count in top_missing.items():
            pct = (count / len(df)) * 100
            print(f"    {col:20} {count:4d} missing ({pct:5.1f}%)")
    
    # 3. Sample Quality Metrics
    print("\n📊 SAMPLE QUALITY METRICS:")
    
    # Get numeric columns only
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    
    if len(numeric_cols) > 0:
        # Calculate per-sample statistics
        sample_means = df[numeric_cols].mean(axis=1)
        sample_stds = df[numeric_cols].std(axis=1)
        sample_missing = df[numeric_cols].isnull().sum(axis=1)
        
        print("  Mean value per sample:")
        print(f"    Average: {sample_means.mean():>8.4f}")
        print(f"    Range: {sample_means.min():.4f} to {sample_means.max():.4f}")
        
        print("\n  Variability per sample:")
        print(f"    Average StdDev: {sample_stds.mean():>8.4f}")
        print(f"    Range: {sample_stds.min():.4f} to {sample_stds.max():.4f}")
        
        # Identify outlier samples
        mean_threshold = sample_means.mean() + 2 * sample_means.std()
        outlier_samples = df[sample_means > mean_threshold]
        
        if len(outlier_samples) > 0:
            print(f"\n  ⚠️  {len(outlier_samples)} potential outlier samples detected")
            print("      (mean value > 2 standard deviations from dataset mean)")
        else:
            print("\n  ✅ No outlier samples detected")
    
    # 4. Feature Quality Metrics
    print("\n📈 FEATURE QUALITY METRICS:")
    
    if len(numeric_cols) > 0:
        feature_missing = df[numeric_cols].isnull().sum()
        feature_vars = df[numeric_cols].var()
        
        # Low variance features
        low_var_features = feature_vars[feature_vars < 0.001]
        print(f"  Low variance features (var < 0.001): {len(low_var_features)}")
        
        # High variance features
        high_var_features = feature_vars.nlargest(5)
        print("\n  Top 5 most variable features:")
        for feat, var in high_var_features.items():
            print(f"    {feat:20} variance = {var:.4f}")
    
    # 5. Data Distribution Check
    print("\n📉 DATA DISTRIBUTION:")
    if len(numeric_cols) > 0:
        overall_mean = df[numeric_cols].values.flatten()
        overall_mean = overall_mean[~np.isnan(overall_mean)]
        
        print(f"  Overall mean: {np.mean(overall_mean):>8.4f}")
        print(f"  Overall median: {np.median(overall_mean):>8.4f}")
        print(f"  Overall std: {np.std(overall_mean):>8.4f}")
        
        # Check for skewness
        if abs(np.mean(overall_mean) - np.median(overall_mean)) > 0.1:
            print("  ⚠️  Data appears skewed (mean ≠ median)")
        else:
            print("  ✅ Data appears normally distributed")
    
    # 6. Final Recommendation
    print("\n💡 RECOMMENDATIONS:")
    
    issues = []
    if len(columns_with_missing) > 0:
        issues.append("Handle missing data before analysis")
    if len(outlier_samples) > 0:
        issues.append("Investigate outlier samples")
    if len(low_var_features) > len(numeric_cols) * 0.1:
        issues.append("Consider removing low-variance features")
    
    if len(issues) == 0:
        print("  ✅ Dataset passes all quality checks!")
        print("  Ready for downstream analysis.")
    else:
        print("  Action items:")
        for i, issue in enumerate(issues, 1):
            print(f"    {i}. {issue}")


# Run QC on our dataset
quality_control_report(crispr_df, sample_col='cell_line_name')

🔬 QUALITY CONTROL DASHBOARD

📋 DATASET OVERVIEW:
  Total samples: 1165
  Total features: 17211
  Metadata columns: 6
  Numeric columns: 17205

🔍 MISSING DATA ANALYSIS:
  ✅ No missing values detected

📊 SAMPLE QUALITY METRICS:
  Mean value per sample:
    Average:  -0.1409
    Range: -0.1672 to -0.1134

  Variability per sample:
    Average StdDev:   0.4125
    Range: 0.3890 to 0.4707

  ⚠️  34 potential outlier samples detected
      (mean value > 2 standard deviations from dataset mean)

📈 FEATURE QUALITY METRICS:
  Low variance features (var < 0.001): 0

  Top 5 most variable features:
    YRDC                 variance = 0.6919
    CCND1                variance = 0.6168
    EIF1AX               variance = 0.5098
    SCAP                 variance = 0.4818
    MYC                  variance = 0.4577

📉 DATA DISTRIBUTION:
  Overall mean:  -0.1409
  Overall median:  -0.0373
  Overall std:   0.4127
  ⚠️  Data appears skewed (mean ≠ median)

💡 RECOMMENDATIONS:
  Action items:
    1. Investi

## 💊 Tool 3: Drug Screening Hit Selector

**The Problem:** You've screened thousands of genes and need to prioritize which ones to validate experimentally based on multiple criteria.

**Real Scenario:** Your lab can only follow up on 10 genes for experimental validation. You need to select the best candidates based on essentiality, cancer-specificity, and consistency across replicates.

**Skills:** Multi-column filtering, sorting, ranking

**Difficulty:** ⭐⭐⭐ Advanced

In [8]:
# Drug Screening Hit Selector
# ==========================

def select_drug_targets(df, target_cancer_type, control_cancer_types, 
                       n_targets=10, essentiality_threshold=-0.3):
    """
    Select top drug target candidates based on cancer-specific essentiality
    
    Parameters:
    - df: CRISPR screening data
    - target_cancer_type: Cancer type to find specific targets for
    - control_cancer_types: List of cancer types to compare against
    - n_targets: Number of top targets to return
    - essentiality_threshold: Minimum effect to consider gene essential
    """
    
    print("💊 DRUG TARGET SELECTION")
    print("=" * 70)
    print(f"Target cancer: {target_cancer_type}")
    print(f"Control cancers: {', '.join(control_cancer_types)}")
    print(f"Essentiality threshold: {essentiality_threshold}")
    print(f"Selecting top {n_targets} targets\n")
    
    # Filter data
    target_df = df[df['oncotree_lineage'] == target_cancer_type]
    control_df = df[df['oncotree_lineage'].isin(control_cancer_types)]
    
    print("📊 Dataset composition:")
    print(f"  Target cancer lines: {len(target_df)}")
    print(f"  Control cancer lines: {len(control_df)}")
    
    # Get gene columns
    gene_cols = [col for col in df.columns if col not in 
                ['model_id', 'cell_line_name', 'stripped_cell_line_name', 
                 'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']]
    
    # Calculate statistics for each gene
    target_means = target_df[gene_cols].mean()
    control_means = control_df[gene_cols].mean()
    
    # Calculate specificity score (how much more essential in target vs control)
    specificity = target_means - control_means
    
    # Create ranking dataframe
    ranking_df = pd.DataFrame({
        'gene': gene_cols,
        'target_mean': target_means.values,
        'control_mean': control_means.values,
        'specificity_score': specificity.values,
        'target_essential': (target_means < essentiality_threshold).values
    })
    
    # Filter for genes essential in target cancer
    essential_targets = ranking_df[ranking_df['target_essential']].copy()
    
    # Sort by specificity (most negative in target, least negative in control)
    essential_targets = essential_targets.sort_values('specificity_score', ascending=True)
    
    # Get top targets
    top_targets = essential_targets.head(n_targets)
    
    print("\n🎯 SCREENING RESULTS:")
    print(f"  Total genes screened: {len(gene_cols):,}")
    print(f"  Essential in {target_cancer_type}: {len(essential_targets)}")
    print(f"  Top candidates selected: {len(top_targets)}")
    
    print(f"\n🏆 TOP {n_targets} DRUG TARGET CANDIDATES:")
    print("  (Ranked by cancer-specificity)\n")
    
    print(f"{'Rank':<6} {'Gene':<12} {'Target':<10} {'Control':<10} {'Specificity':<12} {'Priority'}")
    print("-" * 70)
    
    for idx, row in enumerate(top_targets.itertuples(), 1):
        # Determine priority
        if row.specificity_score < -0.2:
            priority = "⭐⭐⭐ High"
        elif row.specificity_score < -0.1:
            priority = "⭐⭐ Medium"
        else:
            priority = "⭐ Low"
        
        print(f"{idx:<6} {row.gene:<12} {row.target_mean:<10.4f} {row.control_mean:<10.4f} "
              f"{row.specificity_score:<12.4f} {priority}")
    
    # Additional statistics
    print("\n📈 VALIDATION RECOMMENDATIONS:")
    high_priority = top_targets[top_targets['specificity_score'] < -0.2]
    med_priority = top_targets[(top_targets['specificity_score'] >= -0.2) & 
                               (top_targets['specificity_score'] < -0.1)]
    
    print(f"  High priority targets (start here): {len(high_priority)}")
    print(f"  Medium priority targets: {len(med_priority)}")
    print(f"  Low priority targets: {len(top_targets) - len(high_priority) - len(med_priority)}")
    
    print("\n💡 NEXT STEPS:")
    print("  1. Validate top 3 high-priority targets with dose-response curves")
    print("  2. Check literature for known druggability of candidates")
    print("  3. Perform secondary screens in patient-derived models")
    print("  4. Begin medicinal chemistry optimization for validated hits")
    
    return top_targets


# Example: Find breast cancer specific targets
breast_targets = select_drug_targets(
    crispr_df,
    target_cancer_type='Breast',
    control_cancer_types=['Lung', 'Skin', 'Brain'],
    n_targets=10,
    essentiality_threshold=-0.3
)

💊 DRUG TARGET SELECTION
Target cancer: Breast
Control cancers: Lung, Skin, Brain
Essentiality threshold: -0.3
Selecting top 10 targets

📊 Dataset composition:
  Target cancer lines: 53
  Control cancer lines: 197

🎯 SCREENING RESULTS:
  Total genes screened: 17,205
  Essential in Breast: 2255
  Top candidates selected: 10

🏆 TOP 10 DRUG TARGET CANDIDATES:
  (Ranked by cancer-specificity)

Rank   Gene         Target     Control    Specificity  Priority
----------------------------------------------------------------------
1      FOXA1        -0.7159    -0.1788    -0.5371      ⭐⭐⭐ High
2      PPP1R15B     -1.2813    -0.7933    -0.4880      ⭐⭐⭐ High
3      SPDEF        -0.3916    0.0244     -0.4160      ⭐⭐⭐ High
4      EIF1AX       -2.1430    -1.7448    -0.3982      ⭐⭐⭐ High
5      CDK1         -2.6556    -2.2712    -0.3844      ⭐⭐⭐ High
6      TRPS1        -0.3163    0.0677     -0.3840      ⭐⭐⭐ High
7      SMU1         -3.0940    -2.7334    -0.3606      ⭐⭐⭐ High
8      PPP2CA       -1.31

## 📈 Tool 4: Patient Cohort Analyzer

**The Problem:** You need to stratify cell lines (or patients) into groups based on multiple molecular features for clinical trial design.

**Real Scenario:** You're designing a clinical trial and need to identify which cancer subtypes would benefit most from targeting specific genes.

**Skills:** Grouping, aggregation, complex filtering

**Difficulty:** ⭐⭐⭐ Advanced

In [None]:
# Patient Cohort Analyzer
# ======================

def analyze_patient_cohorts(df, genes_of_interest, group_by='oncotree_lineage'):
    """
    Analyze patient/cell line cohorts based on gene dependencies
    
    Parameters:
    - df: CRISPR data
    - genes_of_interest: List of genes to analyze
    - group_by: Column to group samples by
    """
    
    print("📈 PATIENT COHORT ANALYSIS")
    print("=" * 70)
    print(f"Analyzing {len(genes_of_interest)} genes across cohorts")
    print(f"Grouping by: {group_by}")
    print(f"Genes: {', '.join(genes_of_interest)}\n")
    
    # Check all genes exist
    missing_genes = [g for g in genes_of_interest if g not in df.columns]
    if missing_genes:
        print(f"⚠️  Warning: Genes not found: {missing_genes}")
        genes_of_interest = [g for g in genes_of_interest if g in df.columns]
    
    # Calculate cohort statistics
    cohort_stats = df.groupby(group_by)[genes_of_interest].agg(['mean', 'std', 'count'])
    
    # Calculate sensitivity score for each cohort
    # (number of genes with mean effect < -0.2)
    sensitivity_scores = {}
    
    for cohort in df[group_by].unique():
        cohort_df = df[df[group_by] == cohort]
        gene_means = cohort_df[genes_of_interest].mean()
        n_sensitive = (gene_means < -0.2).sum()
        sensitivity_scores[cohort] = n_sensitive
    
    # Sort cohorts by sensitivity
    sorted_cohorts = sorted(sensitivity_scores.items(), key=lambda x: x[1], reverse=True)
    
    print("🎯 COHORT SENSITIVITY RANKING:")
    print("  (Number of target genes with strong dependency)\n")
    
    for rank, (cohort, score) in enumerate(sorted_cohorts, 1):
        n_samples = len(df[df[group_by] == cohort])
        pct_sensitive = (score / len(genes_of_interest)) * 100
        
        # Add visual indicator
        if pct_sensitive >= 60:
            indicator = "🔥 EXCELLENT"
        elif pct_sensitive >= 40:
            indicator = "✅ GOOD"
        elif pct_sensitive >= 20:
            indicator = "⚠️  MODERATE"
        else:
            indicator = "❌ POOR"
        
        print(f"  {rank}. {cohort:<20} {score}/{len(genes_of_interest)} genes "
              f"({pct_sensitive:>5.1f}%)  {indicator}  n={n_samples}")
    
    # Detailed gene-by-cohort analysis
    print(f"\n📊 DETAILED GENE ANALYSIS BY COHORT:\n")
    
    # Show top 3 cohorts
    for cohort, score in sorted_cohorts[:3]:
        print(f"\n  {cohort} ({score} sensitive genes):")
        print(f"  {'-' * 60}")
        
        cohort_df = df[df[group_by] == cohort]
        gene_means = cohort_df[genes_of_interest].mean().sort_values()
        
        for gene, mean_effect in gene_means.items():
            if mean_effect < -0.2:
                priority = "⭐⭐⭐"
            elif mean_effect < -0.1:
                priority = "⭐⭐"
            else:
                priority = "⭐"
            
            print(f"    {gene:<15} Mean: {mean_effect:>7.4f}  {priority}")
    
    # Clinical trial recommendations
    print("\n💊 CLINICAL TRIAL RECOMMENDATIONS:")
    print("\n  Phase I/II (Initial efficacy):")
    top_cohort = sorted_cohorts[0][0]
    print(f"    - Start with {top_cohort} patients (highest sensitivity)")
    print("    - Expected response rate: HIGH")
    
    if len(sorted_cohorts) > 1:
        second_cohort = sorted_cohorts[1][0]
        print("\n  Phase II expansion:")
        print(f"    - Expand to {second_cohort} patients")
        print("    - Compare response rates between cohorts")
    
    poor_cohorts = [c for c, s in sorted_cohorts if (s / len(genes_of_interest)) < 0.2]
    if poor_cohorts:
        print("\n  Exclusion criteria:")
        print(f"    - Consider excluding: {', '.join(poor_cohorts[:3])}")
        print("    - Low predicted response based on molecular profile")


# Example: Analyze cohorts for a combination therapy
combo_genes = ['EGFR', 'KRAS', 'PIK3CA', 'BRAF', 'TP53']
analyze_patient_cohorts(crispr_df, combo_genes, group_by='oncotree_lineage')

## 🧪 Tool 5: Experiment Result Comparator

**The Problem:** You've run the same experiment across multiple conditions and need to identify the most significant differences.

**Real Scenario:** You tested drug combinations on different cancer cell lines and need to find which combinations work best for which cancer types.

**Skills:** Complex filtering, sorting, comparison

**Difficulty:** ⭐⭐⭐ Advanced

In [None]:
# Experiment Result Comparator
# ============================


def compare_gene_effects(df, gene1, gene2, cancer_types=None):
    """
    Compare effects of two genes to identify synergistic relationships

    Parameters:
    - df: CRISPR data
    - gene1, gene2: Genes to compare
    - cancer_types: List of cancer types to analyze (None = all)
    """

    print("🧪 EXPERIMENT RESULT COMPARISON")
    print("=" * 70)
    print(f"Comparing: {gene1} vs {gene2}\n")

    # Check genes exist
    if gene1 not in df.columns or gene2 not in df.columns:
        print("❌ One or both genes not found in dataset")
        return

    # Filter by cancer types if specified
    if cancer_types:
        analysis_df = df[df["oncotree_lineage"].isin(cancer_types)].copy()
        cancer_desc = ", ".join(cancer_types)
    else:
        analysis_df = df.copy()
        cancer_desc = "All cancer types"

    print(f"📊 Analysis scope: {cancer_desc}")
    print(f"   Cell lines analyzed: {len(analysis_df)}\n")

    # Calculate combined effect score
    analysis_df["combined_effect"] = analysis_df[gene1] + analysis_df[gene2]
    analysis_df["synergy_score"] = (
        analysis_df["combined_effect"] - (analysis_df[gene1] + analysis_df[gene2]) / 2
    )

    # Categorize cell lines
    def categorize_dependency(row):
        g1, g2 = row[gene1], row[gene2]
        threshold = -0.3

        if g1 < threshold and g2 < threshold:
            return "Both Essential"
        elif g1 < threshold:
            return f"{gene1} Only"
        elif g2 < threshold:
            return f"{gene2} Only"
        else:
            return "Neither Essential"

    analysis_df["dependency_category"] = analysis_df.apply(
        categorize_dependency, axis=1
    )

    # Summary statistics
    print("🎯 DEPENDENCY CATEGORIES:")
    category_counts = analysis_df["dependency_category"].value_counts()

    for category, count in category_counts.items():
        pct = (count / len(analysis_df)) * 100
        print(f"  {category:<25} {count:>3} cell lines ({pct:>5.1f}%)")

    # Find synergistic combinations
    print("\n💡 SYNERGISTIC OPPORTUNITIES:")
    both_essential = analysis_df[analysis_df["dependency_category"] == "Both Essential"]

    if len(both_essential) > 0:
        print(f"  ✅ Found {len(both_essential)} cell lines dependent on BOTH genes")
        print("     → Strong candidates for combination therapy\n")

        # Show top candidates
        top_combo = both_essential.nsmallest(5, "combined_effect")
        print("  Top 5 combination therapy candidates:")

        for idx, row in top_combo.iterrows():
            print(
                f"    {row['cell_line_name']:<20} {row['oncotree_lineage']:<15} "
                f"{gene1}: {row[gene1]:.3f}  {gene2}: {row[gene2]:.3f}"
            )
    else:
        print("  ℹ️  No cell lines show strong dependency on both genes")
        print("     → Combination therapy may not be synergistic")

    # Analyze by cancer type
    print("\n📈 CANCER TYPE ANALYSIS:")
    cancer_stats = (
        analysis_df.groupby("oncotree_lineage")
        .agg(
            {gene1: ["mean", "std"], gene2: ["mean", "std"], "combined_effect": "mean"}
        )
        .round(4)
    )

    # Sort by combined effect
    cancer_stats = cancer_stats.sort_values(("combined_effect", "mean"))

    print("\n  Cancer types ranked by combination potential:\n")
    print(
        f"  {'Cancer Type':<20} {gene1+' Mean':<12} {gene2+' Mean':<12} {'Combined':<12}"
    )
    print(f"  {'-'*60}")

    for cancer_type in cancer_stats.index:
        g1_mean = cancer_stats.loc[cancer_type, (gene1, "mean")]
        g2_mean = cancer_stats.loc[cancer_type, (gene2, "mean")]
        comb_mean = cancer_stats.loc[cancer_type, ("combined_effect", "mean")]

        print(
            f"  {cancer_type:<20} {g1_mean:<12.4f} {g2_mean:<12.4f} {comb_mean:<12.4f}"
        )

    # Recommendations
    print("\n💊 THERAPEUTIC STRATEGY RECOMMENDATIONS:")

    best_cancer = cancer_stats.index[0]
    print(f"\n  Primary indication: {best_cancer}")
    print("    - Strongest combined dependency")
    print("    - Prioritize for Phase I trials")

    # Check for gene-specific cancers
    gene1_only = analysis_df[analysis_df["dependency_category"] == f"{gene1} Only"]
    gene2_only = analysis_df[analysis_df["dependency_category"] == f"{gene2} Only"]

    if len(gene1_only) > len(both_essential):
        print(f"\n  Alternative strategy: {gene1} monotherapy")
        print(f"    - {len(gene1_only)} cell lines show {gene1}-specific dependency")
        print("    - May be more effective than combination in some contexts")

    if len(gene2_only) > len(both_essential):
        print(f"\n  Alternative strategy: {gene2} monotherapy")
        print(f"    - {len(gene2_only)} cell lines show {gene2}-specific dependency")
        print("    - May be more effective than combination in some contexts")


# Example: Compare EGFR and KRAS for combination therapy
compare_gene_effects(
    crispr_df, "EGFR", "KRAS", cancer_types=["Lung", "Breast", "Ovary"]
)

## 🎯 Tool 6: Multi-Gene Vulnerability Profiler

**The Problem:** You need to create comprehensive vulnerability profiles for each cell line based on multiple genes to guide personalized treatment strategies.

**Real Scenario:** You're developing a precision medicine approach and need to match patients to the most effective targeted therapies based on their molecular profile.

**Skills:** Advanced filtering, aggregation, ranking, multi-dimensional analysis

**Difficulty:** ⭐⭐⭐⭐ Expert

In [None]:
# Multi-Gene Vulnerability Profiler
# =================================

def create_vulnerability_profile(df, gene_panel, sample_name, essentiality_threshold=-0.3):
    """
    Create a comprehensive vulnerability profile for a specific sample
    
    Parameters:
    - df: CRISPR data
    - gene_panel: List of genes to include in profile
    - sample_name: Name of cell line to profile
    - essentiality_threshold: Threshold for calling gene essential
    """
    
    print("🎯 MULTI-GENE VULNERABILITY PROFILE")
    print("=" * 70)
    print(f"Sample: {sample_name}")
    print(f"Gene panel: {len(gene_panel)} genes")
    print(f"Essentiality threshold: {essentiality_threshold}\n")
    
    # Find the sample
    sample_df = df[df['cell_line_name'] == sample_name]
    
    if len(sample_df) == 0:
        print(f"❌ Sample '{sample_name}' not found in dataset")
        return None
    
    sample_row = sample_df.iloc[0]
    
    # Sample metadata
    print("📋 SAMPLE INFORMATION:")
    print(f"  Cell line: {sample_row['cell_line_name']}")
    print(f"  Cancer type: {sample_row['oncotree_lineage']}")
    print(f"  Primary disease: {sample_row['oncotree_primary_disease']}")
    print(f"  Subtype: {sample_row['oncotree_subtype']}")
    
    # Analyze gene panel
    gene_effects = {}
    for gene in gene_panel:
        if gene in sample_row.index:
            gene_effects[gene] = sample_row[gene]
        else:
            print(f"  ⚠️  Gene {gene} not found in dataset")
    
    # Sort genes by effect
    sorted_genes = sorted(gene_effects.items(), key=lambda x: x[1])
    
    # Categorize vulnerabilities
    critical = [(g, e) for g, e in sorted_genes if e < essentiality_threshold]
    moderate = [(g, e) for g, e in sorted_genes if essentiality_threshold <= e < -0.15]
    weak = [(g, e) for g, e in sorted_genes if -0.15 <= e < 0]
    resistant = [(g, e) for g, e in sorted_genes if e >= 0]
    
    print("\n🎯 VULNERABILITY SUMMARY:")
    print(f"  Critical vulnerabilities: {len(critical)} genes (effect < {essentiality_threshold})")
    print(f"  Moderate vulnerabilities: {len(moderate)} genes")
    print(f"  Weak vulnerabilities: {len(weak)} genes")
    print(f"  Resistant/insensitive: {len(resistant)} genes")
    
    # Calculate vulnerability score
    vuln_score = len(critical) * 3 + len(moderate) * 2 + len(weak) * 1
    max_score = len(gene_panel) * 3
    vuln_pct = (vuln_score / max_score) * 100
    
    print(f"\n  Overall vulnerability score: {vuln_score}/{max_score} ({vuln_pct:.1f}%)")
    
    if vuln_pct > 60:
        print("  Status: 🔥 HIGHLY VULNERABLE - Multiple therapeutic options")
    elif vuln_pct > 40:
        print("  Status: ✅ MODERATELY VULNERABLE - Several treatment options")
    elif vuln_pct > 20:
        print("  Status: ⚠️  PARTIALLY VULNERABLE - Limited treatment options")
    else:
        print("  Status: ❌ RESISTANT - Few effective options in this panel")
    
    # Display critical vulnerabilities
    if critical:
        print("\n🎯 CRITICAL VULNERABILITIES (Prioritize these targets):\n")
        print(f"  {'Rank':<6} {'Gene':<12} {'Effect':<12} {'Druggability':<20} {'Priority'}")
        print(f"  {'-'*70}")
        
        # Known druggable genes (simplified - in reality use a database)
        druggable_genes = ['EGFR', 'KRAS', 'BRAF', 'PIK3CA', 'ERBB2', 'MET', 'ALK']
        
        for rank, (gene, effect) in enumerate(critical, 1):
            if gene in druggable_genes:
                druggable = "✅ Known target"
                priority = "⭐⭐⭐ HIGH"
            else:
                druggable = "🔬 Research target"
                priority = "⭐⭐ MEDIUM"
            
            print(f"  {rank:<6} {gene:<12} {effect:<12.4f} {druggable:<20} {priority}")
    
    # Treatment recommendations
    print("\n💊 TREATMENT RECOMMENDATIONS:\n")
    
    if len(critical) >= 3:
        print("  Strategy: Combination therapy")
        print("  Rationale: Multiple critical vulnerabilities identified")
        print("\n  Suggested combinations:")
        for i in range(min(3, len(critical))):
            gene = critical[i][0]
            print(f"    {i+1}. {gene} inhibitor (primary target)")
        
        if len(critical) > 3:
            print("\n  Alternative targets if primary fails:")
            for i in range(3, min(6, len(critical))):
                gene = critical[i][0]
                print(f"    - {gene}")
    
    elif len(critical) >= 1:
        print("  Strategy: Targeted monotherapy")
        print(f"  Primary target: {critical[0][0]}")
        
        if len(moderate) > 0:
            print("\n  Consider combination with:")
            for gene, effect in moderate[:3]:
                print(f"    - {gene} (moderate vulnerability)")
    
    else:
        print("  Strategy: Alternative approaches")
        print("  Rationale: No critical vulnerabilities in tested gene panel")
        print("\n  Recommendations:")
        print("    1. Expand gene panel testing")
        print("    2. Consider immunotherapy approaches")
        print("    3. Investigate metabolic vulnerabilities")
    
    # Clinical trial matching
    print("\n🔬 CLINICAL TRIAL MATCHING:")
    if critical:
        print("  Potentially eligible trials:")
        for gene, effect in critical[:3]:
            print(f"    - {gene} inhibitor trials in {sample_row['oncotree_lineage']} cancer")
    
    return {
        'sample': sample_name,
        'cancer_type': sample_row['oncotree_lineage'],
        'critical_vulnerabilities': critical,
        'vulnerability_score': vuln_score,
        'vulnerability_pct': vuln_pct
    }


# Example: Create vulnerability profile for specific cell line
oncogene_panel = ['EGFR', 'KRAS', 'BRAF', 'PIK3CA', 'MET', 'ERBB2', 'TP53', 
                 'PTEN', 'AKT1', 'NRAS', 'ALK', 'ROS1']

profile = create_vulnerability_profile(
    crispr_df,
    gene_panel=oncogene_panel,
    sample_name='MCF7',
    essentiality_threshold=-0.3
)

---
## 📚 Practice Challenges

Now it's your turn! Apply what you've learned to solve these biological data analysis challenges.

### Challenge 1: Find Pan-Cancer Essential Genes (Intermediate)

**Task:** Identify genes that are essential across ALL cancer types (mean effect < -0.4 in every cancer type).

**Hints:**
- Use `groupby()` to calculate mean effect per cancer type
- Filter for genes where ALL cancer types show strong dependency
- These are potential universal cancer targets!

In [None]:
# Your code here:
# Step 1: Get gene columns
# Step 2: Group by cancer type and calculate means
# Step 3: Find genes essential in ALL cancer types
# Step 4: Display results


### Challenge 2: Build a Precision Medicine Matcher (Advanced)

**Task:** For each cell line, identify its top 3 most critical gene dependencies and create a treatment recommendation.

**Hints:**
- For each row, find the 3 genes with most negative effects
- Create a summary DataFrame with cell line, cancer type, and top 3 targets
- Bonus: Group by cancer type to find common vulnerabilities

In [None]:
# Your code here:
# Create a function that finds top N vulnerabilities per sample
# Apply to all samples and create summary table


### Challenge 3: Biomarker Discovery (Expert)

**Task:** Find genes that predict cancer type (genes with very different effects between cancer types).

**Hints:**
- Calculate variance of mean effects across cancer types for each gene
- High variance = gene effect varies a lot between cancer types = potential biomarker
- Display top 10 biomarker candidates with their cancer-specific effects

In [None]:
# Your code here:
# Calculate variance across cancer types for each gene
# Find genes with highest variance
# Show how these genes differ between cancer types


---
## 🎯 Solutions

Try the challenges above first, then check these solutions!

In [None]:
# Solution 1: Pan-Cancer Essential Genes
print("Solution 1: Pan-Cancer Essential Genes\n")
print("=" * 70)

# Get gene columns
gene_cols = [col for col in crispr_df.columns if col not in 
            ['model_id', 'cell_line_name', 'stripped_cell_line_name',
             'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']]

# Calculate mean effect by cancer type
cancer_gene_means = crispr_df.groupby('oncotree_lineage')[gene_cols].mean()

# Find genes essential in ALL cancer types (all means < -0.4)
threshold = -0.4
pan_essential = []

for gene in gene_cols:
    if (cancer_gene_means[gene] < threshold).all():
        pan_essential.append(gene)

print(f"Found {len(pan_essential)} pan-cancer essential genes\n")

if len(pan_essential) > 0:
    print("These genes are potential universal cancer targets:\n")
    for gene in pan_essential[:10]:  # Show top 10
        mean_across_cancers = cancer_gene_means[gene].mean()
        print(f"  {gene:<15} Mean effect: {mean_across_cancers:.4f}")
else:
    print("No genes meet the strict pan-cancer essentiality criteria.")
    print("Try a less stringent threshold (e.g., -0.3)")

In [None]:
# Solution 2: Precision Medicine Matcher
print("Solution 2: Precision Medicine Matcher\n")
print("=" * 70)

def find_top_vulnerabilities(row, gene_cols, n=3):
    """Find top N most critical gene dependencies for a sample"""
    gene_effects = row[gene_cols]
    top_genes = gene_effects.nsmallest(n)
    return list(zip(top_genes.index, top_genes.values))

# Apply to all samples
top_vulns = crispr_df.apply(
    lambda row: find_top_vulnerabilities(row, gene_cols, n=3),
    axis=1
)

# Create summary DataFrame
summary_data = []
for idx, vulns in top_vulns.items():
    row = crispr_df.iloc[idx]
    summary_data.append({
        'cell_line': row['cell_line_name'],
        'cancer_type': row['oncotree_lineage'],
        'target_1': vulns[0][0],
        'effect_1': vulns[0][1],
        'target_2': vulns[1][0],
        'effect_2': vulns[1][1],
        'target_3': vulns[2][0],
        'effect_3': vulns[2][1]
    })

precision_med_df = pd.DataFrame(summary_data)

print("\nPrecision Medicine Recommendations (first 10 cell lines):\n")
print(precision_med_df.head(10).to_string(index=False))

# Find most common targets by cancer type
print("\n\nMost Common Targets by Cancer Type:\n")
for cancer in precision_med_df['cancer_type'].unique():
    cancer_df = precision_med_df[precision_med_df['cancer_type'] == cancer]
    all_targets = (list(cancer_df['target_1']) + 
                  list(cancer_df['target_2']) + 
                  list(cancer_df['target_3']))
    target_counts = pd.Series(all_targets).value_counts()
    print(f"\n{cancer}:")
    print(f"  {target_counts.head(3).to_dict()}")

In [None]:
# Solution 3: Biomarker Discovery
print("Solution 3: Biomarker Discovery\n")
print("=" * 70)

# Calculate mean effect by cancer type for each gene
cancer_gene_means = crispr_df.groupby('oncotree_lineage')[gene_cols].mean()

# Calculate variance across cancer types (high variance = cancer-specific)
gene_variance = cancer_gene_means.var()

# Get top biomarker candidates
top_biomarkers = gene_variance.nlargest(10)

print("\nTop 10 Cancer-Type Biomarker Candidates:\n")
print(f"{'Gene':<15} {'Variance':<12} {'Interpretation'}")
print("-" * 60)

for gene, var in top_biomarkers.items():
    print(f"{gene:<15} {var:<12.4f} High cancer-type specificity")

# Show cancer-specific effects for top biomarker
top_biomarker = top_biomarkers.index[0]
print(f"\n\nDetailed Profile for Top Biomarker: {top_biomarker}")
print("=" * 60)

biomarker_profile = cancer_gene_means[top_biomarker].sort_values()
print(f"\n{'Cancer Type':<20} {'Mean Effect':<12} {'Classification'}")
print("-" * 60)

for cancer, effect in biomarker_profile.items():
    if effect < -0.3:
        classification = "🔥 Critical dependency"
    elif effect < -0.1:
        classification = "⚠️  Moderate dependency"
    else:
        classification = "✅ Not essential"
    
    print(f"{cancer:<20} {effect:<12.4f} {classification}")

print("\n💡 Clinical Significance:")
print(f"   {top_biomarker} could be used to:")
print("   1. Stratify patients for clinical trials")
print(f"   2. Predict response to {top_biomarker}-targeted therapy")
print("   3. Guide precision medicine treatment decisions")

---
## 🎉 Congratulations!

You've built a complete **Biological Data Analysis Toolkit** using pandas!

### 💡 What You've Accomplished:

1. ✅ **Gene Expression Analyzer** - Filter and analyze gene essentiality across cancers
2. ✅ **Quality Control Dashboard** - Validate dataset quality systematically
3. ✅ **Drug Screening Selector** - Prioritize targets with multi-criteria filtering
4. ✅ **Patient Cohort Analyzer** - Stratify samples using groupby operations
5. ✅ **Experiment Comparator** - Compare gene effects across conditions
6. ✅ **Vulnerability Profiler** - Create comprehensive molecular profiles

### 🔑 Skills Mastered:

**DataFrame Operations:**
- `.read_csv()` - Loading real biological data
- `.head()`, `.tail()`, `.shape` - Quick data inspection
- `.info()`, `.describe()` - Statistical summaries

**Filtering & Selection:**
- Boolean indexing - `df[df['gene'] < threshold]`
- `.loc[]` and `.iloc[]` - Position and label-based selection
- `.isin()` - Filter for multiple values
- `.query()` - SQL-like filtering syntax

**Sorting & Ranking:**
- `.sort_values()` - Sort by single or multiple columns
- `.nsmallest()`, `.nlargest()` - Find extreme values efficiently
- `.rank()` - Rank data for prioritization

**Aggregation & Grouping:**
- `.groupby()` - Split-apply-combine operations
- `.agg()` - Multiple aggregations at once
- `.mean()`, `.std()`, `.var()` - Statistical functions

**Advanced Techniques:**
- Multi-column filtering with `&` and `|`
- Custom functions with `.apply()`
- Creating derived columns for analysis
- Professional report generation

### 🚀 Real-World Applications:

These skills are used daily in:
- **Cancer research** - Analyzing screening data
- **Drug discovery** - Prioritizing therapeutic targets
- **Clinical trials** - Patient stratification
- **Precision medicine** - Personalized treatment selection
- **Biomarker discovery** - Finding diagnostic/prognostic markers

### 📚 Next Steps:

Build on these foundations:
- **Visualization** - Create plots with matplotlib/seaborn
- **Statistical testing** - Add scipy for significance testing
- **Machine learning** - Use scikit-learn for predictions
- **Advanced pandas** - Multi-index, time series, merging datasets
- **Large datasets** - Learn Dask/Polars for big data

**Keep this notebook!** These patterns are templates for analyzing your own research data.

**Happy analyzing!** 🧬💻🔬