# Lecture 3: Essential Statistical Methods with Pandas 📊

Learn how to calculate key statistics from your data using pandas built-in methods.

## What We'll Learn:
1. **Column-wise Statistics** - Mean, median, standard deviation, min/max
2. **Selecting Gene Columns** - Work with specific subsets of data
3. **Gene Analysis Workflow** - Filter → Select → Calculate → Analyze
4. **Grouped Statistics** - Compare statistics across different categories
5. **Practice Questions** - Apply statistical methods to cancer research!

Let's explore statistical analysis with our cancer dataset! 🧬

## Load Our Dataset

Let's start with our familiar DepMap cancer dataset.

In [2]:
import pandas as pd
import numpy as np


# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"Cancer types: {df['oncotree_lineage'].unique()}")
print(f"\nFirst few columns: {list(df.columns[:10])}")

Dataset shape: (94, 17211)
Cancer types: ['Myeloid' 'Breast']

First few columns: ['model_id', 'cell_line_name', 'stripped_cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype', 'A1BG', 'A1CF', 'A2M', 'A2ML1']


---
## 1. Column-wise Statistics 📈

### Basic Statistics for All Numeric Columns
Calculate statistics across all numeric columns at once.

In [None]:
# Calculate mean for all numeric columns
# First we have to filter out the non-numeric columns
numeric_df = df.select_dtypes(include=[np.number]) 
all_means = numeric_df.mean()
print(f"Calculated means for {len(all_means)} columns as {all_means.head(10).round(4)}")


Calculated means for 17205 columns as A1BG      -0.0549
A1CF      -0.0937
A2M        0.0298
A2ML1      0.0650
A3GALT2   -0.1147
A4GALT    -0.0387
A4GNT      0.0328
AAAS      -0.1159
AACS      -0.0456
AADAC      0.1086
dtype: float64


### Statistics for Specific Columns
Calculate statistics for selected genes of interest.

In [4]:
# Mean for specific columns
genes_of_interest = ['A1BG', 'A1CF', 'A2M']
specific_means = df[genes_of_interest].mean()

print("Mean effects for specific genes:")
for gene, mean_val in specific_means.items():
    print(f"{gene:<6}: {mean_val:8.4f}")

# Multiple statistics at once
print("\n=== DETAILED STATISTICS FOR SELECTED GENES ===")
detailed_stats = df[genes_of_interest].describe()
print(detailed_stats.round(4))

Mean effects for specific genes:
A1BG  :  -0.0549
A1CF  :  -0.0937
A2M   :   0.0298

=== DETAILED STATISTICS FOR SELECTED GENES ===
          A1BG     A1CF      A2M
count  94.0000  94.0000  94.0000
mean   -0.0549  -0.0937   0.0298
std     0.1413   0.1067   0.0994
min    -0.6263  -0.4517  -0.1861
25%    -0.1130  -0.1384  -0.0265
50%    -0.0709  -0.0916   0.0283
75%     0.0043  -0.0309   0.0814
max     0.4641   0.2108   0.4119


### Individual Statistical Methods
Explore different statistical measures available in pandas.

In [7]:
# Let's focus on A1BG gene for detailed analysis
gene = 'A1BG'
gene_data = df[gene]

print(f"=== STATISTICAL ANALYSIS OF {gene} GENE ===")
print(f"Mean:              {gene_data.mean():.4f}")
print(f"Median:            {gene_data.median():.4f}")
print(f"Standard Dev:      {gene_data.std():.4f}")
print(f"Variance:          {gene_data.var():.4f}")
print(f"Minimum:           {gene_data.min():.4f}")
print(f"Maximum:           {gene_data.max():.4f}")
print(f"Range:             {gene_data.max() - gene_data.min():.4f}")

# Quantiles
print(f"\n=== QUANTILES ===")
print(f"25th percentile:   {gene_data.quantile(0.25):.4f}")
print(f"50th percentile:   {gene_data.quantile(0.50):.4f}")
print(f"75th percentile:   {gene_data.quantile(0.75):.4f}")

# Count statistics
print(f"\n=== COUNTS ===")
print(f"Total values:      {gene_data.count()}")
print(f"Missing values:    {gene_data.isnull().sum()}")
print(f"Negative effects:  {(gene_data < 0).sum()}")
print(f"Positive effects:  {(gene_data > 0).sum()}")

=== STATISTICAL ANALYSIS OF A1BG GENE ===
Mean:              -0.0549
Median:            -0.0709
Standard Dev:      0.1413
Variance:          0.0200
Minimum:           -0.6263
Maximum:           0.4641
Range:             1.0904

=== QUANTILES ===
25th percentile:   -0.1130
50th percentile:   -0.0709
75th percentile:   0.0043

=== COUNTS ===
Total values:      94
Missing values:    0
Negative effects:  69
Positive effects:  25


---
## 2. Our Gene Analysis Workflow 🔬

Let's implement the complete workflow shown in the lecture slides!

### Step 1: Filter for Breast Cancer

In [10]:
# Step 1: Filter for breast cancer (we already learned this!)
breast_df = df.loc[df['oncotree_lineage'] == 'Breast']
print(f"Found {len(breast_df)} breast cancer cell lines")
print(breast_df[['cell_line_name', 'oncotree_primary_disease']].head())

Found 53 breast cancer cell lines
  cell_line_name   oncotree_primary_disease
2        SK-BR-3  Invasive Breast Carcinoma
3           MCF7  Invasive Breast Carcinoma
4          KPL-1  Invasive Breast Carcinoma
8        ZR-75-1  Invasive Breast Carcinoma
9        HCC1187  Invasive Breast Carcinoma


### Step 2: Select Only Gene Effect Columns

In [8]:
# Step 2: Select only gene effect columns
metadata_cols = ['model_id', 'cell_line_name', 'stripped_cell_line_name', 
                'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype']
gene_columns = [col for col in df.columns if col not in metadata_cols]

print(f"Analyzing {len(gene_columns)} genes")
print(f"First few genes: {gene_columns[:5]}")

# Alternative approach using column slicing
gene_columns_slice = df.columns[6:]  # Skip first 6 metadata columns
print(f"\nUsing slicing: {len(gene_columns_slice)} genes")

Analyzing 17205 genes
First few genes: ['A1BG', 'A1CF', 'A2M', 'A2ML1', 'A3GALT2']

Using slicing: 17205 genes


### Step 3: Calculate Mean Gene Effect for Each Gene

In [11]:
# Step 3: Calculate mean gene effect for each gene across all breast cancer lines
breast_gene_means = breast_df[gene_columns].mean()

print("Sample of gene means:")
print(breast_gene_means.head(10).round(4))

# Example output matching the slides
print(f"\n=== EXAMPLE OUTPUT (matching slides) ===")
example_genes = ['A1BG', 'A1CF', 'A2M', 'A2ML1', 'A4GALT']
if all(gene in breast_gene_means.index for gene in example_genes):
    for gene in example_genes:
        print(f"{gene:<8} {breast_gene_means[gene]:8.3f}")
else:
    print("Some example genes not found, showing available genes:")
    for i, (gene, mean_val) in enumerate(breast_gene_means.head(5).items()):
        print(f"{gene:<8} {mean_val:8.3f}")

Sample of gene means:
A1BG      -0.0397
A1CF      -0.1229
A2M        0.0211
A2ML1      0.0764
A3GALT2   -0.1199
A4GALT    -0.0199
A4GNT      0.0304
AAAS      -0.1467
AACS      -0.0558
AADAC      0.1150
dtype: float64

=== EXAMPLE OUTPUT (matching slides) ===
A1BG       -0.040
A1CF       -0.123
A2M         0.021
A2ML1       0.076
A4GALT     -0.020


---
## 4. Grouped Statistics by Cancer Type 🔍

### Compare Statistics Across Cancer Types
Use `.groupby()` to calculate statistics for different cancer types.

In [12]:
# Group by cancer type and calculate mean for selected genes
selected_genes = ['A1BG', 'A1CF', 'A2M']
grouped_means = df.groupby('oncotree_lineage')[selected_genes].mean()

print("Gene effect means by cancer type:")
print(grouped_means.round(4))

# Calculate other statistics by group
print("\nStandard deviation by cancer type:")
grouped_std = df.groupby('oncotree_lineage')[selected_genes].std()
print(grouped_std.round(4))

# Count of cell lines per cancer type
print("\nNumber of cell lines per cancer type:")
print(df['oncotree_lineage'].value_counts())

Gene effect means by cancer type:
                    A1BG    A1CF     A2M
oncotree_lineage                        
Breast           -0.0397 -0.1229  0.0211
Myeloid          -0.0746 -0.0560  0.0411

Standard deviation by cancer type:
                    A1BG    A1CF     A2M
oncotree_lineage                        
Breast            0.1223  0.1086  0.0808
Myeloid           0.1620  0.0924  0.1193

Number of cell lines per cancer type:
oncotree_lineage
Breast     53
Myeloid    41
Name: count, dtype: int64


### Advanced Grouped Statistics

In [13]:
# Multiple statistics at once using .agg()
stats_summary = df.groupby('oncotree_lineage')['A1BG'].agg([
    'count', 'mean', 'median', 'std', 'min', 'max'
]).round(4)

print("Comprehensive A1BG statistics by cancer type:")
print(stats_summary)

# Custom statistics using lambda functions
custom_stats = df.groupby('oncotree_lineage')['A1BG'].agg([
    ('mean', 'mean'),
    ('range', lambda x: x.max() - x.min()),
    ('negative_count', lambda x: (x < 0).sum()),
    ('essential_count', lambda x: (x < -0.1).sum())
]).round(4)

print("\nCustom statistics for A1BG:")
print(custom_stats)

Comprehensive A1BG statistics by cancer type:
                  count    mean  median     std     min     max
oncotree_lineage                                               
Breast               53 -0.0397 -0.0439  0.1223 -0.3328  0.2573
Myeloid              41 -0.0746 -0.0814  0.1620 -0.6263  0.4641

Custom statistics for A1BG:
                    mean   range  negative_count  essential_count
oncotree_lineage                                                 
Breast           -0.0397  0.5901              36               14
Myeloid          -0.0746  1.0904              33               15


---
## 📚 Practice Questions

Now it's your turn! Apply statistical methods to analyze the cancer dataset.

### Question 1: Basic Statistics
Calculate the mean, median, and standard deviation for the A2M gene. Which is larger - the mean or median? What does this tell us about the distribution?

In [None]:
# Your answer here:
# a2m_mean = ?
# a2m_median = ?
# a2m_std = ?
# print(f"A2M - Mean: {a2m_mean:.4f}, Median: {a2m_median:.4f}, Std: {a2m_std:.4f}")
# print(f"Distribution is: {'right-skewed' if a2m_mean > a2m_median else 'left-skewed' if a2m_mean < a2m_median else 'symmetric'}")


### Question 2: Gene Selection and Analysis
Select the first 5 gene columns (after metadata) and calculate their correlation matrix. Which two genes are most positively correlated?

In [None]:
# Your answer here:
# first_5_genes = ?
# correlation_matrix = ?
# print("Correlation matrix:")
# print(correlation_matrix.round(3))
# Find highest correlation (excluding diagonal)


### Question 3: Cancer Type Comparison
For the A1CF gene, calculate the mean and standard deviation for each cancer type. Which cancer type shows more variability (higher standard deviation)?

In [None]:
# Your answer here:
# a1cf_by_cancer = ?
# print("A1CF statistics by cancer type:")
# print(a1cf_by_cancer)


### Question 4: Essential Gene Analysis
Create a function that identifies "essential genes" (mean effect < -0.1) for a given cancer type. Apply it to find essential genes in breast cancer. How many essential genes are there?

In [None]:
# Your answer here:
# def find_essential_genes(cancer_type, threshold=-0.1):
#     # Filter for cancer type
#     # Calculate gene means
#     # Find genes below threshold
#     # return essential_genes

# breast_essential = find_essential_genes('Breast')
# print(f"Essential genes in breast cancer: {len(breast_essential)}")
# print(f"Most essential (lowest 5): {breast_essential.head().round(3)}")


### Question 5: Advanced Challenge
Create a comprehensive statistical report:
1. For each cancer type, find the gene with the most negative mean effect
2. Calculate what percentage of genes are "essential" (< -0.1) in each cancer type
3. Identify genes that are essential in one cancer type but not the other (threshold < -0.1 vs > -0.05)

In [None]:
# Your answer here:
# Step 1: Most negative gene per cancer type
# grouped_means = ?
# most_negative_per_cancer = ?

# Step 2: Percentage of essential genes
# essential_percentages = ?

# Step 3: Cancer-specific essential genes
# breast_essential = ?
# myeloid_essential = ?
# breast_specific = ?
# myeloid_specific = ?


---
## 🎯 Solutions

Try the questions above first, then run these cells to check your answers!

In [None]:
# Solution 1
a2m_mean = df['A2M'].mean()
a2m_median = df['A2M'].median()
a2m_std = df['A2M'].std()
print(f"Solution 1 - A2M statistics:")
print(f"Mean: {a2m_mean:.4f}, Median: {a2m_median:.4f}, Std: {a2m_std:.4f}")
print(f"Distribution is: {'right-skewed' if a2m_mean > a2m_median else 'left-skewed' if a2m_mean < a2m_median else 'symmetric'}")

In [None]:
# Solution 2
first_5_genes = df.columns[6:11]  # First 5 gene columns after metadata
correlation_matrix = df[first_5_genes].corr()
print("Solution 2 - Correlation matrix:")
print(correlation_matrix.round(3))

# Find highest correlation (excluding diagonal)
corr_values = correlation_matrix.values
np.fill_diagonal(corr_values, np.nan)  # Remove diagonal
max_corr_idx = np.unravel_index(np.nanargmax(corr_values), corr_values.shape)
gene1, gene2 = first_5_genes[max_corr_idx[0]], first_5_genes[max_corr_idx[1]]
max_corr = corr_values[max_corr_idx]
print(f"\nHighest correlation: {gene1} & {gene2} ({max_corr:.3f})")

In [None]:
# Solution 3
a1cf_by_cancer = df.groupby('oncotree_lineage')['A1CF'].agg(['mean', 'std']).round(4)
print("Solution 3 - A1CF statistics by cancer type:")
print(a1cf_by_cancer)
most_variable = a1cf_by_cancer['std'].idxmax()
print(f"\nMost variable cancer type: {most_variable} (std = {a1cf_by_cancer.loc[most_variable, 'std']:.4f})")

In [None]:
# Solution 4
def find_essential_genes(cancer_type, threshold=-0.1):
    # Filter for cancer type
    cancer_df = df[df['oncotree_lineage'] == cancer_type]
    # Calculate gene means
    gene_means = cancer_df[gene_columns].mean()
    # Find genes below threshold
    essential_genes = gene_means[gene_means < threshold].sort_values()
    return essential_genes

breast_essential = find_essential_genes('Breast')
print(f"Solution 4 - Essential genes in breast cancer: {len(breast_essential)}")
print(f"Most essential (lowest 5):")
print(breast_essential.head().round(3))

In [None]:
# Solution 5
print("Solution 5 - Comprehensive Statistical Report")
print("=" * 50)

# Step 1: Most negative gene per cancer type
grouped_means = df.groupby('oncotree_lineage')[gene_columns].mean()
most_negative_per_cancer = grouped_means.min(axis=1)
most_negative_genes = grouped_means.idxmin(axis=1)

print("\n1. Most essential gene per cancer type:")
for cancer in most_negative_per_cancer.index:
    gene = most_negative_genes[cancer]
    effect = most_negative_per_cancer[cancer]
    print(f"   {cancer}: {gene} ({effect:.4f})")

# Step 2: Percentage of essential genes
print("\n2. Percentage of essential genes (< -0.1):")
for cancer in grouped_means.index:
    essential_count = (grouped_means.loc[cancer] < -0.1).sum()
    total_genes = len(grouped_means.columns)
    percentage = (essential_count / total_genes) * 100
    print(f"   {cancer}: {essential_count}/{total_genes} ({percentage:.1f}%)")

# Step 3: Cancer-specific essential genes
breast_essential_genes = set(grouped_means.loc['Breast'][grouped_means.loc['Breast'] < -0.1].index)
myeloid_essential_genes = set(grouped_means.loc['Myeloid'][grouped_means.loc['Myeloid'] < -0.1].index)
breast_nonessential = set(grouped_means.loc['Breast'][grouped_means.loc['Breast'] > -0.05].index)
myeloid_nonessential = set(grouped_means.loc['Myeloid'][grouped_means.loc['Myeloid'] > -0.05].index)

breast_specific = breast_essential_genes & myeloid_nonessential
myeloid_specific = myeloid_essential_genes & breast_nonessential

print(f"\n3. Cancer-specific essential genes:")
print(f"   Breast-specific essential: {len(breast_specific)}")
print(f"   Myeloid-specific essential: {len(myeloid_specific)}")
if breast_specific:
    print(f"   Example breast-specific: {list(breast_specific)[:5]}")
if myeloid_specific:
    print(f"   Example myeloid-specific: {list(myeloid_specific)[:5]}")

---
## 🎊 Congratulations!

You've mastered statistical analysis with pandas! Here's what you learned:

✅ **Column-wise Statistics**: `.mean()`, `.median()`, `.std()`, `.min()`, `.max()`  
✅ **Gene Selection**: Separating metadata from gene data  
✅ **Analysis Workflow**: Filter → Select → Calculate → Analyze  
✅ **Grouped Statistics**: `.groupby()` for comparing categories  
✅ **Advanced Methods**: `.agg()`, custom functions, correlations  
✅ **Real Applications**: Finding essential genes, cancer comparisons  

**Key Statistical Insights**:
- Different cancer types show unique genetic vulnerabilities
- Mean vs median comparisons reveal distribution shapes
- Standard deviation measures variability within groups
- Correlation analysis reveals gene relationships

**Next Steps**: 
- Explore more advanced statistical tests
- Learn about data visualization techniques
- Apply these methods to your own research datasets

Keep analyzing! 📊🚀