# Lecture 3: Pandas Statistics - Step by Step

Learn how to calculate statistics from your data, building from simple to complex calculations.

## What We'll Learn:
1. Calculate one statistic (like mean) for one column
2. Calculate multiple statistics for one column
3. Calculate statistics for several columns
4. Calculate statistics by groups (like cancer type)
5. Complex grouped statistics

Let's start with our cancer dataset! 🧬

## Load Our Dataset

First, let's load our DepMap cancer dataset.

In [None]:
import pandas as pd
import numpy as np

# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1CF']].head())

---

## Section 1: One Statistic for One Column

### Guided Example 1.1: Calculate the mean

The simplest thing - calculate the average (mean) of one column.

In [None]:
# Calculate the mean of the A1BG gene
a1bg_mean = df['A1BG'].mean()

print(f"Mean A1BG effect: {a1bg_mean}")
print(f"Rounded to 4 decimals: {a1bg_mean:.4f}")

**What's happening here?**
- `.mean()` calculates the average of all values in the column
- Negative values mean the gene is essential (cells die without it)
- Positive values mean the gene is not essential

### Guided Example 1.2: Other basic statistics

In [None]:
# Calculate different statistics for A1BG
a1bg_median = df['A1BG'].median()
a1bg_std = df['A1BG'].std()
a1bg_min = df['A1BG'].min()
a1bg_max = df['A1BG'].max()

print(f"A1BG Statistics:")
print(f"Mean:   {a1bg_mean:.4f}")
print(f"Median: {a1bg_median:.4f}")
print(f"Std:    {a1bg_std:.4f}")
print(f"Min:    {a1bg_min:.4f}")
print(f"Max:    {a1bg_max:.4f}")

**What's new here?**
- `.median()` - the middle value when sorted
- `.std()` - standard deviation (how spread out the values are)
- `.min()` - the smallest value
- `.max()` - the largest value

### Practice Example 1.1: Calculate mean for A1CF

Calculate the mean for the 'A1CF' gene column.

Print it rounded to 4 decimal places.

In [None]:
# YOUR CODE HERE: calculate mean for A1CF

### Practice Example 1.2: Find min and max for A2M

For the 'A2M' gene:
- Calculate the minimum value
- Calculate the maximum value
- Calculate the range (max - min)

In [None]:
# YOUR CODE HERE: find min, max, and range for A2M

---

## Section 2: Multiple Statistics at Once

### Guided Example 2.1: Using .describe()

Get many statistics at once with `.describe()`

In [None]:
# Get all statistics for A1BG at once
a1bg_stats = df['A1BG'].describe()

print("Complete statistics for A1BG:")
print(a1bg_stats)

**What's happening here?**
- `.describe()` calculates many statistics at once
- Gives you: count, mean, std, min, 25%, 50% (median), 75%, max
- The percentiles (25%, 50%, 75%) show the distribution

### Guided Example 2.2: Counting specific values

In [None]:
# Count how many values are negative (essential gene)
negative_count = (df['A1BG'] < 0).sum()
positive_count = (df['A1BG'] > 0).sum()
zero_count = (df['A1BG'] == 0).sum()

total = len(df['A1BG'])

print(f"A1BG value distribution:")
print(f"Negative: {negative_count} ({negative_count/total*100:.1f}%)")
print(f"Positive: {positive_count} ({positive_count/total*100:.1f}%)")
print(f"Zero:     {zero_count}")
print(f"Total:    {total}")

**What's new here?**
- `(df['col'] < 0).sum()` counts how many are negative
- The condition creates True/False values
- `.sum()` counts the True values (True = 1, False = 0)
- We can calculate percentages by dividing by total

### Practice Example 2.1: Use describe for A1CF

Use `.describe()` to get all statistics for the A1CF gene.

What is the median value (50%)?

In [None]:
# YOUR CODE HERE: use describe on A1CF

### Practice Example 2.2: Count strong effects for A2M

For the A2M gene:
- Count how many values are less than -0.1 (strong negative effect)
- Count how many values are greater than 0.1 (strong positive effect)
- Print both counts

In [None]:
# YOUR CODE HERE: count strong effects for A2M

---

## Section 3: Statistics for Multiple Columns

### Guided Example 3.1: Mean for several genes

Calculate statistics for multiple columns at once

In [None]:
# Select three genes and calculate their means
genes = ['A1BG', 'A1CF', 'A2M']
gene_means = df[genes].mean()

print("Mean effects for three genes:")
print(gene_means)

**What's happening here?**
- Select multiple columns with `df[['col1', 'col2', 'col3']]`
- `.mean()` calculates mean for each column
- Returns a Series with one mean per gene

### Guided Example 3.2: Describe for multiple columns

In [None]:
# Get complete statistics for three genes
genes = ['A1BG', 'A1CF', 'A2M']
gene_stats = df[genes].describe()

print("Complete statistics for three genes:")
print(gene_stats.round(4))

**What's new here?**
- `.describe()` on multiple columns gives a table
- Each column shows statistics for one gene
- Each row shows one statistic (mean, std, min, etc.)
- `.round(4)` rounds all numbers to 4 decimal places

### Practice Example 3.1: Standard deviation for multiple genes

Calculate the standard deviation for these four genes:
- 'A1BG', 'A1CF', 'A2M', 'A2ML1'

Which gene has the highest variability (largest std)?

In [None]:
# YOUR CODE HERE: calculate std for four genes

### Practice Example 3.2: Compare three genes

For genes 'A1BG', 'A1CF', and 'A2M':
- Use `.describe()` to get all statistics
- Which gene has the most negative minimum value?
- Which gene has the highest maximum value?

In [None]:
# YOUR CODE HERE: describe three genes and compare

---

## Section 4: Statistics by Groups

### Guided Example 4.1: Mean by cancer type

Calculate statistics separately for each cancer type

In [None]:
# Calculate mean A1BG for each cancer type
a1bg_by_cancer = df.groupby('oncotree_lineage')['A1BG'].mean()

print("Mean A1BG effect by cancer type:")
print(a1bg_by_cancer)

**What's happening here?**
- `.groupby('column')` groups rows by that column's values
- Then we select which column to analyze: `['A1BG']`
- Then calculate the statistic: `.mean()`
- Shows one mean for Breast cancer, one mean for Myeloid

### Guided Example 4.2: Multiple statistics by group

In [None]:
# Calculate several statistics for A1BG by cancer type
a1bg_grouped = df.groupby('oncotree_lineage')['A1BG'].agg(['mean', 'median', 'std', 'count'])

print("A1BG statistics by cancer type:")
print(a1bg_grouped.round(4))

**What's new here?**
- `.agg()` lets us calculate multiple statistics at once
- Pass a list of statistic names: `['mean', 'median', 'std', 'count']`
- Each row is a cancer type
- Each column is a statistic

### Practice Example 4.1: A1CF mean by cancer type

Calculate the mean A1CF effect for each cancer type.

Which cancer type has a more negative mean?

In [None]:
# YOUR CODE HERE: calculate A1CF mean by cancer type

### Practice Example 4.2: A2M statistics by cancer type

For the A2M gene, calculate by cancer type:
- mean
- min
- max
- count

Use `.agg()` to get all at once.

In [None]:
# YOUR CODE HERE: calculate multiple A2M statistics by cancer type

---

## Section 5: Multiple Genes and Groups

### Guided Example 5.1: Multiple genes by cancer type

Calculate statistics for several genes, grouped by cancer type

In [None]:
# Calculate mean for three genes, by cancer type
genes = ['A1BG', 'A1CF', 'A2M']
multi_gene_means = df.groupby('oncotree_lineage')[genes].mean()

print("Mean effects by cancer type for three genes:")
print(multi_gene_means.round(4))

**What's happening here?**
- After groupby, we select multiple columns: `[['gene1', 'gene2', 'gene3']]`
- Each row is a cancer type
- Each column is a gene
- Values are the means for that cancer type and gene

### Guided Example 5.2: Custom calculations by group

In [None]:
# Count how many cell lines have strong A1BG effects per cancer type
def count_strong_effects(column):
    return (column < -0.1).sum()

strong_effects = df.groupby('oncotree_lineage')['A1BG'].agg([
    'count',
    'mean',
    ('strong_negative', count_strong_effects)
])

print("A1BG analysis by cancer type:")
print(strong_effects.round(4))

**What's new here?**
- We can create custom statistics with functions
- Pass tuples to `.agg()`: `('name', function)`
- The function takes a column and returns a number
- Very powerful for custom analysis!

### Practice Example 5.1: Three genes by cancer type

Calculate the median for these genes by cancer type:
- 'A1BG', 'A1CF', 'A2M'

Which cancer type has the most negative median for A1CF?

In [None]:
# YOUR CODE HERE: calculate median for three genes by cancer type

### Practice Example 5.2: Count essential genes by cancer type

For each cancer type, count how many of these genes are "essential" (mean < -0.1):
- 'A1BG', 'A1CF', 'A2M', 'A2ML1'

Hint: Calculate means by cancer type first, then count how many are < -0.1

In [None]:
# YOUR CODE HERE: count essential genes by cancer type
# genes = ['A1BG', 'A1CF', 'A2M', 'A2ML1']
# means = df.groupby('oncotree_lineage')[genes].mean()
# essential_count = (means < -0.1).sum(axis=1)
# print(essential_count)

### Practice Example 5.3: Compare variability

For genes 'A1BG' and 'A1CF':
- Calculate standard deviation by cancer type
- Which gene shows more variability in Breast cancer?
- Which gene shows more variability in Myeloid cancer?

In [None]:
# YOUR CODE HERE: compare variability between genes and cancer types

---

## Summary

Congratulations! You've learned pandas statistics step-by-step:

**Section 1 - One statistic, one column:**
- ✅ `.mean()`, `.median()`, `.std()`, `.min()`, `.max()`

**Section 2 - Multiple statistics:**
- ✅ `.describe()` - get many statistics at once
- ✅ Count values with conditions

**Section 3 - Multiple columns:**
- ✅ Calculate statistics for several columns
- ✅ Compare statistics across genes

**Section 4 - Groups:**
- ✅ `.groupby()` - calculate by category
- ✅ `.agg()` - multiple statistics for groups

**Section 5 - Complex analysis:**
- ✅ Multiple genes and groups together
- ✅ Custom calculations

**Next Steps:**
- Combine statistics with filtering
- Use statistics to find interesting patterns
- Visualize your statistics

Keep analyzing! 📊🚀