# Lecture 4 Notebook 3: GroupBy Operations

**Learning Objectives:**
- Understand the split-apply-combine pattern
- Use `groupby()` to analyze data by categories
- Apply aggregation functions (mean, sum, count, etc.)
- Perform exploratory data analysis on gene expression data

---

## 1. Introduction to GroupBy

The **split-apply-combine** pattern is one of the most powerful concepts in data analysis:

1. **Split**: Divide data into groups based on a category
2. **Apply**: Perform a calculation on each group independently
3. **Combine**: Merge the results back together

This pattern answers questions like:
- What's the average gene expression in each cancer type?
- How many cell lines do we have per tissue?
- Which treatment group has the highest cell count?

In [1]:
import pandas as pd


## 2. Simple Example: Team Performance

Let's start with a simple dataset about team scores:

In [2]:
# Create sample data
data = {
    'player': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'team': ['Red', 'Blue', 'Red', 'Blue', 'Red', 'Blue'],
    'score': [85, 92, 78, 88, 95, 82],
    'games_played': [10, 12, 10, 11, 10, 12]
}

df = pd.DataFrame(data)
df

Unnamed: 0,player,team,score,games_played
0,Alice,Red,85,10
1,Bob,Blue,92,12
2,Charlie,Red,78,10
3,David,Blue,88,11
4,Eve,Red,95,10
5,Frank,Blue,82,12


### Question: What's the average score per team?

We can answer this with `groupby()`:

In [6]:
# Group by team and calculate mean score
df.groupby('team')['score'].mean()

team
Blue    87.333333
Red     86.000000
Name: score, dtype: float64

**What happened?**
1. **Split**: Pandas separated the data into Red team and Blue team
2. **Apply**: Calculated the mean of scores for each team
3. **Combine**: Returned a Series with results for each team

### Multiple Statistics with `.agg()`

We can get multiple statistics at once:

In [7]:
# Get mean, max, and count for each team
df.groupby('team')['score'].agg(['mean', 'max', 'count'])

Unnamed: 0_level_0,mean,max,count
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Blue,87.333333,92,3
Red,86.0,95,3


### Counting Group Members

Use `.size()` to count how many members are in each group:

In [8]:
# Count players per team
df.groupby('team').size()

team
Blue    3
Red     3
dtype: int64

---

## 3. Try It Yourself: Cell Biology Experiment

Here's data from a cell biology experiment measuring cell viability under different treatments:

In [9]:
# Cell biology experiment data
cell_data = {
    'cell_line': ['HeLa', 'HeLa', 'HeLa', 'HEK293', 'HEK293', 'HEK293', 'MCF7', 'MCF7', 'MCF7'],
    'treatment': ['Control', 'Drug_A', 'Drug_B', 'Control', 'Drug_A', 'Drug_B', 'Control', 'Drug_A', 'Drug_B'],
    'viability': [98, 75, 82, 95, 68, 78, 97, 72, 85],
    'cell_count': [1000, 750, 820, 950, 680, 780, 970, 720, 850]
}

cells_df = pd.DataFrame(cell_data)
cells_df

Unnamed: 0,cell_line,treatment,viability,cell_count
0,HeLa,Control,98,1000
1,HeLa,Drug_A,75,750
2,HeLa,Drug_B,82,820
3,HEK293,Control,95,950
4,HEK293,Drug_A,68,680
5,HEK293,Drug_B,78,780
6,MCF7,Control,97,970
7,MCF7,Drug_A,72,720
8,MCF7,Drug_B,85,850


### 🧪 Exercise 1: Average Viability per Treatment

Calculate the mean viability for each treatment (Control, Drug_A, Drug_B) across all cell lines.

**Hint:** Use `groupby('treatment')['viability'].mean()`

In [10]:
# Your code here:


### 🧪 Exercise 2: Cell Count by Cell Line

Calculate the mean cell count for each cell line (HeLa, HEK293, MCF7).

**Hint:** Use `groupby('cell_line')['cell_count'].mean()`

In [11]:
# Your code here:


### 🧪 Exercise 3: Multiple Statistics

For each treatment, calculate the mean, standard deviation, and count of viability values.

**Hint:** Use `.agg(['mean', 'std', 'count'])`

In [12]:
# Your code here:


---

## 4. Real Data: DepMap Gene Expression Analysis

Now let's apply groupby to real cancer cell line data from DepMap!

In [14]:
# Load the gene expression data
url = "https://zenodo.org/records/17377786/files/expression_filtered.csv?download=1"
gene_df = pd.read_csv(url)

# Display first few rows
gene_df.head()

Unnamed: 0,model_id,cell_line_name,stripped_cell_line_name,oncotree_lineage,oncotree_primary_disease,oncotree_subtype,oncotree_code,ccle_name,depmap_model_type,A1BG,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
0,ACH-002401,21MT-2,21MT2,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,21MT2_BREAST,IDC,0.646696,...,4.290528,3.304012,1.321368,2.475321,2.634805,1.432083,2.173767,5.442602,2.618363,3.881381
1,ACH-002399,21NT,21NT,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,IDC,21NT_BREAST,IDC,0.597972,...,4.659108,5.637216,2.059359,3.368145,3.210284,1.628363,2.892195,5.471226,2.946576,4.243463
2,ACH-001683,ACC-3133,UACC3133,Breast,Invasive Breast Carcinoma,Breast Invasive Lobular Carcinoma,ILC,UACC3133_BREAST,ILC,3.62136,...,5.147168,6.341126,1.470006,2.652376,2.958056,1.904889,2.652636,6.70371,2.667631,2.778488
3,ACH-000557,AML-193,AML193,Myeloid,Acute Myeloid Leukemia,AML with Myelodysplasia-Related Changes,AMLMRC,AML193_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,AMLMRC,3.333567,...,4.126725,7.424527,0.773312,1.892076,4.531631,0.055397,1.880827,7.356346,2.775413,3.967814
4,ACH-000248,AU565,AU565,Breast,Invasive Breast Carcinoma,Invasive Breast Carcinoma,BRCA,AU565_BREAST,BRCA,2.171889,...,4.618109,7.147286,0.656105,2.342458,3.317148,0.01266,2.992381,7.515187,3.835247,4.026357


In [15]:
# Check the shape
print(f"Dataset shape: {gene_df.shape}")
print(f"Columns: {gene_df.columns.tolist()[:10]}...")  # Show first 10 columns

Dataset shape: (89, 17130)
Columns: ['model_id', 'cell_line_name', 'stripped_cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease', 'oncotree_subtype', 'oncotree_code', 'ccle_name', 'depmap_model_type', 'A1BG']...


### How Many Cell Lines Per Cancer Type?

Let's count the number of cell lines in each cancer lineage:

In [16]:
# Count cell lines per lineage
lineage_counts = gene_df.groupby('oncotree_lineage').size().sort_values(ascending=False)
lineage_counts

oncotree_lineage
Breast     50
Myeloid    39
dtype: int64

**Interpretation:** This tells us which cancer types are best represented in our dataset!

### Average Gene Expression by Cancer Type

Let's look at BRCA1 expression across different cancer lineages:

In [17]:
# Mean BRCA1 expression per lineage
brca1_by_lineage = gene_df.groupby('oncotree_lineage')['BRCA1'].mean().sort_values(ascending=False)
brca1_by_lineage

oncotree_lineage
Myeloid    4.805774
Breast     3.736428
Name: BRCA1, dtype: float64

**Biological Insight:** Which cancer types have the highest BRCA1 expression? Does this make sense biologically?

### Comprehensive Statistics

Let's get mean, standard deviation, and count for BRCA1:

In [18]:
# Detailed statistics for BRCA1 by lineage
brca1_stats = gene_df.groupby('oncotree_lineage')['BRCA1'].agg([
    'mean',
    'std',
    'count'
]).round(2)

brca1_stats

Unnamed: 0_level_0,mean,std,count
oncotree_lineage,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Breast,3.74,1.05,50
Myeloid,4.81,0.63,39


### 🧬 Exercise 4: Analyze TP53 Expression

TP53 is a crucial tumor suppressor gene. Calculate the mean TP53 expression for each cancer lineage and sort by highest expression.

**Hint:** Follow the same pattern as BRCA1 above

In [None]:
# Your code here:


### 🧬 Exercise 5: Compare Multiple Genes

Compare the mean expression of BRCA1, TP53, and MYC across cancer lineages.

**Hint:** Use `groupby('oncotree_lineage')[['BRCA1', 'TP53', 'MYC']].mean()`

In [None]:
# Your code here:


### 🧬 Exercise 6: Advanced - Different Stats per Gene

For BRCA1, get mean and std. For TP53, get mean and max. Use a dictionary with `.agg()`.

**Hint:** 
```python
gene_df.groupby('oncotree_lineage').agg({
    'BRCA1': ['mean', 'std'],
    'TP53': ['mean', 'max']
})
```

In [None]:
# Your code here:


---

## 5. Key Takeaways

✅ **GroupBy** implements the split-apply-combine pattern

✅ Use `.groupby('column')` to split data by categories

✅ Common aggregations: `.mean()`, `.sum()`, `.count()`, `.std()`, `.max()`, `.min()`

✅ Use `.agg()` for multiple statistics

✅ Use `.size()` to count group members

✅ GroupBy is essential for comparative biology questions:
- How does X differ between groups?
- Which group has the highest/lowest Y?
- Are there patterns across categories?

---

## 🎯 Challenge: Explore on Your Own

Try answering these biological questions using groupby:

1. Which cancer lineage has the most variable (highest std) BRCA1 expression?
2. Are there any lineages with only 1 or 2 cell lines? (Use `.size()`)
3. Compare the average expression of your favorite gene across lineages
4. Find the lineage with the highest average expression across all genes (hint: this is tricky!)

---

**Next:** In the next notebook, we'll learn how to visualise data with matplotlib! 📊