# Lecture 3: Pandas Sorting - Step by Step

Learn how to organize your data by sorting, building from simple to complex operations.

## What We'll Learn:
1. Sort by one column (ascending and descending)
2. Sort by multiple columns
3. Find top N and bottom N values quickly
4. Sort within groups
5. Rank values

Let's organize our cancer dataset! 🧬

## Load Our Dataset

First, let's load our DepMap cancer dataset.

In [None]:
import pandas as pd
import numpy as np

# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1CF']].head())

---

## Section 1: Sort by One Column

### Guided Example 1.1: Sort alphabetically (ascending)

The simplest sorting - organize by one column in ascending order (A to Z).

In [None]:
# Sort cell lines alphabetically by name
sorted_names = df.sort_values('cell_line_name')

print("First 5 cell lines alphabetically:")
print(sorted_names[['cell_line_name', 'oncotree_lineage']].head())

**What's happening here?**
- `.sort_values('column_name')` sorts the entire DataFrame by that column
- Default is ascending order (A to Z, or smallest to largest)
- Returns a new sorted DataFrame (doesn't change the original)
- The entire row moves together when sorting

### Guided Example 1.2: Sort numerically (ascending and descending)

Sort by numbers - both smallest-first and largest-first

In [None]:
# Sort by A1BG - ascending (most negative first)
sorted_ascending = df.sort_values('A1BG', ascending=True)

print("Cell lines with most negative A1BG (ascending):")
print(sorted_ascending[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

# Sort by A1BG - descending (most positive first)
sorted_descending = df.sort_values('A1BG', ascending=False)

print("\nCell lines with most positive A1BG (descending):")
print(sorted_descending[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

**What's new here?**
- `ascending=True` - smallest to largest (default)
- `ascending=False` - largest to smallest
- For gene effects: negative values (essential genes) come first with ascending=True
- Remember: ascending=True means the values go up (from -1 to 0 to 1)

### Practice Example 1.1: Sort by A1CF

Sort the DataFrame by the 'A1CF' column in ascending order.

Show the first 5 rows with columns: cell_line_name, oncotree_lineage, A1CF

In [None]:
# YOUR CODE HERE: sort by A1CF ascending

### Practice Example 1.2: Sort descending

Sort by 'A2M' in descending order (largest values first).

Show the top 5 cell lines.

In [None]:
# YOUR CODE HERE: sort by A2M descending

---

## Section 2: Sort by Multiple Columns

### Guided Example 2.1: Sort by two columns (same order)

Sort by cancer type first, then by cell line name within each type

In [None]:
# Sort by cancer type, then by name (both ascending)
sorted_multi = df.sort_values(['oncotree_lineage', 'cell_line_name'])

print("Sorted by cancer type, then by name:")
print(sorted_multi[['oncotree_lineage', 'cell_line_name', 'A1BG']].head(10))

**What's happening here?**
- Pass a list of columns: `['column1', 'column2']`
- Sorts by first column, then breaks ties with second column
- All Breast cancer cells are grouped together, sorted alphabetically
- Then all Myeloid cancer cells, also sorted alphabetically

### Guided Example 2.2: Different orders for different columns

Sort by cancer type (A to Z) but within each type, sort by gene effect (largest to smallest)

In [None]:
# Sort by cancer type (ascending) and A1BG (descending)
sorted_mixed = df.sort_values(
    ['oncotree_lineage', 'A1BG'],
    ascending=[True, False]  # First ascending, second descending
)

print("Each cancer type, showing highest A1BG values first:")
print(sorted_mixed[['cell_line_name', 'oncotree_lineage', 'A1BG']].head(10))

**What's new here?**
- `ascending=[True, False]` - a list matching the column list
- First column (cancer type) sorted ascending (A to Z)
- Second column (A1BG) sorted descending (largest to smallest)
- Within Breast cancer, cells with highest A1BG appear first

### Practice Example 2.1: Sort by cancer type and A1CF

Sort by:
- 'oncotree_lineage' (ascending)
- 'A1CF' (ascending)

Show the first 8 rows.

In [None]:
# YOUR CODE HERE: sort by cancer type and A1CF

### Practice Example 2.2: Three columns with different orders

Sort by:
- 'oncotree_lineage' (ascending)
- 'A1BG' (descending)
- 'A1CF' (descending)

Show the first 10 rows.

In [None]:
# YOUR CODE HERE: sort by three columns

---

## Section 3: Finding Top/Bottom N Values

### Guided Example 3.1: Using nlargest()

Quick way to find the largest values without sorting everything

In [None]:
# Find 5 cell lines with largest A1BG values
top_5_a1bg = df.nlargest(5, 'A1BG')

print("Top 5 highest A1BG values:")
print(top_5_a1bg[['cell_line_name', 'oncotree_lineage', 'A1BG']])

**What's happening here?**
- `.nlargest(n, 'column')` finds the n largest values in that column
- Faster than sorting the entire DataFrame when you only need top N
- Returns the complete rows for those top values
- Useful for "top 10" type queries

### Guided Example 3.2: Using nsmallest()

Find the smallest values - most essential genes have most negative effects

In [None]:
# Find 5 cell lines with smallest (most negative) A1BG values
bottom_5_a1bg = df.nsmallest(5, 'A1BG')

print("Cell lines where A1BG is most essential (most negative):")
print(bottom_5_a1bg[['cell_line_name', 'oncotree_lineage', 'A1BG']])

**What's new here?**
- `.nsmallest(n, 'column')` finds the n smallest values
- For gene effects, smallest = most negative = most essential
- These cells are most dependent on the gene

### Practice Example 3.1: Find top 10 for A1CF

Use `.nlargest()` to find the 10 cell lines with the highest A1CF values.

Show cell_line_name, oncotree_lineage, and A1CF.

In [None]:
# YOUR CODE HERE: find top 10 A1CF values

### Practice Example 3.2: Find both extremes

For the A2M gene:
- Find the 3 largest values using `.nlargest()`
- Find the 3 smallest values using `.nsmallest()`
- Print both results

In [None]:
# YOUR CODE HERE: find both extremes for A2M

---

## Section 4: Sorting Within Groups

### Guided Example 4.1: Sort then take top per group

Find the top cell lines within each cancer type

In [None]:
# Sort by A1BG (ascending - most negative first)
sorted_a1bg = df.sort_values('A1BG')

# Get top 3 from each cancer type
top_per_cancer = sorted_a1bg.groupby('oncotree_lineage').head(3)

print("Top 3 most A1BG-sensitive lines per cancer type:")
print(top_per_cancer[['cell_line_name', 'oncotree_lineage', 'A1BG']])

**What's happening here?**
- First, sort the entire DataFrame by A1BG
- Then use `.groupby('column').head(n)` to get first n rows from each group
- Since we sorted first, `.head()` gives us the smallest values per group
- Shows 3 most sensitive Breast lines, then 3 most sensitive Myeloid lines

### Guided Example 4.2: Compare extremes across groups

Find both highest and lowest within each group

In [None]:
# Get most negative A1BG per cancer type
most_sensitive = df.sort_values('A1BG').groupby('oncotree_lineage').head(1)

# Get most positive A1BG per cancer type
least_sensitive = df.sort_values('A1BG', ascending=False).groupby('oncotree_lineage').head(1)

print("Most A1BG-sensitive cell line per cancer type:")
print(most_sensitive[['cell_line_name', 'oncotree_lineage', 'A1BG']])

print("\nLeast A1BG-sensitive cell line per cancer type:")
print(least_sensitive[['cell_line_name', 'oncotree_lineage', 'A1BG']])

**What's new here?**
- Same technique, used twice with different sort orders
- `.head(1)` gets just the first row from each group
- Lets you find the extreme value within each category

### Practice Example 4.1: Top 5 per cancer type

For the A1CF gene:
- Sort by A1CF descending (highest first)
- Get the top 5 cell lines from each cancer type
- Show the results

In [None]:
# YOUR CODE HERE: get top 5 A1CF per cancer type

### Practice Example 4.2: Most variable gene per type

For each cancer type:
- Calculate the mean of A1BG and A1CF
- Sort by A1BG descending
- Show the cancer type with the highest mean A1BG

Hint: Use `.groupby().mean()` first, then sort

In [None]:
# YOUR CODE HERE: calculate means, then sort

---

## Section 5: Ranking Values

### Guided Example 5.1: Basic ranking

Assign rank numbers to all values

In [None]:
# Create ranks for A1BG (smallest value = rank 1)
df['A1BG_rank'] = df['A1BG'].rank(ascending=True)

# Show the top 10 ranked cell lines
top_ranked = df.nsmallest(10, 'A1BG_rank')

print("Top 10 ranked cell lines (most A1BG-sensitive):")
print(top_ranked[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1BG_rank']])

**What's happening here?**
- `.rank()` assigns a rank number to each value
- `ascending=True` means smallest value gets rank 1
- Every row gets a rank (not just the top ones)
- Useful for comparing positions across different analyses

### Guided Example 5.2: Ranking within groups

Rank separately within each cancer type

In [None]:
# Rank A1BG within each cancer type
df['A1BG_rank_by_type'] = df.groupby('oncotree_lineage')['A1BG'].rank(ascending=True)

# Show top ranked from each cancer type
top_per_type = df[df['A1BG_rank_by_type'] <= 3].sort_values(['oncotree_lineage', 'A1BG_rank_by_type'])

print("Top 3 ranked cell lines within each cancer type:")
print(top_per_type[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1BG_rank_by_type']])

**What's new here?**
- `.groupby('column')['other_column'].rank()` ranks within groups
- Rank 1 in Breast cancer and rank 1 in Myeloid cancer
- Each group has its own ranking from 1 to N
- Filter for `rank <= 3` to get top 3 from each group

### Practice Example 5.1: Rank A1CF

Create ranks for the A1CF gene:
- Rank with ascending=True (most negative = rank 1)
- Show the cell lines with ranks 1 through 5
- Display: cell_line_name, A1CF, and the rank

In [None]:
# YOUR CODE HERE: rank A1CF and show top 5

### Practice Example 5.2: Compare ranks across cancer types

For the A2M gene:
- Create overall ranks (ascending=True)
- Create ranks within each cancer type
- Find cell lines that are rank 1 within their cancer type
- What are their overall ranks?

In [None]:
# YOUR CODE HERE: create both rank types and compare

### Practice Example 5.3: Average rank by cancer type

Create A1BG ranks for all cell lines, then:
- Calculate the mean rank for each cancer type
- Which cancer type has better (lower) average ranks?
- What does this tell you about A1BG sensitivity?

In [None]:
# YOUR CODE HERE: calculate average rank by cancer type

---

## Summary

Congratulations! You've learned pandas sorting step-by-step:

**Section 1 - Sort by one column:**
- ✅ `.sort_values('column')` - ascending and descending
- ✅ `ascending=True` (default) or `ascending=False`

**Section 2 - Multiple columns:**
- ✅ Sort by several columns: `.sort_values(['col1', 'col2'])`
- ✅ Different orders: `ascending=[True, False]`

**Section 3 - Top/Bottom N:**
- ✅ `.nlargest(n, 'column')` - fastest way to get top N
- ✅ `.nsmallest(n, 'column')` - fastest way to get bottom N

**Section 4 - Sorting within groups:**
- ✅ Sort first, then `.groupby('column').head(n)`
- ✅ Get top N from each category

**Section 5 - Ranking:**
- ✅ `.rank()` - assign rank numbers to all values
- ✅ Ranking within groups with `.groupby()`

**Next Steps:**
- Combine sorting with filtering and statistics
- Use sorting to prepare data for visualization
- Apply sorting in complex data analysis workflows

Keep organizing your data! 📊🚀