# Lecture 3: Pandas Sorting Mastery 🔄

Learn how to organize and sort your data effectively using pandas sorting capabilities.

## What We'll Learn:
1. **Basic Sorting** - Sort by single columns (ascending/descending)
2. **Multi-column Sorting** - Sort by multiple columns with different orders
3. **Index Sorting** - Sort by row or column index
4. **Advanced Techniques** - Custom sorting, rank, nlargest/nsmallest
5. **Practice Questions** - Apply sorting to cancer research data!

Let's master data organization with our cancer dataset! 🧬

## Load Our Dataset

Let's start with our familiar DepMap cancer dataset.

In [None]:
import pandas as pd
import numpy as np

# Load the DepMap CRISPR dataset
url = "https://zenodo.org/records/17098555/files/combined_model_crispr_data_filtered.csv?download=1"
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few columns: {list(df.columns[:10])}")
print(f"\nCancer types: {df['oncotree_lineage'].unique()}")

---
## 1. Basic Sorting 📊

### Sort by a Single Column
Use `.sort_values()` to sort your DataFrame by any column.

In [None]:
# Sort by cell line name (alphabetically)
sorted_by_name = df.sort_values('cell_line_name')
print("First 5 cell lines alphabetically:")
print(sorted_by_name[['cell_line_name', 'oncotree_lineage']].head())

print("\nLast 5 cell lines alphabetically:")
print(sorted_by_name[['cell_line_name', 'oncotree_lineage']].tail())

### Ascending vs Descending Order
Control the sort direction with the `ascending` parameter.

In [None]:
# Sort by A1BG gene effect - ascending (most negative first)
sorted_a1bg_asc = df.sort_values('A1BG', ascending=True)
print("Cell lines most sensitive to A1BG knockout (ascending):")
print(sorted_a1bg_asc[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

# Sort by A1BG gene effect - descending (most positive first)
sorted_a1bg_desc = df.sort_values('A1BG', ascending=False)
print("\nCell lines least sensitive to A1BG knockout (descending):")
print(sorted_a1bg_desc[['cell_line_name', 'oncotree_lineage', 'A1BG']].head())

### In-place Sorting
Modify the original DataFrame directly with `inplace=True`.

In [None]:
# Create a copy to demonstrate in-place sorting
df_copy = df[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1CF']].copy()

print("Before sorting (first 3 rows):")
print(df_copy.head(3))

# Sort in-place by A1CF
df_copy.sort_values('A1CF', inplace=True)

print("\nAfter in-place sorting by A1CF (first 3 rows):")
print(df_copy.head(3))

---
## 2. Multi-column Sorting 🎯

### Sort by Multiple Columns
Pass a list of column names to sort by multiple criteria.

In [None]:
# Sort by cancer type first, then by cell line name within each type
sorted_multi = df.sort_values(['oncotree_lineage', 'cell_line_name'])
print("Sorted by cancer type, then by name:")
print(sorted_multi[['oncotree_lineage', 'cell_line_name', 'A1BG']].head(10))

# Show the transition between cancer types
print("\n...transition between cancer types...")
print(sorted_multi[['oncotree_lineage', 'cell_line_name', 'A1BG']].iloc[50:55])

### Different Sort Orders for Different Columns
Use a list in the `ascending` parameter to specify different orders.

In [None]:
# Sort by cancer type (ascending) and A1BG effect (descending)
sorted_mixed = df.sort_values(
    ['oncotree_lineage', 'A1BG'],
    ascending=[True, False]  # First column ascending, second descending
)

print("Breast cancer - highest A1BG effects:")
breast_top = sorted_mixed[sorted_mixed['oncotree_lineage'] == 'Breast'].head(5)
print(breast_top[['cell_line_name', 'oncotree_lineage', 'A1BG']])

print("\nMyeloid cancer - highest A1BG effects:")
myeloid_top = sorted_mixed[sorted_mixed['oncotree_lineage'] == 'Myeloid'].head(5)
print(myeloid_top[['cell_line_name', 'oncotree_lineage', 'A1BG']])

### Complex Multi-column Sorting
Combine multiple columns with different sort orders for sophisticated organization.

In [None]:
# Sort by: cancer type (asc), primary disease (asc), A1BG (desc)
complex_sort = df.sort_values(
    ['oncotree_lineage', 'oncotree_primary_disease', 'A1BG'],
    ascending=[True, True, False]
)

print("Complex sorting - grouped by cancer type and disease:")
display_cols = ['cell_line_name', 'oncotree_lineage', 'oncotree_primary_disease', 'A1BG']
print(complex_sort[display_cols].head(10).to_string(index=False))

---
## 3. Index Sorting 🔢

### Sort by Index
Use `.sort_index()` to sort by the DataFrame's index.

In [None]:
# Create a DataFrame with cell line names as index
df_indexed = df.set_index('cell_line_name')

# Shuffle the rows to demonstrate sorting
df_shuffled = df_indexed.sample(frac=1, random_state=42)
print("Shuffled DataFrame (first 5 rows):")
print(df_shuffled[['oncotree_lineage', 'A1BG']].head())

# Sort by index (cell line names)
df_sorted_index = df_shuffled.sort_index()
print("\nSorted by index (cell line names):")
print(df_sorted_index[['oncotree_lineage', 'A1BG']].head())

### Sort Columns Alphabetically
Sort columns by their names using `axis=1`.

In [None]:
# Get first 10 gene columns
gene_cols = df.columns[6:16]
genes_df = df[gene_cols].head()

print("Original column order:")
print(list(genes_df.columns))

# Sort columns alphabetically
genes_sorted_cols = genes_df.sort_index(axis=1)
print("\nColumns sorted alphabetically:")
print(list(genes_sorted_cols.columns))

# Sort columns in reverse order
genes_reverse_cols = genes_df.sort_index(axis=1, ascending=False)
print("\nColumns sorted reverse alphabetically:")
print(list(genes_reverse_cols.columns))

---
## 4. Advanced Sorting Techniques 🚀

### Finding Top/Bottom N Values with nlargest() and nsmallest()

In [None]:
# Find 5 cell lines with most negative A1BG effect (most essential)
most_essential_a1bg = df.nsmallest(5, 'A1BG')
print("Top 5 cell lines where A1BG is most essential:")
print(most_essential_a1bg[['cell_line_name', 'oncotree_lineage', 'A1BG']].to_string(index=False))

# Find 5 cell lines with most positive A1BG effect (least essential)
least_essential_a1bg = df.nlargest(5, 'A1BG')
print("\nTop 5 cell lines where A1BG is least essential:")
print(least_essential_a1bg[['cell_line_name', 'oncotree_lineage', 'A1BG']].to_string(index=False))

# Multiple columns with nlargest/nsmallest
print("\n5 cell lines with highest combined A1BG and A1CF effects:")
df['combined_effect'] = df['A1BG'] + df['A1CF']
highest_combined = df.nlargest(5, 'combined_effect')
print(highest_combined[['cell_line_name', 'A1BG', 'A1CF', 'combined_effect']].to_string(index=False))

### Ranking Data with rank()
Assign ranks to your data based on values.

In [None]:
# Rank cell lines by A1BG sensitivity
df['A1BG_rank'] = df['A1BG'].rank(ascending=True)  # Lower values = lower rank (more essential)

# Show top 5 most essential (lowest ranks)
top_ranked = df.nsmallest(5, 'A1BG_rank')
print("Top 5 ranked cell lines (most A1BG-sensitive):")
print(top_ranked[['cell_line_name', 'A1BG', 'A1BG_rank']].to_string(index=False))

# Different ranking methods
print("\nDifferent ranking methods for tied values:")
sample_data = pd.DataFrame({
    'gene': ['A', 'B', 'C', 'D', 'E'],
    'effect': [-0.5, -0.3, -0.3, -0.1, 0.2]
})
sample_data['rank_average'] = sample_data['effect'].rank(method='average')
sample_data['rank_min'] = sample_data['effect'].rank(method='min')
sample_data['rank_max'] = sample_data['effect'].rank(method='max')
sample_data['rank_dense'] = sample_data['effect'].rank(method='dense')
print(sample_data)

### Custom Sorting with Key Functions
Sort using custom logic with the `key` parameter.

In [None]:
# Create a subset for demonstration
subset = df[['cell_line_name', 'oncotree_lineage', 'A1BG', 'A1CF']].head(20).copy()

# Sort by the absolute value of A1BG (strongest effects regardless of direction)
subset_sorted = subset.sort_values('A1BG', key=lambda x: x.abs(), ascending=False)
print("Cell lines sorted by absolute A1BG effect (strongest effects first):")
print(subset_sorted.head(10).to_string(index=False))

# Sort by string length of cell line names
subset_name_length = subset.sort_values('cell_line_name', 
                                        key=lambda x: x.str.len(), 
                                        ascending=False)
print("\nCell lines sorted by name length (longest first):")
subset_name_length['name_length'] = subset_name_length['cell_line_name'].str.len()
print(subset_name_length[['cell_line_name', 'name_length']].head().to_string(index=False))

### Sorting After Groupby Operations
Combine grouping and sorting for powerful analysis.

In [None]:
# Calculate mean A1BG effect by cancer type and sort
cancer_means = df.groupby('oncotree_lineage')['A1BG'].agg(['mean', 'std', 'count'])
cancer_means_sorted = cancer_means.sort_values('mean', ascending=True)

print("Cancer types sorted by mean A1BG sensitivity:")
print(cancer_means_sorted.round(4))

# Find top 3 most variable genes
gene_cols = df.columns[6:50]  # Sample of gene columns
gene_variability = df[gene_cols].std().sort_values(ascending=False)
print("\nTop 5 most variable genes (highest standard deviation):")
print(gene_variability.head().round(4))

# Sort within groups
print("\nTop 3 most A1BG-sensitive lines per cancer type:")
top_per_cancer = df.sort_values('A1BG').groupby('oncotree_lineage').head(3)
print(top_per_cancer[['cell_line_name', 'oncotree_lineage', 'A1BG']].to_string(index=False))

---
## 5. Practical Sorting Workflows 🔬

### Research Scenario 1: Finding Drug Targets
Identify the most essential genes across cancer types.

In [None]:
# Calculate mean effect for each gene across all cell lines
gene_cols = df.columns[6:100]  # First 94 genes for demonstration
gene_means = df[gene_cols].mean().sort_values()

print("Top 10 potential drug targets (most essential genes):")
top_targets = gene_means.head(10)
for gene, effect in top_targets.items():
    print(f"{gene:15} {effect:8.4f}")

# Find cancer-specific targets
print("\nBreast cancer specific targets:")
breast_df = df[df['oncotree_lineage'] == 'Breast']
breast_gene_means = breast_df[gene_cols].mean().sort_values()
breast_targets = breast_gene_means.head(5)
for gene, effect in breast_targets.items():
    print(f"{gene:15} {effect:8.4f}")

### Research Scenario 2: Cell Line Selection
Select optimal cell lines for experiments based on multiple criteria.

In [None]:
# Score cell lines based on sensitivity to multiple genes
target_genes = ['A1BG', 'A1CF', 'A2M']
df['sensitivity_score'] = df[target_genes].apply(lambda x: (x < -0.05).sum(), axis=1)

# Sort by sensitivity score and cancer type
optimal_lines = df.sort_values(['sensitivity_score', 'oncotree_lineage'], 
                               ascending=[False, True])

print("Cell lines most sensitive to target genes:")
print(optimal_lines[['cell_line_name', 'oncotree_lineage', 'sensitivity_score'] + target_genes].head(10).round(3))

---
## 📚 Practice Questions

Now it's your turn! Apply sorting techniques to analyze the cancer dataset.

### Question 1: Basic Sorting
Sort the DataFrame by the 'oncotree_primary_disease' column in descending order. What are the top 5 primary diseases?

In [None]:
# Your answer here:
# sorted_disease = ?
# print("Top 5 primary diseases (reverse alphabetical):")
# print(?)

### Question 2: Multi-column Sorting
Sort the data first by 'oncotree_lineage' (ascending) and then by 'A2M' gene effect (descending). Show the top 3 cell lines for each cancer type based on A2M effect.

In [None]:
# Your answer here:
# multi_sorted = ?
# top_per_type = ?
# print(top_per_type[['cell_line_name', 'oncotree_lineage', 'A2M']])

### Question 3: Finding Extremes
Use nlargest and nsmallest to find:
1. The 3 cell lines with the most negative A1CF effects
2. The 3 cell lines with the most positive A1CF effects
What cancer types do these extreme cell lines belong to?

In [None]:
# Your answer here:
# most_negative = ?
# most_positive = ?
# print("Most negative A1CF effects:")
# print(?)
# print("\nMost positive A1CF effects:")
# print(?)

### Question 4: Ranking Analysis
Create ranks for all cell lines based on their A1BG values (most negative = rank 1). Then:
1. Find the top 5 ranked breast cancer cell lines
2. Calculate the average rank for each cancer type

In [None]:
# Your answer here:
# df['a1bg_rank'] = ?
# breast_ranked = ?
# print("Top 5 ranked breast cancer lines:")
# print(?)
# 
# avg_rank_by_type = ?
# print("\nAverage A1BG rank by cancer type:")
# print(?)

### Question 5: Advanced Challenge
Create a "vulnerability index" for each cell line:
1. Count how many genes have an effect < -0.1 for each cell line
2. Sort cell lines by this vulnerability index (highest first)
3. Show the top 10 most vulnerable cell lines
4. Compare the average vulnerability index between cancer types

In [None]:
# Your answer here:
# gene_cols = df.columns[6:]  # All gene columns
# 
# # Calculate vulnerability index
# df['vulnerability_index'] = ?
# 
# # Sort by vulnerability
# most_vulnerable = ?
# 
# print("Top 10 most vulnerable cell lines:")
# print(?)
# 
# # Compare by cancer type
# vulnerability_by_type = ?
# print("\nAverage vulnerability by cancer type:")
# print(?)

---
## 🎯 Solutions

Try the questions above first, then run these cells to check your answers!

In [None]:
# Solution 1
sorted_disease = df.sort_values('oncotree_primary_disease', ascending=False)
print("Solution 1 - Top 5 primary diseases (reverse alphabetical):")
print(sorted_disease['oncotree_primary_disease'].unique()[:5])

In [None]:
# Solution 2
multi_sorted = df.sort_values(['oncotree_lineage', 'A2M'], 
                             ascending=[True, False])
top_per_type = multi_sorted.groupby('oncotree_lineage').head(3)
print("Solution 2 - Top 3 cell lines per cancer type by A2M:")
print(top_per_type[['cell_line_name', 'oncotree_lineage', 'A2M']].to_string(index=False))

In [None]:
# Solution 3
most_negative = df.nsmallest(3, 'A1CF')
most_positive = df.nlargest(3, 'A1CF')

print("Solution 3 - Most negative A1CF effects:")
print(most_negative[['cell_line_name', 'oncotree_lineage', 'A1CF']].to_string(index=False))
print("\nMost positive A1CF effects:")
print(most_positive[['cell_line_name', 'oncotree_lineage', 'A1CF']].to_string(index=False))

print("\nCancer types in extremes:")
extreme_types = pd.concat([most_negative, most_positive])['oncotree_lineage'].value_counts()
print(extreme_types)

In [None]:
# Solution 4
df['a1bg_rank'] = df['A1BG'].rank(ascending=True)

# Top 5 ranked breast cancer lines
breast_ranked = df[df['oncotree_lineage'] == 'Breast'].nsmallest(5, 'a1bg_rank')
print("Solution 4 - Top 5 ranked breast cancer lines:")
print(breast_ranked[['cell_line_name', 'A1BG', 'a1bg_rank']].to_string(index=False))

# Average rank by cancer type
avg_rank_by_type = df.groupby('oncotree_lineage')['a1bg_rank'].mean().sort_values()
print("\nAverage A1BG rank by cancer type:")
print(avg_rank_by_type.round(1))

In [None]:
# Solution 5
gene_cols = df.columns[6:]  # All gene columns

# Calculate vulnerability index
df['vulnerability_index'] = (df[gene_cols] < -0.1).sum(axis=1)

# Sort by vulnerability
most_vulnerable = df.sort_values('vulnerability_index', ascending=False)

print("Solution 5 - Top 10 most vulnerable cell lines:")
print(most_vulnerable[['cell_line_name', 'oncotree_lineage', 'vulnerability_index']].head(10).to_string(index=False))

# Compare by cancer type
vulnerability_by_type = df.groupby('oncotree_lineage')['vulnerability_index'].agg(['mean', 'std', 'max'])
vulnerability_by_type = vulnerability_by_type.sort_values('mean', ascending=False)
print("\nAverage vulnerability by cancer type:")
print(vulnerability_by_type.round(1))

---
## 🎊 Congratulations!

You've mastered pandas sorting techniques! Here's what you learned:

✅ **Basic Sorting**: `.sort_values()` with ascending/descending  
✅ **Multi-column Sorting**: Sort by multiple criteria with different orders  
✅ **Index Sorting**: `.sort_index()` for row and column organization  
✅ **Advanced Techniques**: `nlargest()`, `nsmallest()`, `rank()`, custom sorting  
✅ **Practical Workflows**: Finding targets, selecting cell lines, vulnerability analysis  

**Key Sorting Tips**:
- Use `inplace=True` to modify the original DataFrame
- Combine sorting with groupby for powerful analysis
- `nlargest()`/`nsmallest()` are more efficient than sort + head for finding extremes
- Custom key functions enable sophisticated sorting logic

**Next Steps**: 
- Combine sorting with filtering for complex queries
- Use sorted data for visualization
- Apply sorting in data preprocessing pipelines

Keep organizing your data effectively! 🚀