# Introduction to Proteomics - Data Analysis

**Stanford Data Science in Precision Medicine - Module 4**

---

## iPOP Study: Patient Z Infection and Recovery

This notebook analyzes longitudinal proteomic data from the iPOP study, focusing on Patient Z during infection and recovery. We will explore:

- Mass spectrometry data processing and quality control
- Differential protein expression analysis
- Pathway enrichment and functional analysis
- Clinical interpretation of acute phase response

### Learning Objectives
By completing this analysis, you will understand:
1. Proteomic data structure and preprocessing
2. Statistical methods for differential expression
3. Pathway analysis and biological interpretation
4. Clinical applications of proteomics in precision medicine


In [None]:
import sys
import os
sys.path.append("./scripts")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

from proteomic_utils import (
    ProteomeDataProcessor,
    DifferentialProteinAnalyzer,
    ProteinPathwayAnalyzer,
    ProteomicsVisualizer
)
from config import DEMO_CONFIG

plt.style.use("seaborn-v0_8")
sns.set_palette("husl")

print("✓ Libraries imported successfully")
print("✓ Proteomics utilities loaded")

## 1. Data Loading and Initial Exploration

We begin by loading the iPOP Patient Z proteomic data and exploring its structure.

In [None]:
# Initialize data processor
processor = ProteomeDataProcessor()

# Generate demo data based on iPOP Patient Z study
abundance_data, sample_metadata = processor.generate_demo_data(DEMO_CONFIG)

print("Data Shape:", abundance_data.shape)
print("
Sample Metadata:")
print(sample_metadata)

print("
First few proteins:")
print(abundance_data.columns[:10].tolist())

print("
Abundance data preview:")
abundance_data.head()

## 2. Data Quality Control

Quality control is crucial in proteomics due to technical variability in mass spectrometry.

In [None]:
# Perform quality control
qc_results = processor.quality_control(abundance_data, sample_metadata)

print("Quality Control Results:")
for key, value in qc_results.items():
    print(f"{key}: {value}")

# Visualize missing values
visualizer = ProteomicsVisualizer()
visualizer.plot_missing_values(abundance_data)
plt.show()

## 3. Data Normalization

We apply median centering normalization to account for technical variation between samples.

In [None]:
# Apply normalization
normalized_data = processor.normalize_data(abundance_data, method="median_centering")

print("Normalization completed")
print("
Before normalization - sample medians:")
print(abundance_data.median(axis=1))

print("
After normalization - sample medians:")
print(normalized_data.median(axis=1))

# Visualize normalization effect
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

abundance_data.boxplot(ax=ax1)
ax1.set_title("Before Normalization")
ax1.set_xticklabels([])

normalized_data.boxplot(ax=ax2)
ax2.set_title("After Normalization")
ax2.set_xticklabels([])

plt.tight_layout()
plt.show()

## 4. Differential Protein Expression Analysis

We analyze differential protein expression between infection and recovery phases using statistical tests.

In [None]:
# Initialize differential analyzer
diff_analyzer = DifferentialProteinAnalyzer()

# Define infection vs recovery comparison
infection_samples = sample_metadata[sample_metadata["condition"] == "infection"].index
recovery_samples = sample_metadata[sample_metadata["condition"] == "recovery"].index

print(f"Infection samples: {len(infection_samples)}")
print(f"Recovery samples: {len(recovery_samples)}")

# Perform t-test analysis
ttest_results = diff_analyzer.t_test_analysis(
    normalized_data,
    group1_samples=list(infection_samples),
    group2_samples=list(recovery_samples)
)

print("
Top 10 most significant proteins:")
significant_proteins = ttest_results.sort_values("p_value").head(10)
print(significant_proteins[["fold_change", "p_value", "p_value_adj"]])

## 5. Pathway Enrichment Analysis

We identify enriched biological pathways using the significantly differentially expressed proteins.

In [None]:
# Initialize pathway analyzer
pathway_analyzer = ProteinPathwayAnalyzer()

# Get significantly upregulated proteins
upregulated = ttest_results[
    (ttest_results["p_value_adj"] < 0.05) & 
    (ttest_results["fold_change"] > 1.5)
].index.tolist()

print(f"Upregulated proteins: {len(upregulated)}")
print(f"Upregulated proteins: {upregulated}")

# Perform pathway enrichment
enrichment_results = pathway_analyzer.pathway_enrichment(
    upregulated,
    DEMO_CONFIG["pathways"]
)

print("
Pathway Enrichment Results:")
for pathway, result in enrichment_results.items():
    if result["p_value"] < 0.05:
        print(f"{pathway}: p={result["p_value"]:.3f}, OR={result["odds_ratio"]:.2f}")

## 6. Comprehensive Data Visualization

We create multiple visualizations to explore the proteomic changes during infection and recovery.

In [None]:
# Create volcano plot
print("Creating volcano plot...")
visualizer.volcano_plot(ttest_results)
plt.show()

# Create PCA plot
print("
Creating PCA plot...")
visualizer.pca_plot(normalized_data, sample_metadata)
plt.show()

# Create heatmap of top proteins
print("
Creating heatmap...")
top_proteins = ttest_results.sort_values("p_value").head(20).index
heatmap_data = normalized_data[top_proteins]
visualizer.abundance_heatmap(heatmap_data, sample_metadata)
plt.show()

## 7. Clinical Interpretation

We interpret the results in the context of acute phase response and precision medicine.

In [None]:
# Analyze acute phase response proteins
acute_phase_proteins = ["CRP", "SAA1", "HP", "ALB", "APOA1"]
available_acute_phase = [p for p in acute_phase_proteins if p in normalized_data.columns]

print("Acute Phase Response Analysis:")
print("==============================")

for protein in available_acute_phase:
    if protein in ttest_results.index:
        fc = ttest_results.loc[protein, "fold_change"]
        pval = ttest_results.loc[protein, "p_value"]
        print(f"{protein}: FC={fc:.2f}, p={pval:.3f}")

# Plot protein trajectories
if available_acute_phase:
    visualizer.plot_protein_comparison(
        normalized_data,
        available_acute_phase[:4],
        sample_metadata
    )
    plt.show()

## 8. Summary and Conclusions

### Key Findings

1. **Acute Phase Response**: Clear proteomic signature of acute phase response during infection
2. **Pathway Enrichment**: Significant enrichment in immune response and complement pathways
3. **Recovery Dynamics**: Distinct protein expression patterns during recovery phase
4. **Clinical Relevance**: Demonstrates precision medicine potential of proteomics

### Biological Interpretation

The analysis reveals characteristic changes in the proteome during infection:

- **Positive acute phase proteins** (CRP, SAA1, Haptoglobin) show increased expression
- **Negative acute phase proteins** (Albumin, Apolipoproteins) show decreased expression
- **Complement system** activation is evident from pathway analysis
- **Recovery phase** shows normalization of protein levels

### Clinical Applications

This analysis demonstrates how proteomics can be used for:

1. **Disease monitoring**: Tracking infection progression and recovery
2. **Biomarker discovery**: Identifying proteins associated with disease states
3. **Personalized medicine**: Tailoring treatment based on individual protein profiles
4. **Systems biology**: Understanding complex biological networks

### Future Directions

- Integration with genomic and transcriptomic data
- Machine learning approaches for predictive modeling
- Clinical validation of identified biomarkers
- Expansion to larger patient cohorts