# RNA-seq Analysis Template

This notebook demonstrates the complete RNA-seq analysis workflow using the RNA-seq Analysis Platform modules.

## Workflow Overview
1. Data Loading & Validation
2. Sample Metadata Assignment
3. Differential Expression Analysis (PyDESeq2)
4. Interactive Visualizations (Volcano, Heatmap, PCA)
5. Pathway Enrichment (GO & KEGG)
6. Gene Panel Analysis
7. Export Results

## 1. Setup & Imports

In [None]:
import sys
sys.path.append('..')  # Add parent directory to path

import pandas as pd
import numpy as np

from rnaseq_parser import RNASeqParser, DataType
from de_analysis import DEAnalysisEngine
from pathway_enrichment import PathwayEnrichment
from visualizations import create_volcano_plot, create_clustered_heatmap, create_pca_plot
from gene_panels import GenePanelAnalyzer
from export_engine import ExportEngine, ExportData

## 2. Data Loading

Load your RNA-seq count matrix. The parser supports:
- CSV/TSV files
- Excel files (.xlsx)
- Auto-detection of data type (RAW_COUNTS, NORMALIZED, PRE_ANALYZED)

In [None]:
# Initialize parser
parser = RNASeqParser()

# Parse your data file
result = parser.parse("../tests/data/sample_counts.csv")

print(f"Data type detected: {result.data_type}")
print(f"Matrix shape: {result.expression_df.shape}")
print(f"Can run DE analysis: {result.can_run_de}")

# Preview data
result.expression_df.head()

## 3. Sample Metadata Assignment

Assign each sample to an experimental condition.

In [None]:
# Define sample-to-condition mapping
sample_conditions = {
    "Sample_1": "Control",
    "Sample_2": "Control",
    "Sample_3": "Control",
    "Sample_4": "Control",
    "Sample_5": "Control",
    "Sample_6": "Treatment",
    "Sample_7": "Treatment",
    "Sample_8": "Treatment",
    "Sample_9": "Treatment",
    "Sample_10": "Treatment"
}

# Create metadata DataFrame
metadata_df = pd.DataFrame({
    'condition': sample_conditions
}, index=list(sample_conditions.keys()))

metadata_df

## 4. Differential Expression Analysis

Run PyDESeq2 to identify differentially expressed genes.

**Parameters:**
- `design_factor`: Column name in metadata (e.g., "condition")
- `comparisons`: List of (test, reference) tuples

In [None]:
# Initialize DE engine
engine = DEAnalysisEngine()

# Run analysis
de_results = engine.run_all_comparisons(
    counts_df=result.expression_df,
    metadata_df=metadata_df,
    comparisons=[("Treatment", "Control")],
    design_factor="condition"
)

# Get results for our comparison
de_result = de_results[("Treatment", "Control")]

print(f"Total genes analyzed: {len(de_result.results_df)}")
print(f"Significant genes (padj < 0.05): {(de_result.results_df['padj'] < 0.05).sum()}")

# Top 10 significant genes
de_result.results_df.nsmallest(10, 'padj')[['gene', 'log2FoldChange', 'padj']]

## 5. Visualizations

### 5a. Volcano Plot

In [None]:
volcano_fig = create_volcano_plot(
    de_result.results_df,
    lfc_threshold=1.0,
    padj_threshold=0.05
)
volcano_fig.show()

### 5b. Clustered Heatmap

**Note:** Heatmap requires genes × samples orientation (transpose from canonical)

In [None]:
heatmap_fig = create_clustered_heatmap(
    de_result.log_normalized_counts.T,  # Transpose to genes × samples
    sample_conditions,
    top_n=50
)
heatmap_fig.show()

### 5c. PCA Plot

In [None]:
pca_fig = create_pca_plot(
    de_result.log_normalized_counts,  # samples × genes (no transpose)
    sample_conditions
)
pca_fig.show()

## 6. Pathway Enrichment

Query Enrichr for GO Biological Process and KEGG pathways.

**Note:** Requires internet connection

In [None]:
# Initialize pathway enrichment
pe = PathwayEnrichment()

# Select genes for enrichment (adaptive thresholds)
genes, error = pe.select_genes_for_enrichment(
    de_result.results_df,
    padj_threshold=0.05,
    lfc_threshold=1.0
)

if error:
    print(f"Warning: {error}")
else:
    print(f"Selected {len(genes)} genes for enrichment")
    
    # GO enrichment
    go_results, go_error = pe.get_go_enrichment(genes)
    if go_error:
        print(f"GO enrichment failed: {go_error}")
    else:
        print(f"\nTop 10 GO terms:")
        display(go_results.head(10))
    
    # KEGG enrichment
    kegg_results, kegg_error = pe.get_kegg_enrichment(genes)
    if kegg_error:
        print(f"KEGG enrichment failed: {kegg_error}")
    else:
        print(f"\nTop 10 KEGG pathways:")
        display(kegg_results.head(10))

## 7. Gene Panel Analysis

Analyze expression of curated dermatology gene panels.

In [None]:
# Initialize gene panel analyzer
analyzer = GenePanelAnalyzer('../config/gene_panels.yaml')

# Score Anti-aging panel
scores = analyzer.score_panel(
    de_result.log_normalized_counts,
    "Anti-aging",
    sample_conditions
)

print("Anti-aging panel scores (z-score normalized):")
for condition, score in scores.items():
    print(f"  {condition}: {score:.2f}")

# Plot panel
panel_fig = analyzer.plot_panel(
    de_result.log_normalized_counts,
    "Anti-aging",
    sample_conditions
)
panel_fig.show()

## 8. Export Results

Export to Excel, high-resolution images, and PDF reports.

In [None]:
# Initialize export engine
export_engine = ExportEngine()

# Export volcano plot as high-res PNG
export_engine.export_figure(
    volcano_fig,
    "volcano_plot.png",
    format="png",
    scale=3  # ~300 DPI
)
print("✓ Exported volcano_plot.png")

# Export as SVG (vector)
export_engine.export_figure(
    volcano_fig,
    "volcano_plot.svg",
    format="svg"
)
print("✓ Exported volcano_plot.svg")

### Excel Export

Create multi-sheet workbook with DE results, enrichment, and settings.

In [None]:
from export_engine import EnrichmentResult

# Prepare export data
export_data = ExportData(
    data_type=DataType.RAW_COUNTS,
    de_results=de_results,
    expression_matrix=de_result.log_normalized_counts,
    enrichment_results={
        ("Treatment", "Control"): EnrichmentResult(
            go_results=go_results if not go_error else pd.DataFrame(),
            kegg_results=kegg_results if not kegg_error else pd.DataFrame(),
            genes_used=genes,
            selection_note=f"{len(genes)} genes (padj<0.05, |log2FC|>1)",
            error=go_error or kegg_error
        )
    },
    figures={
        "volcano": volcano_fig,
        "heatmap": heatmap_fig,
        "pca": pca_fig,
        "panel_Anti-aging": panel_fig
    },
    settings={
        "padj_threshold": 0.05,
        "lfc_threshold": 1.0,
        "comparisons": [("Treatment", "Control")]
    },
    sample_conditions=sample_conditions,
    active_comparison=("Treatment", "Control")
)

# Export to Excel
export_engine.export_excel("analysis_results.xlsx", export_data)
print("✓ Exported analysis_results.xlsx")

# Export to PDF
export_engine.export_pdf_report("analysis_report.pdf", export_data)
print("✓ Exported analysis_report.pdf")

## Summary

This notebook demonstrated:
1. ✓ Data loading and validation
2. ✓ Differential expression analysis with PyDESeq2
3. ✓ Interactive visualizations (volcano, heatmap, PCA)
4. ✓ Pathway enrichment (GO & KEGG)
5. ✓ Gene panel analysis
6. ✓ Multi-format export (Excel, PNG, SVG, PDF)

**Next Steps:**
- Customize thresholds for your specific research question
- Explore additional gene panels (Skin Barrier, Anti-inflammation, etc.)
- See `advanced_exploration.ipynb` for multi-comparison workflows