# Beaver Tutorial 4: GWAS Analysis (Data Owner)

This tutorial demonstrates running a **Genome-Wide Association Study (GWAS)** pipeline using PLINK with privacy-preserving collaboration.

## Overview

- **Data Owner (DO)**: Provides real genomic data (PLINK bed/bim/fam files)
- **Data Scientist (DS)**: Defines the GWAS analysis pipeline
- **PLINK**: Industry-standard tool for genomic analysis
- **Output**: Manhattan plots, QQ plots, significant SNPs (no raw genotypes exposed)

## How to Run This Tutorial

### Option 1: Two Browser Tabs (Solo Testing)
1. Create a session with yourself in BioVault
2. Open two Jupyter tabs from the same session
3. Run this notebook in one tab, the DS notebook in the other

### Option 2: With a Collaborator
1. Create a session with your collaborator
2. You run this DO notebook, they run the DS notebook

---

## Step 1: Setup and Install Dependencies

In [None]:
# Install required packages
!uv pip install pandas numpy matplotlib pandas-plink xarray dask -q
print("Dependencies installed!")

In [None]:
# Check if PLINK is available
import shutil
plink_path = shutil.which("plink")
if plink_path:
    print(f"PLINK found at: {plink_path}")
    !plink --version | head -2
else:
    print("WARNING: PLINK not found in PATH!")
    print("Install PLINK from: https://www.cog-genomics.org/plink/")
    print("Or run: brew install plink (on macOS)")

In [None]:
import beaver
from beaver import Twin

bv = beaver.ctx()
session = bv.active_session()

print(f"You are: {bv.user}")
print(f"Session peer: {session.peer}")

## Step 2: Load Genomic Data

We'll load two populations of genomic data:
- **Real data**: Circassian and Chechen populations (from Jordan GWAS study)
- **Mock data**: Synthetic PLINK dataset for testing

In [None]:
from pathlib import Path

# Path to real genomic data (modify as needed)
DATA_DIR = Path("/Users/madhavajay/dev/biovaults/datasets/jordan_gwas")

# Real datasets (without .bed/.bim/.fam extension)
REAL_DATASET1 = str(DATA_DIR / "Circassian_qc")
REAL_DATASET2 = str(DATA_DIR / "Chechen_qc")

# Mock dataset
MOCK_DATASET = str(DATA_DIR / "mock")

print("Data paths configured:")
print(f"  Real Dataset 1: {REAL_DATASET1}")
print(f"  Real Dataset 2: {REAL_DATASET2}")
print(f"  Mock Dataset: {MOCK_DATASET}")

In [None]:
# Verify data files exist
import os

for prefix, name in [(REAL_DATASET1, "Circassian"), (REAL_DATASET2, "Chechen"), (MOCK_DATASET, "Mock")]:
    bed = f"{prefix}.bed"
    bim = f"{prefix}.bim"
    fam = f"{prefix}.fam"
    
    if os.path.exists(bed) and os.path.exists(bim) and os.path.exists(fam):
        bed_size = os.path.getsize(bed) / (1024*1024)
        print(f"✓ {name}: {bed_size:.1f} MB")
    else:
        print(f"✗ {name}: Missing files!")

In [None]:
from pandas_plink import read_plink1_bin

# Preview the datasets
print("Loading data previews...")

ds1 = read_plink1_bin(f"{REAL_DATASET1}.bed", f"{REAL_DATASET1}.bim", f"{REAL_DATASET1}.fam", verbose=False)
ds2 = read_plink1_bin(f"{REAL_DATASET2}.bed", f"{REAL_DATASET2}.bim", f"{REAL_DATASET2}.fam", verbose=False)
mock = read_plink1_bin(f"{MOCK_DATASET}.bed", f"{MOCK_DATASET}.bim", f"{MOCK_DATASET}.fam", verbose=False)

print(f"\nCircassian: {ds1.shape[0]} samples × {ds1.shape[1]} variants")
print(f"Chechen: {ds2.shape[0]} samples × {ds2.shape[1]} variants")
print(f"Mock: {mock.shape[0]} samples × {mock.shape[1]} variants")

## Step 3: Create Genomic Data Twins

Package the data as Twins with real and mock components.

In [None]:
# Create a dictionary with paths to the real datasets
real_data = {
    "dataset1_prefix": REAL_DATASET1,
    "dataset2_prefix": REAL_DATASET2,
    "dataset1_name": "Circassian_qc",
    "dataset2_name": "Chechen_qc",
    "n_samples_1": int(ds1.shape[0]),
    "n_samples_2": int(ds2.shape[0]),
    "n_variants_1": int(ds1.shape[1]),
    "n_variants_2": int(ds2.shape[1]),
}

# Mock data - just use the mock dataset for both
mock_data = {
    "dataset1_prefix": MOCK_DATASET,
    "dataset2_prefix": MOCK_DATASET,  # Same mock for both in testing
    "dataset1_name": "mock",
    "dataset2_name": "mock",
    "n_samples_1": int(mock.shape[0]),
    "n_samples_2": int(mock.shape[0]),
    "n_variants_1": int(mock.shape[1]),
    "n_variants_2": int(mock.shape[1]),
}

# Create Twin
gwas_data = Twin(
    public=mock_data,
    private=real_data,
    name="gwas_data",
)

print("GWAS Data Twin created:")
display(gwas_data)

## Step 4: Publish Data to Session

In [None]:
session.remote_vars["gwas_data"] = gwas_data
print("\nGWAS data published! DS can now see the mock dataset info.")

## Step 5: Create Live Progress Tracker

In [None]:
# Create progress tracker
gwas_progress = Twin(
    public={
        "step": 0,
        "total_steps": 5,
        "current_task": "waiting",
        "status": "waiting",
        "details": "",
    },
    private=None,
    name="gwas_progress",
)

# Enable live sync
gwas_progress.enable_live(interval=1.0)

# Publish
session.remote_vars["gwas_progress"] = gwas_progress
print("GWAS progress tracker published with live sync!")

## Step 6: Wait for GWAS Analysis Request

**Run DS notebook Steps 1-4 now!**

In [None]:
# Wait for GWAS request
request = bv.wait_for_request(gwas_data, timeout=600)  # 10 minute timeout

if request:
    print("GWAS analysis request received!")
    display(request)
else:
    print("No request received. Run DS notebook!")

## Step 7: Test on Mock Data First

Run a quick test on the mock data to verify the GWAS pipeline works.

In [None]:
if request:
    print("Testing GWAS pipeline on mock data...\n")
    print("(This will be fast since mock data is small)\n")
    
    mock_result = request.run_mock()
    
    print("\n=== Mock GWAS Result ===")
    if mock_result.data.public:
        result = mock_result.data.public
        print(f"Status: {result.get('status', 'unknown')}")
        print(f"Total SNPs: {result.get('total_snps', 'N/A')}")
        print(f"Samples: {result.get('total_samples', 'N/A')}")

In [None]:
# Show captured figures from mock run
if request and mock_result.data.public_figures:
    print(f"Captured {len(mock_result.data.public_figures)} figure(s) from mock run:")
    mock_result.data.show_figures("public")

## Step 8: Run GWAS on Real Data

Now run the full GWAS pipeline on real data. This will take longer due to the size of the real datasets.

In [None]:
# Helper to update progress
def update_gwas_progress(step, total, task, status, details=""):
    gwas_progress.public = {
        "step": step,
        "total_steps": total,
        "current_task": task,
        "status": status,
        "details": details,
    }
    session.remote_vars["gwas_progress"] = gwas_progress

In [None]:
if request:
    print("Starting GWAS on REAL data...")
    print("DS can watch progress via the live 'gwas_progress' variable!\n")
    
    # Update status
    update_gwas_progress(0, 5, "Starting", "running", "Initializing GWAS pipeline")
    
    # Run the GWAS
    result = request.run()
    
    # Update final status
    if result and result.data.private:
        update_gwas_progress(5, 5, "Complete", "complete", "GWAS analysis finished")
    
    print("\n=== Real GWAS Complete ===")
    if result:
        private = result.data.private
        print(f"Status: {private.get('status', 'unknown')}")
        print(f"Total SNPs: {private.get('total_snps', 'N/A')}")
        print(f"Samples: {private.get('total_samples', 'N/A')}")
        print(f"Genome-wide significant: {private.get('gw_significant_count', 0)}")
        print(f"Suggestive: {private.get('suggestive_count', 0)}")

In [None]:
# Display GWAS plots
if result and result.data.private_figures:
    print(f"GWAS produced {len(result.data.private_figures)} figure(s):")
    result.data.show_figures("private")

In [None]:
# Show captured PLINK output (last 2000 chars)
if result and result.data.private_stdout:
    print("=== GWAS Pipeline Log (last 2000 chars) ===")
    print(result.data.private_stdout[-2000:])

## Step 9: Review and Approve Results

Before approving, review what will be shared with the DS:
- Summary statistics (SNP counts, sample counts)
- Manhattan and QQ plots
- Top significant SNPs (if any)
- **NOT** the raw genotype data

In [None]:
if result:
    print("=== Result to Approve ===")
    private = result.data.private
    
    print(f"\nSummary Statistics:")
    print(f"  Total SNPs: {private.get('total_snps', 'N/A')}")
    print(f"  Total Samples: {private.get('total_samples', 'N/A')}")
    print(f"  Cases: {private.get('cases', 'N/A')}")
    print(f"  Controls: {private.get('controls', 'N/A')}")
    
    print(f"\nSignificant Findings:")
    print(f"  Genome-wide significant (P < 5e-8): {private.get('gw_significant_count', 0)}")
    print(f"  Suggestive (P < 1e-5): {private.get('suggestive_count', 0)}")
    
    print(f"\nAttachments:")
    print(f"  Figures: {len(result.data.private_figures) if result.data.private_figures else 0}")
    print(f"  Stdout: {len(result.data.private_stdout) if result.data.private_stdout else 0} chars")
    
    if private.get('top_snps'):
        print(f"\nTop SNPs:")
        for snp in private['top_snps'][:5]:
            print(f"  {snp}")

In [None]:
if result:
    # Approve and send to DS
    result.approve()
    print("\nGWAS results approved and sent to DS!")

In [None]:
# Disable live sync
gwas_progress.disable_live()
print("Live sync disabled.")

## Summary

In this tutorial you:

1. **Loaded real GWAS data** (Circassian and Chechen populations)
2. **Created a Twin** with real and mock genomic datasets
3. **Published live progress tracker** for real-time monitoring
4. **Ran GWAS pipeline** using PLINK on both mock and real data
5. **Approved results** including plots and statistics

### Privacy Preserved!

- The DS received Manhattan plots, QQ plots, and summary statistics
- The DS did NOT receive raw genotype data (.bed files)
- The DS did NOT receive individual-level data
- You controlled exactly what statistical summaries were shared

### GWAS Pipeline Steps

1. **Merge datasets** - Combine two populations
2. **PCA** - Population stratification correction
3. **Association testing** - Logistic regression with covariates
4. **Visualization** - Manhattan and QQ plots
5. **Report** - Summary statistics and top SNPs