# Introduction to Genomic Data Analysis

**Stanford Data Ocean - Data Science in Precision Medicine**

---

## 📚 Overview

This notebook demonstrates comprehensive genomic data analysis using the **1000 Genomes Project** dataset. We'll explore:

- **Population Genetics**: Analyzing genetic variant frequencies across global populations
- **Clinical Annotation**: Integrating genomic variants with cancer databases (COSMIC)
- **Gene-Level Analysis**: Interval-based annotation for tumor suppressor genes
- **Data Visualization**: Creating informative plots for genomic insights

### 🎯 Learning Objectives

By the end of this analysis, you will understand:
1. How to query large-scale genomic databases using cloud infrastructure (AWS Athena)
2. Methods for analyzing population-specific genetic variations
3. Techniques for annotating variants with clinical significance
4. Approaches to visualizing genomic data effectively

## 🛠️ Setup and Imports

Let's start by importing our custom genomic analysis utilities and setting up the environment.

In [None]:
# Add the scripts directory to Python path
import sys
import os
sys.path.append('../scripts')

# Import our custom genomic utilities
from genomic_utils import GenomicDataProcessor, VariantAnnotator, GenomicVisualizer
from config import DATABASE_TABLES, DEMO_SNPS, get_demo_snp_info

# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pyathena
from IPython.display import display

# Configure display options
pd.set_option('display.max_columns', None)
plt.style.use('default')
sns.set_palette("husl")

print("✅ Libraries imported successfully!")
print("🧬 Ready for genomic data analysis")

## 🔗 Database Connection

Connect to AWS Athena to access the genomic databases. Make sure your AWS credentials are configured.

In [None]:
# AWS Configuration
S3_STAGING_DIR = "s3://athena-output-351869726285/"
AWS_REGION = "us-east-1"

# Initialize database connection
try:
    conn = pyathena.connect(
        s3_staging_dir=S3_STAGING_DIR,
        region_name=AWS_REGION,
        encryption_option='SSE_S3'
    )
    print("🎉 Successfully connected to AWS Athena!")
    print(f"📍 Region: {AWS_REGION}")
    print(f"📁 S3 Staging: {S3_STAGING_DIR}")
except Exception as e:
    print(f"❌ Connection failed: {e}")
    print("💡 Make sure your AWS credentials are configured properly")

## 🔍 Data Exploration

Let's start by exploring the structure of our genomic datasets.

In [None]:
# Query sample data from the 1000 Genomes Project
print("�� Exploring 1000 Genomes Project Dataset...")
sample_query = f"SELECT * FROM {DATABASE_TABLES['g1000_csv_int']} LIMIT 10"
sample_data = pd.read_sql(sample_query, conn)

print(f"📋 Dataset shape: {sample_data.shape}")
print(f"📝 Columns: {list(sample_data.columns)}")
display(sample_data.head())

## 👁️ SNP Analysis: The Genetics of Eye Color

Let's analyze **rs12913832**, a famous SNP associated with blue vs. brown eye color.

In [None]:
# Initialize our genomic data processor
processor = GenomicDataProcessor()

# Query the eye color SNP
eye_color_snp = "rs12913832"
snp_info = get_demo_snp_info(eye_color_snp)

print(f"👁️ Analyzing: {snp_info['name']}")
print(f"📍 Description: {snp_info['description']}")

# Query the SNP data
snp_query = f"""
SELECT * FROM {DATABASE_TABLES['g1000_csv_int']}
WHERE rsid = '{eye_color_snp}'
"""

snp_data = pd.read_sql(snp_query, conn)
display(snp_data)

In [None]:
# Extract and visualize population frequencies
if not snp_data.empty:
    info_text = snp_data.iloc[0]['info']
    population_frequencies = processor.extract_population_frequencies(info_text)
    
    print(f"\n🌍 Population Frequencies for {eye_color_snp}:")
    for population, frequency in population_frequencies.items():
        print(f"   {population}: {frequency:.2f}%")
    
    # Create visualization
    processor.plot_population_frequencies(
        population_frequencies, 
        title=f"Population Frequencies: {snp_info['name']} ({eye_color_snp})"
    )
else:
    print("❌ SNP data not found")

## 🔬 Offline Analysis (Without AWS)

If you don't have AWS access, you can still explore the functionality using sample data:

In [None]:
# Demo with sample data (works without AWS)
print("🧬 Running offline demo with sample data...")

# Sample VCF info field for rs12913832 (eye color SNP)
sample_info = "AC=2;AN=5008;AF=0.000399361;EAS_AF=0.002;AMR_AF=0.2017;AFR_AF=0.0023;EUR_AF=0.6362;SAS_AF=0.028"

# Extract population frequencies
processor = GenomicDataProcessor()
pop_frequencies = processor.extract_population_frequencies(sample_info)

print("\n🌍 Eye Color SNP Population Frequencies:")
for population, frequency in pop_frequencies.items():
    print(f"   {population:20s}: {frequency:6.2f}%")

# Create visualization
processor.plot_population_frequencies(
    pop_frequencies,
    title="Eye Color SNP (rs12913832) - Population Frequencies"
)

## 🎯 Summary and Key Findings

### 🔬 What We Accomplished

1. **Population Genetics Analysis**: Demonstrated how genetic variants vary across global populations
2. **Data Processing**: Used our custom genomic utilities for analysis
3. **Visualization**: Created informative plots for genomic insights

### 🌟 Key Insights

- **Genetic Diversity**: Human populations show remarkable genetic diversity
- **Clinical Relevance**: Population frequency data is crucial for variant interpretation
- **Quality Control**: Proper data filtering is essential for reliable analysis

### 🚀 Next Steps

- Try the `demo.py` script for a complete offline demonstration
- Explore the `scripts/genomic_utils.py` file to see all available methods
- Set up AWS credentials to analyze real genomic data

---

*This analysis demonstrates practical applications of cloud-based genomic data analysis for precision medicine research as part of the Stanford Data Ocean certificate program.*