# Data Acquisition

## Overview

This notebook demonstrates how to acquire DNA methylation data from the Gene Expression Omnibus (GEO) database for High-Intensity Interval Training (HIIT) epigenetic response analysis. We will download and process the GSE171140 dataset, which contains Illumina EPIC 850K methylation array data from skeletal muscle samples collected at multiple time points during a HIIT intervention study.

### Learning Objectives

By the end of this notebook, you will be able to:

1. Download GEO series matrix files programmatically
2. Extract and parse methylation data from compressed archives
3. Parse sample metadata to understand experimental design
4. Create structured sample mappings for downstream analysis

### Dataset Description

The GSE171140 dataset examines epigenetic adaptations to HIIT in human skeletal muscle, including:

- **Timepoints**: Baseline (PRE), 4 weeks (4WP), 8 weeks (8WP), and 12 weeks (12WP) of training
- **Control period**: Measurements after a non-training control period
- **Platform**: Illumina Infinium MethylationEPIC BeadChip (850K CpG sites)
- **Tissue**: Skeletal muscle biopsies

## 1. Environment Setup

First, we import the necessary modules from our package. The `src.data` module provides utilities for downloading, loading, and parsing GEO data.

In [None]:
# Standard library imports
import sys
import os
import logging
from pathlib import Path

# Add project root to path for imports
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Project-specific imports
from src.data.loader import GEODataLoader
from src.data.sample_mapping import SampleMapper

# Configure logging for informative output
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

print(f"Project root: {project_root}")

## 2. Initialize the Data Loader

The `GEODataLoader` class handles all aspects of acquiring GEO data:

- Constructing the correct FTP download URL
- Downloading compressed series matrix files with progress tracking
- Extracting gzip-compressed files
- Loading the methylation data matrix into pandas DataFrames

We specify the GEO accession ID and the directory where data will be stored.

In [None]:
# Define data directories
data_dir = project_root / 'data' / 'raw'
processed_dir = project_root / 'data' / 'processed'

# Create directories if they don't exist
data_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

# Initialize the GEO data loader
# The metadata_lines parameter specifies how many lines of metadata
# precede the actual data matrix (74 is typical for GEO series matrix files)
loader = GEODataLoader(
    geo_accession='GSE171140',
    data_dir=data_dir,
    metadata_lines=74
)

print(f"Data directory: {loader.data_dir}")
print(f"Expected file: {loader.series_matrix_path}")

## 3. Download Series Matrix File

The series matrix file contains both metadata (sample information, experimental design) and the actual methylation beta values. This file is typically large (hundreds of MB to several GB for EPIC arrays) and is provided in gzip-compressed format.

The download function:
- Checks if the file already exists (skips download if present)
- Shows download progress with estimated time remaining
- Handles network errors gracefully

In [None]:
# Download the series matrix file from GEO
# This may take several minutes depending on your internet connection
# The file is approximately 1 GB compressed

gz_path = loader.download_series_matrix(force=False)

print(f"\nCompressed file location: {gz_path}")
print(f"File size: {gz_path.stat().st_size / (1024**2):.2f} MB")

## 4. Extract Compressed File

After downloading, we extract the gzip-compressed file. The uncompressed series matrix file will be significantly larger than the compressed version.

In [None]:
# Extract the compressed file
# This creates the uncompressed .txt file

extracted_path = loader.extract_gz_file(force=False)

print(f"Extracted file: {extracted_path}")
print(f"Uncompressed size: {extracted_path.stat().st_size / (1024**2):.2f} MB")

## 5. Preview File Contents

Before loading the full dataset, it's useful to preview the file structure. GEO series matrix files have a specific format:

1. **Metadata section** (lines starting with `!`): Contains sample information, experimental design, platform details
2. **Data header**: Column names (sample IDs)
3. **Data matrix**: Rows are probe IDs (CpG sites), columns are samples, values are beta values

In [None]:
# Preview the first lines of the file to understand its structure
preview_lines = loader.preview_file(num_lines=30)

print("File structure preview:")
print("=" * 80)
for i, line in enumerate(preview_lines[:20]):
    # Truncate long lines for display
    display_line = line[:100] + "..." if len(line) > 100 else line
    print(f"{i+1:3d}: {display_line}")

## 6. Extract Sample Metadata

The metadata section contains crucial information about each sample:

- **Sample_geo_accession**: Unique GEO identifier (GSMxxxxxx)
- **Sample_title**: Descriptive sample name with experimental details
- **Sample_source_name_ch1**: Tissue/source description
- **Sample_characteristics_ch1**: Key-value pairs with age, sex, treatment, etc.

This information is essential for creating labels for classification tasks.

In [None]:
# Extract metadata from the series matrix file
metadata = loader.get_metadata()

# Display available metadata fields
print("Available metadata fields:")
print("-" * 40)
for field_name in sorted(metadata.keys()):
    num_values = len(metadata[field_name])
    print(f"  {field_name}: {num_values} values")

# Show sample count
n_samples = len(metadata.get('Sample_geo_accession', []))
print(f"\nTotal samples: {n_samples}")

In [None]:
# Preview sample titles to understand naming convention
print("Sample title examples:")
print("-" * 60)

sample_titles = metadata.get('Sample_title', [])
for i, title in enumerate(sample_titles[:10]):
    sample_id = metadata['Sample_geo_accession'][i]
    print(f"  {sample_id}: {title}")

print("\n... (showing first 10 of many samples)")

## 7. Create Sample Mapping

The `SampleMapper` class parses sample names and metadata to create a structured mapping table. This is critical for:

1. **Binary classification**: HIIT intervention vs. Control/Baseline
2. **Multiclass classification**: Different HIIT durations (4W, 8W, 12W)
3. **Time-series analysis**: Tracking changes within individuals over time

The mapper automatically:
- Extracts timepoint information from sample names
- Identifies individual subjects for paired analyses
- Parses demographic information (age, sex) from characteristics
- Creates classification labels

In [None]:
# Initialize the sample mapper
mapper = SampleMapper()

# Create comprehensive sample mapping from metadata
sample_mapping = mapper.create_sample_mapping(
    metadata,
    output_path=str(data_dir / 'GSE171140_sample_mapping.csv')
)

# Display the mapping structure
print("Sample mapping columns:")
print(sample_mapping.columns.tolist())
print(f"\nTotal samples mapped: {len(sample_mapping)}")

In [None]:
# Preview the sample mapping table
# Note: Specific values will depend on your dataset

display_cols = [
    'sample_id', 'sample_name', 'individual_id', 
    'time_point', 'binary_class', 'multi_class'
]

print("Sample mapping preview:")
print("=" * 100)
print(sample_mapping[display_cols].head(10).to_string())

## 8. Analyze Experimental Design

Understanding the experimental design is crucial for proper statistical analysis. We examine:

- Distribution of samples across timepoints
- Balance between experimental groups
- Number of unique individuals (for paired analyses)
- Demographic characteristics

In [None]:
# Timepoint distribution
print("Timepoint Distribution:")
print("-" * 40)
timepoint_counts = sample_mapping['time_point'].value_counts()
for tp, count in timepoint_counts.items():
    pct = count / len(sample_mapping) * 100
    print(f"  {tp}: {count} samples ({pct:.1f}%)")

In [None]:
# Binary classification balance
print("\nBinary Classification Distribution:")
print("-" * 40)
binary_counts = sample_mapping['binary_class'].value_counts()
for cls, count in binary_counts.items():
    pct = count / len(sample_mapping) * 100
    print(f"  {cls}: {count} samples ({pct:.1f}%)")

In [None]:
# Multiclass distribution (HIIT duration categories)
print("\nMulticlass Distribution (HIIT samples only):")
print("-" * 40)
multi_counts = sample_mapping['multi_class'].value_counts(dropna=False)
for cls, count in multi_counts.items():
    if pd.isna(cls):
        print(f"  Non-HIIT (Baseline/Control): {count} samples")
    else:
        print(f"  {cls} HIIT: {count} samples")

In [None]:
# Individual subject count
unique_individuals = sample_mapping['individual_id'].nunique()
print(f"\nUnique individuals: {unique_individuals}")

# Check for samples per individual (for repeated measures design)
samples_per_individual = sample_mapping.groupby('individual_id').size()
print(f"Samples per individual: min={samples_per_individual.min()}, "
      f"max={samples_per_individual.max()}, median={samples_per_individual.median():.0f}")

## 9. Create Classification Labels

We create numeric labels suitable for machine learning algorithms:

- **Binary labels**: 0 = Control/Baseline, 1 = HIIT intervention
- **Multiclass labels**: 0 = 4W HIIT, 1 = 8W HIIT, 2 = 12W HIIT

In [None]:
# Create binary classification labels
binary_labels = mapper.create_binary_labels(
    sample_mapping,
    positive_class='HIIT',
    exclude_unknown=True
)

print("Binary labels distribution:")
print(binary_labels.value_counts(dropna=False))

In [None]:
# Create multiclass labels for HIIT duration
multiclass_labels = mapper.create_multiclass_labels(
    sample_mapping,
    include_baseline=False
)

print("Multiclass labels distribution (HIIT samples only):")
print(multiclass_labels.value_counts(dropna=False))

## 10. Load Methylation Data Matrix

Finally, we load the actual methylation beta values. This is the most memory-intensive step, as the EPIC array contains approximately 850,000 CpG sites measured across all samples.

**Note**: For very large datasets, consider loading a subset of probes initially for testing, then loading the full dataset for final analysis.

In [None]:
# Load the methylation data matrix
# This may take several minutes for EPIC array data

print("Loading methylation data matrix...")
print("This may take several minutes for large datasets.")

methylation_data = loader.load_methylation_matrix()

print(f"\nData matrix dimensions:")
print(f"  CpG probes (rows): {methylation_data.shape[0]:,}")
print(f"  Samples (columns): {methylation_data.shape[1]}")
print(f"  Total values: {methylation_data.size:,}")

In [None]:
# Preview the methylation data structure
print("Methylation data preview:")
print("=" * 80)
print(f"\nProbe ID index examples: {list(methylation_data.index[:5])}")
print(f"Sample ID column examples: {list(methylation_data.columns[:5])}")

# Check beta value range
print(f"\nBeta value statistics:")
print(f"  Min: {methylation_data.min().min():.4f}")
print(f"  Max: {methylation_data.max().max():.4f}")
print(f"  Mean: {methylation_data.mean().mean():.4f}")

## 11. Save Processed Data

Save the sample mapping and basic data information for use in subsequent notebooks.

In [None]:
import json

# Save sample mapping (already saved during creation, but we can confirm)
mapping_path = data_dir / 'GSE171140_sample_mapping.csv'
print(f"Sample mapping saved to: {mapping_path}")

# Save acquisition info for reproducibility
acquisition_info = {
    'geo_accession': 'GSE171140',
    'n_samples': len(sample_mapping),
    'n_probes': methylation_data.shape[0],
    'timepoints': sample_mapping['time_point'].unique().tolist(),
    'data_file': str(loader.series_matrix_path)
}

info_path = processed_dir / 'acquisition_info.json'
with open(info_path, 'w') as f:
    json.dump(acquisition_info, f, indent=2)

print(f"Acquisition info saved to: {info_path}")

## Summary

In this notebook, we successfully:

1. Downloaded the GSE171140 series matrix file from GEO
2. Extracted and previewed the file structure
3. Parsed sample metadata to understand the experimental design
4. Created a structured sample mapping with classification labels
5. Loaded the full methylation data matrix
6. Saved processed outputs for subsequent analysis

### Next Steps

Continue to **02_preprocessing.ipynb** to:
- Filter low-variance CpG sites
- Handle missing values
- Assess and correct batch effects
- Prepare data for feature selection

In [None]:
# Session summary
print("=" * 60)
print("DATA ACQUISITION COMPLETE")
print("=" * 60)
print(f"\nDataset: GSE171140")
print(f"Samples: {len(sample_mapping)}")
print(f"CpG probes: {methylation_data.shape[0]:,}")
print(f"\nFiles created:")
print(f"  - {mapping_path}")
print(f"  - {info_path}")