# Dataset Discovery and Download from CELLxGENE Census

**Objective**: Discover and download datasets containing NK cells from healthy controls.

This notebook:
1. Searches CELLxGENE Census for datasets matching our criteria
2. Filters for healthy controls, peripheral blood, 10X platforms
3. Downloads source H5AD files
4. Extracts NK cells and assigns age groups
5. Saves dataset catalog for reproducibility

## Setup and Imports

In [None]:
import sys
sys.path.append('..')

import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import project modules
from src import utils, download

# Set random seeds for reproducibility
utils.set_random_seeds(42)

print("Imports successful!")

## Load Configuration

In [None]:
# Load datasets configuration
config = utils.load_config("../config/datasets.yaml")

# Setup logging
logger = utils.setup_logging(
    log_file="../results/reports/01_dataset_discovery.log",
    log_level="INFO"
)

logger.info("Configuration loaded successfully")
logger.info(f"Census version: {config['census_version']}")

# Display inclusion criteria
print("\nInclusion Criteria:")
print("="*60)
for key, value in config['inclusion_criteria'].items():
    if isinstance(value, list) and len(value) <= 5:
        print(f"{key}: {', '.join(map(str, value))}")
    elif isinstance(value, list):
        print(f"{key}: {len(value)} items")
    else:
        print(f"{key}: {value}")

## Discover Datasets

Search CELLxGENE Census for datasets containing NK cells from healthy donors.

In [None]:
# Discover datasets matching our criteria
datasets_df = download.discover_datasets(config, logger)

print(f"\nFound {len(datasets_df)} datasets matching criteria")
print(f"Total NK cells: {datasets_df['nk_cell_count'].sum():,}")

# Display top datasets by NK cell count
print("\nTop 10 datasets by NK cell count:")
display(datasets_df.head(10))

## Visualize Dataset Statistics

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Distribution of NK cell counts
axes[0].hist(datasets_df['nk_cell_count'], bins=30, color='#3498DB', edgecolor='black')
axes[0].set_xlabel('NK Cell Count', fontsize=12)
axes[0].set_ylabel('Number of Datasets', fontsize=12)
axes[0].set_title('Distribution of NK Cells per Dataset', fontsize=14)
axes[0].axvline(
    datasets_df['nk_cell_count'].median(),
    color='red',
    linestyle='--',
    label=f"Median: {datasets_df['nk_cell_count'].median():.0f}"
)
axes[0].legend()
axes[0].grid(alpha=0.3, axis='y')

# Cumulative NK cells
sorted_counts = datasets_df['nk_cell_count'].sort_values(ascending=False).reset_index(drop=True)
cumsum = sorted_counts.cumsum()
axes[1].plot(cumsum.index + 1, cumsum.values, linewidth=2, color='#E74C3C')
axes[1].set_xlabel('Number of Datasets', fontsize=12)
axes[1].set_ylabel('Cumulative NK Cell Count', fontsize=12)
axes[1].set_title('Cumulative NK Cells by Dataset', fontsize=14)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig('../results/figures/dataset_statistics.pdf', dpi=300, bbox_inches='tight')
plt.show()

logger.info("Dataset statistics plots saved")

## Download Datasets

Download H5AD files for all discovered datasets.

**Note**: This step can take a long time (hours) and requires substantial disk space (~10-50 GB).

For testing, you can limit the number of datasets by uncommenting the line below:

In [None]:
# Optional: Limit number of datasets for testing
# datasets_df = datasets_df.head(5)  # Download only first 5 datasets

# Download datasets
output_dir = Path("../data/raw")
output_dir.mkdir(parents=True, exist_ok=True)

logger.info(f"Starting download of {len(datasets_df)} datasets")

downloaded_files = download.batch_download(
    datasets_df,
    output_dir,
    census_version=config['census_version'],
    logger=logger
)

print(f"\nSuccessfully downloaded {len(downloaded_files)} datasets")

## Save Dataset Catalog

Save metadata for reproducibility and reference.

In [None]:
# Save dataset catalog
metadata_dir = Path("../data/metadata")
metadata_dir.mkdir(parents=True, exist_ok=True)

catalog_path = metadata_dir / "dataset_catalog.csv"

download.save_dataset_catalog(datasets_df, catalog_path, logger)

print(f"\nDataset catalog saved to: {catalog_path}")
print(f"\nSummary:")
print(f"  Total datasets: {len(datasets_df)}")
print(f"  Total NK cells: {datasets_df['nk_cell_count'].sum():,}")
print(f"  Mean NK cells/dataset: {datasets_df['nk_cell_count'].mean():.0f}")
print(f"  Median NK cells/dataset: {datasets_df['nk_cell_count'].median():.0f}")
print(f"  Min NK cells: {datasets_df['nk_cell_count'].min():,}")
print(f"  Max NK cells: {datasets_df['nk_cell_count'].max():,}")

## Next Steps

Proceed to `02_exploratory_analysis.ipynb` to:
- Load downloaded datasets
- Explore data structure and metadata
- Identify potential batch effects
- Plan QC strategy

In [None]:
logger.info("Dataset discovery and download complete!")
utils.log_memory_usage(logger)