<p align="left">
  <img src="https://img.shields.io/badge/Research%20Mode-ON-4cbb17?style=for-the-badge" alt="Research Mode">
</p>

# 02 Â· Data Exploration â€” ASAP CRN Learning Lab  
*A guided launchpad for your second ASAP-CRN workspace adventure.*

Welcome to the **ASAP-CRN Learning Lab Pilot Workshop Series!**  

This notebook walks you through the essentials of data inspection and preliminary analyses in **Verily Workbench**.

> ðŸ’¡ **Tip:** Run each cell in order for the smoothest setup experience.  
> You can always come back later to experiment and make it your own.

In [1]:
# setting up environment
import sys
print(sys.executable)
from pathlib import Path
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1200)

import os
import math
import matplotlib.pyplot as plt
from PIL import Image

try:
    import scanpy as sc
except ImportError as e:
    print("Error -> ", e)
    print("Installing scanpy")
    !conda install scanpy
    import scanpy as sc

/Users/amaraalexander/miniconda3/envs/ASAP-CRN/bin/python


### 3. Building Dataset Paths

Next, we define the path to the dataset of interest.  
In this example, we are working with the **PMDBS singleâ€‘cell RNAâ€‘seq cohort** dataset:

- **Workflow** â†’ `pmdbs_sc_rnaseq`  
- **Team** â†’ `cohort`  
- **Source** â†’ `pmdbs`  
- **Type** â†’ `sc-rnaseq`  

These components are combined to construct the bucket and dataset names.  
We then set the path to the **cohort analysis outputs** and preview the available files.


In [None]:
#set general folder paths
HOME = Path.home()
WS_ROOT = HOME / "workspace"
DATA_DIR = WS_ROOT / "Data"
WS_FILES = WS_ROOT / "ws_files"

if not WS_ROOT.exists():
    print(f"{WS_ROOT} doesn't exist. We need to remount our resources")
    !wb resource mount    

print("Home directory:     ", HOME)
print("Workspace root:     ", WS_ROOT)
print("Data directory:     ", DATA_DIR)
print("ws_files directory: ", WS_FILES)

print("\nContents of workspace root:")
for p in WS_ROOT.glob("*"):
    print(" -", p.name, "/" if p.is_dir() else "")

### 3. Building Dataset Paths

Next, we define the path to the dataset of interest.  
In this example, we are working with the **PMDBS singleâ€‘cell RNAâ€‘seq cohort** dataset:

- **Workflow** â†’ `pmdbs_sc_rnaseq`  
- **Team** â†’ `cohort`  
- **Source** â†’ `pmdbs`  
- **Type** â†’ `sc-rnaseq`  

These components are combined to construct the bucket and dataset names.  
We then set the path to the **cohort analysis outputs** and preview the available files.


In [None]:
## Build and set path to desired dataset

DATASETS_PATH = WS_ROOT / "01_PMDBS_scRNAseq"

workflow       = "pmdbs_sc_rnaseq"
dataset_team   = "cohort"
dataset_source = "pmdbs"
dataset_type   = "sc-rnaseq"

bucket_name  = f"asap-curated-{dataset_team}-{dataset_source}-{dataset_type}"
dataset_name = f"asap-{dataset_team}-{dataset_source}-{dataset_type}"

dataset_path = DATASETS_PATH / bucket_name / workflow
print("Dataset Path:", dataset_path)
cohort_analysis_path = dataset_path / "cohort_analysis"

!ls  {cohort_analysis_path} 

### 4. Metadata Resources

Alongside the dataset, we also define a path to the **release metadata resources**.  
This folder contains tables describing samples, subjects, brain regions, and experimental conditions.  
Previewing the contents helps us confirm which metadata files are available for integration.


In [None]:
#Define metadata folder path
ds_metadata_path = WS_ROOT / "release_resources/cohort-pmdbs-sc-rnaseq/metadata"

#preview contents
!ls {ds_metadata_path} 

### 5. Local Output Directory

To keep our work organized, we create a local directory inside `ws_files` called `pilot_workshop_files`.  
This is where weâ€™ll save any outputs (plots, tables, subsetted data) that we want to retain or share.  
If the directory doesnâ€™t exist yet, we create it.


In [None]:
# Define a local path for workshop files
local_data_path = WS_FILES / "pilot_workshop_files"

# Create the directory if it doesn't already exist
if not local_data_path.exists():
    local_data_path.mkdir(parents=True)

print(f"Local data directory ready at: {local_data_path}")

### 6. Loading Data

We now bring in the curated dataset files:

- **`asap-cohort.final_metadata.csv`** â†’ cellâ€‘level metadata table
- **`asap-cohort.final.h5ad`** â†’ full AnnData object containing expression data and annotations  

We copy these files into our local `pilot_workshop_files` directory (if not already present) and load them into memory.  
The metadata CSV is read into a Pandas dataframe, while the `.h5ad` file is loaded as an AnnData object in backed mode.


In [None]:
# Define the expected local path
cell_metadata_local_path = local_data_path / f"asap-{dataset_team}.final_metadata.csv"
if not cell_metadata_local_path.exists():
    cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final_metadata.csv"
    !cp {cell_metadata_og_path} {cell_metadata_local_path}

# load the adata object
cell_metadata_df = pd.read_csv(cell_metadata_local_path, low_memory=False)
print(f"We have loaded the cell_metadata for N={cell_metadata_df.shape[0]} cells")


In [None]:

adata_local_path = local_data_path / f"asap-{dataset_team}.final.h5ad"

# Check if the adata file already exists locally.
if not adata_local_path.exists():
    adata_cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final.h5ad"
    !cp {adata_cell_metadata_og_path} {adata_local_path}

adata = sc.read_h5ad(adata_local_path, backed="r")
adata

### 7. Data Exploration

With both metadata and anndata loaded, we can begin exploring the dataset. A key step is **merging datasetâ€‘level metadata into cellâ€‘level metadata**. This allows us to annotate each cell with experimental conditions and subject information, enabling richer analyses.

Specifically, we combine:
- **Sampleâ€‘level metadata** (`SAMPLE.csv`)  
- **Subjectâ€‘level metadata** (`SUBJECT.csv`)  
- **Brain sample metadata** (`PMDBS.csv`)  
- **Experimental condition metadata** (`CONDITION.csv`)  

From each table, we select only the relevant columns (IDs, demographics, brain regions, conditions) to keep the merged metadata concise and focused. This merged metadata will later allow us to subset the dataset (e.g., by diagnosis or brain region) and encode Parkinsonâ€™s disease state for downstream analysis.


In [None]:
# Sample-level metadata
SAMPLE = pd.read_csv(ds_metadata_path / "SAMPLE.csv", index_col=0)
# subject-level metadata
SUBJECT = pd.read_csv(ds_metadata_path / "SUBJECT.csv", index_col=0)
#  brain-sample metadata
PMDBS = pd.read_csv(ds_metadata_path / "PMDBS.csv", index_col=0)
# experimental condition metadata
CONDITION = pd.read_csv(ds_metadata_path / "CONDITION.csv", index_col=0)

# Just take a few of the columns which we need
sample_cols = [
    "ASAP_sample_id",
    "ASAP_subject_id",
    "ASAP_team_id",
    "ASAP_dataset_id",
    "replicate",
    "condition_id",
]
subject_cols = [
    "ASAP_subject_id",
    "source_subject_id",
    "sex",
    "age_at_collection",
    "primary_diagnosis",
]
pmdbs_cols = [
    "ASAP_sample_id",
    "brain_region",
    "region_level_1",
    "region_level_2",
    "region_level_3",
]
condition_cols = [
    "condition_id",
    "intervention_name",
    "intervention_id",
    "protocol_id",
]