<p align="left">
  <img src="https://img.shields.io/badge/Research%20Mode-ON-4cbb17?style=for-the-badge" alt="Research Mode">
</p>

# 02 Â· Data Exploration â€” ASAP CRN Learning Lab  
*A guided launchpad for your second ASAP-CRN workspace adventure.*

Welcome to the **ASAP-CRN Learning Lab Pilot Workshop Series!**  

This notebook walks you through the essentials of data inspection and preliminary analyses in **Verily Workbench**.

> ðŸ’¡ **Tip:** Run each cell in order for the smoothest setup experience.  
> You can always come back later to experiment and make it your own.

In [None]:
# setting up environment
import sys
print(sys.executable)
from pathlib import Path
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1200)

import os
import math
import matplotlib.pyplot as plt
from PIL import Image

try:
    import scanpy as sc
except ImportError as e:
    print("Error -> ", e)
    print("Installing scanpy")
    !conda install scanpy
    import scanpy as sc

/Users/amaraalexander/miniconda3/envs/ASAP-CRN/bin/python
Error ->  No module named 'scanpy'
Installing scanpy
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 25.9.1

Please update conda by running

    $ conda update -n base -c conda-forge conda

Or to minimize the number of packages updated during conda update use

     conda install conda=25.9.1



# All requested packages already installed.



ModuleNotFoundError: No module named 'scanpy'

### Set Project Paths

In [None]:
#set general folder paths
HOME = Path.home()
WS_ROOT = HOME / "workspace"
DATA_DIR = WS_ROOT / "Data"
WS_FILES = WS_ROOT / "ws_files"

if not WS_ROOT.exists():
    print(f"{WS_ROOT} doesn't exist. We need to remount our resources")
    !wb resource mount    

print("Home directory:     ", HOME)
print("Workspace root:     ", WS_ROOT)
print("Data directory:     ", DATA_DIR)
print("ws_files directory: ", WS_FILES)

print("\nContents of workspace root:")
for p in WS_ROOT.glob("*"):
    print(" -", p.name, "/" if p.is_dir() else "")

In [None]:
## Build and set path to desired dataset

DATASETS_PATH = WS_ROOT / "01_PMDBS_scRNAseq"

workflow       = "pmdbs_sc_rnaseq"
dataset_team   = "cohort"
dataset_source = "pmdbs"
dataset_type   = "sc-rnaseq"

bucket_name  = f"asap-curated-{dataset_team}-{dataset_source}-{dataset_type}"
dataset_name = f"asap-{dataset_team}-{dataset_source}-{dataset_type}"

dataset_path = DATASETS_PATH / bucket_name / workflow
print("Dataset Path:", dataset_path)
cohort_analysis_path = dataset_path / "cohort_analysis"

!ls  {cohort_analysis_path} 

In [None]:
#Define metadata folder path
ds_metadata_path = WS_ROOT / "release_resources/cohort-pmdbs-sc-rnaseq/metadata"

#preview contents
!ls {ds_metadata_path} 

If not already created, create a local directory store any outputs to retain and share. 

In [None]:
# Define a local path for workshop files
local_data_path = WS_FILES / "pilot_workshop_files"

# Create the directory if it doesn't already exist
if not local_data_path.exists():
    local_data_path.mkdir(parents=True)

print(f"Local data directory ready at: {local_data_path}")

### Load in Data 

We will use: 
- asap-cohort.final_metadata.csv
- asap-cohort.final.h5ad

In [None]:
# Define the expected local path
cell_metadata_local_path = local_data_path / f"asap-{dataset_team}.final_metadata.csv"
if not cell_metadata_local_path.exists():
    cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final_metadata.csv"
    !cp {cell_metadata_og_path} {cell_metadata_local_path}

# load the adata object
cell_metadata_df = pd.read_csv(cell_metadata_local_path, low_memory=False)
print(f"We have loaded the cell_metadata for N={cell_metadata_df.shape[0]} cells")


In [None]:

adata_local_path = local_data_path / f"asap-{dataset_team}.final.h5ad"

# Check if the adata file already exists locally.
if not adata_local_path.exists():
    adata_cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final.h5ad"
    !cp {adata_cell_metadata_og_path} {adata_local_path}

adata = sc.read_h5ad(adata_local_path, backed="r")
adata

### Data Exploration

Now, that we have both data loaded we can begin exploring the contents further. 

#### Merging dataset metadata with cell-metadata

We will leverage the available metadatas to subset the data for cells of interest  
First lets load some of the _dataset_-metadata and map this information into our _cell_-metadata.
This will be used to annotate experimental conditions into our _cell_-level metadata.    In the next section we will use this to create subsets of our `asap-cohort` PMDBS snRNAseq dataset, and encode Parkinson's disease state.

Specifically we'll combine Sample-level, Subject-level, PMDBS specific  and experimental condition metadata, by combining the `SAMPLE`, `SUBJECT`, `PMDBS`, and `CONDITION` tables.

In [None]:
# Sample-level metadata
SAMPLE = pd.read_csv(ds_metadata_path / "SAMPLE.csv", index_col=0)
# subject-level metadata
SUBJECT = pd.read_csv(ds_metadata_path / "SUBJECT.csv", index_col=0)
#  brain-sample metadata
PMDBS = pd.read_csv(ds_metadata_path / "PMDBS.csv", index_col=0)
# experimental condition metadata
CONDITION = pd.read_csv(ds_metadata_path / "CONDITION.csv", index_col=0)

# Just take a few of the columns which we need
sample_cols = [
    "ASAP_sample_id",
    "ASAP_subject_id",
    "ASAP_team_id",
    "ASAP_dataset_id",
    "replicate",
    "condition_id",
]
subject_cols = [
    "ASAP_subject_id",
    "source_subject_id",
    "sex",
    "age_at_collection",
    "primary_diagnosis",
]
pmdbs_cols = [
    "ASAP_sample_id",
    "brain_region",
    "region_level_1",
    "region_level_2",
    "region_level_3",
]
condition_cols = [
    "condition_id",
    "intervention_name",
    "intervention_id",
    "protocol_id",
]