<p align="center">
  <img src="https://img.shields.io/badge/Research%20Mode-ON-4cbb17?style=for-the-badge" alt="Research Mode">
</p>

# 01 Â· Getting Started â€” ASAP CRN Learning Lab  
*A guided launchpad for your first ASAP-CRN workspace adventure.*

Welcome to the **ASAP-CRN Learning Lab Pilot Workshop Series!**  

This notebook walks you through the essentials of workspace orientation and set up in **Verily Workbench**.

> ðŸ’¡ **Tip:** Run each cell in order for the smoothest setup experience.  
> You can always come back later to experiment and make it your own.


In [None]:
# setting up environment
import sys
print(sys.executable)
from pathlib import Path
import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1200)

import os
import math
import matplotlib.pyplot as plt
from PIL import Image

try:
    import scanpy as sc
except ImportError as e:
    print("Error -> ", e)
    print("Installing scanpy")
    !{sys.executable} -m pip install scanpy
    import scanpy as sc


# Table of Contents
1. [Workspace Orientation](#workspace-orientation)    
2. [Setting Project Paths](#setting-project-paths)
    - [2.1 Understanding the Path Components](#dataset-paths)
    - [2.2 Accessing Metadata](#analysis-output-files)
    - [2.3 Evaluating Curated Files](#metadata-files)
3. [Exploring a Dataset](#exploring-a-dataset)  
   - [3.1 Inspect QC Plots](#inspect-qc-plots)
   - [3.2 Copying Data Locally](#copying-data-locally)
   - [3.3 Preview Cell Metadata](#preview-cell-metadata)
   - [3.4 Preview Anndata Object](#preview-anndata)
5. [Reproducibility Notes](#reproducibility-notes)  
6. [Next Steps](#next-steps)  

## 1. Workspace Orientation

In the **ASAP-CRN Learning Lab** workspace, data and resources are mounted under your home directory, typically:

- `~/workspace/` â€“ workspace mount for data and outputs  
- `~/workspace/*/asap-curated-team/` â€“ team-specific curated and derived datasets
- `~/workspace/*/asap-curated-cohort/` â€“ multi-team curated and derived datasets  
- `~/workspace/ws_files/` â€“ your personal scratch space for files and results  

General subfolders: 
- `~/cohort_analysis` - Processed cohort-level outputs
- `~/preprocess` - Intermediate files from data curation outputs
In this section, weâ€™ll confirm these paths and see whatâ€™s available.

In [None]:
#set general folder paths
HOME = Path.home()
WS_ROOT = HOME / "workspace"
DATA_DIR = WS_ROOT / "Data"
WS_FILES = WS_ROOT / "ws_files"

if not WS_ROOT.exists():
    print(f"{WS_ROOT} doesn't exist. We need to remount our resources")
    !wb resource mount    

print("Home directory:     ", HOME)
print("Workspace root:     ", WS_ROOT)
print("Data directory:     ", DATA_DIR)
print("ws_files directory: ", WS_FILES)

print("\nContents of workspace root:")
for p in WS_ROOT.glob("*"):
    print(" -", p.name, "/" if p.is_dir() else "")

## 2. Setting Project Paths

For the next steps, we will work with datasets processed using the **PMDBS scRNAseq** workflow. Specifically, we will focus on the **cohort-level dataset**: `asap-cohort-pmdbs-sc-rnaseq`.  

This dataset represents a multi-dataset integration: samples from **five contributing datasets**, processed, curated, and harmonized into a single cohort resource.

### 2.1 Understanding the Path Components
The dataset paths follow a structured hierarchy. Each component has a specific meaning:

- **`workflow`** â€” identifies the workflow used for aggregation and integration.  
  Here we use the **[PMDBS scRNAseq workflow](https://github.com/ASAP-CRN/pmdbs-sc-rnaseq-wf)**.

- **`dataset_team`** â€” identifies the contributing team or grouping of datasets. For cohort-level analyses, this value is **`cohort`**, indicating multiple datasets combined.

- **`source`** â€” describes the biological source of the samples.  
  In this case, **`pmdbs`** refers to *post-mortemâ€“derived brain samples*.

- **`dataset_type`** â€” describes the type of data generated.  
  Here it is **`sc-rnaseq`**, indicating single-cell RNA sequencing.

- **`bucket_name`** â€” the Google Cloud Storage bucket containing the curated dataset.

- **`dataset_name`** â€” a unique identifier for each curated dataset or collection.


In [None]:
## Build and set path to desired dataset

DATASETS_PATH = WS_ROOT / "01_PMDBS_scRNAseq"

workflow       = "pmdbs_sc_rnaseq"
dataset_team   = "cohort"
dataset_source = "pmdbs"
dataset_type   = "sc-rnaseq"

bucket_name  = f"asap-curated-{dataset_team}-{dataset_source}-{dataset_type}"
dataset_name = f"asap-{dataset_team}-{dataset_source}-{dataset_type}"

dataset_path = DATASETS_PATH / bucket_name / workflow
print("Dataset Path:", dataset_path)

## 2.2 Accessing Metadata Files

Metadata for each dataset is located within the `release_resources` directory. See the [Data Dictionary](https://storage.googleapis.com/asap-public-assets/wayfinding/ASAP-CRN-Cloud-Data-Dictionary.pdf) for an overview of the metadata tables.

> **Note:**  
> Metadata files are organized using the **short `dataset_name`**, not the `bucket_name`.  
> You can always use the File Browser tab of the side panel to explore directories and right-click any folder or file to copy its full path.


In [None]:
#Define metadata folder path
ds_metadata_path = WS_ROOT / "release_resources/cohort-pmdbs-sc-rnaseq/metadata"

#preview contents
!ls {ds_metadata_path} 

In [None]:
# preview a dataset metadata file 
display(pd.read_csv(ds_metadata_path / "CONDITION.csv", index_col=0).head(10))

### 2.3 Inspecting Curated Files

Now that our path components are defined, we can inspect the curated files available in the cohort_analysis directory.

In [None]:
# Build the folder path to the cohort analysis directory
cohort_analysis_path = dataset_path / "cohort_analysis"

# Preview the directory contents
print("Contents of cohort_analysis:")
!ls {cohort_analysis_path}

# Optional pure-Python preview:
# [f.name for f in cohort_analysis_path.glob("**/*") if f.is_file()]

## 3. Exploring a Dataset

With the directory structure in place, we can begin exploring the processed outputs produced by the PMDBS scRNA-seq workflow.

### 3.1 Inspect QC Plots

The curated dataset includes several **QC violin plots** summarizing key metrics (e.g., doublet score, gene counts, mitochondrial content). Letâ€™s load and display all violin plot images found in the folder.

In [None]:
# Find all violin plot images
images = [
    os.path.join(cohort_analysis_path, f)
    for f in os.listdir(cohort_analysis_path)
    if f.lower().endswith("violin.png")
]

n = len(images)
if n == 0:
    print("No violin plots found in cohort_analysis.")
else:
    cols = 3
    rows = math.ceil(n / cols)

    plt.figure(figsize=(14, 4 * rows))

    for i, img_path in enumerate(images):
        img = Image.open(img_path)
        plt.subplot(rows, cols, i + 1)
        plt.imshow(img)
        plt.title(os.path.basename(img_path), fontsize=10)
        plt.axis("off")

    plt.tight_layout()
    plt.show()

### 3.2 Downloading data files locally

In this step, weâ€™ll set up a local directory inside our JupyterLab environment to store data files.  The `ws_files` area is a scratch space tied to your workspace â€” anything saved here can be accessed later in the notebook, processed with Python, or uploaded back to a workspace bucket for sharing.

Weâ€™ll create a folder called `workshop_files` under our workspace path (`WS_PATH`).  This ensures that all downloaded datasets are organized in one place.

In [None]:
# Define a local path for workshop files
local_data_path = WS_FILES / "pilot_workshop_files"

# Create the directory if it doesn't already exist
if not local_data_path.exists():
    local_data_path.mkdir(parents=True)

print(f"Local data directory ready at: {local_data_path}")


We can now download the curated `anndata object` and it's associated `obs` field locally into our workspace.
> **Note:**
> It is recommended to download desired data locally before loading into notebook to be more efficient. 

In [None]:
# Downloading obs field (cell metadata)
# Define the expected local path for the metadata file.
cell_metadata_local_path = local_data_path / f"asap-{dataset_team}.final_metadata.csv"\

# Check if the metadata file already exists locally.
if not cell_metadata_local_path.exists():
    # Construct the original path where the metadata file is stored.
    cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final_metadata.csv"

    # Use a shell command (`cp`) to copy the file from the original location
    # into the local workshop_files directory for analysis.
    !cp {cell_metadata_og_path} {cell_metadata_local_path}

In [None]:
# Downloading the anndata object
# Define the expected local path
adata_local_path = local_data_path / f"asap-{dataset_team}.final.h5ad"

# Check if the adata file already exists locally.
if not adata_local_path.exists():
    # Construct the original path where the metadata file is stored.
    adata_cell_metadata_og_path = cohort_analysis_path / f"asap-{dataset_team}.final.h5ad"

    # Use a shell command (`cp`) to copy the file from the original location
    # into the local workshop_files directory for analysis.
    !cp {adata_cell_metadata_og_path} {adata_local_path}

### 3.3 Exploring the Cell Metadata

Once the data is available in our `workshop_files` directory, we can begin exploring its metadata.

This field provides a compact entry point into the datasetâ€™s metadata, making it easier to explore without handling the full expression matrix.

Each row in `obs` corresponds to a single cell (identified by a unique *barcode*) and contains:

- **Quality control metrics**: e.g. CellBender `cell_probability`, `n_genes_by_counts`, `total_counts`.  
- **Dataset references**: e.g. `sample`, `batch`, `team`, `dataset`.  
- **Downstream analysis results**: e.g. `UMAP_1`, `UMAP_2`, CellAssign `cell_type`, and Leiden cluster assignments.  

Together, these annotations summarize the key observations for each cell and provide a rich foundation for both quality assessment and downstream biological interpretation..


In [None]:
# load the adata object metadata
cell_metadata_df = pd.read_csv(cell_metadata_local_path, low_memory=False)
print(f"We have loaded the cell_metadata for N={cell_metadata_df.shape[0]} cells")
# Preview the contents 
cell_metadata_df.columns.to_list()

In [None]:
cell_metadata_df["cell_type"].value_counts()

### 3.4 Exploring the AnnData

With the curated dataset stored locally, we can load it into memory using the `scanpy` library.  The data is stored in an **AnnData** object (`.h5ad` format), which is a common structure for singleâ€‘cell analysis.  

AnnData organizes the dataset into:
- **`.X`** â†’ the main data matrix (e.g., gene expression counts).  
- **`.obs`** â†’ perâ€‘cell annotations (metadata such as QC metrics, sample IDs, cell types).  
- **`.var`** â†’ perâ€‘feature annotations (metadata about genes/features).  
- **`.uns`** â†’ unstructured annotations (analysis results, parameters, plots).  

Weâ€™ll load the file in *backed mode* (`backed="r"`), which allows us to access the object without reading the entire matrix into memory â€” useful for large datasets.

In [None]:
# Load the curated AnnData object in backed mode
adata = sc.read_h5ad(adata_local_path, backed="r")

# Display a summary of the AnnData object
adata

The above summary shows the number of cells (rows) and genes (columns), along with available annotations.  We can see that our `adata` object contains the cell-wise metadata in the `adata.obs` field, and gene-wise metadata in `adata.var`.  This `*.final_adata.h5ad` file contains only the top 3000 highly-variable genes.

Next, letâ€™s visualize the dataset using a UMAP embedding to explore how cells are distributed by their assigned cell type and phase.


In [None]:
# Plot a UMAP embedding colored by cell type
sc.pl.embedding(adata, basis="umap", color=["cell_type"])


In [None]:
# Plot a UMAP embedding colored by phase
sc.pl.embedding(adata, basis="umap", color=["phase"])

## 5. Reproducibility / Versioning Notes

To keep analyses reproducible and collaborative:

- **Record environment details**: note Python version and key package versions (`scanpy`, `pandas`, etc.).
- **Save intermediate outputs**: store curated files and metadata snapshots in `ws_files`.
- **Use Git for version control**: commit notebooks and scripts to a shared repository.
- **Document changes**: add notes in Markdown cells or a changelog section at the end of the notebook.

These practices make it easier for collaborators (and your future self) to reproduce results.


In [None]:
!conda env export

In [None]:
!jupyter labextension list

In [None]:
!grep ^processor /proc/cpuinfo | wc -l

In [None]:
!grep "^MemTotal:" /proc/meminfo

## 6. Next Steps

This notebook introduced:
- Orienting your workspace
- Loading and inspecting data resources

ðŸ‘‰ Continue to the next notebook: **[02_data_exploration.ipynb](./02_data_exploration.ipynb)**  
There weâ€™ll dive deeper into exploring the curated data resources.
