### [!IMPORTANT]
**Data Migration Notice**: Arc's Virtual Cell Atlas data has migrated to the [Google Cloud Marketplace](https://console.cloud.google.com/marketplace/product/bigquery-public-data/arc-institute?project=gcp-public-data-arc-institute). 

**Note**: The new bucket is subject to [Requester Pays](https://docs.cloud.google.com/storage/docs/requester-pays). Users can access up to 2TB of data per month for free before fees apply.

Access to the current GCS buckets (`gs://arc-ctc-tahoe100/` and `gs://arc-scbasecount/`) will be deprecated on **March 31, 2026**. Please update your workflows to use the Google Marketplace bucket `gs://arc-institute-virtual-cell-atlas`.

# Summary

* This is a tutorial on using Python for accessing the Virtual Cell Challenge dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md) for a description of the dataset and metrics.

# Setup

### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository.

### Load dependencies

In [None]:
import io
import pandas as pd
import scanpy as sc
import gcsfs

In [None]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

### Data location

In [None]:
# GCS bucket path for Virtual Cell Challenge 2025 datasets
vcc_base_path = "gs://arc-institute-virtual-cell-atlas/virtual-cell-challenge/2025/"

### Loading Data

The training, validation, and test datasets are provided in `.h5ad` format (AnnData).

In [None]:
train_adata_path = vcc_base_path + "train/adata_Training.h5ad"
val_adata_path = vcc_base_path + "validation/adata_Validation.h5ad"

print(f"Training data path: {train_adata_path}")
print(f"Validation data path: {val_adata_path}")

### Streaming Data

To stream the data directly into memory using `scanpy`:

In [None]:
# with fs.open(train_adata_path) as f:
#     adata = sc.read_h5ad(f)
# adata

### Perturbation Counts

CSV files with perturbation counts are also available for each split.

In [None]:
counts_path = vcc_base_path + "train/pert_counts_Training.csv"
# counts = pd.read_csv(fs.open(counts_path))
# counts.head()