### [!IMPORTANT]
**Data Migration Notice**: Arc's Virtual Cell Atlas data has migrated to the [Google Cloud Marketplace](https://console.cloud.google.com/marketplace/product/bigquery-public-data/arc-institute?project=gcp-public-data-arc-institute). 

**Note**: The new bucket is subject to [Requester Pays](https://docs.cloud.google.com/storage/docs/requester-pays). Users can access up to 2TB of data per month for free before fees apply.

Access to the current GCS buckets (`gs://arc-ctc-tahoe100/` and `gs://arc-scbasecount/`) will be deprecated on **March 31, 2026**. Please update your workflows to use the Google Marketplace bucket `gs://arc-institute-virtual-cell-atlas`.

# Summary

* This is a tutorial on using Python for accessing the scBaseCount dataset hosted by the Arc Institute.
* The data can be streamed or downloaded locally.
  * For small jobs (e.g., summarizing the some metadata), streaming is recommended.
  * For large jobs (e.g., training a model), downloading is recommended.
* See the [README](README.md#metadata) for a description of the obs metadata.


# Setup

### Installation

If needed, install the necessary dependencies.

You can use the [conda environment](../conda_envs/python.yml) provided in this git repository.

# Load packages

In [1]:
import os
import pandas as pd
import scanpy as sc
import pyarrow.dataset as ds
import gcsfs

In [2]:
# initialize GCS file system for reading data from GCS
fs = gcsfs.GCSFileSystem()

# Data location

In [5]:
# GCS bucket path
gcs_base_path = "gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/"

In [6]:
# STARsolo feature type
feature_type = "GeneFull_Ex50pAS"

# List available files

Let's see what we have to work with!

First, load some helper code.

In [7]:
# helper function to list files 
def get_file_table(gcs_base_path: str, target: str=None, endswith: str=None):
    files = fs.glob("/".join([gcs_base_path.rstrip("/"), "**"]))
    if target:
        files = [f for f in files if os.path.basename(f) == target]
    else:
        files = [f for f in files if f.endswith(endswith)]
    file_list = []
    for f in files:
        file_list.append(f.split("/")[-2:-1] + [f])
    return pd.DataFrame(file_list, columns=["organism", "file_path"])

## Parquet files

* Contain the obs metadata
* These can be read efficiently with [pyarrow](https://arrow.apache.org/docs/python/index.html)
  * We will read in via pyarrow and convert to pandas

In [8]:
# set the path to the metadata files
gcs_path = "/".join([gcs_base_path.rstrip("/"), "metadata", feature_type])
gcs_path

'gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/metadata/GeneFull_Ex50pAS'

### List per-sample metadata files

Per-sample (SRX accession) metadata (e.g., tissue)

In [9]:
# list files
sample_pq_files = get_file_table(gcs_path, "sample_metadata.parquet")
print(sample_pq_files.shape)
sample_pq_files.head()

(27, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
1,Bos_taurus,arc-institute-virtual-cell-atlas/scbasecount/2...
2,Caenorhabditis_elegans,arc-institute-virtual-cell-atlas/scbasecount/2...
3,Callithrix_jacchus,arc-institute-virtual-cell-atlas/scbasecount/2...
4,Chlorocebus_aethiops,arc-institute-virtual-cell-atlas/scbasecount/2...


**Notes:**

* As you can see, the files are organized by `feature_type` (STARsolo output type) and `organism`

### List per-obs metadata files

Per-observation (cell) metadata

In [10]:
# list files
obs_pq_files = get_file_table(gcs_path, "obs_metadata.parquet")
print(obs_pq_files.shape)
obs_pq_files.head()

(27, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
1,Bos_taurus,arc-institute-virtual-cell-atlas/scbasecount/2...
2,Caenorhabditis_elegans,arc-institute-virtual-cell-atlas/scbasecount/2...
3,Callithrix_jacchus,arc-institute-virtual-cell-atlas/scbasecount/2...
4,Chlorocebus_aethiops,arc-institute-virtual-cell-atlas/scbasecount/2...


## h5ad files 

* Contain count matrices and per-obs metadata

In [11]:
# set the path
gcs_path = "/".join([gcs_base_path.rstrip("/"), "h5ad", feature_type])
gcs_path

'gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS'

In [12]:
# list files
h5ad_files = get_file_table(gcs_path, endswith=".h5ad")
print(h5ad_files.shape)
h5ad_files.head()

(61378, 2)


Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
1,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
2,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
3,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
4,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...


# Explore the per-sample metadata

### Just human samples

In [13]:
# get the per-sample metadata file path
infile = sample_pq_files[sample_pq_files["organism"] == "Homo_sapiens"]["file_path"].values[0]
infile

'arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet'

In [14]:
# load the metadata
sample_metadata = ds.dataset(infile, filesystem=fs, format="parquet").to_table().to_pandas()
print(sample_metadata.shape)
sample_metadata.head()

(35263, 17)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,tissue_ontology_term_id,disease,disease_ontology_term_id,perturbation,cell_line,antibody_derived_tag,czi_collection_id,czi_collection_name
0,26358130,SRX19162061,gs://arc-institute-virtual-cell-atlas/scbaseco...,713,10x_Genomics,5_prime_gex,single_cell,Homo sapiens,"lungs, lung-associated lymph nodes","UBERON:0000170,UBERON:0039167",unsure,,unsure,unsure,no,,
1,30414235,ERX11557254,gs://arc-institute-virtual-cell-atlas/scbaseco...,4858,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,unsure,,cancer,MONDO:0004992,"cytostatic kinase inhibitors, various drug con...","RKO (PTEN knockout), HAP1 (for KBM-7)",no,,
2,26344578,SRX19148509,gs://arc-institute-virtual-cell-atlas/scbaseco...,1969,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,Oligodendrocyte Progenitor Cells,,Chronic non-cancer pain with opioid-induced co...,MONDO:0024317,Ro1138452,PTt-P6-MsNL,no,,
3,34046538,ERX10987183,gs://arc-institute-virtual-cell-atlas/scbaseco...,35028,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,"foetal liver, foetal bone marrow",,Trisomy 21 (Down's syndrome),MONDO:0700126,unsure,not applicable,no,,
4,34046390,ERX10987201,gs://arc-institute-virtual-cell-atlas/scbaseco...,16908,10x_Genomics,3_prime_gex,single_cell,Homo sapiens,"foetal liver, foetal bone marrow",,Trisomy 21 (Down's syndrome),MONDO:0700126,Human foetal samples from 15 trisomy 21 foetus...,"Sample name: TS21_4_F, Immunophenotype: CD235a...",no,,


In [15]:
# All human?
sample_metadata["organism"].value_counts()

organism
Homo sapiens    35263
Name: count, dtype: int64

In [16]:
# 10X library prep methods
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          24957
5_prime_gex           7146
other                  893
feature_barcoding      788
multiome               651
vdj                    462
not_applicable         335
cellplex                21
flex                     6
atac                     4
Name: count, dtype: int64

In [17]:
# cell prep method
sample_metadata["cell_prep"].value_counts()

cell_prep
single_cell       32748
single_nucleus     2462
unsure               52
not_applicable        1
Name: count, dtype: int64

### All organisms

Let's scale up to everything!

In [18]:
# Read in the metadata for all organisms
sample_metadata = []
for i,row in sample_pq_files.iterrows():
    sample_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet").to_table().to_pandas()
    )
sample_metadata = pd.concat(sample_metadata)

print(f"Number of samples: {sample_metadata.shape[0]}")
sample_metadata.head()

Number of samples: 61378


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,tissue_ontology_term_id,disease,disease_ontology_term_id,perturbation,cell_line,antibody_derived_tag,czi_collection_id,czi_collection_name
0,26779669,SRX19498703,gs://arc-institute-virtual-cell-atlas/scbaseco...,2865,10x_Genomics,3_prime_gex,single_nucleus,Arabidopsis thaliana,whole flowers,UBERON:0000914,unsure,,unsure,not_applicable,no,,
1,20529885,SRX14437315,gs://arc-institute-virtual-cell-atlas/scbaseco...,15115,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,root tip,,unsure,,half MS (0.5x) growth medium,Col-0 cells,no,,
2,13263988,SRX10136412,gs://arc-institute-virtual-cell-atlas/scbaseco...,4340,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,developing leaf tissue,,unsure,,unsure,unsure,no,,
3,13263989,SRX10136413,gs://arc-institute-virtual-cell-atlas/scbaseco...,4116,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,true leaves,,unsure,,TMMp::TMM-YFP,unsure,no,,
4,32020960,SRX23731314,gs://arc-institute-virtual-cell-atlas/scbaseco...,2227,10x_Genomics,3_prime_gex,single_cell,Arabidopsis thaliana,Protoplasts,,unsure,,Transient expression of TFs fused to the gluco...,unsure,no,,


In [19]:
# cells
print(f"Obs count: {sample_metadata['obs_count'].sum()}")

Obs count: 502472775


In [21]:
# samples per organism
sample_metadata["organism"].value_counts()

organism
Homo sapiens               35263
Mus musculus               21439
Macaca mulatta              1275
Danio rerio                  842
Drosophila melanogaster      416
Callithrix jacchus           406
Sus scrofa                   317
Rattus norvegicus            271
Arabidopsis thaliana         232
Bos taurus                   176
Gallus gallus                146
Heterocephalus glaber        133
Ovis aries                   112
Pan troglodytes               72
Caenorhabditis elegans        58
Mesocricetus auratus          46
Oryctolagus cuniculus         38
Zea mays                      35
Oryza sativa                  35
Chlorocebus aethiops          20
Equus caballus                13
Solanum lycopersicum          10
Schistosoma mansoni            9
Monodelphis domestica          6
Gasterosteus aculeatus         5
Gorilla gorilla                2
Taeniopygia guttata            1
Name: count, dtype: int64

In [22]:
# tech_10x
sample_metadata["tech_10x"].value_counts()

tech_10x
3_prime_gex          45125
5_prime_gex          10563
other                 1954
multiome              1253
feature_barcoding      892
not_applicable         828
vdj                    694
cellplex                54
atac                     8
flex                     6
fixed_rna                1
Name: count, dtype: int64

In [23]:
# check that the file paths point to existing h5ad files (assumes you have gsutil  installed)
!which gsutil && gsutil ls {sample_metadata["file_path"].values[0]}

/home/nickyoungblut/bin/google-cloud-sdk/bin/gsutil


Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update

gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Arabidopsis_thaliana/SRX19498703.h5ad


# Explore the per-obs metadata

* `obs` ≃ cell

In [24]:
# The list of metadata files per organism
obs_pq_files

Unnamed: 0,organism,file_path
0,Arabidopsis_thaliana,arc-institute-virtual-cell-atlas/scbasecount/2...
1,Bos_taurus,arc-institute-virtual-cell-atlas/scbasecount/2...
2,Caenorhabditis_elegans,arc-institute-virtual-cell-atlas/scbasecount/2...
3,Callithrix_jacchus,arc-institute-virtual-cell-atlas/scbasecount/2...
4,Chlorocebus_aethiops,arc-institute-virtual-cell-atlas/scbasecount/2...
5,Danio_rerio,arc-institute-virtual-cell-atlas/scbasecount/2...
6,Drosophila_melanogaster,arc-institute-virtual-cell-atlas/scbasecount/2...
7,Equus_caballus,arc-institute-virtual-cell-atlas/scbasecount/2...
8,Gallus_gallus,arc-institute-virtual-cell-atlas/scbasecount/2...
9,Gasterosteus_aculeatus,arc-institute-virtual-cell-atlas/scbasecount/2...


In [25]:
# let's read in the metadata for a single organism
target_organism = "Bos_taurus"

In [26]:
# extract the file path
infile = obs_pq_files[obs_pq_files["organism"] == target_organism]["file_path"].values[0]

In [27]:
# read in the first 100000 rows
obs_metadata = ds.dataset(infile, filesystem=fs, format="parquet").head(100000).to_pandas()
print(obs_metadata.shape)
obs_metadata.head()

(100000, 10)


Unnamed: 0,cell_barcode,SRX_accession,gene_count_Unique,umi_count_Unique,gene_count_UniqueAndMult-EM,umi_count_UniqueAndMult-EM,gene_count_UniqueAndMult-Uniform,umi_count_UniqueAndMult-Uniform,cell_type,cell_ontology_term_id
0,AAACCCACACCTATCC,ERX13041271,5580,19602.0,5667,20662.0,5872,20661.992188,,
1,AAACCCACAGACTGCC,ERX13041271,6478,27106.0,6607,28614.003906,6849,28614.011719,,
2,AAACCCACATCGTGCG,ERX13041271,3731,9476.0,3813,10022.0,3950,10021.996094,,
3,AAACCCAGTGTGAATA,ERX13041271,3879,10705.0,3988,11340.000977,4143,11339.99707,,
4,AAACCCATCACAATGC,ERX13041271,4100,10589.0,4226,11160.0,4376,11160.0,,


In [29]:
# distribution of gene counts
obs_metadata["gene_count_Unique"].describe()

count    100000.000000
mean       2642.376580
std        1701.194558
min          21.000000
25%        1248.000000
50%        2509.000000
75%        3778.000000
max       11205.000000
Name: gene_count_Unique, dtype: float64

In [30]:
# distribution of umi counts
obs_metadata["umi_count_Unique"].describe()

count    100000.000000
mean      11468.872070
std       11903.444336
min         500.000000
25%        3174.000000
50%        8530.500000
75%       15539.000000
max      230823.000000
Name: umi_count_Unique, dtype: float64

## Get per-obs metadata for specific samples

Method:

1. Query the sample metadata
2. Use the filtered sample metadata to query the cell metadata

#### Filter sample metadata

Let's get all sheep and horse samples with `obs_count > 10000`

In [31]:
target_organisms = ["Ovis aries", "Equus caballus"]
obs_count_cutoff = 10000

In [32]:
# get the target samples
target_samples = sample_metadata[(sample_metadata["organism"].isin(target_organisms)) & (sample_metadata["obs_count"] > obs_count_cutoff)]
print(target_samples.shape)
target_samples.head()

(30, 17)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,tissue_ontology_term_id,disease,disease_ontology_term_id,perturbation,cell_line,antibody_derived_tag,czi_collection_id,czi_collection_name
2,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,not specified,,not specified,not specified,no,,
3,31747002,SRX23498642,gs://arc-institute-virtual-cell-atlas/scbaseco...,13357,10x_Genomics,3_prime_gex,single_cell,Equus caballus,synovial tissue,UBERON:0007616,osteoarthritis,MONDO:0005178,none,not applicable,no,,
7,35575330,SRX26348968,gs://arc-institute-virtual-cell-atlas/scbaseco...,10322,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,UBERON:0001295,unsure,,unsure,unsure,no,,
11,31746999,SRX23498639,gs://arc-institute-virtual-cell-atlas/scbaseco...,10395,10x_Genomics,3_prime_gex,single_cell,Equus caballus,synovial fluid,UBERON:0001090,osteoarthritis,MONDO:0005178,none,not applicable,no,,
1,37705683,SRX28018339,gs://arc-institute-virtual-cell-atlas/scbaseco...,14600,10x_Genomics,3_prime_gex,single_cell,Ovis aries,mammary gland,UBERON:0001911,none,,lactation stage comparison,none,no,,


In [33]:
# filter the obs metadata
target_orgs = [x.replace(" ", "_") for x in target_samples["organism"].unique().tolist()]
target_obs_files = obs_pq_files[obs_pq_files["organism"].isin(target_orgs)]
target_obs_files

Unnamed: 0,organism,file_path
7,Equus_caballus,arc-institute-virtual-cell-atlas/scbasecount/2...
19,Ovis_aries,arc-institute-virtual-cell-atlas/scbasecount/2...


In [34]:
# read in the obs metadata
obs_metadata = []
for i,row in target_obs_files.iterrows():
    obs_metadata.append(
        ds.dataset(row["file_path"], filesystem=fs, format="parquet").to_table().to_pandas()
    )
obs_metadata = pd.concat(obs_metadata)

# merge with the target samples
obs_metadata = target_samples.merge(obs_metadata, left_on="srx_accession", right_on="SRX_accession")

print(obs_metadata.shape)
obs_metadata.head()

(383938, 27)


Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,tissue_ontology_term_id,...,cell_barcode,SRX_accession,gene_count_Unique,umi_count_Unique,gene_count_UniqueAndMult-EM,umi_count_UniqueAndMult-EM,gene_count_UniqueAndMult-Uniform,umi_count_UniqueAndMult-Uniform,cell_type,cell_ontology_term_id
0,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,...,AAACCCAAGAAGAGCA,SRX26348972,1485,3118.0,1537,3409.999756,1557,3410.0,,
1,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,...,AAACCCAAGACTACGG,SRX26348972,2485,6926.0,2590,7282.001465,2629,7282.001465,,
2,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,...,AAACCCAAGAGAGTGA,SRX26348972,678,1389.0,708,1620.000244,717,1619.999756,,
3,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,...,AAACCCAAGCATTTCG,SRX26348972,874,1801.0,904,2082.0,913,2082.000244,,
4,35575334,SRX26348972,gs://arc-institute-virtual-cell-atlas/scbaseco...,16167,10x_Genomics,3_prime_gex,single_cell,Equus caballus,endometrium,,...,AAACCCAAGGTAAACT,SRX26348972,527,960.0,557,1068.999512,565,1068.999756,,


In [36]:
# gene_count distribution per sample
obs_metadata.groupby(["organism", "srx_accession"])["gene_count_Unique"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
organism,srx_accession,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Equus caballus,SRX23498639,10395.0,1877.877345,1203.399156,269.0,869.5,1867.0,2558.0,7680.0
Equus caballus,SRX23498642,13357.0,2139.489856,934.999318,166.0,1487.0,2180.0,2646.0,7010.0
Equus caballus,SRX26348968,10322.0,1378.516082,853.656327,99.0,802.0,1229.0,1747.75,8712.0
Equus caballus,SRX26348972,16167.0,1227.339024,694.72539,370.0,736.0,1016.0,1511.0,8515.0
Ovis aries,SRX12469009,18378.0,2305.976657,2230.277011,181.0,456.0,1207.0,3702.0,11467.0
Ovis aries,SRX16872034,12515.0,2081.487655,1099.099994,88.0,1353.0,1922.0,2579.0,12328.0
Ovis aries,SRX16872035,12658.0,2340.459946,1199.428937,104.0,1541.25,2173.5,2908.0,13030.0
Ovis aries,SRX16872037,12483.0,1977.945526,1056.576855,74.0,1275.0,1817.0,2452.0,12143.0
Ovis aries,SRX16872039,12749.0,1848.423798,1336.094627,167.0,1041.0,1414.0,2058.0,9690.0
Ovis aries,SRX16872040,12991.0,2008.823724,1426.545301,140.0,1143.0,1563.0,2252.0,10193.0


In [37]:
# umi_count distribution per sample
obs_metadata.groupby(["organism", "srx_accession"])["umi_count_Unique"].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
organism,srx_accession,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Equus caballus,SRX23498639,10395.0,5371.010254,5839.104492,500.0,1450.5,4206.0,7038.0,61960.0
Equus caballus,SRX23498642,13357.0,7223.476074,5019.631348,503.0,3461.0,7055.0,9203.0,82629.0
Equus caballus,SRX26348968,10322.0,3605.947266,3930.548828,500.0,1561.5,2845.0,4462.0,112096.0
Equus caballus,SRX26348972,16167.0,2750.778809,2259.334717,923.0,1377.5,2141.0,3360.0,76825.0
Ovis aries,SRX12469009,18378.0,16621.232422,25256.837891,535.0,945.0,3372.5,27888.5,293109.0
Ovis aries,SRX16872034,12515.0,4799.313965,4025.047607,500.0,2355.5,3916.0,6199.0,95753.0
Ovis aries,SRX16872035,12658.0,5756.79248,4855.183105,501.0,2821.5,4696.5,7468.0,115398.0
Ovis aries,SRX16872037,12483.0,4443.443848,3717.892334,500.0,2185.0,3626.0,5750.5,88728.0
Ovis aries,SRX16872039,12749.0,3900.116943,4643.742188,500.0,1546.0,2346.0,3863.0,90552.0
Ovis aries,SRX16872040,12991.0,4448.382812,5350.766113,500.0,1742.0,2682.0,4405.0,104565.0


# Read h5ad files

### Example: select marmoset samples

In [39]:
# get the target samples
query = (sample_metadata["organism"] == "Callithrix jacchus") & (sample_metadata["obs_count"] < 3000)
target_samples = sample_metadata[query].head(n=3)
target_samples

Unnamed: 0,entrez_id,srx_accession,file_path,obs_count,lib_prep,tech_10x,cell_prep,organism,tissue,tissue_ontology_term_id,disease,disease_ontology_term_id,perturbation,cell_line,antibody_derived_tag,czi_collection_id,czi_collection_name
9,12444878,SRX9522456,gs://arc-institute-virtual-cell-atlas/scbaseco...,2941,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,
17,12444872,SRX9522450,gs://arc-institute-virtual-cell-atlas/scbaseco...,124,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,Retina,UBERON:0000966,Not applicable,,GFP-related perturbation,Not applicable,no,,
46,12444871,SRX9522449,gs://arc-institute-virtual-cell-atlas/scbaseco...,110,10x_Genomics,3_prime_gex,single_cell,Callithrix jacchus,Retina superior region,UBERON:0000966,unsure,,unsure,unsure,no,,


In [40]:
# read in the anndata for those samples
adata = []
for infile in target_samples["file_path"].tolist():
    print(infile)
    with fs.open(infile, 'rb') as f:
        adata.append(sc.read_h5ad(f))

# combine anndata objects
adata = sc.concat(adata)
adata

gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX9522456.h5ad
gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX9522450.h5ad
gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/SRX9522449.h5ad


  utils.warn_names_duplicates("obs")


AnnData object with n_obs × n_vars = 3175 × 28346
    obs: 'gene_count_Unique', 'umi_count_Unique', 'gene_count_UniqueAndMult-EM', 'umi_count_UniqueAndMult-EM', 'gene_count_UniqueAndMult-Uniform', 'umi_count_UniqueAndMult-Uniform', 'SRX_accession', 'cell_type', 'cell_ontology_term_id'
    layers: 'UniqueAndMult-EM', 'UniqueAndMult-Uniform'

In [41]:
# number of obs per SRX accession
adata.obs["SRX_accession"].value_counts()

SRX_accession
SRX9522456    2941
SRX9522450     124
SRX9522449     110
Name: count, dtype: int64

In [42]:
# add per-sample metadata to the anndata object
adata.obs = adata.obs.reset_index().merge(
    target_samples, left_on="SRX_accession", right_on="srx_accession", how="inner"
)
adata.obs.head()

Unnamed: 0,index,gene_count_Unique,umi_count_Unique,gene_count_UniqueAndMult-EM,umi_count_UniqueAndMult-EM,gene_count_UniqueAndMult-Uniform,umi_count_UniqueAndMult-Uniform,SRX_accession,cell_type,cell_ontology_term_id,...,organism,tissue,tissue_ontology_term_id,disease,disease_ontology_term_id,perturbation,cell_line,antibody_derived_tag,czi_collection_id,czi_collection_name
0,AAACCCAAGTCCCTAA,1,1.0,1,1.0,1,1.0,SRX9522456,,,...,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,
1,AAACCCATCCCTGTTG,1,1.0,1,1.0,1,1.0,SRX9522456,,,...,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,
2,AAACGAAAGCCTTGAT,1,1.0,1,1.0,1,1.0,SRX9522456,,,...,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,
3,AAACGAACAACCGCCA,1,1.0,1,1.0,1,1.0,SRX9522456,,,...,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,
4,AAACGAAGTCGTGGAA,1,1.0,1,1.0,1,1.0,SRX9522456,,,...,Callithrix jacchus,Retina,UBERON:0000966,unsure,,unsure,unsure,no,,


# Downloading files

You can use [gsutil](https://cloud.google.com/storage/docs/gsutil) to download any of the files in the bucket
and work with them locally. 

For example:

```bash
gsutil cp gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Homo_sapiens/ERX4319106.h5ad .
```

For large data transfers, it is better to use `gsutil rsync`:

```bash
gsutil -m rsync gs://arc-institute-virtual-cell-atlas/scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/Callithrix_jacchus/ .
```

***

# Session Info

In [77]:
!conda list

# packages in environment at /home/nickyoungblut/miniforge3/envs/scbasecount-py:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
aiohappyeyeballs          2.6.1              pyhd8ed1ab_0    conda-forge
aiohttp                   3.11.18         py313h8060acc_0    conda-forge
aiosignal                 1.3.2              pyhd8ed1ab_0    conda-forge
anndata                   0.11.4             pyhd8ed1ab_0    conda-forge
anyio                     4.9.0              pyh29332c3_0    conda-forge
argon2-cffi               23.1.0             pyhd8ed1ab_1    conda-forge
argon2-cffi-bindings      21.2.0          py313h536fd9c_5    conda-forge
array-api-compat          1.11.2             pyh29332c3_0    conda-forge
arrow                     1.3.0              pyhd8ed1ab_1    conda-forge
asttokens                 3.0.0              py