# Download census slice

Download a local copy of [CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census/), with several filters applied:
- `collection_id` (default: `283d65eb-dd53-496d-adb7-7570c7caa443`)
- Dataset `start`/`end` (default: `2:7`, ≈133k cells)
- `n_vars` (default: 20k)

By default, outputs are written to `data/census-benchmark_<start>:<end>` (see `out_dir` / `out_root=data` params below).

## Execute this notebook with [Papermill](https://papermill.readthedocs.io/)

**❗️❗️ NOTE: this notebook has been ported to the `alb download` CLI; use that instead, for programmatic execution. ❗️❗️**

```bash
nb=download-census-slice.ipynb
mkdir -p out  # run notebook will be output here
papermill $nb out/$nb

# slice just one dataset:
papermill $nb -p start 3 -p end 4 out/$nb
```

In [1]:
from benchmarks.utils import *

## [Papermill](https://papermill.readthedocs.io/en/latest/) params:

In [2]:
census_uri = None
census_version = "2023-12-15"

collection_id = '283d65eb-dd53-496d-adb7-7570c7caa443'
# Slice datasets from `collection_id`
start = 2
end = 7

# Slice the first `n_vars` vars
n_vars = 20_000

out_root = "data"
out_dir = None
force = True  # rm existing out_dir before writing

In [3]:
if out_dir is None:
    suffix = "" if start is None and end is None else f"_{start or ''}:{end or ''}"
    out_dir = f'{out_root}/census-benchmark{suffix}'
else:
    out_dir = f"{out_root}/{out_dir}"
err(f"Downloading to {out_dir}")

Downloading to data/census-benchmark_2:7


In [4]:
census = cellxgene_census.open_soma(uri=census_uri, census_version=census_version)
datasets = get_dataset_ids(census, collection_id)
len(datasets), datasets[:10]

(138,
 ['8e10f1c4-8e98-41e5-b65f-8cd89a887122',
  'b165f033-9dec-468a-9248-802fc6902a74',
  'ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded',
  'fe1a73ab-a203-45fd-84e9-0f7fd19efcbd',
  'fbf173f9-f809-4d84-9b65-ae205d35b523',
  'fa554686-fc07-44dd-b2de-b726d82d26ec',
  'f9034091-2e8f-4ac6-9874-e7b7eb566824',
  'f8dda921-5fb4-4c94-a654-c6fc346bfd6d',
  'f7d003d4-40d5-4de8-858c-a9a8b48fcc67',
  'f6d9f2ad-5ec7-4d53-b7f0-ceb0e7bcd181'])

In [5]:
exp = census["census_data"]["homo_sapiens"]
exp

<Experiment 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens' (open for 'r') (2 items)
    'ms': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens/ms' (unopened)
    'obs': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens/obs' (unopened)>

In [6]:
from benchmarks.census import get_datasets_df
ddf = get_datasets_df(census, collection_id)
ddf

Unnamed: 0,soma_joinid,collection_id,collection_name,collection_doi,dataset_id,dataset_version_id,dataset_title,dataset_h5ad_path,dataset_total_cell_count
0,102,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,8e10f1c4-8e98-41e5-b65f-8cd89a887122,674227af-0c09-4b2c-8abb-0cb0621e9af2,All neurons,8e10f1c4-8e98-41e5-b65f-8cd89a887122.h5ad,2480956
1,139,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,b165f033-9dec-468a-9248-802fc6902a74,8f2775a8-1a55-4f3c-baf8-23d91b5e6ba3,All non-neuronal cells,b165f033-9dec-468a-9248-802fc6902a74.h5ad,888263
2,198,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded,ad4f40b1-ad1b-43ef-85dd-a5d0dbc35766,"Dissection: Cerebral cortex (Cx) - Cuneus, ros...",ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded.h5ad,28051
3,199,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,fe1a73ab-a203-45fd-84e9-0f7fd19efcbd,ca8e9e3a-852f-487f-9196-e6eb094e18ff,Dissection: Amygdaloid complex (AMY) - basolat...,fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad,35285
4,200,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,fbf173f9-f809-4d84-9b65-ae205d35b523,3cd2b19b-7bda-4ad5-81d7-a7e486d9ef27,Dissection: Thalamus (THM) - lateral nuclear c...,fbf173f9-f809-4d84-9b65-ae205d35b523.h5ad,17660
...,...,...,...,...,...,...,...,...,...
133,329,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,07b1d7c8-5c2e-42f7-9246-26f746cd6013,2fd1d8cc-b463-4191-8ad9-ca8f5582e377,Dissection: Myelencephalon (medulla oblongata)...,07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad,27210
134,330,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,04a23820-ffa8-4be5-9f65-64db15631d1e,5452b43f-7a47-4fd7-8447-81550c05e5cb,Supercluster: Upper rhombic lip,04a23820-ffa8-4be5-9f65-64db15631d1e.h5ad,137162
135,331,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,03d38670-1444-4001-bc53-9936e61d9b20,91bfa4c5-6167-4f38-8850-8b75bacdefd4,Dissection: Hypothalamus (HTH) - preoptic regi...,03d38670-1444-4001-bc53-9936e61d9b20.h5ad,20027
136,332,283d65eb-dd53-496d-adb7-7570c7caa443,Human Brain Cell Atlas v1.0,10.1126/science.add7046,0325478a-9b52-45b5-b40a-2e2ab0d72eb1,789736b6-7d28-444d-a600-b76c607d041e,Supercluster: Upper-layer intratelencephalic,0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad,455006


In [9]:
datasets = ddf.iloc[2:20][['dataset_id', 'dataset_title', 'dataset_total_cell_count']]
datasets['sum_cells'] = datasets.dataset_total_cell_count.cumsum()
datasets

Unnamed: 0,dataset_id,dataset_title,dataset_total_cell_count,sum_cells
2,ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded,"Dissection: Cerebral cortex (Cx) - Cuneus, ros...",28051,28051
3,fe1a73ab-a203-45fd-84e9-0f7fd19efcbd,Dissection: Amygdaloid complex (AMY) - basolat...,35285,63336
4,fbf173f9-f809-4d84-9b65-ae205d35b523,Dissection: Thalamus (THM) - lateral nuclear c...,17660,80996
5,fa554686-fc07-44dd-b2de-b726d82d26ec,Dissection: Cerebral cortex (Cx) - Superior oc...,29674,110670
6,f9034091-2e8f-4ac6-9874-e7b7eb566824,Dissection: Myelencephalon (medulla oblongata)...,23120,133790
7,f8dda921-5fb4-4c94-a654-c6fc346bfd6d,Dissection: Cerebral cortex (Cx) - Occipitotem...,31899,165689
8,f7d003d4-40d5-4de8-858c-a9a8b48fcc67,Supercluster: Astrocyte,155025,320714
9,f6d9f2ad-5ec7-4d53-b7f0-ceb0e7bcd181,Dissection: Thalamus (THM) - lateral nuclear c...,6877,327591
10,f5a04dff-d394-4023-8811-65494e8bb11d,Dissection: Basal nuclei (BN) - Putamen - Pu,34416,362007
11,f502c312-05dc-4fd4-a762-92a63e92b539,Dissection: Paleocortex (PalCx) - Anterior Olf...,31230,393237


In [None]:
%%time
download_datasets(exp, datasets, out_dir, start=start, end=end, n_vars=n_vars, rm=force)

In [None]:
h_size = check_output(['du', '-sh', out_dir]).decode().split('\t')[0]
print(f"{out_dir}: {h_size}")