# Download census slice

Download a local copy of [CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census/), with several filters applied:
- `collection_id` (default: `283d65eb-dd53-496d-adb7-7570c7caa443`)
- Dataset `start`/`end` (default: `2:7`, ≈133k cells)
- `n_vars` (default: 20k)

By default, outputs are written to `data/census-benchmark_<start>:<end>` (see `out_dir` / `out_root=data` params below).

## Execute this notebook with [Papermill](https://papermill.readthedocs.io/)
```bash
nb=download-census-slice.ipynb
mkdir -p out  # run notebook will be output here
papermill $nb out/$nb

# slice just one dataset:
papermill $nb -p start 3 -p end 4 out/$nb
```

In [1]:
from utils import *

## [Papermill](https://papermill.readthedocs.io/en/latest/) params:

In [2]:
census_uri = None
census_version = "2023-12-15"

collection_id = '283d65eb-dd53-496d-adb7-7570c7caa443'
# Slice datasets from `collection_id`
start = 2
end = 7

# Slice the first `n_vars` vars
n_vars = 20_000

out_root = "data"
out_dir = None

In [3]:
if out_dir is None:
    suffix = "" if start is None and end is None else f"_{start or ''}:{end or ''}"
    out_dir = f'{out_root}/census-benchmark{suffix}'
else:
    out_dir = f"{out_root}/{out_dir}"
err(f"Downloading to {out_dir}")

Downloading to /mnt/nvme/census-benchmark_2:7


In [4]:
census = cellxgene_census.open_soma(uri=census_uri, census_version=census_version)
datasets = get_datasets(census, collection_id)
len(datasets), datasets[:10]

(138,
 ['8e10f1c4-8e98-41e5-b65f-8cd89a887122',
  'b165f033-9dec-468a-9248-802fc6902a74',
  'ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded',
  'fe1a73ab-a203-45fd-84e9-0f7fd19efcbd',
  'fbf173f9-f809-4d84-9b65-ae205d35b523',
  'fa554686-fc07-44dd-b2de-b726d82d26ec',
  'f9034091-2e8f-4ac6-9874-e7b7eb566824',
  'f8dda921-5fb4-4c94-a654-c6fc346bfd6d',
  'f7d003d4-40d5-4de8-858c-a9a8b48fcc67',
  'f6d9f2ad-5ec7-4d53-b7f0-ceb0e7bcd181'])

In [5]:
experiment = census["census_data"]["homo_sapiens"]
experiment

<Experiment 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens' (open for 'r') (2 items)
    'ms': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens/ms' (unopened)
    'obs': 's3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens/obs' (unopened)>

In [6]:
%%time
download_datasets(experiment, datasets, out_dir, start=start, end=end, n_vars=n_vars)

Downloading 5 datasets:
	ff7d15fa-f4b6-4a0e-992e-fd0c9d088ded
	fe1a73ab-a203-45fd-84e9-0f7fd19efcbd
	fbf173f9-f809-4d84-9b65-ae205d35b523
	fa554686-fc07-44dd-b2de-b726d82d26ec
	f9034091-2e8f-4ac6-9874-e7b7eb566824
  subset_census(query, out_dir)


CPU times: user 26min 26s, sys: 58.2 s, total: 27min 25s
Wall time: 1min 43s


In [8]:
h_size = check_output(['du', '-sh', out_dir]).decode().split('\t')[0]
print(f"{out_dir}: {h_size}")

/mnt/nvme/census-benchmark_2:7: 714M
