# Download census slice

Download a local copy of [CELLxGENE Census](https://chanzuckerberg.github.io/cellxgene-census/), with several filters applied:
- `collection_id` (default: `283d65eb-dd53-496d-adb7-7570c7caa443`)
- Dataset `start`/`end` (default: `2:7`, ≈133k cells)
- `n_vars` (default: 20k)

By default, outputs are written to `data/census-benchmark_<start>:<end>` (see `out_dir` / `out_root=data` params below).

## Execute this notebook with [Papermill](https://papermill.readthedocs.io/)

**❗️❗️ NOTE: this notebook has been ported to the `alb download` CLI; use that instead, for programmatic execution. ❗️❗️**

```bash
nb=download-census-slice.ipynb
mkdir -p out  # run notebook will be output here
papermill $nb out/$nb

# slice just one dataset:
papermill $nb -p start 3 -p end 4 out/$nb
```

In [None]:
from benchmarks.utils import *

## [Papermill](https://papermill.readthedocs.io/en/latest/) params:

In [None]:
census_uri = None
census_version = "2023-12-15"

collection_id = '283d65eb-dd53-496d-adb7-7570c7caa443'
# Slice datasets from `collection_id`
start = 2
end = 7

# Slice the first `n_vars` vars
n_vars = 20_000

out_root = "data"
out_dir = None
force = True  # rm existing out_dir before writing

In [None]:
if out_dir is None:
    suffix = "" if start is None and end is None else f"_{start or ''}:{end or ''}"
    out_dir = f'{out_root}/census-benchmark{suffix}'
else:
    out_dir = f"{out_root}/{out_dir}"
err(f"Downloading to {out_dir}")

In [None]:
census = cellxgene_census.open_soma(uri=census_uri, census_version=census_version)
datasets = get_dataset_ids(census, collection_id)
len(datasets), datasets[:10]

In [None]:
exp = census["census_data"]["homo_sapiens"]
exp

In [None]:
from benchmarks.census import get_datasets_df
ddf = get_datasets_df(census, collection_id)
ddf

In [None]:
datasets = ddf.iloc[2:20][['dataset_id', 'dataset_title', 'dataset_total_cell_count']]
datasets['sum_cells'] = datasets.dataset_total_cell_count.cumsum()
datasets

In [None]:
%%time
download_datasets(exp, datasets, out_dir, start=start, end=end, n_vars=n_vars, rm=force)

In [None]:
h_size = check_output(['du', '-sh', out_dir]).decode().split('\t')[0]
print(f"{out_dir}: {h_size}")