This notebook will show how to map data using the MapMyCells data products released with the HMBA consensus Basal Ganglia taxonomy.

We are going to start by installing the `cell_type_mapper` (actual backend for MapMyCells) and `abc_atlas_access` (convenient download of other data provided by the AllenInstitute) libraries. You may need to restart your kernel after running these cells so that the notebook will acknoweldge the installed libraries.

In [1]:
pip install "cell_type_mapper @ git+https://github.com/alleninstitute/cell_type_mapper.git@rc/v1.5.2"


Collecting cell_type_mapper@ git+https://github.com/alleninstitute/cell_type_mapper.git@rc/v1.5.2
  Cloning https://github.com/alleninstitute/cell_type_mapper.git (to revision rc/v1.5.2) to /private/var/folders/8b/hnw5vq8s20jbpz51wdhd11fr0000gp/T/pip-install-8c53nk7s/cell-type-mapper_0bad1e0b8b984773bf66c09ace614cdc
  Running command git clone --filter=blob:none --quiet https://github.com/alleninstitute/cell_type_mapper.git /private/var/folders/8b/hnw5vq8s20jbpz51wdhd11fr0000gp/T/pip-install-8c53nk7s/cell-type-mapper_0bad1e0b8b984773bf66c09ace614cdc
  Running command git checkout -b rc/v1.5.2 --track origin/rc/v1.5.2
  Switched to a new branch 'rc/v1.5.2'
  branch 'rc/v1.5.2' set up to track 'origin/rc/v1.5.2'.
  Resolved https://github.com/alleninstitute/cell_type_mapper.git to commit 2aefd39caf5799555c4bcad6ba437ca9d93d4bfd
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldo

In [10]:
pip install "abc_atlas_access[notebooks] @ git+https://github.com/alleninstitute/abc_atlas_access.git"

Collecting abc_atlas_access@ git+https://github.com/alleninstitute/abc_atlas_access.git (from abc_atlas_access[notebooks]@ git+https://github.com/alleninstitute/abc_atlas_access.git)
  Cloning https://github.com/alleninstitute/abc_atlas_access.git to /private/var/folders/8b/hnw5vq8s20jbpz51wdhd11fr0000gp/T/pip-install-i48hugce/abc-atlas-access_ddbb425a58264c86b248be905825cb7b
  Running command git clone --filter=blob:none --quiet https://github.com/alleninstitute/abc_atlas_access.git /private/var/folders/8b/hnw5vq8s20jbpz51wdhd11fr0000gp/T/pip-install-i48hugce/abc-atlas-access_ddbb425a58264c86b248be905825cb7b
  Resolved https://github.com/alleninstitute/abc_atlas_access.git to commit a82a6770c99ad166105c3e6fccd47f31ee69b44c
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting boto3 (from abc_atlas_access@ git+https://github.com/alleninstitute/abc_atlas_access.git->

In [33]:
import json
import matplotlib.pyplot as plt
import pandas as pd

import cell_type_mapper
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache

To run MapMyCells, you need two supporting data files: [an HDF5 file](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/precomputed_stats_file.md) which defines your taxonomy and the average gene expression profiles of the taxons there in, and a [lookup table of marker genes](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/marker_gene_lookup.md) for your taxonomy. For the sake of this exercise, let's download those two files for the human basal ganglia data. Refer back [to the main page](https://github.com/AllenInstitute/HMBA_BasalGanglia_Consensus_Taxonomy/blob/main/index.md#cell-type-mapping-with-mapmycells) for the locations of the relevant files for all three species (human, marmoset, and macaque).

In [7]:
! wget "https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.precomputed_stats.20250507.h5" -O "data/Human.precomputed_stats.20250507.h5"

--2025-05-08 10:07:31--  https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.precomputed_stats.20250507.h5
Resolving released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com (released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com)... 3.5.85.43, 52.218.232.209, 52.218.153.41, ...
Connecting to released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com (released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com)|3.5.85.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 755535936 (721M) [binary/octet-stream]
Saving to: ‘data/Human.precomputed_stats.20250507.h5’


2025-05-08 10:08:31 (12.0 MB/s) - ‘data/Human.precomputed_stats.20250507.h5’ saved [755535936/755535936]



In [1]:
! wget "https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.query_markers.20250507.json" -O "data/Human.query_markers.20250507.json"

--2025-05-08 10:38:10--  https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.query_markers.20250507.json
Resolving released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com (released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com)... 52.218.232.161, 52.92.133.26, 3.5.82.215, ...
Connecting to released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com (released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com)|52.218.232.161|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 837105 (817K) [application/json]
Saving to: ‘data/Human.query_markers.20250507.json’


2025-05-08 10:38:11 (8.79 MB/s) - ‘data/Human.query_markers.20250507.json’ saved [837105/837105]



You also need some unlabeled ata to map. We will use [the abc_atlas_access tool](https://alleninstitute.github.io/abc_atlas_access/intro.html) to download some Whole Human Brain data that we can map onto the Basal Ganglia taxonomy. **Note:** we are going to download the non-neuronal data only, not because it is interesting, but because it is small (5 GB as opposed to 32 GB for neuronal data).

In [5]:
abc_cache = AbcProjectCache.from_cache_dir('abc_cache')

In [6]:
abc_cache.list_data_files(directory='WHB-10Xv3')

['WHB-10Xv3-Neurons/log2',
 'WHB-10Xv3-Neurons/raw',
 'WHB-10Xv3-Nonneurons/log2',
 'WHB-10Xv3-Nonneurons/raw']

In [7]:
unmapped_human_data_path = abc_cache.get_data_path(
    directory='WHB-10Xv3',
    file_name='WHB-10Xv3-Nonneurons/raw'
)

WHB-10Xv3-Nonneurons-raw.h5ad: 100%|██████| 4.75G/4.75G [06:57<00:00, 11.4MMB/s]


Now we will actually [perform the mapping](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/mapping_cells.md).

In [9]:
from cell_type_mapper.cli.from_specified_markers import FromSpecifiedMarkersRunner

In [12]:
# paths to files where mapping output will be written
json_dst_path = "results/human_nonneurons_mapping.json"
csv_dst_path = "results/human_nonneruons_mapping.csv"

# the lookup table of marker genes which we downloaded from S3
query_marker_path = "data/Human.query_markers.20250507.json"

# the human-specific precomputed stats file which we downloaded from S3
precomputed_path = "data/Human.precomputed_stats.20250507.h5"

config = {
    "query_path": str(unmapped_human_data_path.resolve().absolute()),
    "extended_result_path": json_dst_path,
    "csv_result_path": csv_dst_path,
    "verbose_csv": True,
    "query_markers": {
       "serialized_lookup": query_marker_path
    },
    "precomputed_stats": {
        "path": precomputed_path
    },
    "type_assignment": {
        "n_processors": 4,
        "normalization": "raw",
        "bootstrap_factor": 0.5,
        "bootstrap_iteration": 100
    }
}

In [13]:
runner = FromSpecifiedMarkersRunner(
    args=[],
    input_data=config
)
runner.run()

=== Running Hierarchical Mapping 1.5.2 with config ===
{
  "log_level": "ERROR",
  "extended_result_path": "results/human_nonneurons_mapping.json",
  "verbose_stdout": true,
  "type_assignment": {
    "bootstrap_factor": 0.5,
    "log_level": "ERROR",
    "bootstrap_factor_lookup": null,
    "rng_seed": 11235813,
    "chunk_size": 10000,
    "bootstrap_iteration": 100,
    "n_processors": 4,
    "min_markers": 10,
    "n_runners_up": 5,
    "normalization": "raw"
  },
  "cloud_safe": false,
  "hdf5_result_path": null,
  "log_path": null,
  "query_markers": {
    "collapse_markers": false,
    "log_level": "ERROR",
    "serialized_lookup": "data/Human.query_markers.20250507.json"
  },
  "csv_result_path": "results/human_nonneruons_mapping.csv",
  "extended_result_dir": null,
  "query_gene_id_col": null,
  "max_gb": 100.0,
  "obsm_clobber": false,
  "verbose_csv": false,
  "drop_level": null,
  "summary_metadata_path": null,
  "map_to_ensembl": false,
  "nodes_to_drop": null,
  "precompu

{
  "NUMEXPR_NUM_THREADS": "",
  "MKL_NUM_THREADS": "",
  "OMP_NUM_THREADS": ""
}


BENCHMARK: spent 5.9319e-02 seconds creating query marker cache
Running CPU implementation of type assignment.
40000 of 888263 cells in 1.75e+00 min; predict 3.70e+01 min of 3.88e+01 min left
60000 of 888263 cells in 2.43e+00 min; predict 3.36e+01 min of 3.60e+01 min left
80000 of 888263 cells in 3.01e+00 min; predict 3.04e+01 min of 3.34e+01 min left
100000 of 888263 cells in 3.72e+00 min; predict 2.93e+01 min of 3.30e+01 min left
120000 of 888263 cells in 4.42e+00 min; predict 2.83e+01 min of 3.27e+01 min left
140000 of 888263 cells in 5.16e+00 min; predict 2.76e+01 min of 3.27e+01 min left
160000 of 888263 cells in 5.79e+00 min; predict 2.63e+01 min of 3.21e+01 min left
180000 of 888263 cells in 6.46e+00 min; predict 2.54e+01 min of 3.19e+01 min left
200000 of 888263 cells in 7.29e+00 min; predict 2.51e+01 min of 3.24e+01 min left
220000 of 888263 cells in 7.99e+00 min; predict 2.43e+01 min of 3.23e+01 min left
240000 of 888263 cells in 8.64e+00 min; predict 2.33e+01 min of 3.20e+01

# Output of mapping file

The results of our mapping are now in two files: the csv file pointed to by `csv_dst_path` and the JSON file pointed to by `json_dst_path`. Dedicated documentation of the the contents of the mapping output [can be found here.](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md)

## CSV output file

The CSV file is effectively just a dataframe. For every cell at every taxonomy level, you have its assigned cell type (both as a guaranteed unique "label" and a more human readable "name") along with quality metrics assessing the confidence in the mapping (see the detailed documentation above).

In [23]:
mapping_csv = pd.read_csv(csv_dst_path, comment='#')

In [24]:
mapping_csv

Unnamed: 0,cell_id,Neighborhood_label,Neighborhood_name,Neighborhood_bootstrapping_probability,Class_label,Class_name,Class_bootstrapping_probability,Subclass_label,Subclass_name,Subclass_bootstrapping_probability,Group_label,Group_name,Group_bootstrapping_probability,Cluster_label,Cluster_name,Cluster_alias,Cluster_bootstrapping_probability
0,10X362_3:TCAGTGAGTATTGACC,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0010,OPC-Oligo,1.0,CS20250428_SUBCL_0022,Oligodendrocyte,1.00,CS20250428_GROUP_0025,Oligo OPALIN,0.69,CS20250428_CLUST_0227,Human-1,Human-1,0.55
1,10X362_5:TCCGTGTGTGAAAGTT,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0010,OPC-Oligo,1.0,CS20250428_SUBCL_0022,Oligodendrocyte,1.00,CS20250428_GROUP_0026,Oligo PLEKHG1,0.99,CS20250428_CLUST_0231,Human-15,Human-15,0.99
2,10X362_5:CACGGGTAGAGCAGAA,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0010,OPC-Oligo,1.0,CS20250428_SUBCL_0022,Oligodendrocyte,1.00,CS20250428_GROUP_0026,Oligo PLEKHG1,0.58,CS20250428_CLUST_0231,Human-15,Human-15,1.00
3,10X362_5:GATTCTTGTATGTCAC,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0010,OPC-Oligo,1.0,CS20250428_SUBCL_0022,Oligodendrocyte,1.00,CS20250428_GROUP_0026,Oligo PLEKHG1,1.00,CS20250428_CLUST_0230,Human-12,Human-12,1.00
4,10X362_6:AGGACTTGTATCCTTT,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0010,OPC-Oligo,1.0,CS20250428_SUBCL_0022,Oligodendrocyte,1.00,CS20250428_GROUP_0026,Oligo PLEKHG1,1.00,CS20250428_CLUST_0231,Human-15,Human-15,1.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
888258,10X194_8:GAAATGAGTTCGGCTG,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0008,Immune,1.0,CS20250428_SUBCL_0018,Macrophage,1.00,CS20250428_GROUP_0001,BAM,1.00,CS20250428_CLUST_0157,Human-71,Human-71,0.97
888259,10X350_4:TTTACCATCGCACGAC,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0008,Immune,1.0,CS20250428_SUBCL_0018,Macrophage,0.95,CS20250428_GROUP_0001,BAM,1.00,CS20250428_CLUST_0160,Human-75,Human-75,0.92
888260,10X225_1:AGAAGCGTCCATATGG,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0008,Immune,1.0,CS20250428_SUBCL_0018,Macrophage,1.00,CS20250428_GROUP_0001,BAM,1.00,CS20250428_CLUST_0159,Human-74,Human-74,0.90
888261,10X221_5:TTGAACGCAGCCTTCT,CS20250428_NEIGH_0001,Nonneuron,1.0,CS20250428_CLASS_0008,Immune,1.0,CS20250428_SUBCL_0018,Macrophage,1.00,CS20250428_GROUP_0001,BAM,1.00,CS20250428_CLUST_0160,Human-75,Human-75,1.00


## JSON output file

The JSON output contains everything in the CSV file, along with helpful metadata about your mapping run as [documented here](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md#json-output-file).

In [27]:
with open(json_dst_path, 'rb') as src:
    mapping_json = json.load(src)

For instance, to see the configuration parameters corresponding to your mapping run, you can look at

In [34]:
print(json.dumps(mapping_json['config'], indent=2))

{
  "log_level": "ERROR",
  "extended_result_path": "results/human_nonneurons_mapping.json",
  "verbose_stdout": true,
  "type_assignment": {
    "bootstrap_factor": 0.5,
    "log_level": "ERROR",
    "bootstrap_factor_lookup": null,
    "rng_seed": 11235813,
    "chunk_size": 10000,
    "bootstrap_iteration": 100,
    "n_processors": 4,
    "min_markers": 10,
    "n_runners_up": 5,
    "normalization": "raw"
  },
  "cloud_safe": false,
  "hdf5_result_path": null,
  "log_path": null,
  "query_markers": {
    "collapse_markers": false,
    "log_level": "ERROR",
    "serialized_lookup": "data/Human.query_markers.20250507.json"
  },
  "csv_result_path": "results/human_nonneruons_mapping.csv",
  "extended_result_dir": null,
  "query_gene_id_col": null,
  "max_gb": 100.0,
  "obsm_clobber": false,
  "verbose_csv": false,
  "drop_level": null,
  "summary_metadata_path": null,
  "map_to_ensembl": false,
  "nodes_to_drop": null,
  "precomputed_stats": {
    "path": "data/Human.precomputed_stats

The actual cell type assignments are stored as a list under `'results'` as in

In [28]:
print(json.dumps(mapping_json['results'][0], indent=2))

{
  "CCN20250428_LEVEL_0": {
    "assignment": "CS20250428_NEIGH_0001",
    "bootstrapping_probability": 1.0,
    "aggregate_probability": 1.0,
    "avg_correlation": 0.7511638600643517,
    "runner_up_assignment": [],
    "runner_up_correlation": [],
    "runner_up_probability": [],
    "directly_assigned": true
  },
  "CCN20250428_LEVEL_1": {
    "assignment": "CS20250428_CLASS_0010",
    "bootstrapping_probability": 1.0,
    "aggregate_probability": 1.0,
    "avg_correlation": 0.8847336088608828,
    "runner_up_assignment": [],
    "runner_up_correlation": [],
    "runner_up_probability": [],
    "directly_assigned": true
  },
  "CCN20250428_LEVEL_2": {
    "assignment": "CS20250428_SUBCL_0022",
    "bootstrapping_probability": 1.0,
    "aggregate_probability": 1.0,
    "avg_correlation": 0.809104201147677,
    "runner_up_assignment": [],
    "runner_up_correlation": [],
    "runner_up_probability": [],
    "directly_assigned": true
  },
  "CCN20250428_LEVEL_3": {
    "assignment": 

One complication is that the cell type assignments are only referred to by their unique machine-readable labels in this file. Fortunately, the cell type taxonomy, along with the mapping between machine- and human-readable cell type labels is also provided in this file. The `TaxonomyTree` class provides a helpful interface with that data.

In [29]:
from cell_type_mapper.taxonomy.taxonomy_tree import TaxonomyTree

In [30]:
taxonomy = TaxonomyTree(data=mapping_json['taxonomy_tree'])



In [32]:
taxonomy.label_to_name(level='CCN20250428_LEVEL_3', label='CS20250428_GROUP_0025')

'Oligo OPALIN'