# Goal
This notebook will show how to map data using the MapMyCells data products released with the HMBA consensus Basal Ganglia taxonomy.

[Original Notebook here](https://github.com/AllenInstitute/HMBA_BasalGanglia_Consensus_Taxonomy/blob/main/examples/using_MapMyCells_data.ipynb)

# Install MapMyCell dependencies
We are going to start by installing the `cell_type_mapper` (actual backend for MapMyCells) and `abc_atlas_access` (convenient download of other data provided by the AllenInstitute) libraries.
```bash
uv add "cell_type_mapper@git+https://github.com/alleninstitute/cell_type_mapper.git@rc/v1.5.2"
uv add "abc_atlas_access[notebooks] @ git+https://github.com/alleninstitute/abc_atlas_access.git"
```

# Dowload mapping files for basal ganglia

To run MapMyCells, you need two supporting data files: 
- [an HDF5 file](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/precomputed_stats_file.md) which defines your taxonomy and the average gene expression profiles of the taxons there in
- [lookup table of marker genes](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/marker_gene_lookup.md) for your taxonomy. 

Let's download those two files for the **human** basal ganglia data. 
Refer back [to the main page](https://github.com/AllenInstitute/HMBA_BasalGanglia_Consensus_Taxonomy/blob/main/index.md#cell-type-mapping-with-mapmycells) for the locations of the relevant files for all three species (human, marmoset, and macaque).

```bash
wget https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.precomputed_stats.20250507.h5
wget https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.query_markers.20250507.json
```

---

In [1]:
import json
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
from pathlib import Path

import cell_type_mapper
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from cell_type_mapper.cli.from_specified_markers import FromSpecifiedMarkersRunner

# Prepare Query data

Your query data must be:
- **Format**: `.h5ad` file (AnnData format)
- **Structure**: 
  - `X` layer with **RAW** gene expression data (cells × genes)
  - `obs` with cell metadata
  - `var` with gene names, **Ensembl IDs** for MapMyCells-supported taxonomies

In [2]:
QUERY_PATH="/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-H1/adata/raw_adata.h5ad"

# paths to files where mapping output will be written
json_dst_path = str(Path(QUERY_PATH).parent.parent / "map_my_cell" / "mapping.json")
csv_dst_path = str(Path(QUERY_PATH).parent.parent / "map_my_cell" / "mapping.csv")

# updated adata path
adata_labelled_path = str(Path(QUERY_PATH).parent / "labelled_adata.h5ad")

# FIXED (always mappign againf basal ganglia)
# the lookup table of marker genes which we downloaded from S3
query_marker_path = "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.query_markers.20250507.json"
# the human-specific precomputed stats file which we downloaded from S3
precomputed_path = "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.precomputed_stats.20250507.h5"

# Run Mapping

Now we will actually [perform the mapping](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/mapping_cells.md).

In [3]:
config = {
    # output paths
    "query_path": QUERY_PATH,
    "extended_result_path": json_dst_path,
    "csv_result_path": csv_dst_path,
    "verbose_csv": True,

    # inout paths
    "query_markers": {
       "serialized_lookup": query_marker_path
    },
    "precomputed_stats": {
        "path": precomputed_path
    },

    "type_assignment": {
        "n_processors": 4,
        "normalization": "raw", #Use raw counts (not normalized)
        "bootstrap_factor": 0.5,
        "bootstrap_iteration": 100
    }
}

In [4]:
runner = FromSpecifiedMarkersRunner(
    args=[],
    input_data=config
)
runner.run()
print("Done!")

=== Running Hierarchical Mapping 1.5.2 with config ===
{
  "query_markers": {
    "serialized_lookup": "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.query_markers.20250507.json",
    "collapse_markers": false,
    "log_level": "ERROR"
  },
  "drop_level": null,
  "query_path": "/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-H1/adata/raw_adata.h5ad",
  "obsm_clobber": false,
  "csv_result_path": "/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-H1/map_my_cell/mapping.csv",
  "extended_result_dir": null,
  "log_path": null,
  "precomputed_stats": {
    "log_level": "ERROR",
    "path": "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.precomputed_stats.20250507.h5"
  },
  "hdf5_result_path": null,
  "nodes_to_drop": null,
  "flatten": false,
  "verbose_csv": true,
  "type_assignment": {
    "n_processors": 4,
    "chunk_size": 10000,
    "bootstrap_factor": 0.5,
    "min_mark

{
  "NUMEXPR_NUM_THREADS": "",
  "MKL_NUM_THREADS": "",
  "OMP_NUM_THREADS": ""
}
  self.env(f"anndata version: {anndata.__version__}")


BENCHMARK: spent 1.4562e-01 seconds creating query marker cache
Running CPU implementation of type assignment.
BENCHMARK: spent 3.0735e+02 seconds assigning cell types
Writing marker genes to output file
MAPPING FROM SPECIFIED MARKERS RAN SUCCESSFULLY
CLEANING UP
Done!


# Output of mapping file

The results of our mapping are now in two files: the csv file pointed to by `csv_dst_path` and the JSON file pointed to by `json_dst_path`. Dedicated documentation of the the contents of the mapping output [can be found here.](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md)

## CSV output file

The CSV file is effectively just a dataframe. For every cell at every taxonomy level, you have its assigned cell type (both as a guaranteed unique "label" and a more human readable "name") along with quality metrics assessing the confidence in the mapping (see the detailed documentation above).

In [5]:
mapping_csv = pd.read_csv(csv_dst_path, comment='#')
mapping_csv = mapping_csv.set_index("cell_id")

display(mapping_csv)

Unnamed: 0_level_0,Neighborhood_label,Neighborhood_name,Neighborhood_bootstrapping_probability,Neighborhood_aggregate_probability,Neighborhood_correlation_coefficient,Class_label,Class_name,Class_bootstrapping_probability,Class_aggregate_probability,Class_correlation_coefficient,...,Group_name,Group_bootstrapping_probability,Group_aggregate_probability,Group_correlation_coefficient,Cluster_label,Cluster_name,Cluster_alias,Cluster_bootstrapping_probability,Cluster_aggregate_probability,Cluster_correlation_coefficient
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCAAAGATAACAG-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6303,CS20250428_CLASS_0000,Astro-Epen,0.66,0.66,0.6408,...,Astrocyte,1.00,0.66,0.5962,CS20250428_CLUST_0253,Human-230,Human-230,1.00,0.66,0.5403
AAACCAAAGCATGGAG-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7041,CS20250428_CLASS_0008,Immune,1.00,1.00,0.8486,...,Microglia,1.00,1.00,0.5218,CS20250428_CLUST_0223,Human-532,Human-532,0.50,0.50,0.4024
AAACCAAAGGCGGAGT-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7610,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8914,...,Oligo PLEKHG1,1.00,1.00,0.7088,CS20250428_CLUST_0231,Human-15,Human-15,0.99,0.99,0.6570
AAACCAAAGGTCTATG-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6028,CS20250428_CLASS_0000,Astro-Epen,1.00,1.00,0.6255,...,Astrocyte,1.00,1.00,0.5057,CS20250428_CLUST_0167,Human-159,Human-159,0.98,0.98,0.4633
AAACCAAAGGTTGTAT-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7130,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8124,...,Oligo OPALIN,1.00,1.00,0.6399,CS20250428_CLUST_0227,Human-1,Human-1,1.00,1.00,0.5997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTTGTGGGTCCGTCCA-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7741,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8936,...,ImOligo,1.00,1.00,0.6727,CS20250428_CLUST_0203,Human-53,Human-53,0.80,0.80,0.7143
GTTGTGGGTCGAAGCC-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7561,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.9014,...,Oligo OPALIN,0.98,0.98,0.6007,CS20250428_CLUST_0227,Human-1,Human-1,1.00,0.98,0.8427
GTTGTGGGTGCACGTA-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7351,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8877,...,Oligo OPALIN,1.00,1.00,0.7132,CS20250428_CLUST_0227,Human-1,Human-1,1.00,1.00,0.6591
GTTGTGGGTGGCGTGT-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.5699,CS20250428_CLASS_0000,Astro-Epen,1.00,1.00,0.6010,...,Astrocyte,1.00,1.00,0.4413,CS20250428_CLUST_0168,Human-160,Human-160,0.81,0.81,0.4814


# Add Metadata to adata and Save

In [6]:
# Read adata
query_adata = sc.read_h5ad(QUERY_PATH)

# Merge metadata
query_adata.obs = query_adata.obs.join(mapping_csv, how="left")
display(query_adata.obs)

# save 
query_adata.write(adata_labelled_path)


n_cells_original = len(query_adata.obs)
n_cells_mapped = mapping_csv.index.nunique()
print(f"Original cells: {n_cells_original}")
print(f"Mapped cells: {n_cells_mapped}")

Unnamed: 0,x,y,pct_intronic,is_cell,Neighborhood_label,Neighborhood_name,Neighborhood_bootstrapping_probability,Neighborhood_aggregate_probability,Neighborhood_correlation_coefficient,Class_label,...,Group_name,Group_bootstrapping_probability,Group_aggregate_probability,Group_correlation_coefficient,Cluster_label,Cluster_name,Cluster_alias,Cluster_bootstrapping_probability,Cluster_aggregate_probability,Cluster_correlation_coefficient
AAACCAAAGATAACAG-1,33013.428400,444.562211,0.585746,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6303,CS20250428_CLASS_0000,...,Astrocyte,1.00,0.66,0.5962,CS20250428_CLUST_0253,Human-230,Human-230,1.00,0.66,0.5403
AAACCAAAGCATGGAG-1,21225.159553,4193.340108,0.596834,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7041,CS20250428_CLASS_0008,...,Microglia,1.00,1.00,0.5218,CS20250428_CLUST_0223,Human-532,Human-532,0.50,0.50,0.4024
AAACCAAAGGCGGAGT-1,16429.015488,3852.143369,0.716821,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7610,CS20250428_CLASS_0010,...,Oligo PLEKHG1,1.00,1.00,0.7088,CS20250428_CLUST_0231,Human-15,Human-15,0.99,0.99,0.6570
AAACCAAAGGTCTATG-1,43242.438818,-1822.457007,0.609493,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6028,CS20250428_CLASS_0000,...,Astrocyte,1.00,1.00,0.5057,CS20250428_CLUST_0167,Human-159,Human-159,0.98,0.98,0.4633
AAACCAAAGGTTGTAT-1,14118.369028,-4852.270550,0.588186,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7130,CS20250428_CLASS_0010,...,Oligo OPALIN,1.00,1.00,0.6399,CS20250428_CLUST_0227,Human-1,Human-1,1.00,1.00,0.5997
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTTGTGGGTCCGTCCA-1,57226.380126,-635.403689,0.597685,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7741,CS20250428_CLASS_0010,...,ImOligo,1.00,1.00,0.6727,CS20250428_CLUST_0203,Human-53,Human-53,0.80,0.80,0.7143
GTTGTGGGTCGAAGCC-1,46045.094685,3106.066837,0.630523,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7561,CS20250428_CLASS_0010,...,Oligo OPALIN,0.98,0.98,0.6007,CS20250428_CLUST_0227,Human-1,Human-1,1.00,0.98,0.8427
GTTGTGGGTGCACGTA-1,9851.267283,7168.714885,0.541554,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7351,CS20250428_CLASS_0010,...,Oligo OPALIN,1.00,1.00,0.7132,CS20250428_CLUST_0227,Human-1,Human-1,1.00,1.00,0.6591
GTTGTGGGTGGCGTGT-1,17765.143818,-4292.867209,0.630550,True,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.5699,CS20250428_CLASS_0000,...,Astrocyte,1.00,1.00,0.4413,CS20250428_CLUST_0168,Human-160,Human-160,0.81,0.81,0.4814


Original cells: 34704
Mapped cells: 34704


---

In [7]:
stop

NameError: name 'stop' is not defined

## Optional: JSON output file

The JSON output contains everything in the CSV file, along with helpful metadata about your mapping run as [documented here](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md#json-output-file).

In [None]:
with open(json_dst_path, 'rb') as src:
    mapping_json = json.load(src)

For instance, to see the configuration parameters corresponding to your mapping run, you can look at

In [None]:
print(json.dumps(mapping_json['config'], indent=2))

The actual cell type assignments are stored as a list under `'results'` as in

In [None]:
print(json.dumps(mapping_json['results'][0], indent=2))

One complication is that the cell type assignments are only referred to by their unique machine-readable labels in this file. Fortunately, the cell type taxonomy, along with the mapping between machine- and human-readable cell type labels is also provided in this file. The `TaxonomyTree` class provides a helpful interface with that data.

In [None]:
from cell_type_mapper.taxonomy.taxonomy_tree import TaxonomyTree

In [None]:
taxonomy = TaxonomyTree(data=mapping_json['taxonomy_tree'])

In [None]:
taxonomy.label_to_name(level='CCN20250428_LEVEL_3', label='CS20250428_GROUP_0025')