# Goal
This notebook will show how to map data using the MapMyCells data products released with the HMBA consensus Basal Ganglia taxonomy.

[Original Notebook here](https://github.com/AllenInstitute/HMBA_BasalGanglia_Consensus_Taxonomy/blob/main/examples/using_MapMyCells_data.ipynb)

# Install MapMyCell dependencies
We are going to start by installing the `cell_type_mapper` (actual backend for MapMyCells) and `abc_atlas_access` (convenient download of other data provided by the AllenInstitute) libraries.
```bash
uv add "cell_type_mapper@git+https://github.com/alleninstitute/cell_type_mapper.git@rc/v1.5.2"
uv add "abc_atlas_access[notebooks] @ git+https://github.com/alleninstitute/abc_atlas_access.git"
```

# Dowload mapping files for basal ganglia

To run MapMyCells, you need two supporting data files: 
- [an HDF5 file](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/precomputed_stats_file.md) which defines your taxonomy and the average gene expression profiles of the taxons there in
- [lookup table of marker genes](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/input_data_files/marker_gene_lookup.md) for your taxonomy. 

Let's download those two files for the **human** basal ganglia data. 
Refer back [to the main page](https://github.com/AllenInstitute/HMBA_BasalGanglia_Consensus_Taxonomy/blob/main/index.md#cell-type-mapping-with-mapmycells) for the locations of the relevant files for all three species (human, marmoset, and macaque).

```bash
wget https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.precomputed_stats.20250507.h5
wget https://released-taxonomies-802451596237-us-west-2.s3.us-west-2.amazonaws.com/HMBA/BasalGanglia/BICAN_05072025_pre-print_release/MapMyCells/Human.query_markers.20250507.json
```

---

In [1]:
import json
import matplotlib.pyplot as plt
import pandas as pd
import scanpy as sc
from pathlib import Path

import cell_type_mapper
from abc_atlas_access.abc_atlas_cache.abc_project_cache import AbcProjectCache
from cell_type_mapper.cli.from_specified_markers import FromSpecifiedMarkersRunner

# Hyperparamyters
LIBRARY = "H3"

# Prepare Query data

Your query data must be:
- **Format**: `.h5ad` file (AnnData format)
- **Structure**: 
  - `X` layer with **RAW** gene expression data (cells × genes)
  - `obs` with cell metadata
  - `var` with gene names, **Ensembl IDs** for MapMyCells-supported taxonomies

In [2]:
QUERY_PATH=f"/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-{LIBRARY}/adata/raw_adata.h5ad"

# paths to files where mapping output will be written
json_dst_path = str(Path(QUERY_PATH).parent.parent / "map_my_cell" / "mapping.json")
csv_dst_path = str(Path(QUERY_PATH).parent.parent / "map_my_cell" / "mapping.csv")

# updated adata path
adata_labelled_path = str(Path(QUERY_PATH).parent / "labelled_adata.h5ad")

# FIXED (always mappign againf basal ganglia)
# the lookup table of marker genes which we downloaded from S3
query_marker_path = "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.query_markers.20250507.json"
# the human-specific precomputed stats file which we downloaded from S3
precomputed_path = "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.precomputed_stats.20250507.h5"

# Run Mapping

Now we will actually [perform the mapping](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/mapping_cells.md).

In [3]:
config = {
    # output paths
    "query_path": QUERY_PATH,
    "extended_result_path": json_dst_path,
    "csv_result_path": csv_dst_path,
    "verbose_csv": True,

    # inout paths
    "query_markers": {
       "serialized_lookup": query_marker_path
    },
    "precomputed_stats": {
        "path": precomputed_path
    },

    "type_assignment": {
        "n_processors": 8,
        "normalization": "raw", #Use raw counts (not normalized)
        "bootstrap_factor": 0.5,
        "bootstrap_iteration": 100
    }
}

In [4]:
runner = FromSpecifiedMarkersRunner(
    args=[],
    input_data=config
)
runner.run()
print("Done!")

=== Running Hierarchical Mapping 1.5.2 with config ===
{
  "query_path": "/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-H3/adata/raw_adata.h5ad",
  "map_to_ensembl": false,
  "obsm_clobber": false,
  "verbose_csv": true,
  "hdf5_result_path": null,
  "max_gb": 100.0,
  "query_markers": {
    "serialized_lookup": "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.query_markers.20250507.json",
    "collapse_markers": false,
    "log_level": "ERROR"
  },
  "log_level": "ERROR",
  "verbose_stdout": true,
  "obsm_key": null,
  "summary_metadata_path": null,
  "csv_result_path": "/home/gdallagl/myworkdir/XDP/data/XDP/disease/240805_SL-EXD_0328_B22FKKYLT4/SI-TT-H3/map_my_cell/mapping.csv",
  "precomputed_stats": {
    "path": "/home/gdallagl/myworkdir/XDP/data/AllenAtlas/BGT_human_20250507/Human.precomputed_stats.20250507.h5",
    "log_level": "ERROR"
  },
  "drop_level": null,
  "cloud_safe": false,
  "extended_result_path": "/home/gdal

{
  "NUMEXPR_NUM_THREADS": "",
  "MKL_NUM_THREADS": "",
  "OMP_NUM_THREADS": ""
}
  self.env(f"anndata version: {anndata.__version__}")


BENCHMARK: spent 1.4915e-01 seconds creating query marker cache
Running CPU implementation of type assignment.
BENCHMARK: spent 5.3183e+02 seconds assigning cell types
Writing marker genes to output file
MAPPING FROM SPECIFIED MARKERS RAN SUCCESSFULLY
CLEANING UP
Done!


# Output of mapping file

The results of our mapping are now in two files: the csv file pointed to by `csv_dst_path` and the JSON file pointed to by `json_dst_path`. Dedicated documentation of the the contents of the mapping output [can be found here.](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md)

## CSV output file

The CSV file is effectively just a dataframe. For every cell at every taxonomy level, you have its assigned cell type (both as a guaranteed unique "label" and a more human readable "name") along with quality metrics assessing the confidence in the mapping (see the detailed documentation above).

In [5]:
mapping_csv = pd.read_csv(csv_dst_path, comment='#')
mapping_csv = mapping_csv.set_index("cell_id")

display(mapping_csv)

Unnamed: 0_level_0,Neighborhood_label,Neighborhood_name,Neighborhood_bootstrapping_probability,Neighborhood_aggregate_probability,Neighborhood_correlation_coefficient,Class_label,Class_name,Class_bootstrapping_probability,Class_aggregate_probability,Class_correlation_coefficient,...,Group_name,Group_bootstrapping_probability,Group_aggregate_probability,Group_correlation_coefficient,Cluster_label,Cluster_name,Cluster_alias,Cluster_bootstrapping_probability,Cluster_aggregate_probability,Cluster_correlation_coefficient
cell_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AAACCAAAGAACAGAC-1,CS20250428_NEIGH_0002,Subpallium GABA,1.0,1.0,0.6523,CS20250428_CLASS_0003,CN LGE GABA,1.00,1.00,0.6431,...,STRv D1 NUDAP MSN,1.00,1.0000,0.5333,CS20250428_CLUST_0516,Human-484,Human-484,1.00,1.0000,0.4764
AAACCAAAGAATATCC-1,CS20250428_NEIGH_0002,Subpallium GABA,1.0,1.0,0.7357,CS20250428_CLASS_0003,CN LGE GABA,1.00,1.00,0.6947,...,OT D1 ICj,1.00,1.0000,0.6176,CS20250428_CLUST_0338,Human-164,Human-164,1.00,1.0000,0.5817
AAACCAAAGAGTCGTC-1,CS20250428_NEIGH_0002,Subpallium GABA,1.0,1.0,0.7695,CS20250428_CLASS_0003,CN LGE GABA,1.00,1.00,0.7420,...,STRd D2 Striosome MSN,1.00,1.0000,0.5643,CS20250428_CLUST_0485,Human-189,Human-189,0.47,0.4700,0.4365
AAACCAAAGATCCCAA-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7086,CS20250428_CLASS_0000,Astro-Epen,1.00,1.00,0.7801,...,Astrocyte,0.99,0.9900,0.3414,CS20250428_CLUST_0255,Human-232,Human-232,0.95,0.9405,0.4905
AAACCAAAGCAAGCGA-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.5931,CS20250428_CLASS_0000,Astro-Epen,1.00,1.00,0.6480,...,Astrocyte,0.91,0.9100,0.2416,CS20250428_CLUST_0249,Human-14,Human-14,0.51,0.4641,0.3844
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTTGTGGGTACGTGAT-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6906,CS20250428_CLASS_0008,Immune,1.00,1.00,0.8336,...,Microglia,1.00,1.0000,0.6258,CS20250428_CLUST_0225,Human-543,Human-543,0.71,0.7100,0.3815
GTTGTGGGTAGGCATT-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.6890,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8381,...,OPC,1.00,1.0000,0.8231,CS20250428_CLUST_0232,Human-11,Human-11,0.87,0.8700,0.3882
GTTGTGGGTATCGTGG-1,CS20250428_NEIGH_0002,Subpallium GABA,1.0,1.0,0.6163,CS20250428_CLASS_0006,F M GABA,0.54,0.54,0.5724,...,ZI-HTH GABA,0.97,0.5238,0.4203,CS20250428_CLUST_0356,Human-521_1,Human-521_1,1.00,0.5238,0.4203
GTTGTGGGTCATACCC-1,CS20250428_NEIGH_0001,Nonneuron,1.0,1.0,0.7080,CS20250428_CLASS_0010,OPC-Oligo,1.00,1.00,0.8174,...,Oligo OPALIN,0.61,0.6100,0.5653,CS20250428_CLUST_0227,Human-1,Human-1,1.00,0.6100,0.6234


# Add Metadata to adata and Save

In [6]:
# Read adata
query_adata = sc.read_h5ad(QUERY_PATH)

# Merge metadata
query_adata.obs = query_adata.obs.join(mapping_csv, how="left")
display(query_adata.obs)

# save 
query_adata.write(adata_labelled_path)


n_cells_original = len(query_adata.obs)
n_cells_mapped = mapping_csv.index.nunique()
print(f"Original cells: {n_cells_original}")
print(f"Mapped cells: {n_cells_mapped}")
assert n_cells_original == n_cells_mapped, "Mapping incomplete!"

Unnamed: 0,x,y,pct_intronic,is_cell,dbscan_clusters,dbscan_score,has_spatial,Neighborhood_label,Neighborhood_name,Neighborhood_bootstrapping_probability,...,Group_name,Group_bootstrapping_probability,Group_aggregate_probability,Group_correlation_coefficient,Cluster_label,Cluster_name,Cluster_alias,Cluster_bootstrapping_probability,Cluster_aggregate_probability,Cluster_correlation_coefficient
AAACCAAAGAACAGAC-1,36750.336875,-4918.947027,0.582893,True,1,0.914737,True,CS20250428_NEIGH_0002,Subpallium GABA,1.0,...,STRv D1 NUDAP MSN,1.00,1.0000,0.5333,CS20250428_CLUST_0516,Human-484,Human-484,1.00,1.0000,0.4764
AAACCAAAGAATATCC-1,6546.427751,-8964.535170,0.676476,True,1,0.949614,True,CS20250428_NEIGH_0002,Subpallium GABA,1.0,...,OT D1 ICj,1.00,1.0000,0.6176,CS20250428_CLUST_0338,Human-164,Human-164,1.00,1.0000,0.5817
AAACCAAAGAGTCGTC-1,45734.792830,8427.068976,0.712612,True,1,0.978896,True,CS20250428_NEIGH_0002,Subpallium GABA,1.0,...,STRd D2 Striosome MSN,1.00,1.0000,0.5643,CS20250428_CLUST_0485,Human-189,Human-189,0.47,0.4700,0.4365
AAACCAAAGATCCCAA-1,51201.227539,525.607101,0.686855,True,1,0.960045,True,CS20250428_NEIGH_0001,Nonneuron,1.0,...,Astrocyte,0.99,0.9900,0.3414,CS20250428_CLUST_0255,Human-232,Human-232,0.95,0.9405,0.4905
AAACCAAAGCAAGCGA-1,8939.060270,5043.015355,0.672605,True,1,0.973961,True,CS20250428_NEIGH_0001,Nonneuron,1.0,...,Astrocyte,0.91,0.9100,0.2416,CS20250428_CLUST_0249,Human-14,Human-14,0.51,0.4641,0.3844
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTTGTGGGTACGTGAT-1,7312.232149,5689.570899,0.671489,True,1,0.790263,True,CS20250428_NEIGH_0001,Nonneuron,1.0,...,Microglia,1.00,1.0000,0.6258,CS20250428_CLUST_0225,Human-543,Human-543,0.71,0.7100,0.3815
GTTGTGGGTAGGCATT-1,11313.567114,8499.633020,0.751973,True,1,0.966612,True,CS20250428_NEIGH_0001,Nonneuron,1.0,...,OPC,1.00,1.0000,0.8231,CS20250428_CLUST_0232,Human-11,Human-11,0.87,0.8700,0.3882
GTTGTGGGTATCGTGG-1,31393.199291,-120.827487,0.610678,True,1,0.963507,True,CS20250428_NEIGH_0002,Subpallium GABA,1.0,...,ZI-HTH GABA,0.97,0.5238,0.4203,CS20250428_CLUST_0356,Human-521_1,Human-521_1,1.00,0.5238,0.4203
GTTGTGGGTCATACCC-1,46188.098628,-7745.600240,0.691346,True,1,0.903554,True,CS20250428_NEIGH_0001,Nonneuron,1.0,...,Oligo OPALIN,0.61,0.6100,0.5653,CS20250428_CLUST_0227,Human-1,Human-1,1.00,0.6100,0.6234


Original cells: 36649
Mapped cells: 36649


---
---
---

## Optional: JSON output file

The JSON output contains everything in the CSV file, along with helpful metadata about your mapping run as [documented here](https://github.com/AllenInstitute/cell_type_mapper/blob/main/docs/output.md#json-output-file).

In [7]:
# with open(json_dst_path, 'rb') as src:
#     mapping_json = json.load(src)

For instance, to see the configuration parameters corresponding to your mapping run, you can look at

In [8]:
# print(json.dumps(mapping_json['config'], indent=2))

The actual cell type assignments are stored as a list under `'results'` as in

In [9]:
# print(json.dumps(mapping_json['results'][0], indent=2))

One complication is that the cell type assignments are only referred to by their unique machine-readable labels in this file. Fortunately, the cell type taxonomy, along with the mapping between machine- and human-readable cell type labels is also provided in this file. The `TaxonomyTree` class provides a helpful interface with that data.

In [10]:
# from cell_type_mapper.taxonomy.taxonomy_tree import TaxonomyTree

In [11]:
# taxonomy = TaxonomyTree(data=mapping_json['taxonomy_tree'])

In [12]:
# taxonomy.label_to_name(level='CCN20250428_LEVEL_3', label='CS20250428_GROUP_0025')