The purpose of this notebook is to demonstrate the prototype gene mapper library put together to generalize the processes of

- Mapping gene symbols to gene identifiers
- Mapping gene identifiers across authorities (NCBI, ENSEMBL, etc.)
- Mapping genes in one species to ortholog genes in another species

for MapMyCells.

## Setup

In [1]:
import h5py
import json
import pathlib
import subprocess
import sqlite3
import time

import abc_atlas_access.abc_atlas_cache.abc_project_cache as abc_cache_library

import mmc_gene_mapper
import mmc_gene_mapper.utils.timestamp as timestamp
import mmc_gene_mapper.utils.file_utils as file_utils
import mmc_gene_mapper.ensembl_download.scraper as ensembl_scraper
import mmc_gene_mapper.mapper.mapper as mapper
import mmc_gene_mapper.cli.create_db_file as create_db_file

  import pkg_resources


In [2]:
abc_cache = abc_cache_library.AbcProjectCache.from_cache_dir('../data/abc_cache')

type.compare_manifests('releases/20250331/manifest.json', 'releases/20250531/manifest.json')
To load another version of the dataset, run
type.load_manifest('releases/20250531/manifest.json')


In [3]:
data_dir = pathlib.Path(
    mmc_gene_mapper.__file__).parent.parent.parent / "data"
assert data_dir.is_dir()

Below we will instantiate the class that is used to actually do the mapping. This class works by querying a sqlite database file that contains all of the relevant gene identifier mappings. We will store that database at the location defined below.



In [4]:
db_path = data_dir/"gene_mapper_example.db"
assert db_path.parent.is_dir()
print(db_path)

/Users/scott.daniel/KnowledgeEngineering/mmc_gene_mapper/data/gene_mapper_example.db


You can either

a) Create the database, a process which will take about 2 hours as the code downloads an ingests data from both NCBI and ENSEMBl

b) Copy the database from Isilon to the path above

I have put a functional copy of the data base here
```
/allen/aibs/technology/danielsf/gene_mapper/gene_mapper_example.db
```
Just copy that from Isilon to the path defined as `db_path` above, and this notebook should run without having to go through the time-consuming database creation process.

**Note:** the database file is 15G in size.

In [5]:
if not db_path.is_file():
    print("====running database creation; this will take ~ 2 hours=====")
    create_db_file.create_db_file(
        db_path=db_path,
        local_dir=data_dir/timestamp.get_timestamp(),
        ensembl_version=114,
        suppress_download_stdout=True,
        clobber=False
    )
gene_mapper = mapper.MMCGeneMapper(
    db_path=db_path
)

## What, qualitatively, is in this database?

The database you just created/downloaded was constructed by downloading
- `gene_info.gz`
- `gene2ensembl.gz`
- `gene_orthologs.gz`

from
```
https://ftp.ncbi.nlm.nih.gov/gene/DATA/
```

These files provide
- mappings between NCBI gene identifiers and gene symbols (`gene_info.gz`)
- mappings between NCBI gene identifiers and species taxons (`gene_info.gz`)
- mappings between NCBI gene identifiers and ENSEMBL IDs (`gene2ensembl.gz`)
- ortholog mappings between genes of different speices (`gene_orthologs.gz`)

We also ingest
```
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
```
to get the mapping between NCBI species taxon and human-readable species names.

Finally, we scrape all of the `gff3` files from a specific release (default 114) published on
```
https://ftp.ensembl.org/pub/
```
to provide mappings between ENSEMBL IDs and gene symbols.

The data from these sources is used to construct a series of tables (described below) that are queried to map the user's input list of genes (be they symbols, ENSEMBL IDs, or NCBI IDs) to the desired outputs (ENSEMBL IDs or NCBI IDs, possibly from a different species).


## Actual description of tables in the database

The database backing the `gene_mapper` contains the following tables

### Metadata tables

#### authority
For lack of a better term, "authority" is how we refer to NCBI and ENSEMBL as institutions. The `authority` table consists of the columns

- id -- an integer used for internal indexing
- name -- the human-readable name of the authority

#### citation
The `citation` table tracks all of the raw data files used to justify the various mappings (symbol-to-identifier, ENSEMBL-to-NCBI, orthologs, etc.). It consists of the columns

- id -- an integer used for internal indexing
- name -- a short, human-readable name for the citation (this will be used whenever you have to specify a citation when calling the `gene_mapper`'s methods)
- metadata -- a JSON serialized dict containing whatever information is necessary to specify the data backing this citation. This is a very free-form column as different entries will be specified in different ways (see below). Usually it will be a URL and a "downloaded on" timestamp.

We can list the available authorities like so

In [6]:
authority_list = gene_mapper.get_all_authorities()
print(authority_list)

['NCBI', 'ENSEMBL']


Similarly, we can list the available citations (we will only list a few here, since each species-specific gff3 from ENSEMBL is its own citation).

In [7]:
citation_list = gene_mapper.get_all_citations()
for c in citation_list[:5]:
    print(c['name'])


NCBI
ENSEMBL-9606-101
ENSEMBL-10090-98
ENSEMBL-80966-114
ENSEMBL-211598-114


As mentioned above, different citations contain different metadata. For instance, the `NCBI` citation (referring to the data downloaded from the NCBI FTP server) contains a list of the files downloaded their hashes, and the date they were downloaded on (NCBI updates its FTP server daily; it's not clear that they keep readily available version information around, though we can certainly investigate that). This is what the `NCBI` citation metadata looks like:

In [8]:
for citation in citation_list:
    if citation['name'] == 'NCBI':
        print(json.dumps(citation, indent=2))

{
  "name": "NCBI",
  "metadata": {
    "gene_info.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_info.gz",
      "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
      "downloaded_on": "2025-07-25-16-52-21"
    },
    "gene2ensembl.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene2ensembl.gz",
      "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
      "downloaded_on": "2025-07-25-16-53-39"
    },
    "gene_orthologs.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_orthologs.gz",
      "hash": "md5:744c66facf3775f54bd6d82b468af89d",
      "downloaded_on": "2025-07-25-16-54-10"
    }
  }
}


Here is an example of the citation generated for one of the ENSEMBL gff3 files. It is worth pointing out that the gff3 files are processed through the code `brain-bican/bkbit` GitHub repository to make the data easier to parse on ingest. `bkbit` also produces this metadata.

In [9]:
for citation in citation_list:
    if citation['name'] == 'ENSEMBL-10090-98':
        print(json.dumps(citation, indent=2))

{
  "name": "ENSEMBL-10090-98",
  "metadata": {
    "biolink:OrganismTaxon": {
      "id": "NCBITaxon:10090",
      "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
      "category": [
        "biolink:OrganismTaxon"
      ],
      "name": "house mouse",
      "full_name": "Mus musculus"
    },
    "bican:GenomeAssembly": {
      "id": "NCBIAssembly:GCF_000001635.26",
      "category": [
        "bican:GenomeAssembly"
      ],
      "name": "GRCm38",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "26"
    },
    "bican:GenomeAnnotation": {
      "id": "bican:annotation-ENSEMBL-10090-98",
      "category": [
        "bican:GenomeAnnotation"
      ],
      "description": "ENSEMBL Mus musculus Annotation Release 98",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "98",
      "digest": [
        "urn:uuid:1696a139-804c-4576-98df-2be64dac34ff"
      ],
  

### Data tables

The data tables in the database are

#### NCBI_species
The `NCBI_species` table just ingests the NCBI organism taxonomy and records
- id -- the NCBI taxon ID of the species
- name -- the human readable name of the species. **Note:** we record all possible names ("common", "scientific", etc.) of the species in different rows. That way, users can give us whatever name they have for a species and, if possible, we can return an `id` for cross-referencing with other tables.

This is meant to be hidden from the user. When actually mapping genes, the `gene_mapper` will detect the species associated with the user's input list of genes. Users can specify a species into which they want their genes mapped using the human readable name. The mapper will perform the necessary database query to convert the human-readable name into the integer species taxon.

Here we query the database "by hand" to demonstrate its contents.

In [10]:
with sqlite3.connect(gene_mapper.db_path) as conn:
    cursor = conn.cursor()
    for name in ('house mouse', 'Mus musculus', 'mouse', 'human', 'Homo sapiens'):
        result = cursor.execute(
            """
            SELECT id FROM NCBI_species WHERE name=?
            """,
            (name,)
        ).fetchall()
        print(name,result)

house mouse [(10090,)]
Mus musculus [(10090,)]
mouse [(10088,), (10090,)]
human [(9606,)]
Homo sapiens [(9606,)]


**Note:** in the case of "mouse", there is more than one matching taxon. Calling a mapping with `dst_species="mouse"` will fail.

In [11]:
import traceback

mouse_symbols = ["Xkr4", "Npbwr1", "not_a_symbol", "Rrs1"]
try:
    gene_mapper.map_genes(
        gene_list=mouse_symbols,
        dst_species="mouse",
        dst_authority="ENSEMBL"
    )
except Exception as err:
    msg = traceback.format_exc()
    print(
        f"Failed with error message:\n=======\n{msg}"
    )

Failed with error message:
Traceback (most recent call last):
  File "/var/folders/8b/hnw5vq8s20jbpz51wdhd11fr0000gp/T/ipykernel_80157/4200133508.py", line 5, in <module>
    gene_mapper.map_genes(
  File "/Users/scott.daniel/KnowledgeEngineering/mmc_gene_mapper/src/mmc_gene_mapper/mapper/mapper.py", line 220, in map_genes
    dst_species = query_utils.get_species(
                  ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/scott.daniel/KnowledgeEngineering/mmc_gene_mapper/src/mmc_gene_mapper/query_db/query.py", line 35, in get_species
    species_taxon = _get_species_taxon(
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/scott.daniel/KnowledgeEngineering/mmc_gene_mapper/src/mmc_gene_mapper/query_db/query.py", line 87, in _get_species_taxon
    raise ValueError(
ValueError: 2 species match name mouse
[(10088,), (10090,)]



A function is provided for listing all of the species the database knows about.

**Note:** This is not a list of all of the species for which the database has gene information. It is merely a list of the species that can be associated with a taxon ID in the database (for cross-referencing with the tables that contain gene information).

In [12]:
species_list = gene_mapper.get_all_species()
print(species_list[1679491:1679499])

['Homo heidelbergensis', 'Homo heidelbergensis Schoetensack, 1908', 'Homo lar', 'Homo lar Linneaus, 1771', 'Homo neanderthalensis', 'Homo sapiens', 'Homo sapiens Linnaeus, 1758', 'Homo sapiens environmental sample']


In [13]:
len(species_list)

4111928

#### gene
The `gene` table contains all of the information about individual genes ingested into the database. Columns are
- authority -- an integer for cross-referencing the `authority` table
- citation -- an integer for cross-referencing the `citation` table
- species_taxon -- an integer for cross-refereneing the `NCBI_species` table
- id -- an integer for internal indexing
- identifier -- a string. The full unique identifier (e.g. ENSMUS0000G12345) of the gene
- symbol -- a string. The human-readable name of the gene (**Note:** the BICAN bkbit files contain `symbol` and `name` entries for genes. If these differ, both are ingested as a separate row in the `gene` table with `name` being recorded in `symbol` as appropriate).

#### gene_equivalence
The `gene_equivalence` table records mappings between authorities (ENSEMBL and NCBI). Its columns are
- species_taxon -- an integer for cross-referencing with the `NCBI_species` table (in what species are we looking for equivalences?)
- citation -- an integer for cross-referencing with the `citation` table (who said these genes were equivalent?)
- authority0 -- an integer for cross-referencing with the `authority` table
- gene0 -- an integer; the ID of the gene in `authority0`
- authority1 -- an integer for cross-referncing with the `authority` table
- gene1 -- an integer; the ID of the gene in `authority1` that is equivalent to `gene0` according to `citation`.

**Note:** we record equivalences symmetrically, so that every `(gene0, gene1)` pair is also recorded as `(gene1, gene0)` so that users do not have to worry about which index (`gene0` or `gene1`) their input data is compared to.

#### gene_ortholog
The `gene_ortholog` table records cross-species ortholog relationships. Its columns are
- authority -- an integer for cross-referencing the `authority` table (are we looking for NCBI orthologs or ENSEMBL orthologs?)
- citation -- an integer for cross-referencing the `citation` table (who said these genes were orthologs?)
- species -- an integer for cross-referencing the `NCBI_species` table
- gene -- an integer, the ID of the gene in `species` whose orthologs we are looking for
- ortholog_group -- an integer; genes with the same value of `ortholog_group` are orthologs of each other


## Actually mapping data

The `gene_mapper` provides a single function to map gene identifiers from one form `(authority, species)` to another. To run it you specify
- the list of gene identifiers you have
- the name of the species you want your genes mapped into (the mapper will automatically detect the species your genes are already in)
- the authority of gene (ENSEMBL or NCBI) you want your genes mapped into
- (optionally) the name of the citation to use for any cross species ortholog mappings (default is to use the NCBI lookup table)

The mapper will then go through the necessary transformations and return a mapped list of gene identifiers for you. It will also provide a metadata structure listing the transformations that were applied to the input `gene_list`.

The mapper will return a dict containing the list of identifiers your genes mapped onto as well as metadata to help you understand how your genes were mapped.

### Mapping symbols to identifiers



In [14]:
mouse_symbols = ["Xkr4", "Npbwr1", "not_a_symbol", "Rrs1"]
mouse_ens_mapping = gene_mapper.map_genes(
    gene_list=mouse_symbols,
    dst_species="Mus musculus",
    dst_authority="ENSEMBL",
    ortholog_citation="NCBI"
)
print(json.dumps(mouse_ens_mapping, indent=2))

{
  "metadata": [
    {
      "mapping": {
        "axis": "authority",
        "from": "symbol",
        "to": "ENSEMBL"
      },
      "citation": {
        "name": "ENSEMBL-10090-98",
        "metadata": {
          "biolink:OrganismTaxon": {
            "id": "NCBITaxon:10090",
            "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
            "category": [
              "biolink:OrganismTaxon"
            ],
            "name": "house mouse",
            "full_name": "Mus musculus"
          },
          "bican:GenomeAssembly": {
            "id": "NCBIAssembly:GCF_000001635.26",
            "category": [
              "bican:GenomeAssembly"
            ],
            "name": "GRCm38",
            "in_taxon": [
              "NCBITaxon:10090"
            ],
            "in_taxon_label": "Mus musculus",
            "version": "26"
          },
          "bican:GenomeAnnotation": {
            "id": "bican:annotation-ENSEMBL-10090-98",
            "category": [
       

For ease of reading, here is the metadata part of the result. In this case, since the input data was already aligned to mouse, there is only one mapping step: the mapping from gene symbol to ENSEMBL. Below, we will see cases of mappings that involve multiple steps.

In [15]:
print(json.dumps(mouse_ens_mapping['metadata'], indent=2))

[
  {
    "mapping": {
      "axis": "authority",
      "from": "symbol",
      "to": "ENSEMBL"
    },
    "citation": {
      "name": "ENSEMBL-10090-98",
      "metadata": {
        "biolink:OrganismTaxon": {
          "id": "NCBITaxon:10090",
          "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
          "category": [
            "biolink:OrganismTaxon"
          ],
          "name": "house mouse",
          "full_name": "Mus musculus"
        },
        "bican:GenomeAssembly": {
          "id": "NCBIAssembly:GCF_000001635.26",
          "category": [
            "bican:GenomeAssembly"
          ],
          "name": "GRCm38",
          "in_taxon": [
            "NCBITaxon:10090"
          ],
          "in_taxon_label": "Mus musculus",
          "version": "26"
        },
        "bican:GenomeAnnotation": {
          "id": "bican:annotation-ENSEMBL-10090-98",
          "category": [
            "bican:GenomeAnnotation"
          ],
          "description": "ENSEMBL Mus m

Here is the actual list of mapped genes.

In [16]:
print("input genes")
print(json.dumps(mouse_symbols, indent=2))
print("map to")
print(json.dumps(mouse_ens_mapping['gene_list'], indent=2))

input genes
[
  "Xkr4",
  "Npbwr1",
  "not_a_symbol",
  "Rrs1"
]
map to
[
  "ENSMUSG00000051951",
  "ENSMUSG00000033774",
  "symbol:ENSEMBL:UNMAPPABLE_NO_MATCH_0",
  "ENSMUSG00000061024"
]


**Note:** the gene that failed to map (`"not_a_symbol"`) is mapped to a unique placeholder name that attempts to communicate at which point the mapping failed (there was no match in the symbol -> ENSEMBL mapping stage).

Here is a similar call mapping symbols to identifiers in NCBI

In [17]:
mouse_ncbi_mapping = gene_mapper.map_genes(
    gene_list=mouse_symbols,
    dst_species="Mus musculus",
    dst_authority="NCBI"
)
print(json.dumps(mouse_ncbi_mapping, indent=2))

{
  "metadata": [
    {
      "mapping": {
        "axis": "authority",
        "from": "symbol",
        "to": "NCBI"
      },
      "citation": {
        "name": "NCBI",
        "metadata": {
          "gene_info.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene_info.gz",
            "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
            "downloaded_on": "2025-07-25-16-52-21"
          },
          "gene2ensembl.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene2ensembl.gz",
            "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
            "downloaded_on": "2025-07-25-16-53-39"
          },
          "gene_orthologs.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene_orthologs.gz",
            "hash": "md5:744c66facf3775f54bd6d82b468af89d",
            "downloaded_on": "2025-07-25-16-54-10"
          }
        }
      }
    }
  ],
  "gene_list": [
    "NC

### Mapping from ENSEMBL to NCBI


In [18]:
ens_to_ncbi = gene_mapper.map_genes(
    dst_authority='NCBI',
    gene_list=["ENSMUSG00000051951",
               "ENSMUSG00000030337",
               "ENSMUSG00000087247",
               "nope",
               "ENSMUSG00000025911"],
    dst_species="Mus musculus"
)
print(json.dumps(ens_to_ncbi, indent=2))

{
  "metadata": [
    {
      "mapping": {
        "axis": "authority",
        "from": "symbol",
        "to": "NCBI"
      },
      "citation": {
        "name": "NCBI",
        "metadata": {
          "gene_info.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene_info.gz",
            "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
            "downloaded_on": "2025-07-25-16-52-21"
          },
          "gene2ensembl.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene2ensembl.gz",
            "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
            "downloaded_on": "2025-07-25-16-53-39"
          },
          "gene_orthologs.gz": {
            "host": "ftp.ncbi.nlm.nih.gov",
            "src_path": "gene/DATA/gene_orthologs.gz",
            "hash": "md5:744c66facf3775f54bd6d82b468af89d",
            "downloaded_on": "2025-07-25-16-54-10"
          }
        }
      }
    },
    {
      "mapping": {
   

You will note that the metadata includes two mappings: one from ENSEMBL to NCBI, and one from symbol to NCBI. This is because the mapper tries to interpret `"nope"` as a gene symbol (it is obviously not an ENSEMBL or an NCBI ID). That mapping fails, which is why
```
"nope" -> "symbol:NCBI:UNMAPPABLE_NO_MATCH_0"
```
but the metadata for the attempted mapping is still recorded.



Here is a call mapping from NCBI to ENSEMBL

In [19]:
ncbi_to_ens = gene_mapper.map_genes(
    dst_authority='ENSEMBL',
    gene_list=["NCBIGene:67269", "NCBIGene:54200", "nope", "NCBIGene:70911"],
    dst_species="Mus musculus"
)
print(json.dumps(ncbi_to_ens, indent=2))

{
  "metadata": [
    {
      "mapping": {
        "axis": "authority",
        "from": "symbol",
        "to": "ENSEMBL"
      },
      "citation": {
        "name": "ENSEMBL-10090-98",
        "metadata": {
          "biolink:OrganismTaxon": {
            "id": "NCBITaxon:10090",
            "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
            "category": [
              "biolink:OrganismTaxon"
            ],
            "name": "house mouse",
            "full_name": "Mus musculus"
          },
          "bican:GenomeAssembly": {
            "id": "NCBIAssembly:GCF_000001635.26",
            "category": [
              "bican:GenomeAssembly"
            ],
            "name": "GRCm38",
            "in_taxon": [
              "NCBITaxon:10090"
            ],
            "in_taxon_label": "Mus musculus",
            "version": "26"
          },
          "bican:GenomeAnnotation": {
            "id": "bican:annotation-ENSEMBL-10090-98",
            "category": [
       

To simulate "production scale," we will now download all of the genes in the Yao et al. 2023 10X gene panel and try mappping them from ENSEMBL to NCBI to see how long it takes to process that many genes.

In [20]:
wmb_gene_df = abc_cache.get_metadata_dataframe(
    directory='WMB-10X',
    file_name='gene'
)
wmb_genes = wmb_gene_df.gene_identifier.values
t0 = time.time()
wmb_ncbi = gene_mapper.map_genes(
    dst_authority='NCBI',
    gene_list=wmb_genes,
    dst_species='Mus musculus',
)
dur = time.time()-t0
print(f'mapping {len(wmb_genes)} genes took {dur:.2e} seconds')

n_mapped = 0
for k in wmb_ncbi['gene_list']:
    if 'UNMAPPABLE' not in k:
        n_mapped += 1
print(f'{n_mapped} genes had unique NCBI matches')


mapping 32285 genes took 1.09e+01 seconds
24507 genes had unique NCBI matches


Now we will download the marker genes used by MapMyCells for the Whole Mouse Brain taxonomy and try mapping them from ENSEMBL to NCBI.

In [21]:
src_path = "https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/mouse_markers_230821.json"
wmb_marker_path = pathlib.Path('../data/mouse_markers_230821.json')
if not wmb_marker_path.is_file():
    print(f"downloading {wmb_marker_path}")
    process = subprocess.Popen(
        args=[
           "wget",
            src_path,
            "-O",
            str(wmb_marker_path),
        ],
        stderr=subprocess.DEVNULL
    )
    process.wait()

In [22]:
marker_genes = set()
with open(wmb_marker_path, 'rb') as src:
    marker_lookup = json.load(src)
for key in marker_lookup:
    if key in ('log', 'metadata'):
        continue
    marker_genes = marker_genes.union(set(marker_lookup[key]))
marker_genes = sorted(marker_genes)

Let's now map our mouse marker genes from ENSEMBL to NCBI

In [23]:
print(f'{len(marker_genes)} marker genes')
t0 = time.time()
marker_ncbi = gene_mapper.map_genes(
    dst_authority='NCBI',
    gene_list=marker_genes,
    dst_species='Mus musculus'
)
dur = time.time()-t0
print(f'mapping {len(marker_genes)} genes took {dur:.2e} seconds')

n_mapped = 0
n_one_to_one = 0
for k in marker_ncbi['gene_list']:
    if 'UNMAPPABLE' not in k:
        n_mapped += 1

print(f'{n_mapped} genes had unique NCBI matches')


6558 marker genes
mapping 6558 genes took 5.37e-01 seconds
6493 genes had unique NCBI matches


### Ortholog mapping

To test ortholog mapping, let's take the genes from Yao et al. 2023 that only mapped to one NCBI gene and map them to human orthologs.

First let's just map 5 genes so that we can display the output and see what it looks like

In [24]:
gene_list = marker_genes[10:15] + ['not_an_actual_gene']
print(json.dumps(gene_list, indent=2))
test_orthologs = gene_mapper.map_genes(
    dst_authority='ENSEMBL',
    dst_species='human',
    gene_list=gene_list
)
print("maps to")
print(json.dumps(test_orthologs['gene_list'], indent=2))

[
  "ENSMUSG00000000142",
  "ENSMUSG00000000159",
  "ENSMUSG00000000184",
  "ENSMUSG00000000197",
  "ENSMUSG00000000202",
  "not_an_actual_gene"
]
maps to
[
  "ENSG00000168646",
  "ENSG00000183067",
  "ENSG00000118971",
  "ENSG00000102452",
  "ENSG00000204347",
  "symbol:NCBI:UNMAPPABLE_NO_MATCH_0"
]


Since we just did an ortholog mapping, which will have involved several transformation steps, let's look at the contents of the mapping metadata in more detail.

In [25]:
print(len(test_orthologs['metadata']))

4


Each element in `test_orthologs["metadata"]` records the metadata for a particular mapping step as a dict. `"mapping"` explains the purpose of the mapping step.

In [26]:
for metadata_entry in test_orthologs['metadata']:
    print(metadata_entry['mapping'])

{'axis': 'authority', 'from': 'symbol', 'to': 'NCBI'}
{'axis': 'authority', 'from': 'ENSEMBL', 'to': 'NCBI'}
{'axis': 'species', 'from': {'name': 'Balb/c mouse', 'taxon': 10090}, 'to': {'name': 'human', 'taxon': 9606}}
{'axis': 'authority', 'from': 'NCBI', 'to': 'ENSEMBL'}


Since the input genes were a mixture of ENSEMBL IDs and one symbol (`"not_an_actual_gene"`), the mapper first maps from symbol to NCBI and from ENSEMBl to NCBI (recall that we use NCBI identifiers to record the mapping between ortholog genes), then maps from mouse to human (in NCBI), then, since the user requested `dst_authority="ENSEMBL"`, the mapper maks from `"NCBI"->"ENSEMBL"`.

The `"citation"` element of the metadata dict records the actual metadata describing how that step in the mapping was performed.

In [27]:
print(json.dumps(test_orthologs['metadata'][2]['citation'], indent=2))

{
  "gene_info.gz": {
    "host": "ftp.ncbi.nlm.nih.gov",
    "src_path": "gene/DATA/gene_info.gz",
    "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
    "downloaded_on": "2025-07-25-16-52-21"
  },
  "gene2ensembl.gz": {
    "host": "ftp.ncbi.nlm.nih.gov",
    "src_path": "gene/DATA/gene2ensembl.gz",
    "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
    "downloaded_on": "2025-07-25-16-53-39"
  },
  "gene_orthologs.gz": {
    "host": "ftp.ncbi.nlm.nih.gov",
    "src_path": "gene/DATA/gene_orthologs.gz",
    "hash": "md5:744c66facf3775f54bd6d82b468af89d",
    "downloaded_on": "2025-07-25-16-54-10"
  }
}


Now let's map all 32,000 mouse genes to human.

In [28]:
t0 = time.time()
print("mapping genes like")
print(wmb_genes[:5])
print("to human ENSEMBL genes")
wmb_to_whb_orthologs = gene_mapper.map_genes(
    dst_authority='ENSEMBL',
    dst_species='human',
    gene_list=wmb_genes
)
dur = time.time()-t0
print(f'mapping {len(wmb_ncbi['gene_list'])} orthologs took {dur:.2e} seconds')
n_mapped = 0
for k in wmb_to_whb_orthologs['gene_list']:
    if 'UNMAPPABLE' not in k:
        n_mapped += 1
print(f'{n_mapped} genes had orthologs')


mapping genes like
['ENSMUSG00000051951' 'ENSMUSG00000089699' 'ENSMUSG00000102331'
 'ENSMUSG00000102343' 'ENSMUSG00000025900']
to human ENSEMBL genes
mapping 32285 orthologs took 2.15e+01 seconds
16460 genes had orthologs


In [29]:
print(json.dumps(wmb_to_whb_orthologs['metadata'], indent=2))

[
  {
    "mapping": {
      "axis": "authority",
      "from": "ENSEMBL",
      "to": "NCBI"
    },
    "citation": {
      "name": "NCBI",
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
          "downloaded_on": "2025-07-25-16-52-21"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
          "downloaded_on": "2025-07-25-16-53-39"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:744c66facf3775f54bd6d82b468af89d",
          "downloaded_on": "2025-07-25-16-54-10"
        }
      }
    }
  },
  {
    "mapping": {
      "axis": "species",
      "from": {
        "name": "Balb/c mouse",
        "

#### Map from mouse to naked mole rat

Now, inspired by a recent community forum post, let's take our mouse marker genes, convert them into NCBI identifiers, and finally convert those in to naked mole rat orthologs.

In [30]:
t0 = time.time()

naked_mole_rat = gene_mapper.map_genes(
    dst_authority='ENSEMBL',
    dst_species='naked mole rat',
    gene_list=marker_genes,
    ortholog_citation='NCBI'
)

dur = time.time()-t0
print(f"mapping took {dur:.2e} seconds")
n_mapped = 0
for g in naked_mole_rat['gene_list']:
    if 'UNMAPPABLE' not in g:
        n_mapped += 1
print(f"{n_mapped} genes of {len(marker_genes)} had orthologs")
print(marker_genes[:5])
print(naked_mole_rat['gene_list'][:5])

mapping took 1.44e+00 seconds
5048 genes of 6558 had orthologs
['ENSMUSG00000000028', 'ENSMUSG00000000037', 'ENSMUSG00000000056', 'ENSMUSG00000000058', 'ENSMUSG00000000078']
['ENSHGLG00000015159', 'NCBI:ENSEMBL:UNMAPPABLE_NO_MATCH_0', 'ENSHGLG00000008789', 'ENSHGLG00000002962', 'ENSHGLG00000007077']


Here is the metadata for the naked mole rat ortholog mapping

In [31]:
print(json.dumps(naked_mole_rat['metadata'], indent=2))

[
  {
    "mapping": {
      "axis": "authority",
      "from": "ENSEMBL",
      "to": "NCBI"
    },
    "citation": {
      "name": "NCBI",
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:932969426286efd0114db1c7b7a0a9b2",
          "downloaded_on": "2025-07-25-16-52-21"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:181c1cb0feac8aa8cb45b98fa7a758a0",
          "downloaded_on": "2025-07-25-16-53-39"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:744c66facf3775f54bd6d82b468af89d",
          "downloaded_on": "2025-07-25-16-54-10"
        }
      }
    }
  },
  {
    "mapping": {
      "axis": "species",
      "from": {
        "name": "Balb/c mouse",
        "