The purpose of this notebook is to demonstrate the prototype gene mapper library put together to generalize the processes of

- Mapping gene symbols to gene identifiers
- Mapping gene identifiers across authorities (NCBI, ENSEMBL, etc.)
- Mapping genes in one species to ortholog genes in another species

for MapMyCells.

In addition to the `mmc_gene_mapper` codebase and its dependencies, this notebook depends on the `abc_atlas_access` tool, though only for downloading metadata to simulate mapping 10s of thousands of genes at once (as we would with real user data).

In [1]:
import h5py
import json
import pathlib
import subprocess
import sqlite3
import time

import abc_atlas_access.abc_atlas_cache.abc_project_cache as abc_cache_library

import mmc_gene_mapper
import mmc_gene_mapper.utils.file_utils as file_utils
import mmc_gene_mapper.mapper.mapper as mapper

In [2]:
abc_cache = abc_cache_library.AbcProjectCache.from_cache_dir('../data/abc_cache')

In [3]:
data_dir = pathlib.Path(
    mmc_gene_mapper.__file__).parent.parent.parent / "data"
assert data_dir.is_dir()

In [4]:
db_path = data_dir/"gene_mapper_example.db"
assert db_path.parent.is_dir()

There are some files in
```
/allen/scratch/aibstemp/danielsf/gene_mapper_data
```
that are used to ingest data beyond what is available in the NCBI FTP server. Please copy those files to `../data/db_creation_data` in this respository (the cell below will fail if any of the files are not present)

In [5]:
data_file_spec=[
    {"type": "bkbit",
     "path": "../data/db_creation_data/mouse_ENSEMBL-10090-98.jsonld"},
    {"type": "bkbit",
     "path": "../data/db_creation_data/human_ENSEMBL-9606-101.jsonld"},
    {"type": "hmba_orthologs",
     "path": "../data/db_creation_data/all_gene_ids.csv",
     "name": "HMBA",
     "baseline_species": "human"}
]

error_msg = ""
for spec in data_file_spec:
    pth = pathlib.Path(spec['path'])
    if not pth.is_file():
        error_msg += f"{pth} is not a file\n"
if len(error_msg) > 0:
    raise RuntimeError(msg)

Below we instantiate the class that is used to actually do the mapping. The first time you run the cell, it will take ~ 15 minutes as it downloads ~ 4 GB of data from NCBI and ingests that data, alongside the local data from the cell above, to create an ~ 8 GB sqlite database that backs the mapping process.

As long as you leave `clobber=False` below, you will not have to recreate that database on subsequent uses.

In [6]:
gene_mapper = mapper.MMCGeneMapper(
    db_path=db_path,
    local_dir=data_dir,
    data_file_spec=data_file_spec,
    clobber=False,
    force_download=False,
    suppress_download_stdout=True
)

    chunk 0.00e+00, 5.00e+06
    chunk 5.00e+06, 1.00e+07
    chunk 1.00e+07, 1.50e+07
    chunk 1.50e+07, 2.00e+07
    chunk 2.00e+07, 2.50e+07
    chunk 2.50e+07, 3.00e+07
    chunk 3.00e+07, 3.50e+07
    chunk 3.50e+07, 4.00e+07
    chunk 4.00e+07, 4.50e+07
    chunk 4.50e+07, 5.00e+07
    chunk 5.00e+07, 5.50e+07
    chunk 5.50e+07, 6.00e+07
    chunk 6.00e+07, 6.00e+07
    INGESTING 55421 GENES
    INGESTING 60671 GENES


## Contents of the database backing the mapper

The database backing the `gene_mapper` contains the following tables

### Metadata tables

#### authority
The `authority` table just tracks the gene identifying authorities the database knows about (i.e. ENSEMBL and NCBI). it consists of the columns

- id -- an integer used for internal indexing
- name -- the human-readable name of the authority

#### citation
The `citation` table tracks all of the datasets used to justify the various mappings (symbol-to-identifier, ENSEMBL-to-NCBI, orthologs, etc.). It consists of the columns

- id -- an integer used for internal indexing
- name -- a short, human-readable name for the citation (this will be used whenever you have to specify a citation when calling the `gene_mapper`'s methods
- metadata -- a JSON serialized dict containing whatever information is necessary to recreate or specify the data backing this citation. This is a very free-form column as different entries will be specified in different ways (see below)

We can list the available authorities like so

In [7]:
authority_list = gene_mapper.get_all_authorities()
print(authority_list)

['NCBI', 'ENSEMBL']


Similarly, we can list the available citations.

In [8]:
citation_list = gene_mapper.get_all_citations()
print([c['name'] for c in citation_list])

['NCBI', 'ENSEMBL-10090-98', 'ENSEMBL-9606-101', 'HMBA']


As mentioned above, different citations contain different metadata. For instance, the `NCBI` citation (referring to the data downloaded from the NCBI FTP server) contains a list of the files downloaded their hashes, and the date they were downloaded on (NCBI updates its FTP server daily; it's not clear that they keep readily available version information around, though we can certainly investigate that). This is what the `NCBI` citation metadata looks like:

In [9]:
for citation in citation_list:
    if citation['name'] == 'NCBI':
        print(json.dumps(citation, indent=2))

{
  "name": "NCBI",
  "metadata": {
    "gene_info.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_info.gz",
      "hash": "md5:47e8cd29e4c0b95a2e86cf78582316c5",
      "downloaded_on": "2025-04-07-10-24-52"
    },
    "gene2ensembl.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene2ensembl.gz",
      "hash": "md5:4febdfd55f0b42aba4e0253a8e258416",
      "downloaded_on": "2025-04-07-10-27-04"
    },
    "gene_orthologs.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_orthologs.gz",
      "hash": "md5:6152a853d48a76299909b711d19756a7",
      "downloaded_on": "2025-04-07-10-27-27"
    }
  }
}


The citations based on BICAN bkbit files, contain all of the GenomeAnnotation and GenomeAssembly information in those files.

In [10]:
for citation in citation_list:
    if citation['name'] == 'ENSEMBL-10090-98':
        print(json.dumps(citation, indent=2))

{
  "name": "ENSEMBL-10090-98",
  "metadata": {
    "biolink:OrganismTaxon": {
      "id": "NCBITaxon:10090",
      "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
      "category": [
        "biolink:OrganismTaxon"
      ],
      "name": "house mouse",
      "full_name": "Mus musculus"
    },
    "bican:GenomeAssembly": {
      "id": "NCBIAssembly:GCF_000001635.26",
      "category": [
        "bican:GenomeAssembly"
      ],
      "name": "GRCm38",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "26"
    },
    "bican:GenomeAnnotation": {
      "id": "bican:annotation-ENSEMBL-10090-98",
      "category": [
        "bican:GenomeAnnotation"
      ],
      "description": "ENSEMBL Mus musculus Annotation Release 98",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "98",
      "digest": [
        "urn:uuid:58618e83-7b4c-46ed-b300-9df8abf1f5d0"
      ],
  

The `HMBA` citation is an ingest of a lookup table created by our science team. As such, it has the least detailed information.

In [11]:
for citation in citation_list:
    if citation['name'] == 'HMBA':
        print(json.dumps(citation, indent=2))

{
  "name": "HMBA",
  "metadata": {
    "file": "../data/db_creation_data/all_gene_ids.csv",
    "hash": "md5:3a038d261dfde93c95c0c1d9ede39a24"
  }
}


### Data tables

The data tables in the database are

#### NCBI_species
The `NCBI_species` table just ingests the NCBI organism taxonomy and records
- id -- the NCBI taxon ID of the species
- name -- the human readablename of the species. **Note:** we record all possible names ("common", "scientific", etc.) of the species in different rows. That way, users can give us whatever name they have for a species and, if possible, we can return an `id` for cross-referencing with other tables.

This is meant to be hidden from the user. The user specifies the species when performing a mapping and the gene_mapper makes the necessary database call. Below, we will make the call "by hand", so you can see what is in the database.

In [12]:
with sqlite3.connect(gene_mapper.db_path) as conn:
    cursor = conn.cursor()
    for name in ('house mouse', 'Mus musculus', 'mouse', 'human', 'Homo sapiens'):
        result = cursor.execute(
            """
            SELECT id FROM NCBI_species WHERE name=?
            """,
            (name,)
        ).fetchall()
        print(name,result)

house mouse [(10090,)]
Mus musculus [(10090,)]
mouse [(10088,), (10090,)]
human [(9606,)]
Homo sapiens [(9606,)]


**Note:** in the case of "mouse", there is more than one matching taxon. Calling a mapping with `species="mouse"` will fail.

In [13]:
mouse_symbols = ["Xkr4", "Npbwr1", "not_a_symbol", "Rrs1"]
gene_mapper.identifiers_from_symbols(
    gene_symbol_list=mouse_symbols,
    species_name="mouse",
    authority_name="ENSEMBL"
)

RuntimeError: 2 species match name mouse
[(10088,), (10090,)]

#### gene
The `gene` table contains all of the information about individual genes ingested into the database. Columns are
- authority -- an integer for cross-referencing the `authority` table
- citation -- an integer for cross-referencing the `citation` table
- species_taxon -- an integer for cross-refereneing the `NCBI_species` table
- id -- an integer for internal indexing
- identifier -- a string. The full unique identifier (e.g. ENSMUS0000G12345) of the gene
- symbol -- a string. The human-readable name of the gene (**Note:** the BICAN bkbit files contain `symbol` and `name` entries for genes. If these differ, both are ingested as a separate row in the `gene` table with `name` being recorded in `symbol` as appropriate).

#### gene_equivalence
The `gene_equivalence` table records mappings between authorities (ENSEMBL and NCBI). Its columns are
- species_taxon -- an integer for cross-referencing with the `NCBI_species` table (in what species are we looking for equivalences?)
- citation -- an integer for cross-referencing with the `citation` table (who said these genes were equivalent?)
- authority0 -- an integer for cross-referencing with the `authority` table
- gene0 -- an integer; the ID of the gene in `authority0`
- authority1 -- an integer for cross-referncing with the `authority` table
- gene1 -- an integer; the ID of the gene in `authority1` that is equivalent to `gene0` according to `citation`.

**Note:** we record equivalences symmetrically, so that every `(gene0, gene1)` pair is also recorded as `(gene1, gene0)` so that users do not have to worry about which index (`gene0` or `gene1`) their input data is compared to.

#### gene_ortholog
The `gene_ortholog` table records cross-species ortholog relationships. Its columns are
- authority -- an integer for cross-referencing the `authority` table (are we looking for NCBI orthologs or ENSEMBL orthologs?)
- citation -- an integer for cross-referencing the `citation` table (who said these genes were orthologs?)
- species0 -- an integer for cross-referencing the `NCBI_species` table
- gene0 -- an integer, the ID of the gene in `species0` whose orthologs we are looking for
- species1 -- an integer, the index in `NCBI_species` of the other species we are considering
- gene1 -- an integer, the index of the gene in `species1` that is considered an ortholog of `gene0`

**Note:** we ingest all `[(species0, gene0), (species1, gene1)]` pairs twice, permuting the order so that users can search for orthologs in both directions (`species0 -> species1` and `species1 -> species0`) easily.


### Actually mapping data

The `gene_mapper` provides funtions to perform the various mappings we support, namely
- gene symbol to gene identifier within one authority
- identifier to identifier across authorities
- identifier to identifier across species within one authority

The functions return a dict containing all of the relevant mappings, including 1:N mappings (it is left to the user to decide what to do in those cases) along with the metadata necessary to understand the provenance of the mapping.

#### Mapping symbols to identifiers

For instance, here is a function call mapping gene symbols to gene identifiers in ENSEMBL

In [14]:
mouse_symbols = ["Xkr4", "Npbwr1", "not_a_symbol", "Rrs1"]
mouse_ens_mapping = gene_mapper.identifiers_from_symbols(
    gene_symbol_list=mouse_symbols,
    species_name="Mus musculus",
    authority_name="ENSEMBL"
)
print(json.dumps(mouse_ens_mapping, indent=2))

{
  "metadata": {
    "authority": {
      "name": "ENSEMBL",
      "idx": 1
    },
    "citation": {
      "name": "ENSEMBL-10090-98",
      "metadata": {
        "biolink:OrganismTaxon": {
          "id": "NCBITaxon:10090",
          "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
          "category": [
            "biolink:OrganismTaxon"
          ],
          "name": "house mouse",
          "full_name": "Mus musculus"
        },
        "bican:GenomeAssembly": {
          "id": "NCBIAssembly:GCF_000001635.26",
          "category": [
            "bican:GenomeAssembly"
          ],
          "name": "GRCm38",
          "in_taxon": [
            "NCBITaxon:10090"
          ],
          "in_taxon_label": "Mus musculus",
          "version": "26"
        },
        "bican:GenomeAnnotation": {
          "id": "bican:annotation-ENSEMBL-10090-98",
          "category": [
            "bican:GenomeAnnotation"
          ],
          "description": "ENSEMBL Mus musculus Annotation 

For ease of reading, here is the output broken down into metadata

In [15]:
print(json.dumps(mouse_ens_mapping['metadata'], indent=2))

{
  "authority": {
    "name": "ENSEMBL",
    "idx": 1
  },
  "citation": {
    "name": "ENSEMBL-10090-98",
    "metadata": {
      "biolink:OrganismTaxon": {
        "id": "NCBITaxon:10090",
        "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
        "category": [
          "biolink:OrganismTaxon"
        ],
        "name": "house mouse",
        "full_name": "Mus musculus"
      },
      "bican:GenomeAssembly": {
        "id": "NCBIAssembly:GCF_000001635.26",
        "category": [
          "bican:GenomeAssembly"
        ],
        "name": "GRCm38",
        "in_taxon": [
          "NCBITaxon:10090"
        ],
        "in_taxon_label": "Mus musculus",
        "version": "26"
      },
      "bican:GenomeAnnotation": {
        "id": "bican:annotation-ENSEMBL-10090-98",
        "category": [
          "bican:GenomeAnnotation"
        ],
        "description": "ENSEMBL Mus musculus Annotation Release 98",
        "in_taxon": [
          "NCBITaxon:10090"
        ],
        "i

and actual mapping

In [16]:
print(json.dumps(mouse_ens_mapping['mapping'], indent=2))

{
  "Xkr4": [
    "ENSMUSG00000051951"
  ],
  "Npbwr1": [
    "ENSMUSG00000033774"
  ],
  "not_a_symbol": [],
  "Rrs1": [
    "ENSMUSG00000061024"
  ]
}


Here is a similar call mapping symbols to identifiers in NCBI

In [17]:
mouse_ncbi_mapping = gene_mapper.identifiers_from_symbols(
    gene_symbol_list=mouse_symbols,
    species_name="Mus musculus",
    authority_name="NCBI"
)
print(json.dumps(mouse_ncbi_mapping, indent=2))

{
  "metadata": {
    "authority": {
      "name": "NCBI",
      "idx": 0
    },
    "citation": {
      "name": "NCBI",
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:47e8cd29e4c0b95a2e86cf78582316c5",
          "downloaded_on": "2025-04-07-10-24-52"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:4febdfd55f0b42aba4e0253a8e258416",
          "downloaded_on": "2025-04-07-10-27-04"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:6152a853d48a76299909b711d19756a7",
          "downloaded_on": "2025-04-07-10-27-27"
        }
      },
      "idx": 0
    }
  },
  "mapping": {
    "Xkr4": [
      "NCBIGene:10090"
    ],
    "Npbwr1": [
      "NCBIGene:10090"
    ],
  

#### Mapping from ENSEMBL to NCBI

Below is a function call mapping ENSEMBL IDs to NCBI IDs using the gene equivalences downloaded from the NCBI FTP server. Note that all of the mappings map one ENSEMBL ID to two NCBI IDs.

In [18]:
ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=["ENSMUSG00000030337", "ENSMUSG00000037747", "nope", "ENSMUSG00000021983"],
    species_name="Mus musculus",
    citation_name="NCBI"
)
print(json.dumps(ens_to_ncbi, indent=2))

{
  "metadata": {
    "key_authority": "ENSEMBL",
    "value_authority": "NCBI",
    "citation": {
      "name": "NCBI",
      "idx": 0,
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:47e8cd29e4c0b95a2e86cf78582316c5",
          "downloaded_on": "2025-04-07-10-24-52"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:4febdfd55f0b42aba4e0253a8e258416",
          "downloaded_on": "2025-04-07-10-27-04"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:6152a853d48a76299909b711d19756a7",
          "downloaded_on": "2025-04-07-10-27-27"
        }
      }
    }
  },
  "mapping": {
    "ENSMUSG00000021983": [
      "NCBIGene:10090",
      "NCBIGene:10090"
    ],
    "ENSMU

Here is a call mapping from NCBI to ENSEMBL

In [19]:
ncbi_to_ens = gene_mapper.equivalent_genes(
    input_authority='NCBI',
    output_authority='ENSEMBL',
    gene_list=["NCBIGene:67269", "NCBIGene:54200", "nope", "NCBIGene:70911"],
    species_name="Mus musculus",
    citation_name="NCBI"
)
print(json.dumps(ncbi_to_ens, indent=2))

{
  "metadata": {
    "key_authority": "NCBI",
    "value_authority": "ENSEMBL",
    "citation": {
      "name": "NCBI",
      "idx": 0,
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:47e8cd29e4c0b95a2e86cf78582316c5",
          "downloaded_on": "2025-04-07-10-24-52"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:4febdfd55f0b42aba4e0253a8e258416",
          "downloaded_on": "2025-04-07-10-27-04"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:6152a853d48a76299909b711d19756a7",
          "downloaded_on": "2025-04-07-10-27-27"
        }
      }
    }
  },
  "mapping": {
    "NCBIGene:54200": [],
    "NCBIGene:67269": [],
    "nope": [],
    "NCBIGene:70911": []

To simulate "production scale," we will now download all of the genes in the Yao et al. 2023 10X gene panel and try mappping them from ENSEMBL to NCBI to see how long it takes to process that many genes.

In [20]:
wmb_gene_df = abc_cache.get_metadata_dataframe(
    directory='WMB-10X',
    file_name='gene'
)
wmb_genes = wmb_gene_df.gene_identifier.values
t0 = time.time()
wmb_ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=wmb_genes,
    species_name='Mus musculus',
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'mapping {len(wmb_genes)} genes took {dur:.2e} seconds')

n_mapped = 0
n_one_to_one = 0
for k in wmb_genes:
    if len(wmb_ens_to_ncbi['mapping'][k]) > 0:
        n_mapped += 1
    if len(wmb_ens_to_ncbi['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had NCBI matches')
print(f'{n_one_to_one} genes had exactly one NCBI match')

mapping 32285 genes took 1.84e+01 seconds
24656 genes had NCBI matches
24527 genes had exactly one NCBI match


Now we will download the marker genes used by MapMyCells for the Whole Mouse Brain taxonomy and try mapping them from ENSEMBL to NCBI.

In [21]:
src_path = "https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/mouse_markers_230821.json"
wmb_marker_path = pathlib.Path('../data/mouse_markers_230821.json')
if not wmb_marker_path.is_file():
    print(f"downloading {wmb_marker_path}")
    process = subprocess.Popen(
        args=[
           "wget",
            src_path,
            "-O",
            str(wmb_marker_path),
        ],
        stderr=subprocess.DEVNULL
    )
    process.wait()

In [22]:
marker_genes = set()
with open(wmb_marker_path, 'rb') as src:
    marker_lookup = json.load(src)
for key in marker_lookup:
    if key in ('log', 'metadata'):
        continue
    marker_genes = marker_genes.union(set(marker_lookup[key]))
marker_genes = sorted(marker_genes)

In [23]:
print(f'{len(marker_genes)} marker genes')
t0 = time.time()
marker_ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=marker_genes,
    species_name='Mus musculus',
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'mappiing {len(marker_genes)} genes took {dur:.2e} seconds')

n_mapped = 0
n_one_to_one = 0
for k in marker_genes:
    if len(marker_ens_to_ncbi['mapping'][k]) > 0:
        n_mapped += 1
    if len(marker_ens_to_ncbi['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had NCBI matches')
print(f'{n_one_to_one} genes had exactly one NCBI match')

6558 marker genes
mappiing 6558 genes took 1.30e+00 seconds
6525 genes had NCBI matches
6493 genes had exactly one NCBI match


#### Ortholog mapping

To test ortholog mapping, let's take the genes from Yao et al. 2023 that only mapped to one NCBI gene and map them to human orthologs.

In [24]:
wmb_ncbi_genes = [
    wmb_ens_to_ncbi['mapping'][ens][0]
    for ens in wmb_genes
    if len(wmb_ens_to_ncbi['mapping'][ens]) == 1
]

First let's just map 5 genes so that we can display theoutput and see what it looks like

In [25]:
test_orthologs = gene_mapper.ortholog_genes(
    authority='NCBI',
    src_species_name='Mus musculus',
    dst_species_name='human',
    gene_list=wmb_ncbi_genes[:5] + ['not_an_actual_gene'],
    citation_name='NCBI'
)
print(json.dumps(test_orthologs, indent=2))

{
  "metadata": {
    "authority": "NCBI",
    "citation": {
      "gene_info.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene_info.gz",
        "hash": "md5:47e8cd29e4c0b95a2e86cf78582316c5",
        "downloaded_on": "2025-04-07-10-24-52"
      },
      "gene2ensembl.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene2ensembl.gz",
        "hash": "md5:4febdfd55f0b42aba4e0253a8e258416",
        "downloaded_on": "2025-04-07-10-27-04"
      },
      "gene_orthologs.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene_orthologs.gz",
        "hash": "md5:6152a853d48a76299909b711d19756a7",
        "downloaded_on": "2025-04-07-10-27-27"
      }
    },
    "src_species_taxon": 10090,
    "dst_species_taxon": 9606
  },
  "mapping": {
    "NCBIGene:10090": [
      "NCBIGene:9606"
    ],
    "not_an_actual_gene": []
  }
}


Now let's map all 24,000 genes and see how long that takes

In [26]:
t0 = time.time()
wmb_to_whb_orthologs = gene_mapper.ortholog_genes(
    authority='NCBI',
    src_species_name='Mus musculus',
    dst_species_name='human',
    gene_list=wmb_ncbi_genes,
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'mapping {len(wmb_ncbi_genes)} orthologs took {dur:.2e} seconds')
n_mapped = 0
n_one_to_one = 0
for k in wmb_ncbi_genes:
    if len(wmb_to_whb_orthologs['mapping'][k]) > 0:
        n_mapped += 1
    if len(wmb_to_whb_orthologs['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had orthologs')
print(f'{n_one_to_one} genes had exactly one ortholog')

mapping 24527 orthologs took 7.72e+00 seconds
24527 genes had orthologs
24527 genes had exactly one ortholog


To map orthologs using the "by-hand" look up table created by our science team, change the kwarg `citation='HMBA'`

In [27]:
t0 = time.time()
wmb_to_whb_orthologs_hmba = gene_mapper.ortholog_genes(
    authority='NCBI',
    src_species_name='Mus musculus',
    dst_species_name='human',
    gene_list=wmb_ncbi_genes,
    citation_name='HMBA'
)
dur = time.time()-t0
print(f'mapping {len(wmb_ncbi_genes)} orthologs took {dur:.2e} seconds')
n_mapped = 0
n_one_to_one = 0
for k in wmb_ncbi_genes:
    if len(wmb_to_whb_orthologs_hmba['mapping'][k]) > 0:
        n_mapped += 1
    if len(wmb_to_whb_orthologs_hmba['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had orthologs')
print(f'{n_one_to_one} genes had exactly one ortholog')

mapping 24527 orthologs took 7.56e+00 seconds
24527 genes had orthologs
24527 genes had exactly one ortholog
