The purpose of this notebook is to demonstrate the prototype gene mapper library put together to generalize the processes of

- Mapping gene symbols to gene identifiers
- Mapping gene identifiers across authorities (NCBI, ENSEMBL, etc.)
- Mapping genes in one species to ortholog genes in another species

for MapMyCells.

In addition to the `mmc_gene_mapper` codebase and its dependencies, this notebook depends on the `abc_atlas_access` tool, though only for downloading metadata to simulate mapping 10s of thousands of genes at once (as we would with real user data).

In [None]:
import h5py
import json
import pathlib
import subprocess
import sqlite3
import time

import abc_atlas_access.abc_atlas_cache.abc_project_cache as abc_cache_library

import mmc_gene_mapper
import mmc_gene_mapper.utils.file_utils as file_utils
import mmc_gene_mapper.mapper.mapper as mapper

In [2]:
abc_cache = abc_cache_library.AbcProjectCache.from_cache_dir('../data/abc_cache')

In [3]:
data_dir = pathlib.Path(
    mmc_gene_mapper.__file__).parent.parent.parent / "data"
assert data_dir.is_dir()

In [4]:
db_path = data_dir/"gene_mapper_example.db"
assert db_path.parent.is_dir()

There are some files in
```
/allen/scratch/aibstemp/danielsf/gene_mapper_data
```
that are used to ingest data beyond what is available in the NCBI FTP server. Please copy those files to `../data/db_creation_data` in this respository (the cell below will fail if any of the files are not present)

In [5]:
data_file_spec=[
    {"type": "bkbit",
     "path": "../data/db_creation_data/mouse_ENSEMBL-10090-98.jsonld",
     "name": "bican:Mouse"},
    {"type": "bkbit",
     "path": "../data/db_creation_data/human_ENSEMBL-9606-101.jsonld",
     "name": "bican:Human"},
    {"type": "hmba_orthologs",
     "path": "../data/db_creation_data/all_gene_ids.csv",
     "name": "HMBA",
     "baseline_species": "human"}
]

error_msg = ""
for spec in data_file_spec:
    pth = pathlib.Path(spec['path'])
    if not pth.is_file():
        error_msg += f"{pth} is not a file\n"
if len(error_msg) > 0:
    raise RuntimeError(msg)

Below we instantiate the class that is used to actually do the mapping. The first time you run the cell, it will take ~ 15 minutes as it downloads ~ 4 GB of data from NCBI and ingests that data, alongside the local data from the cell above, to create an ~ 8 GB sqlite database that backs the mapping process.

As long as you leave `clobber=False` below, you will not have to recreate that database on subsequent uses.

In [6]:
gene_mapper = mapper.MMCGeneMapper(
    db_path=db_path,
    local_dir=data_dir,
    data_file_spec=data_file_spec,
    clobber=False,
    force_download=False
)

    chunk 0.00e+00, 5.00e+06
    chunk 5.00e+06, 1.00e+07
    chunk 1.00e+07, 1.50e+07
    chunk 1.50e+07, 2.00e+07
    chunk 2.00e+07, 2.50e+07
    chunk 2.50e+07, 3.00e+07
    chunk 3.00e+07, 3.50e+07
    chunk 3.50e+07, 4.00e+07
    chunk 4.00e+07, 4.50e+07
    chunk 4.50e+07, 5.00e+07
    chunk 5.00e+07, 5.50e+07
    chunk 5.50e+07, 6.00e+07
    INGESTING 55421 GENES
    INGESTING 60671 GENES
    GOT SPECIES MAP


The database backing the `gene_mapper` contains the following tables

### Metadata

#### authority
The `authority` table just tracks the gene identifying authorities the database knows about (i.e. ENSEMBL and NCBI). it consists of the columns

- id -- an integer used for internal indexing
- name -- the human-readable name of the authority

#### citation
The `citation` table tracks all of the datasets used to justify the various mappings (symbol-to-identifier, ENSEMBL-to-NCBI, orthologs, etc.). It consists of the columns

- id -- an integer used for internal indexing
- name -- a short, human-readable name for the citation (this will be used whenever you have to specify a citation when calling the `gene_mapper`'s methods
- metadata -- a JSON serialized dict containing whatever information is necessary to recreate or specify the data backing this citation. This is a very free-form column as different entries will be specified in different ways (see below)

We can list the available authorities like so

In [7]:
authority_list = gene_mapper.get_all_authorities()
print(authority_list)

['NCBI', 'ENSEMBL']


Similarly, we can list the available citations.

In [8]:
citation_list = gene_mapper.get_all_citations()
print([c['name'] for c in citation_list])

['NCBI', 'bican:Mouse', 'bican:Human', 'HMBA']


As mentioned above, different citations contain different metadata. For instance, the `NCBI` citation (referring to the data downloade from the NCBI FTP server) contains a list of the files downloaded their hashes, and the date they were downloaded on (NCBI updates its FTP server daily; it's not clear that they keep readily available version information around, though we can certainly investigate that)

In [26]:
for citation in citation_list:
    if citation['name'] == 'NCBI':
        print(json.dumps(citation, indent=2))

{
  "name": "NCBI",
  "metadata": {
    "gene_info.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_info.gz",
      "hash": "md5:d15a97631494d3f0cba0dff10595a29b",
      "downloaded_on": "2025-04-03-11-34-02"
    },
    "gene2ensembl.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene2ensembl.gz",
      "hash": "md5:3cd9020c4bcd7fc7307b1dcf8f4c8dea",
      "downloaded_on": "2025-04-03-11-35-23"
    },
    "gene_orthologs.gz": {
      "host": "ftp.ncbi.nlm.nih.gov",
      "src_path": "gene/DATA/gene_orthologs.gz",
      "hash": "md5:9dbbeb0e454b3b3d03ec3b0d1755a4ce",
      "downloaded_on": "2025-04-03-11-35-54"
    }
  }
}


The citations based on BICAN bkbit files, contain all of the GenomeAnnotation and GenomeAssembly information in those files.

In [27]:
for citation in citation_list:
    if citation['name'] == 'bican:Mouse':
        print(json.dumps(citation, indent=2))

{
  "name": "bican:Mouse",
  "metadata": {
    "biolink:OrganismTaxon": {
      "id": "NCBITaxon:10090",
      "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
      "category": [
        "biolink:OrganismTaxon"
      ],
      "name": "house mouse",
      "full_name": "Mus musculus"
    },
    "bican:GenomeAssembly": {
      "id": "NCBIAssembly:GCF_000001635.26",
      "category": [
        "bican:GenomeAssembly"
      ],
      "name": "GRCm38",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "26"
    },
    "bican:GenomeAnnotation": {
      "id": "bican:annotation-ENSEMBL-10090-98",
      "category": [
        "bican:GenomeAnnotation"
      ],
      "description": "ENSEMBL Mus musculus Annotation Release 98",
      "in_taxon": [
        "NCBITaxon:10090"
      ],
      "in_taxon_label": "Mus musculus",
      "version": "98",
      "digest": [
        "urn:uuid:58618e83-7b4c-46ed-b300-9df8abf1f5d0"
      ],
      "

### Data

The data tables in the database are

#### species
The `species` table just ingests the NCBI organism taxonomy and records
- id -- the NCBI taxon ID of the species
- name -- the human readablename of the species. **Note:** we record all possible names ("common", "scientific", etc.) of the sepcies in different rows. That way, users can give us whatever name they have for a species and, if possible, we can return an `id` for cross-referencing with other tables.

#### gene
The `gene` table contains all of the information about individual genes ingested into the database. Columns are
- authority -- an integer for cross-referencing the `authority` table
- citation -- an integer for cross-referencing the `citation` table
- species_taxon -- an integer for cross-refereneing the `species` table
- id -- an integer for internal indexing
- identifier -- a string. The full unique identifier (e.g. ENSMUS0000G12345) of the gene
- symbol -- a string. The human-readable name of the gene (**Note:** the BICAN bkbit files contain `symbol` and `name` entries for genes. If these differ, each corresponds to a different row in the `gene` table with `name` being recorded in `symbol` as appropriate).

#### gene_equivalence
The `gene_equivalence` table records mappings between authorities (ENSEMBL and NCBI). Its columns are
- species_taxon -- an integer for cross-referencing with the `species` table (in what species are we looking for equivalences?)
- citation -- an integer for cross-referencing with the `citation` table (who said these genes were equivalent?)
- authority0 -- an integer for cross-referencing with the `authority` table
- gene0 -- an integer; the ID of the gene in `authority0`
- authority1 -- an integer for cross-referncing with the `authority` table
- gene1 -- an integer; the ID of the gene in `authority1` that is equivalent to `gene0` according to `citation`.

**Note:** we record equivalences symmetrically, so that every `(gene0, gene1)` pair is also recorded as `(gene1, gene0)` so that users do not have to worry about which index (`gene0` or `gene1` their input data is compared to).

#### gene_ortholog
The `gene_ortholog` table records cross-species ortholog relationships. Its columns are
- authority -- an integer for cross-referencing the `authority` table (are we looking for NCBI orthologs or ENSEMBL orthologs?)
- citation -- an integer for cross-referencing the `citation` table (who said these genes were orthologs?)
- species0 -- an integer for cross-referencing the `species` table
- gene0 -- an integer, the ID of the gene in `species0` whose orthologs we are looking for
- species1 -- an integer, the index in `species` of the other species we are considering
- gene0 -- an integer, the indess of the gene in `species1` that is considered an ortholog of `gene0`

**Note:** we ingest all `[(species0, gene0), (species1, gene1)]` pairs twise, permuting the order so that users can search for orthologs in both directions (`species0 -> species1` and `species1 -> species0`) easily


In [25]:
mouse_symbols = ["Xkr4", "Npbwr1", "not_a_symbol", "Rrs1"]
mouse_ens_mapping = gene_mapper.identifiers_from_symbols(
    gene_symbol_list=mouse_symbols,
    species_name="Mus musculus",
    authority_name="ENSEMBL"
)
print(json.dumps(mouse_ens_mapping, indent=2))

{
  "metadata": {
    "authority": {
      "name": "ENSEMBL",
      "idx": 1
    },
    "citation": {
      "name": "bican:Mouse",
      "metadata": {
        "biolink:OrganismTaxon": {
          "id": "NCBITaxon:10090",
          "iri": "http://purl.obolibrary.org/obo/NCBITaxon_10090",
          "category": [
            "biolink:OrganismTaxon"
          ],
          "name": "house mouse",
          "full_name": "Mus musculus"
        },
        "bican:GenomeAssembly": {
          "id": "NCBIAssembly:GCF_000001635.26",
          "category": [
            "bican:GenomeAssembly"
          ],
          "name": "GRCm38",
          "in_taxon": [
            "NCBITaxon:10090"
          ],
          "in_taxon_label": "Mus musculus",
          "version": "26"
        },
        "bican:GenomeAnnotation": {
          "id": "bican:annotation-ENSEMBL-10090-98",
          "category": [
            "bican:GenomeAnnotation"
          ],
          "description": "ENSEMBL Mus musculus Annotation Relea

In [11]:
mouse_ncbi_mapping = gene_mapper.identifiers_from_symbols(
    gene_symbol_list=mouse_symbols,
    species_name="Mus musculus",
    authority_name="NCBI"
)
print(json.dumps(mouse_ncbi_mapping, indent=2))

{
  "metadata": {
    "authority": {
      "name": "NCBI",
      "idx": 0
    },
    "citation": {
      "name": "NCBI",
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:d15a97631494d3f0cba0dff10595a29b",
          "downloaded_on": "2025-04-03-11-34-02"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:3cd9020c4bcd7fc7307b1dcf8f4c8dea",
          "downloaded_on": "2025-04-03-11-35-23"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:9dbbeb0e454b3b3d03ec3b0d1755a4ce",
          "downloaded_on": "2025-04-03-11-35-54"
        }
      },
      "idx": 0
    }
  },
  "mapping": {
    "Xkr4": [
      "NCBIGene:497097"
    ],
    "Npbwr1": [
      "NCBIGene:226304"
    ],


In [12]:
ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=["ENSMUSG00000030337", "ENSMUSG00000037747", "nope", "ENSMUSG00000021983"],
    species_name="Mus musculus",
    citation_name="NCBI"
)
print(json.dumps(ens_to_ncbi, indent=2))

{
  "metadata": {
    "key_authority": "ENSEMBL",
    "value_authority": "NCBI",
    "citation": {
      "name": "NCBI",
      "idx": 0,
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:d15a97631494d3f0cba0dff10595a29b",
          "downloaded_on": "2025-04-03-11-34-02"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:3cd9020c4bcd7fc7307b1dcf8f4c8dea",
          "downloaded_on": "2025-04-03-11-35-23"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:9dbbeb0e454b3b3d03ec3b0d1755a4ce",
          "downloaded_on": "2025-04-03-11-35-54"
        }
      }
    }
  },
  "mapping": {
    "ENSMUSG00000021983": [
      "NCBIGene:50769",
      "NCBIGene:108168164"
    ],
    "E

In [13]:
ncbi_to_ens = gene_mapper.equivalent_genes(
    input_authority='NCBI',
    output_authority='ENSEMBL',
    gene_list=["NCBIGene:67269", "NCBIGene:54200", "nope", "NCBIGene:70911"],
    species_name="Mus musculus",
    citation_name="NCBI"
)
print(json.dumps(ncbi_to_ens, indent=2))

{
  "metadata": {
    "key_authority": "NCBI",
    "value_authority": "ENSEMBL",
    "citation": {
      "name": "NCBI",
      "idx": 0,
      "metadata": {
        "gene_info.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_info.gz",
          "hash": "md5:d15a97631494d3f0cba0dff10595a29b",
          "downloaded_on": "2025-04-03-11-34-02"
        },
        "gene2ensembl.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene2ensembl.gz",
          "hash": "md5:3cd9020c4bcd7fc7307b1dcf8f4c8dea",
          "downloaded_on": "2025-04-03-11-35-23"
        },
        "gene_orthologs.gz": {
          "host": "ftp.ncbi.nlm.nih.gov",
          "src_path": "gene/DATA/gene_orthologs.gz",
          "hash": "md5:9dbbeb0e454b3b3d03ec3b0d1755a4ce",
          "downloaded_on": "2025-04-03-11-35-54"
        }
      }
    }
  },
  "mapping": {
    "NCBIGene:54200": [
      "ENSMUSG00000003271"
    ],
    "NCBIGene:67269": [
      "ENS

In [14]:
src_path = "https://allen-brain-cell-atlas.s3-us-west-2.amazonaws.com/mapmycells/WMB-10X/20240831/mouse_markers_230821.json"
wmb_marker_path = pathlib.Path('../data/mouse_markers_230821.json')
if not wmb_marker_path.is_file():
    print(f"downloading {wmb_marker_path}")
    process = subprocess.Popen(
        args=[
           "wget",
            src_path,
            "-O",
            str(wmb_marker_path),
        ],
        stderr=subprocess.DEVNULL
    )
    process.wait()

In [15]:
wmb_gene_df = abc_cache.get_metadata_dataframe(
    directory='WMB-10X',
    file_name='gene'
)
wmb_genes = wmb_gene_df.gene_identifier.values
t0 = time.time()
wmb_ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=wmb_genes,
    species_name='Mus musculus',
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'mapping {len(wmb_genes)} genes took {dur:.2e} seconds')

n_mapped = 0
n_one_to_one = 0
for k in wmb_genes:
    if len(wmb_ens_to_ncbi['mapping'][k]) > 0:
        n_mapped += 1
    if len(wmb_ens_to_ncbi['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had NCBI matches')
print(f'{n_one_to_one} genes had exactly one NCBI match')

mapping 32285 genes took 1.98e+01 seconds
24656 genes had NCBI matches
24527 genes had exactly one NCBI match


In [16]:
marker_genes = set()
with open(wmb_marker_path, 'rb') as src:
    marker_lookup = json.load(src)
for key in marker_lookup:
    if key in ('log', 'metadata'):
        continue
    marker_genes = marker_genes.union(set(marker_lookup[key]))
marker_genes = sorted(marker_genes)

In [17]:
print(f'{len(marker_genes)} marker genes')
t0 = time.time()
marker_ens_to_ncbi = gene_mapper.equivalent_genes(
    input_authority='ENSEMBL',
    output_authority='NCBI',
    gene_list=marker_genes,
    species_name='Mus musculus',
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'that took {dur:.2e} seconds')

n_mapped = 0
n_one_to_one = 0
for k in marker_genes:
    if len(marker_ens_to_ncbi['mapping'][k]) > 0:
        n_mapped += 1
    if len(marker_ens_to_ncbi['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had NCBI matches')
print(f'{n_one_to_one} genes had exactly one NCBI match')

6558 marker genes
that took 1.32e+00 seconds
6525 genes had NCBI matches
6493 genes had exactly one NCBI match


In [18]:
wmb_ncbi_genes = [
    wmb_ens_to_ncbi['mapping'][ens][0]
    for ens in wmb_genes
    if len(wmb_ens_to_ncbi['mapping'][ens]) == 1
]

In [19]:
test_orthologs = gene_mapper.ortholog_genes(
    authority='NCBI',
    src_species_name='Mus musculus',
    dst_species_name='human',
    gene_list=wmb_ncbi_genes[:5] + ['not_an_actual_gene'],
    citation_name='NCBI'
)
print(json.dumps(test_orthologs, indent=2))

{
  "metadata": {
    "authority": "NCBI",
    "citation": {
      "gene_info.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene_info.gz",
        "hash": "md5:d15a97631494d3f0cba0dff10595a29b",
        "downloaded_on": "2025-04-03-11-34-02"
      },
      "gene2ensembl.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene2ensembl.gz",
        "hash": "md5:3cd9020c4bcd7fc7307b1dcf8f4c8dea",
        "downloaded_on": "2025-04-03-11-35-23"
      },
      "gene_orthologs.gz": {
        "host": "ftp.ncbi.nlm.nih.gov",
        "src_path": "gene/DATA/gene_orthologs.gz",
        "hash": "md5:9dbbeb0e454b3b3d03ec3b0d1755a4ce",
        "downloaded_on": "2025-04-03-11-35-54"
      }
    },
    "src_species_taxon": 10090,
    "dst_species_taxon": 9606
  },
  "mapping": {
    "NCBIGene:19888": [
      "NCBIGene:6101"
    ],
    "NCBIGene:20671": [
      "NCBIGene:64321"
    ],
    "NCBIGene:27395": [
      "NCBIGene:29088"
    ],
    "NCB

In [20]:
t0 = time.time()
wmb_to_whb_orthologs = gene_mapper.ortholog_genes(
    authority='NCBI',
    src_species_name='Mus musculus',
    dst_species_name='human',
    gene_list=wmb_ncbi_genes,
    citation_name='NCBI'
)
dur = time.time()-t0
print(f'mapping {len(wmb_ncbi_genes)} orthologs took {dur:.2e} seconds')
n_mapped = 0
n_one_to_one = 0
for k in wmb_ncbi_genes:
    if len(wmb_to_whb_orthologs['mapping'][k]) > 0:
        n_mapped += 1
    if len(wmb_to_whb_orthologs['mapping'][k]) == 1:
        n_one_to_one += 1
print(f'{n_mapped} genes had orthologs')
print(f'{n_one_to_one} genes had exactly one ortholog')

mapping 24527 orthologs took 8.52e+00 seconds
16508 genes had orthologs
16508 genes had exactly one ortholog
