# Use NCBI Datasets to Retrieve Information on Genes

The objective of this notebook is to use the `ncbi.datasets` python library to download and extract gene data.

In this example, we will get information about the Gonadotropin Releasing Hormone Receptor (GNRHR) gene family, which plays a key role in sexual development and function across vertebrates.

This notebook will demonstrate the following features that are available through NCBI Datasets for retrieving gene data:  
1. Getting gene metadata as a JSON-formatted gene descriptor, by specifying either an NCBI Gene ID or gene symbol  
2. Downloading gene datasets, which include gene, transcript, and protein sequences, and a data report (gene metadata in a more human-readable yaml format)

Gonadotropin Releasing Hormone (GnRH) binds and activates the GnRH receptor (GnRHR) to stimulate secretion of follicle stimulating hormone (FSH) and luteinizing hormones (LH). These hormones in turn regulate repoductive development and function. 

While the role of GnRH in reproduction is conserved across vertebrates, GnRH receptor copy number is variable. 
In humans and some primates, there are two GnRHR genes, while in fish and amphibians, three GnRHR genes have been identified, with additional duplications observed in some fish. Additional receptors and GnRH ligands suggests that new functions could have been acquired by the additional gene copies[1].

[1] Moncaut N, Somoza G, Power DM, Canário AV. Five gonadotrophin-releasing hormone receptors in a teleost fish: isolation, tissue distribution and phylogenetic relationships. J Mol Endocrinol. 2005 Jun;34(3):767-79. doi: 10.1677/jme.1.01757. PMID: 15956346.


First we will load the libraries necessary to run this notebook. This includes:
1. the ncbi.datasets library 
2. pandas, the python data analysis library and 
3. pprint, which allows "pretty-printing" of Python data structures

In [1]:
# load all libraries
import pandas as pd
import ncbi.datasets


## Get gene descriptors for three human GnRHR genes 
First we're going to get gene descriptors (metadata) for three human GnRHR genes, GNRHR, GNRHR2, and GNRHR2P1, by specifying the NCBI Gene IDs for these genes. 

Gene descriptors provide gene metadata in a machine-readable JSON format.

Gene descriptors contain a lot of interesting metadata and it's easy to pull out just the fields that you're interested in. As an example, here we'll show gene symbols, chromosome number and the corresponding SwissProt accession for the three genes.

In [1]:
# start a datasets gene API instance
api_client = ncbi.datasets.ApiClient()
ds_gene_instance = ncbi.datasets.GeneApi(api_client)

# Retrieve descriptors for three known gene-ids.
genes_and_messages = ds_gene_instance.gene_metadata_by_id([2798, 114814, 404718])

# Look up the symbols for each gene
def report_on_gene_descriptors(genes_and_messages, leader='\t', report_errors=True):
    if report_errors:
        for message in genes_and_messages.messages or []:
            print(f'{leader}Error for: ({",".join(message.error.invalid_identifiers)})')
            print(f'{leader}{leader}Reason: ({message.error.reason})')

    if not genes_and_messages.genes:
        print(f'{leader}Gene-ids not found in this organism')
        return

    for gene in map(lambda g: g.gene, genes_and_messages.genes):
        print(f'{leader}{gene.gene_id} -> {gene.symbol} (Chromosome(s): {gene.chromosomes}, SwissProt: {gene.swiss_prot_accessions})')

report_on_gene_descriptors(genes_and_messages)

	114814 -> GNRHR2 (Chromosome(s): ['1'], SwissProt: ['Q96P88.4'])
	2798 -> GNRHR (Chromosome(s): ['4'], SwissProt: ['P30968.1'])
	404718 -> GNRHR2P1 (Chromosome(s): ['14'], SwissProt: None)


## Get gene descriptors for GNRHR genes throughout primates
Now we're going to get gene descriptors (metadata) for a large set of GnRHR genes in primates. To get these gene descriptors we can specify gene symbol and species-level Tax IDs.  
In order to get these species-level Tax ID, we're going to use the Gene API to get the species-level TaxIDs under the taxonomic rank of primate.
We're going to search for two symbols, GNRHR and GNRHR2. Interestingly, very few primates have an annotated gene ortholog of human GNRHR2, and this is reflected in the results. In fact, the only non-human primates for which we can find GNRHR2 is in Rhesus monkey (Macaca mulatta) and white-tufted-ear marmoset.
For those primates without an apparent GNRHR2, it's also possible that we might miss the gene because of inconsistent (or missing) nomenclature.
In a future iteration of Datasets, we plan to give users access to orthology information that would allow retrieval of a more comprehensive set of gene homologs across organisms.

In [1]:
#Get gene descriptors by gene symbol + organism name for all primates
primate_tax_id = 9443

def species_tax_ids(tree):
    
    if tree.children:
        for child in tree.children:
            yield from species_tax_ids(child)
        
    if tree.rank == 'SPECIES':
        yield tree.tax_id, tree.sci_name, tree.common_name

primate_tax_tree = ds_gene_instance.gene_tax_tree(taxon=primate_tax_id)

symbols = ['GNRHR', 'GNRHR2']
for tax_id, sci_name, common_name in species_tax_ids(primate_tax_tree):
    print(f'Fetch for {common_name} ({sci_name} / {tax_id})')
    gene_descriptors = ds_gene_instance.gene_metadata_by_tax_and_symbol(symbols=symbols, taxon=tax_id)
    report_on_gene_descriptors(gene_descriptors, report_errors=False)


Fetch for eastern hoolock gibbon (Hoolock leuconedys / 593543)
	Gene-ids not found in this organism
Fetch for hoolock gibbon (Hoolock hoolock / 61851)
	Gene-ids not found in this organism
Fetch for None (Hoolock leuconedys x Hoolock tianxing / 1934191)
	Gene-ids not found in this organism
Fetch for Skywalker hoolock gibbon (Hoolock tianxing / 1934190)
	Gene-ids not found in this organism
Fetch for agile gibbon (Hylobates agilis / 9579)
	Gene-ids not found in this organism
Fetch for common gibbon (Hylobates lar / 9580)
	Gene-ids not found in this organism
Fetch for pileated gibbon (Hylobates pileatus / 9589)
	Gene-ids not found in this organism
Fetch for silvery gibbon (Hylobates moloch / 81572)
	116468321 -> GNRHR (Chromosome(s): ['Un'], SwissProt: None)
Fetch for northern white-cheeked gibbon (Nomascus leucogenys / 61853)
	100589627 -> GNRHR (Chromosome(s): ['9'], SwissProt: None)
Fetch for Red-cheeked gibbon (Nomascus gabriellae / 61852)
	Gene-ids not found in this organism
Fetch for

	108530291 -> GNRHR (Chromosome(s): ['Un'], SwissProt: None)
Fetch for Burmese snub-nosed monkey (Rhinopithecus strykeri / 1194336)
	Gene-ids not found in this organism
Fetch for golden snub-nosed monkey (Rhinopithecus roxellana / 61622)
	104664117 -> GNRHR (Chromosome(s): ['2'], SwissProt: None)
Fetch for Gray snub-nosed monkey (Rhinopithecus brelichi / 224329)
	Gene-ids not found in this organism
Fetch for None (Rhinopithecus bieti 1 RL-2012 / 1194334)
	Gene-ids not found in this organism
Fetch for None (Rhinopithecus bieti 2 RL-2012 / 1194335)
	Gene-ids not found in this organism
Fetch for Tonkin snub-nosed monkey (Rhinopithecus avunculus / 66062)
	Gene-ids not found in this organism
Fetch for Hanuman langur (Semnopithecus entellus / 88029)
	Gene-ids not found in this organism
Fetch for Simakobou (Simias concolor / 170207)
	Gene-ids not found in this organism
Fetch for Azara's night monkey (Aotus azarai / 30591)
	Gene-ids not found in this organism
Fetch for gray-bellied night monke

	Gene-ids not found in this organism
Fetch for ruffed lemur (Varecia variegata / 9455)
	Gene-ids not found in this organism
Fetch for gray-backed sportive lemur (Lepilemur dorsalis / 78583)
	Gene-ids not found in this organism
Fetch for Hubbard's sportive lemur (Lepilemur hubbardorum / 756882)
	Gene-ids not found in this organism
Fetch for James' sportive lemur (Lepilemur jamesorum / 486960)
	Gene-ids not found in this organism
Fetch for None (Lepilemur aeeclis / 342399)
	Gene-ids not found in this organism
Fetch for None (Lepilemur ahmansonorum / 886957)
	Gene-ids not found in this organism
Fetch for None (Lepilemur ankaranensis / 342401)
	Gene-ids not found in this organism
Fetch for None (Lepilemur betsileo / 886958)
	Gene-ids not found in this organism
Fetch for None (Lepilemur fleuretae / 886959)
	Gene-ids not found in this organism
Fetch for None (Lepilemur grewcockorum / 886960)
	Gene-ids not found in this organism
Fetch for None (Lepilemur hollandorum / 886961)
	Gene-ids not fo

## Get gene descriptors for GNRHR genes across vertebrates and build a table

Let's expand the taxonomic scope even further, and look at a selection of vertebrates.

We'll use a pre-determined list of Gene IDs to get gene descriptors for these genes and build an easily readable table with key information about these genes.

In [1]:
# extract fields of interest from descriptors class to build a table
cols = '''
common_name
taxonomic_name
symbol
type
chromosome
num_transcripts
ensembl_id
omim_id
uniprot_id
nomenclature_id
nomenclature_auth
genome_coordinates
'''
cols = cols.split('\n')[1:-1]

def _range_repr(range):
    ret = []
    for interval in range:
        ret.append(f'{interval.begin}_{interval.end}')
    return ','.join(ret)

def _ranges_repr(ranges):
    ret = []
    for range in ranges:
        ret.append(f'{range.accession_version}:{_range_repr(range.range)}')
    return ','.join(ret)

# specify genes of interest and retrieve descriptors
gene_ids = [2798, 114814, 404718, 14715, 109324103, 109309182, 281798, 395368, 403718, 427517, 471226, 7226731, 100001586, 100135415, 100135416, 100135417, 100136028, 100270671, 100270672, 101318246, 101932446, 101935915, 101953943, 102193667, 102202954, 102205592, 102346610, 102363373, 102364206, 102366752, 102536567, 102687824, 102694185, 102770612, 103899900, 103899926, 105916404, 105919697, 105934126, 108392639, 109987527, 109994050, 109999298, 110488224, 110495632, 110496352, 110513414, 110520912, 112994411, 112996301, 114645297, 114667483]
gene_descriptors = ds_gene_instance.gene_metadata_by_id(gene_ids)

# collect elements of the descriptor class into a dictionary based on each gene ID
table_data = {}
for gene in map(lambda g: g.gene, gene_descriptors.genes):
    table_data[gene.gene_id] = [gene.common_name]
    table_data[gene.gene_id].append(gene.taxname)
    table_data[gene.gene_id].append(gene.symbol)
    table_data[gene.gene_id].append(gene.type)
    table_data[gene.gene_id].append(gene.chromosome)
    if gene.transcripts:
        table_data[gene.gene_id].append(len(gene.transcripts))
    else:
        table_data[gene.gene_id].append(0)
    table_data[gene.gene_id].append(gene.ensembl_gene_ids)
    table_data[gene.gene_id].append(gene.omim_ids)
    table_data[gene.gene_id].append(gene.swiss_prot_accessions)
    if gene.nomenclature_authority:
        table_data[gene.gene_id].append(gene.nomenclature_authority.identifier)
        table_data[gene.gene_id].append(gene.nomenclature_authority.authority)
    else:
        table_data[gene.gene_id].append(None)
        table_data[gene.gene_id].append(None)        
    table_data[gene.gene_id].append(_ranges_repr(gene.genomic_ranges))

        
df = pd.DataFrame.from_dict(table_data, orient='index', columns=cols)
df.index.name = 'gene_id'
df

Unnamed: 0_level_0,common_name,taxonomic_name,symbol,type,chromosome,num_transcripts,ensembl_id,omim_id,uniprot_id,nomenclature_id,nomenclature_auth,genome_coordinates
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
100001586,zebrafish,Danio rerio,gnrhr4,PROTEIN_CODING,,2,[ENSDARG00000038116],,,ZDB-GENE-050419-76,ZFIN,NC_007129.7:25390740_25402909
100135415,tropical clawed frog,Xenopus tropicalis,gnrhr2,PROTEIN_CODING,,3,[ENSXETG00000021161],,,XB-GENE-5867415,Xenbase,NC_030679.2:109001475_109014876
100135416,tropical clawed frog,Xenopus tropicalis,gnrhr2/nmi,PROTEIN_CODING,,2,[ENSXETG00000005637],,,,,NC_030679.2:116455054_116472596
100135417,tropical clawed frog,Xenopus tropicalis,gnrhr,PROTEIN_CODING,,2,[ENSXETG00000001290],,,XB-GENE-5753573,Xenbase,NC_030684.2:142203708_142211556
100136028,rainbow trout,Oncorhynchus mykiss,gnrh-r,PROTEIN_CODING,,1,[ENSOMYG00000000839],,,,,NC_035078.1:14963448_14968063
100270671,zebrafish,Danio rerio,gnrhr2,PROTEIN_CODING,,1,[ENSDARG00000003553],,,ZDB-GENE-090128-3,ZFIN,NC_007118.7:52364748_52368843
100270672,zebrafish,Danio rerio,gnrhr1,PROTEIN_CODING,,2,[ENSDARG00000100593],,,ZDB-GENE-090128-2,ZFIN,NC_007130.7:43213293_43235554
101318246,common bottlenose dolphin,Tursiops truncatus,GNRHR,PROTEIN_CODING,,1,,,,,,NC_047038.1:85406171_85428905
101932446,Painted turtle,Chrysemys picta,LOC101932446,PROTEIN_CODING,,1,[ENSCPBG00000024942],,,,,NW_007281386.1:1139258_1141674
101935915,Painted turtle,Chrysemys picta,LOC101935915,PROTEIN_CODING,,2,[ENSCPBG00000007948],,,,,NW_007281382.1:370463_379501


## Use gene descriptor data to build a simple table of gene copy numbers

As we mentioned before, while the role of the gonadotropin releasing hormone and receptor in reproduction is highly conserved across vertebrates, gene copy number for the receptor genes is not. Let's take a quick look at how gene count varies in different organisms.  
Note that in our list, you can see that rainbow trout has the most gene copies at 6, while numerous mammals, including mouse, dolphin and alpaca, only have a single annotated gene copy. 

In [1]:
# plot gene count based on organism
gene_cnt = df.groupby('common_name')['symbol'].count().reset_index()
gene_cnt.columns = ['organism', 'gene_count']
gene_cnt.sort_values('gene_count', ascending=False, inplace=True)
gene_cnt

Unnamed: 0,organism,gene_count
16,rainbow trout,6
8,coelacanth,4
20,zebrafish,3
2,Painted turtle,3
19,tropical clawed frog,3
4,ballan wrasse,3
15,mummichog,3
14,human,3
12,emu,2
18,spotted gar,2


## Use gene datasets to build a transcript-focused table

Finally, we are going to download a gene dataset for the human GnRH receptor genes, and use metadata included within the dataset to build a transcript-focused table.
We'll take this metadata from the data report file, which is similar to the gene descriptor described above, but in a more human-readable yaml format.  
Note that gene datasets include a data table that is gene-oriented, with one gene record per row. Here, we are building a custom transcript-oriented table.


In [1]:
%%time

gene_ids = [2798, 114814, 404718]
gene_ds_download = ds_gene_instance.download_gene_package(gene_ids, _preload_content=False)

## write to a zip file 
zipfile_name = 'gene_ds.zip'
with open(zipfile_name, 'wb') as f:
    f.write(gene_ds_download.data)

print(f'Download complete to {zipfile_name}')

Download complete to gene_ds.zip
CPU times: user 13 ms, sys: 3.39 ms, total: 16.4 ms
Wall time: 176 ms


In [1]:
!unzip -v {zipfile_name}

Archive:  gene_ds.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     661  Defl:N      384  42% 09-04-2020 16:27 bc3c97af  README.md
    5515  Defl:N     1284  77% 09-04-2020 16:27 c33d83dc  ncbi_dataset/data/data_report.yaml
    1434  Defl:N      478  67% 09-04-2020 16:27 fa82a51a  ncbi_dataset/data/data_table.tsv
     202  Defl:N      115  43% 09-04-2020 16:27 c98a95f2  ncbi_dataset/data/dataset_catalog.json
       0  Defl:N        2   0% 09-04-2020 16:27 00000000  ncbi_dataset/fetch.txt
--------          -------  ---                            -------
    7812             2263  71%                            5 files


In [1]:
import json
import zipfile

from google.protobuf.json_format import ParseDict
import yaml

import ncbi.datasets.v1alpha1.reports.gene_pb2 as gene_report_pb2

def gene_report_for(path_to_zipfile):
    '''
    Return an object representing the data report.
    path_to_zipfile: The relative path to the zipfile containing the virus data report
    '''
    with zipfile.ZipFile(path_to_zipfile, 'r') as zip:
        gene_report_as_dict = yaml.safe_load(zip.read('ncbi_dataset/data/data_report.yaml'))
    gene_report = gene_report_pb2.GeneDescriptors()
    ParseDict(gene_report_as_dict, gene_report)
    return gene_report

def _5prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.cds.range[0].begin - 1

def _3prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.length - transcript.cds.range[0].end

gene_report = gene_report_for(zipfile_name)

rows = []
for gene in gene_report.genes:

    # transcripts for each gene are embedded as lists and require additional handling
    for transcript in gene.transcripts:
        rows.append({
            'gene_id': gene.gene_id,
            'gene_symbol': gene.symbol,
            'gene_taxonomy': gene.taxname,            
            'accVer': transcript.accession_version,
            'name': transcript.name,
            'length': transcript.length,
            '5`UTR_len': _5prime_len(transcript),
            '3`UTR_len': _3prime_len(transcript),
            'protAccVer': transcript.protein.accession_version or None,
            'protName': transcript.protein.isoform_name or None,
            'protLength': transcript.protein.length or None,
            'exonAccVer': transcript.exons.accession_version,
            'numExons': len(transcript.exons.range),
        })

transcript_table = pd.DataFrame(rows)

transcript_table


Unnamed: 0,gene_id,gene_symbol,gene_taxonomy,accVer,name,length,5`UTR_len,3`UTR_len,protAccVer,protName,protLength,exonAccVer,numExons
0,114814,GNRHR2,Homo sapiens,NR_002328.4,transcript variant 1,1626,,,,,,NC_000001.11,3
1,114814,GNRHR2,Homo sapiens,NR_104034.1,transcript variant 3,786,,,,,,NC_000001.11,3
2,114814,GNRHR2,Homo sapiens,NR_104033.1,transcript variant 2,1035,,,,,,NC_000001.11,4
3,2798,GNRHR,Homo sapiens,NM_000406.3,transcript variant 1,4402,53.0,3362.0,NP_000397.1,isoform 1,328.0,NC_000004.12,3
4,2798,GNRHR,Homo sapiens,NM_001012763.2,transcript variant 2,4017,53.0,3214.0,NP_001012781.1,isoform 2,249.0,NC_000004.12,3
