# Use NCBI Datasets to Retrieve Information on Genes

The objective of this notebook is to use the `ncbi.datasets` python library to demonstrate how to download and extract gene data from NCBI Datasets.

There are two major types of gene data available from NCBI Datasets:

1. Gene datasets, which include gene, transcript, and protein sequences, and a data report (gene metadata in JSON lines format)

2. Gene summaries, which are brief descriptions of the gene datasets described above

As an example, we will get gene data for the Gonadotropin Releasing Hormone Receptor (GNRHR) gene family, which plays a key role in sexual development and function across vertebrates.

While the role of GnRH in reproduction is conserved across vertebrates, GnRH receptor copy number is variable. 
In humans and some primates, there are two GnRHR genes, while in fish and amphibians, three GnRHR genes have been identified, with additional duplications observed in some fish. Additional receptors and GnRH ligands suggests that new functions could have been acquired by the additional gene copies[1].

We expect to observe this variable gene copy number in the data we obtain using NCBI Datasets.

[1] Moncaut N, Somoza G, Power DM, Canário AV. Five gonadotrophin-releasing hormone receptors in a teleost fish: isolation, tissue distribution and phylogenetic relationships. J Mol Endocrinol. 2005 Jun;34(3):767-79. doi: 10.1677/jme.1.01757. PMID: 15956346.


First we will load the libraries necessary to run this notebook. This includes:
1. the ncbi.datasets library 
2. pandas, the python data analysis library and 
3. pprint, which allows "pretty-printing" of Python data structures

In [1]:
# load all libraries
import pandas as pd
import ncbi.datasets

## Get gene summaries for three human GnRHR genes 
First we're going to get gene summaries for three human GnRHR genes, GNRHR, GNRHR2, and GNRHR2P1, by specifying the NCBI Gene IDs for these genes. 

Gene summaries contain a lot of interesting metadata and it's easy to pull out just the fields that you're interested in. As an example, we'll show how to parse gene symbols, chromosome number and the corresponding SwissProt accession for the three genes.

In [1]:
# Start a Datasets gene API instance
api_client = ncbi.datasets.ApiClient()
ds_gene_instance = ncbi.datasets.GeneApi(api_client)

# Retrieve gene summaries for the three genes using NCBI Gene IDs
gene_summary = ds_gene_instance.gene_metadata_by_id([2798, 114814, 404718])

# Look up the symbols, chromosome number and SwissProt accession for each gene
def report_on_gene_descriptors(gene_summary, leader='\t', report_errors=True):
    if report_errors:
        for message in gene_summary.messages or []:
            print(f'{leader}Error for: ({",".join(message.error.invalid_identifiers)})')
            print(f'{leader}{leader}Reason: ({message.error.reason})')

    if not gene_summary.genes:
        print(f'{leader}No genes found')
        return

    for gene in map(lambda g: g.gene, gene_summary.genes):
        print(f'{leader}{gene.symbol} (GeneID: {gene.gene_id}), Chromosome: {gene.chromosomes}, SwissProt: {gene.swiss_prot_accessions}')

report_on_gene_descriptors(gene_summary)

	GNRHR2 (GeneID: 114814), Chromosome: ['1'], SwissProt: ['Q96P88']
	GNRHR (GeneID: 2798), Chromosome: ['4'], SwissProt: ['P30968']
	GNRHR2P1 (GeneID: 404718), Chromosome: ['14'], SwissProt: None


## Finding genes by symbol in primates

Now we're going to look for GnRHR genes in primates. To find these genes, we'll query by gene symbol and NCBI Taxonomy ID.

First we'll use the Gene API to get all primate species-level Taxonomy IDs. 

Then for each primate species, we'll query for two gene symbols, GNRHR and GNRHR2, and from the gene summary, we'll return the gene symbol, NCBI Gene ID, Chromosome, and SwissProt accession (if available). 

Interestingly, very few primates have an annotated gene ortholog of human GNRHR2, and this is reflected in the results. In fact, the only non-human primates with an annotated gene named GNRHR2 are Rhesus monkey (Macaca mulatta) and white-tufted-ear marmoset.
For many primates, we may not get any results, because of missing nomenclature.  
In a future iteration of Datasets, we plan to give users access to orthology information that would allow retrieval of a more comprehensive set of gene homologs across organisms.

In [1]:
#Get gene summaries by gene symbol + organism name for all primates
primate_tax_id = 9443

def species_tax_ids(tree):
    
    if tree.children:
        for child in tree.children:
            yield from species_tax_ids(child)
        
    if tree.rank == 'SPECIES':
        yield tree.tax_id, tree.sci_name, tree.common_name

primate_tax_tree = ds_gene_instance.gene_tax_tree(taxon=primate_tax_id)

symbols = ['GNRHR', 'GNRHR2']
for tax_id, sci_name, common_name in species_tax_ids(primate_tax_tree):
    if not common_name:
        print(f'\n{sci_name}, TaxID: {tax_id}')
    else:
        print(f'\n{sci_name} ({common_name}), TaxID: {tax_id}')
    gene_descriptors = ds_gene_instance.gene_metadata_by_tax_and_symbol(symbols=symbols, taxon=tax_id)
    report_on_gene_descriptors(gene_descriptors, report_errors=False)


Hoolock leuconedys (eastern hoolock gibbon), TaxID: 593543
	No genes found

Hoolock hoolock (hoolock gibbon), TaxID: 61851
	No genes found

Hoolock leuconedys x Hoolock tianxing, TaxID: 1934191
	No genes found

Hoolock tianxing (Skywalker hoolock gibbon), TaxID: 1934190
	No genes found

Hylobates agilis (agile gibbon), TaxID: 9579
	No genes found

Hylobates lar (common gibbon), TaxID: 9580
	No genes found

Hylobates pileatus (pileated gibbon), TaxID: 9589
	No genes found

Hylobates moloch (silvery gibbon), TaxID: 81572
	GNRHR (GeneID: 116468321), Chromosome: ['Un'], SwissProt: None

Nomascus leucogenys (northern white-cheeked gibbon), TaxID: 61853
	GNRHR (GeneID: 100589627), Chromosome: ['9'], SwissProt: None

Nomascus gabriellae (Red-cheeked gibbon), TaxID: 61852
	No genes found

Nomascus siki (southern white-cheeked gibbon), TaxID: 9586
	No genes found

Symphalangus syndactylus (siamang), TaxID: 9590
	No genes found

Pan troglodytes (chimpanzee), TaxID: 9598
	GNRHR (GeneID: 471226),

	No genes found

Callimico goeldii (Goeldi's marmoset), TaxID: 9495
	No genes found

Callithrix penicillata (black-pencilled marmoset), TaxID: 57378
	No genes found

Callithrix geoffroyi (Geoffroy's marmoset), TaxID: 52231
	No genes found

Callithrix aurita (white-eared marmoset), TaxID: 57375
	No genes found

Callithrix jacchus (white-tufted-ear marmoset), TaxID: 9483
	GNRHR (GeneID: 100385305), Chromosome: ['3'], SwissProt: None
	LOC100399755 (GeneID: 100399755), Chromosome: ['18'], SwissProt: ['Q95MG6']

Callithrix kuhlii (Wied's marmoset), TaxID: 867363
	No genes found

Callithrix pygmaea (pygmy marmoset), TaxID: 9493
	No genes found

Leontopithecus rosalia (golden lion tamarin), TaxID: 30588
	No genes found

Leontopithecus chrysopygus (golden-rumped lion tamarin), TaxID: 58710
	No genes found

Saguinus oedipus (cotton-top tamarin), TaxID: 9490
	No genes found

Saimiri boliviensis (Bolivian squirrel monkey), TaxID: 27679
	GNRHR (GeneID: 101052255), Chromosome: ['Un'], SwissProt: No

## Build a table of key metadata for GnRHR genes across vertebrates

Let's expand the taxonomic scope even further, and look at a selection of vertebrates.

We'll use a pre-determined list of Gene IDs to get gene summaries for these genes and build an easily readable table with key information about these genes.

In [1]:
# extract fields of interest from descriptors class to build a table
cols = '''
common_name
taxonomic_name
symbol
type
chromosome
num_transcripts
ensembl_id
omim_id
uniprot_id
nomenclature_id
nomenclature_auth
genome_coordinates
'''
cols = cols.split('\n')[1:-1]

def _range_repr(range):
    ret = []
    for interval in range:
        ret.append(f'{interval.begin}_{interval.end}')
    return ','.join(ret)

def _ranges_repr(ranges):
    ret = []
    for range in ranges:
        ret.append(f'{range.accession_version}:{_range_repr(range.range)}')
    return ','.join(ret)

# specify genes of interest and retrieve descriptors
gene_ids = [2798, 114814, 404718, 14715, 109324103, 109309182, 281798, 395368, 403718, 427517, 471226, 7226731, 100001586, 100135415, 100135416, 100135417, 100136028, 100270671, 100270672, 101318246, 101932446, 101935915, 101953943, 102193667, 102202954, 102205592, 102346610, 102363373, 102364206, 102366752, 102536567, 102687824, 102694185, 102770612, 103899900, 103899926, 105916404, 105919697, 105934126, 108392639, 109987527, 109994050, 109999298, 110488224, 110495632, 110496352, 110513414, 110520912, 112994411, 112996301, 114645297, 114667483]
gene_descriptors = ds_gene_instance.gene_metadata_by_id(gene_ids)

# collect elements of the descriptor class into a dictionary based on each gene ID
table_data = {}
for gene in map(lambda g: g.gene, gene_descriptors.genes):
    table_data[gene.gene_id] = [gene.common_name]
    table_data[gene.gene_id].append(gene.taxname)
    table_data[gene.gene_id].append(gene.symbol)
    table_data[gene.gene_id].append(gene.type)
    table_data[gene.gene_id].append(gene.chromosome)
    if gene.transcripts:
        table_data[gene.gene_id].append(len(gene.transcripts))
    else:
        table_data[gene.gene_id].append(0)
    table_data[gene.gene_id].append(gene.ensembl_gene_ids)
    table_data[gene.gene_id].append(gene.omim_ids)
    table_data[gene.gene_id].append(gene.swiss_prot_accessions)
    if gene.nomenclature_authority:
        table_data[gene.gene_id].append(gene.nomenclature_authority.identifier)
        table_data[gene.gene_id].append(gene.nomenclature_authority.authority)
    else:
        table_data[gene.gene_id].append(None)
        table_data[gene.gene_id].append(None)        
    table_data[gene.gene_id].append(_ranges_repr(gene.genomic_ranges))

        
df = pd.DataFrame.from_dict(table_data, orient='index', columns=cols)
df.index.name = 'gene_id'
df

Unnamed: 0_level_0,common_name,taxonomic_name,symbol,type,chromosome,num_transcripts,ensembl_id,omim_id,uniprot_id,nomenclature_id,nomenclature_auth,genome_coordinates
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
100001586,zebrafish,Danio rerio,gnrhr4,PROTEIN_CODING,,2,[ENSDARG00000038116],,,ZDB-GENE-050419-76,ZFIN,NC_007129.7:25390740_25402909
100135415,tropical clawed frog,Xenopus tropicalis,gnrhr2,PROTEIN_CODING,,3,[ENSXETG00000021161],,,XB-GENE-5867415,Xenbase,NC_030679.2:109001475_109014876
100135416,tropical clawed frog,Xenopus tropicalis,gnrhr2/nmi,PROTEIN_CODING,,2,[ENSXETG00000005637],,,,,NC_030679.2:116455054_116472596
100135417,tropical clawed frog,Xenopus tropicalis,gnrhr,PROTEIN_CODING,,2,[ENSXETG00000001290],,,XB-GENE-5753573,Xenbase,NC_030684.2:142203708_142211556
100136028,rainbow trout,Oncorhynchus mykiss,gnrh-r,PROTEIN_CODING,,1,[ENSOMYG00000000839],,,,,NC_048566.1:29925736_29930354
100270671,zebrafish,Danio rerio,gnrhr2,PROTEIN_CODING,,1,[ENSDARG00000003553],,,ZDB-GENE-090128-3,ZFIN,NC_007118.7:52364748_52368843
100270672,zebrafish,Danio rerio,gnrhr1,PROTEIN_CODING,,2,[ENSDARG00000100593],,,ZDB-GENE-090128-2,ZFIN,NC_007130.7:43213293_43235554
101318246,common bottlenose dolphin,Tursiops truncatus,GNRHR,PROTEIN_CODING,,1,,,,,,NC_047038.1:85406171_85428905
101932446,Painted turtle,Chrysemys picta,LOC101932446,PROTEIN_CODING,,1,[ENSCPBG00000024942],,,,,NW_007281386.1:1139258_1141674
101935915,Painted turtle,Chrysemys picta,LOC101935915,PROTEIN_CODING,,2,[ENSCPBG00000007948],,,,,NW_007281382.1:370463_379501


## Build a table showing GnRHR gene copy number across vertebrates

We're going to build a table showing how gene count varies in among a selected group of vertebrates.  
Note that rainbow trout has the most gene copies at 6, while numerous mammals, including mouse, dolphin and alpaca, only have a single annotated gene copy. 

In [1]:
# plot gene count based on organism
gene_cnt = df.groupby('common_name')['symbol'].count().reset_index()
gene_cnt.columns = ['organism', 'gene_count']
gene_cnt.sort_values('gene_count', ascending=False, inplace=True)
gene_cnt

Unnamed: 0,organism,gene_count
16,rainbow trout,6
8,coelacanth,4
20,zebrafish,3
2,Painted turtle,3
19,tropical clawed frog,3
4,ballan wrasse,3
15,mummichog,3
14,human,3
12,emu,2
18,spotted gar,2


## Use gene datasets to build a transcript-focused table

Finally, we are going to download a gene dataset for the human GnRHR genes, and use metadata included as part of the dataset to build a transcript-focused table.  
We'll take this metadata from the data report file, `data_report.jsonl`.

In [1]:
%%time

gene_ids = [2798, 114814, 404718]
gene_ds_download = ds_gene_instance.download_gene_package(gene_ids, _preload_content=False)

## write to a zip file 
zipfile_name = 'gene_ds.zip'
with open(zipfile_name, 'wb') as f:
    f.write(gene_ds_download.data)

print(f'Download saved to {zipfile_name}')

Download saved to gene_ds.zip
CPU times: user 12.2 ms, sys: 3.71 ms, total: 15.9 ms
Wall time: 153 ms


In [1]:
!unzip -v {zipfile_name}

Archive:  gene_ds.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
     661  Defl:N      384  42% 11-20-2020 13:53 bc3c97af  README.md
    4623  Defl:N     1187  74% 11-20-2020 13:53 f9703695  ncbi_dataset/data/data_report.jsonl
    1434  Defl:N      478  67% 11-20-2020 13:53 fa82a51a  ncbi_dataset/data/data_table.tsv
     203  Defl:N      116  43% 11-20-2020 13:53 47253497  ncbi_dataset/data/dataset_catalog.json
--------          -------  ---                            -------
    6921             2165  69%                            4 files


In [1]:
import zipfile
from google.protobuf.json_format import ParseDict
import jsonlines
import ncbi.datasets.v1alpha1.reports.gene_pb2 as gene_report_pb2
def gene_report_for(path_to_zipfile):
    '''
    Return an object representing the data report.
    path_to_zipfile: The relative path to the zipfile containing the virus data report
    '''
    gene_report = gene_report_pb2.GeneDescriptors()
    with zipfile.ZipFile(path_to_zipfile, 'r') as zip:      
        with zip.open('ncbi_dataset/data/data_report.jsonl') as file:
            reader = jsonlines.Reader(file)
            for gene_dict in reader.iter(type=dict, skip_invalid=True):
                ParseDict(gene_dict, gene_report.genes.add())
    return gene_report

def _5prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.cds.range[0].begin - 1

def _3prime_len(transcript):
    if not transcript.cds or not transcript.cds.range:
        return None
    return transcript.length - transcript.cds.range[0].end

gene_report = gene_report_for(zipfile_name)

rows = []
for gene in gene_report.genes:

    # transcripts for each gene are embedded as lists and require additional handling
    for transcript in gene.transcripts:
        rows.append({
            'gene_id': gene.gene_id,
            'gene_symbol': gene.symbol,
            'gene_taxonomy': gene.taxname,            
            'accVer': transcript.accession_version,
            'name': transcript.name,
            'length': transcript.length,
            '5`UTR_len': _5prime_len(transcript),
            '3`UTR_len': _3prime_len(transcript),
            'protAccVer': transcript.protein.accession_version or None,
            'protName': transcript.protein.isoform_name or None,
            'protLength': transcript.protein.length or None,
            'exonAccVer': transcript.exons.accession_version,
            'numExons': len(transcript.exons.range),
        })

transcript_table = pd.DataFrame(rows)

transcript_table


Unnamed: 0,gene_id,gene_symbol,gene_taxonomy,accVer,name,length,5`UTR_len,3`UTR_len,protAccVer,protName,protLength,exonAccVer,numExons
0,114814,GNRHR2,Homo sapiens,NR_002328.4,transcript variant 1,1626,,,,,,NC_000001.11,3
1,114814,GNRHR2,Homo sapiens,NR_104034.1,transcript variant 3,786,,,,,,NC_000001.11,3
2,114814,GNRHR2,Homo sapiens,NR_104033.1,transcript variant 2,1035,,,,,,NC_000001.11,4
3,2798,GNRHR,Homo sapiens,NM_000406.3,transcript variant 1,4402,53.0,3362.0,NP_000397.1,isoform 1,328.0,NC_000004.12,3
4,2798,GNRHR,Homo sapiens,NM_001012763.2,transcript variant 2,4017,53.0,3214.0,NP_001012781.1,isoform 2,249.0,NC_000004.12,3
