# Preface: The BRCA Exchange Internal API
BRCA Exchange uses an internal API to facilitate communication between the front-end and the backend. This notebook is intended to guide developers through the internal API and describe what features are available.

Full API documentation can be found here: [API Overview](https://github.com/BRCAChallenge/brca-exchange/blob/master/website/content/api_docs/api_overview.md).

In [1]:
# imports some objects we'll be using throughout this notebook
from pprint import pprint  # for pretty-printing results
import requests as rq  # for issuing HTTP(S) queries

# Searching for Variants

## Getting All Variants
As a starting exercise, let's get a list of all variants from the backend. The list is paged to prevent overwhelming the client with the 26,000+ variants available on the site.

In [107]:
# in these examples, the query is split up into separate lines here to improve readability
query = """
https://brcaexchange.org/backend/data/
?format=json
&order_by=Gene_Symbol
&direction=ascending
&page_size=20
&page_num=0
&search_term=
&include=Variant_in_ENIGMA
&include=Variant_in_ClinVar
&include=Variant_in_1000_Genomes
&include=Variant_in_ExAC&include=Variant_in_LOVD
&include=Variant_in_BIC&include=Variant_in_ESP
&include=Variant_in_exLOVD&include=Variant_in_ENIGMA_BRCA12_Functional_Assays
&include=Variant_in_GnomAD
"""

# the lines in the query are joined before we issue it
first20 = rq.get(query.replace('\n', '')).json()

**first20** now contains a dictionary with the following keys:
- **count**: an integer indicating the total number of variants in the database
- **deletedCount**: an integer indicating the number of variants that have been removed from the database
- **synonyms**: an integer, 0 for this example since we didn't search for anything specific. (this will be explained later)
- **data**: a list of the actual variants returned; the length of this list will always be <= page_size

Each variant in data holds a wealth of information. It's beyond the scope of this document to describe the full contents of the variant object, but for now we'll consider the field Genomic_Coordinate_hg38 to be a good description of the variant. Specifically it's the genomic HGVS notation for that variant, but with the chromosone identifier instead of an accession. For example, `chr17:g.43111549:T>C` indicates a variant on chromosome 17 starting at genomic position 43111549. The `T>C` part indicates that it's a single-nucleotide substitution in which a `T`, Thymine, has been substituted (`>`) for a `C`, Cytosine. Consult the following site for a thorough explanation of HGVS notation: https://varnomen.hgvs.org/.

Let's take a look at that field for the variants we've retrieved.

In [89]:
# let's see which variants they are
print("Genomic coordinates of variants in page:")
pprint([x['Genomic_Coordinate_hg38'] for x in first20['data']])

Genomic coordinates of variants in page:
['chr17:g.43065086:C>T',
 'chr17:g.43122547:T>A',
 'chr17:g.43098204:AT>A',
 'chr17:g.43051983:C>G',
 'chr17:g.43094573:T>G',
 'chr17:g.43080681:C>T',
 'chr17:g.43111828:A>C',
 'chr17:g.43044897:G>A',
 'chr17:g.43082145:T>C',
 'chr17:g.43112322:G>A',
 'chr17:g.43099786:T>C',
 'chr17:g.43099859:G>C',
 'chr17:g.43090063:AG>A',
 'chr17:g.43078027:G>A',
 'chr17:g.43108968:G>C',
 'chr17:g.43082506:C>G',
 'chr17:g.43102706:G>A',
 'chr17:g.43086602:C>T',
 'chr17:g.43071077:T>C',
 'chr17:g.43079318:G>A']


Let's also take a look at the non-empty fields in the first variant in the results. Note that empty fields may either be `None` (`null` in Javascript), the string '-', the string 'None', or simply an empty string, ''.

In [90]:
variant = first20['data'][0]
pprint([(k, variant[k]) for k in variant if variant[k] not in (None, '-', 'None', '')])

[('ClinVarAccession_ENIGMA', 'SCV000244620.1'),
 ('Variant_in_LOVD', True),
 ('DBID_LOVD', 'BRCA1_003341'),
 ('Chr', '17'),
 ('Submitter_ClinVar',
  'Evidence-based_Network_for_the_Interpretation_of_Germline_Mutant_Alleles_(ENIGMA)'),
 ('Submitters_LOVD', 'ENIGMA consortium (Brisbane,AU)'),
 ('Pathogenicity_expert', 'Benign / Little Clinical Significance'),
 ('Method_ClinVar', 'curation'),
 ('BIC_Nomenclature', 'IVS 17-1135G>A'),
 ('Change_Type_id', 6),
 ('Collection_method_ENIGMA', 'Curation'),
 ('Source_URL', 'http://www.ncbi.nlm.nih.gov/clinvar/?term=SCV000244620'),
 ('HGVS_cDNA_LOVD', 'NM_007294.3:c.5075-1135G>A'),
 ('EAS_Allele_frequency_1000_Genomes', '0.0536'),
 ('Reference_Sequence', 'NM_007294.3'),
 ('Individuals_LOVD', '1'),
 ('Variant_in_ExAC', False),
 ('Allele_Origin_ClinVar', 'germline'),
 ('Variant_in_1000_Genomes', True),
 ('Edited_date_LOVD', '2017-08-18 17:32:14'),
 ('AFR_Allele_frequency_1000_Genomes', '0.0015'),
 ('EUR_Allele_frequency_1000_Genomes', '0.0'),
 ('Sour

## Navigating the Results
As mentioned, the API will only return a single page of results at a time. The number of results per page is specified in the parameter `page_size`. `page_num` indicates the page you're currently viewing, with page 0 being the first one in the result set. To retrieve the next page, you must increment the argument `page_num` while keeping the rest of the parameters the same. Here's the next page of variants, for example.

In [91]:
page_num = 1
next20 = rq.get("".join(("""
https://brcaexchange.org/backend/data/
?format=json
&order_by=Gene_Symbol
&direction=ascending
&page_size=20
&page_num=%(page_num)d
&search_term=
&include=Variant_in_ENIGMA
&include=Variant_in_ClinVar
&include=Variant_in_1000_Genomes
&include=Variant_in_ExAC&include=Variant_in_LOVD
&include=Variant_in_BIC&include=Variant_in_ESP
&include=Variant_in_exLOVD&include=Variant_in_ENIGMA_BRCA12_Functional_Assays
&include=Variant_in_GnomAD
""" % {'page_num': page_num}).split('\n'))).json()

# as before, next20 contains 20 elements, but they're not the same elements
print("Variants in page: %d" % len(next20['data']))

# again, let's see which variants are on this page
print("Genomic coordinates of variants in page %d:" % page_num)
pprint([x['Genomic_Coordinate_hg38'] for x in next20['data']])

Variants in page: 20
Genomic coordinates of variants in page 1:
['chr17:g.43098204:AT>A',
 'chr17:g.43044897:G>A',
 'chr17:g.43078027:G>A',
 'chr17:g.43080681:C>T',
 'chr17:g.43099409:TAC>T',
 'chr17:g.43122547:T>A',
 'chr17:g.43108968:G>C',
 'chr17:g.43094113:T>C',
 'chr17:g.43101650:G>C',
 'chr17:g.43050608:CA>C',
 'chr17:g.43082506:C>G',
 'chr17:g.43111828:A>C',
 'chr17:g.43071077:T>C',
 'chr17:g.43086602:C>T',
 'chr17:g.43082145:T>C',
 'chr17:g.43112322:G>A',
 'chr17:g.43099859:G>C',
 'chr17:g.43099786:T>C',
 'chr17:g.43051983:C>G',
 'chr17:g.43064696:T>A']


## Filtering the Variants
Extending the example from before, let's supply some criteria to narrow down the list. Criteria are specified by adding `filter=<type>` and then `filterValue=<value>` later in the parameter list. The following filter types are available:

- `Pathogenicity_expert`: pathogenicity of the variant as determined by ENIGMA; values are one of the following:
  - Pathogenic,
  - Likely Pathogenic,
  - Benign / Little Clinical Significance, 
  - Likely Benign,
  - Not Yet Reviewed
- `Gene_Symbol`: the HUGO symbol for the gene in which the variant occurs; may be one:
  - BRCA1
  - BRCA2

Omitting a filter removes the restriction, e.g. leaving out Gene_Symbol would return results for both genes. Each filter type can be specified only once, and filter types are ANDed together, i.e. variant must match all filters to be returned. If you want to specify multiple criteria, you must ensure that the `filter` parameters' order matches the `filterValues`. For instance, if both types are specified and `filter=Gene_Symbol` is first in the parameter list, then `filterValue=BRCA1` must occur before `filterValue=Likely Benign`.

Let's try querying for pathogenic BRCA2 variants.

In [92]:
query = """
https://brcaexchange.org/backend/data/
?format=json
&filter=Pathogenicity_expert
&filterValue=Pathogenic
&filter=Gene_Symbol
&filterValue=BRCA2
&order_by=Gene_Symbol
&direction=ascending
&page_size=20
&page_num=0
&search_term=
&include=Variant_in_ENIGMA
&include=Variant_in_ClinVar
&include=Variant_in_1000_Genomes
&include=Variant_in_ExAC
&include=Variant_in_LOVD
&include=Variant_in_BIC
&include=Variant_in_ESP
&include=Variant_in_exLOVD
&include=Variant_in_ENIGMA_BRCA12_Functional_Assays
&include=Variant_in_GnomAD
"""
results = rq.get(query.replace('\n', '')).json()

print("Number of pathogenic BRCA2 variants: %d" % results['count'])

Number of pathogenic BRCA2 variants: 2637


## Searching for a Variant

Say that we know a particular value associated with our variant, e.g. its HGVS string or SCV accession number from ClinVar. We can use the `search_term` parameter to supply this value; variants which contain the value in any of a wide number of columns will be returned as search results. The `search_term` is filled when using the search boxes on the BRCA Exchange website.

Here's an example of querying for a variant by its HGVS cDNA string (i.e., the change as located within a coding transcript):

In [108]:
# we're searching for the variant NM_007294.3:c.2389G>T, i.e. a change recorded in ClinVar molecular accession NM_007294.3 (the .3 is the version of the transcript)
# c. indicates that it's a change in a coding sequence
# it occurs at position 2389, and is a single-nucleotide substitution from G to T
query = """
https://brcaexchange.org/backend/data/
?format=json
&search_term=NM_007294.3:c.2389G>T
&include=Variant_in_ENIGMA
&include=Variant_in_ClinVar
&include=Variant_in_1000_Genomes
&include=Variant_in_ExAC
&include=Variant_in_LOVD
&include=Variant_in_BIC
&include=Variant_in_ESP
&include=Variant_in_exLOVD
&include=Variant_in_ENIGMA_BRCA12_Functional_Assays
&include=Variant_in_GnomAD
"""
response = rq.get(query.replace('\n', '')).json()
print("Variants matching query: %d" % response['count'])

Variants matching query: 1


Nice, only one variant as expected. We can inspect the `data` field for our search results as before.

Note that the supplied string might not match the canonical data for that variant; there are many nomenclatures for describing genetic variants, some defunct, but that are still used in other databases or in the literature. In these cases, the search will still return the variant, but will indicate that it matched on a synonym instead of the variant's canonical data. For example, here's a search for a synonym of the canonical variant `chr17:g.43067763:T>C`, expressed as the BIC designation `IVS16-68_A>G`:

In [120]:
query = """
https://brcaexchange.org/backend/data/
?format=json
&search_term=IVS16-68_A>G
&include=Variant_in_ENIGMA
&include=Variant_in_ClinVar
&include=Variant_in_1000_Genomes
&include=Variant_in_ExAC
&include=Variant_in_LOVD
&include=Variant_in_BIC
&include=Variant_in_ESP
&include=Variant_in_exLOVD
&include=Variant_in_ENIGMA_BRCA12_Functional_Assays
&include=Variant_in_GnomAD
"""
response = rq.get(query.replace('\n', '')).json()
print("Variants matching query: %d" % response['count'])
print("Synonym matches: %d" % response['synonyms'])

Variants matching query: 1
Synonym matches: 1


Both `count` and `synonyms` contain 1, indicating there was a single result and that that result was a match on a synonym.

On a related note, the `deletedCount` field is non-zero when a search returns results that only include variants that have been deleted from BRCA Exchange. In those cases, `count` *is* zero.

## Alternate Formats

The `format` parameter can take other values than the `json` one we've been using before. Specifically, it can take `csv` or `tsv` to produce a comma-separated or tab-separated list of newline-delimited records, respectively. Note in this case that paging will not be used and the response will contain the full set of results. The first line is always a header that contains the names of the columns in the rows that follow it.

# Getting Variant Details
While the variant search endpoint produces quite a lot of data per variant, we can request even more information about variants on a per-variant basis. In order to query these endpoints, you must have the BRCA Exchange database ID of the variant, i.e. the `id` field in the variant data object we saw before.

There are a few variant-specific endpoints:
- `/backend/data/variant/`: returns a data payload like the variant search, but with the full version history included
- `/backend/data/variant/<id>/reports`: returns reports from sources which underlie the information on BRCA Exchange
- `/backend/data/variantpapers/?variant_id=<id>`: returns literature references to this variant found by our literature crawler

## Variant History

Approximately each month, the BRCA Exchange's database is refreshed from external sources, producing a new "release" of the dataset. Each release is tagged with a numeric ID; the releases are listed at [BRCA Exchange Releases](https://brcaexchange.org/releases).

During this refresh, each variant is compared against the new data; if there's been a change, then a new entry for the variant is generated in the version history. (The release ID is included in the variant version as the field `Data_Release_id` from the search endpoints we used earlier.) The full description of a variant thus consists of its entire history, from when it was first added to the database to the present (or, in the case of variants that have been deleted, to the revision in which it was deleted). **The history is always in reverse chronological order**, with element 0 being the most recent version, and the last element being the version in which the variant was first introduced.

For example, here's the details for the first variant in the list of all variants that we queried earlier. (Note that the search features we covered earlier always return variants from the most recent release by default.)

In [93]:
# for the following examples we'll consider the first variant we retrieved from the 'all variants' query we conducted before
variant_id = 224971

response = rq.get("""https://brcaexchange.org/backend/data/variant/?variant_id=%d""" % variant_id).json()

The response is almost the same as from the search endpoints we covered before, but instead of a single entry for `data` we instead have a list of entries, one per release in which this variant changed. This variant was introduced in release 1 and has been through 11 revisions, which we confirm below. Note that the releases are in order, but not sequential, indicating that this variant has changed in only a few releases.

In [94]:
print("Number of revisions: %d" % len(response['data']))
pprint([x['Data_Release']['id'] for x in response['data']])

Number of revisions: 11
[30, 25, 23, 22, 21, 18, 12, 11, 10, 2, 1]


Let's take a closer look at the variant data, specifically the `Data_Release` field.

In [95]:
response['data'][0]['Data_Release']

{'name': 29,
 'notes': 'Release notes for BRCA Exchange data version 29, dated April 12, 2019\n\n\nThis is the most recent update of the BRCA Exchange variant data since the release in March, 2019. It includes variant data from 1000 Genomes, BIC, ClinVar, ESP, ExAC, ExUV, LOVD, ENIGMA, and Findlay et al (PMID: 30209399). The data was assembled with an automated Luigi pipeline, running software from https://github.com/BRCAChallenge/brca-exchange/tree/data_release_2019-04-12\n\n\nChanges in this release:\n* Due to some issues processing functional analysis data, this release contains the same LOVD data as the release from 14 February, 2019.\n* Because BIC is no longer being updated, this release contains the same BIC data as the release from March 22, 2019. \n',
 'md5sum': '',
 'id': 30,
 'sources': 'Bic, ClinVar, ESP, ExAC, ENIGMA, LOVD, ExUV, 1000 Genomes, Findlay BRCA1 Ring Function Scores',
 'date': '2019-04-12T00:00:00',
 'archive': 'release-04-12-19.tar.gz',
 'schema': ''}

Unlike the search results endpoints which include just the `Data_Release_id` per variant, we instead see much more information about the release, including its description, which sources were involved in its generation, and the date of its release.

### Differential History

To make it easier to determine what's changed between revisions, each version of the variant includes a field called `Diff`; this field lists the data that has been added, removed, and changed with each version compared to the previous one. Let's take a look at the `Diff` field for the second version of our variant:

In [96]:
response['data'][len(response['data'])-2]['Diff']

[{'field': 'Allele_Frequency',
  'removed': '-',
  'added': '0.357029 (1000 Genomes)',
  'field_type': 'individual'},
 {'field': 'EAS_Allele_frequency_1000_Genomes',
  'removed': '-',
  'added': '0.372',
  'field_type': 'individual'},
 {'field': 'AFR_Allele_frequency_1000_Genomes',
  'removed': '-',
  'added': '0.2345',
  'field_type': 'individual'},
 {'field': 'EUR_Allele_frequency_1000_Genomes',
  'removed': '-',
  'added': '0.3539',
  'field_type': 'individual'},
 {'field': 'Source',
  'removed': None,
  'added': ['1000_Genomes'],
  'field_type': 'list'},
 {'field': 'SAS_Allele_frequency_1000_Genomes',
  'removed': '-',
  'added': '0.498',
  'field_type': 'individual'},
 {'field': 'Max_Allele_Frequency',
  'removed': '-',
  'added': '0.498000 (SAS from 1000 Genomes)',
  'field_type': 'individual'},
 {'field': 'Allele_frequency_1000_Genomes',
  'removed': '-',
  'added': '0.357029',
  'field_type': 'individual'},
 {'field': 'AMR_Allele_frequency_1000_Genomes',
  'removed': '-',
  'ad

It appears that this version is the one in which demographic data for this variant was first populated.

## Variant Reports

Information about variants is aggregated from a number of sources; each evidence point that supports the information is also collected by BRCA Exchange, and is called a *report*. You can get the full list of reports for a variant like so (again, for our example variant):

In [97]:
reports = rq.get("https://brcaexchange.org/backend/data/variant/%d/reports" % variant_id).json()

In [98]:
print("Number of reports: %d" % len(reports['data']))
pprint([(x['Source'], x['Data_Release']['id']) for x in reports['data']])

Number of reports: 13
[('ClinVar', 30),
 ('ClinVar', 25),
 ('ClinVar', 23),
 ('ClinVar', 22),
 ('ClinVar', 21),
 ('ClinVar', 18),
 ('ClinVar', 12),
 ('ClinVar', 11),
 ('ClinVar', 10),
 ('LOVD', 30),
 ('LOVD', 23),
 ('LOVD', 22),
 ('LOVD', 21)]


Each report includes the full data about each variant at the time the report was collected, which maintains context about the report. In addition to the full data about the variant for that release, the report includes the source of the submission (the `Source` field)

## Literature References

In addition to reports, BRCA Exchange also harvests references of this variant in academic papers from a variety of sources (typically PubMed). Note that this feature is a work-in-progress and currently in beta, so take the current data with a grain of salt.

You can view the references to our candidate variant using the following endpoint:

In [99]:
refs = rq.get('https://brcaexchange.org/backend/data/variantpapers/?variant_id=%d' % variant_id).json()
refs

{'data': []}

Well, that's a little disappointing...it seems we didn't find any references in the literature for this variant. Let's try another:

In [101]:
refs = rq.get('https://brcaexchange.org/backend/data/variantpapers/?variant_id=%d' % 223762).json()
refs['data']

[{'crawl_date': '2019-08-26T23:33:38',
  'title': 'High proportion of recurrent germline mutations in the BRCA1 gene in breast and ovarian cancer patients from the Prague area.',
  'url': '',
  'journal': 'Breast cancer research : BCR',
  'authors': 'Pohlreich, Petr; Zikan, Michal; Stribrna, Jana; Kleibl, Zdenek; Janatova, Marketa; Kotlas, Jaroslav; Zidovska, Jana; Novotny, Jan; Petruzelka, Lubos; Szabo, Csilla; Matous, Bohuslav',
  'id': 265,
  'deleted': False,
  'points': 10,
  'year': 2005,
  'keywords': 'Breast Neoplasms/Czech Republic/DNA Mutational Analysis/DNA, Neoplasm/Ethnic Groups/Exons/Family/Female/Gene Amplification/Genes, BRCA1/Genes, BRCA2/Humans/Ovarian Neoplasms/Polymerase Chain Reaction/RNA, Neoplasm/Risk Factors',
  'mentions': ['61Gly  Sequencing  1  29  -   F-43  11  c.1135delA  c.1016delA  p.Lys339fsX340  PTT  2  41  Colon (50), lung (64)   F-361  11  c.1246delA  c.1127delA <<< p.Asn376fs>>>X393  PTT  1  37  Ovarian (52, 54, 55)   F-21  11  c.1806C>T  c.1687C>T  

It's a bit beyond the scope of this document to describe these results in full, but the fields are fairly self-explanatory.

Note that these results are independent of the version of the variant; instead, the field `crawl_date` provides the date at which the citation was collected. Compare that date to the one in the `Data_Release` object for an idea of what the evidence for that variant looked like at the time the citation was collected. More importantly, refer to the year in which the paper was published for a better idea of the context in which this citation should be regarded.