# Demo of endpoint rarest_variants

Documentation: http://geco.deib.polimi.it/popstudy/api/ui/#/default/server.api.rarest_variants

Requirements to run this demo: https://github.com/tomalf2/data_summarization_1KGP/blob/master/demo/README_requirements.txt

In this demo, we're going to ask for the rarest variants found in a population composed of female healthy individuals from East Asian countries and having the two variants - described as a tuple (chromosome)-(start)-(reference allele)-(alternative allele):

 1-13271-G-C 

 1-10176--C 
 
 aligned on assembly hg19. 
 So let's begin by building the body parameters selecting this population.

In [9]:
import json
population = {
    'having_meta': {
        'health_status': "true",
        'super_population': ['EAS'],
        'gender': 'female',
        'assembly': 'hg19'
        },
    'having_variants': {
        'with': [
            {'chrom': 1, 'start': 10176, 'ref': '', 'alt': 'C'},
            {'chrom': 1, 'start': 13271, 'ref': 'G', 'alt': 'C'},
            ]
    }
}

# Considerations and preliminary operations

Since the endpoints \rarest_variants and \most_common_variants require to analyze millions of variants (about 4.5 millions for each individual), they can take some time to return a response. So before sending our request, let's check the population size with the help of the endpoint \donor_distribution. Remember that \donor_distribution needs also the mandatory attribute distribute_by.

In [10]:
dd_param = population.copy()
# distribute_by accepts a list of names. We randomly chose gender but other attributes are acceptable too, either fixed or non-fixed attributes.
dd_param['distribute_by'] = ['gender'] 

Send the request parameters to \donor_distribution and inspect the response_body

In [11]:
import requests
import pandas as pd
from matplotlib import pyplot as plt

donors_request = requests.post('http://geco.deib.polimi.it/popstudy/api/donor_distribution', json=dd_param)
print(' response status code: {}'.format(donors_request.status_code))
dd_response_body = donors_request.json()
# inspect the result
dd_df = pd.DataFrame.from_records(dd_response_body['rows'], columns=dd_response_body['columns'])
dd_df.fillna(value='ANY', inplace=True)    # replaces Nones (== any value) with 'ANY'
dd_df

response status code: 200


Unnamed: 0,GENDER,DONORS
0,ANY,22
1,female,22


We just show that the population considered in this example contains 22 individuals, which means that the API is going to analyze ⁓99 millions variants. As you can imagine, finding the rarest variants in this set, requires some time. In this case it will take ~3m to answer the request (execution time can be estimated roughly as 8 sec * < size of population >). If you wish to reduce further the population size, you can introduce constraints on the country of origin (for example you can select only the BEB - Bangladesh - population) and on the DNA source type (for example only blood), or also increase the region constraints.

# Find the rarest variants

Continuing with the initial goal, we are going to POST \rarest_variants. Before doing so, add a few more parameters (optional) saying to return the 30 rarest variants from each genomic variation data source and to filter out the variants having frequency below 0.99%.

In [12]:
rv_param = population.copy()
rv_param['filter_output'] = {
    'limit': 30,
    'min_frequency': 0.0099
}
# send the request
rarest_request = requests.post('http://geco.deib.polimi.it/popstudy/api/rarest_variants', json=rv_param)
print(' response status code: {}'.format(rarest_request.status_code))
rarest_response_body = rarest_request.json()

response status code: 200


# Inspect response data:
The response includes the 30 rarest mutations (from each data source) found in the individuals of the selected population with frequency greater or equal to 0.99% ordered by ascending frequency and occurrence of the variant.

In [13]:
columns = rarest_response_body['columns']
rows = rarest_response_body['rows']
df = pd.DataFrame.from_records(rows, columns=columns)
df.fillna(value='', inplace=True)    # replaces Nones (== any value) with ''
df

Unnamed: 0,CHROM,START,REF,ALT,POPULATION_SIZE,POSITIVE_DONORS,OCCURRENCE_OF_VARIANT,FREQUENCY_OF_VARIANT
0,1,76836,T,G,22,1,1,0.022727
1,1,77872,G,A,22,1,1,0.022727
2,1,72524,A,G,22,1,1,0.022727
3,1,74790,G,A,22,1,1,0.022727
4,1,72296,,TAT,22,1,1,0.022727
5,1,77864,C,T,22,1,1,0.022727
6,1,64929,G,A,22,1,1,0.022727
7,1,68594,T,G,22,1,1,0.022727
8,1,74788,C,G,22,1,1,0.022727
9,1,60349,A,G,22,1,1,0.022727
