# Demo of endpoint rarest_variants

Documentation: 
- http://geco.deib.polimi.it/popstudy/api/ui/#/default/server.api.rarest_variants
- http://geco.deib.polimi.it/popstudy/api/ui/#/default/server.api.donor_distribution

Requirements to run this demo: https://github.com/tomalf2/data_summarization_1KGP/blob/master/demo/README_requirements

Try this demo online: https://colab.research.google.com/drive/1v5ZFDh1KyHjs9ItN-zr4jOTNbhyHVrDK

In this demo, we're going to ask for the rarest variants found in a population composed of female healthy individuals from East Asian countries and having the two variants - described as a tuple (chromosome)-(start)-(reference allele)-(alternative allele):

 1-13272-G-C 

 1-10177--C 
 
 aligned on assembly hg19. 
 So let's begin by building the body parameters selecting this population.

In [1]:
import json
population = {
    'having_meta': {
        'health_status': "true",
        'population': ['BEB'],
        'gender': 'female',
        'assembly': 'hg19'
        },
    'having_variants': {
        'with': [
            {'chrom': 1, 'start': 10177, 'ref': '', 'alt': 'C'},
            {'chrom': 1, 'start': 13272, 'ref': 'G', 'alt': 'C'},
            ]
    }
}

# Considerations and preliminary operations

Since the endpoints \rarest_variants and \most_common_variants require to analyze millions of variants (about 4.5 millions for each individual), they can take some time to return a response. So before sending our request, let's check the population size with the help of the endpoint \donor_distribution. Remember that \donor_distribution needs also the mandatory attribute distribute_by.

In [2]:
dd_param = population.copy()
# distribute_by accepts a list of names. We randomly chose gender but other attributes are acceptable too, either fixed or non-fixed attributes.
dd_param['distribute_by'] = ['gender'] 

Send the request parameters to \donor_distribution and inspect the response_body

In [3]:
import requests
import pandas as pd
from matplotlib import pyplot as plt

donors_request = requests.post('http://geco.deib.polimi.it/popstudy/api/donor_distribution', json=dd_param)
print(' response status code: {}'.format(donors_request.status_code))
dd_response_body = donors_request.json()
# inspect the result
dd_df = pd.DataFrame.from_records(dd_response_body['rows'], columns=dd_response_body['columns'])
dd_df.fillna(value='ANY', inplace=True)    # replaces Nones (== any value) with 'ANY'
dd_df

response status code: 200


Unnamed: 0,GENDER,DONORS
0,ANY,7
1,female,7


We just show that the population considered in this example contains 7 individuals, which means that the API is going to analyze ⁓28 millions variants. As you can imagine, finding the rarest variants in this set, requires some time. In this case it will take ~1m to answer the request (execution time can be estimated roughly as 8 sec * < size of population >). If you wish to reduce further the population size, you can add constraints, for example on the DNA source type (blood/lcl), or also increase the region constraints.

# Find the rarest variants

Continuing with the initial goal, we are going to POST \rarest_variants. Before doing so, add a few more parameters (optional) saying to return the 30 rarest variants from each genomic variation data source and to filter out the variants having frequency less than 0.99%.

In [4]:
rv_param = population.copy()
rv_param['filter_output'] = {
    'limit': 30,
    'min_frequency': 0.0099
}
# send the request
rarest_request = requests.post('http://geco.deib.polimi.it/popstudy/api/rarest_variants', json=rv_param)
print(' response status code: {}'.format(rarest_request.status_code))
rarest_response_body = rarest_request.json()

response status code: 200


# Inspect response data:
The response includes the 30 rarest mutations (from each data source) found in the individuals of the selected population with frequency greater or equal to 0.99% ordered by ascending frequency and occurrence of the variant.

In [5]:
columns = rarest_response_body['columns']
rows = rarest_response_body['rows']
df = pd.DataFrame.from_records(rows, columns=columns)
df.fillna(value='', inplace=True)    # replaces Nones (== any value) with ''
df['POSITIVE_RATIO'] = df.apply(lambda r: r['POSITIVE_DONORS']/r['POPULATION_SIZE'], axis=1)  # positive donors / population size
df

Unnamed: 0,CHROM,START,REF,ALT,POPULATION_SIZE,POSITIVE_DONORS,OCCURRENCE_OF_VARIANT,FREQUENCY_OF_VARIANT,POSITIVE_RATIO
0,1,86191,G,A,7,1,1,0.071429,0.142857
1,1,88709,C,G,7,1,1,0.071429,0.142857
2,1,86027,T,C,7,1,1,0.071429,0.142857
3,1,86064,G,C,7,1,1,0.071429,0.142857
4,1,82608,C,G,7,1,1,0.071429,0.142857
5,1,87408,C,T,7,1,1,0.071429,0.142857
6,1,13115,T,G,7,1,1,0.071429,0.142857
7,1,64512,G,,7,1,1,0.071429,0.142857
8,1,74791,G,A,7,1,1,0.071429,0.142857
9,1,72525,A,G,7,1,1,0.071429,0.142857


Note that FREQUENCY_OF_VARIANT in the above result is half the POSITIVE_RATIO. This means that the the variant described is heterozygous in the selected population.