## Getting the studies for a species

This notebook will walk you through how to use the EVA API to get all the studies in the EVA for a particular species (e.g. mouse), and then get basic summary statistics for those studies - namely the number of variants and number of samples.

For reference, here is the documentation for all the APIs we will use:

* [EVA API](https://www.ebi.ac.uk/eva/webservices/rest/swagger-ui.html)

In [1]:
import requests
from collections import defaultdict

In [2]:
taxonomy_id = 10090

For a given species, the EVA may have variants mapped to multiple reference genomes. Here we'll get all the assemblies for which there is data for this taxonomy.

In [3]:
species_list_url = 'https://www.ebi.ac.uk/eva/webservices/rest/v1/meta/species/list'
response = requests.get(species_list_url)
species_list = response.json()['response'][0]['result']

taxonomy_code = None
assemblies = set()
for s in species_list:
    if s['taxonomyId'] == taxonomy_id:
        if not taxonomy_code:
            taxonomy_code = s['taxonomyCode']
        assemblies.add(s['assemblyCode'])

In [4]:
taxonomy_code

'mmusculus'

In [5]:
assemblies

{'grcm38', 'grcm39', 'mgscv37'}

Now for each assembly, we'll get all the studies in the EVA and collect some summary stats about them, namely the number of variants and number of samples in each.

In [6]:
studies_per_assembly = defaultdict(list)

for assembly_code in assemblies:
    studies_list_url = f'https://www.ebi.ac.uk/eva/webservices/rest/v1/meta/studies/list' \
                       f'?species={taxonomy_code}_{assembly_code}'
    response = requests.get(studies_list_url)
    studies = response.json()['response'][0]['result']

    for study in studies:
        study_id = study['studyId']
        study_url = f'https://www.ebi.ac.uk/eva/webservices/rest/v1/studies/{study_id}/summary'
        response = requests.get(study_url)
        # Currently we don't have all stats for studies not submitted directly to the EVA, hence this check
        if not response.ok:
            continue
        results = response.json()['response'][0]
        if results['numTotalResults'] != 0:
            studies_per_assembly[assembly_code].append(study_id)

In [7]:
studies_per_assembly

defaultdict(list,
            {'grcm38': ['PRJEB45961',
              'PRJEB45429',
              'PRJEB41714',
              'PRJEB28344',
              'PRJEB28956',
              'PRJEB11471',
              'PRJEB6911',
              'PRJEB43298',
              'PRJEB53276'],
             'grcm39': ['PRJEB53906', 'PRJEB53593'],
             'mgscv37': ['PRJEB48005', 'PRJEB39892']})

In [8]:
variant_counts = {}
sample_counts = {}

for assembly_code, studies in studies_per_assembly.items():
    for study_id in studies:
        # In general studies can have multiple files, so we need to sum their counts
        total_num_variants = 0
        total_num_samples = 0
        
        files_url = f'https://www.ebi.ac.uk/eva/webservices/rest/v1/studies/{study_id}/files' \
                    f'?species={taxonomy_code}_{assembly_code}'
        response = requests.get(files_url)
        files = response.json()['response'][0]['result']
        for file in files:
            if 'stats' in file:
                total_num_variants += file['stats']['variantsCount']
                total_num_samples += file['stats']['samplesCount']
                
        variant_counts[study_id] = total_num_variants
        sample_counts[study_id] = total_num_samples

In [9]:
variant_counts

{'PRJEB45961': 5996023,
 'PRJEB45429': 35967533,
 'PRJEB41714': 0,
 'PRJEB28344': 1181836,
 'PRJEB28956': 0,
 'PRJEB11471': 95624835,
 'PRJEB6911': 80706582,
 'PRJEB43298': 0,
 'PRJEB53276': 3456,
 'PRJEB53906': 101584582,
 'PRJEB53593': 34174,
 'PRJEB48005': 6955,
 'PRJEB39892': 349}

In [10]:
sample_counts

{'PRJEB45961': 150,
 'PRJEB45429': 157,
 'PRJEB41714': 0,
 'PRJEB28344': 26,
 'PRJEB28956': 0,
 'PRJEB11471': 72,
 'PRJEB6911': 36,
 'PRJEB43298': 0,
 'PRJEB53276': 16,
 'PRJEB53906': 104,
 'PRJEB53593': 940,
 'PRJEB48005': 10,
 'PRJEB39892': 30}