# GWAS Catalog and Summary Statistics REST API workshop

* The following shows basic examples of how to access and parse data from the GWAS Catalog through the REST API. 
* Although this example is written in Python, any other programming language is equally good.
* There are TWO REST APIs, the GWAS Catalog API and the Summary Statistics API.
* Examples in other languages will be available soon.


### Contents:

* **Exercise 1**: fetching data from the API manually, via a browser
* **Exercise 2**: fetching data for a list of variants (GWAS Catalog API)
* **Exercise 3**: fetching summary statistics data for a genomic region (Summary Stats API)
* **Exercise 4**: combining the two APIs

## Exercise 1
### _Requests are just URLs_

Fetch data of a single study with accession ID [GCST001795](https://www.ebi.ac.uk/gwas/studies/GCST001795) from the GWAS Catalog REST API using a browser.   

**Generate the URL:**

* API URL: `https://www.ebi.ac.uk/gwas/rest/api`
* Endpoint: `studies`
* AccessionID: `GCST001795`

**URL:**

[https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795](https://www.ebi.ac.uk/gwas/rest/api/studies/GCST001795)

Visit the URL in a browser to see the response from the REST API.

### Understanding the returned data:

* Number of simple key-value pairs eg:

```json
    "initialSampleSize" : "1,656 Han Chinese ancestry cases, 3,394 Han Chinese ancestry controls",
    "snpCount" : 2100739,
    "imputed" : true,
    "accessionId" : "GCST001795",
```

* List allowing multiple elements for a key:

```json
    "genotypingTechnologies" : [ {
        "genotypingTechnology" : "Genome-wide genotyping array"
    } ],
```
* List where the values are themselves complex objects eg. ancestries.


* The returned data is highly structured, easy to read for computer. 
* The same information is accessible via the UI.

In the following examples we make small scripts in Python to organize this data to make is easy to read for humans.

## Exercise 2

### _Retrieve data for a list of variants_

1. Fetch the trait and p-value of all associations for multiple rsIDs. 
2. Organize the data in a table.
3. Be careful, might not all rsIDs have associations!

In [1]:
# Import required packages
import requests     # HTTP library - manages data transfer from web resource (e.g. GWAS Catalog)
import json         # Handling the json response
import pandas as pd # Data analysis library, a bit like R for Python!


# API Address:
apiUrl = 'https://www.ebi.ac.uk/gwas/rest/api'

# List of variants:
variants = ['rs142968358', 'rs62402518', 'rs12199222', 'rs7329174', 'rs9879858765']

# Store extracted data in this list:
extractedData = []

# Iterating over all variants:
for variant in variants:

    # Accessing data for a single variant:
    requestUrl = '%s/singleNucleotidePolymorphisms/%s/associations?projection=associationBySnp' %(apiUrl, variant)
    response = requests.get(requestUrl, headers={ "Content-Type" : "application/json"})
    
    # Testing if rsID exists:
    if not response.ok:
        print("[Warning] %s is not in the GWAS Catalog!!" % variant)
        continue
    
    # Test if the returned data looks good:
    try:
        decoded = response.json()
    except:
        print("[Warning] Failed to encode data for %s" % variant)
        continue
    
    for association in decoded['_embedded']['associations']:
        trait = ",".join([trait['trait'] for trait in association['efoTraits']])
        pvalue = association['pvalue']
        
        extractedData.append({'variant' : variant,
                              'trait' : trait,
                              'pvalue' : pvalue})
        
# Format data into a table (data frame):
table = pd.DataFrame.from_dict(extractedData)
table 



Unnamed: 0,pvalue,trait,variant
0,1e-06,autism spectrum disorder,rs142968358
1,8e-06,"lung carcinoma,smoking status measurement",rs62402518
2,6e-07,acute myeloid leukemia,rs12199222
3,7e-07,body height,rs12199222
4,1e-08,systemic lupus erythematosus,rs7329174
5,6e-06,systemic lupus erythematosus,rs7329174
6,8e-09,crohn's disease,rs7329174
7,3e-06,systemic lupus erythematosus,rs7329174


## Exercise 3
### _Summary Statistics API_

* Get all associations for type II diabetes mellitus: EFO_0001360
* Filter by p-value

In [4]:
# API Address:
apiUrl = 'https://www.ebi.ac.uk/gwas/summary-statistics/api'


trait = "EFO_0001360"
p_upper = "0.000000001"


requestUrl = '%s/traits/%s/associations?p_upper=%s&size=10' %(apiUrl, trait, p_upper)
response = requests.get(requestUrl, headers={ "Content-Type" : "application/json"})

# The returned response is a "response" object, from which we have to extract and parse the information:
decoded = response.json()
extractedData = []

for association in decoded['_embedded']['associations'].values():
    pvalue = association['p_value']
    variant = association['variant_id']
    studyID = association['study_accession']
    
    extractedData.append({'variant' : variant,
                          'studyID': studyID,
                          'pvalue' : pvalue})    
    
ssTable = pd.DataFrame.from_dict(extractedData)
ssTable 


Unnamed: 0,pvalue,studyID,variant
0,1.7e-24,GCST005047,rs7020996
1,4.2e-19,GCST005047,rs10965243
2,6.1e-13,GCST005047,rs10965245
3,4.4e-34,GCST005047,rs2383208
4,1e-22,GCST005047,rs10965250
5,1.1000000000000001e-27,GCST005047,rs10811661
6,5.9e-15,GCST005047,rs1333051
7,4.8e-10,GCST005047,rs12913233
8,7.6e-10,GCST005047,rs11636554
9,1e-09,GCST005047,rs1191196


## Exercise 4
### _Combine the two APIs!_

* Get all associations for type II diabetes mellitus: EFO_0001360
* Filter by p-value
* Add in the pubmed ID and trait name from the study info from the GWAS Catalog

In [6]:
def getStudy(studyLink):
    # Accessing data for a single study:
    response = requests.get(studyLink, headers={ "Content-Type" : "application/json"})
    decoded = response.json()
    
    gwasData = requests.get(decoded['_links']['gwas_catalog']['href'], headers={ "Content-Type" : "application/json"})
    decodedGwasData = gwasData.json()

    traitName = decodedGwasData['diseaseTrait']['trait']
    pubmedId = decodedGwasData['publicationInfo']['pubmedId']
    
    return(traitName, pubmedId)


extractedData = []

for association in decoded['_embedded']['associations'].values():
    pvalue = association['p_value']
    variant = association['variant_id']
    studyID = association['study_accession']
    studyLink = association['_links']['study']['href']
    traitName, pubmedId = getStudy(studyLink)
    
    extractedData.append({'variant' : variant,
                          'studyID': studyID,
                          'pvalue' : pvalue,
                          'traitName': traitName,
                          'pubmedID': pubmedId}) 

    
ssWithGWASTable = pd.DataFrame.from_dict(extractedData)
ssWithGWASTable


Unnamed: 0,pubmedID,pvalue,studyID,traitName,variant
0,22885922,1.7e-24,GCST005047,Type 2 diabetes,rs7020996
1,22885922,4.2e-19,GCST005047,Type 2 diabetes,rs10965243
2,22885922,6.1e-13,GCST005047,Type 2 diabetes,rs10965245
3,22885922,4.4e-34,GCST005047,Type 2 diabetes,rs2383208
4,22885922,1e-22,GCST005047,Type 2 diabetes,rs10965250
5,22885922,1.1000000000000001e-27,GCST005047,Type 2 diabetes,rs10811661
6,22885922,5.9e-15,GCST005047,Type 2 diabetes,rs1333051
7,22885922,4.8e-10,GCST005047,Type 2 diabetes,rs12913233
8,22885922,7.6e-10,GCST005047,Type 2 diabetes,rs11636554
9,22885922,1e-09,GCST005047,Type 2 diabetes,rs1191196
