# APIs, Webscraping, and Data Munging

An API (Application Programing Interface) is a set of methods to communicate with a program. RESTful APIs are used to access webservices. There are many bioinformatics APIs, but we will use the Ensembl API to demonstrate how to use them.

The requests library can be used to submit HTTP requests to APIs and to process the returned data.

In [1]:
import requests

In [2]:
r = requests.get('https://rest.ensembl.org/info/rest?',headers={ "Content-Type" : "application/json"})

### HTTP Status Codes

HTTP status codes are returned by the web server with every request. You can find a full list of codes and their explainations [here](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html)

In [3]:
r.status_code

200

### JSON

JSON (Javascript Object Notation) is a data format that is commonly used to send and recieve data on the web. If you are curious about specifications or technical details you can visit [json.org](http://www.json.org/)

In [4]:
r.json()

{u'release': u'4.8'}

In [5]:
r = requests.get('https://rest.ensembl.org/info/species',headers={ "Content-Type" : "application/json"})

In [6]:
r.status_code

200

In [7]:
r.json()

{u'species': [{u'accession': u'GCA_000146045.2',
   u'aliases': [u'4932',
    u'saccer',
    u"saccharomyces cerevisiae (baker's yeast)",
    u"baker's yeast",
    u'scer',
    u'saccharomyces cerevisiae',
    u'scerevisiae',
    u's_cerevisiae'],
   u'assembly': u'R64-1-1',
   u'common_name': u"baker's yeast",
   u'display_name': u'Saccharomyces cerevisiae',
   u'division': u'Ensembl',
   u'groups': [u'core', u'otherfeatures', u'variation', u'funcgen'],
   u'name': u'saccharomyces_cerevisiae',
   u'release': 87,
   u'strain': u'S288C',
   u'strain_collection': None,
   u'taxon_id': u'4932'},
  {u'accession': None,
   u'aliases': [u'ciosav',
    u'51511',
    u'ciona savignyi',
    u'csavignyi',
    u'c.savignyi',
    u'csav',
    u'sea squirt ciona savignyi'],
   u'assembly': u'CSAV2.0',
   u'common_name': u'Sea squirt Ciona savignyi',
   u'display_name': u'C.savignyi',
   u'division': u'Ensembl',
   u'groups': [u'core', u'otherfeatures'],
   u'name': u'ciona_savignyi',
   u'release':

In [8]:
r = requests.get('https://rest.ensembl.org/info/variation/homo_sapiens',headers={ "Content-Type" : "application/json"})

In [9]:
r.status_code

200

In [10]:
r.json()

[{u'data_types': [u'variation'],
  u'description': u'Variants (including SNPs and indels) imported from dbSNP',
  u'name': u'dbSNP',
  u'somatic_status': u'mixed',
  u'url': u'http://www.ncbi.nlm.nih.gov/projects/SNP/',
  u'version': u'147'},
 {u'data_types': [u'variation_synonym'],
  u'description': u'Variants dbSNP annotates as being from LSDBs',
  u'name': u'LSDB',
  u'somatic_status': u'germline',
  u'type': u'lsdb',
  u'url': u'http://www.ncbi.nlm.nih.gov/projects/SNP/',
  u'version': u'147'},
 {u'data_types': [u'variation', u'variation_synonym'],
  u'description': u'PhenCode is a collaborative project to better understand the relationship between genotype and phenotype in humans',
  u'name': u'PhenCode',
  u'url': u'http://phencode.bx.psu.edu/',
  u'version': u'30/04/2014'},
 {u'data_types': [u'variation_synonym'],
  u'description': u'The registry of Hereditary Auto-inflammatory Disorders Mutations ',
  u'name': u'Infevers',
  u'somatic_status': u'germline',
  u'type': u'lsdb',
 

In [53]:
r = requests.get('https://rest.ensembl.org/lookup/id/ENSG00000157764?expand=1',headers={ "Content-Type" : "application/json"})

In [12]:
r.status_code

200

In [13]:
r.json()

{u'Transcript': [{u'Exon': [{u'assembly_name': u'GRCh38',
     u'db_type': u'core',
     u'end': 140783157,
     u'id': u'ENSE00003685923',
     u'object_type': u'Exon',
     u'seq_region_name': u'7',
     u'species': u'homo_sapiens',
     u'start': 140783021,
     u'strand': -1,
     u'version': 1},
    {u'assembly_name': u'GRCh38',
     u'db_type': u'core',
     u'end': 140781693,
     u'id': u'ENSE00003559218',
     u'object_type': u'Exon',
     u'seq_region_name': u'7',
     u'species': u'homo_sapiens',
     u'start': 140781576,
     u'strand': -1,
     u'version': 1},
    {u'assembly_name': u'GRCh38',
     u'db_type': u'core',
     u'end': 140778075,
     u'id': u'ENSE00003521664',
     u'object_type': u'Exon',
     u'seq_region_name': u'7',
     u'species': u'homo_sapiens',
     u'start': 140777991,
     u'strand': -1,
     u'version': 1},
    {u'assembly_name': u'GRCh38',
     u'db_type': u'core',
     u'end': 140777088,
     u'id': u'ENSE00003527888',
     u'object_type': u'Exo

## *Exercise: Find the gene name of ENSG00000157764 from the API response*

In [57]:
type(r.json()['Transcript'])

list

In [47]:
r = requests.get('https://rest.ensembl.org/overlap/id/ENSG00000157764?feature=variation',headers={ "Content-Type" : "application/json"})

In [48]:
r.json()

[{u'alleles': [u'G', u'A'],
  u'assembly_name': u'GRCh38',
  u'clinical_significance': [],
  u'consequence_type': u'3_prime_UTR_variant',
  u'end': 140719364,
  u'feature_type': u'variation',
  u'id': u'rs565779474',
  u'seq_region_name': u'7',
  u'source': u'dbSNP',
  u'start': 140719364,
  u'strand': 1},
 {u'alleles': [u'A', u'C'],
  u'assembly_name': u'GRCh38',
  u'clinical_significance': [],
  u'consequence_type': u'3_prime_UTR_variant',
  u'end': 140719366,
  u'feature_type': u'variation',
  u'id': u'rs761919219',
  u'seq_region_name': u'7',
  u'source': u'dbSNP',
  u'start': 140719366,
  u'strand': 1},
 {u'alleles': [u'C', u'T'],
  u'assembly_name': u'GRCh38',
  u'clinical_significance': [],
  u'consequence_type': u'3_prime_UTR_variant',
  u'end': 140719369,
  u'feature_type': u'variation',
  u'id': u'rs769907969',
  u'seq_region_name': u'7',
  u'source': u'dbSNP',
  u'start': 140719369,
  u'strand': 1},
 {u'alleles': [u'G', u'A'],
  u'assembly_name': u'GRCh38',
  u'clinical_sign

In [51]:
r.json()[0]['start']

140719364

## *Exercise: Use the Ensembl API documentation (https://rest.ensembl.org/) to find a endpoint to get information for rs565779474*

## Web Scraping

In [18]:
r = requests.get('https://www.ams.usda.gov/mnreports/wa_py001.txt')

In [19]:
r.content

'\r\nWA_PY001                                                                        \r\nWashington, DC          Fri. Mar 24, 2017          USDA Market News             \r\n                                                                                \r\nWeekly Combined Regional Shell Eggs                                             \r\n\r\nAverage prices on sales to volume buyers, USDA Grade A and Grade A,             \r\nWhite eggs in cartons, delivered warehouse, cents per dozen                     \r\n                                                                                \r\nREGIONS                             EX LARGE    LARGE   MEDIUM                  \r\n                                                                                \r\nNORTHEAST                              82.00    80.00    71.00                  \r\nSOUTHEAST                              79.50    77.50    70.00                  \r\nMIDWEST                                74.50    72.50    65.50     

In [20]:
text = r.content.split('\r\n')

In [21]:
text

['',
 'WA_PY001                                                                        ',
 'Washington, DC          Fri. Mar 24, 2017          USDA Market News             ',
 '                                                                                ',
 'Weekly Combined Regional Shell Eggs                                             ',
 '',
 'Average prices on sales to volume buyers, USDA Grade A and Grade A,             ',
 'White eggs in cartons, delivered warehouse, cents per dozen                     ',
 '                                                                                ',
 'REGIONS                             EX LARGE    LARGE   MEDIUM                  ',
 '                                                                                ',
 'NORTHEAST                              82.00    80.00    71.00                  ',
 'SOUTHEAST                              79.50    77.50    70.00                  ',
 'MIDWEST                                74.50    72.50

In [22]:
import re

for line in text:
    if re.search('\d+\.\d\d',line):
        print line.split()

['NORTHEAST', '82.00', '80.00', '71.00']
['SOUTHEAST', '79.50', '77.50', '70.00']
['MIDWEST', '74.50', '72.50', '65.50']
['SOUTH', 'CENTRAL', '90.50', '80.50', '72.50']
['COMBINED', 'REGIONAL', '82.08', '77.74', '69.87']


In [28]:
meck = file('meck_san.html','r')

In [30]:
for line in meck:
    print line

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

    <head id="ctl00_Head1"><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta name="keywords" content="NCENVPBL" /><title>

	Establishment

</title><link href="../App_Themes/Mount%20Redmond/BaseStyles.css" type="text/css" rel="stylesheet" /><link href="../App_Themes/Mount%20Redmond/Styles.css" type="text/css" rel="stylesheet" /><link href="/NCEnvPbl/WebResource.axd?d=bjXvMHIGHt5AIHiflgE4qxfla9QW-s7grX_OliyKmo7TSFvOs9oaqQjicW-m9p1YFvchQ32nrt2oh7rhhbfgF88DChhAqDyR_KC02SaTmoZbjLYi358w9moX6vzQRoPooVDvmO-xzW_WSIqCFXmcVQ2&amp;t=635302236840000000" type="text/css" rel="stylesheet" /></head>

    <body id="ctl00_Body1" class="pBack">

        <form name="aspnetForm" method="post" action="ShowESTABLISHMENTTablePage.aspx?ESTTST_CTY=60" id="aspnetForm">

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEP

In [31]:
from bs4 import BeautifulSoup

In [36]:
text = BeautifulSoup(open('meck_san.html'))

In [45]:
text.html

<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1"><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="NCENVPBL" name="keywords"/><title>
	Establishment
</title><link href="../App_Themes/Mount%20Redmond/BaseStyles.css" rel="stylesheet" type="text/css"/><link href="../App_Themes/Mount%20Redmond/Styles.css" rel="stylesheet" type="text/css"/><link href="/NCEnvPbl/WebResource.axd?d=bjXvMHIGHt5AIHiflgE4qxfla9QW-s7grX_OliyKmo7TSFvOs9oaqQjicW-m9p1YFvchQ32nrt2oh7rhhbfgF88DChhAqDyR_KC02SaTmoZbjLYi358w9moX6vzQRoPooVDvmO-xzW_WSIqCFXmcVQ2&amp;t=635302236840000000" rel="stylesheet" type="text/css"/></head>
<body class="pBack" id="ctl00_Body1">
<form action="ShowESTABLISHMENTTablePage.aspx?ESTTST_CTY=60" id="aspnetForm" method="post" name="aspnetForm">
<input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value="/wEPDwUJMzkyNjY5MTAxDxQrAANkaGQWAmYPZBYCAgMPZBYCAgEPZBYCAgkPZBYEAgMPZBYCZg9kFgICAw8WCB4hRVNUQUJMSVNITUVOVFRhYmxlQ29udHJvbF9PcmRlckJ5BbECPE5hbWUgU2VsY3RUeX