# HTTP/HTML and Web Services
## 10/24/2023

<a href="?print-pdf">print view</a>

<a href="web.ipynb">notebook</a>


In [41]:
%%html
<script src="https://bits.csb.pitt.edu/preamble.js"></script>

In [47]:
import re
m = re.search(r'([^:]):(\d+)([^:]?):([^:]+)(:([^:]+))?',"A:44B:CYS:SG")

In [48]:
m.groups()

('A', '44', 'B', 'CYS', ':SG', 'SG')

In [50]:
%%html
<div id="regexqq" style="width: 500px"></div>
<script>

    var divid = '#regexqq';
	jQuery(divid).asker({
	    id: divid,
	    question: "What is m.group(3)?",
		answers: ['Error','SG','CYS','44B','44','44B',':SG'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

<center><img src='imgs/OSI_model.jpg'></center>

# HTTP

Hypertext Transfer Protocol

A request-response protocol in a client-server framework.

# HTTP Requests

The request consists of the following:

* A request line with desired method (action)
* Request Headers
* An empty line.
* An optional message body.

Example
<pre>GET / HTTP/1.1
Host: cnn.com
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8
Cookie: SelectedEdition=www; optimizelyEndUserId=oeu1364949768474r0.014468349516391754; optimizelySegments=%7B%22170962340%22%3A%22false%22%2C%22171657961%22%3A%22gc%22%2C%22172148679%22%3A%22none%22%2C%22172265329%22%3A%22search%22%7D; optimizelyBuckets=%7B%7D; s_vi=[CS]v1|25F54343051D3284-6000013900246AA6[CE]</pre>


# HTTP Requests

 **GET**
 
 Requests a representation of the specified resource. Requests using GET should only retrieve data and should have no other effect.
 
**POST**

Submits data to the server in the request body.  Can have side-effects.

**HEAD**

Asks for the response identical to the one that would correspond to a GET request, but without the response body (just the headers). 

**OPTIONS, PUT, DELETE, TRACE and CONNECT**

HTTP 1.1 methods that are less commonly used.

# Sending Data

**GET**

Data must be in the query string of the URL.  This is the part of the URL after a question mark:

    http://server/program/path/?query_string

The query_string is made up of name=value pairs separated by &:

    http://server/program/path/?field1=value1&field2=value2&field3=value3
    
URLs have length limits (differs by browser, but generally needs to be <2000 characters)
    
**POST**

The data is sent in the request body. There is no length limit (just patience limit).

http://www.rcsb.org/pdb/ngl/ngl.do?pdbid=3ERK&bionumber=1

In [39]:
%%html
<div id="httpgetdata" style="width: 500px"></div>
<script>

    var divid = '#httpgetdata';
	jQuery(divid).asker({
	    id: divid,
	    question: "How many data items are being sent in this URL?",
		answers: ['0','1','2','3','4'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

# HTTP Response

A response consists of the following:

* A Status-Line (includes [response code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes))
* Response Headers
* An empty line
* An optional message body

<br>

Example:

```
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 10 Oct 2013 14:23:59 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: CG=US:PA:Pittsburgh; path=/
Last-Modified: Thu, 10 Oct 2013 14:23:14 GMT
Vary: Accept-Encoding
Cache-Control: max-age=60, private
Expires: Thu, 10 Oct 2013 14:24:58 GMT
Content-Encoding: gzip

<!DOCTYPE HTML> <html lang="en-US"> <head> <title>CNN.com - Breaking News, U.S., World, Weather, Entertainment &amp; Video News</title> 
``` 


# requests 

`requests` is a simple but high-level interface for requesting http data

Also:
 * `urllib2` - another high-level interface
 * `urllib` - similar to `urllib2` but different, has `urlencode` function
 * `urllib3` - successor to `urllib2` but different
 * `httplib` - low-level interface to http request
 * `mechanize` - much higher level interface for scripting web interactions - use this if you need to submit form data, passwords, etc.


# `get`

`requests.get` takes a URL and returns a response object that contains the message body.

**Note:** `urllib2` and `mechanize` have a `urlopen` method that returns a file-like object.

In [3]:
import requests
response = requests.get('http://mscbio2025.csb.pitt.edu')
print(response.text)

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="refresh" content="0; url='https://mscbio2025.github.io/'" />
  </head>
  <body>
  </body>
</html>



**Must include protocol in URL**

In [4]:
fail = requests.get('www.cnn.com')

MissingSchema: Invalid URL 'www.cnn.com': No scheme supplied. Perhaps you meant https://www.cnn.com?

# Requesting with Data (POST)

In [5]:
url = 'http://pocketquery.csb.pitt.edu/pocket.cgi'
values = {'json' : '{"pdbid_text":["1ycr"],"start":0,"num":24,"sort":"score","dir":"desc","cmd":"cluster"}'}

data = requests.post(url,data=values)
print(data)

<Response [200]>


In [6]:
the_page = data.text

In [7]:
 the_page.split('\n')[16:18]

['1YCR,B,178,2,5.856,19;23,PHE;TRP,-6.31,-12.62,-6.61,-6.01,5.20125,10.4025,4.7898,5.6127,133.9,267.8,131.1,136.7,72.3,144.6,64.7,79.9,0,0,0,0,0,0,0,0,0.972816',
 '1YCR,B,129,1,0,19,PHE,-6.61,-6.61,-6.61,-6.61,5.6127,5.6127,5.6127,5.6127,131.1,131.1,131.1,131.1,79.9,79.9,79.9,79.9,0,0,0,0,0,0,0,0,0.96009']

# Requesting with Data (GET)

In [8]:
values = {'structureId' : '1ycr'}
data = requests.get('http://www.pdb.org/pdb/explore/explore.do',values)

print(data.text)

<!DOCTYPE html><html lang="en"><head><script src="https://www.googletagmanager.com/gtag/js?id=G-5JMGYPWJRR" async></script><script>//- global rcsb-config object
var RC = {
      googleAnalyticsTrackingId: 'UA-3923365-3'
    , instance: 'production'
    , isProductionServer: true
    , dataUrl: 'https://data.rcsb.org/'
    , searchUrl: 'https://search.rcsb.org/rcsbsearch/v2/'
    , alignmentUrl: 'https://alignment.rcsb.org/api/v1-beta/'
    , internalAnalyticsOriginHeaderKey: 'Rcsb-Analytics-Traffic-Origin'
    , internalAnalyticsOriginHeaderValue: 'internal'
    , internalAnalyticsStageHeaderKey: 'Rcsb-Analytics-Traffic-Stage'
    , internalAnalyticsStageHeaderValue: 'production'
    , MOLSTAR_IMG_URL: 'https://cdn.rcsb.org/images/structures/'
};
</script><script src="/search/search-data?ts=5660297"></script><script src="/js/search/react-search.js?ts=5660297"></script><script>!function(){if("performance"in window==0&&(window.performance={}),Date.now=Date.now||function(){return(new Date

# Encoding Data

URLs are only allowed to have alphanumeric characters (and a handful of punctuation marks).  This means data needs to be encoded when passing it as a value. requests will do this for you if you pass values as a dictionary.

In [9]:
values = {'q':"What's the meaning of life?"}
data = requests.get('http://www.google.com/search',values)

In [10]:
data.url

'http://www.google.com/search?q=What%27s+the+meaning+of+life%3F'

# Faking It

<img src="imgs/zeldacomp.jpeg" width="500">

...or a python script

In [11]:
import requests, re
page = requests.get('https://www.whatsmybrowser.org/').text
re.findall(r'You&rsquo;re using (.*?)\.',page)

['python-requests 2', 'python-requests 2']

In [12]:
page = requests.get('https://www.whatsmybrowser.org/',headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}).text
re.findall(r'You&rsquo;re using (.*?)\.',page)

['Chrome 61', 'Chrome 61']

**Use `mechanize` if a site is giving you trouble**

# Getting Data From the Web

If it's on the web, you can get it into python.

* *screen scraping* - getting data from your computer screen; often to the extreme of using screen shots and OCR
* *web scraping* - downloading HTML content and parsing out what you need


# Web Scraping

The advantage of web scraping is that it always works - if you can see the data in your browser, you can see it in python.

Disadvantages:

* HTMLParser isn't very sophisticated and requires you to manage context
    * see [pyquery](https://pypi.python.org/pypi/pyquery) for a more powerful HTML parsing package inspired by JQuery
    * [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is another popular library for extracting information from HTML
    
* Scraping is slow and inefficient
    * may need to make several requests to get your data (think pagination)
    * downloads a lot more than just your data (html)
    * your code may be easily broken by changes in the website design
    
**Only parse raw HTML if there is no better option**

# ReSTful Web Services

REpresentational State Transfer

* Client–server
* Stateless
* Cacheable
* Layered system
* Uniform interface

    * basically, resources are identified by their url
    * ReST does *not* specify the format of the resources, but they are often provided in **XML**

# XML

*Extensible Markup Language*. A very general way to express structured data; a generalization of HTML.

* **Tag** Key building block of XML - starts and ends with < >
    * `<div>` start tag
    * `</div>` end tag
    * `<br />` empty element tag (not matching end)
 
* **Element** A component of the document; everything between a start and end tag.  May contain child elements.

* **Attribute** A name-value pair within a start or empty element tag
        <div width=90>

# XML Example

Let's consider accessing NCBI's Entrez service to get data from the Gene Expression Omnibus (note we could use BioPython and avoid doing this directly).

Let's look for data from humans with between 100 and 500 samples:

* (human[Organism]) AND 100:500[Number of Samples]

We construct a URL to represent this query.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term=human[Organism]+AND+100:500[Number+of+Samples]

db=gds - GEO datasets 

* GEO DataSets is a study-level database which users can search for studies relevant to their interests. The database stores descriptions of all original submitter-supplied records, as well as curated DataSets.


In [13]:
result = requests.get(\
            "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",\
             params={'db':'gds',\
             'term':'human[Organism] AND 100:500[Number of Samples]'}).text
print(result[:297])

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>6651</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>200245735</Id>
<Id>200241425</Id>
<I


In [38]:
%%html
<div id="xmlq" style="width: 500px"></div>
<script>

    var divid = '#xmlq';
	jQuery(divid).asker({
	    id: divid,
	    question: "On the previous slide, RetMax is what?",
		answers: ['Element','Attribute','Tag','Component'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

# `xml.etree.ElementTree`

In [14]:
import xml.etree.ElementTree as ET
root = ET.fromstring(result)
print(root.tag,root.attrib)

eSearchResult {}


In [15]:
for child in root:
    print(child.tag)

Count
RetMax
RetStart
IdList
TranslationSet
TranslationStack
QueryTranslation


In [16]:
root.find('IdList')

<Element 'IdList' at 0x7f2918f418f0>

In [17]:
ids = [child.text for child in root.find('IdList')]
print(ids)

['200245735', '200241425', '200181711', '200181709', '200245630', '200241428', '200232216', '200231719', '200245626', '200240155', '200222110', '200185796', '200233715', '200227832', '200226400', '200211631', '200211158', '200131747', '200237999', '200225817']


In [18]:
root.findall('Id') #find only considers children

[]

In [19]:
ids = [elem.text for elem in root.iter('Id')] #iter considers the full tree
print(ids)

['200245735', '200241425', '200181711', '200181709', '200245630', '200241428', '200232216', '200231719', '200245626', '200240155', '200222110', '200185796', '200233715', '200227832', '200226400', '200211631', '200211158', '200131747', '200237999', '200225817']


# XML Parsing Alternative: Regular Expressions

In [20]:
print(result[:297])

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>6651</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdList>
<Id>200245735</Id>
<Id>200241425</Id>
<I


In [21]:
import re
regex = re.compile(r'<Id>(\d+)</Id>')
print(regex.search(result))

<re.Match object; span=(257, 275), match='<Id>200245735</Id>'>


In [22]:
ids = [m.group(1) for m in regex.finditer(result)]
print(ids)

['200245735', '200241425', '200181711', '200181709', '200245630', '200241428', '200232216', '200231719', '200245626', '200240155', '200222110', '200185796', '200233715', '200227832', '200226400', '200211631', '200211158', '200131747', '200237999', '200225817']


# JSON: JavaScript Object Notation

A lightweight data-interchange format.  

Essentially, represent data as JavaScript, which is very similar to python dictionaries/lists.

In [23]:
import json

In [24]:
json.dumps({'a':1, 'b':[1.2,1.3,1.4]})

'{"a": 1, "b": [1.2, 1.3, 1.4]}'

# PDB Restful

https://data.rcsb.org/redoc/index.html

Access PDB data through endpoints (URLs).  The path of the endpoints starts with https://data.rcsb.org/rest/v1/core, followed by the type of the resource, e.g. entry, polymer_entity, and the identifier. 

In [25]:
response = requests.get('https://data.rcsb.org/rest/v1/core/entry/4hhb')
print(response.text)

{"audit_author":[{"name":"Fermi, G.","pdbx_ordinal":1},{"name":"Perutz, M.F.","pdbx_ordinal":2}],"cell":{"angle_alpha":90.0,"angle_beta":99.34,"angle_gamma":90.0,"length_a":63.15,"length_b":83.59,"length_c":53.8,"zpdb":4},"citation":[{"country":"UK","id":"primary","journal_abbrev":"J.Mol.Biol.","journal_id_astm":"JMOBAK","journal_id_csd":"0070","journal_id_issn":"0022-2836","journal_volume":"175","page_first":"159","page_last":"174","pdbx_database_id_doi":"10.1016/0022-2836(84)90472-8","pdbx_database_id_pub_med":6726807,"rcsb_authors":["Fermi, G.","Perutz, M.F.","Shaanan, B.","Fourme, R."],"rcsb_is_primary":"Y","rcsb_journal_abbrev":"J Mol Biol","title":"The crystal structure of human deoxyhaemoglobin at 1.74 A resolution","year":1984},{"country":"UK","id":"1","journal_abbrev":"Nature","journal_id_astm":"NATUAS","journal_id_csd":"0006","journal_id_issn":"0028-0836","journal_volume":"295","page_first":"535","rcsb_authors":["Perutz, M.F.","Hasnain, S.S.","Duke, P.J.","Sessler, J.L.","Hah

In [26]:
pdbinfo = json.loads(response.text)
pdbinfo

{'audit_author': [{'name': 'Fermi, G.', 'pdbx_ordinal': 1},
  {'name': 'Perutz, M.F.', 'pdbx_ordinal': 2}],
 'cell': {'angle_alpha': 90.0,
  'angle_beta': 99.34,
  'angle_gamma': 90.0,
  'length_a': 63.15,
  'length_b': 83.59,
  'length_c': 53.8,
  'zpdb': 4},
 'citation': [{'country': 'UK',
   'id': 'primary',
   'journal_abbrev': 'J.Mol.Biol.',
   'journal_id_astm': 'JMOBAK',
   'journal_id_csd': '0070',
   'journal_id_issn': '0022-2836',
   'journal_volume': '175',
   'page_first': '159',
   'page_last': '174',
   'pdbx_database_id_doi': '10.1016/0022-2836(84)90472-8',
   'pdbx_database_id_pub_med': 6726807,
   'rcsb_authors': ['Fermi, G.', 'Perutz, M.F.', 'Shaanan, B.', 'Fourme, R.'],
   'rcsb_is_primary': 'Y',
   'rcsb_journal_abbrev': 'J Mol Biol',
   'title': 'The crystal structure of human deoxyhaemoglobin at 1.74 A resolution',
   'year': 1984},
  {'country': 'UK',
   'id': '1',
   'journal_abbrev': 'Nature',
   'journal_id_astm': 'NATUAS',
   'journal_id_csd': '0006',
   'jou

In [27]:
pdbinfo['rcsb_entry_info']

{'assembly_count': 1,
 'branched_entity_count': 0,
 'cis_peptide_count': 0,
 'deposited_atom_count': 4779,
 'deposited_hydrogen_atom_count': 0,
 'deposited_model_count': 1,
 'deposited_modeled_polymer_monomer_count': 574,
 'deposited_nonpolymer_entity_instance_count': 6,
 'deposited_polymer_entity_instance_count': 4,
 'deposited_polymer_monomer_count': 574,
 'deposited_solvent_atom_count': 221,
 'deposited_unmodeled_polymer_monomer_count': 0,
 'disulfide_bond_count': 0,
 'entity_count': 5,
 'experimental_method': 'X-ray',
 'experimental_method_count': 1,
 'inter_mol_covalent_bond_count': 0,
 'inter_mol_metalic_bond_count': 4,
 'molecular_weight': 64.74,
 'na_polymer_entity_types': 'Other',
 'nonpolymer_bound_components': ['HEM'],
 'nonpolymer_entity_count': 2,
 'nonpolymer_molecular_weight_maximum': 0.62,
 'nonpolymer_molecular_weight_minimum': 0.09,
 'polymer_composition': 'heteromeric protein',
 'polymer_entity_count': 2,
 'polymer_entity_count_dna': 0,
 'polymer_entity_count_rna': 0

In [28]:
response = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity/4hhb/1')
json.loads(response.text)

{'rcsb_cluster_membership': [{'cluster_id': 105, 'identity': 100},
  {'cluster_id': 112, 'identity': 95},
  {'cluster_id': 96, 'identity': 90},
  {'cluster_id': 47, 'identity': 70},
  {'cluster_id': 20, 'identity': 50},
  {'cluster_id': 31, 'identity': 30}],
 'entity_poly': {'nstd_linkage': 'no',
  'nstd_monomer': 'no',
  'pdbx_seq_one_letter_code': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
  'pdbx_seq_one_letter_code_can': 'VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR',
  'pdbx_strand_id': 'A,C',
  'rcsb_artifact_monomer_count': 0,
  'rcsb_conflict_count': 0,
  'rcsb_deletion_count': 0,
  'rcsb_entity_polymer_type': 'Protein',
  'rcsb_insertion_count': 0,
  'rcsb_mutation_count': 0,
  'rcsb_non_std_monomer_count': 0,
  'rcsb_sample_sequence_length': 141,
  'type': 'polypeptide(L)'},
 'ent

In [29]:
response = requests.get('https://data.rcsb.org/rest/v1/core/polymer_entity_instance/4HHB/A')
json.loads(response.text)

{'rcsb_ligand_neighbors': [{'atom_id': 'NE2',
   'auth_seq_id': 87,
   'comp_id': 'HIS',
   'distance': 2.143,
   'ligand_asym_id': 'E',
   'ligand_atom_id': 'FE',
   'ligand_comp_id': 'HEM',
   'ligand_entity_id': '3',
   'ligand_is_bound': 'Y',
   'ligand_model_id': 1,
   'seq_id': 87}],
 'rcsb_polymer_entity_instance_container_identifiers': {'asym_id': 'A',
  'auth_asym_id': 'A',
  'auth_to_entity_poly_seq_mapping': ['1',
   '2',
   '3',
   '4',
   '5',
   '6',
   '7',
   '8',
   '9',
   '10',
   '11',
   '12',
   '13',
   '14',
   '15',
   '16',
   '17',
   '18',
   '19',
   '20',
   '21',
   '22',
   '23',
   '24',
   '25',
   '26',
   '27',
   '28',
   '29',
   '30',
   '31',
   '32',
   '33',
   '34',
   '35',
   '36',
   '37',
   '38',
   '39',
   '40',
   '41',
   '42',
   '43',
   '44',
   '45',
   '46',
   '47',
   '48',
   '49',
   '50',
   '51',
   '52',
   '53',
   '54',
   '55',
   '56',
   '57',
   '58',
   '59',
   '60',
   '61',
   '62',
   '63',
   '64',
   '65',
   

# BLAST URLAPI

[http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html](http://www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html)

A URLAPI request looks like this:

`http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=<command>&{<name>=<value>}`

where CMD can be

* **PUT** add a job to the queue
* **GET** get formated results from a job


# BLAST PUT

Some common attributes to PUT:

* DATABASE - what database to search, *mandatory*
* PROGRAM - what program to use (blastn, blastp, blastx, tblastn, tblastx)
* QUERY - Accession(s), gi(s), or FASTA sequence(s), *mandatory*
* MATRIX_NAME - matrix to use (default BLOSUM62)


In [40]:
%%html
<div id="webblast" style="width: 500px"></div>
<script>

    var divid = '#webblast';
	jQuery(divid).asker({
	    id: divid,
	    question: "When making a BLAST request with a long sequence, which HTTP request type should you use?",
		answers: ['GET','POST','REQUEST','Does not matter'],
        server: "https://bits.csb.pitt.edu/asker.js/example/asker.cgi",
		charter: chartmaker})
    
$(".jp-InputArea .o:contains(html)").closest('.jp-InputArea').hide();


</script>

In [31]:
response = requests.post('https://www.ncbi.nlm.nih.gov/blast/Blast.cgi',\
                        data={'CMD': 'PUT', 'DATABASE': 'nr', 'PROGRAM':'blastp', \
                              'QUERY': 'SQETFSDLWKLLPEN'})

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=PUT&DATABASE=nr&QUERY=SQETFSDLWKLLPEN&PROGRAM=blastp

In [32]:
print(response.text)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="jig" content="ncbitoggler ncbiautocomplete"/>
<meta name="ncbi_app" content="static" />
<meta name="ncbi_pdid" content="blastformatreq" />
<meta name="ncbi_stat" content="false" />
<meta name="ncbi_sessionid" content="21915490536CA8C1_0000SID" />
<meta name="ncbi_phid" content="21915490536CA8C10000000000000001" />
<title>NCBI Blast</title>
<meta http-equiv="Pragma" content="no-cache">
<link rel="stylesheet" type="text/css" href="css/uswds.min.css" media="screen" />
<link rel="stylesheet"  type="text/css" href="https://www.ncbi.nlm.nih.gov/style-guide/static/nwds/css/nwds.css"/>

<link rel="stylesheet" href="css/headerNew.css?v=1"/>
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.5.0/css/all.

# BLAST GET

Using the request id parsed from the result of the PUT, get the results of the search.  Common attributes:

* RID - mandatory
* FORMAT_TYPE - HTML, Text, ASN.1, XML
* ALIGNMENTS - number of alignments (default 500)
* ALIGNMENT_VIEW - Pairwise, QueryAnchored, QueryAnchoredNoIdentities, FlatQueryAnchored, FlatQueryAnchoredNoIdentities, Tabular

In [33]:
rid = re.search(r'RID = (\S+)',response.text).group(1)
print(rid)

KDJMCZ64013


In [34]:
result = requests.get('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=GET&RID=%s&FORMAT_TYPE=XML' % rid).text

In [35]:
print(result)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="jig" content="ncbitoggler"/>
<meta name="ncbitoggler" content="animation:'none'"/>
<title>NCBI Blast:</title>
<script type="text/javascript" src="/core/jig/1.15.2/js/jig.min.js             "></script>
<script type="text/javascript">    jQuery.getScript("/core/alerts/alerts.js", function() {
        galert(['div#header', 'body > *:nth-child(1)'])
    });</script>
<meta http-equiv="Pragma" content="no-cache">
<link rel="stylesheet" type="text/css" href="css/uswds.min.css" media="screen" />
<link rel="stylesheet"  type="text/css" href="https://www.ncbi.nlm.nih.gov/style-guide/static/nwds/css/nwds.css"/>

<link rel="stylesheet" href="css/headerNew.css?v=1"/>
<link rel="stylesheet" href="https://use.fontawesome.com/r

In [36]:
result = requests.get('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=GET&RID=%s&FORMAT_TYPE=XML' % rid).text

In [37]:
print(result)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="jig" content="ncbitoggler"/>
<meta name="ncbitoggler" content="animation:'none'"/>
<title>NCBI Blast:</title>
<script type="text/javascript" src="/core/jig/1.15.2/js/jig.min.js             "></script>
<script type="text/javascript">    jQuery.getScript("/core/alerts/alerts.js", function() {
        galert(['div#header', 'body > *:nth-child(1)'])
    });</script>
<meta http-equiv="Pragma" content="no-cache">
<link rel="stylesheet" type="text/css" href="css/uswds.min.css" media="screen" />
<link rel="stylesheet"  type="text/css" href="https://www.ncbi.nlm.nih.gov/style-guide/static/nwds/css/nwds.css"/>

<link rel="stylesheet" href="css/headerNew.css?v=1"/>
<link rel="stylesheet" href="https://use.fontawesome.com/r

# Exercise: Anything new?

Write a script to see if a website has changed since the last time you checked.  The script will  save the text of the website in the current directory and compare the previously saved text to the current website.

# Exercise: BLAST IT!

Use the BLAST URLAPI to find structures with similar sequence to a user-supplied FASTA protein sequence. Your script will take the name of a FASTA file as its only argument. The contents of this file should be provided as the QUERY parameter in a BLAST URL PUT request querying the pdb database (DATABASE=pdb) using blastp (PROGRAM=blastp).  You should extract the RID from the response using a regular expression.

Using the RID, you then submit a GET request. Your GET request may either return the desired data, or it may return a status HTML page (even if you request XML) if the request hasn't finished. You should look for the presence of the string Status=WAITING in the response. If this string is present, you should repeat your GET request every 5 seconds (`time.sleep(5)`) until you get a response without it.  It is typical for it to take 30 seconds.

Print out the final XML response.

Example fasta file: http://mscbio2025.net/files/brca.fasta

In [None]:
import requests,re,sys,time
import xml.etree.ElementTree as ET

if len(sys.argv) < 2:
	print("Need fasta file")
	sys.exit(1)

f = open(sys.argv[1])
fasta = f.read()

values = {'CMD': 'PUT', 'DATABASE':'pdb', 'PROGRAM': 'blastp','QUERY':fasta, 'PROGRAM_NAME':'blastp','BLAST_PROGRAM':'blastp'}
res = requests.get('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi',values)
response = res.text
m = re.search(r'RID = (\S+)', response)

rid = m.group(1)
result = 'Status=WAITING'
values = {'CMD':'GET','RID': rid, 'FORMAT_TYPE':'XML'}
while re.search('Status=WAITING',result):
    time.sleep(5)
    res = requests.get('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi',values)
    result = res.text
    #print(result)
    #print(res.url)

root = ET.fromstring(result)
cnt = 0
#print out some summary information from the hits
for hit in root.iter('Hit'):
	hsp = hit.find('Hit_hsps').find('Hsp')
	evalue = hsp.find('Hsp_evalue').text
	ident = float(hsp.find('Hsp_identity').text)
	length = float(hsp.find('Hsp_align-len').text)
	pdb_ch = hit.find('Hit_accession').text
	m = re.search(r'(\S+)_', pdb_ch)
	resolution = '0'
	if m:
		pdb = m.group(1)
		response = requests.get('http://www.pdb.org/pdb/rest/describePDB',{'structureId': pdb}).text
		m = re.search(r'resolution="(\S+)"',response)
		if m:
			resolution = m.group(1)
	print('%s %s %.2f %s' % (pdb_ch, evalue, ident/length,resolution))
	cnt += 1
	if cnt >= 10:
		break
