[UniProt](http://www.uniprot.org/) is a resource for protein functional annotation. This annotation is manual and therefore of a very high quality.

Querying Uniprot is simple and does not have direct support in BioPython. Instead, we will use the [requests](http://requests.readthedocs.org/en/latest/) module to grab data directly through the web interface. That is a good skill to have. The `requests` module does not come with Python so you can install it using `pip install --user requests` on the command line (in Windows START/cmd).

The basic documentation for the Web API of Uniprot is [here](http://www.uniprot.org/help/programmatic_access). Let's have the examples speak for themselves. First, let's aim at grabbing these records: http://www.uniprot.org/uniprot/?query=name%3Ap53+AND+reviewed%3Ayes+AND+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22&sort=score

In [46]:
import requests

q={"query":'name:p53 AND reviewed:yes AND organism:"Homo sapiens (Human) [9606]"'}
r=requests.get("http://www.uniprot.org/uniprot",params=q)
print "r=",r
print "r.url=", r.url
print "r.text=", r.text[:500], "(...)" #HTML!

r= <Response [200]>
r.url= http://www.uniprot.org/uniprot/?query=name%3Ap53+AND+reviewed%3Ayes+AND+organism%3A%22Homo+sapiens+%28Human%29+%5B9606%5D%22
r.text= <!DOCTYPE html SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><title>name:p53 AND reviewed:yes AND organism:&#034;Homo sapiens (Human) [9606]&#034; in UniProtKB</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/" rel="home"/><link href="http://creativecommons.org/licenses/by-nd/3.0/" rel="lic (...)


This is the exact HTML you get when browsing Uniprot yourself in the browser. Nice to read, but useless for programmatic access. Reading the Uniprot docs reveals we have the `format`and `columns` parameters (and many more). Let's try:

In [13]:
q["format"]="list"
q["limit"]=20
r=requests.get("http://www.uniprot.org/uniprot",params=q)
print "List:"
print r.text #nice!
#Can also get tabulated data
q["format"]="tab"
r=requests.get("http://www.uniprot.org/uniprot",params=q)
print "Tab:"
print r.text

List:
Q9BRQ8
A1A5B4
Q96KQ4
Q13625
Q9BXH1
Q96PG8
Q9UQB8
Q9H305
Q5BN46
Q8IWT3
Q96F07
O14682
O14681
Q8TAE8
Q15051
P47929
Q99732
O15151
Q00987
Q15648

Tab:
Entry	Entry name	Status	Protein names	Gene names	Organism	Length
Q9BRQ8	AIFM2_HUMAN	reviewed	Apoptosis-inducing factor 2 (EC 1.-.-.-) (Apoptosis-inducing factor homologous mitochondrion-associated inducer of death) (Apoptosis-inducing factor-like mitochondrion-associated inducer of death) (p53-responsive gene 3 protein)	AIFM2 AMID PRG3	Homo sapiens (Human)	373
A1A5B4	ANO9_HUMAN	reviewed	Anoctamin-9 (Transmembrane protein 16J) (Tumor protein p53-inducible protein 5) (p53-induced gene 5 protein)	ANO9 PIG5 TMEM16J TP53I5	Homo sapiens (Human)	782
Q96KQ4	ASPP1_HUMAN	reviewed	Apoptosis-stimulating of p53 protein 1 (Protein phosphatase 1 regulatory subunit 13B)	PPP1R13B ASPP1 KIAA0771	Homo sapiens (Human)	1090
Q13625	ASPP2_HUMAN	reviewed	Apoptosis-stimulating of p53 protein 2 (Bcl2-binding protein) (Bbp) (Renal carcinoma antigen NY-REN-51) (Tu

In [15]:
q["format"]="list"
q["limit"]=20
r=requests.get("http://www.uniprot.org/uniprot",params=q)
ids=r.text.strip().split("\n")
print ids #Here we have our ids!


[u'Q9BRQ8', u'A1A5B4', u'Q96KQ4', u'Q13625', u'Q9BXH1', u'Q96PG8', u'Q9UQB8', u'Q9H305', u'Q5BN46', u'Q8IWT3', u'Q96F07', u'O14682', u'O14681', u'Q8TAE8', u'Q15051', u'P47929', u'Q99732', u'O15151', u'Q00987', u'Q15648']


In [17]:
q["format"]="xml"
r=requests.get("http://www.uniprot.org/uniprot",params=q)
xml_data=r.text #...and in XML

UniProt XML format is nowadays supported by BioPython. So we can use SeqIO as before to get SeqRecords out of our result.

In [24]:
print xml_data[:500], "(...)"
from Bio import SeqIO
import StringIO
records=SeqIO.parse(StringIO.StringIO(xml_data),"uniprot-xml")
for r in records:
    print " ----------------- new record ----------------- "
    print r.name, r.id, r.dbxrefs

<?xml version='1.0' encoding='UTF-8'?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="2006-05-30" modified="2016-01-20" version="127">
<accession>Q9BRQ8</accession>
<accession>B3KXI0</accession>
<accession>Q63Z39</accession>
<name>AIFM2_HUMAN</name>
<protein>
<recommendedName>
<fullName>Apoptosis-inducing factor 2 (...)
 ----------------- new record ----------------- 
AIFM2_HUMAN Q9BRQ8 ['Bgee:Q9BRQ8', 'BioGrid:124325', 'BioMuta:AIFM2', 'CCDS:CCDS7297.1', 'CTD:84883', 'ChiTaRS:AIFM2', 'CleanEx:HS_AIFM2', 'CleanEx:HS_PRG3', 'DMDM:74752283', 'DNASU:84883', 'DOI:10.1002/pmic.200900783', 'DOI:10.1016/S0014-5793(02)03049-1', 'DOI:10.1038/nature02462', 'DOI:10.1038/ncomms5919', 'DOI:10.1038/ng1285', 'DOI:10.1038/sj.onc.1207909', 'DOI:10.1074/jbc.M202285200', 'DOI:10.1074/jbc.M414018200', 'DOI:10.1101/gr.259650

It is always friendly to not overload the servers with many repeated requests. So please grab 200 longest, reviewed human proteins and store them in a file called "human_200.xml" (Ville exercise). We can then work with this file.

*Take a de-tour to the GO and named tuple materials now*

Named tuples are great in combination with tab delimited data

In [45]:
import collections
import csv #Check this one out
import StringIO

URec = collections.namedtuple('URec', 'id, entry_name, keywords')
q={"query":'organism:"Homo sapiens (Human) [9606]"',"columns":"id,entry name,keywords", "limit":10, "format":"tab"}
r=requests.get("http://www.uniprot.org/uniprot",params=q)
print r.text

#Now make a list of URec named tuples out of the table (Ville exercise)


Entry	Entry name	Keywords
P31947	1433S_HUMAN	3D-structure; Alternative splicing; Complete proteome; Cytoplasm; Direct protein sequencing; Nucleus; Phosphoprotein; Polymorphism; Reference proteome; Secreted; Ubl conjugation
P30462	1B14_HUMAN	Complete proteome; Disulfide bond; Glycoprotein; Host-virus interaction; Immunity; MHC I; Membrane; Polymorphism; Reference proteome; Signal; Transmembrane; Transmembrane helix; Ubl conjugation
P30479	1B41_HUMAN	3D-structure; Complete proteome; Disulfide bond; Glycoprotein; Host-virus interaction; Immunity; MHC I; Membrane; Polymorphism; Reference proteome; Signal; Transmembrane; Transmembrane helix; Ubl conjugation
P30483	1B45_HUMAN	Complete proteome; Disulfide bond; Glycoprotein; Host-virus interaction; Immunity; MHC I; Membrane; Polymorphism; Reference proteome; Signal; Transmembrane; Transmembrane helix; Ubl conjugation
P30486	1B48_HUMAN	Complete proteome; Disulfide bond; Glycoprotein; Host-virus interaction; Immunity; MHC I; Membrane; Polymorph

# ID mapping

UniProt offers an online service for ID mapping. [See here](http://www.uniprot.org/help/programmatic_access#id_mapping_examples)