In [1]:
import requests
import json
import pandas as pd
from io import StringIO

`requests` is the recommended module for requesting and sending resources to a web-based API endpoint

`json` is the built-in module for working with text data in JSON format

# UniProt REST API

In [2]:
uniprot_url = "https://www.uniprot.org/uploadlists"
headers = {
    "User-Agent": "Python, toan.phung@uq.net.au"
}
acc_file = "../data/testlist.txt"

`https://www.uniprot.org/uploadlists` is the url of the uniprot REST API that we will used to request information

`headers` is the metadata that should be included with every api requests for potential debugging purpose from uniprot admin

In [29]:
with open(acc_file, "rt") as source_acc:
    l = [i.strip() for i in source_acc]

Opening the file containing our list of Uniprot accession id and store as an string array


In [65]:
parameters = {
    "query": " ".join(l),
    "format": "tab",
    "from": "ACC,ID",
    "to": "REFSEQ_NT_ID"
}


From the `uploadlists` api endpoint, there are a few options we can choose.

For example, one can use Uniprot ability to convert input id format to id format in different databases.
Above, we are created a parameters dictionary to convert from Uniprot accession. The dictionary contain 4 keys.
- `query` value is a string constructed from the array above with each item joined by a space
- `format` the desired return file format
- `from` input format id type (uniprot acc and id)
- `to` output format id type (RefSeq nucleotide sequence id)


In [66]:
response = requests.get(uniprot_url, params=parameters, headers=headers)
print(response.status_code)

200


`response` is the variable containing the request result from Uniprot.

In [38]:
result_refseq_nt = pd.read_csv(StringIO(response.text), sep="\t")
result_refseq_nt.head()

Unnamed: 0,From,To
0,P25045,NM_001182805.1
1,Q07844,NM_001181854.1
2,P22147,NM_001181038.1
3,P39931,NM_001182137.1
4,P27692,NM_001182366.1


For saving the output from the operation, you can save directly from the response result or from the data frame.


`result` store uniprot tabulated data in a `pandas` dataframe.

First columne is the original input id and the second column is the corresponding id in the RefSeq nucleotide database.

In [39]:
parameters["to"] = "P_REFSEQ_AC"
response = requests.get(uniprot_url, params=parameters, headers=headers)
result_refseq_ac = pd.read_csv(StringIO(response.text), sep="\t")
result_refseq_ac.head()

Unnamed: 0,From,To
0,P25045,NP_014025.1
1,Q07844,NP_013066.1
2,P22147,NP_011342.1
3,P39931,NP_013351.1
4,P27692,NP_013703.1


Above, we changed our query parameter to targeting RefSeq protein accession id instead.

---

Now what if we want to get more information from the Uniprot database instead of just doing id coversion.

In [53]:
extra_parameters = ["id", "entry name", "reviewed", "protein names", "organism", "sequence"]
parameters["to"] = "ACC"
parameters["columns"] = ",".join(extra_parameters)
print(parameters["columns"])

id,entry name,reviewed,protein names,organism,sequence


Adding a fifth key to our parameters named
- `columns` string composed of the columns name of desired data corresponding to the ids collection.
Each column name separated by ",". Above we are getting id, entry name, reviewed status, protein names, organism and 
protein sequence from Uniprot.

For all column name accessible through this mode, you can visit https://www.uniprot.org/help/uniprotkb_column_names


In [45]:
response = requests.get(uniprot_url, params=parameters, headers=headers)
result_uniprot = pd.read_csv(StringIO(response.text), sep="\t")
print(result_uniprot.columns)
result_uniprot.head()


Unnamed: 0,Entry,Entry name,Status,Protein names,Organism,Sequence,yourlist:M201907126746803381A1F0E0DB47453E0216320D51E8FES
0,P25045,LCB1_YEAST,reviewed,Serine palmitoyltransferase 1 (SPT 1) (SPT1) (...,Saccharomyces cerevisiae (strain ATCC 204508 /...,MAHIPEVLPKSIPIPAFIVTTSSYLWYYFNLVLTQIPGGQFIVSYI...,P25045
1,Q07844,RIX7_YEAST,reviewed,Ribosome biogenesis ATPase RIX7,Saccharomyces cerevisiae (strain ATCC 204508 /...,MVKVKSKKNSLTSSLDNKIVDLIYRLLEEKTLDRKRSLRQESQGEE...,Q07844
2,P22147,XRN1_YEAST,reviewed,5'-3' exoribonuclease 1 (EC 3.1.13.-) (DNA str...,Saccharomyces cerevisiae (strain ATCC 204508 /...,MGIPKFFRYISERWPMILQLIEGTQIPEFDNLYLDMNSILHNCTHG...,P22147
3,P39931,SS120_YEAST,reviewed,Protein SSP120,Saccharomyces cerevisiae (strain ATCC 204508 /...,MRFLRGFVFSLAFTLYKVTATAEIGSEINVENEAPPDGLSWEEWHM...,P39931
4,P27692,SPT5_YEAST,reviewed,Transcription elongation factor SPT5 (Chromati...,Saccharomyces cerevisiae (strain ATCC 204508 /...,MSDNSDTNVSMQDHDQQFADPVVVPQSTDTKDENTSDKDTVDSGNV...,P27692


An example for a more extensive parameters is below


In [82]:
parameters = {
    "query": " ".join(l),
    "format": "tab",
    "from": "ACC,ID",
    "to": "ACC",
    "columns": "id,entry name,reviewed,protein names,genes,organism,length," \
                                   "organism-id,go-id,go(cellular component),comment(SUBCELLULAR LOCATION)," \
                                   "feature(TOPOLOGICAL_DOMAIN),feature(GLYCOSYLATION),comment(MASS SPECTROMETRY)," \
                                   "sequence,feature(ALTERNATIVE SEQUENCE),comment(ALTERNATIVE PRODUCTS) ",
}

response = requests.get(uniprot_url, params=parameters, headers=headers)

result_uniprot_extensive = pd.read_csv(StringIO(response.text), sep="\t")
print(result_uniprot_extensive.columns)
result_uniprot_extensive.head()

Index(['Entry', 'Entry name', 'Status', 'Protein names', 'Gene names',
       'Organism', 'Length', 'Organism ID', 'Gene ontology IDs',
       'Gene ontology (cellular component)', 'Subcellular location [CC]',
       'Topological domain', 'Glycosylation', 'Mass spectrometry', 'Sequence',
       'Alternative sequence', 'Alternative products (isoforms)',
       'yourlist:M201907156746803381A1F0E0DB47453E0216320D52DB7D6'],
      dtype='object')


Unnamed: 0,Entry,Entry name,Status,Protein names,Gene names,Organism,Length,Organism ID,Gene ontology IDs,Gene ontology (cellular component),Subcellular location [CC],Topological domain,Glycosylation,Mass spectrometry,Sequence,Alternative sequence,Alternative products (isoforms),yourlist:M201907156746803381A1F0E0DB47453E0216320D52DB7D6
0,P25045,LCB1_YEAST,reviewed,Serine palmitoyltransferase 1 (SPT 1) (SPT1) (...,LCB1 END8 TSC2 YMR296C,Saccharomyces cerevisiae (strain ATCC 204508 /...,558,559292,GO:0004758; GO:0005783; GO:0016021; GO:0017059...,endoplasmic reticulum [GO:0005783]; integral c...,SUBCELLULAR LOCATION: Cytoplasm. Endoplasmic r...,TOPO_DOM 1 49 Lumenal. {ECO:0000269|PubMed:154...,,,MAHIPEVLPKSIPIPAFIVTTSSYLWYYFNLVLTQIPGGQFIVSYI...,,,P25045
1,Q07844,RIX7_YEAST,reviewed,Ribosome biogenesis ATPase RIX7,RIX7 YLL034C,Saccharomyces cerevisiae (strain ATCC 204508 /...,837,559292,GO:0000055; GO:0005524; GO:0005634; GO:0005730...,nucleolus [GO:0005730]; nucleus [GO:0005634]; ...,"SUBCELLULAR LOCATION: Nucleus, nucleolus {ECO:...",,,,MVKVKSKKNSLTSSLDNKIVDLIYRLLEEKTLDRKRSLRQESQGEE...,,,Q07844
2,P22147,XRN1_YEAST,reviewed,5'-3' exoribonuclease 1 (EC 3.1.13.-) (DNA str...,XRN1 DST2 KEM1 RAR5 SEP1 SKI1 YGL173C G1645,Saccharomyces cerevisiae (strain ATCC 204508 /...,1528,559292,GO:0000184; GO:0000741; GO:0000932; GO:0000956...,cytoplasm [GO:0005737]; cytoplasmic stress gra...,"SUBCELLULAR LOCATION: Cytoplasm. Cytoplasm, pe...",,,,MGIPKFFRYISERWPMILQLIEGTQIPEFDNLYLDMNSILHNCTHG...,,,P22147
3,P39931,SS120_YEAST,reviewed,Protein SSP120,SSP120 YLR250W L9672.4,Saccharomyces cerevisiae (strain ATCC 204508 /...,234,559292,GO:0000324; GO:0005509; GO:0005737; GO:0005793,cytoplasm [GO:0005737]; endoplasmic reticulum-...,,,,,MRFLRGFVFSLAFTLYKVTATAEIGSEINVENEAPPDGLSWEEWHM...,,,P39931
4,P27692,SPT5_YEAST,reviewed,Transcription elongation factor SPT5 (Chromati...,SPT5 YML010W YM9571.08,Saccharomyces cerevisiae (strain ATCC 204508 /...,1063,559292,GO:0000993; GO:0001042; GO:0001179; GO:0003677...,DSIF complex [GO:0032044]; mitochondrion [GO:0...,SUBCELLULAR LOCATION: Nucleus {ECO:0000269|Pub...,,,,MSDNSDTNVSMQDHDQQFADPVVVPQSTDTKDENTSDKDTVDSGNV...,,,P27692


Tabulated output from Uniprot does not give isoform sequence. If you want to get their sequence as well, you will have 
to work with fasta output instead of tabulated. An extra parameters is also needed is `include : "yes"`


In [73]:
parameters = {
    "query": " ".join(l),
    "format": "fasta",
    "from": "ACC,ID",
    "to": "ACC",
    "include": "yes",
    "reviewed": "yes"
}
response = requests.get(uniprot_url, params=parameters, headers=headers)

With the fasta file retrieved, we would still need to save it out.


In [74]:
with open("../data/all_isoforms.fasta", "wb") as fasta_file:
    fasta_file.write(response.content)
    

For query not using an id or accession but a more general search, the api endpoint will have to be changed to `https://www.uniprot.org/uniprot`


In [None]:
uniprot_url = "https://www.uniprot.org/uniprot"
parameters = 

# NCBI API


In [None]:
result = pd.read_csv(StringIO(response.text), sep="\t")


`result` store uniprot tabulated data in a `pandas` dataframe.


In [40]:
eutil_path = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
acc_list = ['NM_009417','NM_000547','NM_001003009','NM_019353']
query = ""
for i in range(len(acc_list)):
    acc_list[i] = acc_list[i] + "[accn]"
query = "+OR+".join(acc_list)
params = [
    "db=nuccore",
    "term={}".format(query),
    "usehistory=y"
]
url = "&".join(params)

In [41]:
res = requests.get(eutil_path + "esearch.fcgi?" + url, headers=headers)

In [6]:
res.content

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult><Count>4</Count><RetMax>4</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_54036580_130.14.22.33_9001_1559264498_1086444762_0MetA0_S_MegaStore</WebEnv><IdList>\n<Id>253735815</Id>\n<Id>927442695</Id>\n<Id>402766536</Id>\n<Id>350529408</Id>\n</IdList><TranslationSet/><TranslationStack>   <TermSet>    <Term>NM_009417[accn]</Term>    <Field>accn</Field>    <Count>1</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>NM_000547[accn]</Term>    <Field>accn</Field>    <Count>1</Count>    <Explode>N</Explode>   </TermSet>   <OP>OR</OP>   <TermSet>    <Term>NM_001003009[accn]</Term>    <Field>accn</Field>    <Count>1</Count>    <Explode>N</Explode>   </TermSet>   <OP>OR</OP>   <TermSet>    <Term>NM_019353[accn]</Term>    <Field>accn</Field>    <Count>1</Count>    <Explode>N<

E-search would return an xml document containing the assigned ID for the query to be used for retrieving the result.
The assigned ID is stored in two part, one in the QueryKey tab, another in the WebEnv tab.

If the return result is large, it is suggested to manually grab 500 sequences at a time.

In [7]:
from bs4 import BeautifulSoup

In [43]:
soup = BeautifulSoup(res.content, features="lxml-xml")

In [44]:
print(soup.find("Count").text)

4


In [45]:
query_key = soup.find("QueryKey").text
web_env = soup.find("WebEnv").text

Now we can use E-fetch to retrieve the result. E-fetch can return data in different format.
More information on return datatype
https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/

In [46]:
retrieve_params = [
    "db=nuccore",
    "query_key={}".format(query_key),
    "WebEnv={}".format(web_env),
    "rettype=gb",
    "retmode=xml"
]
retrieve_url = "&".join(retrieve_params)

In [47]:
res = requests.get(eutil_path + "efetch.fcgi?" + retrieve_url, headers=headers)
soup = BeautifulSoup(res.text, features="lxml-xml")

In [48]:
entries = []
for gb_seq in soup.find_all("GBSeq"):
    entry = {}
    entry["locus"] = gb_seq.find("GBSeq_locus").text
    entry["definition"] = gb_seq.find("GBSeq_definition").text
    entry["id"] = gb_seq.find("GBSeqid").text
    entry["org"] = gb_seq.find("GBSeq_organism").text
    entry["sequence"] = gb_seq.find("GBSeq_sequence").text
    entries.append(entry)
        

Convert the output into a more familiar and tabulated format like a `Pandas.DataFrame`.

In [49]:
df = pd.DataFrame(entries)
df.head()

Unnamed: 0,definition,id,locus,org
0,"Homo sapiens thyroid peroxidase (TPO), transcr...",ref|NM_000547.5|,NM_000547,Homo sapiens
1,"Mus musculus thyroid peroxidase (Tpo), mRNA",ref|NM_009417.3|,NM_009417,Mus musculus
2,"Rattus norvegicus thyroid peroxidase (Tpo), mRNA",ref|NM_019353.2|,NM_019353,Rattus norvegicus
3,Canis lupus familiaris thyroid peroxidase (TPO...,ref|NM_001003009.2|,NM_001003009,Canis lupus familiaris


Or output into fasta format

In [None]:
with open("")
for i in entries:
    
