In [6]:
import requests
import pandas as pd
import logging

### The uniprot.py file in the "go" folder of the go2pdb package was not working, most of this is due to the fact that uniprot seems to have updated their rest url API, so the current url's are invalid. Here we will be updating them and testing them.

### search_go

Following this link, https://www.uniprot.org/help/api_queries, contains info on generating the appropriate rest URL's for the desired queries. Here is an example query for searching all uniprot id's by go code 0016151 AND that contain pdb entries: https://www.uniprot.org/uniprotkb?query=pdb%20AND%20%28go%3A0016151%29

The query returns 97 results. Specifying the pdb in the search is important as we want the results to have the 3d structural information available for when we map them later on in the script.

In [49]:
_LOGGER = logging.getLogger(__name__)
SEARCH_GO_URL = "https://rest.uniprot.org/uniprotkb/stream?fields=accession%2Cid%2Cprotein_name&format=tsv&query=pdb%20go%3A{go}"

In [50]:
def search_go(go_codes, ssl_verify) -> pd.DataFrame:
    """Search UniProt by GO code.

    :param list go_codes:  list of GO codes to search
    :param bool ssl_verify:  does SSL work?
    :returns:  UniProt IDs and additional information
    """
    rows = []
    for go_code in go_codes:
        _, code = go_code.split(":")
        search_url = SEARCH_GO_URL.format(go=code)
        req = requests.get(search_url, verify=ssl_verify)
        lines = req.text.splitlines()
        rows += [line.split("\t") + [go_code] for line in lines[1:]]
    df = pd.DataFrame(
        data=rows,
        columns=[
            "UniProt entry ID",
            "UniProt entry name",
            "UniProt protein names",
            "UniProt GO code",
        ],
    )
    return df.drop_duplicates(ignore_index=True)


Testing the function using GO:0016151 (nickel cation binding)

In [51]:
nickel_go = search_go(["GO:0016151"], True)

In [52]:
nickel_go

Unnamed: 0,UniProt entry ID,UniProt entry name,UniProt protein names,UniProt GO code
0,O25560,HYPB_HELPY,Hydrogenase/urease maturation factor HypB (Hyd...,GO:0016151
1,O33599,LYTM_STAA8,"Glycyl-glycine endopeptidase LytM, EC 3.4.24.7...",GO:0016151
2,P04905,GSTM1_RAT,"Glutathione S-transferase Mu 1, EC 2.5.1.18 (G...",GO:0016151
3,P07374,UREA_CANEN,"Urease, EC 3.5.1.5 (Jack bean urease, JBU) (Ur...",GO:0016151
4,P07451,CAH3_HUMAN,"Carbonic anhydrase 3, EC 4.2.1.1 (Carbonate de...",GO:0016151
...,...,...,...,...
92,Q8ZPH0,Q8ZPH0_SALTY,"Putative hydrogenase-1 large subunit, EC 1.12.7.2",GO:0016151
93,Q92YH7,Q92YH7_RHIME,"ABC transporter, periplasmic solute-binding pr...",GO:0016151
94,Q9L868,Q9L868_DESDE,[NiFe] hydrogenase large subunit,GO:0016151
95,U5RTE2,U5RTE2_9CLOT,"Carbon-monoxide dehydrogenase (Acceptor), EC 1...",GO:0016151


We get the same 97 results as we did in the manual query on the website.

### search_id

The steps to obtain the new url here are essentially the same as for the search_go function, in terms of querying a search on the site and generating the api url. This url uses the keyword "accession". Here is an example of how it looks on the site if searching for the uniprot id "P85092": https://www.uniprot.org/uniprotkb?query=(accession:P85092)

In [31]:
SEARCH_ID_URL = "https://rest.uniprot.org/uniprotkb/stream?fields=accession%2Cid%2Cprotein_name&format=tsv&query=%28accession%3A{id}%29"

In [32]:
def search_id(uniprot_ids, ssl_verify) -> pd.DataFrame:
    """Search UniProt by ID.

    :param list uniprot_ids:  list of UniProt IDs to search
    :param bool ssl_verify:  does SSL work?
    :returns:  UniProt IDs and additional information
    """
    rows = []
    for uni_id in uniprot_ids:
        search_url = SEARCH_ID_URL.format(id=uni_id)
        req = requests.get(search_url, verify=ssl_verify)
        lines = req.text.splitlines()
        rows += [line.split("\t") for line in lines[1:]]
    df = pd.DataFrame(
        data=rows,
        columns=[
            "UniProt entry ID",
            "UniProt entry name",
            "UniProt protein names",
        ],
    )
    return df.drop_duplicates(ignore_index=True)

We will just use the id's from the previous result to test this function

In [59]:
nickel_id = search_id(nickel_go['UniProt entry ID'], True)

In [60]:
nickel_id

Unnamed: 0,UniProt entry ID,UniProt entry name,UniProt protein names
0,O25560,HYPB_HELPY,Hydrogenase/urease maturation factor HypB (Hyd...
1,O33599,LYTM_STAA8,"Glycyl-glycine endopeptidase LytM, EC 3.4.24.7..."
2,P04905,GSTM1_RAT,"Glutathione S-transferase Mu 1, EC 2.5.1.18 (G..."
3,P07374,UREA_CANEN,"Urease, EC 3.5.1.5 (Jack bean urease, JBU) (Ur..."
4,P07451,CAH3_HUMAN,"Carbonic anhydrase 3, EC 4.2.1.1 (Carbonate de..."
...,...,...,...
92,Q8ZPH0,Q8ZPH0_SALTY,"Putative hydrogenase-1 large subunit, EC 1.12.7.2"
93,Q92YH7,Q92YH7_RHIME,"ABC transporter, periplasmic solute-binding pr..."
94,Q9L868,Q9L868_DESDE,[NiFe] hydrogenase large subunit
95,U5RTE2,U5RTE2_9CLOT,"Carbon-monoxide dehydrogenase (Acceptor), EC 1..."


### get_pdb_ids

The get_pdb_ids function was a little more in depth, in terms of not only updating the url but also the parameters that get called to map the pdb ids on the site. We now have a url that is responsible for submitting the mapping request (annotated by a job id), and then we need to use a subsequent url with the job id to retrieve the data.

In [88]:
#url to initate mapping job
MAPPING_URL = "https://rest.uniprot.org/idmapping/run"

#url to retrieve data from mapping job
MAPPING_DATA_URL = "https://rest.uniprot.org/idmapping/stream/{jobId}?format=tsv"

In [89]:
def get_pdb_ids(uniprot_ids, ssl_verify) -> pd.DataFrame:
    """Get PDB IDs corresponding to UniProt IDs.

    :param list uniprot_ids:  list of UniProt IDs.
    :param bool ssl_verify:  does SSL work?
    :returns:  mapping of UniProt IDs to PDB IDs.
    """
    #updating correct parameters
    params = {
        "from": "UniProtKB_AC-ID",
        "to": "PDB",
        "ids": " ".join(uniprot_ids),
    }
    
    #Initializing the mapping request and grabbing the jobid
    req = requests.post(MAPPING_URL, data=params, verify=ssl_verify)
    req_json = req.json()
    job_id = req_json["jobId"]
    
    #retrieving the data using the jobid
    data_url = MAPPING_DATA_URL.format(jobId=job_id)
    data_req = requests.get(data_url)
    
    lines = data_req.text.splitlines()
    rows = [line.split("\t") for line in lines[1:]]
    df = pd.DataFrame(data=rows, columns=["UniProt entry ID", "PDB ID"])
    return df.drop_duplicates(ignore_index=True)

Testing on the list of uniprot ids from before

In [96]:
nickel_pdb = get_pdb_ids(nickel_id['UniProt entry ID'], True)

In [97]:
nickel_pdb

Unnamed: 0,UniProt entry ID,PDB ID
0,O25560,4LPS
1,O33599,1QWY
2,O33599,2B0P
3,O33599,2B13
4,O33599,2B44
...,...,...
523,U5RTE2,6YU9
524,U5RTE2,6YUA
525,V0V766,6SYX
526,V0V766,6SZD


We have now retrieved pdb ids from the list ofuniprot ids with nickel cation binding