# Create UniProt Mapping File

This notebook creates a file that maps Ensembl IDs to UniProt accession numbers. It uses the Ensembl IDs in Agora's gene metadata file (`syn25953363`), queries UniProtKB for matching accession numbers, and writes the file to a tsv. This notebook uses UniprotKB-Swiss-Prot as its source, which ensures that all accessions returned have been reviewed and annotated by UniProt and are likely to be primary accessions only.

## Installation requirements

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install unipressed
```

In [1]:
from unipressed import IdMappingClient
import synapseclient
import time
import pandas as pd
import agoradatatools.etl.utils as utils
import agoradatatools.etl.extract as extract

config_filename = "../../../../config.yaml"

Find the specific version of the gene metadata file to use, as specified in Agora's config file.

In [3]:
config = utils._get_config(config_path=config_filename)
datasets = config["datasets"]

metadata_synID = None

for dataset in datasets:
    if "gene_info" in dataset.keys():
        files = dataset["gene_info"]["files"]
        for file_info in files:
            if "gene_metadata" in file_info.values():
                metadata_synID = file_info["id"]

print(metadata_synID)

syn25953363.13


In [None]:
syn = synapseclient.Synapse()
syn.login(silent=True)

gene_metadata = extract.get_entity_as_df(syn_id=metadata_synID, source="feather", syn=syn)
gene_metadata["ensembl_gene_id"]

Query UniProt for accession numbers that match to Ensembl IDs. Using `UniProtKB-Swiss-Prot` ensures that all accession numbers returned have been reviewed and are highly likely to be primary accessions.

In [None]:
ensembl_ids = gene_metadata["ensembl_gene_id"].tolist()
print(len(ensembl_ids))

# Break the query into smaller chunks to avoid long jobs that could fail
batch_ind = range(0, len(ensembl_ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ensembl_ids), B + 1000)
    print("Querying genes " + str(B + 1) + " - " + str(end))
    
    request = IdMappingClient.submit(
        source="Ensembl", dest="UniProtKB-Swiss-Prot", ids=ensembl_ids[B:end]
    )

    found = False
    while not found:
        time.sleep(2)
        
        status = request.get_status()
        if (status == "FINISHED"):
            results = results + list(request.each_result())
            found = True
        else:
            print("Waiting for response from UniProt...")

In [6]:
mapping = pd.DataFrame(results).rename(
    columns={"from": "ensembl_gene_id", "to": "UniProtKB_accession"}
)
mapping = mapping[["UniProtKB_accession", "ensembl_gene_id"]]
mapping

Unnamed: 0,UniProtKB_accession,ensembl_gene_id
0,O43657,ENSG00000000003
1,Q9H2S6,ENSG00000000005
2,O60762,ENSG00000000419
3,Q8IZE3,ENSG00000000457
4,Q9NSG2,ENSG00000000460
...,...,...
19691,P23610,ENSG00000288722
19692,Q8IX94,ENSG00000288784
19693,A0A8I5KQE6,ENSG00000288920
19694,W6CW81,ENSG00000289721


In [7]:
mapping.to_csv(path_or_buf="../../output/ensg_to_uniprot_mapping.tsv", sep="\t", header=True, index=False)

# Extra information printouts

Total number of Ensembl IDs that match to a UniProt accession:

In [15]:
matches = len(mapping["ensembl_gene_id"].drop_duplicates())
total = gene_metadata.shape[0]
pct = round(matches * 100 / total, ndigits = 2)

print(f'{matches:.0f} of {total:.0f} ({pct:.2f}%) Ensembl IDs match to an accession')

19671 of 37452 (52.52%) Ensembl IDs match to an accession


Ensembl IDs that match to more than one UniProt accession:

In [19]:
dupes = mapping["ensembl_gene_id"].loc[mapping["ensembl_gene_id"].duplicated()].drop_duplicates()
print(f'{len(dupes):d} Ensembl IDs map to more than one UniProt accession')
mapping.loc[mapping["ensembl_gene_id"].isin(dupes)]

24 Ensembl IDs map to more than one UniProt accession


Unnamed: 0,UniProtKB_accession,ensembl_gene_id
404,Q9HDB5,ENSG00000021645
405,Q9Y4C0,ENSG00000021645
1724,O95467,ENSG00000087460
1725,P63092,ENSG00000087460
1726,Q5JWF2,ENSG00000087460
2462,O00241,ENSG00000101307
2463,Q5TFQ8,ENSG00000101307
3038,Q96PG8,ENSG00000105327
3039,Q9BXH1,ENSG00000105327
3585,P0DI83,ENSG00000109113


UniProt accessions that match to more than one Ensembl ID:

In [20]:
dupes2 = mapping["UniProtKB_accession"].loc[mapping["UniProtKB_accession"].duplicated()].drop_duplicates()
print(f'{len(dupes2):d} UniProt accessions map to more than one Ensembl ID')
mapping.loc[mapping["UniProtKB_accession"].isin(dupes2)]

474 UniProt accessions map to more than one Ensembl ID


Unnamed: 0,UniProtKB_accession,ensembl_gene_id
119,O76009,ENSG00000006059
137,Q53H12,ENSG00000006530
278,Q9P203,ENSG00000011114
319,Q8N806,ENSG00000012963
360,Q99666,ENSG00000015568
...,...,...
19675,Q6IEY1,ENSG00000284733
19680,Q9H1A7,ENSG00000285437
19683,P68431,ENSG00000287080
19684,Q14953,ENSG00000288206
