# Create UniProt Mapping File

This notebook creates a file that maps Ensembl IDs to UniProt accession numbers. It uses all Ensembl IDs that are present in Agora input files (excluding druggability), queries UniProtKB for matching accession numbers, and writes the file to a tsv. This notebook uses UniprotKB-Swiss-Prot as its source, which ensures that all accessions returned have been reviewed and annotated by UniProt and are likely to be primary accessions only.

## Installation requirements

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install unipressed
```

In [None]:
from unipressed import IdMappingClient
import time
import pandas as pd
import preprocessing_utils

config_filename = "../../../../config.yaml"

## Get Ensembl IDs from data sets that will be processed by agora-data-tools

Loop through all data sets in the config file to get all Ensembl IDs used in every data set. NOTE: In the future, it would be simpler to just load the `gene_metadata` data set once druggability genes are removed from it, rather than looping through all of these files. 

In [None]:
ensembl_ids = preprocessing_utils.get_all_adt_ensembl_ids(
    config_filename=config_filename,
    exclude_files=["gene_metadata", "druggability"],
    token=None,
)
print("")
print(str(len(ensembl_ids)) + " Ensembl IDs found.")

Query UniProt for accession numbers that match to Ensembl IDs. Using `UniProtKB-Swiss-Prot` ensures that all accession numbers returned have been reviewed and are highly likely to be primary accessions.

In [5]:
# Break the query into smaller chunks to avoid long jobs that could fail
batch_ind = range(0, len(ensembl_ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ensembl_ids), B + 1000)
    print("Querying genes " + str(B + 1) + " - " + str(end))
    
    request = IdMappingClient.submit(
        source="Ensembl", dest="UniProtKB-Swiss-Prot", ids=ensembl_ids[B:end]
    )

    found = False
    while not found:
        time.sleep(2)
        
        status = request.get_status()
        if (status == "FINISHED"):
            results = results + list(request.each_result())
            found = True
        else:
            print("Waiting for response from UniProt...")

Querying genes 1 - 1000
Querying genes 1001 - 2000
Querying genes 2001 - 3000
Querying genes 3001 - 4000
Querying genes 4001 - 5000
Querying genes 5001 - 6000
Querying genes 6001 - 7000
Querying genes 7001 - 8000
Querying genes 8001 - 9000
Querying genes 9001 - 10000
Querying genes 10001 - 11000
Querying genes 11001 - 12000
Querying genes 12001 - 13000
Querying genes 13001 - 14000
Querying genes 14001 - 15000
Querying genes 15001 - 16000
Querying genes 16001 - 17000
Querying genes 17001 - 18000
Querying genes 18001 - 19000
Querying genes 19001 - 20000
Querying genes 20001 - 21000
Querying genes 21001 - 22000
Querying genes 22001 - 23000
Querying genes 23001 - 24000
Querying genes 24001 - 25000
Querying genes 25001 - 26000
Querying genes 26001 - 27000
Querying genes 27001 - 28000
Querying genes 28001 - 29000
Querying genes 29001 - 30000
Querying genes 30001 - 31000
Querying genes 31001 - 32000
Querying genes 32001 - 33000
Querying genes 33001 - 34000
Querying genes 34001 - 35000
Queryin

In [6]:
mapping = pd.DataFrame(results).rename(
    columns={"from": "RESOURCE_IDENTIFIER", "to": "UniProtKB_accession"}
)
mapping = mapping[["UniProtKB_accession", "RESOURCE_IDENTIFIER"]]
mapping

Unnamed: 0,UniProtKB_accession,RESOURCE_IDENTIFIER
0,A0A075B6I4,ENSG00000211642
1,Q13641,ENSG00000146242
2,Q6PCB7,ENSG00000130304
3,Q7Z591,ENSG00000106948
4,Q5SZD1,ENSG00000197261
...,...,...
18456,Q6ZUI0,ENSG00000188001
18457,O43747,ENSG00000166747
18458,Q9UBU2,ENSG00000155011
18459,Q86VY9,ENSG00000164484


In [7]:
mapping.to_csv(path_or_buf="../../output/ensg_to_uniprot_mapping.tsv", sep="\t", header=True, index=False)

# Extra information printouts

Total number of Ensembl IDs that match to a UniProt accession:

In [9]:
matches = len(mapping["RESOURCE_IDENTIFIER"].drop_duplicates())
total = len(ensembl_ids)
pct = round(matches * 100 / total, ndigits = 2)

print(f'{matches:.0f} of {total:.0f} ({pct:.2f}%) Ensembl IDs match to an accession')

18437 of 35858 (51.42%) Ensembl IDs match to an accession


Ensembl IDs that match to more than one UniProt accession:

In [10]:
dupes = mapping["RESOURCE_IDENTIFIER"].loc[mapping["RESOURCE_IDENTIFIER"].duplicated()].drop_duplicates()
print(f'{len(dupes):d} Ensembl IDs map to more than one UniProt accession')
mapping.loc[mapping["RESOURCE_IDENTIFIER"].isin(dupes)]

23 Ensembl IDs map to more than one UniProt accession


Unnamed: 0,UniProtKB_accession,RESOURCE_IDENTIFIER
538,P0CAP2,ENSG00000255529
539,Q6EEV4,ENSG00000255529
2499,O95467,ENSG00000087460
2500,P63092,ENSG00000087460
2501,Q5JWF2,ENSG00000087460
2846,P39880,ENSG00000257923
2847,Q13948,ENSG00000257923
2943,O96007,ENSG00000164172
2944,O96033,ENSG00000164172
4298,Q8NFQ8,ENSG00000169905


UniProt accessions that match to more than one Ensembl ID:

In [11]:
dupes2 = mapping["UniProtKB_accession"].loc[mapping["UniProtKB_accession"].duplicated()].drop_duplicates()
print(f'{len(dupes2):d} UniProt accessions map to more than one Ensembl ID')
mapping.loc[mapping["UniProtKB_accession"].isin(dupes2)]

28 UniProt accessions map to more than one Ensembl ID


Unnamed: 0,UniProtKB_accession,RESOURCE_IDENTIFIER
498,Q08493,ENSG00000285188
664,Q5JQF8,ENSG00000184388
845,P62805,ENSG00000197061
1474,Q71DI3,ENSG00000203852
1553,P0C0S8,ENSG00000196747
...,...,...
17564,Q08493,ENSG00000105650
18069,P01562,ENSG00000197919
18161,P62805,ENSG00000278705
18253,P62807,ENSG00000277224
