# Create UniProt Mapping File

This notebook creates a file that maps Ensembl IDs to UniProt accession numbers. It uses all Ensembl IDs that are present in Agora input files (excluding druggability), queries UniProtKB for matching accession numbers, and writes the file to a tsv. This notebook uses UniprotKB-Swiss-Prot as its source, which ensures that all accessions returned have been reviewed and annotated by UniProt and are likely to be primary accessions only.

## Installation requirements

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install unipressed
```

In [None]:
from unipressed import IdMappingClient
import time
import pandas as pd
import preprocessing_utils

config_filename = "../../../../config.yaml"

## Get the list of nominated targets for Agora

In [None]:
targets_df = preprocessing_utils.load_file_with_name("target_list", config_filename=config_filename)

## Get Ensembl IDs from data sets that will be processed by agora-data-tools

Loop through all data sets in the config file to get all Ensembl IDs used in every data set. NOTE: In the future, it would be simpler to just load the `gene_metadata` data set once druggability genes are removed from it, rather than looping through all of these files. 

In [None]:
ensembl_ids = preprocessing_utils.get_all_adt_ensembl_ids(
    config_filename=config_filename,
    exclude_files=["gene_metadata", "druggability"],
)
print("")
print(str(len(ensembl_ids)) + " Ensembl IDs found.")

Query UniProt for accession numbers that match to Ensembl IDs. Using `UniProtKB-Swiss-Prot` ensures that all accession numbers returned have been reviewed and are highly likely to be primary accessions.

In [None]:
# Break the query into smaller chunks to avoid long jobs that could fail
batch_ind = range(0, len(ensembl_ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ensembl_ids), B + 1000)
    print("Querying genes " + str(B + 1) + " - " + str(end))
    
    request = IdMappingClient.submit(
        source="Ensembl", dest="UniProtKB-Swiss-Prot", ids=ensembl_ids[B:end]
    )

    found = False
    while not found:
        time.sleep(2)
        
        status = request.get_status()
        if (status == "FINISHED"):
            results = results + list(request.each_result())
            found = True
        else:
            print("Waiting for response from UniProt...")

In [None]:
mapping = pd.DataFrame(results).rename(
    columns={"from": "RESOURCE_IDENTIFIER", "to": "UniProtKB_accession"}
)
mapping = mapping[["UniProtKB_accession", "RESOURCE_IDENTIFIER"]]

nomination_string = "Agora Nominated Target for Alzheimer’s Disease"

mapping["OPTIONAL_INFORMATION"] = ""
mapping["OPTIONAL_INFORMATION"].loc[
    mapping["RESOURCE_IDENTIFIER"].isin(targets_df["ensembl_gene_id"])
] = nomination_string
mapping

In [None]:
mapping.to_csv(path_or_buf="../../output/ensg_to_uniprot_mapping.tsv", sep="\t", header=True, index=False)

# Extra information printouts

Total number of Ensembl IDs that match to a UniProt accession:

In [None]:
matches = len(mapping["RESOURCE_IDENTIFIER"].drop_duplicates())
total = len(ensembl_ids)
pct = round(matches * 100 / total, ndigits = 2)

print(f'{matches:.0f} of {total:.0f} ({pct:.2f}%) Ensembl IDs match to an accession')

Ensembl IDs that match to more than one UniProt accession:

In [None]:
dupes = mapping["RESOURCE_IDENTIFIER"].loc[mapping["RESOURCE_IDENTIFIER"].duplicated()].drop_duplicates()
print(f'{len(dupes):d} Ensembl IDs map to more than one UniProt accession')
mapping.loc[mapping["RESOURCE_IDENTIFIER"].isin(dupes)].sort_values(by="RESOURCE_IDENTIFIER")

UniProt accessions that match to more than one Ensembl ID:

In [None]:
dupes2 = mapping["UniProtKB_accession"].loc[mapping["UniProtKB_accession"].duplicated()].drop_duplicates()
print(f'{len(dupes2):d} UniProt accessions map to more than one Ensembl ID')
mapping.loc[mapping["UniProtKB_accession"].isin(dupes2)].sort_values(by="UniProtKB_accession")

Are any nominated targets missing a Uniprot accession?

In [None]:
ens = targets_df["ensembl_gene_id"].drop_duplicates()
missing = len(ens) - sum(ens.isin(mapping["RESOURCE_IDENTIFIER"]))

if missing == 0:
    print("All nominated targets have a matching UniProt accession.")

else:
    print(f"{missing} of {len(ens)} nominated targets are missing a UniProt accession.")
    missing_ens = [x for x in ens if x not in list(mapping["RESOURCE_IDENTIFIER"])]
    print(
        targets_df[targets_df["ensembl_gene_id"].isin(missing_ens)][
            ["ensembl_gene_id", "hgnc_symbol"]
        ]
    )