# Preprocess Gene Annotations

This notebook creates a table of gene annotations by:
1. Querying Biomart for all Ensembl IDs in the database
2. Querying MyGene for annotation about those IDs
3. Querying Ensembl for the most recent Ensembl release for each ID
4. Building a permalink to the Ensembl archive page for each ID

This gene annotation table is read in by `agoradataprocessing/process.py` to be used in the `gene_info` transformation. 

***Note:*** *This notebook is exploratory and should eventually be converted to a Python script that is run through an automated process.*

## Installation requirements

#### Linux / Windows / Mac

Install R: https://cran.r-project.org/

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install rpy2 mygene
```

#### Note for Macs with M1 chips (2020 and newer)

Install as above, but make sure that your R installation is the arm64 version (R-4.X.X-arm64.pkg) so that the architecture matches what pip is using. 
You may also need to install an older version of `rpy2` on the Mac:
```
pip install rpy2==3.5.12
```

In [1]:
from rpy2.robjects import r
import pandas as pd
import mygene
import numpy as np
import requests
import agoradatatools.etl.utils as utils
import agoradatatools.etl.extract as extract
import preprocessing_utils

r('if (!require("BiocManager", character.only = TRUE)) { install.packages("BiocManager") }')
r('if (!require("biomaRt")) { BiocManager::install("biomaRt") }')

r.library("biomaRt")

biomart_filename = "../../output/biomart_ensg_list.txt"
archive_filename = "../../output/ensembl_archive_list.csv"
config_filename = "../../../../config.yaml"



  is available with R version '4.3'; see https://bioconductor.org/install




# Part 1: Get gene annotation data

## Query Biomart for a list of all Ensembl IDs in the database of human genes. 

Here we use the R library `biomaRt`. There is no canonical Python library with the features we need for this script. 

In [2]:
# Sometimes Biomart doesn't respond and the command needs to be sent again. Try up to 5 times.
for T in range(5):
    try:
        mart = r.useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
        ensembl_ids = r.getBM(
            attributes=r.c("ensembl_gene_id", "chromosome_name", "hgnc_symbol"),
            mart=mart,
            useCache=False,
        )

    except:
        print("Trying again...")
        ensembl_ids = None
    else:
        break

if ensembl_ids is None or ensembl_ids.nrow == 0:
    print("Biomart was unresponsive after 5 attempts. Try again later.")

else:
    # Convert the ensembl_gene_id column from R object to a python list
    ensembl_ids_df = pd.DataFrame(
        {
            "ensembl_gene_id": list(ensembl_ids.rx2("ensembl_gene_id")),
            "chromosome_name": list(ensembl_ids.rx2("chromosome_name")),
            "hgnc_symbol": list(ensembl_ids.rx2("hgnc_symbol")),
        }
    )
    print(ensembl_ids_df)
    print(str(ensembl_ids_df.shape[0]) + " genes found.")

       ensembl_gene_id chromosome_name hgnc_symbol
0      ENSG00000210049              MT       MT-TF
1      ENSG00000211459              MT     MT-RNR1
2      ENSG00000210077              MT       MT-TV
3      ENSG00000210082              MT     MT-RNR2
4      ENSG00000209082              MT      MT-TL1
...                ...             ...         ...
70707  ENSG00000288629               1            
70708  ENSG00000288678               1            
70709  ENSG00000290825               1            
70710  ENSG00000227232               1      WASH7P
70711  ENSG00000290826               1            

[70712 rows x 3 columns]
70712 genes found.


## Remove HASGs from Biomart results
This removes all human alternative sequence genes (HASGs) and patches from the Biomart results. These genes can be identified by their chromosome name.

In [3]:
ensembl_ids_df = preprocessing_utils.filter_hasgs(
    df=ensembl_ids_df, chromosome_name_column="chromosome_name"
)
print(str(ensembl_ids_df.shape[0]) + " genes remaining after HASG filtering.")

63188 genes remaining after HASG filtering.


## Add Ensembl IDs from data sets that will be processed by agora-data-tools

Some of these datasets have older/retired Ensembl IDs that no longer exist in the current Ensembl database. Loop through all data sets in the config file to get all Ensembl IDs they use, and add missing ones to the gene list. 

In [4]:
config = utils._get_config(config_path=config_filename)
datasets = config["datasets"]

files = {}

for dataset in datasets:
    dataset_name = list(dataset.keys())[0]

    for entity in dataset[dataset_name]["files"]:
        entity_id = entity["id"]
        entity_format = entity["format"]
        entity_name = entity["name"]

        # Ignore json files, which are post-processed and not what we're interested in.
        # Also ignore "gene_metadata" since that's the file we're making here.
        if entity_format != "json" and entity_name != "gene_metadata":
            files[entity_name] = (entity_id, entity_format)

# There are some duplicate synID's in this list but that doesn't really matter
files

{'genes_biodomains': ('syn44151254.4', 'csv'),
 'neuropath_regression_results': ('syn22017882.5', 'csv'),
 'proteomics': ('syn18689335.3', 'csv'),
 'proteomics_tmt': ('syn35221005.2', 'csv'),
 'proteomics_srm': ('syn52579640.4', 'csv'),
 'target_exp_validation_harmonized': ('syn24184512.8', 'csv'),
 'metabolomics': ('syn26064497.1', 'feather'),
 'igap': ('syn12514826.5', 'csv'),
 'eqtl': ('syn12514912.3', 'csv'),
 'diff_exp_data': ('syn27211942.1', 'tsv'),
 'target_list': ('syn12540368.47', 'csv'),
 'median_expression': ('syn27211878.2', 'csv'),
 'druggability': ('syn13363443.11', 'csv'),
 'tep_adi_info': ('syn51942280.2', 'csv'),
 'team_info': ('syn12615624.18', 'csv'),
 'team_member_info': ('syn12615633.18', 'csv'),
 'overall_scores': ('syn25575156.13', 'table'),
 'networks': ('syn11685347.1', 'csv')}

### We should now have a list of all raw data files ingested. Get each one and add its Ensembl IDs to the gene list.

In [5]:
# Assumes you have already logged in with a valid token
syn = utils._login_to_synapse(token=None)

# The various column names used to store Ensembl IDs in the files
col_names = ["ENSG", "ensembl_gene_id", "GeneID", "ensembl_id"]
file_ensembl_list = []

for file in files.keys():
    df = extract.get_entity_as_df(syn_id=files[file][0], source=files[file][1], syn=syn)

    file_ensembl_ids = None

    for C in col_names:
        if C in df.columns:
            file_ensembl_ids = df[C]

    # networks file is a special case
    if file == "networks":
        file_ensembl_ids = pd.melt(
            df[["geneA_ensembl_gene_id", "geneB_ensembl_gene_id"]]
        )["value"]

    # genes_biodomains is a special case -- the ensembl_id field has some semicolon-separated lists in it
    if file == "genes_biodomains":
        df = df[["Biodomain", "ensembl_id"]].drop_duplicates().dropna()
        df = utils.split_delimited_field_to_multiple_rows(
            df=df, split_field="ensembl_id", delim=";"
        )
        file_ensembl_ids = df["ensembl_id"].drop_duplicates()

    if file_ensembl_ids is not None:
        file_ensembl_list = file_ensembl_list + file_ensembl_ids.tolist()
        if "n/A" in file_ensembl_ids.tolist():
            print(file + " has an n/A Ensembl ID")
        if np.NaN in file_ensembl_ids.tolist():
            print(file + " has an NaN Ensembl ID")
    else:
        print("WARNING: no Ensembl ID column found for " + file + "!")

Welcome, Jaclyn Beck!



INFO:synapseclient_default:Welcome, Jaclyn Beck!



target_exp_validation_harmonized has an n/A Ensembl ID


In [6]:
file_ensembl_list = list(set(file_ensembl_list))
file_ensembl_list.remove("n/A")  # Necessary because one data set has an Ensembl ID set to "n/A"

# Add Ensembl IDs that are in the files but not in the biomart result
missing = set(file_ensembl_list) - set(ensembl_ids_df["ensembl_gene_id"])
print(
    str(len(missing))
    + " genes from the data files are missing from Biomart results and will be added."
)

missing_df = pd.DataFrame({"ensembl_gene_id": list(missing), "chromosome_name": ""})
ensembl_ids_df = pd.concat([ensembl_ids_df, missing_df])
ensembl_ids_df = ensembl_ids_df.dropna(subset=["ensembl_gene_id"])
print(len(ensembl_ids_df))

1821 genes from the data files are missing from Biomart results and will be added.
65009


In [7]:
# Write to a file
ensembl_ids_df.to_csv(path_or_buf=biomart_filename, sep="\t", header=False, index=False)

## Get info on each gene from mygene

In [8]:
mg = mygene.MyGeneInfo()

mygene_output = mg.getgenes(
    ensembl_ids_df["ensembl_gene_id"],
    fields=["symbol", "name", "summary", "type_of_gene", "alias"],
    as_dataframe=True,
)

mygene_output.index.rename("ensembl_gene_id", inplace=True)
mygene_output.head()

INFO:biothings.client:querying 1-1000...
INFO:biothings.client:done.
INFO:biothings.client:querying 1001-2000...
INFO:biothings.client:done.
INFO:biothings.client:querying 2001-3000...
INFO:biothings.client:done.
INFO:biothings.client:querying 3001-4000...
INFO:biothings.client:done.
INFO:biothings.client:querying 4001-5000...
INFO:biothings.client:done.
INFO:biothings.client:querying 5001-6000...
INFO:biothings.client:done.
INFO:biothings.client:querying 6001-7000...
INFO:biothings.client:done.
INFO:biothings.client:querying 7001-8000...
INFO:biothings.client:done.
INFO:biothings.client:querying 8001-9000...
INFO:biothings.client:done.
INFO:biothings.client:querying 9001-10000...
INFO:biothings.client:done.
INFO:biothings.client:querying 10001-11000...
INFO:biothings.client:done.
INFO:biothings.client:querying 11001-12000...
INFO:biothings.client:done.
INFO:biothings.client:querying 12001-13000...
INFO:biothings.client:done.
INFO:biothings.client:querying 13001-14000...
INFO:biothings

Unnamed: 0_level_0,_id,_version,name,symbol,type_of_gene,alias,summary,notfound
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000210049,4558,2.0,tRNA-Phe,TRNF,tRNA,,,
ENSG00000211459,4549,2.0,s-rRNA,RNR1,rRNA,MTRNR1,Enables DNA binding activity and DNA-binding t...,
ENSG00000210077,4577,2.0,tRNA-Val,TRNV,tRNA,MTTV,,
ENSG00000210082,4550,2.0,l-rRNA,RNR2,rRNA,MTRNR2,Enables G protein-coupled receptor binding act...,
ENSG00000209082,4567,2.0,tRNA-Leu,TRNL1,tRNA,MTTL1,Implicated in cardiomyopathy. [provided by All...,


In [9]:
print("Annotations found for " + str(sum(mygene_output["notfound"].isna())) + " genes.")
print(
    "No annotations found for "
    + str(sum(mygene_output["notfound"] == True))
    + " genes."
)

Annotations found for 63844 genes.
No annotations found for 1176 genes.


# Part 2: Clean the data

## Join and standardize columns / values

For consistency with the `agora-data-tools` transform process, this uses the etl standardize functions.

In [10]:
gene_table_merged = pd.merge(
    left=ensembl_ids_df,
    right=mygene_output,
    how="left",
    on="ensembl_gene_id",
    validate="many_to_many",
)

gene_table_merged = utils.standardize_column_names(gene_table_merged)
gene_table_merged = utils.standardize_values(gene_table_merged)

print(gene_table_merged.shape)
gene_table_merged.head()

(65022, 11)


Unnamed: 0,ensembl_gene_id,chromosome_name,hgnc_symbol,_id,_version,name,symbol,type_of_gene,alias,summary,notfound
0,ENSG00000210049,MT,MT-TF,4558,2.0,tRNA-Phe,TRNF,tRNA,,,
1,ENSG00000211459,MT,MT-RNR1,4549,2.0,s-rRNA,RNR1,rRNA,MTRNR1,Enables DNA binding activity and DNA-binding t...,
2,ENSG00000210077,MT,MT-TV,4577,2.0,tRNA-Val,TRNV,tRNA,MTTV,,
3,ENSG00000210082,MT,MT-RNR2,4550,2.0,l-rRNA,RNR2,rRNA,MTRNR2,Enables G protein-coupled receptor binding act...,
4,ENSG00000209082,MT,MT-TL1,4567,2.0,tRNA-Leu,TRNL1,tRNA,MTTL1,Implicated in cardiomyopathy. [provided by All...,


## Fix alias field

Fix `NaN` values in the `alias` field and make sure every alias value is a list, not a string.

In [11]:
# NaN or NULL alias values become empty lists
for row in gene_table_merged.loc[gene_table_merged["alias"].isnull(), "alias"].index:
    gene_table_merged.at[row, "alias"] = []

# Some alias values are a single string, not a list. Turn them into lists here.
gene_table_merged["alias"] = gene_table_merged["alias"].apply(
    lambda cell: cell if isinstance(cell, list) else [cell]
)

# Some alias values are lists of lists or have duplicate values
def flatten(row):
    flattened = []
    for item in row:
        if isinstance(item, list):
            flattened = flattened + item
        else:
            flattened.append(item)
    return flattened


gene_table_merged["alias"] = gene_table_merged["alias"].apply(
    lambda row: list(set(flatten(row)))
)

## Remove duplicate Ensembl IDs from the list. 

Duplicates in the list typically have the same Ensembl ID but different gene symbols. This usually happens when a single Ensembl ID maps to multiple Entrez IDs in the NCBI database. There's not a good way to reconcile this, so we first use entries whose `hgnc_symbol` from Biomart matches `symbol` from NCBI, then from the remaining duplicates just use the first entry in the list for each ensembl ID and discard the rest, which is what the Agora front end does. The gene symbols of duplicate rows are then added as aliases to the matching unique row.

In [12]:
# duplicated() will return true if the ID is a duplicate and is not the first one to appear the list.
dupes = gene_table_merged["ensembl_gene_id"].duplicated()
dupe_vals = gene_table_merged[dupes]

# Rows with duplicated Ensembl IDs
all_duplicated = gene_table_merged.loc[
    gene_table_merged["ensembl_gene_id"].isin(dupe_vals["ensembl_gene_id"])
]
all_duplicated

Unnamed: 0,ensembl_gene_id,chromosome_name,hgnc_symbol,_id,_version,name,symbol,type_of_gene,alias,summary,notfound
4089,ENSG00000287838,9.0,,101927042,1.0,uncharacterized LOC101927042,LOC101927042,ncRNA,[],,
4090,ENSG00000287838,9.0,,124902157,2.0,uncharacterized LOC124902157,LOC124902157,ncRNA,[],,
5130,ENSG00000230417,10.0,LINC00856,414243,1.0,long intergenic non-protein coding RNA 595,LINC00595,ncRNA,[C10orf101],,
5131,ENSG00000230417,10.0,LINC00856,414243,1.0,long intergenic non-protein coding RNA 595,LINC00595,ncRNA,[C10orf101],,
5132,ENSG00000230417,10.0,LINC00595,414243,1.0,long intergenic non-protein coding RNA 595,LINC00595,ncRNA,[C10orf101],,
5133,ENSG00000230417,10.0,LINC00595,414243,1.0,long intergenic non-protein coding RNA 595,LINC00595,ncRNA,[C10orf101],,
8675,ENSG00000188660,21.0,LINC00319,124900467,1.0,uncharacterized LOC124900467,LOC124900467,protein-coding,[],,
8676,ENSG00000188660,21.0,LINC00319,102724398,1.0,uncharacterized CH507-42P11.6,CH507-42P11.6,ncRNA,[],,
12016,ENSG00000278903,21.0,,124905527,1.0,uncharacterized LOC124905527,LOC124905527,ncRNA,[],,
12017,ENSG00000278903,21.0,,124905312,1.0,uncharacterized LOC124905312,LOC124905312,ncRNA,[],,


In [13]:
non_dupes = set(gene_table_merged.index) - set(all_duplicated.index)
keep_df = gene_table_merged.loc[list(non_dupes)].copy(deep=True)

# For each duplicated Ensembl ID, collapse to 1 row and append that row to keep_df
for ens_id in set(all_duplicated["ensembl_gene_id"]):
    group = all_duplicated.loc[all_duplicated["ensembl_gene_id"] == ens_id].copy(
        deep=True
    )
    matches = group["hgnc_symbol"] == group["symbol"]

    # If there is a single entry with a matching symbol from NCBI, use that row and ignore the others
    if sum(matches) == 1:
        keep_df = keep_df.append(group.loc[matches], ignore_index=True)
    # Multiple or no matching symbols, save the first entry in the list and add the other rows as aliases
    else:
        # For multiple matching symbols, discard non-matching entries and continue
        if sum(matches) > 1:
            group = group.loc[matches]

        # Add all duplicate symbols and their aliases to the alias field of the first entry
        for row in group.index[1:]:
            group.at[group.index[0], "alias"].append(group["symbol"][row])
            if len(group.at[row, "alias"]) > 0:
                group.at[group.index[0], "alias"] = (
                    group.at[group.index[0], "alias"] + group["alias"][row]
                )

        # Make sure we didn't add duplicate aliases
        group.at[group.index[0], "alias"] = list(set(group.at[group.index[0], "alias"]))

        # Keep the first row only, which now has all the aliases
        keep_df = pd.concat([keep_df, group.iloc[0].to_frame().T], ignore_index=True)

print(
    str(len(all_duplicated.drop_duplicates("ensembl_gene_id")))
    + " duplicated genes have been processed."
)
gene_table_merged = keep_df.reset_index(drop=True)
gene_table_merged.tail(n=10)

8 duplicated genes have been processed.


Unnamed: 0,ensembl_gene_id,chromosome_name,hgnc_symbol,_id,_version,name,symbol,type_of_gene,alias,summary,notfound
64998,ENSG00000277936,,,84311.0,1.0,mitochondrial ribosomal protein L45,MRPL45,protein-coding,"[MRP-L45, L45mt, Mba1, mL45]",Mammalian mitochondrial ribosomal proteins are...,
64999,ENSG00000277328,,,,,,,,[],,True
65000,ENSG00000287838,9.0,,101927042.0,1.0,uncharacterized LOC101927042,LOC101927042,ncRNA,[LOC124902157],,
65001,ENSG00000249738,5.0,,105377683.0,1.0,uncharacterized LOC105377683,LOC105377683,ncRNA,[LOC285626],,
65002,ENSG00000293331,1.0,,101928626.0,2.0,uncharacterized LOC101928626,LOC101928626,ncRNA,[LOC124901156],,
65003,ENSG00000276518,,,128966722.0,2.0,putative killer cell immunoglobulin-like recep...,LOC128966722,protein-coding,"[LOC128966731, LOC128966733, LOC128966730, LOC...",,
65004,ENSG00000230417,10.0,LINC00595,414243.0,1.0,long intergenic non-protein coding RNA 595,LINC00595,ncRNA,"[LINC00595, C10orf101]",,
65005,ENSG00000278903,21.0,,124905527.0,1.0,uncharacterized LOC124905527,LOC124905527,ncRNA,"[LOC124905468, LOC124905312]",,
65006,ENSG00000230373,15.0,GOLGA6L5P,100133220.0,1.0,"golgin A6 family like 3, pseudogene",GOLGA6L3P,pseudo,"[GOLGA6L17P, GOLGA6L21P, GOLGA6L3]",,
65007,ENSG00000188660,21.0,LINC00319,124900467.0,1.0,uncharacterized LOC124900467,LOC124900467,protein-coding,[CH507-42P11.6],,


# Part 3: Create Ensembl archive permalinks

## Get a table of Ensembl archive URLs

This is where we need to use the R biomaRt library specifically, instead of any of the available Python interfaces to Biomart, to get a table of Ensembl release versions and their corresponding archive URLs. 

In [14]:
archive_df = r.listEnsemblArchives()
archive_df.to_csvfile(path=archive_filename, row_names=False, quote=False)

print(archive_df)

             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111
3     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
4     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
5     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
6     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
7     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
8     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
9     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
10    Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
11    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
12    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
13    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
14    

## Query Ensembl for each gene's version

Ensembl's REST API can only take 1000 genes at once, so this is looped to query groups of 1000. 

In [15]:
url = "https://rest.ensembl.org/archive/id"
headers = {"Content-Type": "application/json", "Accept": "application/json"}

ids = gene_table_merged["ensembl_gene_id"].tolist()
print(len(ids))

# We can only query 1000 genes at a time
batch_ind = range(0, len(ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ids), B + 1000)
    print("Querying genes " + str(B + 1) + " - " + str(end))

    request_data = '{ "id" : ' + str(ids[B:end]) + " }"
    request_data = request_data.replace("'", '"')

    ok = False
    tries = 0

    while tries < 5 and not ok:
        try:
            res = requests.post(url, headers=headers, data=request_data)
            ok = res.ok
        except:
            ok = False

        tries = tries + 1

        if not ok:
            # res.raise_for_status()
            print(
                "Error retrieving Ensembl versions for genes "
                + str(B + 1)
                + " - "
                + str(end)
                + ". Trying again..."
            )
        else:
            results = results + res.json()
            break

print(len(results))

versions = pd.json_normalize(results)

versions.tail()

65008
Querying genes 1 - 1000
Querying genes 1001 - 2000
Querying genes 2001 - 3000
Querying genes 3001 - 4000
Querying genes 4001 - 5000
Querying genes 5001 - 6000
Querying genes 6001 - 7000
Querying genes 7001 - 8000
Querying genes 8001 - 9000
Querying genes 9001 - 10000
Querying genes 10001 - 11000
Querying genes 11001 - 12000
Querying genes 12001 - 13000
Querying genes 13001 - 14000
Querying genes 14001 - 15000
Querying genes 15001 - 16000
Querying genes 16001 - 17000
Querying genes 17001 - 18000
Querying genes 18001 - 19000
Querying genes 19001 - 20000
Querying genes 20001 - 21000
Querying genes 21001 - 22000
Querying genes 22001 - 23000
Querying genes 23001 - 24000
Querying genes 24001 - 25000
Querying genes 25001 - 26000
Querying genes 26001 - 27000
Querying genes 27001 - 28000
Querying genes 28001 - 29000
Querying genes 29001 - 30000
Querying genes 30001 - 31000
Querying genes 31001 - 32000
Querying genes 32001 - 33000
Querying genes 33001 - 34000
Querying genes 34001 - 35000
Q

Unnamed: 0,assembly,peptide,release,latest,possible_replacement,version,id,type,is_current
65003,GRCh38,,111,ENSG00000276518.1,[],1,ENSG00000276518,Gene,1
65004,GRCh38,,111,ENSG00000230417.12,[],12,ENSG00000230417,Gene,1
65005,GRCh38,,111,ENSG00000278903.5,[],5,ENSG00000278903,Gene,1
65006,GRCh38,,111,ENSG00000230373.9,[],9,ENSG00000230373,Gene,1
65007,GRCh38,,111,ENSG00000188660.5,[],5,ENSG00000188660,Gene,1


In [16]:
versions.groupby("release").size()

release
100       22
101        8
102       16
103       15
104       19
105        9
106       34
107       10
108        4
109        4
110       11
111    63843
80        21
81         2
82        10
84       673
87        61
89        20
91        75
93        53
95        33
96        31
97        18
98         9
99         7
dtype: int64

In [17]:
# Check that all IDs are the same between the result and the gene table
print(len(versions["id"]))
print(len(gene_table_merged))
print(
    all(versions["id"].isin(gene_table_merged["ensembl_gene_id"]))
    and all(gene_table_merged["ensembl_gene_id"].isin(versions["id"]))
)

65008
65008
True


In [18]:
# Make sure everything is GRCh38, not GRCh37
all(versions["assembly"] == "GRCh38")

True

## Create permalinks based on archive version

**Not all of these versions have an archive.** We can go back to the closest previous archive for these but the link isn't guaranteed to work.

In [19]:
archive_table = pd.read_csv(archive_filename)

# Remove GRCh37 from the archive list
archive_table = archive_table[archive_table["version"] != "GRCh37"].reset_index()

archive_table["numeric_version"] = archive_table["version"].astype(int)


def closest_release(release, archive_table):
    if release in archive_table:
        return release

    return max([V for V in archive_table["numeric_version"] if V <= release])

In [20]:
versions["closest_release"] = 0

releases = versions["release"].drop_duplicates().astype(int)

# Only have to call closest_release once per version, instead of >70k times
for release in releases:
    versions.loc[
        versions["release"] == str(release), "closest_release"
    ] = closest_release(release, archive_table)

versions.groupby("closest_release").size()

closest_release
80       915
95        33
96        31
97        18
98         9
99         7
100       22
101        8
102       16
103       15
104       19
105        9
106       34
107       10
108        4
109        4
110       11
111    63843
dtype: int64

In [21]:
versions["permalink"] = ""

for i in versions.index:
    match = archive_table["numeric_version"] == versions.at[i, "closest_release"]
    url = archive_table.loc[match, "url"].to_string(index=False)
    if len(url) > 0:
        versions.at[i, "permalink"] = (
            url + "/Homo_sapiens/Gene/Summary?db=core;g=" + versions.at[i, "id"]
        )

versions.head()

Unnamed: 0,assembly,peptide,release,latest,possible_replacement,version,id,type,is_current,closest_release,permalink
0,GRCh38,,111,ENSG00000210049.1,[],1,ENSG00000210049,Gene,1,111,https://jan2024.archive.ensembl.org/Homo_sapie...
1,GRCh38,,111,ENSG00000211459.2,[],2,ENSG00000211459,Gene,1,111,https://jan2024.archive.ensembl.org/Homo_sapie...
2,GRCh38,,111,ENSG00000210077.1,[],1,ENSG00000210077,Gene,1,111,https://jan2024.archive.ensembl.org/Homo_sapie...
3,GRCh38,,111,ENSG00000210082.2,[],2,ENSG00000210082,Gene,1,111,https://jan2024.archive.ensembl.org/Homo_sapie...
4,GRCh38,,111,ENSG00000209082.1,[],1,ENSG00000209082,Gene,1,111,https://jan2024.archive.ensembl.org/Homo_sapie...


In [22]:
versions[versions["closest_release"] < 100].head()

Unnamed: 0,assembly,peptide,release,latest,possible_replacement,version,id,type,is_current,closest_release,permalink
63180,GRCh38,,84,ENSG00000238909.1,[],1,ENSG00000238909,Gene,,80,https://may2015.archive.ensembl.org/Homo_sapie...
63181,GRCh38,,84,ENSG00000265155.1,[],1,ENSG00000265155,Gene,,80,https://may2015.archive.ensembl.org/Homo_sapie...
63183,GRCh38,,84,ENSG00000275447.1,[],1,ENSG00000275447,Gene,,80,https://may2015.archive.ensembl.org/Homo_sapie...
63184,GRCh38,,84,ENSG00000263623.1,[],1,ENSG00000263623,Gene,,80,https://may2015.archive.ensembl.org/Homo_sapie...
63190,GRCh38,,84,ENSG00000238644.1,[],1,ENSG00000238644,Gene,,80,https://may2015.archive.ensembl.org/Homo_sapie...


In [23]:
print(versions["permalink"][0])
print(versions["permalink"][25])

https://jan2024.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000210049
https://jan2024.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000210174


In [24]:
# Does every gene have an associated URL?
url_base_len = len(archive_table["url"][0]) + 1
all([len(url) > url_base_len for url in versions["permalink"]])

True

# Part 4: Add permalinks to the gene table

In [25]:
versions = versions[["id", "release", "possible_replacement", "permalink"]]
versions.rename(
    columns={"id": "ensembl_gene_id", "release": "ensembl_release"}, inplace=True
)

gene_table_merged = pd.merge(
    left=gene_table_merged,
    right=versions,
    how="left",
    on="ensembl_gene_id",
    validate="one_to_one",
)

print(gene_table_merged.shape)
gene_table_merged.head()

(65008, 14)


Unnamed: 0,ensembl_gene_id,chromosome_name,hgnc_symbol,_id,_version,name,symbol,type_of_gene,alias,summary,notfound,ensembl_release,possible_replacement,permalink
0,ENSG00000210049,MT,MT-TF,4558,2.0,tRNA-Phe,TRNF,tRNA,[],,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
1,ENSG00000211459,MT,MT-RNR1,4549,2.0,s-rRNA,RNR1,rRNA,[MTRNR1],Enables DNA binding activity and DNA-binding t...,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
2,ENSG00000210077,MT,MT-TV,4577,2.0,tRNA-Val,TRNV,tRNA,[MTTV],,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
3,ENSG00000210082,MT,MT-RNR2,4550,2.0,l-rRNA,RNR2,rRNA,[MTRNR2],Enables G protein-coupled receptor binding act...,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
4,ENSG00000209082,MT,MT-TL1,4567,2.0,tRNA-Leu,TRNL1,tRNA,[MTTL1],Implicated in cardiomyopathy. [provided by All...,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...


### Final cleanup
Unfilled "possible_replacement" entries should be changed from NaN to empty lists. 

"possible_replacement" entries that have data in them exist as a list of dicts, and need to have the Ensembl IDs pulled out of them as a list of strings. 

Remove unneeded columns. 

In [26]:
for row in gene_table_merged.loc[
    gene_table_merged["possible_replacement"].isnull(), "possible_replacement"
].index:
    gene_table_merged.at[row, "possible_replacement"] = []

gene_table_merged["possible_replacement"] = gene_table_merged.apply(
    lambda row: row["possible_replacement"]
    if len(row["possible_replacement"]) == 0
    else [x["stable_id"] for x in row["possible_replacement"]],
    axis=1,
)

gene_table_merged = gene_table_merged[
    [
        "ensembl_gene_id",
        "name",
        "alias",
        "summary",
        "symbol",
        "type_of_gene",
        "ensembl_release",
        "possible_replacement",
        "permalink",
    ]
]

gene_table_merged

Unnamed: 0,ensembl_gene_id,name,alias,summary,symbol,type_of_gene,ensembl_release,possible_replacement,permalink
0,ENSG00000210049,tRNA-Phe,[],,TRNF,tRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
1,ENSG00000211459,s-rRNA,[MTRNR1],Enables DNA binding activity and DNA-binding t...,RNR1,rRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
2,ENSG00000210077,tRNA-Val,[MTTV],,TRNV,tRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
3,ENSG00000210082,l-rRNA,[MTRNR2],Enables G protein-coupled receptor binding act...,RNR2,rRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
4,ENSG00000209082,tRNA-Leu,[MTTL1],Implicated in cardiomyopathy. [provided by All...,TRNL1,tRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
...,...,...,...,...,...,...,...,...,...
65003,ENSG00000276518,putative killer cell immunoglobulin-like recep...,"[LOC128966731, LOC128966733, LOC128966730, LOC...",,LOC128966722,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
65004,ENSG00000230417,long intergenic non-protein coding RNA 595,"[LINC00595, C10orf101]",,LINC00595,ncRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
65005,ENSG00000278903,uncharacterized LOC124905527,"[LOC124905468, LOC124905312]",,LOC124905527,ncRNA,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
65006,ENSG00000230373,"golgin A6 family like 3, pseudogene","[GOLGA6L17P, GOLGA6L21P, GOLGA6L3]",,GOLGA6L3P,pseudo,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...


### Write to a file
This will get uploaded to Synapse as [syn25953363](https://www.synapse.org/#!Synapse:syn25953363).

In [27]:
gene_table_merged = gene_table_merged.sort_values(by="ensembl_gene_id").reset_index(
    drop=True
)
gene_table_merged
gene_table_merged.to_feather("../../output/gene_table_merged_GRCh38.p14.feather")