# Preprocess Gene Annotations

This notebook creates a table of gene annotations by:
1. Querying Biomart for all Ensembl IDs in the database
2. Querying MyGene for annotation about those IDs
3. Querying Ensembl for the most recent Ensembl release for each ID
4. Building a permalink to the Ensembl archive page for each ID

This gene annotation table is read in by `agoradataprocessing/process.py` to be used in the `gene_info` transformation. 

***Note:*** *This notebook is exploratory and should eventually be converted to a Python script that is run through an automated process.*

## Installation requirements

#### Linux / Windows / Mac

Install R: https://cran.r-project.org/

Install Python and agora-data-tools following the instructions in this repository's README. This notebook assumes it is being run from the same `pipenv` virtual environment as agora-data-tools. 

Then install the following packages using `pip`:
```
pip install rpy2 mygene
```

#### Note for Macs with M1 chips (2020 and newer)

Install as above, but make sure that your R installation is the arm64 version (R-4.X.X-arm64.pkg) so that the architecture matches what pip is using. 
You may also need to install an older version of `rpy2` on the Mac:
```
pip install rpy2==3.5.12
```

In [None]:
from rpy2.robjects import r
import pandas as pd
import mygene
import numpy as np
import requests
import agoradatatools.etl.utils as utils
import agoradatatools.etl.extract as extract
import preprocessing_utils

r(
    'if (!require("BiocManager", character.only = TRUE)) { install.packages("BiocManager") }'
)
r('if (!require("biomaRt")) { BiocManager::install("biomaRt") }')

r.library("biomaRt")

ensembl_ids_filename = "../../output/ensembl_id_list.txt"
archive_filename = "../../output/ensembl_archive_list.csv"
config_filename = "../../../../config.yaml"

# Part 1: Get gene annotation data

## [Deprecated] Query Biomart for a list of all Ensembl IDs in the database of human genes. 

Here we use the R library `biomaRt`. There is no canonical Python library with the features we need for this script. 

*We no longer get all genes from BioMart, so this section is unused. The code is here in case we need it again.*

In [None]:
"""
ensembl_ids_df = preprocessing_utils.r_query_biomart()
ensembl_ids_df = preprocessing_utils.filter_hasgs(
    df=ensembl_ids_df, chromosome_name_column="chromosome_name"
)
print(str(ensembl_ids_df.shape[0]) + " genes remaining after HASG filtering.")
"""

## Get Ensembl IDs from data sets that will be processed by agora-data-tools

Loop through all data sets in the config file to get all Ensembl IDs used in every data set.

In [3]:
config = utils._get_config(config_path=config_filename)
datasets = config["datasets"]

files = {}

for dataset in datasets:
    dataset_name = list(dataset.keys())[0]

    for entity in dataset[dataset_name]["files"]:
        entity_id = entity["id"]
        entity_format = entity["format"]
        entity_name = entity["name"]

        # Ignore json files, which are post-processed and not what we're interested in.
        # Also ignore "gene_metadata" since that's the file we're making here.
        if entity_format != "json" and entity_name != "gene_metadata":
            files[entity_name] = (entity_id, entity_format)

# There are some duplicate synID's in this list but that doesn't really matter
files

{'genes_biodomains': ('syn44151254.5', 'csv'),
 'neuropath_regression_results': ('syn22017882.5', 'csv'),
 'proteomics': ('syn18689335.3', 'csv'),
 'proteomics_tmt': ('syn35221005.2', 'csv'),
 'proteomics_srm': ('syn52579640.4', 'csv'),
 'target_exp_validation_harmonized': ('syn24184512.9', 'csv'),
 'metabolomics': ('syn26064497.1', 'feather'),
 'igap': ('syn12514826.5', 'csv'),
 'eqtl': ('syn12514912.3', 'csv'),
 'diff_exp_data': ('syn27211942.1', 'tsv'),
 'target_list': ('syn12540368.47', 'csv'),
 'median_expression': ('syn27211878.2', 'csv'),
 'druggability': ('syn13363443.11', 'csv'),
 'tep_adi_info': ('syn51942280.2', 'csv'),
 'team_info': ('syn12615624.18', 'csv'),
 'team_member_info': ('syn12615633.18', 'csv'),
 'overall_scores': ('syn25575156.13', 'table'),
 'networks': ('syn11685347.1', 'csv')}

### We should now have a list of all raw data files ingested. Get each one and create a list of IDs.

In [4]:
syn = utils._login_to_synapse(
    token=None
)  # Assumes you have already logged in with a valid token

# The various column names used to store Ensembl IDs in the files
col_names = ["ENSG", "ensembl_gene_id", "GeneID", "ensembl_id"]
file_ensembl_list = []

for file in files.keys():
    df = extract.get_entity_as_df(syn_id=files[file][0], source=files[file][1], syn=syn)

    file_ensembl_ids = None

    for C in col_names:
        if C in df.columns:
            file_ensembl_ids = df[C]

    # networks file is a special case
    if file == "networks":
        file_ensembl_ids = pd.melt(
            df[["geneA_ensembl_gene_id", "geneB_ensembl_gene_id"]]
        )["value"]

    if file_ensembl_ids is not None:
        file_ensembl_list = file_ensembl_list + file_ensembl_ids.tolist()
        if "n/A" in file_ensembl_ids.tolist():
            print(file + " has an n/A Ensembl ID")
            file_ensembl_list.remove("n/A")
        if np.NaN in file_ensembl_ids.tolist():
            print(file + " has an NaN Ensembl ID")
    else:
        print("WARNING: no Ensembl ID column found for " + file + "!")


UPGRADE AVAILABLE

A more recent version of the Synapse Client (4.2.0) is available. Your version (4.0.0) can be upgraded by typing:
    pip install --upgrade synapseclient

Python Synapse Client version 4.2.0 release notes

https://python-docs.synapse.org/news/



Welcome, Jaclyn Beck!



INFO:synapseclient_default:Welcome, Jaclyn Beck!



genes_biodomains has an NaN Ensembl ID


In [5]:
file_ensembl_list = list(set(file_ensembl_list))

ensembl_ids_df = pd.DataFrame({"ensembl_gene_id": file_ensembl_list})

""" Removed due to no longer getting genes from BioMart, but saving code
# Add Ensembl IDs that are in the files but not in the biomart result
missing = set(file_ensembl_list) - set(ensembl_ids_df["ensembl_gene_id"])
print(
    str(len(missing))
    + " genes from the data files are missing from Biomart results and will be added."
)

missing_df = pd.DataFrame({"ensembl_gene_id": list(missing), "chromosome_name": ""})
ensembl_ids_df = pd.concat([ensembl_ids_df, missing_df])
"""

ensembl_ids_df = ensembl_ids_df.dropna(subset=["ensembl_gene_id"])
print(len(ensembl_ids_df))

37452


In [6]:
# Write to a file to save the list of IDs
ensembl_ids_df.to_csv(
    path_or_buf=ensembl_ids_filename, sep="\t", header=False, index=False
)

## Get info on each gene from mygene

In [7]:
mg = mygene.MyGeneInfo()

mygene_output = mg.getgenes(
    ensembl_ids_df["ensembl_gene_id"],
    fields=["symbol", "name", "summary", "type_of_gene", "alias"],
    as_dataframe=True,
)

mygene_output.index.rename("ensembl_gene_id", inplace=True)
mygene_output.head()

INFO:biothings.client:querying 1-1000...
INFO:biothings.client:done.
INFO:biothings.client:querying 1001-2000...
INFO:biothings.client:done.
INFO:biothings.client:querying 2001-3000...
INFO:biothings.client:done.
INFO:biothings.client:querying 3001-4000...
INFO:biothings.client:done.
INFO:biothings.client:querying 4001-5000...
INFO:biothings.client:done.
INFO:biothings.client:querying 5001-6000...
INFO:biothings.client:done.
INFO:biothings.client:querying 6001-7000...
INFO:biothings.client:done.
INFO:biothings.client:querying 7001-8000...
INFO:biothings.client:done.
INFO:biothings.client:querying 8001-9000...
INFO:biothings.client:done.
INFO:biothings.client:querying 9001-10000...
INFO:biothings.client:done.
INFO:biothings.client:querying 10001-11000...
INFO:biothings.client:done.
INFO:biothings.client:querying 11001-12000...
INFO:biothings.client:done.
INFO:biothings.client:querying 12001-13000...
INFO:biothings.client:done.
INFO:biothings.client:querying 13001-14000...
INFO:biothings

Unnamed: 0_level_0,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000164972,84688,2.0,"[C9orf24, CBE1, NYD-SP22, SMRP1, bA573M23.4]",sperm microtubule inner protein 6,This gene encodes a nuclear- or perinuclear-lo...,SPMIP6,protein-coding,
ENSG00000169105,113189,2.0,"[ATCS, D4ST1, EDSMC1, HNK1ST]",carbohydrate sulfotransferase 14,This gene encodes a member of the HNK-1 family...,CHST14,protein-coding,
ENSG00000255136,ENSG00000255136,1.0,,TPBGL antisense RNA 1,,TPBGL-AS1,,
ENSG00000105499,8605,1.0,CPLA2-gamma,phospholipase A2 group IVC,This gene encodes a protein which is a member ...,PLA2G4C,protein-coding,
ENSG00000104611,63898,1.0,"[PPP1R38, SH2A]",SH2 domain containing 4A,Enables phosphatase binding activity. Located ...,SH2D4A,protein-coding,


In [8]:
print("Annotations found for " + str(sum(mygene_output["notfound"].isna())) + " genes.")
print(
    "No annotations found for "
    + str(sum(mygene_output["notfound"] == True))
    + " genes."
)

Annotations found for 36284 genes.
No annotations found for 1175 genes.


# Part 2: Clean the data

## Join and standardize columns / values

For consistency with the `agora-data-tools` transform process, this uses the etl standardize functions.

In [9]:
gene_table_merged = pd.merge(
    left=ensembl_ids_df,
    right=mygene_output,
    how="left",
    on="ensembl_gene_id",
    validate="many_to_many",
)

gene_table_merged = utils.standardize_column_names(gene_table_merged)
gene_table_merged = utils.standardize_values(gene_table_merged)

print(gene_table_merged.shape)
gene_table_merged.head()

(37459, 9)


Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
0,ENSG00000164972,84688,2.0,"[C9orf24, CBE1, NYD-SP22, SMRP1, bA573M23.4]",sperm microtubule inner protein 6,This gene encodes a nuclear- or perinuclear-lo...,SPMIP6,protein-coding,
1,ENSG00000169105,113189,2.0,"[ATCS, D4ST1, EDSMC1, HNK1ST]",carbohydrate sulfotransferase 14,This gene encodes a member of the HNK-1 family...,CHST14,protein-coding,
2,ENSG00000255136,ENSG00000255136,1.0,,TPBGL antisense RNA 1,,TPBGL-AS1,,
3,ENSG00000105499,8605,1.0,CPLA2-gamma,phospholipase A2 group IVC,This gene encodes a protein which is a member ...,PLA2G4C,protein-coding,
4,ENSG00000104611,63898,1.0,"[PPP1R38, SH2A]",SH2 domain containing 4A,Enables phosphatase binding activity. Located ...,SH2D4A,protein-coding,


## Fix alias field

Fix `NaN` values in the `alias` field and make sure every alias value is a list, not a string.

In [10]:
# NaN or NULL alias values become empty lists
for row in gene_table_merged.loc[gene_table_merged["alias"].isnull(), "alias"].index:
    gene_table_merged.at[row, "alias"] = []

# Some alias values are a single string, not a list. Turn them into lists here.
gene_table_merged["alias"] = gene_table_merged["alias"].apply(
    lambda cell: cell if isinstance(cell, list) else [cell]
)


# Some alias values are lists of lists or have duplicate values
def flatten(row):
    flattened = []
    for item in row:
        if isinstance(item, list):
            flattened = flattened + item
        else:
            flattened.append(item)
    return flattened


gene_table_merged["alias"] = gene_table_merged["alias"].apply(
    lambda row: list(set(flatten(row)))
)

## Remove duplicate Ensembl IDs from the list. 

Duplicates in the list typically have the same Ensembl ID but different gene symbols. This usually happens when a single Ensembl ID maps to multiple Entrez IDs in the NCBI database. There's not a good way to reconcile this, so we first check for entries whose `symbol` is something other than "LOC#######", and designate that entry as the main row. If there are multiple or zero entries meeting that criteria, we just use the first entry in the list for each ensembl ID and discard the rest, which is what the Agora front end does. The gene symbols of duplicate rows are then added as aliases to the matching unique row.

In [11]:
# duplicated() will return true if the ID is a duplicate and is not the first one to appear the list.
dupes = gene_table_merged["ensembl_gene_id"].duplicated()
dupe_vals = gene_table_merged[dupes]

# Rows with duplicated Ensembl IDs
all_duplicated = gene_table_merged.loc[
    gene_table_merged["ensembl_gene_id"].isin(dupe_vals["ensembl_gene_id"])
]
all_duplicated

Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
6011,ENSG00000276518,128966722,1.0,[],putative killer cell immunoglobulin-like recep...,,LOC128966722,protein-coding,
6012,ENSG00000276518,128966732,1.0,[],putative killer cell immunoglobulin-like recep...,,LOC128966732,protein-coding,
6013,ENSG00000276518,128966730,1.0,[],putative killer cell immunoglobulin-like recep...,,LOC128966730,protein-coding,
6014,ENSG00000276518,128966731,1.0,[],putative killer cell immunoglobulin-like recep...,,LOC128966731,protein-coding,
6015,ENSG00000276518,128966733,1.0,[],putative killer cell immunoglobulin-like recep...,,LOC128966733,protein-coding,
12139,ENSG00000230373,100133220,1.0,[GOLGA6L3],"golgin A6 family like 3, pseudogene",,GOLGA6L3P,pseudo,
12140,ENSG00000230373,642402,1.0,[GOLGA6L21P],"golgin A6 family like 17, pseudogene",,GOLGA6L17P,pseudo,
23329,ENSG00000276387,124900571,1.0,[],killer cell immunoglobulin-like receptor 2DS1,,LOC124900571,protein-coding,
23330,ENSG00000276387,3802,2.0,"[NKAT1, KIR2DL3, NKAT, KIR221, CD158A, p58.1, ...","killer cell immunoglobulin like receptor, two ...",Killer cell immunoglobulin-like receptors (KIR...,KIR2DL1,protein-coding,
31304,ENSG00000249738,285626,1.0,[],uncharacterized LOC285626,,LOC285626,ncRNA,


In [12]:
non_dupes = set(gene_table_merged.index) - set(all_duplicated.index)
keep_df = gene_table_merged.loc[list(non_dupes)].copy(deep=True)

# For each duplicated Ensembl ID, collapse to 1 row and append that row to keep_df
for ens_id in set(all_duplicated["ensembl_gene_id"]):
    group = all_duplicated.loc[all_duplicated["ensembl_gene_id"] == ens_id].copy(
        deep=True
    )
    # Put any entries with symbols that aren't "LOC#####" at the top of the data frame
    matches = group["symbol"].str.startswith("LOC") == False
    group = pd.concat([group.loc[matches], group.loc[matches == False]]).reset_index(
        drop=True
    )

    # Add all duplicate symbols and their aliases to the alias field of the first entry
    for row in group.index[1:]:
        group.at[group.index[0], "alias"].append(group["symbol"][row])
        if len(group.at[row, "alias"]) > 0:
            group.at[group.index[0], "alias"] = (
                group.at[group.index[0], "alias"] + group["alias"][row]
            )

    # Make sure we didn't add duplicate aliases
    group.at[group.index[0], "alias"] = list(set(group.at[group.index[0], "alias"]))

    # Keep the first row only, which now has all the aliases
    keep_df = pd.concat([keep_df, group.iloc[0].to_frame().T], ignore_index=True)

print(
    str(len(all_duplicated.drop_duplicates("ensembl_gene_id")))
    + " duplicated genes have been processed."
)
gene_table_merged = keep_df.reset_index(drop=True)
gene_table_merged.tail(n=10)

4 duplicated genes have been processed.


Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
37442,ENSG00000163811,23160,1.0,"[NET12, UTP5]",WD repeat domain 43,Enables RNA binding activity. Involved in posi...,WDR43,protein-coding,
37443,ENSG00000226467,10554,1.0,"[G15, LPLAT1, 1-AGPAT1, LPAATA, LPAAT-alpha]",1-acylglycerol-3-phosphate O-acyltransferase 1,This gene encodes an enzyme that converts lyso...,AGPAT1,protein-coding,
37444,ENSG00000120533,56943,1.0,"[Sus1, e(y)2, DC6]",ENY2 transcription and export complex 2 subunit,Enables nuclear receptor coactivator activity....,ENY2,protein-coding,
37445,ENSG00000214759,ENSG00000214759,1.0,[],ribosomal protein L36a pseudogene 2,,RPL36AP2,,
37446,ENSG00000253981,ENSG00000253981,1.0,[],"ALG1 like 13, pseudogene",,ALG1L13P,,
37447,ENSG00000267206,158062,1.0,"[hLcn5, LCN5, UNQ643]",lipocalin 6,Predicted to enable small molecule binding act...,LCN6,protein-coding,
37448,ENSG00000276387,3802,2.0,"[NKAT1, LOC124900571, KIR2DL3, NKAT, KIR221, C...","killer cell immunoglobulin like receptor, two ...",Killer cell immunoglobulin-like receptors (KIR...,KIR2DL1,protein-coding,
37449,ENSG00000276518,128966722,1.0,"[LOC128966730, LOC128966732, LOC128966731, LOC...",putative killer cell immunoglobulin-like recep...,,LOC128966722,protein-coding,
37450,ENSG00000230373,100133220,1.0,"[GOLGA6L21P, GOLGA6L17P, GOLGA6L3]","golgin A6 family like 3, pseudogene",,GOLGA6L3P,pseudo,
37451,ENSG00000249738,285626,1.0,[LOC105377683],uncharacterized LOC285626,,LOC285626,ncRNA,


# Part 3: Create Ensembl archive permalinks

## Get a table of Ensembl archive URLs

This is where we need to use the R biomaRt library specifically, instead of any of the available Python interfaces to Biomart, to get a table of Ensembl release versions and their corresponding archive URLs. 

In [13]:
archive_df = r.listEnsemblArchives()
archive_df.to_csvfile(path=archive_filename, row_names=False, quote=False)

print(archive_df)

             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111
3     Ensembl 110 Jul 2023 https://jul2023.archive.ensembl.org     110
4     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
5     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
6     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
7     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
8     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
9     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
10    Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
11    Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
12    Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
13    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
14    

## Query Ensembl for each gene's version

Ensembl's REST API can only take 1000 genes at once, so this is looped to query groups of 1000. 

In [14]:
url = "https://rest.ensembl.org/archive/id"
headers = {"Content-Type": "application/json", "Accept": "application/json"}

ids = gene_table_merged["ensembl_gene_id"].tolist()
print(len(ids))

# We can only query 1000 genes at a time
batch_ind = range(0, len(ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ids), B + 1000)
    print("Querying genes " + str(B + 1) + " - " + str(end))

    request_data = '{ "id" : ' + str(ids[B:end]) + " }"
    request_data = request_data.replace("'", '"')

    ok = False
    tries = 0

    while tries < 5 and not ok:
        try:
            res = requests.post(url, headers=headers, data=request_data)
            ok = res.ok
        except:
            ok = False

        tries = tries + 1

        if not ok:
            # res.raise_for_status()
            print(
                "Error retrieving Ensembl versions for genes "
                + str(B + 1)
                + " - "
                + str(end)
                + ". Trying again..."
            )
        else:
            results = results + res.json()
            break

print(len(results))

versions = pd.json_normalize(results)

versions.tail()

37452
Querying genes 1 - 1000
Querying genes 1001 - 2000
Querying genes 2001 - 3000
Querying genes 3001 - 4000
Querying genes 4001 - 5000
Querying genes 5001 - 6000
Querying genes 6001 - 7000
Querying genes 7001 - 8000
Querying genes 8001 - 9000
Querying genes 9001 - 10000
Querying genes 10001 - 11000
Querying genes 11001 - 12000
Querying genes 12001 - 13000
Querying genes 13001 - 14000
Querying genes 14001 - 15000
Querying genes 15001 - 16000
Querying genes 16001 - 17000
Querying genes 17001 - 18000
Querying genes 18001 - 19000
Querying genes 19001 - 20000
Querying genes 20001 - 21000
Querying genes 21001 - 22000
Querying genes 22001 - 23000
Querying genes 23001 - 24000
Querying genes 24001 - 25000
Querying genes 25001 - 26000
Querying genes 26001 - 27000
Querying genes 27001 - 28000
Querying genes 28001 - 29000
Querying genes 29001 - 30000
Querying genes 30001 - 31000
Querying genes 31001 - 32000
Querying genes 32001 - 33000
Querying genes 33001 - 34000
Querying genes 34001 - 35000
Q

Unnamed: 0,is_current,assembly,id,version,type,peptide,latest,possible_replacement,release
37447,1,GRCh38,ENSG00000267206,6,Gene,,ENSG00000267206.6,[],111
37448,1,GRCh38,ENSG00000276387,4,Gene,,ENSG00000276387.4,[],111
37449,1,GRCh38,ENSG00000276518,1,Gene,,ENSG00000276518.1,[],111
37450,1,GRCh38,ENSG00000230373,9,Gene,,ENSG00000230373.9,[],111
37451,1,GRCh38,ENSG00000249738,10,Gene,,ENSG00000249738.10,[],111


In [15]:
versions.groupby("release").size()

release
100       22
101        8
102       16
103       15
104       19
105        9
106       35
107       10
108        4
109        4
110       11
111    36286
80        21
81         2
82        10
84       673
87        61
89        20
91        75
93        53
95        33
96        31
97        18
98         9
99         7
dtype: int64

In [16]:
# Check that all IDs are the same between the result and the gene table
print(len(versions["id"]))
print(len(gene_table_merged))
print(
    all(versions["id"].isin(gene_table_merged["ensembl_gene_id"]))
    and all(gene_table_merged["ensembl_gene_id"].isin(versions["id"]))
)

37452
37452
True


In [17]:
# Make sure everything is GRCh38, not GRCh37
all(versions["assembly"] == "GRCh38")

True

## Create permalinks based on archive version

**Not all of these versions have an archive.** We can go back to the closest previous archive for these but the link isn't guaranteed to work.

In [18]:
archive_table = pd.read_csv(archive_filename)

# Remove GRCh37 from the archive list
archive_table = archive_table[archive_table["version"] != "GRCh37"].reset_index()

archive_table["numeric_version"] = archive_table["version"].astype(int)


def closest_release(release, archive_table):
    if release in archive_table:
        return release

    return max([V for V in archive_table["numeric_version"] if V <= release])

In [19]:
versions["closest_release"] = 0

releases = versions["release"].drop_duplicates().astype(int)

# Only have to call closest_release once per version, instead of >70k times
for release in releases:
    versions.loc[versions["release"] == str(release), "closest_release"] = (
        closest_release(release, archive_table)
    )

versions.groupby("closest_release").size()

closest_release
80       915
95        33
96        31
97        18
98         9
99         7
100       22
101        8
102       16
103       15
104       19
105        9
106       35
107       10
108        4
109        4
110       11
111    36286
dtype: int64

In [20]:
versions["permalink"] = ""

for i in versions.index:
    match = archive_table["numeric_version"] == versions.at[i, "closest_release"]
    url = archive_table.loc[match, "url"].to_string(index=False)
    if len(url) > 0:
        versions.at[i, "permalink"] = (
            url + "/Homo_sapiens/Gene/Summary?db=core;g=" + versions.at[i, "id"]
        )

versions.head()

Unnamed: 0,is_current,assembly,id,version,type,peptide,latest,possible_replacement,release,closest_release,permalink
0,1,GRCh38,ENSG00000164972,14,Gene,,ENSG00000164972.14,[],111,111,https://jan2024.archive.ensembl.org/Homo_sapie...
1,1,GRCh38,ENSG00000169105,8,Gene,,ENSG00000169105.8,[],111,111,https://jan2024.archive.ensembl.org/Homo_sapie...
2,1,GRCh38,ENSG00000255136,3,Gene,,ENSG00000255136.3,[],111,111,https://jan2024.archive.ensembl.org/Homo_sapie...
3,1,GRCh38,ENSG00000105499,14,Gene,,ENSG00000105499.14,[],111,111,https://jan2024.archive.ensembl.org/Homo_sapie...
4,1,GRCh38,ENSG00000104611,12,Gene,,ENSG00000104611.12,[],111,111,https://jan2024.archive.ensembl.org/Homo_sapie...


In [21]:
versions[versions["closest_release"] < 100].head()

Unnamed: 0,is_current,assembly,id,version,type,peptide,latest,possible_replacement,release,closest_release,permalink
51,,GRCh38,ENSG00000266701,1,Gene,,ENSG00000266701.1,[],84,80,https://may2015.archive.ensembl.org/Homo_sapie...
99,,GRCh38,ENSG00000268225,2,Gene,,ENSG00000268225.2,[],98,98,https://sep2019.archive.ensembl.org/Homo_sapie...
119,,GRCh38,ENSG00000281018,1,Gene,,ENSG00000281018.1,[],84,80,https://may2015.archive.ensembl.org/Homo_sapie...
120,,GRCh38,ENSG00000216011,2,Gene,,ENSG00000216011.2,[],84,80,https://may2015.archive.ensembl.org/Homo_sapie...
135,,GRCh38,ENSG00000264103,1,Gene,,ENSG00000264103.1,[],84,80,https://may2015.archive.ensembl.org/Homo_sapie...


In [22]:
print(versions["permalink"][0])
print(versions["permalink"][25])

https://jan2024.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000164972
https://jul2023.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000279049


In [23]:
# Does every gene have an associated URL?
url_base_len = len(archive_table["url"][0]) + 1
all([len(url) > url_base_len for url in versions["permalink"]])

True

# Part 4: Add permalinks to the gene table

In [24]:
versions = versions[["id", "release", "possible_replacement", "permalink"]]
versions.rename(
    columns={"id": "ensembl_gene_id", "release": "ensembl_release"}, inplace=True
)

gene_table_merged = pd.merge(
    left=gene_table_merged,
    right=versions,
    how="left",
    on="ensembl_gene_id",
    validate="one_to_one",
)

print(gene_table_merged.shape)
gene_table_merged.head()

(37452, 12)


Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound,ensembl_release,possible_replacement,permalink
0,ENSG00000164972,84688,2.0,"[SMRP1, C9orf24, CBE1, bA573M23.4, NYD-SP22]",sperm microtubule inner protein 6,This gene encodes a nuclear- or perinuclear-lo...,SPMIP6,protein-coding,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
1,ENSG00000169105,113189,2.0,"[ATCS, EDSMC1, HNK1ST, D4ST1]",carbohydrate sulfotransferase 14,This gene encodes a member of the HNK-1 family...,CHST14,protein-coding,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
2,ENSG00000255136,ENSG00000255136,1.0,[],TPBGL antisense RNA 1,,TPBGL-AS1,,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
3,ENSG00000105499,8605,1.0,[CPLA2-gamma],phospholipase A2 group IVC,This gene encodes a protein which is a member ...,PLA2G4C,protein-coding,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
4,ENSG00000104611,63898,1.0,"[PPP1R38, SH2A]",SH2 domain containing 4A,Enables phosphatase binding activity. Located ...,SH2D4A,protein-coding,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...


### Final cleanup
Unfilled "possible_replacement" entries should be changed from NaN to empty lists. 

"possible_replacement" entries that have data in them exist as a list of dicts, and need to have the Ensembl IDs pulled out of them as a list of strings. 

Remove unneeded columns. 

In [25]:
for row in gene_table_merged.loc[
    gene_table_merged["possible_replacement"].isnull(), "possible_replacement"
].index:
    gene_table_merged.at[row, "possible_replacement"] = []

gene_table_merged["possible_replacement"] = gene_table_merged.apply(
    lambda row: (
        row["possible_replacement"]
        if len(row["possible_replacement"]) == 0
        else [x["stable_id"] for x in row["possible_replacement"]]
    ),
    axis=1,
)

gene_table_merged = gene_table_merged[
    [
        "ensembl_gene_id",
        "name",
        "alias",
        "summary",
        "symbol",
        "type_of_gene",
        "ensembl_release",
        "possible_replacement",
        "permalink",
    ]
]

gene_table_merged

Unnamed: 0,ensembl_gene_id,name,alias,summary,symbol,type_of_gene,ensembl_release,possible_replacement,permalink
0,ENSG00000164972,sperm microtubule inner protein 6,"[SMRP1, C9orf24, CBE1, bA573M23.4, NYD-SP22]",This gene encodes a nuclear- or perinuclear-lo...,SPMIP6,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
1,ENSG00000169105,carbohydrate sulfotransferase 14,"[ATCS, EDSMC1, HNK1ST, D4ST1]",This gene encodes a member of the HNK-1 family...,CHST14,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
2,ENSG00000255136,TPBGL antisense RNA 1,[],,TPBGL-AS1,,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
3,ENSG00000105499,phospholipase A2 group IVC,[CPLA2-gamma],This gene encodes a protein which is a member ...,PLA2G4C,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
4,ENSG00000104611,SH2 domain containing 4A,"[PPP1R38, SH2A]",Enables phosphatase binding activity. Located ...,SH2D4A,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
...,...,...,...,...,...,...,...,...,...
37447,ENSG00000267206,lipocalin 6,"[hLcn5, LCN5, UNQ643]",Predicted to enable small molecule binding act...,LCN6,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
37448,ENSG00000276387,"killer cell immunoglobulin like receptor, two ...","[NKAT1, LOC124900571, KIR2DL3, NKAT, KIR221, C...",Killer cell immunoglobulin-like receptors (KIR...,KIR2DL1,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
37449,ENSG00000276518,putative killer cell immunoglobulin-like recep...,"[LOC128966730, LOC128966732, LOC128966731, LOC...",,LOC128966722,protein-coding,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...
37450,ENSG00000230373,"golgin A6 family like 3, pseudogene","[GOLGA6L21P, GOLGA6L17P, GOLGA6L3]",,GOLGA6L3P,pseudo,111,[],https://jan2024.archive.ensembl.org/Homo_sapie...


### Write to a file
This will get uploaded to Synapse as [syn25953363](https://www.synapse.org/#!Synapse:syn25953363).

In [26]:
gene_table_merged = gene_table_merged.sort_values(by="ensembl_gene_id").reset_index(
    drop=True
)
gene_table_merged
gene_table_merged.to_feather("../../output/gene_table_merged_GRCh38.p14.feather")