# Preprocess Gene Annotations

This notebook creates a table of gene annotations by:
1. Querying Biomart for all Ensembl IDs in the database
2. Querying MyGene for annotation about those IDs
3. Querying Ensembl for the most recent Ensembl release for each ID
4. Building a permalink to the Ensembl archive page for each ID

This gene annotation table is read in by `agoradataprocessing/process.py` to be used in the `gene_info` transformation. 

***Note:*** *This notebook is exploratory and should eventually be converted to a Python script that is run through an automated process.*

## Installation requirements

#### Linux / Windows 

Install R: https://cran.r-project.org/

In an R console, execute the following commands:
```R
install.packages("BiocManager")
BiocManager::install("biomaRt")
```

Install Python following the instructions in this repository's README. Then install the following packages using `pip`: 
```
pip install rpy2 pandas numpy mygene
```

#### Mac

As above but with some modifications for **Macs with an M1 chip** (2020 or newer):
1. Do not install the arm version of R (i.e. R-4.X.X-arm64.pkg). Install the version for Intel Macs (i.e. R-4.X.X.pkg). 
2. Python libraries need to be compiled with x86 architecture for rpy2 to work. To do this with conda, use the following commands to create a conda environment with x86 packages:

```
CONDA_SUBDIR=osx-64 conda create -n agora_x86 python=3.9
conda activate agora_x86
pip install rpy2 pandas numpy mygene
```

In [1]:
from rpy2.robjects import r
from os import name
import pandas as pd
import mygene
import numpy as np
import requests
import synapseclient
import agoradatatools.etl.transform as transform
import agoradatatools.etl.utils as utils
import agoradatatools.etl.extract as extract

r.library('biomaRt')

biomart_filename = '../output/biomart_ensg_list.txt'
archive_filename = '../output/ensembl_archive_list.csv'
config_filename = '../../config.yaml'

# Part 1: Get gene annotation data

## Query Biomart for a list of all Ensembl IDs in the database of human genes. 

Here we use the R library `biomaRt`. There is no canonical Python library with the features we need for this script. 

In [2]:
# Sometimes Biomart doesn't respond and the command needs to be sent again. Try up to 5 times.
for T in range(5):
    try:
        mart = r.useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl')

        ensembl_ids = r.getBM(attributes = 'ensembl_gene_id', mart = mart, useCache = False)
        
    except rpy2.rinterface.RRuntimeError as err:
        print(err)
        print('Trying again...')
    
    else: 
        break

if ensembl_ids.nrow == 0:
    print('Biomart was unresponsive after 5 attempts. Try again later.')

else:
    # Convert the ensembl_gene_id column from R object to a python list
    ensembl_ids = list(ensembl_ids.rx2('ensembl_gene_id'))

    print(ensembl_ids[0:5])
    print(str(len(ensembl_ids)) + " genes found.")

# Save biomart IDs in a separate variable for debugging, since ensembl_ids will get expanded below
ensembl_ids_biomart = ensembl_ids 

['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460']
69292 genes found.


## Add Ensembl IDs from data sets that will be processed by agora-data-tools

Some of these datasets have older/retired Ensembl IDs that no longer exist in the current Ensembl database. Loop through all data sets in the config file to get all Ensembl IDs they use, and add missing ones to the gene list. 

In [3]:
config = utils._get_config(config_path = config_filename)
datasets = config[1]['datasets']

files = {}

for dataset in datasets:
    dataset_name = list(dataset.keys())[0]
    
    for entity in dataset[dataset_name]['files']:
        entity_id = entity['id']
        entity_format = entity['format']
        entity_name = entity['name']
        
        # Ignore json files, which are post-processed and not what we're interested in. 
        # Also ignore "gene_metadata" since that's the file we're making here.
        if entity_format != 'json' and entity_name != "gene_metadata":
            files[entity_name] = [entity_id, entity_format]

files

{'neuropath_regression_results': ['syn22017882', 'csv'],
 'agora_proteomics': ['syn18689335', 'csv'],
 'agora_proteomics_tmt': ['syn35221005', 'csv'],
 'target_exp_validation_harmonized': ['syn24184512', 'csv'],
 'srm_data': ['syn25454540', 'csv'],
 'metabolomics': ['syn26064497', 'feather'],
 'igap': ['syn12514826', 'csv'],
 'eqtl': ['syn12514912', 'csv'],
 'proteomics': ['syn18689335', 'csv'],
 'rna_expression_change': ['syn27211942', 'tsv'],
 'target_list': ['syn12540368', 'csv'],
 'median_expression': ['syn27211878', 'csv'],
 'druggability': ['syn13363443', 'csv'],
 'team_info': ['syn12615624', 'csv'],
 'team_member_info': ['syn12615633', 'csv'],
 'networks': ['syn11685347', 'csv'],
 'diff_exp_data': ['syn27211942', 'tsv'],
 'proteomics_tmt': ['syn35221005', 'csv']}

In [4]:
files_unique = {}
ids = []
for F in files.keys():
    if files[F][0] not in ids:
        ids.append(files[F][0])
        files_unique[F] = files[F]

print(len(files))
print(len(files_unique))

files = files_unique

18
15


### We should now have a list of all raw data files ingested. Get each one and add its Ensembl IDs to the gene list.

In [5]:
syn = utils._login_to_synapse(authtoken = None) # Assumes you have already logged in 

# The various column names used to store Ensembl IDs in the files
col_names = ['ENSG', 'ensembl_gene_id', 'GeneID']

for file in files.keys():
    df = extract.get_entity_as_df(syn_id=files[file][0],
                                  format=files[file][1],
                                  syn=syn)
    file_ensembl_ids = None
    
    for C in col_names:
        if C in df.columns:
            file_ensembl_ids = df[C]
    
    # networks file is a special case
    if file == "networks":
        file_ensembl_ids = pd.melt(df[['geneA_ensembl_gene_id','geneB_ensembl_gene_id']])['value']
        
    if file_ensembl_ids is not None:
        ensembl_ids = ensembl_ids + file_ensembl_ids.tolist()
        if "n/A" in file_ensembl_ids.tolist():
            print(file + " has an n/A Ensembl ID")
        if np.NaN in file_ensembl_ids.tolist():
            print(file + " has an NaN Ensembl ID")


UPGRADE AVAILABLE

A more recent version of the Synapse Client (2.7.0) is available. Your version (2.5.1) can be upgraded by typing:
    pip install --upgrade synapseclient

Python Synapse Client version 2.7.0 release notes

https://python-docs.synapse.org/build/html/news.html



Welcome, Jaclyn Beck!

target_exp_validation_harmonized has an n/A Ensembl ID
eqtl has an NaN Ensembl ID


In [6]:
ensembl_ids = list(set(ensembl_ids))
ensembl_ids.remove("n/A") # Necessary because one data set has an Ensembl ID set to "n/A"
ensembl_ids.remove(np.NaN) # Necessary because one data set has an Ensembl ID set to NaN

# Convert to a pandas data frame
ensembl_ids_df = pd.DataFrame({'ensembl_gene_id': ensembl_ids})
len(ensembl_ids_df)

70432

In [7]:
# Write to a file
ensembl_ids_df.to_csv(path_or_buf = biomart_filename, sep = '\t', header = False, index = False)

## Get info on each gene from mygene

In [8]:
mg = mygene.MyGeneInfo()

mygene_output = mg.getgenes(ensembl_ids_df['ensembl_gene_id'], 
                            fields=["symbol", "name", "summary", "type_of_gene", "alias"], 
                            as_dataframe=True)

mygene_output.index.rename("ensembl_gene_id", inplace=True)
mygene_output.head()

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

Unnamed: 0_level_0,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ENSG00000108823,6442,1.0,"[50DAG, ADL, DAG2, DMDA2, LGMD2D, LGMDR3, SCAR...",sarcoglycan alpha,This gene encodes a component of the dystrophi...,SGCA,protein-coding,
ENSG00000160051,55721,1.0,,IQ motif containing C,,IQCC,protein-coding,
ENSG00000222635,106480092,1.0,,"RNA, U6 small nuclear 1203, pseudogene",,RNU6-1203P,pseudo,
ENSG00000197321,6840,1.0,MFM10,supervillin,This gene encodes a bipartite protein with dis...,SVIL,protein-coding,
ENSG00000257480,347894,1.0,,mitochondrial ribosomal protein L2 pseudogene 1,,MRPL2P1,pseudo,


In [9]:
print("Annotations found for " + str(sum(mygene_output['notfound'].isna())) + " genes.")
print("No annotations found for " + str(sum(mygene_output['notfound'] == True)) + " genes.")

Annotations found for 70111 genes.
No annotations found for 1143 genes.


# Part 2: Clean the data

## Join and standardize columns / values

For consistency with the `agora-data-tools` transform process, this uses the etl standardize functions.

In [10]:
# This merge may not be strictly necessary? mygene should return at least one row all genes queried even if  
# it can't find the gene in the database
gene_table_merged = pd.merge(left = ensembl_ids_df, right = mygene_output, how = 'left', on = 'ensembl_gene_id')

gene_table_merged = transform.standardize_column_names(gene_table_merged)
gene_table_merged = transform.standardize_values(gene_table_merged)

print(gene_table_merged.shape)
gene_table_merged.head()

(71254, 9)


Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
0,ENSG00000108823,6442,1.0,"[50DAG, ADL, DAG2, DMDA2, LGMD2D, LGMDR3, SCAR...",sarcoglycan alpha,This gene encodes a component of the dystrophi...,SGCA,protein-coding,
1,ENSG00000160051,55721,1.0,,IQ motif containing C,,IQCC,protein-coding,
2,ENSG00000222635,106480092,1.0,,"RNA, U6 small nuclear 1203, pseudogene",,RNU6-1203P,pseudo,
3,ENSG00000197321,6840,1.0,MFM10,supervillin,This gene encodes a bipartite protein with dis...,SVIL,protein-coding,
4,ENSG00000257480,347894,1.0,,mitochondrial ribosomal protein L2 pseudogene 1,,MRPL2P1,pseudo,


## Fix alias field

Fix `NaN` values in the `alias` field and make sure every alias value is a list, not a string.

In [11]:
# NaN or NULL alias values become empty lists
for row in gene_table_merged.loc[gene_table_merged['alias'].isnull(), 'alias'].index:
    gene_table_merged.at[row, 'alias'] = []

# Some alias values are a single string, not a list. Turn them into lists here.
gene_table_merged['alias'] = gene_table_merged['alias'].apply(lambda cell: cell if isinstance(cell, list) else [cell])

## Remove duplicate Ensembl IDs from the list. 

Duplicates in the list typically have the same Ensembl ID but different gene symbols. This usually happens when a single Ensembl ID maps to multiple Entrez IDs in the NCBI database. There's not a good way to reconcile this, so just use the first entry in the list for each ensembl ID and discard the rest, which is what the Agora front end does. The gene symbols of duplicate rows are then added as aliases to the matching unique row.

In [12]:
# duplicated() will return true if the ID is a duplicate and is not the first one to appear the list. 
dupes = gene_table_merged['ensembl_gene_id'].duplicated()
dupe_vals = gene_table_merged[dupes]

# Rows with duplicated Ensembl IDs
gene_table_merged.loc[gene_table_merged['ensembl_gene_id'].isin(dupe_vals['ensembl_gene_id'])]

Unnamed: 0,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
254,ENSG00000260788,124903732,1.0,[],uncharacterized LOC124903732,,LOC124903732,ncRNA,
255,ENSG00000260788,105371366,1.0,[],uncharacterized LOC105371366,,LOC105371366,ncRNA,
1628,ENSG00000282767,124900566,1.0,[],U8 small nucleolar RNA,,LOC124900566,snoRNA,
1629,ENSG00000282767,124900359,1.0,[],U8 small nucleolar RNA,,LOC124900359,snoRNA,
1634,ENSG00000284507,124906466,1.0,[],double homeobox protein 4,,LOC124906466,protein-coding,
...,...,...,...,...,...,...,...,...,...
71166,ENSG00000283767,124906459,1.0,[],double homeobox protein 4,,LOC124906459,protein-coding,
71167,ENSG00000283767,124906463,1.0,[],double homeobox protein 4,,LOC124906463,protein-coding,
71168,ENSG00000283767,124906464,1.0,[],double homeobox protein 4,,LOC124906464,protein-coding,
71169,ENSG00000283767,124906460,1.0,[],double homeobox protein 4,,LOC124906460,protein-coding,


In [13]:
# Remove duplicates from the list
gene_table_merged = gene_table_merged[dupes == False].reset_index()

# For each duplicate row, add its symbol as an alias
for row in dupe_vals.index:
    match = gene_table_merged['ensembl_gene_id'] == dupe_vals['ensembl_gene_id'][row]
    match_ind = gene_table_merged[match].index[0] # There should only be one row

    # Add the duplicate's symbol to the alias list
    gene_table_merged.at[match_ind, 'alias'].append(dupe_vals['symbol'][row])
    
    # Make sure we didn't add duplicate aliases
    gene_table_merged.at[match_ind, 'alias'] = list(set(gene_table_merged.at[match_ind, 'alias']))

print(gene_table_merged.shape)

# Printed out table should have unique Ensembl IDs with aliases properly added
gene_table_merged.loc[gene_table_merged['ensembl_gene_id'].isin(dupe_vals['ensembl_gene_id'])]

(70432, 10)


Unnamed: 0,index,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound
254,254,ENSG00000260788,124903732,1.0,[LOC105371366],uncharacterized LOC124903732,,LOC124903732,ncRNA,
1627,1628,ENSG00000282767,124900566,1.0,[LOC124900359],U8 small nucleolar RNA,,LOC124900566,snoRNA,
1632,1634,ENSG00000284507,124906466,1.0,"[LOC124905409, LOC124906461, LOC124906453, LOC...",double homeobox protein 4,,LOC124906466,protein-coding,
3029,3048,ENSG00000284210,124906466,1.0,"[LOC124905409, LOC124906461, LOC124906453, LOC...",double homeobox protein 4,,LOC124906466,protein-coding,
4436,4472,ENSG00000273624,124900632,1.0,"[LOC124905519, LOC124905327]",U6 spliceosomal RNA,,LOC124900632,snRNA,
5642,5680,ENSG00000276367,124900356,1.0,"[LOC124900358, LOC124900359]",U8 small nucleolar RNA,,LOC124900356,snoRNA,
5685,5725,ENSG00000283898,124906466,1.0,"[LOC124905409, LOC124906461, LOC124906453, LOC...",double homeobox protein 4,,LOC124906466,protein-coding,
7859,7916,ENSG00000284156,124906466,1.0,"[LOC124905409, LOC124906461, LOC124906453, LOC...",double homeobox protein 4,,LOC124906466,protein-coding,
8572,8646,ENSG00000278294,124908250,1.0,"[LOC124907156, LOC124907485]",5.8S ribosomal RNA,,LOC124908250,rRNA,
9634,9710,ENSG00000282177,124900357,1.0,"[LOC124900358, LOC124900359, LOC124900356]",U8 small nucleolar RNA,,LOC124900357,snoRNA,


# Part 3: Create Ensembl archive permalinks

## Get a table of Ensembl archive URLs

This is where we need to use the R biomaRt library specifically, instead of any of the available Python interfaces to Biomart, to get a table of Ensembl release versions and their corresponding archive URLs. 

In [14]:
archive_df = r.listEnsemblArchives()
archive_df.to_csvfile(path = archive_filename, row_names = False, quote = False)

print(archive_df)

             name     date                                 url version
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
2     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
3     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
4     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
5     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
6     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
7     Ensembl 103 Feb 2021 https://feb2021.archive.ensembl.org     103
8     Ensembl 102 Nov 2020 https://nov2020.archive.ensembl.org     102
9     Ensembl 101 Aug 2020 https://aug2020.archive.ensembl.org     101
10    Ensembl 100 Apr 2020 https://apr2020.archive.ensembl.org     100
11     Ensembl 99 Jan 2020 https://jan2020.archive.ensembl.org      99
12     Ensembl 98 Sep 2019 https://sep2019.archive.ensembl.org      98
13     Ensembl 97 Jul 2019 https://jul2019.archive.ensembl.org      97
14    

## Query Ensembl for each gene's version

Ensembl's REST API can only take 1000 genes at once, so this is looped to query groups of 1000. 

*TODO: Gracefully handle HTTPError occurrences and try again with the same indices.*

In [15]:
url = "https://rest.ensembl.org/archive/id"
headers = {"Content-Type" : "application/json", "Accept" : "application/json"}

ids = gene_table_merged['ensembl_gene_id'].tolist()
print(len(ids))

# We can only query 1000 genes at a time
batch_ind = range(0, len(ids), 1000)
results = []

for B in batch_ind:
    end = min(len(ids), B + 1000)
    print("Querying genes " + str(B+1) + " - " + str(end))
    
    request_data = '{ "id" : ' + str(ids[B:end]) + ' }'
    request_data = request_data.replace("'", "\"")
    
    ok = False
    tries = 0
    
    while tries < 5 and not ok:
        res = requests.post(url, headers=headers, data=request_data)
        ok = res.ok
        tries = tries + 1
        
        if not res.ok:
            #res.raise_for_status()
            print("Error retrieving Ensembl versions for genes " + str(B+1) + " - " + str(end) + 
                  ". Trying again...")
        else:
            results = results + res.json()
            break

print(len(results))

versions = pd.json_normalize(results)

versions.tail()

70432
Querying genes 1 - 1000
Querying genes 1001 - 2000
Querying genes 2001 - 3000
Querying genes 3001 - 4000
Querying genes 4001 - 5000
Querying genes 5001 - 6000
Querying genes 6001 - 7000
Querying genes 7001 - 8000
Querying genes 8001 - 9000
Querying genes 9001 - 10000
Querying genes 10001 - 11000
Querying genes 11001 - 12000
Querying genes 12001 - 13000
Querying genes 13001 - 14000
Querying genes 14001 - 15000
Querying genes 15001 - 16000
Querying genes 16001 - 17000
Querying genes 17001 - 18000
Querying genes 18001 - 19000
Querying genes 19001 - 20000
Querying genes 20001 - 21000
Querying genes 21001 - 22000
Querying genes 22001 - 23000
Querying genes 23001 - 24000
Querying genes 24001 - 25000
Querying genes 25001 - 26000
Querying genes 26001 - 27000
Querying genes 27001 - 28000
Querying genes 28001 - 29000
Querying genes 29001 - 30000
Querying genes 30001 - 31000
Querying genes 31001 - 32000
Querying genes 32001 - 33000
Querying genes 33001 - 34000
Querying genes 34001 - 35000
Q

Unnamed: 0,version,possible_replacement,id,type,latest,is_current,assembly,release,peptide
70427,1,[],ENSG00000289495,Gene,ENSG00000289495.1,1,GRCh38,108,
70428,1,[],ENSG00000287894,Gene,ENSG00000287894.1,1,GRCh38,108,
70429,7,[],ENSG00000198526,Gene,ENSG00000198526.7,1,GRCh38,108,
70430,15,[],ENSG00000105750,Gene,ENSG00000105750.15,1,GRCh38,108,
70431,13,[],ENSG00000197586,Gene,ENSG00000197586.13,1,GRCh38,108,


In [16]:
versions.groupby('release').size()

release
100       23
101        8
102       15
103       15
104       18
105        9
106       32
107       10
108    69292
80        21
81         1
82        10
84       673
87        61
89        20
91        75
93        53
95        33
96        31
97        18
98         8
99         6
dtype: int64

In [17]:
# Check that all IDs are the same between the result and the gene table
print(len(versions['id']))
print(len(gene_table_merged))
print(all(versions['id'].isin(gene_table_merged['ensembl_gene_id'])) and 
      all(gene_table_merged['ensembl_gene_id'].isin(versions['id'])))

70432
70432
True


In [18]:
# Make sure everything is GRCh38, not GRCh37
all(versions['assembly'] == "GRCh38")

True

## Create permalinks based on archive version

**Not all of these versions have an archive.** We can go back to the closest previous archive for these but the link isn't guaranteed to work.

In [19]:
archive_table = pd.read_csv(archive_filename)

# Remove GRCh37 from the archive list
archive_table = archive_table[archive_table['version'] != "GRCh37"].reset_index()

archive_table['numeric_version'] = archive_table['version'].astype(int)

def closest_release(release, archive_table):
    if release in archive_table:
        return release
    
    return max([V for V in archive_table['numeric_version'] if V <= release])

In [20]:
versions['closest_release'] = 0

releases = versions['release'].drop_duplicates().astype(int)

# Only have to call closest_release once per version, instead of >70k times
for release in releases:
    versions.loc[versions['release'] == str(release), 'closest_release'] = closest_release(release, archive_table)
    
versions.groupby('closest_release').size()

closest_release
80       786
91        75
93        53
95        33
96        31
97        18
98         8
99         6
100       23
101        8
102       15
103       15
104       18
105        9
106       32
107       10
108    69292
dtype: int64

In [21]:
versions['permalink'] = ''

for i in versions.index:
    match = archive_table['numeric_version'] == versions.at[i, 'closest_release']
    url = archive_table.loc[match, 'url'].to_string(index = False)
    if len(url) > 0:
        versions.at[i, 'permalink'] = url + "/Homo_sapiens/Gene/Summary?db=core;g=" + versions.at[i, 'id']

versions.head()

Unnamed: 0,version,possible_replacement,id,type,latest,is_current,assembly,release,peptide,closest_release,permalink
0,17,[],ENSG00000108823,Gene,ENSG00000108823.17,1,GRCh38,108,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
1,12,[],ENSG00000160051,Gene,ENSG00000160051.12,1,GRCh38,108,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
2,1,[],ENSG00000222635,Gene,ENSG00000222635.1,1,GRCh38,108,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
3,16,[],ENSG00000197321,Gene,ENSG00000197321.16,1,GRCh38,108,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
4,1,[],ENSG00000257480,Gene,ENSG00000257480.1,1,GRCh38,108,,108,https://oct2022.archive.ensembl.org/Homo_sapie...


In [22]:
versions[versions['closest_release'] < 100].head()

Unnamed: 0,version,possible_replacement,id,type,latest,is_current,assembly,release,peptide,closest_release,permalink
56,1,[],ENSG00000228962,Gene,ENSG00000228962.1,,GRCh38,91,,91,https://dec2017.archive.ensembl.org/Homo_sapie...
60,2,"[{'stable_id': 'ENSG00000283682', 'score': 0.7...",ENSG00000222526,Gene,ENSG00000222526.2,,GRCh38,84,,80,https://may2015.archive.ensembl.org/Homo_sapie...
105,2,[],ENSG00000216045,Gene,ENSG00000216045.2,,GRCh38,84,,80,https://may2015.archive.ensembl.org/Homo_sapie...
131,1,[],ENSG00000270523,Gene,ENSG00000270523.1,,GRCh38,95,,95,https://jan2019.archive.ensembl.org/Homo_sapie...
293,1,[],ENSG00000167945,Gene,ENSG00000167945.1,,GRCh38,93,,93,https://jul2018.archive.ensembl.org/Homo_sapie...


In [23]:
print(versions['permalink'][0])
print(versions['permalink'][25])

https://oct2022.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000108823
https://oct2022.archive.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000286001


In [24]:
# Does every gene have an associated URL?
url_base_len = len(archive_table['url'][0]) + 1
all([len(url) > url_base_len for url in versions['permalink']])

True

# Part 4: Add permalinks to the gene table

In [25]:
versions = versions[['id', 'release', 'permalink']]
versions.rename(columns={'id': 'ensembl_gene_id', 'release': 'ensembl_release'}, inplace=True)

gene_table_merged = pd.merge(left = gene_table_merged, right = versions, how = 'left', on = 'ensembl_gene_id')

print(gene_table_merged.shape)
gene_table_merged.head()

(70432, 12)


Unnamed: 0,index,ensembl_gene_id,_id,_version,alias,name,summary,symbol,type_of_gene,notfound,ensembl_release,permalink
0,0,ENSG00000108823,6442,1.0,"[50DAG, ADL, DAG2, DMDA2, LGMD2D, LGMDR3, SCAR...",sarcoglycan alpha,This gene encodes a component of the dystrophi...,SGCA,protein-coding,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
1,1,ENSG00000160051,55721,1.0,[],IQ motif containing C,,IQCC,protein-coding,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
2,2,ENSG00000222635,106480092,1.0,[],"RNA, U6 small nuclear 1203, pseudogene",,RNU6-1203P,pseudo,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
3,3,ENSG00000197321,6840,1.0,[MFM10],supervillin,This gene encodes a bipartite protein with dis...,SVIL,protein-coding,,108,https://oct2022.archive.ensembl.org/Homo_sapie...
4,4,ENSG00000257480,347894,1.0,[],mitochondrial ribosomal protein L2 pseudogene 1,,MRPL2P1,pseudo,,108,https://oct2022.archive.ensembl.org/Homo_sapie...


### Write to a file
This will get uploaded to Synapse as [syn25953363](https://www.synapse.org/#!Synapse:syn25953363).

In [26]:
gene_table_merged.to_feather('../output/gene_table_merged_GRCh38.p13.feather')