In [1]:
from os import name
import pandas as pd
import mygene
import numpy as np

flat_file_url = "https://gist.github.com/kdaily/2ed85e0dd3048fea8424b40243ddfa1c/raw/420086bd941962df66992667972c13462e504cc6/gencode.v24.primary_assembly.refFlat.txt"

Agora uses one particular dataset that historically has been packed as part of an RData file.  Since finding the original
code used to generate gene_info.RData, I realized that code cannot be run.  It rellies on the presence of a couple of
columns that come from the mygene package in BioConductor.  This notebook is the most faithful reproduction of that data
workflow in order to generate an interoperable file corresponding to the one we haver been using in Agora for a long time.

This is the provenance of gene_info.feather.

The next cell contains the set-up required to run the notebook:

The first step is to read the raw data into a pandas dataframe and make sure the names are standardized.  The result we
get is a Pandas Series that needs to be converted in a dataframe.

In [2]:
gene_table = pd.read_csv(flat_file_url, sep='\t', header=None, usecols=[0], names=['ensembl_gene_id'])
gene_table = gene_table["ensembl_gene_id"].replace("\\..*", "", regex=True).drop_duplicates()

gene_table = pd.DataFrame(gene_table)
gene_table.columns = ['ensembl_gene_id']

gene_table.shape # should be the same as the R counterpart

(60725, 1)

Next, we must fetch the data from the BioConductor Package in order to retrieve a few key fields.  Interestingly, the
field X_Score - a measurement of how well the search algorithm did in finding this gene- is not present anymore.  Feel
free to modify the query to include that field and verify it for yourself.

*"query" is the name of the index and needs to be named "ensembl_gene_id".

In [None]:
mg = mygene.MyGeneInfo()
bioconductor_gene_info = mg.getgenes(gene_table['ensembl_gene_id'], fields=["symbol", "name", "summary", "type_of_gene", "go.MF"], as_dataframe=True)
bioconductor_gene_info.index.rename("ensembl_gene_id", inplace=True)
bioconductor_gene_info.head()

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

Lastly, we need to merge the two tables together and perform some validation on the values.

In [None]:
gene_table_merged = pd.merge(left=gene_table, right=bioconductor_gene_info, how='left', on="ensembl_gene_id")
gene_table_merged.columns = gene_table_merged.columns.str.replace("[#,@,&,*,^,?,(,),%,$,#,!,/]", "")
gene_table_merged.columns = gene_table_merged.columns.str.replace("[' ', '-', '.']", "_")
gene_table_merged.columns = map(str.lower, gene_table_merged.columns)

gene_table_merged["go_mf"] = gene_table_merged["go_mf"].fillna('').astype(str)
gene_table_merged["go_mf_pubmed"] = gene_table_merged["go_mf_pubmed"].fillna(np.nan).apply(lambda x: x if type(x) is None or type(x) is list else [x])

for col in gene_table_merged.columns:
    print("Missing values from " + col + ": " + str(gene_table_merged[col].isna().sum()))

gene_table_merged.shape

Lastly, we save the gene_table_merged as a feather file:

In [None]:
gene_table_merged.to_feather('./gene_table_merged.feather')