In [1]:
from os import name
import pandas as pd
import mygene
import numpy as np

flat_file_url = "https://gist.github.com/kdaily/2ed85e0dd3048fea8424b40243ddfa1c/raw/420086bd941962df66992667972c13462e504cc6/gencode.v24.primary_assembly.refFlat.txt"

Agora uses one particular dataset that historically has been packed as part of an RData file.  Since finding the original
code used to generate gene_info.RData, I realized that code cannot be run.  It rellies on the presence of a couple of
columns that come from the mygene package in BioConductor.  This notebook is the most faithful reproduction of that data
workflow in order to generate an interoperable file corresponding to the one we haver been using in Agora for a long time.

This is the provenance of gene_info.feather.

The next cell contains the set-up required to run the notebook:

The first step is to read the raw data into a pandas dataframe and make sure the names are standardized.  The result we
get is a Pandas Series that needs to be converted in a dataframe.

In [2]:
gene_table = pd.read_csv(flat_file_url, sep='\t', header=None, usecols=[0], names=['ensembl_gene_id'])
gene_table = gene_table["ensembl_gene_id"].replace("\\..*", "", regex=True).drop_duplicates()

gene_table = pd.DataFrame(gene_table)
gene_table.columns = ['ensembl_gene_id']

gene_table.shape # should be the same as the R counterpart

(60725, 1)

Next, we must fetch the data from the BioConductor Package in order to retrieve a few key fields.  Interestingly, the
field X_Score - a measurement of how well the search algorithm did in finding this gene- is not present anymore.  Feel
free to modify the query to include that field and verify it for yourself.

*"query" is the name of the index and needs to be named "ensembl_gene_id".

In [3]:
mg = mygene.MyGeneInfo()
bioconductor_gene_info = mg.getgenes(gene_table['ensembl_gene_id'], fields=["symbol", "name", "summary", "type_of_gene"], as_dataframe=True)
bioconductor_gene_info.index.rename("ensembl_gene_id", inplace=True)
bioconductor_gene_info.head()

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-17000...done.
querying 17001-18000...done.
querying 18001-19000...done.
querying 19001-20000...done.
querying 20001-21000...done.
querying 21001-22000...done.
querying 22001-23000...done.
querying 23001-24000...done.
querying 24001-25000...done.
querying 25001-26000...done.
querying 26001-27000...done.
querying 27001-28000...done.
querying 28001-29000...done.
querying 29001-30000...done.
querying 30001-31000...done.
querying 31001-32000...done.
querying 32001-33000...done.
querying 33001-34000...done.
querying 34001-35000...done.
queryin

Unnamed: 0_level_0,_id,_version,name,symbol,type_of_gene,summary,notfound
ensembl_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ENSG00000223972,100287102,1.0,DEAD/H-box helicase 11 like 1 (pseudogene),DDX11L1,pseudo,,
ENSG00000227232,653635,1.0,"WASP family homolog 7, pseudogene",WASH7P,pseudo,,
ENSG00000278267,102466751,1.0,microRNA 6859-1,MIR6859-1,ncRNA,microRNAs (miRNAs) are short (20-24 nt) non-co...,
ENSG00000243485,ENSG00000243485,1.0,MIR1302-2 host gene,MIR1302-2HG,,,
ENSG00000274890,,,,,,,True


We join, and then stardardize our datasets:

In [4]:
gene_table_merged = pd.merge(left=gene_table, right=bioconductor_gene_info, how='left', on="ensembl_gene_id")
gene_table_merged.columns = gene_table_merged.columns.str.replace("[#,@,&,*,^,?,(,),%,$,#,!,/]", "")
gene_table_merged.columns = gene_table_merged.columns.str.replace("[' ', '-', '.']", "_")
gene_table_merged.columns = map(str.lower, gene_table_merged.columns)


# the next two lines would be relevant if we wanted to bring in the go.MF field.  Since we do not, they're commented out.  Older datasets should still contain them, so I'm providing the logic in case you see those.
# gene_table_merged["go_mf"] = gene_table_merged["go_mf"].fillna('').astype(str)
# gene_table_merged["go_mf_pubmed"] = gene_table_merged["go_mf_pubmed"].fillna(np.nan).apply(lambda x: x if type(x) is None or type(x) is list else [x])

gene_table_merged.shape

  gene_table_merged.columns = gene_table_merged.columns.str.replace("[#,@,&,*,^,?,(,),%,$,#,!,/]", "")
  gene_table_merged.columns = gene_table_merged.columns.str.replace("[' ', '-', '.']", "_")


(60727, 8)

It's important that we check the values here.  We expect the index to be populated for every row (in other words, it should match the row count of the previous cell), while missing values on the other columns are expected.  The 'notfound' column should be an indicator that querying for that particular gene yielded no result.  Therefore, columns used for internal purposes (the ones starting in underscore) should contain the same number of missing values.

In [5]:
for col in gene_table_merged.columns:
    print("Missing values from " + col + ": " + str(gene_table_merged[col].isna().sum()))
    
not_found = gene_table_merged[gene_table_merged['notfound'].notna()]
not_found.shape

Missing values from ensembl_gene_id: 0
Missing values from _id: 3985
Missing values from _version: 3985
Missing values from name: 20166
Missing values from symbol: 20166
Missing values from type_of_gene: 35887
Missing values from summary: 46423
Missing values from notfound: 56742


(3985, 8)

Most importantly, we would like to make sure that there's no information in the other columns every time 'notfound' is True.  That will ensure the cleanliness of the dataset.

In [6]:
interesting_columns = [col for col in not_found.columns if '_' not in col[0]] # all columns that don't start with _
interesting_columns.remove('ensembl_gene_id')
interesting_columns.remove('notfound')

for col in interesting_columns:
    print(not_found[not_found[col].notna()].shape[0])

0
0
0
0


Lastly, we can confidently remove the values where notfound is true, and write our feather file:

In [7]:
gene_table_merged_py = gene_table_merged.copy() # this copy gets used for analysis in the ./comparisson.ipynb file
gene_table_merged = gene_table_merged[gene_table_merged['notfound'].isnull()].reset_index()

gene_table_merged_py.to_feather('../output/gene_table_merged_py.feather')
gene_table_merged.to_feather('../output/gene_table_merged.feather')