The script to generate both datasets is included in the same folder.  First we import the datasets:

In [7]:
import pandas as pd

gene_info_py = pd.read_feather("../output/gene_table_merged_py.feather")
gene_info_r = pd.read_feather("../output/gene_table_merged_r.feather")

Then we standardize the columns of the R dataset:

In [8]:
gene_info_r.rename(columns={"X_id": "_id", "X_version": "_version", "ensembl_gene_id": "_ensembl_gene_id"}, inplace=True)
interesting_columns = [col for col in gene_info_r.columns if col[0] != "_"]

We should observe the same results across datasets:

In [9]:
print("Shape from python processing:" + str(gene_info_py.shape))
print("Genes found by the query: " + str(gene_info_py[gene_info_py['notfound'].isnull()].shape[0]))
print("Genes missing:" + str(gene_info_py[gene_info_py['notfound'].notna()].shape[0]))

Shape from python processing:(60727, 8)
Genes found by the query: 56742
Genes missing:3985


In [10]:
print("Shape from R processing:" + str(gene_info_py.shape))
print(gene_info_r[gene_info_r['notfound'].isnull()].shape)
print(gene_info_r[gene_info_r['notfound'].notna()].shape)

Shape from R processing:(60727, 8)
(56742, 8)
(3985, 8)


Check to see if notfound flag isn't misleading in the python-generated dataset:

In [5]:
for dataset in [gene_info_py, gene_info_r]:
    for col in interesting_columns:
        if dataset[dataset['notfound'].notna() & dataset[col].isnull()].shape[0] > 0:
            print(col + " values matches notfound flag")
        elif col == 'notfound':
            continue
        else:
            print(col + " values don't match notfound flag")

name values matches notfound flag
symbol values matches notfound flag
type_of_gene values matches notfound flag
summary values matches notfound flag
name values matches notfound flag
symbol values matches notfound flag
type_of_gene values matches notfound flag
summary values matches notfound flag


We can now compare the values within each column and try to establish equality between them:

In [6]:
for col in interesting_columns:
    print(col)
    for dataset in [gene_info_py, gene_info_r]:
        try:
            print(col + "'s mean: " + dataset[col].mean())
        except:
           pass 
        print(col + "'s mode: " + str(dataset[col].mode()[0]))
        print("Common values for " + col + ":")
        print(dataset[col].value_counts(dropna=False)[:3])
        print()

name
name's mode: Y RNA
Common values for name:
NaN                                         20166
Y RNA                                         756
Metazoan signal recognition particle RNA      166
Name: name, dtype: int64

name's mode: Y RNA
Common values for name:
NaN                                         20166
Y RNA                                         756
Metazoan signal recognition particle RNA      166
Name: name, dtype: int64

symbol
symbol's mode: Y_RNA
Common values for symbol:
NaN            20166
Y_RNA            756
Metazoa_SRP      166
Name: symbol, dtype: int64

symbol's mode: Y_RNA
Common values for symbol:
NaN            20166
Y_RNA            756
Metazoa_SRP      166
Name: symbol, dtype: int64

type_of_gene
type_of_gene's mode: protein-coding
Common values for type_of_gene:
NaN               35887
protein-coding    19201
ncRNA              4811
Name: type_of_gene, dtype: int64

type_of_gene's mode: protein-coding
Common values for type_of_gene:
NaN               3

It's safe to say the datasets are equal.