In [18]:
import synapseclient as sc
import pandas as pd
import numpy as np

syn = sc.Synapse()
syn.login(silent=True)

deffective_rna = 'syn17015360'
deffective_rna = syn.get(deffective_rna)
deffective_rna = pd.read_json(deffective_rna.path, orient='records')

rna = 'syn27211942' # tsv
rna = syn.get(rna)
rna = pd.read_csv(rna.path, sep='\t')

targets = 'syn12540368' # csv
targets = syn.get(targets)
targets = pd.read_csv(targets.path)

gene_info = 'syn25953363'  # feather
gene_info = syn.get(gene_info)
gene_info = pd.read_feather(gene_info.path)

# these were taken from the configuration file that we use
models_to_keep = ['Diagnosis AD-CONTROL ALL', 
                  'Diagnosis.AOD AD-CONTROL ALL', 
                  'Diagnosis.Sex AD-CONTROL FEMALE', 
                  'Diagnosis.Sex AD-CONTROL MALE']

rna.columns


Index(['Model', 'Tissue', 'Comparison', 'ensembl_gene_id', 'logFC', 'CI.L',
       'CI.R', 'AveExpr', 't', 'P.Value', 'adj.P.Val', 'gene_biotype',
       'chromosome_name', 'Direction', 'hgnc_symbol', 'percentage_gc_content',
       'gene.length', 'Sex', 'Study'],
      dtype='object')

First, some basic characteristics of these files:

In [12]:
print(deffective_rna.groupby('ensembl_gene_id').agg('count')) # 23379
print(rna.groupby('ensembl_gene_id').agg('count')) # 24395


                 hgnc_symbol  logfc  fc  ci_l  ci_r  adj_p_val  tissue  study  \
ensembl_gene_id                                                                 
ENSG00000000003           36     36  36    36    36         36      36     36   
ENSG00000000419           36     36  36    36    36         36      36     36   
ENSG00000000457           36     36  36    36    36         36      36     36   
ENSG00000000460           36     36  36    36    36         36      36     36   
ENSG00000000938           36     36  36    36    36         36      36     36   
...                      ...    ...  ..   ...   ...        ...     ...    ...   
ENSG00000283097           28     28  28    28    28         28      28     28   
ENSG00000283103           36     36  36    36    36         36      36     36   
ENSG00000283108           24     24  24    24    24         24      24     24   
ENSG00000283118           16     16  16    16    16         16      16     16   
ENSG00000283122           16

Prepare data for models in the configuration.  Here we just want to know if we're losing anything due to a mismatch between the "models to keep" and the aggregated models.

In [13]:
rna['tmp'] = rna[['Model', 'Comparison', 'Sex']].agg(' '.join, axis=1)
print("Current number of rows: " + str(rna.shape[0]))

rna = rna[rna['tmp'].isin(models_to_keep)]
print("Number of rows after filtering models: " + str(rna.shape[0]))

Current number of rows: 745580
Number of rows after filtering models: 745580


As expected, no loss of rows given that Jake is only including the ones we use in Agora.  Next we have the conditions that filter out the rows in the differential expression dataset.  

In summary, we run some logic to create a pandas Series containing only ensembl_gene_ids that meet these conditions. Once we're done, we filter the `rna` dataset to contain only rows matching those ensembl_gene_ids.

In [14]:
adjusted_p_value_threshold = 1 # from the configuration file we use now
transformed_rna = rna.loc[
    ((rna['adj.P.Val'] <= adjusted_p_value_threshold) | (rna['ensembl_gene_id'].isin(targets['ensembl_gene_id'])))
    & (rna['ensembl_gene_id'].isin(gene_info['ensembl_gene_id']))]

transformed_rna.shape 

(722408, 20)

Some of the loss of rows occurs with the logic above.  But it does not account for the 23136 rows missing.  This logic will propagate later.  Onto the next piece of logic:

In [15]:
transformed_rna = transformed_rna.drop_duplicates(['ensembl_gene_id'])
transformed_rna.shape

(23379, 20)

Then we filter the original `rna` dataset (what we call `differential_expression_data` in some places).  It's the filtering of rna based on the ensembl_gene_ids that will cause the row difference between these datasets.

In [16]:
rna = rna[rna['ensembl_gene_id'].isin(transformed_rna['ensembl_gene_id'])]
rna.shape

(722408, 20)

Lastly, we filter `rna` again to excluse rows missing a `hgnc_symbol`.  We do need to join this with gene_info first.

In [17]:
rna = pd.merge(left=rna,
                     right=gene_info,
                     on='ensembl_gene_id',
                     how='left')

rna = rna[rna['hgnc_symbol'].notna()]
rna.shape

(722444, 28)