# Download and explore GTEx datasets.
Namely, the eQTL datasets (one file per tissue)  and the gene median TPM dataset.  
Data Source link: https://gtexportal.org/home/datasets

## As of March 22 2023 I am using [polars](https://www.pola.rs) library to process the larger eQTL (70 million) dataset. In order to load all 50 tissue files (each containing eQTLs) with polars, the files must be unzipped (actually gunzipped) which was done with the linux command ` gunzip *.txt.gz`
### The median TPM workflow (second half of this notebook) still uses pandas which is fine.
### HOWEVER, Updates to the GTEX_EXP (median TPM) workflow are going to be made in THIS notebook

#### The end of this workflow is different from the original GTEx.ipynb notebook located in /Users/stearb/Dropbox/CHOP/R03/code/GTEx, we are using Jonathan Silversteins workflow for the Neo4j CSV creation (meaning the files produced by this workflow will be the inputs into JS's workflow) ...so we only need to create 2 files, a nodes.tsv and an edges.tsv (instead of the ~6 files, CUIs, CUI-CUIs, Code-CUIs, Terms, etc.) The nodes aand edges files are then inputs into the OWLNETS python script which transforms them into the 11 CSV files that get appended to the base UMLS CSVs.

#### The guide for how to create these new nodes and edges files can be found in the Data Distillerys [github](https://github.com/dbmi-pitt/UBKG/tree/main/user%20guide)

## As of Feb 18 2023, the new format for GTEX_EQTLs is: 

#### `GTEX_EQTL eQTL_chr{CHROMOSOME_NUMBER}_{LOCATION}_{REF}_{ALT}_b38_{TISSUE}`

#### ie) `GTEX_EQTL eQTL_chr11_538371_G_A_b38_Skin_Not_Sun_Exposed_Suprapubic`

(changed from the old format of rsID-HGNC_ID-tissue)

## As of March 7th we are using the `signif_variant_gene_pairs.txt` files and NOT the `egenes.txt` files. `signif_variant_gene_pairs.txt.gz` contains ~71 million eQTLs while `egenes` contains only ~2.1 million

## As of May 2023 the SABs for the GTEx eQTL and GTEx expression datasets will be GTEXEQTL and GTEXEXP, respectively

In [1]:
#!jupyter nbconvert --to script GTEx_JS.ipynb
#!sed -i '' '/.head(/d' GTEx_JS.ipynb
#!sed -i '' '/^#/d' GTEx_JS.ipynb

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import polars as pl
from polars import Config
Config.set_fmt_str_lengths = 1000
import numpy as np
import os
from collections import Counter
import pyarrow
from cmapPy.pandasGEXpress.parse_gct import parse

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

## Modeling Notes/Questions 
Median Expression 
* Median Expression Code nodes have Interval nodes as Terms. Each Interval Term node is connected to many Median Expression Code Nodes, is this allowed? (Code nodes sharing Term nodes)
* Median Expression Code nodes CodeID and CODE (see abover)

eQTL
* Currently using HGNC/UBERON/dbSNP Concept nodes to uniquely identify every eQTL 
* Instead,use the variant_id to link the eqtl node (with variant_id) to UBERON/HGNC nodes



* Concepts can link to multiple Codes from different ontologies
    * Ex) CUI C1416553 links to HGNC and OMIM

###  First, Get the Sample Annotations, which contain sample and uberon to tissue mappings (GTEx--UBERON mappings).
Data Source: https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt

In [2]:
#!curl --insecure https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt > gtex_sample_annotations.txt
samp_annos = pd.read_csv('gtex_sample_annotations_select.txt',sep='\t')
print(samp_annos.shape)
samp_annos.head(3)

(22951, 3)


Unnamed: 0,SMTS,SMTSD,SMUBRID
0,Blood,Whole Blood,13756
1,Blood,Whole Blood,13756
2,Blood,Whole Blood,13756


In [3]:
#SAMPID_2_TISSUE.to_csv('gtex_sample_annotations_select.txt',sep='\t',index=False)

In [4]:
# Select the 3 columns  we need from the annotation dataset     # dont actually need sample id
# SAMPID = SAMPLE ID
# SMTS = Tissue
# SMTSD = More specific Tissue
# SMUBRID = UBERON ID
SAMPID_2_TISSUE = samp_annos[['SMTS','SMTSD','SMUBRID']] # 'SAMPID',

# Remove dashes, -, and parentheses so the tissue strings match the tissue strings from the eqtl datasets.
SAMPID_2_TISSUE['SMTSD'] = SAMPID_2_TISSUE['SMTSD'].str.replace(' - ',' ').str.replace('\(','').str.replace('\)','')

# Get only unique tissue IDs for mapping tissues--UBERON codes later.
SAMPID_2_TISSUE_unique = SAMPID_2_TISSUE.drop_duplicates('SMTSD')  # .drop('SAMPID',axis=1)
SAMPID_2_TISSUE_unique.rename(columns={'SMTSD':'tissue'},inplace=True)

print(SAMPID_2_TISSUE.shape)
SAMPID_2_TISSUE.head()

(22951, 3)


Unnamed: 0,SMTS,SMTSD,SMUBRID
0,Blood,Whole Blood,13756
1,Blood,Whole Blood,13756
2,Blood,Whole Blood,13756
3,Brain,Brain Frontal Cortex BA9,9834
4,Brain,Brain Frontal Cortex BA9,9834


###  Also, get the master list of HGNC IDs (that maps gene names to HGNC IDs)
Data Source: ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/non_alt_loci_set.txt

In [5]:
#!curl ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/non_alt_loci_set.txt > hgnc_master.txt

In [6]:
#!curl ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/non_alt_loci_set.txt > hgnc_master.txt
hgnc_master = pd.read_csv('hgnc_master.txt',sep='\t')
hgnc_select = hgnc_master[['hgnc_id','symbol','ensembl_gene_id']]
print(hgnc_select.shape)
hgnc_select.head(3)
#hgnc_master[['hgnc_id','symbol']].to_csv('hgnc_master_2cols.txt')

(43653, 3)


Unnamed: 0,hgnc_id,symbol,ensembl_gene_id
0,HGNC:5,A1BG,ENSG00000121410
1,HGNC:37133,A1BG-AS1,ENSG00000268895
2,HGNC:24086,A1CF,ENSG00000148584


# Import eQTL data (as of May 20th, 2023 we are filtering eQTLs to reduce the size of the resulting nodes/edges files. We are only including 'common' eQTLs, ones that are present in every one of the 49 tissues)
File:  GTEx_Analysis_v8_eQTL/  
(Contains 49 tissue files)  

[Nominal P-value Explanation](https://stats.stackexchange.com/questions/536116/what-is-the-concepts-of-nominal-and-actual-significance-level)
 
[Ensembl Gene ID Period Explanation](https://useast.ensembl.org/Help/Faq?id=488)

## Not using this for loop to load in eqtls anymore (that code is at the bottom of this nb). Use the next block to load in the single 71m eqtl dataset using polars 

In [8]:
%%time
eqtl = pl.read_parquet('/Users/stearb/Desktop/DESKTOP_TRANSFER/R03_local/data/gtex/eqtls_all_71m.parquet')
eqtl = eqtl.drop('__index_level_0__')
#eqtl

CPU times: user 3.97 s, sys: 2.4 s, total: 6.37 s
Wall time: 4.16 s


In [8]:
#hgnc_select# = hgnc_select.rename(columns={'ensembl_gene_id':'gene_id'})

In [9]:
#eqtl = eqtl.rename({'gene_id':'ensembl_gene_id'})
#eqtl['ensembl_gene_id'] = eqtl['gene_id'].apply(lambda x: x.split('.')[0]) 
#eqtl.select(pl.col('gene_id'))

In [9]:
%%time

###### leave the '.' in the 'gene_id' column (the ensembl code uses a .N to specify the 
# Nth revision of a genes definition). The '.' is ok to have in the code now that we are ingesting ensembl ids
# the HGNC ensembl genes dont have the '.' though so we can still make this col, just to merge on and then
# delete it.
eqtl = eqtl.with_columns(pl.col("gene_id").str.split(".").arr.get(0).alias("ensembl_gene_id"))
df = eqtl.join(pl.from_pandas(hgnc_select),on='ensembl_gene_id',how='left').drop('ensembl_gene_id')
del eqtl
df.shape

CPU times: user 11 s, sys: 29.7 s, total: 40.7 s
Wall time: 29.7 s


(71589767, 6)

## If we use ensembl ids we dont need to drop nulls

In [11]:
df = df.drop_nulls() # drops 71m to ~61m   

In [12]:
# get heart eqtls
#df.select(pl.col('tissue')).unique().to_pandas()
#heart_tissues = ['Artery Coronary','Heart Atrial Appendage','Artery Aorta','Heart Left Ventricle']
#df_heart = df.filter(pl.col('tissue').is_in(heart_tissues))
#df_heart = df_heart
#df_heart
#df_heart.write_parquet('gtex_heart_eqtls.parquet')

### Specify delimiter for GTEX_EQTL CODEs. It used to be '_', now we are using '.'

In [13]:
EQTL_CODE_DELIM = '.'
SAB = 'GTEXEQTL'

In [14]:
df = df.with_columns(pl.col('tissue').str.replace_all(' ',EQTL_CODE_DELIM).alias('tissue_underscore'))\
       .with_columns(pl.col('variant_id').str.replace_all('_',EQTL_CODE_DELIM).alias('variant_id'))

df = df.with_columns(pl.map([pl.col('variant_id'),pl.col('tissue_underscore')], 
                      lambda s: f"{SAB} eQTL{EQTL_CODE_DELIM}" 
                              +s[0] +EQTL_CODE_DELIM+ s[1]).alias("CodeID_gtex"))


In [15]:
#df.select('CodeID_gtex').head(3).to_pandas()

In [16]:
%%time
# MERGE IN UBERON CODEs
eqtl_ub = df.join(pl.from_pandas(SAMPID_2_TISSUE_unique),on='tissue',how='left')
eqtl_ub = eqtl_ub.rename({'SMUBRID':'UBERON_code'}) # 'gene_name': 'symbol',
eqtl_all_GTEx = eqtl_ub
del eqtl_ub

CPU times: user 2.52 s, sys: 3.65 s, total: 6.17 s
Wall time: 1.92 s


## Optionally filter by eqtl frequency -- do this first and then do all the other merges
### Join the variant_id field with the gene_id so we can see how many times this new string appears. should be once in every tissue so we can just filter by the frequency of this new string
If the eqtl isnt in at least 5 tissues, we can drop it

In [17]:
eqtl_all_GTEx = eqtl_all_GTEx.with_columns(pl.map([pl.col('variant_id'),pl.col('gene_id')], 
                      lambda s: s[0] +EQTL_CODE_DELIM+ s[1]).alias("eqtl_gene"))

In [18]:
# get frequency counts of this new string
cnts = eqtl_all_GTEx.select([pl.col("eqtl_gene").value_counts()])
# split the array into 2 cols
cnts = cnts.with_columns(pl.col('eqtl_gene').struct.field("counts").alias("eqtl_gene_freq"))
cnts = cnts.with_columns(pl.col('eqtl_gene').struct.field("eqtl_gene").alias("eqtl_gene_str"))


### ONLY INCLUDE EQTL IF ITS IN 49 (all of them) TISSUES
#### ~71m  `-->`  ~2m eQTLs

In [19]:
# 49 tissues in total
#eqtl_all_GTEx.select(pl.col('tissue').unique())

In [20]:
TISSUE_FREQUENCY = 49
filtered_ids = cnts.filter(pl.col('eqtl_gene_freq') >= TISSUE_FREQUENCY).select(pl.col('eqtl_gene_str')).to_series()
filtered_ids.shape

(42086,)

In [21]:
eqtl_all_GTEx = eqtl_all_GTEx.filter(pl.col('eqtl_gene').is_in(filtered_ids))
eqtl_all_GTEx.shape

(2089174, 11)

In [22]:
## Select rare eqtls -- NOT doing this
#TISSUE_FREQUENCY = 1
#filtered_ids = cnts.filter(pl.col('eqtl_gene_freq') <= TISSUE_FREQUENCY).select(pl.col('eqtl_gene_str')).to_series()
#filtered_ids.shape
#eqtl_all_GTEx_filt = eqtl_all_GTEx.filter(pl.col('eqtl_gene').is_in(filtered_ids))
#eqtl_all_GTEx_filt.shape

In [25]:
eqtl_all_GTEx

gene_id,pval_beta,variant_id,tissue,hgnc_id,symbol,tissue_underscore,CodeID_gtex,SMTS,UBERON_code,eqtl_gene
str,f64,str,str,str,str,str,str,str,str,str
"""ENSG0000016007…",5.6862e-41,"""chr1.1434243.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1434243.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1497758.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1497758.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499000.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499000.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499128.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499128.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499639.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499639.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1500526.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1500526.T…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1520463.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1520463.T…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1531013.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1531013.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1534481.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1534481.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1558462.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1558462.T…"


In [23]:
#eqtl_all_GTEx.select('CodeID_gtex').to_pandas().to_csv('/Users/stearb/Desktop/common_eqtls.csv',index=False)

In [24]:
# 
def fill_missing_cols(df):
    
    if 'node_id' not in df.columns:
        raise ValueError('Must have at least a "node_id" column.')
    
    all_cols = set([ 'node_label', 'node_synonyms', 'node_dbxrefs',
            'node_definition','node_namespace','value','lowerbound','upperbound','unit'])
    
    missing_cols = list(all_cols - set(df.columns))
    
    nan_cols_df = pd.DataFrame(np.full([len(df), len(missing_cols)], np.nan),columns=missing_cols)

    if isinstance(df, pd.DataFrame):
        nan_cols_df.index = df.index
        return pd.concat([df,nan_cols_df],axis=1)
    
    elif isinstance(df, pl.DataFrame):
        # no index for polars
        return pl.concat([df,pl.from_pandas(nan_cols_df)],how='horizontal')
    else:
        raise ValueError(f'Must Pass either a pandas DataFrame or a polars DataFrame but recieved "{type(df)}".')


In [26]:
eqtl_all_GTEx

gene_id,pval_beta,variant_id,tissue,hgnc_id,symbol,tissue_underscore,CodeID_gtex,SMTS,UBERON_code,eqtl_gene
str,f64,str,str,str,str,str,str,str,str,str
"""ENSG0000016007…",5.6862e-41,"""chr1.1434243.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1434243.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1497758.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1497758.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499000.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499000.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499128.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499128.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1499639.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1499639.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1500526.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1500526.T…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1520463.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1520463.T…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1531013.G…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1531013.G…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1534481.C…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1534481.C…"
"""ENSG0000016007…",5.6862e-41,"""chr1.1558462.T…","""Cells Cultured…","""HGNC:24007""","""ATAD3B""","""Cells.Cultured…","""GTEXEQTL eQTL.…","""Skin""","""EFO_0002009""","""chr1.1558462.T…"


### Bin P-values

In [27]:
# Create bins w/o having to map bins to the entire df
terms_pval = pd.DataFrame()
bins = [0,1e-12,1e-11,1e-10,1e-9,1e-8,1e-7,1e-6,1e-5,1e-4,1e-3,.005,.01,.02,.03,.04,.05,.06]

b = []
for i in range(0,len(bins)-1):
    b.append((bins[i],bins[i+1]) )

terms_pval['bin'] = b

#terms_pval['lowerbound'] = [format(float(i[0]),'.12f') for i in terms_pval['bin']]
#terms_pval['upperbound'] = [format(float(i[1]),'.12f') for i in terms_pval['bin']]

terms_pval['Term'] = [str(i[0])+','+str(i[1]) for i in terms_pval['bin']]

terms_pval['CodeID_PVAL'] = ['PVALUEBINS '+i for i in terms_pval['Term']]
terms_pval['rel'] = 'p_value'

## Create both Pvalue bin nodes (and the relationships to/from GTEX_EQTL nodes) as well as putting actual pvalues in the `value` column. 

### Create p-value nodes

In [28]:
nodes_pval = pd.DataFrame(terms_pval['CodeID_PVAL'].drop_duplicates()).rename(columns={'CodeID_PVAL':'node_id'})
nodes_pval = fill_missing_cols(nodes_pval)
nodes_pval.head(3)

Unnamed: 0,node_id,node_namespace,unit,node_synonyms,lowerbound,node_definition,node_label,node_dbxrefs,value,upperbound
0,"PVALUEBINS 0,1e-12",,,,,,,,,
1,"PVALUEBINS 1e-12,1e-11",,,,,,,,,
2,"PVALUEBINS 1e-11,1e-10",,,,,,,,,


In [29]:
nodes_pval['node_id'] = [i.replace(',','.') for i in nodes_pval['node_id']]
#nodes_pval

### Create p-value edges. Use `pl.cut()` to map p_vals to bins

In [30]:
binned_pvals_pl = pl.cut(eqtl_all_GTEx['pval_beta'], bins)

eqtl_all_GTEx_saved = eqtl_all_GTEx

# get CodeID_gtex and concat with pvalues and the bins we just mapped
eqtl_all_GTEx = pl.concat([eqtl_all_GTEx.select(['CodeID_gtex']),binned_pvals_pl],how='horizontal')

In [31]:
#eqtl_all_GTEx_df = eqtl_all_GTEx.select(['CodeID_gtex','pval_beta']).to_pandas()
#pvals_2_map = list(eqtl_all_GTEx_df['pval_beta'].values)
#binned_pvals = pd.cut(pvals_2_map, bins)
#eqtl_all_GTEx['binned_pvals'] = binned_pvals.astype(str) # cojvert from categorical to str type
#eqtl_all_GTEx.head(2)

In [33]:
# rename and cast to str
eqtl_all_GTEx = eqtl_all_GTEx.rename({"category": "binned_pvals"})\
                        .with_columns(pl.col('binned_pvals').cast(pl.Utf8, strict=False).alias('binned_pvals'))

SchemaFieldNotFoundError: category

In [31]:
#print(len(eqtl_all_GTEx))
#print(len(eqtl_all_GTEx['CodeID_gtex'].unique()))

In [34]:
eqtl_all_GTEx

CodeID_gtex,pval_beta,break_point,binned_pvals
str,f64,f64,str
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"
"""GTEXEQTL eQTL.…",1.1447e-275,1.0000e-12,"""(0.0, 1.0e-12]…"


In [32]:
print(len(eqtl_all_GTEx))

# Filter out pvals >= .05, they are resulting in NaNs
eqtl_all_GTEx = eqtl_all_GTEx.filter(pl.col('pval_beta') < 0.05)   

print(len(eqtl_all_GTEx))

# Filter out pvals >= .05, they are resulting in NaNs
#eqtl_all_GTEx_df = eqtl_all_GTEx_df[eqtl_all_GTEx_df['pval_beta'] <.05]
#assert not eqtl_all_GTEx_df['binned_pvals'].isna().sum()
#assert not 'nan' in eqtl_all_GTEx_df['binned_pvals']

assert not eqtl_all_GTEx.filter(pl.col('binned_pvals') == 'nan').shape[0]
assert not eqtl_all_GTEx.select(pl.col('binned_pvals').is_null()).sum().to_pandas().values[0][0]

2089174
2089169


In [33]:
eqtl_all_GTEx_df = eqtl_all_GTEx.to_pandas()

In [34]:
# Reformat and change name so we can merge on this column to get pval code IDs
eqtl_all_GTEx_df['binned_pvals'] = eqtl_all_GTEx_df['binned_pvals'].str.replace(']',')')

eqtl_all_GTEx_df = eqtl_all_GTEx_df.rename(columns={'binned_pvals':'bin'})

eqtl_all_GTEx_df['bin'] =  eqtl_all_GTEx_df['bin'].str.replace('1.0e','1e')
eqtl_all_GTEx_df['bin'] =  eqtl_all_GTEx_df['bin'].str.replace('0.0,','0,')

terms_pval['bin'] = terms_pval['bin'].astype(str)

terms_pval['bin'] = ['(1e-10, 1e-9)' if i == '(1e-10, 1e-09)' else i for i in terms_pval['bin'] ]
terms_pval['bin'] = ['(1e-9, 1e-8)' if i == '(1e-09, 1e-08)' else i for i in terms_pval['bin'] ]
terms_pval['bin'] =  ['(1e-8, 1e-7)' if i == '(1e-08, 1e-07)' else i for i in terms_pval['bin'] ]
terms_pval['bin'] = ['(1e-7, 1e-6)' if i == '(1e-07, 1e-06)' else i for i in terms_pval['bin'] ]
terms_pval['bin'] = ['(1e-6, 0.00001)' if i == '(1e-06, 1e-05)' else i for i in terms_pval['bin'] ]
terms_pval['bin'] = ['(0.00001, 0.0001)' if i == '(1e-05, 0.0001)' else i for i in terms_pval['bin'] ]
# '(1e-10, 1e-09)'  -- '(1e-10, 1e-9)'
# '(1e-09, 1e-08)' --  '(1e-9, 1e-8)'
# '(1e-08, 1e-07)' -- '(1e-8, 1e-7)'
# '(1e-07, 1e-06)' -- '(1e-7, 1e-6)'
# '(1e-06, 1e-05)' -- '(1e-6, 0.00001)'
# '(1e-05, 0.0001)' -- '(0.00001, 0.0001)'

assert not len(set(eqtl_all_GTEx_df['bin']) - set(terms_pval['bin']))

In [35]:
# Merge in pval code ids
eqtl_all_GTEx_df = pd.merge(eqtl_all_GTEx_df,terms_pval,on='bin',how='left')
#eqtl_all_GTEx_df

In [36]:
eqtl_all_GTEx_df['CodeID_PVAL'] = [i.replace(',','.') for i in eqtl_all_GTEx_df['CodeID_PVAL']]

In [37]:
eqtl_all_GTEx_df['predicate'] = 'p_value'

pval_edges = eqtl_all_GTEx_df[['CodeID_gtex','predicate','CodeID_PVAL']]
pval_edges.columns = ['subject','predicate','object']

pval_edges = pval_edges.drop_duplicates()
pval_edges = pl.from_pandas(pval_edges)

# Create eQTL edges file

In [38]:
# Create HGNC/UBERON code IDs
eqtl_all_GTEx = eqtl_all_GTEx_saved


In [39]:
eqtl_all_GTExP = eqtl_all_GTEx.to_pandas()

In [40]:
eqtl_all_GTExP['uberon_CodeID'] = ['UBERON ' + i if not i.startswith('EFO') else i 
                                     for i in eqtl_all_GTExP['UBERON_code'] ]

In [41]:
eqtl_all_GTExP['hgnc_codeID'] = ['HGNC ' + i for i in eqtl_all_GTExP['hgnc_id'] ]

In [42]:
eqtl_all_GTEx = pl.from_pandas(eqtl_all_GTExP)

In [43]:
#|eqtl_all_GTEx = eqtl_all_GTEx.with_columns(pl.map([pl.col('hgnc_id')], 
 #                     lambda s: "HGNC " +s[0]).alias("hgnc_codeID"))\
 #                            .with_columns(pl.map([pl.col('UBERON_code')], 
 #                     lambda s: "UBERON " +s[0]).alias("uberon_CodeID"))


### Make eqtl-HGNC and eqtl-UBERON edges
#### Took out the dbsnp (rsID) edges for now bc the bigger eqtl Data set that we're now using doesn't contain those mappings

In [44]:
#  Create predicates (relationship)
#eqtl_all_GTEx['gtex_hgnc_rel'] =    'RO_0001025'   #'located in'    old rel:  'eqtl_in_gene'
#eqtl_all_GTEx['gtex_uberon_rel'] =  'RO_0001025'   #'located in'    old rel: 'eqtl_in_tissue'
LOCATED_IN_RO_PURL = 'http://purl.obolibrary.org/obo/RO_0001025'

eqtl_all_GTEx = eqtl_all_GTEx.with_columns(pl.lit('located in').alias('gtex_hgnc_ub_rel'))

######## GTEX_EQTL -- HGNC ########
edges1 = eqtl_all_GTEx.select(['CodeID_gtex','gtex_hgnc_ub_rel','hgnc_codeID'])

# edges1.isna().sum()
#edges1 = edges1.dropna().reset_index(drop=True)  # drop rows where HGNC_ID is  NaN

######### GTEX_EQTL -- UBERON #######
edges2 = eqtl_all_GTEx.select(['CodeID_gtex','gtex_hgnc_ub_rel','uberon_CodeID'])

edges1.columns = edges2.columns = ['subject','predicate','object'] 
     
edges_eqtl = pl.concat([edges1,edges2])

# Create GTEX_EQTL nodes
### (Dont need to make HGNC or UBERON nodes)

In [45]:
nodes_gtex_eqtl = eqtl_all_GTEx.select(['CodeID_gtex','pval_beta'])   # ,'pval_beta','unit'
nodes_gtex_eqtl.columns = ['node_id','value']

nodes_eqtl_gtex = fill_missing_cols(nodes_gtex_eqtl)

col_order = ['node_id', 'unit', 'node_namespace', 'value','lowerbound', 'node_synonyms',
 'node_label', 'node_dbxrefs', 'node_definition', 'upperbound']


nodes_pval = nodes_pval[col_order]
nodes_eqtl_gtex = nodes_eqtl_gtex.select(col_order)

nodes_eqtl = pl.concat([nodes_eqtl_gtex,pl.from_pandas(nodes_pval)]) 
nodes_eqtl = nodes_eqtl.unique()

# Create  eQTL to chromosomal ontology edges. 
### This is not a true 'mapping' beceause we already know where the eQTL is, we just have to assign it to its chlo range which we can do by dividing each location by a certain amount and then assigning the mod to a region


In [46]:
# get just the gtex_eqtl nodes
#  name change--confusing, neeed to fix
eqtl_nodes = nodes_eqtl.filter(pl.col('node_id').str.starts_with(f'{SAB} eQTL'))
eqtl_nodes

# get chr and location from node_id field into their own columns
# using EQTL_CODE_DELIM here!!!! Defined above. Important to know what char 
# to split the node_id field by to get the location of the eQTL
eqtl_nodes = eqtl_nodes.with_columns(pl.col("node_id").str.split(EQTL_CODE_DELIM).alias('node_id_split'))\
                            .with_columns(pl.col('node_id_split').arr.get(1).alias('chromosome'))\
                            .with_columns(pl.col('node_id_split').arr.get(2).alias('location'))\
                            .with_columns(pl.col('location').cast(pl.Int64, strict=False).alias('location'))

In [105]:
#x = eqtl_nodes.select('node_id').sample(10000).to_pandas()['node_id']

#eqtl_nodes = eqtl_nodes.select('node_id')

In [51]:
chro_path='/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/HSCLO/OWLNETS_node_metadata.txt'
#chro = pl.read_csv(chro_path,separator='\t')

In [52]:
chro =  pd.read_csv(chro_path,sep='\t')
chro

Unnamed: 0,node_id,node_label,node_definition,node_synonyms,node_dbxrefs,value,lowerbound,upperbound,unit
0,CHLO chr1.1-1000000,,,,,,,,
1,CHLO chr1.1000001-2000000,,,,,,,,
2,CHLO chr1.2000001-3000000,,,,,,,,
3,CHLO chr1.3000001-4000000,,,,,,,,
4,CHLO chr1.4000001-5000000,,,,,,,,
...,...,...,...,...,...,...,...,...,...
3431148,CHLO MtDNA.5001-6000,,,,,,,,
3431149,CHLO MtDNA.6001-7000,,,,,,,,
3431150,CHLO MtDNA.7001-8000,,,,,,,,
3431151,CHLO MtDNA.8001-9000,,,,,,,,


In [54]:
chro['node_id'] = [i.replace('CHLO','HSCLO') for i in chro['node_id']]
chro

Unnamed: 0,node_id,node_label,node_definition,node_synonyms,node_dbxrefs,value,lowerbound,upperbound,unit
0,HSCLO chr1.1-1000000,,,,,,,,
1,HSCLO chr1.1000001-2000000,,,,,,,,
2,HSCLO chr1.2000001-3000000,,,,,,,,
3,HSCLO chr1.3000001-4000000,,,,,,,,
4,HSCLO chr1.4000001-5000000,,,,,,,,
...,...,...,...,...,...,...,...,...,...
3431148,HSCLO MtDNA.5001-6000,,,,,,,,
3431149,HSCLO MtDNA.6001-7000,,,,,,,,
3431150,HSCLO MtDNA.7001-8000,,,,,,,,
3431151,HSCLO MtDNA.8001-9000,,,,,,,,


### Dynamically determine what the `chrom-interval delimiter` (ie. chr1:1-1000, its `:`, but thats been changed to `-`. ) We will count the # of `:`'s in the string. If its 0 then `chrom-interval delimiter` is `-`, it's 1 then `chrom-interval delimiter` is `:`

In [55]:
# Load and format chromosomal ontology dataset
chro_path='/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/HSCLO/OWLNETS_node_metadata.txt'
chro = pl.read_csv(chro_path,separator='\t')

if chro.select('node_id').head(1000).to_pandas()['node_id'].str.count(':').sum():
    print('chrom_interval_delimiter is ":"')
    chrom_interval_delimiter=':'
else:
    print('chrom_interval_delimiter is "-"')
    chrom_interval_delimiter='-'
    
# dont want mtDNA and some rows are chromosome rows that we dont want (no chrom_interval_delimiter in those)
chro = chro.filter(~pl.col('node_id').str.starts_with('MtDNA'))\
           .filter(pl.col('node_id').str.contains(chrom_interval_delimiter))


chrom_interval_delimiter is "-"


In [56]:
# Reformat cols so we can isolate chr/start/end into their own cols
chro = chro.with_columns(pl.col("node_id").str.split('.')\
                  .arr.to_struct(n_field_strategy="max_width").alias('temp')).unnest('temp')

chro.columns = ['node_id','node_label','node_definition','node_synonyms',
 'node_dbxrefs','value','lowerbound','upperbound','unit', 'chromosome','intervals']

chro = chro.with_columns(pl.col("intervals").str.split("-")\
                  .arr.to_struct(n_field_strategy="max_width").alias('temp')).unnest('temp')      

chro.columns = ['node_id','node_label','node_definition','node_synonyms',
 'node_dbxrefs','value','lowerbound','upperbound','unit', 'chromosome','intervals','low','high']

chro = chro.with_columns(pl.col('chromosome').str.replace('HSCLO ','').alias('chromosome'))

chro = chro.with_columns(pl.col('low').cast(pl.Int64, strict=False).alias('low'))\
            .with_columns(pl.col('high').cast(pl.Int64, strict=False).alias('high'))

chro = chro.with_columns((pl.col('high') - pl.col('low')).alias('diff'))
chro = chro.filter(pl.col('diff') <= 1_000)

chro = chro.with_columns(pl.map([pl.col('low'),pl.col('high')],lambda s: s[0]+','+s[1]).alias("bin"))

#chro.filter(pl.col('high_mod100')!=0)#.filter(pl.col('chromosome')=='chr1')
chro_ends = chro.filter(pl.col('diff')!=999)

chro = chro.filter(pl.col('diff')==999) # this gives the rows that contain the ending location of each chrom


In [57]:
#def rounddown(x):
#    x100 = x % 1_000 # will round down to nearest nth, whateverr this num is
#    #return x if not x100 else x + 10_000 - x100
#    return x if not x100 else x - x100

# Implement rounddown() in two steps (both column wise so super fast) on the eqtl_nodes dataframe
eqtl_nodes = eqtl_nodes.with_columns( (pl.col('location') % 1_000).alias('mod1000'))

eqtl_nodes = eqtl_nodes.with_columns(
                   pl.when(pl.col("mod1000")==0)
                     .then(pl.col('location'))
                     .otherwise(pl.col('location') - pl.col("mod1000")).alias('low'))

In [58]:

eqtl_nodes = eqtl_nodes.with_columns((pl.col('low') + 1000).alias('high'))

eqtl_nodes = eqtl_nodes.with_columns((pl.col('low')+1).alias('low'))

eqtl_nodes = eqtl_nodes.with_columns(
                        pl.map([pl.col('chromosome'),pl.col('low'),pl.col('high')],
                               lambda s: s[0] + '.' + s[1] + '-' + s[2]).alias("chro_node_id"))

# ------ quality check ------------------------------------------  # rewrite so it reflects filtering
#eqtl_ids = eqtl_nodes.select('chro_node_id').unique()
#chro_ids = chro.select('node_id').unique()
#eqtl_ids_set = set(list(eqtl_ids.to_pandas()['chro_node_id'].values))
#chro_ids_set = set(list(chro_ids.to_pandas()['node_id'].values))
# check that all node_ids I made are in the main id list
#assert len(eqtl_ids_set & chro_ids_set) == len(eqtl_ids_set)         # uncomment after fixing 
# ----------------------------------------------------------------

# Add the relationship/predicate b/t the eqtl and chromosomal location
LOCATED_IN_RO_PURL = 'http://purl.obolibrary.org/obo/RO_0001025'

eqtl_nodes = eqtl_nodes.with_columns(pl.lit('located in').alias('predicate'))

edges_chrom_ont = eqtl_nodes.select('node_id','predicate','chro_node_id')
edges_chrom_ont.columns = ['subject','predicate','object']

# Add SAB to chromosome ontology Codes
edges_chrom_ont = edges_chrom_ont.with_columns(pl.concat_str('CHLO '+pl.col('object')).alias('object'))

# there are many dups here -- bc of multiple eqtls in each tissue -- we need that for the eqtl--tissue edges
# but not here, so drop dups.
edges_chrom_ont_dd = edges_chrom_ont.unique() # drop duplicates
edges_chrom_ont_dd.shape

(1240810, 3)

### Concat edges_eqtl (which has eqtl--uberon and eqtl--hgnc edges) with the eqtl--chromosomal_ontology edges and the pval edges

In [59]:
col_order_edges = ['subject','predicate','object']

edges_eqtl = edges_eqtl.select(col_order_edges)
edges_chrom_ont_dd = edges_chrom_ont_dd.select(col_order_edges)
pval_edges = pval_edges.select(col_order_edges)

# character to replace the colon with in the code, ie chr:1-1000

# make sure these chlo nodes match the ones below (check the format/delim is the same)
edges_chrom_ont_dd = edges_chrom_ont_dd.with_columns(pl.col('object').str.replace(':','.').alias('object'))
edges_eqtl = pl.concat([edges_eqtl,edges_chrom_ont_dd,pval_edges])

edges_eqtl = edges_eqtl.unique()

edges_eqtl.unique().shape

(5780111, 3)

### Add CHLO nodeds to the eqtl edges so we can ingest GTEX_EQTL data w/o needing CHLO already in the graph.

In [60]:
chro_path='/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/HSCLO/OWLNETS_node_metadata.txt'
chro = pl.read_csv(chro_path,separator='\t')
chro = fill_missing_cols(chro).select(col_order)

# make sure these chlo nodes match the ones above (check the format/delim is the same)
chro = chro.with_columns(pl.col('node_id').str.replace(':','.').alias('node_id'))

### check that CHLO codes in `edges_chrom_ont_dd.select('object')` match those in `chro.select('node_id')`

In [61]:
assert not len(set([i[0] for i in edges_chrom_ont_dd.select('object').to_pandas().values]) -\
set([i[0] for i in chro.select('node_id').to_pandas().values]))

In [62]:
nodes_all = pd.concat([nodes_eqtl.to_pandas(),chro.to_pandas()])

In [63]:
#edges_eqtl.filter(pl.col('object').str.contains('PVALUE')).select('object').head().to_pandas()

In [64]:
nodes_all = nodes_all.drop_duplicates()
edges_eqtl = edges_eqtl.unique()

In [65]:
#edges_eqtl.select(pl.col('subject').str.startswith('CHLO'))
#eedf = edges_eqtl.to_pandas()

#eedf[eedf['object'].str.startswith('CHLO')]

In [66]:
#nodes_all[nodes_all['node_id'].str.startswith('CHLO')]

In [73]:
print('saving nodes and edges...')

nodes.to_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_eqtl/OWLNETS_node_metadata.txt',
             sep='\t',index=False)

#edges_eqtl.to_pandas().to_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_eqtl/OWLNETS_edgelist.txt',
#             sep='\t',index=False)

saving nodes and edges...


In [79]:
edges_eqtl = pd.read_csv(
    '/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_eqtl/OWLNETS_edgelist.txt',sep='\t')
#edges_eqtl

In [80]:
edges_eqtl[edges_eqtl['object'].str.startswith('CHLO')]

#edges_eqtl['object'] = [i.replace('CHLO','HSCLO') for i in edges_eqtl['object']]


Unnamed: 0,subject,predicate,object


In [78]:
edges_eqtl.to_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_eqtl/OWLNETS_edgelist.txt',
             sep='\t',index=False)

In [68]:
nodes = pd.read_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_eqtl/OWLNETS_node_metadata.txt',
             sep='\t')
nodes

Unnamed: 0,node_id,unit,node_namespace,value,lowerbound,node_synonyms,node_label,node_dbxrefs,node_definition,upperbound
0,GTEXEQTL eQTL.chr1.1434243.G.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
1,GTEXEQTL eQTL.chr1.1497758.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
2,GTEXEQTL eQTL.chr1.1499000.C.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
3,GTEXEQTL eQTL.chr1.1499128.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
4,GTEXEQTL eQTL.chr1.1499639.G.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
...,...,...,...,...,...,...,...,...,...,...
5478253,CHLO MtDNA.5001-6000,,,,,,,,,
5478254,CHLO MtDNA.6001-7000,,,,,,,,,
5478255,CHLO MtDNA.7001-8000,,,,,,,,,
5478256,CHLO MtDNA.8001-9000,,,,,,,,,


In [71]:
nodes

Unnamed: 0,node_id,unit,node_namespace,value,lowerbound,node_synonyms,node_label,node_dbxrefs,node_definition,upperbound
0,GTEXEQTL eQTL.chr1.1434243.G.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
1,GTEXEQTL eQTL.chr1.1497758.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
2,GTEXEQTL eQTL.chr1.1499000.C.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
3,GTEXEQTL eQTL.chr1.1499128.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
4,GTEXEQTL eQTL.chr1.1499639.G.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
...,...,...,...,...,...,...,...,...,...,...
5478253,CHLO MtDNA.5001-6000,,,,,,,,,
5478254,CHLO MtDNA.6001-7000,,,,,,,,,
5478255,CHLO MtDNA.7001-8000,,,,,,,,,
5478256,CHLO MtDNA.8001-9000,,,,,,,,,


In [72]:
nodes['node_id'] = [i.replace('CHLO','HSCLO') for i in nodes['node_id']]
nodes

Unnamed: 0,node_id,unit,node_namespace,value,lowerbound,node_synonyms,node_label,node_dbxrefs,node_definition,upperbound
0,GTEXEQTL eQTL.chr1.1434243.G.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
1,GTEXEQTL eQTL.chr1.1497758.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
2,GTEXEQTL eQTL.chr1.1499000.C.A.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
3,GTEXEQTL eQTL.chr1.1499128.C.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
4,GTEXEQTL eQTL.chr1.1499639.G.T.b38.Cells.Cultured.fibroblasts,,,5.686160e-41,,,,,,
...,...,...,...,...,...,...,...,...,...,...
5478253,HSCLO MtDNA.5001-6000,,,,,,,,,
5478254,HSCLO MtDNA.6001-7000,,,,,,,,,
5478255,HSCLO MtDNA.7001-8000,,,,,,,,,
5478256,HSCLO MtDNA.8001-9000,,,,,,,,,


# Now let's look at the Gene Median TPM dataset
File: GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct  
Description: This file contains genes as rows And tissues as columns. Unlike the other gene expression GTEx dataset we looked at where columns were individual samples (of which we had 17,382 samples), the gene median dataset has only 54 columns corresponding to each tissue type. This gene median data set shows just the median level expression for each tissue.

In [7]:
gene_median_tpm = '/Users/stearb/desktop/DESKTOP_TRANSFER/R03_local/data/gtex/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct'

gct_obj=parse(gene_median_tpm)

df = gct_obj.data_df
print(df.shape)
df.head(3)

(56200, 54)


cid,Adipose - Subcutaneous,Adipose - Visceral (Omentum),Adrenal Gland,Artery - Aorta,Artery - Coronary,Artery - Tibial,Bladder,Brain - Amygdala,Brain - Anterior cingulate cortex (BA24),Brain - Caudate (basal ganglia),Brain - Cerebellar Hemisphere,Brain - Cerebellum,Brain - Cortex,Brain - Frontal Cortex (BA9),Brain - Hippocampus,Brain - Hypothalamus,Brain - Nucleus accumbens (basal ganglia),Brain - Putamen (basal ganglia),Brain - Spinal cord (cervical c-1),Brain - Substantia nigra,Breast - Mammary Tissue,Cells - Cultured fibroblasts,Cells - EBV-transformed lymphocytes,Cervix - Ectocervix,Cervix - Endocervix,Colon - Sigmoid,Colon - Transverse,Esophagus - Gastroesophageal Junction,Esophagus - Mucosa,Esophagus - Muscularis,Fallopian Tube,Heart - Atrial Appendage,Heart - Left Ventricle,Kidney - Cortex,Kidney - Medulla,Liver,Lung,Minor Salivary Gland,Muscle - Skeletal,Nerve - Tibial,Ovary,Pancreas,Pituitary,Prostate,Skin - Not Sun Exposed (Suprapubic),Skin - Sun Exposed (Lower leg),Small Intestine - Terminal Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole Blood
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1
ENSG00000223972.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166403,0.0,0.0,0.0,0.0
ENSG00000227232.5,4.06403,3.37111,2.68549,4.04762,3.90076,3.63963,5.16375,1.43859,1.69285,1.56605,4.99231,5.72099,2.48317,2.14667,1.68599,1.74811,1.53899,1.44167,2.73049,1.74194,4.43876,1.6786,2.49477,5.62935,7.09749,4.64777,3.59509,4.32641,3.11749,4.10335,6.13409,1.52031,0.924962,2.77081,2.21451,1.76541,4.50841,3.52767,1.41667,6.68531,6.6341,1.80871,5.42546,7.08318,5.93298,6.13265,4.19378,5.92631,3.06248,4.70253,6.27255,7.19001,5.74554,2.64743
ENSG00000278267.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Show names of the fields for the object
#gct_obj.__dict__.keys()

In [13]:
# Show row (gene) metadata (Column (tissue) metadata is just the column names)
#gct_obj.row_metadata_df.head(3)

In [9]:
# Reformat and rename both axes, also use stack to flatten the dataframe (make tissues into a column)
medgene_flat = df.stack().reset_index().rename(columns={'rid':'Transcript ID',
                                                        'cid':'tissue',0:'Median_TPM'})

# Reformat tissue strings so they match the UBERON mappings data
medgene_flat['tissue'] = medgene_flat['tissue'].str.replace(' - ',' ').str.replace('\(','').str.replace('\)','')
print(medgene_flat.shape)
medgene_flat.head(3)

(3034800, 3)


Unnamed: 0,Transcript ID,tissue,Median_TPM
0,ENSG00000223972.5,Adipose Subcutaneous,0.0
1,ENSG00000223972.5,Adipose Visceral (Omentum),0.0
2,ENSG00000223972.5,Adrenal Gland,0.0


#### Look at row (gene) metadata and map them to HGNC IDs

In [10]:
gtx_genes = gct_obj.row_metadata_df['Description'].to_frame()
gtx_genes['Transcript ID'] = gtx_genes.index
gtx_genes.rename(columns={'Description':'symbol'},inplace=True)
gtx_genes.reset_index(drop=True, inplace=True)

print(f'# of unique genes: {gct_obj.row_metadata_df["Description"].nunique()} (some overlap)')
print(f'# of unique transcript IDs: {gtx_genes.index.nunique()} (no overlap)')
#gtx_genes.head()

# of unique genes: 54592 (some overlap)
# of unique transcript IDs: 56200 (no overlap)


In [14]:
gtx_genes

Unnamed: 0,symbol,Transcript ID
0,DDX11L1,ENSG00000223972.5
1,WASH7P,ENSG00000227232.5
2,MIR6859-1,ENSG00000278267.1
3,MIR1302-2HG,ENSG00000243485.5
4,FAM138A,ENSG00000237613.2
...,...,...
56195,MT-ND6,ENSG00000198695.2
56196,MT-TE,ENSG00000210194.1
56197,MT-CYB,ENSG00000198727.2
56198,MT-TT,ENSG00000210195.2


In [11]:
# how much overlap is there with the HGNC master genes list and the gtx_genes?
#venn2([set(gtx_genes['symbol']),
#       set(hgnc_master['symbol'])],
#       set_labels = ('GTEx genes', 'HGNC master')); plt.show()

### Use the master hgnc list from the website and filter by gtx_genes so we only get the overlap

In [12]:
hgnc_master

Unnamed: 0,hgnc_id,symbol,name,locus_group,locus_type,status,location,location_sortable,alias_symbol,alias_name,prev_symbol,prev_name,gene_group,gene_group_id,date_approved_reserved,date_symbol_changed,date_name_changed,date_modified,entrez_id,ensembl_gene_id,vega_id,ucsc_id,ena,refseq_accession,ccds_id,uniprot_ids,pubmed_id,mgd_id,rgd_id,lsdb,cosmic,omim_id,mirbase,homeodb,snornabase,bioparadigms_slc,orphanet,pseudogene.org,horde_id,merops,imgt,iuphar,kznf_gene_catalog,mamit-trnadb,cd,lncrnadb,enzyme_id,intermediate_filament_db,rna_central_ids,lncipedia,gtrnadb,agr,mane_select,gencc
0,HGNC:5,A1BG,alpha-1-B glycoprotein,protein-coding gene,gene with protein product,Approved,19q13.43,19q13.43,,,,,Immunoglobulin like domain containing,594,1989-06-30,,,2023-01-20,1.0,ENSG00000121410,OTTHUMG00000183507,uc002qsd.5,,NM_130786,CCDS12976,P04217,2591067,MGI:2152878,RGD:69417,,,138670,,,,,,,,I43.950,,,,,,,,,,,,HGNC:5,ENST00000263100.8|NM_130786.4,
1,HGNC:37133,A1BG-AS1,A1BG antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,19q13.43,19q13.43,FLJ23569,,NCRNA00181|A1BGAS|A1BG-AS,non-protein coding RNA 181|A1BG antisense RNA (non-protein coding)|A1BG antisense RNA 1 (non-protein coding),Antisense RNAs,1987,2009-07-20,2010-11-25,2012-08-15,2013-06-27,503538.0,ENSG00000268895,OTTHUMG00000183508,uc002qse.3,BC040926,NR_015380,,,,,,,,,,,,,,,,,,,,,,,,,,A1BG-AS1,,HGNC:37133,,
2,HGNC:24086,A1CF,APOBEC1 complementation factor,protein-coding gene,gene with protein product,Approved,10q11.23,10q11.23,ACF|ASP|ACF64|ACF65|APOBEC1CF,,,,RNA binding motif containing,725,2007-11-23,,,2023-01-20,29974.0,ENSG00000148584,OTTHUMG00000018240,uc057tgv.1,AF271790,NM_014576,CCDS7242|CCDS7241|CCDS73133|CCDS7243,Q9NQ94,11815617|11072063,MGI:1917115,RGD:619834,,,618199,,,,,,,,,,,,,,,,,,,,HGNC:24086,ENST00000373997.8|NM_014576.4,
3,HGNC:7,A2M,alpha-2-macroglobulin,protein-coding gene,gene with protein product,Approved,12p13.31,12p13.31,FWP007|S863-7|CPAMD5,,,,Alpha-2-macroglobulin family,2148,1986-01-01,,,2023-01-20,2.0,ENSG00000175899,OTTHUMG00000150267,uc001qvk.2,BX647329|X68728|M11313,NM_000014,CCDS44827,P01023,2408344|9697696,MGI:2449119,RGD:2004,LRG_591|http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_591.xml,,103950,,,,,,,,I39.001,,,,,,,,,,,,HGNC:7,ENST00000318602.12|NM_000014.6,HGNC:7
4,HGNC:27057,A2M-AS1,A2M antisense RNA 1,non-coding RNA,"RNA, long non-coding",Approved,12p13.31,12p13.31,,,,A2M antisense RNA 1 (non-protein coding)|A2M antisense RNA 1,Antisense RNAs,1987,2012-06-23,,2018-03-21,2018-03-21,144571.0,ENSG00000245105,OTTHUMG00000168289,uc009zgj.2,,NR_026971,,,,,,,,,,,,,,,,,,,,,,,,,,A2M-AS1,,HGNC:27057,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43648,HGNC:25820,ZYG11B,"zyg-11 family member B, cell cycle regulator",protein-coding gene,gene with protein product,Approved,1p32.3,01p32.3,FLJ13456,,ZYG11,zyg-11 homolog (C. elegans)|zyg-11 homolog B (C. elegans),ZYG11 cell cycle regulator family|Armadillo like helical domain containing,6|1492,2005-06-06,2005-07-11,2012-12-10,2023-01-20,79699.0,ENSG00000162378,OTTHUMG00000008938,uc001cuj.4,AB051517,NM_024646,CCDS30717,Q9C0D3,11214970,MGI:2685277,RGD:1307814,,,618673,,,,,,,,,,,,,,,,,,,,HGNC:25820,ENST00000294353.7|NM_024646.3,HGNC:25820
43649,HGNC:13200,ZYX,zyxin,protein-coding gene,gene with protein product,Approved,7q34,07q34,,,,,Zyxin family|MicroRNA protein coding host genes,1402|1691,1997-10-16,,,2023-01-20,7791.0,ENSG00000159840,OTTHUMG00000023822,uc003wcx.5,X95735,NM_003461,CCDS5883,Q15942,8917469|8940160,MGI:103072,RGD:620698,,,602002,,,,,,,,,,,,,,,,,,,,HGNC:13200,ENST00000322764.10|NM_003461.5,
43650,HGNC:51695,ZYXP1,zyxin pseudogene 1,pseudogene,pseudogene,Approved,8q24.23,08q24.23,,,,,,,2015-05-07,,,2015-05-07,106480342.0,ENSG00000274572,OTTHUMG00000187758,,,NG_044600,,,,,,,,,,,,,,PGOHUM00000303383,,,,,,,,,,,,,,HGNC:51695,,
43651,HGNC:29027,ZZEF1,zinc finger ZZ-type and EF-hand domain containing 1,protein-coding gene,gene with protein product,Approved,17p13.2,17p13.2,KIAA0399|ZZZ4|FLJ10821,,,"zinc finger, ZZ-type with EF hand domain 1",Zinc fingers ZZ-type|EF-hand domain containing,91|863,2004-02-27,,2016-02-12,2023-01-20,23140.0,ENSG00000074755,OTTHUMG00000090741,uc002fxe.4,BC035319,NM_015113,CCDS11043,O43149,9455477,MGI:2444286,RGD:1311189,,,619459,,,,,,,,,,,,,,,,,,,,HGNC:29027,ENST00000381638.7|NM_015113.4,


In [15]:
# Merge in HGNC IDs from hgnc_master_filtbygtex
cols2include = ['symbol','hgnc_id','name','locus_group','locus_type','location']
                                        #,'entrez_id','ensembl_gene_id','uniprot_ids']
symbols_hgnc=pd.merge(left=gtx_genes,right=hgnc_master[cols2include],   # include more gene IDs/gene data here
                           how='inner',
                           on='symbol')#.dropna()#.drop_duplicates('Transcript ID')

#symbols_hgnc.head(3)

#### Now map COLUMN names (tissues) and UBERON codes. onto medgene_flat

In [16]:
medgene_flat_ub = pd.merge(left=medgene_flat,right=SAMPID_2_TISSUE_unique,on='tissue')
#medgene_flat_ub.head(3)

In [17]:
# Join flattened median gene dataset with the HGNC IDs (and other gene info)
medgene_merge  = pd.merge(left=medgene_flat_ub,
                     right=symbols_hgnc,
                     how='inner',
                     on='Transcript ID').dropna()

print(medgene_merge.shape)
#medgene_merge.head(3)

(1573380, 11)


### SKIP THS---DONT NEED CUIs---Merge in HGNC CUIs from the umls_genes (HGNC_IDs--HGNC CUI mapping)

In [27]:
#medgene_merge_2  =  pd.merge(left=umls_genes,right=medgene_merge,on='hgnc_id')
#assert medgene_merge.shape[0] == medgene_merge_2.shape[0]

In [18]:
medgene_merge_2 = medgene_merge

In [19]:
### Merge in UBERON CUIs from the umls_genes (HGNC_IDs--HGNC CUI mapping)
medgene_merge_2.rename(columns={'SMUBRID':'UBERON_code'},inplace=True)

# Drop rows where UBERON code starts with 'EFO', which drops 70,738 rows
#medgene_merge_2 = medgene_merge_2[~medgene_merge_2['UBERON_code'].str.contains('EFO')]

In [20]:
# Append UBERON to front of UBERON code unless it starts with EFO, then leave it alone
medgene_merge_2['UBERON_CODEID'] = ['UBERON '+i if 'EFO_' not in i else i for i in medgene_merge_2['UBERON_code'] ]

In [21]:
assert len(medgene_merge_2[medgene_merge_2['UBERON_code'].str.startswith('UBERON EFO_')]) == 0

In [22]:
# We need to manually add the UMLS CUIs for some of these UBERON codes (again.) 
# As a reminder, they do not exist in UMLS as UBERON CUIs, which we searched for,
# bc they were in UMLS (from NCIT) before UBERON was added. (meaning they didnt appear
# in the search when I searched for UBERON Concepts, so I need to add them manually here)

#extra_uberon_xref =  pd.DataFrame( [['0000458','C0227837'], # endocervix
#                                   ['0001255','C0005682'],  # urinary bladder
#                                   ['0003889','C0015560'],  # fallopian tube
#                                   ['0012249','C0227829']], # ectocervix
#                                     columns=['UBERON_code','UBERON_CUI'])

#umls_uberon_xref_2 = umls_uberon_xref.append(extra_uberon_xref) # add these extra UBERON Code-CUI mappings to the rest

### SKIP THIS---DONT NEED CUIs--- Merge in UBERON CUIs

In [23]:
#medgene_merge_3  =  pd.merge(left=medgene_merge_2,right=umls_uberon_xref_2,on='UBERON_code') 
#assert medgene_merge_2.shape[0]==medgene_merge_3.shape[0]

In [24]:
medgene_merge_3 = medgene_merge_2

### The last thing to do for the median_gene_tpm data set is to add unique code/Concept node identifiers

In [25]:
medgene_merge_3['tissue_no_space'] = [i.replace(' ','-') for i in medgene_merge_3['tissue']]
medgene_merge_3['Transcript ID DASH'] =  medgene_merge_3['Transcript ID'].str.replace('.','-')

In [45]:
# Turn the  'Median_TPM' into strings so we can include them in the unique hash ID 
median_tpm_strs = [str(i) for i in medgene_merge_3['Median_TPM']]


In [46]:

# there are no columns with 100% unique values so we need to combine 3         FIX:  ' - ' +
# columns to create unique strings that we can then hash
medgene_merge_3['CODE_GTEX_Expression'] = medgene_merge_3['Transcript ID DASH'] + '-' + \
                                                    medgene_merge_3['tissue_no_space'] 


In [47]:

medgene_merge_3['CodeID_GTEX_Expression'] = ['GTEXEXP '+i for i in medgene_merge_3['CODE_GTEX_Expression']]

# USE BASE-64 METHOD
# create a 'GTEX EXP' CUI using a hash on that column.
#medgene_merge_3['CUI_GTEX_Expression']  = ['KC' + str(int(hashlib.sha256(uid.encode('utf8')).hexdigest(),
#                                                      base=16))[:CUI_LEN] for uid in medgene_merge_3['CODE_GTEX_Expression']]

# Make CUIs
#medgene_merge_3['CUI_GTEX_Expression'] = [i for i in CUIbase64(medgene_merge_3['CodeID_GTEX_Expression'])]

# Check for collisions
assert len(medgene_merge_3['CODE_GTEX_Expression'].unique()) == medgene_merge_3.shape[0]
assert medgene_merge_3['CODE_GTEX_Expression'].unique().shape == medgene_merge_3['CodeID_GTEX_Expression'].unique().shape 
#assert  medgene_merge_3['CODE_GTEX_Expression'].unique().shape== medgene_merge_3['CUI_GTEX_Expression'].unique().shape

In [48]:
#  Select just the columns we need (CUIs and CodeIDs)
#medgene_select = medgene_merge_3[['CUI_GTEX_Expression','CODE_GTEX_Expression','CodeID_GTEX_Expression','CUI_hgnc','Median_TPM','UBERON_CUI']]

medgene_select = medgene_merge_3#[['CUI_GTEX_Expression','CODE_GTEX_Expression','CodeID_GTEX_Expression','CUI_hgnc','Median_TPM','UBERON_CUI']]
#medgene_select.head(3)

### There are 2 CUI-CUI  mappings and 1 CUI-CodeID mapping we need for the GTEX EXP CUIs
- CUI-CUIs
    - GTEX_Expression CUI -- HGNC CUI
    - GTEX_Expression CUI -- UBERON CUI
- CUI-CODE
    - GTEX_Expression CUI -- GTEX_Expression CodeID

CUIs (combine the Expression CUIs with the eQTL CUIs and save)

In [49]:
#GTEX_Ex_CUIs = pd.DataFrame(np.transpose([medgene_select['CUI_GTEX_Expression'].drop_duplicates().values]
#                                          ),columns=['CUI:ID']) 

#CUIs_all_eqtl_2=CUIs_all_eqtl.to_frame().rename(columns={0:'CUI:ID'})#,inplace=True)

# Combine eqtl CUIs with med expression CUIs
#CUIs_all = CUIs_all_eqtl_2.append(GTEX_Ex_CUIs)

#assert CUIs_all.shape  == CUIs_all.drop_duplicates().shape 
#CUIs_all.to_csv('/Users/stearb/desktop/R03_local/data/ingest_files/GTEx/CUIs_GTEx.csv',index=False)

CUI-CUIs (and combine the Expression CUI-CUIs with the eQTL CUI-CUIs and save)

In [50]:
%%capture
'''
# GTEX_EXP --> HGNC 
GTEX_Ex_2_HGNC = medgene_select[['CUI_GTEX_Expression','CUI_hgnc']].rename(columns={'CUI_GTEX_Expression':':START_ID','CUI_hgnc':':END_ID'})
GTEX_Ex_2_HGNC[':TYPE'] = 'median_expression_in_gene'
GTEX_Ex_2_HGNC['SAB'] = 'GTEX_EXP__HGNC'

# INVERSE, HGNC --> GTEX_EXP
HGNC_2_GTEX_Ex = medgene_select[['CUI_hgnc','CUI_GTEX_Expression']].rename(columns={'CUI_GTEX_Expression':':END_ID','CUI_hgnc':':START_ID'})
HGNC_2_GTEX_Ex[':TYPE'] = 'gene_has_median_expression'
HGNC_2_GTEX_Ex['SAB'] = 'GTEX_EXP__HGNC'

############################################
############################################

# GTEX_EXP --> UBERON
GTEX_Ex_2_UBERON  = medgene_select[['CUI_GTEX_Expression','UBERON_CUI']].rename(columns={'CUI_GTEX_Expression':':START_ID','UBERON_CUI':':END_ID'})
GTEX_Ex_2_UBERON[':TYPE'] = 'median_expression_in_tissue'
GTEX_Ex_2_UBERON['SAB'] = 'GTEX_EXP__UBERON'


# UBERON -->  GTEX_EXP
UBERON_2_GTEX_Ex  = medgene_select[['UBERON_CUI','CUI_GTEX_Expression']].rename(columns={'CUI_GTEX_Expression':':END_ID','UBERON_CUI':':START_ID'})
UBERON_2_GTEX_Ex[':TYPE'] = 'tissue_has_median_expression'
UBERON_2_GTEX_Ex['SAB'] = 'GTEX_EXP__UBERON'


# Combine CUI-CUIs
GTEX_Ex_CUI_CUI = pd.concat([GTEX_Ex_2_HGNC,HGNC_2_GTEX_Ex,GTEX_Ex_2_UBERON,UBERON_2_GTEX_Ex])


# Add in the eqtl CUI-CUIs
#CUI_CUI_eqtls.rename(columns={'CUI_gtex':':START_ID','CUI':':END_ID'},inplace=True)

# Combine with the eQTL CUI-CUIs
CUI_CUIs_all = pd.concat([GTEX_Ex_CUI_CUI, CUI_CUI_eqtls]) 

# Check that there are no duplicates when we combine the 2 CUI-CUI dataframes
assert CUI_CUIs_all.duplicated().sum() == 0

#CUI_CUIs_all.to_csv('/Users/stearb/desktop/R03_local/data/ingest_files/GTEx/CUI-CUI_GTEx.csv',index=False)
'''

CODEs (and combine the Expression Codes with the eQTL Codes and save)

In [51]:
#GTEX_Ex_CODEs = medgene_select[['CodeID_GTEX_Expression','CODE_GTEX_Expression']].rename(columns={'CodeID_GTEX_Expression':'CodeID',
#                                                                                   'CODE_GTEX_Expression':'CODE'})
#GTEX_Ex_CODEs['SAB'] = 'GTEX_EXP'

#GTEX_Ex_CODEs = GTEX_Ex_CODEs[['CodeID','SAB','CODE']] 

# Add in eqtl CODEs
#CODEs_all = pd.concat([GTEX_Ex_CODEs,gtex_codes_eqtl])

#assert CODEs_all.shape == CODEs_all.drop_duplicates().shape
#assert CODEs_all.nunique()['CodeID'] == CODEs_all.nunique()['CODE']

#CODEs_all.to_csv('/Users/stearb/desktop/R03_local/data/ingest_files/GTEx/CODEs_GTEx.csv',index=False)

###  CUI_CODEs (and combine the Expression CUI-Codes with the eQTL CUI-Codes and save)
### fix duplicates issues
There arent duplicates in GTEX_Expression CUI_CODE and there arent duplicates in GTEX eqtl CUI_CODE

In [52]:
#GTEX_Ex_CUI_CODE  = medgene_select[['CUI_GTEX_Expression',
#                                    'CodeID_GTEX_Expression']].rename(columns={'CUI_GTEX_Expression':'CUI',
#                                                                               'CodeID_GTEX_Expression':'CODE'})
# Add in eqtl CUI-CODEs
#GTEX_CUI_CODEs_all  = pd.concat([GTEX_Ex_CUI_CODE,GTEX_eqtl_CUI_CODEs])

#assert GTEX_CUI_CODEs_all.shape == GTEX_CUI_CODEs_all.drop_duplicates().shape

# No Overlap between GTEX eqtl CUIs and GTEX Expression CUIs
#assert set() == set(GTEX_eqtl_CUI_CODEs['CUI']).intersection(set(GTEX_Ex_CUI_CODE['CUI']))

#GTEX_CUI_CODEs_all.to_csv('/Users/stearb/desktop/R03_local/data/ingest_files/GTEx/CUI-CODEs_GTEx.csv',index=False)

There is actually no overlap between the SUI:IDs or the names/Terms between the EQTL SUIs (eqtl pvals) and the EXP SUIs (TPM values)

In [53]:
#venn2([set(GTEX_EXP_SUIs['SUI:ID']),set(GTEX_EQTL_SUIs['SUI:ID'])])
#venn2([set(GTEX_EXP_SUIs['name']),set(GTEX_EQTL_SUIs['name'])])

In [54]:
#medgene_select['Median_TPM'].hist(bins=100)    
# less than 2,000  = 1,837,599
#  total = 1,839,188
# zeros = 742,955
#medgene_select[(medgene_select['Median_TPM'] > 50000)]['Median_TPM'].hist()
#max(medgene_select[(medgene_select['Median_TPM'] != 0)]['Median_TPM'])

### Put TPM values into bins

In [55]:
# Only 2 TPM values below 0.0007

tpm_bins = list([0.0000000,7e-4,8e-4,9e-4]) + list(np.linspace(1e-3,9e-3,9)) + \
           list(np.round(np.linspace(1e-2,9e-2,9),2)) + list(np.round(np.linspace(.1,1,10),2)) + \
           list(np.linspace(2,100,99)) + list(np.arange(100,1100,100)[1:]) +  \
            list(np.arange(2000,11000,1000)) + list(np.arange(20000,110000,10000)) + [300000]

# Seperate main df into df's where 'Median_TPM' is == 0 , and where 'Median_TPM' is != 0
tpm_0s = medgene_select[medgene_select['Median_TPM'] == 0.00] 
tpm_intervals = medgene_select[medgene_select['Median_TPM'] != 0.00] 

# Create TPM Bins  
tpm_intervals['tpm_bins'] = pd.cut(tpm_intervals['Median_TPM'], tpm_bins)

# Create SUIs for intervals
#tpm_intervals['SUI_tpm_bins']  = ['KS' + str(int(hashlib.sha256(str(uid).encode('utf8')).hexdigest(),base=16))[:SUI_LEN] for uid in tpm_intervals['tpm_bins']]

# Save expression bins as strings first
bin_strings_exp = tpm_intervals['tpm_bins'].astype(str)

# Create SUIs the new way.
#tpm_intervals['SUI_tpm_bins'] = [i for i in CUIbase64(bin_strings_exp)]


# Seperate lower bounds and upper  bounds of intervals into 2 columns, unless its Nan (float), 
# then just set both bounds to 0.0
# CHEECK WHAT THE FLOATS IN HERE  ARE, THEY SHOULDNT BE 0's
tpm_intervals['bins_lowerbound'] = [i.left if type(i) is not float else 0.0 for i in tpm_intervals['tpm_bins'] ]
tpm_intervals['bins_upperbound'] = [i.right if type(i) is not float else 0.0 for i in tpm_intervals['tpm_bins'] ]

tpm_intervals.drop('tpm_bins',axis=1,inplace=True)

# Create SUIs for  TPM == 0. It will be the same SUI (the 0.0 SUI) for all of these
#tpm_0s['SUI_tpm_bins']  = ['KS' + str(int(hashlib.sha256(str(uid).encode('utf8')).hexdigest(),base=16))[:SUI_LEN] for uid in tpm_0s['Median_TPM']]

bin_strings_exp_0 = tpm_0s['Median_TPM'].astype(str)

#tpm_0s['SUI_tpm_bins'] = [i for i in CUIbase64(bin_strings_exp_0)]

# Add columns to concat with main tpm_intervals
tpm_0s['bins_upperbound'] = 0.0
tpm_0s['bins_lowerbound'] = 0.0

# Join them back together
tpm_all_intervals =  pd.concat([tpm_intervals,tpm_0s]).reset_index(drop=True)

In [56]:
# Create 'name' column for interval Terms.
tpm_all_intervals['name'] = tpm_all_intervals['bins_lowerbound'].astype(str) +','+ tpm_all_intervals['bins_upperbound'].astype(str)

tpm_all_intervals.drop(['bins_lowerbound','bins_upperbound'],axis=1,inplace=True)

In [57]:
tpm_all_intervals.head(3)

Unnamed: 0,Transcript ID,tissue,Median_TPM,SMTS,UBERON_code,symbol,hgnc_id,name,locus_group,locus_type,location,UBERON_CODEID,tissue_no_space,Transcript ID DASH,CODE_GTEX_Expression,CodeID_GTEX_Expression
0,ENSG00000223972.5,Testis,0.166403,Testis,473,DDX11L1,HGNC:37102,"0.1,0.2",pseudogene,pseudogene,1p36.33,UBERON 0000473,Testis,ENSG00000223972-5,ENSG00000223972-5-Testis,GTEXEXP ENSG00000223972-5-Testis
1,ENSG00000227232.5,Adipose Subcutaneous,4.06403,Adipose Tissue,2190,WASH7P,HGNC:38034,"4.0,5.0",pseudogene,pseudogene,1p36.33,UBERON 0002190,Adipose-Subcutaneous,ENSG00000227232-5,ENSG00000227232-5-Adipose-Subcutaneous,GTEXEXP ENSG00000227232-5-Adipose-Subcutaneous
2,ENSG00000227232.5,Adrenal Gland,2.68549,Adrenal Gland,2369,WASH7P,HGNC:38034,"2.0,3.0",pseudogene,pseudogene,1p36.33,UBERON 0002369,Adrenal-Gland,ENSG00000227232-5,ENSG00000227232-5-Adrenal-Gland,GTEXEXP ENSG00000227232-5-Adrenal-Gland


### Create Expression Bin edges

In [58]:
tpm_all_intervals['bin_CodeID'] = ['EXPBINS ' + i for i in tpm_all_intervals['name']]

edges_gtexEXP_bins = tpm_all_intervals[['CodeID_GTEX_Expression','bin_CodeID']]
edges_gtexEXP_bins['predicate'] = 'has_expression'

edges_gtexEXP_bins = edges_gtexEXP_bins[['CodeID_GTEX_Expression','predicate','bin_CodeID']]

edges_gtexEXP_bins.columns = ['subject','predicate','object']
edges_gtexEXP_bins.head(3)

Unnamed: 0,subject,predicate,object
0,GTEXEXP ENSG00000223972-5-Testis,has_expression,"EXPBINS 0.1,0.2"
1,GTEXEXP ENSG00000227232-5-Adipose-Subcutaneous,has_expression,"EXPBINS 4.0,5.0"
2,GTEXEXP ENSG00000227232-5-Adrenal-Gland,has_expression,"EXPBINS 2.0,3.0"


In [59]:
edges_gtexEXP_bins['object'] = [i.replace(',','.') for i in edges_gtexEXP_bins['object']]

In [100]:
edges_gtexEXP_bins

Unnamed: 0,subject,predicate,object
0,GTEXEXP ENSG00000223972-5-Testis,has_expression,EXPBINS 0.1.0.2
1,GTEXEXP ENSG00000227232-5-Adipose-Subcutaneous,has_expression,EXPBINS 4.0.5.0
2,GTEXEXP ENSG00000227232-5-Adrenal-Gland,has_expression,EXPBINS 2.0.3.0
3,GTEXEXP ENSG00000227232-5-Artery-Aorta,has_expression,EXPBINS 4.0.5.0
4,GTEXEXP ENSG00000227232-5-Artery-Coronary,has_expression,EXPBINS 3.0.4.0
...,...,...,...
1573915,GTEXEXP ENSG00000210195-2-Thyroid,has_expression,EXPBINS 0.0.0.0
1573916,GTEXEXP ENSG00000210195-2-Uterus,has_expression,EXPBINS 0.0.0.0
1573917,GTEXEXP ENSG00000210195-2-Vagina,has_expression,EXPBINS 0.0.0.0
1573918,GTEXEXP ENSG00000210195-2-Whole-Blood,has_expression,EXPBINS 0.0.0.0


In [108]:
'''Save all SUIs (and combine the Expression SUIs with the eQTL SUIs and save)
(Combine GTEX EXPRESSION SUIs  with GTEX EQTL)  ; Columns: SUI:ID, name
        
Add the 0 SUI and the interval SUIs to the main SUI file  
Add just the 0 SUI to the SUIs_all (EQTL SUIs + EXPRESSION SUIs)  
We need a seperate script to add the TPM interval SUIs bc they contain numerical properties  
BUT, add both the 0 SUIs and the interval SUIs to the CODE-SUIs_all, bc the relationships (CODE-SUIs) can be put in the ,UMLS master CODE-SUIs file.
# Add in all intervals'''
#GTEX_EXP_SUIs = tpm_all_intervals[['SUI_tpm_bins','name']].rename(columns={'SUI_tpm_bins':'SUI:ID'
#                                                               }).drop_duplicates()

# Combine GTEX_EXP_SUIs with GTEX EQTL SUIs
#GTEX_SUIs = pd.concat([GTEX_EXP_SUIs,GTEX_EQTL_SUIs]).drop_duplicates()

#assert len(GTEX_SUIs[GTEX_SUIs['SUI:ID'].duplicated()]) == 0

#GTEX_SUIs.to_csv('/Users/stearb/desktop/R03_local/data/ingest_files/GTEx/SUIs_GTEx.csv',index=False)

'Save all SUIs (and combine the Expression SUIs with the eQTL SUIs and save)\n(Combine GTEX EXPRESSION SUIs  with GTEX EQTL)  ; Columns: SUI:ID, name\n        \nAdd the 0 SUI and the interval SUIs to the main SUI file  \nAdd just the 0 SUI to the SUIs_all (EQTL SUIs + EXPRESSION SUIs)  \nWe need a seperate script to add the TPM interval SUIs bc they contain numerical properties  \nBUT, add both the 0 SUIs and the interval SUIs to the CODE-SUIs_all, bc the relationships (CODE-SUIs) can be put in the ,UMLS master CODE-SUIs file.\n# Add in all intervals'

# Create edges files (GTEX_EXP--HGNC  &  GTEX_EXP---UBERON) 

In [101]:
medgene_select.head(1)

Unnamed: 0,Transcript ID,tissue,Median_TPM,SMTS,UBERON_code,symbol,hgnc_id,name,locus_group,locus_type,location,UBERON_CODEID,tissue_no_space,Transcript ID DASH,CODE_GTEX_Expression,CodeID_GTEX_Expression
0,ENSG00000223972.5,Adipose Subcutaneous,0.0,Adipose Tissue,2190,DDX11L1,HGNC:37102,DEAD/H-box helicase 11 like 1 (pseudogene),pseudogene,pseudogene,1p36.33,UBERON 0002190,Adipose-Subcutaneous,ENSG00000223972-5,ENSG00000223972-5-Adipose-Subcutaneous,GTEXEXP ENSG00000223972-5-Adipose-Subcutaneous


In [60]:
medgene_select['hgnc_codeID'] = 'HGNC '+ medgene_select['hgnc_id']
#medgene_select['uberon_CodeID'] = 'UBERON '+ medgene_select['UBERON_code']

##################################
######## WRONG RELATIONSHIPS ###
medgene_select['gtex_hgnc_predicate'] =  'RO:0002206'  # expressed in   old rel 'median_expression_in_gene'
medgene_select['gtex_uberon_predicate'] =  'RO:0002206'  # expressed in   old rel 'median_expression_in_gene' 'median_expression_in_tissue'

edges_gtexExp_HGNC = medgene_select[['CodeID_GTEX_Expression','gtex_hgnc_predicate','hgnc_codeID']]
edges_gtexExp_UB = medgene_select[['CodeID_GTEX_Expression','gtex_uberon_predicate','UBERON_CODEID']]

edges_gtexExp_HGNC.columns = edges_gtexExp_UB.columns = ['subject','predicate','object']

In [61]:
edges_exp = pd.concat([edges_gtexEXP_bins,edges_gtexExp_HGNC,edges_gtexExp_UB])
edges_exp = edges_exp.reset_index(drop=True)
#edges_exp

In [62]:
edges_exp = edges_exp.drop_duplicates()

# Create nodes file
#### dont  need dbsnp  nodes saved here, theyre saved in the eqtl nodes file

In [63]:
nodes_GTEX_EXP = pd.DataFrame(medgene_select['CodeID_GTEX_Expression'] )
nodes_GTEX_EXP['GTEX_EXP_terms'] = np.nan

nodes_GTEX_EXP.columns = ['node_id','node_label']
nodes_GTEX_EXP.head(3)

Unnamed: 0,node_id,node_label
0,GTEXEXP ENSG00000223972-5-Adipose-Subcutaneous,
1,GTEXEXP ENSG00000223972-5-Adrenal-Gland,
2,GTEXEXP ENSG00000223972-5-Artery-Aorta,


### and Expression_bin nodes file

In [64]:
nodes_gtexEXP_bins = pd.DataFrame(edges_gtexEXP_bins['object'].drop_duplicates().reset_index(drop=True))
nodes_gtexEXP_bins['node_label'] = np.nan
nodes_gtexEXP_bins.columns = ['node_id','node_label']
#nodes_gtexEXP_bins

In [66]:
## UBERON NODES
nodes_exp_ub = pd.DataFrame(edges_gtexExp_UB['object'].drop_duplicates())
nodes_exp_ub.columns = ['node_id']
nodes_exp_ub['node_label'] = np.nan
nodes_exp_ub = nodes_exp_ub.reset_index(drop=True)

In [67]:
nodes_exp = pd.concat([nodes_gtexEXP_bins,nodes_exp_ub,nodes_GTEX_EXP])
nodes_exp = nodes_exp.reset_index(drop=True)
nodes_exp

Unnamed: 0,node_id,node_label
0,EXPBINS 0.1.0.2,
1,EXPBINS 4.0.5.0,
2,EXPBINS 2.0.3.0,
3,EXPBINS 3.0.4.0,
4,EXPBINS 5.0.6.0,
...,...,...
1573578,GTEXEXP ENSG00000210196-2-Testis,
1573579,GTEXEXP ENSG00000210196-2-Thyroid,
1573580,GTEXEXP ENSG00000210196-2-Uterus,
1573581,GTEXEXP ENSG00000210196-2-Vagina,


In [70]:
nodes_exp = fill_missing_cols(nodes_exp)

In [114]:
nodes_exp

Unnamed: 0,node_id,node_label,node_dbxrefs,node_namespace,node_definition,value,lowerbound,unit,upperbound,node_synonyms
0,EXPBINS 0.1.0.2,,,,,,,,,
1,EXPBINS 4.0.5.0,,,,,,,,,
2,EXPBINS 2.0.3.0,,,,,,,,,
3,EXPBINS 3.0.4.0,,,,,,,,,
4,EXPBINS 5.0.6.0,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
1574118,GTEXEXP ENSG00000210196-2-Testis,,,,,,,,,
1574119,GTEXEXP ENSG00000210196-2-Thyroid,,,,,,,,,
1574120,GTEXEXP ENSG00000210196-2-Uterus,,,,,,,,,
1574121,GTEXEXP ENSG00000210196-2-Vagina,,,,,,,,,


In [72]:
#print([i for i in nodes_exp['node_id'] if 'EFO' in i])
#nodes_exp = nodes_exp[~nodes_exp['node_id'].str.startswith('UBERON EFO_')].reset_index(drop=True)

In [73]:
nodes_exp.to_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_exp/OWLNETS_node_metadata.txt',
             sep='\t',index=False)

edges_exp.to_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_exp/OWLNETS_edgelist.txt',
             sep='\t',index=False)

In [13]:
nodes_exp = pd.read_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_exp/OWLNETS_node_metadata.txt',sep='\t')


In [74]:
edges_exp = pd.read_csv('/Users/stearb/Desktop/DESKTOP_TRANSFER/DataDistilleryFiles/gtex/gtex_exp/OWLNETS_edgelist.txt',sep='\t')

In [75]:

#print([i for i in edges_exp['object'] if 'EFO' in i])


edges_exp[edges_exp['object'].str.startswith('UBERON EFO_')]['object'].unique()

array([], dtype=object)