## How to derive transcript counts
The transcript counts are simply the log1p from pe-rna-counts, which is a file Lauren gave me

In [1]:
import numpy as np
import pandas as pd

counts = np.log1p(pd.read_csv('pe-rna-counts.csv', index_col=0))
counts

Unnamed: 0,PL1013,PL1015,PL1023,PL1043,PL1159,PL1182,PL1226,PL1362,PL1365,PL1383,...,PL2353,PL2360,PL2406,PL475,PL519,PL629,PL687,PL808,PL810,PL893
ENSG00000000419,3.583519,5.303305,7.074963,6.028279,0.000000,6.946976,4.499810,6.747587,0.000000,5.966147,...,7.100028,6.905753,7.099202,5.537334,0.000000,5.910797,6.429719,6.931472,6.118097,0.000000
ENSG00000000457,6.001415,0.000000,3.931826,6.336826,5.720312,5.774552,6.373320,6.146329,0.000000,4.787492,...,6.309918,7.311218,7.907652,6.418365,6.255750,2.302585,6.251904,6.502790,5.308268,5.902633
ENSG00000000460,5.652489,0.000000,5.934894,0.000000,4.795791,0.000000,5.036953,0.000000,0.000000,0.000000,...,7.098376,6.505784,0.000000,4.795791,0.000000,5.493061,6.322565,6.006353,4.189655,0.000000
ENSG00000000938,7.959975,7.421178,8.806424,8.457868,8.099251,7.702556,8.263075,8.235891,8.310169,8.835647,...,9.149209,10.213322,8.968778,7.729296,8.234830,7.521318,8.253227,9.376024,8.475954,8.511980
ENSG00000000971,0.000000,3.044522,6.502790,4.434251,0.000000,0.000000,0.000000,5.288267,0.000000,6.871091,...,0.000000,5.855072,5.288267,3.663562,0.000000,4.007333,0.000000,5.326978,0.000000,4.127134
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000293497,0.000000,2.995732,6.553935,4.477337,0.000000,0.000000,3.713572,0.000000,0.000000,4.574711,...,2.639057,6.046547,0.000000,3.555348,5.955837,0.000000,5.723585,6.408527,4.852030,0.000000
ENSG00000293508,0.000000,0.693147,5.153292,5.613128,0.000000,0.000000,5.159055,0.000000,0.000000,5.723585,...,1.609438,0.000000,2.079442,0.000000,4.418841,5.857933,5.811141,5.117994,0.000000,2.197225
ENSG00000293510,5.303305,5.533389,0.000000,4.804021,6.575076,0.000000,6.304449,2.397895,0.000000,4.584967,...,0.000000,6.425222,5.868463,0.000000,4.418841,4.025352,7.156956,7.134547,5.902633,4.110874
ENSG00000293514,5.313206,5.283204,6.011267,5.252273,0.000000,0.000000,6.857514,5.666427,0.000000,0.000000,...,0.000000,6.527958,5.786897,5.509388,0.000000,4.948760,0.000000,5.308268,6.575076,6.981006


The keys are the ensembl codes which are not the same as what pathwaycommons accepts. To translate the ensembl codes to more 'normal' nomenclature I used the genes.csv file to create two dictionaries

In [2]:
to_name = pd.read_csv('genes.csv').set_index("code")["name"].to_dict()
to_ensembl = pd.read_csv('genes.csv').set_index("name")["code"].to_dict()

The genes.csv file I created manually. I couldn't create it automatically because the genes mentioned in papers don't easily translate to ensembl code. For example, some papers mentioned EPHX, epoxide hydrolase, but there are actually three variations of epoxide hydrolase (EPHX1, EPHX2, EXHX3) and they all have a different ensembl codes which I included. So I while I'd prefer to do these kinds of things using an API to minimize the liklihood for human error, there wasn't a straightforward way to do that so I did it manually.


Then we need to filter out the counts so it only includes the genes I identified in papers as being associated with preeclamptia. Note that not all genes I identified as being associated with preeclamptia are in our database

In [3]:
genes_associated_with_pe = list(pd.read_csv('genes.csv')['code'])
genes = [g for g in genes_associated_with_pe if g in counts.index]
len(genes)

78

In [4]:
counts_we_have = counts.loc[genes]

It's probably a good idea to change the index of the counts_we_have dataframe to the common names for ease of use

In [5]:
counts_we_have.index = [to_name[x] for x in counts_we_have.index]
counts_we_have

Unnamed: 0,PL1013,PL1015,PL1023,PL1043,PL1159,PL1182,PL1226,PL1362,PL1365,PL1383,...,PL2353,PL2360,PL2406,PL475,PL519,PL629,PL687,PL808,PL810,PL893
RPS29,9.073145,8.135347,9.229894,9.534523,9.115480,8.322151,9.376442,9.049819,8.565793,8.789051,...,9.832529,9.743222,9.550164,9.359952,8.670944,8.232972,9.381601,9.373224,8.905731,9.459230
EEF1A1,12.409879,11.462379,12.842304,12.799694,12.148532,12.134368,12.709605,12.011991,11.687065,12.302857,...,12.980988,13.140993,13.013849,12.336670,11.855785,11.738759,12.593443,13.154979,12.371195,12.421758
IGF2,10.336406,9.230241,10.410396,9.392912,9.100972,10.404263,9.463120,8.696845,9.644328,10.233295,...,10.211854,9.476926,9.688808,9.787459,8.963416,10.138994,8.751791,10.203444,10.171719,9.555702
UBC,9.726691,9.422625,10.644615,10.225535,10.163349,10.247680,10.465786,10.041247,9.954608,10.444124,...,10.740757,11.025588,11.097486,10.653511,9.857129,9.505916,10.520321,10.927125,10.202258,9.728181
FAU,9.953515,8.407601,10.009648,10.112167,9.944342,9.017241,10.053888,9.181941,8.977651,9.326967,...,10.331985,10.400924,9.986621,9.813508,9.281358,8.737613,10.090216,10.551324,9.711661,9.677653
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ACE,0.693147,5.337538,5.402677,6.232448,6.957497,0.000000,0.000000,5.379897,0.000000,8.014336,...,0.000000,4.262680,6.975414,6.230481,0.000000,5.181784,0.000000,6.572283,4.615121,5.908083
APOE,4.682131,4.820282,4.787492,6.139885,0.000000,0.000000,5.204007,6.156979,0.000000,5.332719,...,6.660575,5.690359,0.000000,5.680173,4.330733,5.891644,6.586172,6.383507,5.993961,5.958425
EPHX1,3.871201,4.060443,0.000000,6.450472,6.156979,3.784190,3.951244,6.423247,5.298317,5.883322,...,6.222576,5.752573,0.000000,0.000000,4.262680,4.189655,5.579730,5.497168,0.000000,5.308268
EPHX2,0.000000,1.098612,5.257495,5.961005,0.693147,7.470224,5.318120,4.867534,0.000000,0.000000,...,0.693147,6.356108,0.000000,6.846943,0.000000,4.204693,5.834811,6.541030,4.828314,0.000000
