### Pathway commons is helpful to select genes(features) which can be used as features of specific gene. There are more than 20,000 features(genes) which corresponds to features in RNASeq dataset. 

##### 1. One way of selecting features from more than 20,000 features is selecting genes which has a relation of **controls-expression-of** with a selected gene. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read RNASeq dataset and add new column as GeneId to select genes
RNASeq = pd.read_csv('prepared/RNASeq.csv')
geneSplit = RNASeq['HybridizationREF'].str.split('|', n=0, expand=True)
RNASeq["GeneId"] = geneSplit.iloc[:, 0]
RNASeq

Unnamed: 0,HybridizationREF,TCGA-02-0047-01A,TCGA-02-0055-01A,TCGA-02-2483-01A,TCGA-02-2485-01A,TCGA-02-2486-01A,TCGA-06-0129-01A,TCGA-06-0130-01A,TCGA-06-0132-01A,TCGA-06-0141-01A,...,TCGA-41-4097-01A,TCGA-41-5651-01A,TCGA-76-4925-01A,TCGA-76-4926-01B,TCGA-76-4927-01A,TCGA-76-4928-01B,TCGA-76-4929-01A,TCGA-76-4931-01A,TCGA-76-4932-01A,GeneId
0,?|100130426,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,?
1,?|100133144,1.619742,0.000000,1.559100,3.999567,2.475344,4.480310,-0.644984,1.014712,2.358481,...,3.932137,0.285698,3.610700,3.258760,1.268794,3.346375,3.083333,2.622930,0.000000,?
2,?|100134869,2.757258,3.972445,3.801138,3.902759,2.264506,4.072440,1.570754,1.768798,3.126312,...,3.796878,3.373105,4.117662,3.804147,2.683854,2.696261,1.700262,2.992333,3.314726,?
3,?|10357,5.773564,4.972440,5.915141,6.520796,5.966629,6.252266,5.132149,5.288702,5.943574,...,4.788085,6.131009,6.756695,6.140689,5.426939,5.407978,5.959487,5.878205,5.911668,?
4,?|10431,9.791685,9.790795,10.270095,8.876517,9.093659,9.327291,9.596396,9.471455,9.766777,...,9.212488,8.987817,9.055101,9.199774,9.278179,9.060000,9.385647,8.840315,8.964348,?
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20526,ZYX|7791,11.917362,13.487015,12.306861,12.313720,12.929258,11.906380,13.488796,12.484875,12.721659,...,12.205940,11.631881,13.302111,12.361849,12.866714,14.047882,11.512933,12.816231,13.230089,ZYX
20527,ZZEF1|23140,10.448212,9.243237,9.700586,10.154215,9.293331,10.939146,9.632465,9.510155,10.207305,...,10.706289,9.571047,9.555891,10.093747,10.280976,10.001377,10.863138,10.608635,9.634856,ZZEF1
20528,ZZZ3|26009,9.237409,9.488141,9.462370,9.452192,9.042860,10.121765,9.814525,9.084337,9.469504,...,9.227342,9.070279,9.830382,9.543483,9.080459,9.001994,9.305167,9.152136,9.548352,ZZZ3
20529,psiTPTE22|387590,2.757258,3.624522,8.640045,3.974006,6.119194,8.393311,4.089083,6.592484,6.295394,...,3.788091,3.533576,3.585167,3.372646,3.771315,2.979623,3.407162,2.371977,0.000000,psiTPTE22


##### Read Pathway Commons dataset which contains information about the relationship

In [3]:
pCommons = pd.read_csv('data/PathwayCommons12.All.hgnc.sif', sep='\s+', error_bad_lines=False, index_col=False, dtype='unicode')
pCommons.head()

Unnamed: 0,Gene,RelationName,To
0,A1BG,controls-expression-of,A2M
1,A1BG,interacts-with,ABCC6
2,A1BG,interacts-with,ACE2
3,A1BG,interacts-with,ADAM10
4,A1BG,interacts-with,ADAM17


## There is another dataset which has ranking of significanlty mutated genes. I am using this dataset to find most significanlty mutated genes

In [7]:
SigMut = pd.read_csv('data/GBM-TP/SignificantlyMutatedGenes.txt', sep='\t', error_bad_lines=False, dtype='unicode')
SigMut

Unnamed: 0,rank,gene,longname,codelen,nnei,nncd,nsil,nmis,nstp,nspl,nind,nnon,npat,nsite,pCV,pCL,pFN,p,q
0,1,TP53,tumor protein p53,1314,60,0,0,78,4,7,8,97,80,60,1.851640e-15,1.000000e-05,1.000000e-05,1.000000e-16,6.754967e-13
1,2,PIK3R1,"phosphoinositide-3-kinase, regulatory subunit ...",2361,3,0,0,14,1,2,16,33,32,27,1.000000e-16,1.000000e-05,4.367000e-02,1.000000e-16,6.754967e-13
2,3,RB1,retinoblastoma 1 (including osteosarcoma),2891,63,0,1,0,9,9,7,25,24,22,1.000000e-16,2.057000e-01,2.900000e-02,1.110223e-16,6.754967e-13
3,4,NF1,"neurofibromin 1 (neurofibromatosis, von Reckli...",8807,1,0,1,4,12,5,14,35,29,34,1.000000e-16,2.310000e-01,8.820000e-01,1.221245e-15,5.572848e-12
4,5,PTEN,phosphatase and tensin homolog (mutated in mul...,1244,58,0,0,46,15,7,21,89,86,72,4.117264e-15,7.000000e-02,5.030000e-01,1.831868e-14,6.687417e-11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18248,18249,ZSWIM4,"zinc finger, SWIM-type containing 4",3018,12,0,0,0,0,0,0,0,0,0,1,,,1,1
18249,18250,ZSWIM7,"zinc finger, SWIM-type containing 7",578,347,0,0,0,0,0,0,0,0,0,1,,,1,1
18250,18251,ZW10,"ZW10, kinetochore associated, homolog (Drosoph...",2402,93,0,0,0,0,0,0,0,0,0,1,,,1,1
18251,18252,ZWINT,ZW10 interactor,868,26,0,0,0,0,0,0,0,0,0,1,,,1,1


In [13]:
# Removing all columns except gene ID
SigMut = SigMut[['gene']]
SigMut = SigMut.head(10)
SigMut.head()

Unnamed: 0,gene
0,TP53
1,PIK3R1
2,RB1
3,NF1
4,PTEN


##### Filtering genes from Pathway commons which are in relationship with selected genes

In [19]:
rank = 1
for gene in SigMut['gene']:
    filtered = pCommons.loc[pCommons['Gene'] == gene]
    filtered['To'].to_csv('pathwayCommons/'+str(rank) + '_'+str(gene)+'_relation_gene.csv', index=False)
    rank = rank + 1