# NetworKIN Data Formatting

This file takes data regarding kinase-protein interactions from the NetworKIN database and converts the data into the .gmt format. The data was retrieved from the NetworKIN database on Wed, Jun 7 2017 14:55:39. This data will be added to enhance the KEA2 database and will be suitably formatted for use by the ENRICHR and X2K tools.

## Import packages necessary for following program

In [1]:
%run /home/maayanlab/Desktop/Projects/Scripts/init.ipy

## Create a dataframe from a file containing NetworKIN data

In [2]:
#read data from tsv file into dataframe 'networkin_human_predictions.tsv'
#None of the yeast data was downloaded or included from the NetworKIN site
net_df = pd.read_table('~/Desktop/Projects/KEA3/networkin_human_predictions.tsv')

#View dataframe
net_df.head()

Unnamed: 0,#substrate,position,id,networkin_score,tree,netphorest_group,netphorest_score,string_identifier,string_score,substrate_name,sequence,string_path
0,A1CF (ENSP00000363105),154,ACTR2,0.1659,KIN,ACTR2_ACTR2B_TGFbR2_group,0.0485,ENSP00000282641,0.086,A1CF,REEILsEMKKV,"ENSP00000241416, 0.5192 ENSP00000256759, 0.480..."
1,A1CF (ENSP00000363105),154,ACTR2B,0.1659,KIN,ACTR2_ACTR2B_TGFbR2_group,0.0485,ENSP00000282641,0.0837,A1CF,REEILsEMKKV,"ENSP00000340361, 0.5192 ENSP00000256759, 0.480..."
2,A1CF (ENSP00000363105),154,AMPKa1,0.0145,KIN,AMPK_group,0.0061,ENSP00000282641,0.096,A1CF,REEILsEMKKV,"ENSP00000346148, 0.6272 ENSP00000233242, 0.213..."
3,A1CF (ENSP00000363105),154,AMPKa2,0.0145,KIN,AMPK_group,0.0061,ENSP00000282641,0.1564,A1CF,REEILsEMKKV,"ENSP00000360290, 0.7304 ENSP00000385269, 0.675..."
4,A1CF (ENSP00000363105),154,ARAF,0.2307,KIN,ARAF_BRAF_RAF1_group,0.0877,ENSP00000282641,0.1414,A1CF,REEILsEMKKV,"ENSP00000366244, 0.76 ENSP00000356520, 0.688 E..."


### Some Notes Regarding NetworKIN scoring methods [useful for future data analyses]

The STRING netwrok score was assigned based on network proximity, and the Netphorest classifiers became Netphorest probability scores based on the peptide sequences. Combining both using an algorithim resulted in the networkin_score shown below. Calculations were also meant to account for bias of over-study in algorithm calculations. However, I will not be filtering out any of the kinases based on their scores, so these columns will not be necessary.

In [3]:
#select columns necessary for .gmt format and filter into new dataframe 'df'
#should this be 'id' or 'substrate name'
df = net_df[['substrate_name', 'id']]

#len(df) before any changes is 5193537
#len(df) following drop of duplicates is 4634810

#drop duplicate rows in the dataframe
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


## Specify Species of Kinases

In [None]:
#Join kinase names with species name 'Homo sapiens'
df.insert(0, 'kinase_organism', 'None')

#Specify species as Homo sapiens 
for index, rowData in df.id.iteritems():
    df.kinase_organism[index] = '_'.join([rowData, 'Homo sapiens'])

#View dataframe
df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


## Set Index to 'Kinase_Organism' and Aggregate Kinase Targets

In [None]:
#Set index to 'kinase_organism'
df.set_index('kinase_organism')

#Group kinases in dataframe 'kin'
#Aggregate data in 'kin' according to kinase groups
kin = df.groupby('kinase_organism').agg(lambda x: tuple(x))

#Create a new column 'NetworKIN' as description of data
kin.insert(0, 'Description', 'NetworKIN')

#Visualize Data
kin.head()

# Exploratory Data Analysis

## Calculate Number of Protein targets for each kinase
Create new column with the number of substrates related to each kinase, and sort the dataframe by this column.

In [None]:
# Create column representing counts of protein targets per kinase
kin['kinase_substrate_num'] = [len(lst) for kinase, lst in kin['substrate_name'].iteritems()]

# Sort kinases from max to min according to number of protein targets each has
kin.sort_values(by = ['kinase_substrate_num'], ascending= False, inplace=True)

# View dataframe
kin.head()

## Create Histogram to display distribution  of number of targets per kinase

In [None]:
# Create histogram displaying the distribution of the number
#targets per kinase
kin.plot.hist(by = 'kinase_substrate_num', bins = 63)

#Show histogram
plt.show()

# Creation of Final .GMT File

## Create Dictionary of Tab-Separated Rows of the Dataframe

In [None]:
#Reset index of the dataframe
kin.reset_index(inplace = True)

#create column 'acc_merged' in which all 'acc' elements are joined by a \t symbol
kin['substrates_merged'] = ['\t'.join(x) for x in kin['substrate_name']]

#drop the now-unneccesary column 'Substrates'
kin.drop('substrate_name', axis=1, inplace = True)

#also drop the data-exploratory column 'kinase_substrate_num'
kin.drop('kinase_substrate_num', axis=1, inplace = True)

#Create dictionary 'PhosphoSite' with index numbers as keys
NetworKIN_num = dict([(key, '') for key in kin.index])

# loop through rows with iterrows()
for index, rowData in kin.iterrows():
    line = ('\t'.join(rowData))
    NetworKIN_num[index] = line

## Write Info from Dictionary into .GMT File

In [None]:
#Transfer tab-separated info into a new txt file
with open('NetworKIN.gmt', 'w') as openfile:
    for index in NetworKIN_num:
        openfile.write(str(NetworKIN_num[index]) + '\n')