## Purpose of script

**Input data:** 
- KNN Imputer filled full TCGA dataset - contains only the cancer type and protein expression

**Output data:** 
- Machine learning (70%) and validation (30%) TCGA datasets with shortlisted protein list - contains the cancer type and validated proteins as defined in the following paper: Akbani, R., Ng, P., Werner, H. et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat Commun 5, 3887 (2014). https://doi.org/10.1038/ncomms4887
- Unique gene list - contains all the gene which are associated with the shortlisted proteins, this will be used to create the distance matrix
- Protein to gene conversion list - contains the shortlisted protein list and the gene or genes that the protein is associated with, this will be used to creat the distance matrix

## Importing data

In [1]:
# Importing necessary packages
import pandas as pd 
import numpy as np 

In [2]:
# Reading in full TCGA data which contains all the proteins
Full_TCGA_data_ori = pd.read_csv("../R/Data/Processed_Data/KNN_filled_data.csv")
Full_TCGA_data_ori.drop(['Unnamed: 0'], axis=1, inplace=True)

Full_TCGA_data_ori = Full_TCGA_data_ori.sort_values(by='project_id')


Full_TCGA_data_ori

Unnamed: 0,project_id,1433BETA,1433EPSILON,1433ZETA,4EBP1,4EBP1_pS65,4EBP1_pT37T46,4EBP1_pT70,53BP1,ACC_pS79,...,XPF,XRCC1,YAP,YAP_pS127,YB1,YB1_pS102,YTHDF2,YTHDF3,ZAP.70,ZEB1
5424,0.0,0.066729,0.412330,-0.229260,-0.034305,-0.030930,0.71239,-0.240950,-0.916615,-0.285390,...,0.436293,-0.361380,0.106262,-0.529325,-0.530005,0.002020,-0.397034,-0.382803,-1.629115,1.311241
5207,0.0,0.218150,-0.096406,-0.192420,-0.445430,-0.102640,0.19628,-0.095456,-1.399500,-0.478150,...,-0.100140,-0.676610,0.304890,-0.540600,-0.334280,1.229900,-0.328632,-0.330245,-1.254223,0.192863
1792,0.0,1.880500,0.604240,-0.310380,0.601800,-0.180450,0.11812,-0.202220,-2.182900,1.554200,...,-0.586610,0.469800,-0.166270,-0.920470,-0.943470,0.278090,-0.948951,-0.073959,-0.796032,-0.208793
6377,0.0,0.212780,0.574020,-0.323870,-0.513780,-0.263550,0.92589,-0.153070,0.321780,0.091441,...,0.297383,-0.486600,-0.330460,-1.106000,-0.150950,0.589190,-1.291660,-0.892555,-0.931502,0.991503
259,0.0,0.004928,0.273890,-0.357320,-0.472230,0.188780,1.94850,-0.110500,0.136650,-0.000208,...,-0.060720,-0.334700,-0.248240,-1.378500,-0.026322,0.213700,-0.524764,-0.613327,-1.288974,0.253337
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5750,27.0,0.002211,0.074397,0.670265,-0.121499,-0.282985,-1.47840,-0.092727,-0.218190,-0.157340,...,-0.079172,-0.158024,-0.058338,-0.231340,0.260340,-0.322565,0.869280,0.122838,-0.288481,-0.661225
2624,27.0,0.071701,0.021365,-0.206400,-0.440120,-0.247860,0.21098,-0.434760,-0.258630,-0.064565,...,-0.307446,-0.177410,0.123340,0.190050,-0.344350,-0.332820,-0.788958,-0.064927,-1.246426,-0.502380
5007,27.0,0.006940,-0.064350,-0.074589,0.198150,-0.141280,0.30652,-0.319210,-0.613640,-0.160340,...,-0.108959,-0.065061,0.339020,0.419500,0.000640,-0.157810,-0.014350,-0.172026,-1.309777,-0.239612
6925,27.0,0.131440,-0.078326,0.308450,0.464070,-0.370510,-1.33630,-0.191930,-1.317700,-0.028137,...,-0.072274,0.038092,0.406310,-0.303230,0.127870,-0.026296,-0.806935,-0.084823,-1.404030,0.284475


## Creating dataset with the shortlisted proteins from the paper

**First do this before the following code:** From the reference paper, download Supplementary Data 9 and put it into the folder R/Data/Processed_Data.

**NOTE:** ensure that the 'Pan-can ab list' sheet is the first sheet in the excel just downloaded.

In [75]:
# reads by default 1st sheet of an excel file
short_protein_list = pd.read_excel("../R/Data/Processed_Data/41467_2014_BFncomms4887_MOESM464_ESM.xlsx")

# Set a specific row as column names
short_protein_list.columns = short_protein_list.iloc[0]

# Drop the row that was used as column names
short_protein_list = short_protein_list[1:]

# Keeping only validated proteins
short_protein_list = short_protein_list[short_protein_list['Antibody validation status'] == 'Validated']

short_protein_list

  warn(msg)


Unnamed: 0,#,Protein Name,Gene Name,Antibody validation status,Antibody Origin,Antibody Source (Company),Catalog Number,Dilution
2,2,4E-BP1_pS65,EIF4EBP1,Validated,Rabbit,CST,9456,1:250
3,3,4E-BP1_pT37,EIF4EBP1,Validated,Rabbit,CST,9459,1:100
4,4,4E-BP1,EIF4EBP1,Validated,Rabbit,CST,9452,1:100
6,6,ACC_pS79,ACACA ACACB,Validated,Rabbit,CST,3661,1:250
8,8,Akt_pS473,AKT1 AKT2 AKT3,Validated,Rabbit,CST,9271,1:250
...,...,...,...,...,...,...,...,...
171,171,Transglutaminase,TGM2,Validated,Mouse,Lab Vision,MS-224,1:100
172,172,TFRC,TFRC,Validated,Rabbit,SDI / Novus,22500002,1:100
174,174,Tuberin_pT1462,TSC2,Validated,Rabbit,CST,3617,1:500
180,180,ETS-1,ETS-1,Validated,Rabbit,BethYl,A303-501A,1:750


In [76]:
# reads by default 1st sheet of an excel file
short_protein_list = pd.read_excel("../R/Data/Processed_Data/41467_2014_BFncomms4887_MOESM464_ESM.xlsx")

# Setting a specific row as column names
short_protein_list.columns = short_protein_list.iloc[0]

# Drop the row that was used as column names
short_protein_list = short_protein_list[1:]

# Keeping only validated proteins
short_protein_list = short_protein_list[short_protein_list['Antibody validation status'] == 'Validated']

#short_protein_list1 = short_protein_list['Official Ab Name ']
short_protein_list = short_protein_list['Protein Name']
short_protein_list = [x for x in short_protein_list if str(x) != 'nan']


# ----------------------------------------------------------------------------
# Ensuring that the protein names in my data frame and their list match
short_protein_list = pd.DataFrame(short_protein_list)
short_protein_list.replace({'-':''}, regex=True, inplace=True)
short_protein_list.replace({'_':''}, regex=True, inplace=True)
short_protein_list.replace({' ':''}, regex=True, inplace=True)
short_protein_list = short_protein_list[0].str.lower()
short_protein_list = short_protein_list.str.replace(r'\W', '')

# Making new full dataset variable to keep orignial which has no changes to the column names
Full_TCGA_data = Full_TCGA_data_ori
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(r'\W', '')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('-','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('_','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(' ','')


# Checking how many names in the paper's protein list is in my data frame
common_proteins1 = short_protein_list.isin(Full_TCGA_data.columns.str.lower())
unique_values, counts = np.unique(common_proteins1, return_counts=True)

for value, count in zip(unique_values, counts):
    print(f"{value} occurs {count} times")



False occurs 4 times
True occurs 109 times


  warn(msg)


Since 5 proteins were not found in common, I will check if the protein name they used is called something different by looking at the second sheet called 'Standard Ab List_Website' which contains an 'Official Ab name' column corresponding to the protein list they used to name their proteins. 

In [77]:
# Exporting the uncommon proteins for manual edit
common_proteins1 = pd.DataFrame(common_proteins1)
X = pd.DataFrame(common_proteins1[common_proteins1[0]== False])
index = X.index.values
y1 = short_protein_list.iloc[index]
y1.to_csv("../R/Data/Processed_Data/Not_common_proteins.csv")


In [78]:
# Importing the manually edited version
found_proteins1 = pd.read_csv("../R/Data/Processed_Data/Not_common_proteins_manual_edit.csv")
found_proteins = pd.DataFrame(found_proteins1['protein'])

found_proteins.replace({'-':''}, regex=True, inplace=True)
found_proteins.replace({'_':''}, regex=True, inplace=True)
found_proteins.replace({' ':''}, regex=True, inplace=True)
found_proteins = found_proteins['protein'].str.lower()
found_proteins = found_proteins.str.replace(r'\W', '')

common_proteins2 = found_proteins.isin(Full_TCGA_data.columns.str.lower())

unique_values, counts = np.unique(common_proteins2, return_counts=True)


# Print the results
for value, count in zip(unique_values, counts):
    print(f"{value} occurs {count} times")

# From doing the mannual edit I was able to identify them in my dataset

True occurs 4 times


All 113 validated proteins were found in my dataset.

Now I can pull out these proteins from the full TCGA dataset.

In [79]:
X1 = pd.DataFrame(common_proteins1[common_proteins1[0]== True])
index = X1.index.values
y1 = short_protein_list.iloc[index]

common_proteins2 = pd.DataFrame(common_proteins2)
X2 = pd.DataFrame(common_proteins2[common_proteins2['protein']== True])
index = X2.index.values
y2 = found_proteins1['protein'].iloc[index]


y = pd.concat([y1,y2], ignore_index=True)

y.replace({'-':''}, regex=True, inplace=True)
y.replace({'_':''}, regex=True, inplace=True)
y.replace({' ':''}, regex=True, inplace=True)
y = y.str.lower()
y = y.str.replace(r'\W', '')

Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(r'\W', '')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('-','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('_','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(' ','')

common = y.isin(Full_TCGA_data.columns.str.lower())

unique_values, counts = np.unique(common, return_counts=True)

# Print the results
for value, count in zip(unique_values, counts):
    print(f"{value} occurs {count} times")



True occurs 113 times


In [80]:
# Producing the shortlisted Full TCGA dataframe and saving this
y = list(y)
y.insert(0, Full_TCGA_data_ori.columns[0])
y
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(r'\W', '')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('-','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('_','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(' ','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.lower() 


shortlisted_TCGA_data = Full_TCGA_data[y]
shortlisted_TCGA_data.to_csv("../R/Data/Processed_Data/Shortlisted_protein_list.csv")


## Creating a machine learning and validation TCGA datasets

In [82]:
from sklearn.model_selection import train_test_split

# Group the DataFrame by 'CancerType'
grouped_df = shortlisted_TCGA_data.groupby('projectid')

# Create empty DataFrames for the 70% and 30% splits
ML_70_percent = pd.DataFrame()
V_30_percent = pd.DataFrame()

# Iterate over groups and split each group
for group_name, group_data in grouped_df:
    group_data_70, group_data_30 = train_test_split(group_data, test_size=0.3, random_state=42)
    ML_70_percent = pd.concat([group_data_70, ML_70_percent])
    V_30_percent = pd.concat([group_data_30, V_30_percent])


In [86]:
# Saving the final machine learning and validation dataset
ML_70_percent = ML_70_percent.sort_values(by='projectid')
ML_70_percent.to_csv("../R/Data/Processed_Data/Final_Machine_Learning_Data.csv")
V_30_percent = V_30_percent.sort_values(by='projectid')
V_30_percent.to_csv("../R/Data/Processed_Data/Final_Validation_Data.csv")

## Creating a dataset which contains the validated proteins and their associated gene

In [63]:
protein_gene_list =  pd.read_excel("../R/Data/Processed_Data/41467_2014_BFncomms4887_MOESM464_ESM.xlsx")

# Setting a specific row as column names
protein_gene_list.columns = protein_gene_list.iloc[0]

# Drop the row that was used as column names
protein_gene_list = protein_gene_list[1:]

# Keeping only validated proteins
protein_gene_list = protein_gene_list[protein_gene_list['Antibody validation status'] == 'Validated']

protein_gene_list = protein_gene_list[['Protein Name', 'Gene Name']]


protein_gene_list



  warn(msg)


Unnamed: 0,Protein Name,Gene Name
2,4ebp1ps65,EIF4EBP1
3,4ebp1pt37,EIF4EBP1
4,4ebp1,EIF4EBP1
6,accps79,ACACA ACACB
8,aktps473,AKT1 AKT2 AKT3
...,...,...
171,transglutaminase,TGM2
172,tfrc,TFRC
174,tuberinpt1462,TSC2
180,ets1,ETS-1


In [66]:
# Ensuring names are in the same format
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].str.replace(r'\W', '')
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].str.replace('-','')
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].str.replace('_','')
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].str.replace(' ','')
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].str.lower() 

# Create a dictionary mapping the different protein names to those found in my dataset
replacement_dict = dict(zip(found_proteins1['0'], found_proteins1['protein']))

# Replace names based on the mapping in replacement_dict
protein_gene_list['Protein Name'] = protein_gene_list['Protein Name'].replace(replacement_dict)

#-------------------------------------------------------------------------------------------------#
# Double mapping was done correctly
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(r'\W', '')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('-','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace('_','')
Full_TCGA_data.columns = Full_TCGA_data.columns.str.replace(' ','')

common1 = protein_gene_list['Protein Name'].isin(Full_TCGA_data.columns.str.lower())

unique_values, counts = np.unique(common1, return_counts=True)

# Print the results
for value, count in zip(unique_values, counts):
    print(f"{value} occurs {count} times")



True occurs 113 times


Now the protein to gene dataframe contains protein names found in my dataset. This is required as this dataset will be used as reference to create the protein interactions matrix. 

## Creating a unique gene list

In [68]:
# Creating a protein list which contains unique gene names 
unique_list = []

for value in protein_gene_list['Gene Name']:
    unique_list.extend(value.split())

unique_list = list(set(unique_list))



In [69]:
unique_list = pd.DataFrame(unique_list)
unique_list.to_csv("../R/Data/Processed_Data/Shortlisted_unique_gene_list.csv")

## Creating a protein-gene conversion list

This is necessary as in the R script it needs a protein-gene conversion dataframe which has each individual gene in its own box in the dataframe.

In [71]:
gene = protein_gene_list['Gene Name'].str.split(expand=True)
protein_to_gene = pd.concat([protein_gene_list['Protein Name'], gene], axis=1)
protein_to_gene.head()

Unnamed: 0,Protein Name,0,1,2
2,4ebp1ps65,EIF4EBP1,,
3,4ebp1pt37t46,EIF4EBP1,,
4,4ebp1,EIF4EBP1,,
6,accps79,ACACA,ACACB,
8,aktps473,AKT1,AKT2,AKT3


In [72]:
protein_to_gene.to_csv("../R/Data/Processed_Data/Protein_to_gene_conversion.csv")