The intention of this notebook is to provide the performance for baseline models.

<ins>Contents of this notebook:</ins>

- [ ] Regression baseline
- [ ] SVM baseline
- [ ] Random forest baseline
- [ ] MLP baseline
- [x] Drug mean baseline

# Imports <a name="imports"/>

In [2]:
!pip install torch
!pip install torch_geometric

Collecting torch
  Downloading torch-1.10.2-cp36-cp36m-manylinux1_x86_64.whl (881.9 MB)
     |████████████████████████████████| 881.9 MB 7.9 kB/s              
Installing collected packages: torch
Successfully installed torch-1.10.2


In [1]:
import sys
import os
import pickle
import random
import torch
import torch.nn as nn
import tqdm as tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

sys.path.append('/home/ec2-user/SageMaker/projects/gnn/GNN-material/src')
sys.path.append('/home/ec2-user/SageMaker/projects/gnn/data/')
sys.path = list(set(sys.path))

  from .autonotebook import tqdm as notebook_tqdm


# Global Settings <a name="global-settings"/>

- __train set__: Data to train the model on. Used for model selection.
- __validation set__: Data to validate how well the model has been trained. Used for model selection.
- __test set__: Data for prediction error of the final model (the model which has been selected in the selection process).

In [2]:
RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

TRAIN_RATIO = 0.8
TEST_RATIO = 0.2

---

# Dataset Preparation <a name="Dataset Preparation"/>

In this notebook we are going to build simple baselines to the other models. The dataset for this model will be a simple table with the following columns
- cell-line indicator (`CELL_LINE_NAME`)
- drug (`DRUG_ID`)
- gene expression features (`<gene_id>_gexpr`)
- mutation features (`<gene_id>_mut`)
- copy number features (`<gene_id>_cnvg`, `<gene_id>_cnvp`)
- target value (`LN_IC50`)

Notes: 
- for the same cell-line `i`, there will be drug values `1, ...j, ..., q`, where `q = # of drugs` and `i = {1, ..., m}`, where `i = # of cell-lines`
- for the same cell-line `i` the feature values are exactly the same
  - this is because the gene expression, mutation and copy number values are PER CELL-LINE and HAVE NOTHING TO DO WITH THE DRUG
- the ic50 value however will be per cell-line drug tuple and thus be dependent on the cell-line as well as the drug

## Load Root Datasets <a name="load-root-datasets"/>

In [4]:
# --- Drug response matrix ---
with open('../../data/processed/' + 'gdsc2_drm.pkl', 'rb') as f: 
    drm = pickle.load(f)
    assert not drm.isna().sum().any(), "Some rows have missing cells."
print(drm.shape)
print("Number of unique drug names: ", len(drm.DRUG_ID.unique()))
print("Number of unique drug-ids  : ", len(drm.DRUG_NAME.unique()))
print("Number of unique cell-lines: ", len(drm.CELL_LINE_NAME.unique()))
drm.head(5)

(137835, 9)
Number of unique drug names:  181
Number of unique drug-ids  :  181
Number of unique cell-lines:  856


Unnamed: 0,DATASET,CELL_LINE_NAME,DRUG_NAME,DRUG_ID,SANGER_MODEL_ID,AUC,RMSE,Z_SCORE,LN_IC50
333161,GDSC2,PFSK-1,Camptothecin,1003,SIDM01132,0.930105,0.088999,0.432482,-1.462148
333162,GDSC2,A673,Camptothecin,1003,SIDM00848,0.614932,0.111423,-1.420322,-4.869447
333163,GDSC2,ES5,Camptothecin,1003,SIDM00263,0.790953,0.142754,-0.599894,-3.360684
333164,GDSC2,ES7,Camptothecin,1003,SIDM00269,0.592624,0.135642,-1.515791,-5.045014
333165,GDSC2,EW-11,Camptothecin,1003,SIDM00203,0.733992,0.128066,-0.807038,-3.74162


In [5]:
# --- Gene feature matrix ---
with open('../../data/processed/gdsc2/990/' + f'thresh_gdsc2_990_gene_mat.pkl', 'rb') as f: 
    cl_gene_mat = pickle.load(f)
    assert not cl_gene_mat.isna().sum().any(), "Some rows have missing cells."
print(cl_gene_mat.shape)
print("Number of unique cell-lines: ", len(cl_gene_mat.CELL_LINE_NAME.unique()))
cl_gene_mat.head(5)

(856, 1173)
Number of unique cell-lines:  856


Unnamed: 0,CELL_LINE_NAME,PLK1_gexpr,RUVBL1_gexpr,TERT_gexpr,EIF4EBP1_gexpr,RPS6_gexpr,STXBP1_gexpr,CDC42_gexpr,SFN_gexpr,POLB_gexpr,...,MYC_mut,CCNB1_mut,TICAM1_mut,CENPE_mut,CFLAR_mut,AKT1_mut,FBXO11_mut,PSMB8_mut,PRKCD_mut,FBXO7_mut
0,22RV1,3.376377,7.935522,3.22259,9.062096,12.713941,4.554855,4.461362,8.570664,8.331353,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,23132-87,3.525281,8.745985,3.329384,10.303132,12.802973,4.206494,4.112894,9.06954,8.788701,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,42-MG-BA,3.769823,6.651761,3.492581,9.790909,11.856622,5.397945,4.45785,3.673494,7.443434,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,451Lu,3.922186,8.386289,3.479026,9.40878,12.290141,6.228226,5.412469,4.064795,8.27265,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,5637,3.531716,8.347308,3.243133,9.884705,12.588677,3.683614,4.389642,9.927606,8.998844,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Inner join drug response matrix with the features.
join = pd.merge(left=drm[['CELL_LINE_NAME', 'LN_IC50']],
                right=cl_gene_mat,
                on=['CELL_LINE_NAME'],
                how='inner')
print(join.shape)
join.head(5)

(137835, 1174)


Unnamed: 0,CELL_LINE_NAME,LN_IC50,PLK1_gexpr,RUVBL1_gexpr,TERT_gexpr,EIF4EBP1_gexpr,RPS6_gexpr,STXBP1_gexpr,CDC42_gexpr,SFN_gexpr,...,MYC_mut,CCNB1_mut,TICAM1_mut,CENPE_mut,CFLAR_mut,AKT1_mut,FBXO11_mut,PSMB8_mut,PRKCD_mut,FBXO7_mut
0,PFSK-1,-1.462148,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,PFSK-1,-4.996545,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,PFSK-1,3.213791,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,PFSK-1,-1.847259,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,PFSK-1,2.10137,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [28]:
gexpr_cols = join.filter(like='_gexpr'); print(gexpr_cols.shape)
cnvg_cols = join.filter(like='_cnvg'); print(cnvg_cols.shape)
cnvp_cols = join.filter(like='_cnvp'); print(cnvp_cols.shape)

(137835, 293)
(137835, 293)
(137835, 293)


In [30]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Apply standard scaling to the specified columns
join[gexpr_cols.columns.tolist()] = scaler.fit_transform(join[gexpr_cols.columns.tolist()])
join[cnvg_cols.columns.tolist()] = scaler.fit_transform(join[cnvg_cols.columns.tolist()])
join[cnvp_cols.columns.tolist()] = scaler.fit_transform(join[cnvp_cols.columns.tolist()])
join.head(3)

Unnamed: 0,CELL_LINE_NAME,LN_IC50,PLK1_gexpr,RUVBL1_gexpr,TERT_gexpr,EIF4EBP1_gexpr,RPS6_gexpr,STXBP1_gexpr,CDC42_gexpr,SFN_gexpr,...,MYC_mut,CCNB1_mut,TICAM1_mut,CENPE_mut,CFLAR_mut,AKT1_mut,FBXO11_mut,PSMB8_mut,PRKCD_mut,FBXO7_mut
0,PFSK-1,-1.462148,0.95287,-0.436216,-0.218618,1.393951,0.697062,0.277039,0.485848,-1.070033,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,PFSK-1,-4.996545,0.95287,-0.436216,-0.218618,1.393951,0.697062,0.277039,0.485848,-1.070033,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,PFSK-1,3.213791,0.95287,-0.436216,-0.218618,1.393951,0.697062,0.277039,0.485848,-1.070033,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## Create Modeling Datasets <a name="create-modeling-datasets"/>

In [11]:
from sklearn.model_selection import train_test_split

In [31]:
drm_train, drm_test = train_test_split(
    join, 
    test_size=0.2,
    random_state=42,
    stratify=drm['CELL_LINE_NAME']
)
print(f"train: {drm_train.shape[0]:7,}")
print(f"test : {drm_test.shape[0]:7,}")

train: 110,268
test :  27,567


In [32]:
X = join.loc[:, ~join.columns.isin(['CELL_LINE_NAME', 'LN_IC50'])]
y = join['LN_IC50']

X_train = drm_train.loc[:, ~drm_train.columns.isin(['CELL_LINE_NAME', 'LN_IC50'])]
y_train = drm_train['LN_IC50']

X_test = drm_test.loc[:, ~drm_test.columns.isin(['CELL_LINE_NAME', 'LN_IC50'])]
y_test = drm_test['LN_IC50']

In [33]:
print(f"""Final dataset shapes:
=====================
             {'X':<15s} {'y':<10s}
Training   : {str(X_train.shape):15s} {str(y_train.shape):10s}
Testing    : {str(X_test.shape):15s} {str(y_test.shape):10s}
""")

Final dataset shapes:
             X               y         
Training   : (110268, 1172)  (110268,) 
Testing    : (27567, 1172)   (27567,)  



# Model Building <a name="model-building"/>

## Linear Regression <a name="#linear-regression"/>

In [34]:
from sklearn.linear_model import LinearRegression

linregr = LinearRegression()
linregr.fit(X_train, y_train)
print(linregr.score(X_test, y_test))

0.10090062485507845


In [35]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = linregr.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)
print("R2   :", r2)

MAE  : 1.9694354439357937
MSE  : 6.433135124786564
RMSE : 2.536362577548124
PCC  : 0.3202600667251888
SCC  : 0.3333729438920468


## Ridge Regression

In [39]:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score

ridge = Ridge(alpha=0.5)
ridge.fit(X_train, y_train)

# Predict on test data
y_pred = ridge.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = ridge.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)
print("R2   :", r2)

MAE  : 1.9626524830148593
MSE  : 6.393542803236027
RMSE : 2.528545590499809
PCC  : 0.32708891069741136
SCC  : 0.34069741173046386


## Lasso Regression

In [46]:
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score

lasso = Lasso(alpha=0.3)
lasso.fit(X_train, y_train)

# Predict on test data
y_pred = lasso.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = lasso.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)
print("R2   :", r2)

MAE  : 2.042081832190281
MSE  : 6.940327659961665
RMSE : 2.634450162740162
PCC  : 0.21156487305145985
SCC  : 0.1825619715724882


## GradientBoostingRegressor

In [49]:
from sklearn.ensemble import GradientBoostingRegressor

gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
  
print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)
print("R2   :", r2)

MAE  : 1.9700595586832526
MSE  : 6.43330385305926
RMSE : 2.5363958391897863
PCC  : 0.3194547546295964
SCC  : 0.33296268479824437
R2   : 0.10087704327601832


## Random Forest <a name="random-forest"/>

In [47]:
from sklearn.ensemble import RandomForestRegressor

rfregr = RandomForestRegressor()
rfregr.fit(X_train, y_train)
print(rfregr.score(X_test, y_test))

0.10634275969721563


In [48]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = rfregr.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
  
print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)

MAE  : 1.9625570449299503
MSE  : 6.394196171235257
RMSE : 2.528674785581423
PCC  : 0.32698616989315465
SCC  : 0.3405751272046242


## Drug Mean

In this section we are going to calculate the performance if we would exclusively predict the drug mean from the training set.

In [92]:
import pickle
import pandas as pd 

path_to_drm = '../../data/processed/gdsc2/gdsc2_drm.pkl'

with open(path_to_drm, 'rb') as f: 
    drm = pickle.load(f)
    print(f"Finished reading drug response matrix: {drm.shape}")
drm.head(3)

Finished reading drug response matrix: (137835, 9)


Unnamed: 0,DATASET,CELL_LINE_NAME,DRUG_NAME,DRUG_ID,SANGER_MODEL_ID,AUC,RMSE,Z_SCORE,LN_IC50
333161,GDSC2,PFSK-1,Camptothecin,1003,SIDM01132,0.930105,0.088999,0.432482,-1.462148
333162,GDSC2,A673,Camptothecin,1003,SIDM00848,0.614932,0.111423,-1.420322,-4.869447
333163,GDSC2,ES5,Camptothecin,1003,SIDM00263,0.790953,0.142754,-0.599894,-3.360684


In [107]:
drm_train, drm_test = train_test_split(
    drm, 
    test_size=0.2,
    random_state=42,
    stratify=drm['CELL_LINE_NAME']
)
print(f"train: {drm_train.shape[0]:7,}")
print(f"test : {drm_test.shape[0]:7,}")

train: 110,268
test :  27,567


In [101]:
drug_mean_train = drm_train.groupby(['DRUG_NAME'])['LN_IC50'].mean().to_frame().reset_index()
drug_mean_train.rename(columns={'LN_IC50': 'AVG_LN_IC50'}, inplace=True)
drug_mean_train.head(3)

Unnamed: 0,DRUG_NAME,AVG_LN_IC50
0,5-Fluorouracil,4.400241
1,ABT737,1.881357
2,AGI-5198,4.704752


In [110]:
drm_test_with_avg = pd.merge(drm_test, drug_mean_train, on='DRUG_NAME', how='left')
print(drm_test_with_avg.shape)
drm_test_with_avg.head(3)

(27567, 10)


Unnamed: 0,DATASET,CELL_LINE_NAME,DRUG_NAME,DRUG_ID,SANGER_MODEL_ID,AUC,RMSE,Z_SCORE,LN_IC50,AVG_LN_IC50
0,GDSC2,GOTO,Dinaciclib,1180,SIDM00544,0.735695,0.116286,0.797829,-1.478976,-2.675501
1,GDSC2,COR-L311,AZ960,1250,SIDM00509,0.841005,0.126621,-0.124919,1.680296,1.916362
2,GDSC2,SCC-25,Ruxolitinib,1507,SIDM01082,0.968012,0.056758,-0.522201,4.19073,4.789988


In [115]:
# --- Calculate metrics ---
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from scipy.stats     import pearsonr, spearmanr

In [120]:
y_true = drm_test_with_avg['LN_IC50'].values.tolist()
y_pred = drm_test_with_avg['AVG_LN_IC50'].values.tolist()

mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_true, y_pred)
pcc = pearsonr(y_true, y_pred).statistic
scc = spearmanr(y_true, y_pred).statistic
r2 = r2_score(y_true, y_pred)

print(f"MSE : {mse}")
print(f"RMSE: {rmse}")
print(f"MAE : {mae}")
print(f"PCC : {pcc}")
print(f"SCC : {scc}")
print(f"R2  : {r2}")

MSE : 2.4007143763336902
RMSE: 1.5494238852985616
MAE : 1.164739616495685
PCC : 0.8137098935802614
SCC : 0.7705617591171653
R2  : 0.6620864191924242
