The intention of this notebook is to provide the performance for baseline comparison models.

<ins>Contents of this notebook:</ins>

- [ ] Build regression model
- [ ] Build SVM model 
- [ ] Build random forest 
- [ ] Build MLP

## Table of Contents

* [Imports](#imports)
* [Global Settings](#global-settings)
* [Dataset Preparation](#dataset-preparation)
    * [Load Root Datasets](#load-root-datasets)
    * [Create Modeling Datasets](#create-modeling-datasets)
* [Model Building](#model-building)
    * [Linear Regression](#linear-regression)
    * [SVM](#svm)
    * [Random Forest](#random-forest)

# Imports <a name="imports"/>

In [24]:
import sys
import os
import pickle
import random
import torch
import torch.nn as nn
import tqdm as tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

sys.path.append('/home/ec2-user/SageMaker/projects/gnn/GNN-material/src')
sys.path.append('/home/ec2-user/SageMaker/projects/gnn/data/')
sys.path = list(set(sys.path))

# Global Settings <a name="global-settings"/>

- __train set__: Data to train the model on. Used for model selection.
- __validation set__: Data to validate how well the model has been trained. Used for model selection.
- __test set__: Data for prediction error of the final model (the model which has been selected in the selection process).

In [32]:
RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

TRAIN_RATIO = 0.8
TEST_VAL_RATIO = 1 - TRAIN_RATIO
VAL_RATIO = 0.5

---

# Dataset Preparation <a name="Dataset Preparation"/>

In this notebook we are going to build simple baselines to the other models. The dataset for this model will be a simple table with the following columns
- cell-line indicator (`CELL_LINE_NAME`)
- drug (`DRUG_ID`)
- gene expression features (`<gene_id>_gexpr`)
- mutation features (`<gene_id>_mut`)
- copy number features (`<gene_id>_cnvg`, `<gene_id>_cnvp`)
- target value (`LN_IC50`)

Notes: 
- for the same cell-line `i`, there will be drug values `1, ...j, ..., q`, where `q = # of drugs` and `i = {1, ..., m}`, where `i = # of cell-lines`
- for the same cell-line `i` the feature values are exactly the same
  - this is because the gene expression, mutation and copy number values are PER CELL-LINE and HAVE NOTHING TO DO WITH THE DRUG
- the ic50 value however will be per cell-line drug tuple and thus be dependent on the cell-line as well as the drug

## Load Root Datasets <a name="load-root-datasets"/>

In [18]:
# --- Drug response matrix ---
with open('../../data/processed/' + 'gdsc2_drm.pkl', 'rb') as f: 
    drm = pickle.load(f)
    assert not drm.isna().sum().any(), "Some rows have missing cells."
print(drm.shape)
print("Number of unique drug names: ", len(drm.DRUG_ID.unique()))
print("Number of unique drug-ids  : ", len(drm.DRUG_NAME.unique()))
print("Number of unique cell-lines: ", len(drm.CELL_LINE_NAME.unique()))
drm.head(5)

(137835, 9)
Number of unique drug names:  181
Number of unique drug-ids  :  181
Number of unique cell-lines:  856


Unnamed: 0,DATASET,CELL_LINE_NAME,DRUG_NAME,DRUG_ID,SANGER_MODEL_ID,AUC,RMSE,Z_SCORE,LN_IC50
333161,GDSC2,PFSK-1,Camptothecin,1003,SIDM01132,0.930105,0.088999,0.432482,-1.462148
333162,GDSC2,A673,Camptothecin,1003,SIDM00848,0.614932,0.111423,-1.420322,-4.869447
333163,GDSC2,ES5,Camptothecin,1003,SIDM00263,0.790953,0.142754,-0.599894,-3.360684
333164,GDSC2,ES7,Camptothecin,1003,SIDM00269,0.592624,0.135642,-1.515791,-5.045014
333165,GDSC2,EW-11,Camptothecin,1003,SIDM00203,0.733992,0.128066,-0.807038,-3.74162


In [19]:
# --- Gene feature matrix ---
with open('../../data/processed/gdsc2/990/' + f'thresh_gdsc2_990_gene_mat.pkl', 'rb') as f: 
    cl_gene_mat = pickle.load(f)
    assert not cl_gene_mat.isna().sum().any(), "Some rows have missing cells."
print(cl_gene_mat.shape)
print("Number of unique cell-lines: ", len(cl_gene_mat.CELL_LINE_NAME.unique()))
cl_gene_mat.head(5)

(856, 1173)
Number of unique cell-lines:  856


Unnamed: 0,CELL_LINE_NAME,PLK1_gexpr,RUVBL1_gexpr,TERT_gexpr,EIF4EBP1_gexpr,RPS6_gexpr,STXBP1_gexpr,CDC42_gexpr,SFN_gexpr,POLB_gexpr,...,MYC_mut,CCNB1_mut,TICAM1_mut,CENPE_mut,CFLAR_mut,AKT1_mut,FBXO11_mut,PSMB8_mut,PRKCD_mut,FBXO7_mut
0,22RV1,3.376377,7.935522,3.22259,9.062096,12.713941,4.554855,4.461362,8.570664,8.331353,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,23132-87,3.525281,8.745985,3.329384,10.303132,12.802973,4.206494,4.112894,9.06954,8.788701,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,42-MG-BA,3.769823,6.651761,3.492581,9.790909,11.856622,5.397945,4.45785,3.673494,7.443434,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,451Lu,3.922186,8.386289,3.479026,9.40878,12.290141,6.228226,5.412469,4.064795,8.27265,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,5637,3.531716,8.347308,3.243133,9.884705,12.588677,3.683614,4.389642,9.927606,8.998844,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [23]:
# Inner join drug response matrix with the features.
join = pd.merge(left=drm[['CELL_LINE_NAME', 'LN_IC50']],
                right=cl_gene_mat,
                on=['CELL_LINE_NAME'],
                how='inner')
print(join.shape)
join.head(5)

(137835, 1174)


Unnamed: 0,CELL_LINE_NAME,LN_IC50,PLK1_gexpr,RUVBL1_gexpr,TERT_gexpr,EIF4EBP1_gexpr,RPS6_gexpr,STXBP1_gexpr,CDC42_gexpr,SFN_gexpr,...,MYC_mut,CCNB1_mut,TICAM1_mut,CENPE_mut,CFLAR_mut,AKT1_mut,FBXO11_mut,PSMB8_mut,PRKCD_mut,FBXO7_mut
0,PFSK-1,-1.462148,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,PFSK-1,-4.996545,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,PFSK-1,3.213791,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,PFSK-1,-1.847259,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,PFSK-1,2.10137,4.457667,8.072458,3.422545,10.496699,12.776203,4.993946,4.82063,3.56576,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## Create Modeling Datasets <a name="create-modeling-datasets"/>

In [22]:
from sklearn.model_selection import train_test_split

In [37]:
X = join.loc[:, ~join.columns.isin(['CELL_LINE_NAME', 'LN_IC50'])]
y = join['LN_IC50']

# Create train and test sets. Note that the test set includes validation and test set.
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_VAL_RATIO, 
    random_state=RANDOM_STATE,
    stratify=join['CELL_LINE_NAME']
)

# Create validation and test sets.
X_val, X_test, y_val, y_test = train_test_split(
    X_test, y_test,
    test_size=VAL_RATIO,
    random_state=RANDOM_STATE,
    stratify=join.iloc[X_test.index]['CELL_LINE_NAME']
)

In [51]:
print(f"""Final dataset shapes:
=====================
             {'X':<15s} {'y':<10s}
Training   : {str(X_train.shape):15s} {str(y_train.shape):10s}
Validation : {str(X_val.shape):15s} {str(y_val.shape):10s}
Testing    : {str(X_test.shape):15s} {str(y_test.shape):10s}
""")

Final dataset shapes:
             X               y         
Training   : (110268, 1172)  (110268,) 
Validation : (13783, 1172)   (13783,)  
Testing    : (13784, 1172)   (13784,)  



# Model Building <a name="model-building"/>

## Linear Regression <a name="#linear-regression"/>

In [52]:
from sklearn.linear_model import LinearRegression

linregr = LinearRegression()
linregr.fit(X_train, y_train)
print(linregr.score(X_test, y_test))

0.09883063829470629


In [61]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = linregr.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
  
print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)

MAE  : 1.9755975583211287
MSE  : 6.387249238542124
RMSE : 2.5273007811778405
PCC  : 0.31676971989382074
SCC  : 0.3362652193859476


## SVM <a name="svm"/>

## Random Forest <a name="random-forest"/>

In [64]:
from sklearn.ensemble import RandomForestRegressor

rfregr = RandomForestRegressor()
rfregr.fit(X_train, y_train)
print(rfregr.score(X_test, y_test))

0.0993441489397966


In [66]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from scipy.stats import pearsonr, spearmanr
  
y_pred = rfregr.predict(X_test)

mae = mean_absolute_error(y_true=y_test, y_pred=y_pred)
mse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=True)
rmse = mean_squared_error(y_true=y_test, y_pred=y_pred, squared=False)
pcc, _ = pearsonr(y_test, y_pred)
scc, _ = spearmanr(y_test, y_pred)
  
print("MAE  :", mae)
print("MSE  :", mse)
print("RMSE :", rmse)
print("PCC  :", pcc)
print("SCC  :", scc)

MAE  : 1.9754301494797675
MSE  : 6.383609611390763
RMSE : 2.5265806164440434
PCC  : 0.31741554648458736
SCC  : 0.3372081225889679
