# The Hitchhiker's Guide to Accelerating Early Drug Discovery with AI
### Part I: Hello World!

This notebook is a **first introduction** to the problem, it's aimed at:
* Domain experts who know a lot of biology but nothing of AI
* Data scientists who want to know what it's like to work with Bio data 

If you're an expert in both, bear with us, it will get interesting I promise!

### Now let's our hands dirty!

To get data, we'll tap into the resources that are provided by [depmap](https://depmap.org/portal/download/all/). We'll download the 
D2_combined_gene_dep_scores file, from the **DEMETER2 Data v6** dataset. This dataset contains the estimated gene effect for each cell line and gene (posterior mean estimates).  

In this dataset, genes are indexed using Entrez IDs during analysis, and they are labeled: “HGNC_symbol (Entrez_ID)". 

### Glossary:
* **Cell line**: An immortalised cell line is a population of cells from a multicellular organism which would normally not proliferate indefinitely but, due to mutation, have evaded normal cellular senescence and instead can keep undergoing division.
* **Gene dependency**: Gene dependency or essentiality is defined as the degree to which a gene is essential for cell proliferation and survival. Counterintuitively, the convention of this data is that a **lower number** means that a cell line is more likely to be dependent on that gene: i.e., a lower number translates to a higher negative effect of the absence of that gene on the cell growth. -1 is ideal. 

So ideally we want to find genes with a high dependency (lower number in this data), and kill inhibit them!

In the code below we are loading the data and transposing it, so that the index contains cell lines, and the columns are genes.


In [17]:
import pandas as pd  

url = "https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/13515395/D2_combined_gene_dep_scores.csv"
df = pd.read_csv(url, index_col=0).T

In [257]:
df

Unnamed: 0_level_0,A1BG,NAT2,ADA,CDH2,AKT3,MED6,NR2E3,NAALAD2,DUXB,PDZK1P1,...,RCE1,HNRNPDL,DMTF1,PPP4R1,CDH1,SLC12A6,KCNE2,DGCR2,CASP8AP2,SCO2
cell_line_display_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
127399,,,,-0.194962,-0.256108,-0.174220,-0.140052,,,,...,-0.201644,-0.363670,0.184260,-0.115616,-0.125958,,0.088853,,-0.843295,
1321N1,,,,-0.028171,0.100751,-0.456124,-0.174618,,,,...,0.074889,0.152158,0.036011,0.117300,0.101725,,-0.110628,,-0.307031,
143B,0.146042,0.102854,0.168839,0.063047,-0.008077,-0.214376,-0.153619,0.133830,0.138673,0.030345,...,0.006735,-0.033385,0.197651,-0.016372,0.077486,0.106165,0.057286,0.025596,-0.413669,0.122669
184A1,-0.190388,0.384106,-0.120700,-0.237251,0.060267,-0.338946,-0.057551,0.134511,,0.144463,...,0.209009,-0.156839,-0.155837,-0.001141,,0.227968,0.028095,-0.080611,-1.849696,-0.078856
184B5,0.907063,0.403192,0.004394,-0.017059,-0.094749,-0.328074,-0.089573,0.362029,,-0.098161,...,-0.137465,-1.037848,-0.261262,-0.228016,,0.088744,0.159467,0.014071,-0.414154,0.032661
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YKG1,0.111530,0.073460,0.227977,0.000769,-0.072564,-0.175593,-0.155250,0.105052,0.143781,0.357053,...,-0.079333,-0.358065,-0.090982,0.168945,-0.173036,0.141616,0.109206,0.153414,-0.046700,0.075238
YMB1,,,,-0.139126,0.017161,-0.226356,-0.445319,,,,...,-0.050825,0.286697,0.134608,-0.166845,0.065173,,-0.048763,,-0.865486,
ZR751,-0.079313,-0.130921,-0.134479,0.047022,0.123615,-0.311682,-0.211145,-0.014285,0.074681,-0.053025,...,-0.143304,-0.078062,-0.022528,0.021830,0.308641,0.100142,0.128882,0.159781,-1.039110,0.100361
ZR7530,-0.141559,0.127358,0.083506,-0.097644,0.046846,-0.355300,-0.095010,0.049151,0.129006,0.038661,...,-0.146587,-0.050230,0.127782,-0.031292,0.035794,0.160643,-0.179656,0.286456,-0.301415,-0.117268


**Some housekeeping first:** We'll get rid of the Entrez ID, and we'll also remove the lineage type from the cell line name so we can merge it to other datasets later.

In [20]:
df.columns = [ x.split(" ")[0] for x in df.columns ]
cell_line_name = [ x.split("_")[0] for x in df.index ]
df.index = cell_line_name
df.index.name = 'cell_line_display_name'

In [260]:
df['MAT2A'].median()

-0.281954973202

### Additional Data

Ok, now before we start building on top of the data, let's learn how to combine with other datasets. What we have above is the gene dependency: i.e., how essential a gene is for the proliferation of a given cell line. That gives us an idea about potential targets for treatment. 

Now we will introduce the **copy number** and **gene expression** data. Gene expression tells us how genes are activated or suppressed in cancer cells, helping to identify potential targets for treatment, understand cancer progression, and develop personalized therapies. Copy number refers to the number of copies of a particular gene present in the genome of a cell. In cancer cells, the copy number of certain genes can vary due to genetic abnormalities.

To get this data, you can go to Custom Downloads in depmap, and select the information you need. We're now focusing on the following genes for simplicity: 
* PRMT5 MTAP MAT2A ZNF185 PDGFRB GJA1 MUCL1 MAP1A. 

How did we decide on these? Magic. 

In [273]:
additional = pd.read_csv('https://raw.githubusercontent.com/Hitchhikers-AI-Guide/AIGuideToDrugDiscovery/main/data/depmap_export.csv', index_col=1).dropna(axis=1, how='all').dropna(axis=0, how='all') #removing empty rows and columns
additional.head(5)

Unnamed: 0_level_0,depmap_id,lineage_1,lineage_2,lineage_3,lineage_5,lineage_6,Copy Number Public 23Q4 MUCL1,Copy Number Public 23Q4 PRMT5,Copy Number Public 23Q4 MAP1A,Copy Number Public 23Q4 MAT2A,...,Copy Number Public 23Q4 MTAP,Copy Number Public 23Q4 ZNF185,Expression Public 23Q4 MTAP,Expression Public 23Q4 PRMT5,Expression Public 23Q4 PDGFRB,Expression Public 23Q4 ZNF185,Expression Public 23Q4 GJA1,Expression Public 23Q4 MAP1A,Expression Public 23Q4 MAT2A,Expression Public 23Q4 MUCL1
cell_line_display_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
127399,ACH-001270,Soft Tissue,Synovial Sarcoma,Synovial Sarcoma,,,0.794242,0.80972,0.789335,1.049616,...,1.054304,0.744569,4.959306,5.880686,6.083852,3.100978,4.517276,0.321928,7.053003,0.0
170MGBA,ACH-002680,CNS/Brain,Diffuse Glioma,Glioblastoma,Glioblastoma,,0.94438,0.59901,0.716587,1.030777,...,0.429817,0.780689,3.500802,4.947666,6.081723,0.815575,8.105385,5.461398,7.079698,0.0
1777NRPMET,ACH-001438,Testis,Non-Seminomatous Germ Cell Tumor,Embryonal Carcinoma,,,1.0773,0.867353,0.84088,1.052921,...,0.918797,0.746289,4.665052,6.203788,5.103917,3.813525,7.075319,1.682573,6.453847,0.0
201T,ACH-002089,Lung,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,NSCLC Adenocarcinoma,,1.073952,1.017229,0.992259,0.931516,...,0.000294,0.644727,,,,,,,,
21NT,ACH-002399,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,,,0.897753,1.113557,0.867614,0.956954,...,0.883159,0.807497,,,,,,,,


In [272]:
pd.read_csv()

Unnamed: 0,depmap_id,cell_line_display_name,lineage_1,lineage_2,lineage_3,lineage_5,lineage_6,lineage_4,Copy Number Public 23Q4 MUCL1,Copy Number Public 23Q4 PRMT5,...,Copy Number Public 23Q4 MTAP,Copy Number Public 23Q4 ZNF185,Expression Public 23Q4 MTAP,Expression Public 23Q4 PRMT5,Expression Public 23Q4 PDGFRB,Expression Public 23Q4 ZNF185,Expression Public 23Q4 GJA1,Expression Public 23Q4 MAP1A,Expression Public 23Q4 MAT2A,Expression Public 23Q4 MUCL1
0,ACH-001270,127399,Soft Tissue,Synovial Sarcoma,Synovial Sarcoma,,,,0.794242,0.809720,...,1.054304,0.744569,4.959306,5.880686,6.083852,3.100978,4.517276,0.321928,7.053003,0.0
1,ACH-002680,170MGBA,CNS/Brain,Diffuse Glioma,Glioblastoma,Glioblastoma,,,0.944380,0.599010,...,0.429817,0.780689,3.500802,4.947666,6.081723,0.815575,8.105385,5.461398,7.079698,0.0
2,ACH-001438,1777NRPMET,Testis,Non-Seminomatous Germ Cell Tumor,Embryonal Carcinoma,,,,1.077300,0.867353,...,0.918797,0.746289,4.665052,6.203788,5.103917,3.813525,7.075319,1.682573,6.453847,0.0
3,ACH-002089,201T,Lung,Non-Small Cell Lung Cancer,Lung Adenocarcinoma,NSCLC Adenocarcinoma,,,1.073952,1.017229,...,0.000294,0.644727,,,,,,,,
4,ACH-002399,21NT,Breast,Invasive Breast Carcinoma,Breast Invasive Ductal Carcinoma,,,,0.897753,1.113557,...,0.883159,0.807497,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1817,ACH-000146,THP1,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,M5,,,0.991017,0.983336,...,0.000115,0.584093,0.084064,6.691534,1.117695,0.454176,0.176323,0.084064,8.017254,0.0
1818,ACH-000835,GCT,Soft Tissue,Undifferentiated Pleomorphic Sarcoma/Malignant...,Undifferentiated Pleomorphic Sarcoma/Malignant...,,,,1.028573,1.015083,...,0.765158,0.523483,5.072963,7.163499,2.613532,3.472488,4.550901,1.389567,7.982594,0.0
1819,ACH-001300,CHLA15,Peripheral Nervous System,Neuroblastoma,Neuroblastoma,,,,1.239743,0.945503,...,0.937817,0.883267,3.878725,6.396262,3.816600,0.978196,2.319040,2.687061,7.652702,0.0
1820,ACH-001301,COGN278,Peripheral Nervous System,Neuroblastoma,Neuroblastoma,,MYCN Amp,,0.991967,0.983763,...,0.990240,1.000230,4.524816,6.962318,2.831877,0.773996,5.000451,3.119356,7.330021,0.0


### Your first in-silico experiment:

Which gene dependency correlates the most with the copy number for MTAP? In the following line we will compute the correlation between all genes and the MTAP copy number, then rank them in descending order.

In [261]:
df

Unnamed: 0_level_0,A1BG,NAT2,ADA,CDH2,AKT3,MED6,NR2E3,NAALAD2,DUXB,PDZK1P1,...,RCE1,HNRNPDL,DMTF1,PPP4R1,CDH1,SLC12A6,KCNE2,DGCR2,CASP8AP2,SCO2
cell_line_display_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
127399,,,,-0.194962,-0.256108,-0.174220,-0.140052,,,,...,-0.201644,-0.363670,0.184260,-0.115616,-0.125958,,0.088853,,-0.843295,
1321N1,,,,-0.028171,0.100751,-0.456124,-0.174618,,,,...,0.074889,0.152158,0.036011,0.117300,0.101725,,-0.110628,,-0.307031,
143B,0.146042,0.102854,0.168839,0.063047,-0.008077,-0.214376,-0.153619,0.133830,0.138673,0.030345,...,0.006735,-0.033385,0.197651,-0.016372,0.077486,0.106165,0.057286,0.025596,-0.413669,0.122669
184A1,-0.190388,0.384106,-0.120700,-0.237251,0.060267,-0.338946,-0.057551,0.134511,,0.144463,...,0.209009,-0.156839,-0.155837,-0.001141,,0.227968,0.028095,-0.080611,-1.849696,-0.078856
184B5,0.907063,0.403192,0.004394,-0.017059,-0.094749,-0.328074,-0.089573,0.362029,,-0.098161,...,-0.137465,-1.037848,-0.261262,-0.228016,,0.088744,0.159467,0.014071,-0.414154,0.032661
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YKG1,0.111530,0.073460,0.227977,0.000769,-0.072564,-0.175593,-0.155250,0.105052,0.143781,0.357053,...,-0.079333,-0.358065,-0.090982,0.168945,-0.173036,0.141616,0.109206,0.153414,-0.046700,0.075238
YMB1,,,,-0.139126,0.017161,-0.226356,-0.445319,,,,...,-0.050825,0.286697,0.134608,-0.166845,0.065173,,-0.048763,,-0.865486,
ZR751,-0.079313,-0.130921,-0.134479,0.047022,0.123615,-0.311682,-0.211145,-0.014285,0.074681,-0.053025,...,-0.143304,-0.078062,-0.022528,0.021830,0.308641,0.100142,0.128882,0.159781,-1.039110,0.100361
ZR7530,-0.141559,0.127358,0.083506,-0.097644,0.046846,-0.355300,-0.095010,0.049151,0.129006,0.038661,...,-0.146587,-0.050230,0.127782,-0.031292,0.035794,0.160643,-0.179656,0.286456,-0.301415,-0.117268


In [108]:
# for all of the genes in the dependency data
# which ones correlate the most with 
# the MTAP copy number?

res = df.corrwith(additional['Copy Number Public 23Q4 MTAP']).sort_values(ascending=False)

In [263]:
res.head(10)

PRMT5                        0.574670
LOC105374879&LOC107986554    0.472545
WDR77                        0.423746
RIOK1                        0.308628
SLC39A12-AS1&LOC389834       0.301664
MAT2A                        0.290771
CEP68                        0.285653
FAM107A                      0.258787
LOC441455&MKRN7P             0.252400
CLNS1A                       0.248265
dtype: float64

If all goes well, you'll be seeing PRMT5, RIOK1 and MAT2A in the list. Well that's not a coincidence: as it turns out, in about 15% of human cancers there's a loss of a gene called MTAP. This loss is associated with cancer severity. MTAP works closely with another gene, MAT2A, and when both are lost, it causes a deadly effect in cancer cells due to the blocking of a protein called PRMT5.
 
  
See, for instance: 
* [MTAP Deletions in Cancer Create Vulnerability to Targeting of the MAT2A/PRMT5/RIOK1 Axis](https://www.sciencedirect.com/science/article/pii/S2211124716302996) 
* [Combined inhibition of MTAP and MAT2a mimics synthetic lethality in tumor models via PRMT5 inhibition](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10770533/)

So here's a curveball: why is [WDR77](https://www.ncbi.nlm.nih.gov/gene/79084) here? Go browse the literature and try to figure out!

So let's explore this effect, essentially when the MTAP is absent in cancer cells, then MAT2A/PRMT5/RIOK1 have a strong effect. Which is exactly what we see in the data: copy number of MTAP is correlated with gene effect - i.e.: the lower the copy number, the stronger (more negative, remember this convention) is the effect of this gene.

Let's find a way to visualize this.

In [264]:
import plotly.express as px

joint_data = pd.concat([df, additional], axis=1, join='inner')
px.scatter( joint_data, y = 'WDR77', x = 'Copy Number Public 23Q4 MTAP', color = joint_data['lineage_1'], trendline='ols' )





What we see above is quite interesting: there's a number of cell lines for which MTAP's copy number is zero, and for these cell lines we see that PRMT5's gene effect is more negative (i.e., stronger dependency). 


This is nice but also not the best way to visualize the data. Let's try bucketing the copy number in bins and seeing the average effect. The code below will cut the MTAP copy number into 6 groups, and look at the distribution of PRMT5's, MAT2A's and RIOK1's gene effect. 

In [153]:
joint_data['MTAP Copy Number Quantile'] = pd.qcut( joint_data['Copy Number Public 23Q4 MTAP'], 6, labels=False, duplicates='drop' )

for gene in ['PRMT5', 'MAT2A', 'RIOK1']:
    fig = px.violin( joint_data, x = 'MTAP Copy Number Quantile', y = gene,  points = 'all', box=True  )
    fig.show()

That's better, now we see how there's a significant effect - especially for PRMT5 - whenever MTAP's copy number is low. Let's try the same exercise, but now using expression data.

In [154]:
joint_data['MTAP Expression Quantile'] = pd.qcut( joint_data['Expression Public 23Q4 MTAP'], 6, labels=False, duplicates='drop' )

for gene in ['PRMT5', 'MAT2A', 'RIOK1']:
    fig = px.violin( joint_data, x = 'MTAP Expression Quantile', y = gene,  points = 'all', box=True  )
    fig.show()

And as expected, we observe the same behavior: whenever MTAP isn't expressed, we observe stronger gene dependency. 

We've so far done a few things:
* Loaded an entire dataset for Gene dependency. This data tells us which genes are essential for the growth of certain cell lines.
* Merged this data expression and copy number data for a small subset of genes.  
* Found that MAT2A, PRMT5 and RIOK1's dependencies are strongly correlated with MTAP's expression and copy number: corroborating findings that MTAP's deletion presents an opportunity for treatment using MAT2A/PRMT5/RIOK1.

Let's now ask a few simpler questions:
* What happens on different cancer types? You can use the features lineage_1, lineage_2 and lineage_3 in the joint_data to explore. For instance... which is the gene that has the strongest dependency per lineage type?

In [155]:
genes = df.columns
lineages = joint_data['lineage_1'].unique()

median_dep_per_lineage = joint_data.groupby('lineage_1')[genes].median()
median_dep_per_lineage.head(5)

Unnamed: 0_level_0,A1BG,NAT2,ADA,CDH2,AKT3,MED6,NR2E3,NAALAD2,DUXB,PDZK1P1,...,RCE1,HNRNPDL,DMTF1,PPP4R1,CDH1,SLC12A6,KCNE2,DGCR2,CASP8AP2,SCO2
lineage_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Biliary Tract,,,,-0.134051,0.185132,0.078481,0.015465,,,,...,-0.5223,-0.114177,0.139492,-0.189923,0.105446,,0.000362,,-0.342565,
Bladder/Urinary Tract,-0.034754,-0.097546,-0.009613,-0.05175,-0.065325,-0.408861,-0.152046,-0.002347,0.120372,0.085327,...,-0.034185,-0.006242,-0.111248,0.080209,-0.066471,0.081051,-0.013882,0.105869,-0.357238,-0.124898
Bone,0.019898,0.015789,0.068316,0.063047,0.086339,-0.169326,-0.153619,0.132095,0.11149,0.030345,...,0.021702,-0.152583,0.015087,-0.016372,-0.058102,-0.00192,0.027272,0.14022,-0.401351,-0.02149
Bowel,-0.07685,-0.053574,-0.086468,0.062052,0.070845,-0.32663,-0.085908,0.107212,0.120195,-0.021937,...,-0.079436,-0.079643,-0.019285,-0.039689,-0.02338,0.11626,0.017792,0.061854,-0.524577,0.02585
Breast,-0.015329,-0.057465,0.007641,0.008477,0.061731,-0.333732,-0.13656,0.096272,0.139046,0.0072,...,-0.068192,-0.041876,0.082791,0.023083,0.021964,0.086839,0.003523,0.073731,-0.507736,0.021587


In [156]:
results = {}
for lineage in lineages:
    results[lineage] = median_dep_per_lineage.loc[lineage].sort_values(ascending=True).iloc[:10]

results['Bowel']

LOC100130331   -1.720645
SF3B2          -1.649171
RBX1           -1.648663
RPL14          -1.581234
SNRPD1         -1.561428
RPL5           -1.524563
RPL7           -1.515645
CTNNB1         -1.495507
COPB1          -1.492408
RPS27A         -1.439314
Name: Bowel, dtype: float64

### Some light machine learning

So previously we identified that whenever MTAP's expression/copy number was low, that presented an opportunity to target MAT2A/PRMT5/RIOK1. But what if we could go more granular and attempt to predict the expected dependency of a given gene using way more features?

That's the ultimate goal - i.e., take a series of measurements - and based on these measurements know exactly the right targets for treatment. Well let's create a toy model for that!

The following function will do the following: take the joint dataset we create it, subset it on X_features, and use these features to train a model to predict y. It will then use this model to make predictions on a testing data, and use these predictions to evaluate the quality of the model trained with those features. 

In [225]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.model_selection import train_test_split

def get_and_evaluate_linear_model( model, data, X_features, y_feature, random_state = 42  ):
    ''' This function takes a model, a dataframe, a list of features and a target feature and returns a dictionary with the model, 
    the r2, the mse, the predicted values and the test values'''

    data_subset = data[ X_features + [ y_feature ] ].dropna()
    X = data_subset[ X_features ]
    y = data_subset[ y_feature ]

    #splitting the data into train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state= random_state)
    model.fit(X_train, y_train)

    #evaluating the model with r2 and mse
    y_pred = model.predict(X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    
    output = {}
    output['model'] = model
    output['r2'] = r2
    output['mse'] = mse
    output['y_pred_test'] = y_pred
    output['y_test'] = y_test
    output['features'] = X_features
    output['n_features'] = len(X_features)
    output['y_feature'] = y_feature
    output['model_name'] = str(model)
    output['y_pred'] = model.predict(data_subset[X_features])
    output['y'] = data_subset[y_feature]
    try:
        output['coefficients'] = { X_features[i]: model.coef_[i] for i in range(len(X_features))}
    except:
        output['coefficients'] = None 
    
    return output 

Let's first try it with a baseline model. This is a simple linear regression that predicts PRMT5 using MTAP's copy number. As we saw a 50% correlation, we expect this model to do quite well.

In [247]:
model = LinearRegression()
output = get_and_evaluate_linear_model( model, joint_data, ['Copy Number Public 23Q4 MTAP'], 'PRMT5' )

print( 'Features: ', ', '.join(output['features']) )
print( 'R-squared (the higher the better): ', output['r2'] )
print( 'mean squared error (the lower the better): ', output['mse'] )

print( 'Coefficients: ' )
for feature, coefficient in sorted( output['coefficients'].items(), key= lambda x: abs(x[1]), reverse=True ):
    print( feature, coefficient )

Features:  Copy Number Public 23Q4 MTAP
R-squared (the higher the better):  0.3187891279066697
mean squared error (the lower the better):  0.04706268360546403
Coefficients: 
Copy Number Public 23Q4 MTAP 0.44703043574569923


In [248]:
px.scatter( x = output['y_pred_test'], y = output['y_test'], labels = {'x': 'Predictions (testing data)', 'y': 'Realized PRMT5 (testing data)'} )

Moving on to bigger things, let's now try with more features! Here we'll select all numerical variables we added in our additional dataset.

In [249]:
from pandas.api.types import is_numeric_dtype

#here we select only the numeric features from the additional data, we will use them to predict the dependency scores
features = [ x for x in additional if is_numeric_dtype(additional[x]) ]

model = LinearRegression()
output = get_and_evaluate_linear_model( model, joint_data, features, 'PRMT5' )

print( 'Features: ', ', '.join(output['features']) )
print( 'R-squared (the higher the better): ', output['r2'] )
print( 'mean squared error (the lower the better): ', output['mse'] )

print( 'Coefficients: ' )
for feature, coefficient in sorted( output['coefficients'].items(), key= lambda x: abs(x[1]), reverse=True ):
    print( feature, coefficient )

Features:  Copy Number Public 23Q4 MUCL1, Copy Number Public 23Q4 PRMT5, Copy Number Public 23Q4 MAP1A, Copy Number Public 23Q4 MAT2A, Copy Number Public 23Q4 PDGFRB, Copy Number Public 23Q4 GJA1, Copy Number Public 23Q4 MTAP, Copy Number Public 23Q4 ZNF185, Expression Public 23Q4 MTAP, Expression Public 23Q4 PRMT5, Expression Public 23Q4 PDGFRB, Expression Public 23Q4 ZNF185, Expression Public 23Q4 GJA1, Expression Public 23Q4 MAP1A, Expression Public 23Q4 MAT2A, Expression Public 23Q4 MUCL1
R-squared (the higher the better):  0.3849404259859315
mean squared error (the lower the better):  0.04713671524328601
Coefficients: 
Copy Number Public 23Q4 MTAP 0.2347538095886346
Copy Number Public 23Q4 PRMT5 0.1768962631189619
Copy Number Public 23Q4 GJA1 -0.15690046447957026
Copy Number Public 23Q4 MAP1A 0.11371081330245937
Copy Number Public 23Q4 MUCL1 0.08607494060958937
Copy Number Public 23Q4 MAT2A 0.058146249461147166
Copy Number Public 23Q4 PDGFRB -0.052558330943991184
Expression Public

In [250]:
px.scatter( x = output['y_pred_test'], y = output['y_test'], labels = {'x': 'Predictions (testing data)', 'y': 'Realized PRMT5 (testing data)'} )

Because the training and testing data are chosen randomly, there will be some variability in the numbers we'll see - but these two numbers are very close to each other in my runs, with a difference of ~ 7%. But we can do better, what heppens if we bring in bigger models?

In [251]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor( max_depth= 4, random_state=42 )
output = get_and_evaluate_linear_model( model, joint_data, features, 'PRMT5' )

print( 'Features: ', ', '.join(output['features']) )
print( 'R-squared (the higher the better): ', output['r2'] )
print( 'mean squared error (the lower the better): ', output['mse'] )

Features:  Copy Number Public 23Q4 MUCL1, Copy Number Public 23Q4 PRMT5, Copy Number Public 23Q4 MAP1A, Copy Number Public 23Q4 MAT2A, Copy Number Public 23Q4 PDGFRB, Copy Number Public 23Q4 GJA1, Copy Number Public 23Q4 MTAP, Copy Number Public 23Q4 ZNF185, Expression Public 23Q4 MTAP, Expression Public 23Q4 PRMT5, Expression Public 23Q4 PDGFRB, Expression Public 23Q4 ZNF185, Expression Public 23Q4 GJA1, Expression Public 23Q4 MAP1A, Expression Public 23Q4 MAT2A, Expression Public 23Q4 MUCL1
R-squared (the higher the better):  0.44102823554399206
mean squared error (the lower the better):  0.04283827779192868


In [266]:
features

['Copy Number Public 23Q4 MUCL1',
 'Copy Number Public 23Q4 PRMT5',
 'Copy Number Public 23Q4 MAP1A',
 'Copy Number Public 23Q4 MAT2A',
 'Copy Number Public 23Q4 PDGFRB',
 'Copy Number Public 23Q4 GJA1',
 'Copy Number Public 23Q4 MTAP',
 'Copy Number Public 23Q4 ZNF185',
 'Expression Public 23Q4 MTAP',
 'Expression Public 23Q4 PRMT5',
 'Expression Public 23Q4 PDGFRB',
 'Expression Public 23Q4 ZNF185',
 'Expression Public 23Q4 GJA1',
 'Expression Public 23Q4 MAP1A',
 'Expression Public 23Q4 MAT2A',
 'Expression Public 23Q4 MUCL1']

In [252]:
px.scatter( x = output['y_pred_test'], y = output['y_test'], labels = {'x': 'Predictions (testing data)', 'y': 'Realized PRMT5 (testing data)'} )

In [268]:
{ features[i] : model.feature_importances_[i] for i in range(len(features)) }

{'Copy Number Public 23Q4 MUCL1': 0.011407636529843534,
 'Copy Number Public 23Q4 PRMT5': 0.03580835456050252,
 'Copy Number Public 23Q4 MAP1A': 0.023095985251015496,
 'Copy Number Public 23Q4 MAT2A': 0.019256412195827462,
 'Copy Number Public 23Q4 PDGFRB': 0.0229250281089463,
 'Copy Number Public 23Q4 GJA1': 0.007293093774018603,
 'Copy Number Public 23Q4 MTAP': 0.6785980786849979,
 'Copy Number Public 23Q4 ZNF185': 0.00539926039638626,
 'Expression Public 23Q4 MTAP': 0.04919384859102367,
 'Expression Public 23Q4 PRMT5': 0.0072532165273554395,
 'Expression Public 23Q4 PDGFRB': 0.05081480534241779,
 'Expression Public 23Q4 ZNF185': 0.036059449841725696,
 'Expression Public 23Q4 GJA1': 0.016016248434067303,
 'Expression Public 23Q4 MAP1A': 0.015218178315123782,
 'Expression Public 23Q4 MAT2A': 0.012617977814544701,
 'Expression Public 23Q4 MUCL1': 0.009042425632203645}

We're slowly improving! Can we do even better? Probably! How? I don't know, that's up to you to find out!

### Conclusion

So we've learned a few things today:
1. How to ingest data from depmap, work on dependency data and merge it with expression and copy number data.
2. Visualize quite intricate literature findings in the data and understand its significance when it comes to drug discovery.
3. How to build simple ML models to predict dependency.

Do you have a hypothesis that you want to test? Or do you have the data skills and need a hypothesis? Get in touch and let's cure cancer!