# Session Info

necessary packages to run this notebook

In [13]:
import types
def imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            yield val.__name__
list(imports())

['builtins',
 'builtins',
 'pandas',
 'pickle',
 'rdkit.Chem',
 'pubchempy',
 'session_info',
 'types',
 'pkg_resources',
 'pip']

# Data loading 

methods to pre-cache and load in raw data from the data folder

### Data Pre-processing Guidelines

- **documentation of original download information**
    - original download link 
    - the date of download
    - the doi or bibliography of the linked publication 
    - basic description of the dataset
    - |time-consuming| ideally, a simple description of the methodology used to generate the dataset
        - how were the samples collected?
        - at what time point are the samples lysed? 
        - any further pre-processing steps?  
<br />

- **documentation of dataset(s)** 
    - *the information type represented by the dataset (i.e. gene expression, drug response, etc.)*
    - any supplementary spreadsheet(s) associated with the dataset or metadata
    - sample size (e.g. number of samples, number of genes, etc.)
    - main row and column domains (e.g. genes, samples, drugs, etc.)
    - identifiers used for drug, gene, protein etc. 
    - presence of specific drugs or genes/proteins of interest
        - e.g. CDK4/6 inhibitors: palbociclib, ribociclib and abemaciclib  
<br />  

- **documentation of the pre-processing steps**
    - *the final shape of the processed dataset associated with metadata, e.g. (n_samples, n_genes)*
    - |time-consuming| the technique used to transform the dataset
        - e.g. log2 transformation, z-score normalization, etc.
        - e.g. the method used to impute missing values
        - any removal of data and reasoning (i.e. due to missing values, etc.)
    - index to identifier mapping (e.g. gene index to gene symbol mapping)
        - then, the processed dataset will have indexes matched with a corresponding identifier/symbol 
        - e.g. gene index 0 corresponds to gene symbol A1BG
        - e.g. drug index 0 corresponds to drug palbociclib
        - when performing further filtering, the original index order must be preserved or traced to allow for mapping back to the original identifiers
    - creating a paired dataset from two different datasets
        - e.g. drug response and gene expression
        - e.g. drug response and mutation status
        - e.g. gene expression and mutation status
        - must perform model-to-name mapping between the two datasets and document the mapping logic
            - e.g. model are cell lines, matched by cell line name (no spaces, lower case)
            - e.g. model are cell lines, matched by a common identifier (e.g. Sanger_Model_ID)



### GDSC 1 

GDSC1 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC1` folder.

Data Retrieval Date: 2022-06-01

Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D., Smith, J. A., Thompson, I. R., Ramaswamy, S., Futreal, P. A., Haber, D. A., Stratton, M. R., Benes, C., McDermott, U., & Garnett, M. J. (2013). Genomics of Drug Sensitivity in Cancer (GDSC): A resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(Database issue), D955–D961. https://doi.org/10.1093/nar/gks1111

#### Methodology

retrieved from [Genomics of Drug Sensitivity in Cancer](https://www.cancerrxgene.org/help#t_curve)

> The GDSC1 dataset was generated jointly by the Wellcome Sanger Institute and Massachusetts General Hospital between 2009 and 2015 using a matched set of cancer cell lines (the GDSC1000).

> Compounds were stored in aliquots at -80°C and were subjected to a maximum of 5 freeze-thaw cycles.

> Cells were seeded in 96-well or 384-well plates and compound dose titrations were delivered using tip based liquid handling apparatus. Cell viability was measured using either Syto60 or Resazurin. Drug treatments in this dataset used two formats:

> 9-point dose curve incorporating a 2-fold dilution step (256-fold range)
> 5-point dose curve incorporating a 4-fold dilution step (256-fold range)



In [1]:
## Initial Loading of Data

import pandas as pd 

gdsc1 = pd.read_excel('data\drug-response\GDSC1\GDSC1_fitted_dose_response_25Feb20.xlsx')
print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [2]:
gdsc1_info = pd.read_csv('data\drug-response\GDSC1\GDSC1_DrugData.csv')

In [5]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'wb') as f:
    pickle.dump(gdsc1, f)
    pickle.dump(gdsc1_info, f)


In [1]:
## Loading cached data
import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'rb') as f:
    gdsc1 = pickle.load(f)
    gdsc1_info = pickle.load(f)

In [2]:
# investigating the structure of the gdsc1 dataset 

print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [3]:
print(gdsc1.shape)

(310904, 19)


In [4]:
print(gdsc1_info.head())

   drug_id   drug_name                            synonyms  \
0     1559  Luminespib  AUY922, VER-52296,NVP-AUY922,  AUY   
1     1372  Trametinib                GSK1120212, Mekinist   
2     1909  Venetoclax       ABT-199, Veneclexta, GDC-0199   
3     1017    Olaparib       AZD2281, KU0059436,  Lynparza   
4     1021    Axitinib                    AG-13736, Inlyta   

                        pathway_name            targets   pubchem  
0  Protein stability and degradation              HSP90  10096043  
1                 ERK MAPK signaling         MEK1, MEK2  11707110  
2               Apoptosis regulation               BCL2  49846579  
3                   Genome integrity       PARP1, PARP2  23725625  
4                      RTK signaling  PDGFR, KIT, VEGFR   6450551  


In [5]:
palbo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Palbociclib']

print(palbo.shape)

(901, 19)


In [6]:
ribo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Ribociclib']

print(ribo.shape)

(0, 19)


In [7]:
Abemaciclib = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Abemaciclib']

print(Abemaciclib.shape)

(0, 19)


Technical information
- gdsc1 
- dataset type: drug response
- dataset shape: (310904, 19) (n_cells, row_features)
- **each row** represents a drug response measurement of a cell line for a given drug
- **each column** are the features of each drug response measurement
- the column `LN_IC50` is the log-normalized IC50 value of the drug response
- the column `AUC` is the area under the curve of the drug response
- the column `DRUG_ID`, `DRUG_NAME` are the **identifiers of the drug**
    - 'DRUG_ID' can be queried to show further information on drug targets from the supplementary spreadsheet 'gdsc1_info.csv' or the `gdsc1_info` object in python 
- the column `COSMIC_ID`, `SANGER_MODEL_ID`, `CELL_LINE_NAME` are the **identifiers of the cell line**
- drug present: **Palbociclib** (cell line n = 901)


pre-processing may not be required on this dataset, however, further pre-processing is needed if paired with other datasets

### CCLE 22Q2

CCLE (Cancer Cell Line Encyclopedia) is a gene expression dataset, retrieved from [Cancer Cell Line Encyclopedia](https://depmap.org/portal/download/all/). Data is pulled with the option 'DepMap Public 22Q2' in the selection menu.

The data is stored in the `data/gene-expression/CCLE_Public_22Q2` folder.

Data Retrieval Date: 2022-06-01

Ghandi, M., Huang, F. W., Jané-Valbuena, J., Kryukov, G. V., Lo, C. C., McDonald, E. R., Barretina, J., Gelfand, E. T., Bielski, C. M., Li, H., Hu, K., Andreev-Drakhlin, A. Y., Kim, J., Hess, J. M., Haas, B. J., Aguet, F., Weir, B. A., Rothberg, M. V., Paolella, B. R., … Sellers, W. R. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature, 569(7757), Article 7757. https://doi.org/10.1038/s41586-019-1186-3


#### Methodology

From Ghandi et al, 2019:

> WGS for 329 cell lines and WES for 326 cell lines were performed at the Broad Institute Genomics Platform. Libraries were constructed and sequenced on either an Illumina HiSeq 2000 or Illumina GAIIX, with the use of 101-base-pair (bp) paired-end reads for WGS and 76-bp paired-end reads for WES. Output from Illumina software was processed by the Picard data-processing pipeline to yield BAM files containing well-calibrated, aligned reads. All sample information tracking was performed by automated LIMS messaging.

In [2]:
# importing ccle data 

import pandas as pd

ccle = pd.read_csv('data\gene-expression\CCLE_Public_22Q2\CCLE_expression.csv')

In [3]:
print(ccle.shape)

print(ccle.describe())

(1406, 19222)
       TSPAN6 (7105)  TNMD (64102)  DPM1 (8813)  SCYL3 (57147)  \
count    1406.000000   1406.000000  1406.000000    1406.000000   
mean        3.363532      0.069776     6.495860       2.366410   
std         1.645531      0.345324     0.646531       0.544784   
min         0.000000      0.000000     3.654206       0.594549   
25%         2.862946      0.000000     6.097505       2.003602   
50%         3.804776      0.000000     6.479295       2.334854   
75%         4.430620      0.000000     6.916029       2.682573   
max         8.131857      5.251340     9.175100       4.746850   

       C1orf112 (55732)   FGR (2268)   CFH (3075)  FUCA2 (2519)  GCLC (2729)  \
count       1406.000000  1406.000000  1406.000000   1406.000000  1406.000000   
mean           3.674362     0.445801     2.167746      5.140341     4.639761   
std            0.784917     1.250105     2.241927      1.817106     1.152074   
min            0.056584     0.000000     0.000000      0.000000     1.1

In [4]:
# rename ccle columns 

entrez = list(ccle.columns)
gene_name = []

for c in entrez:
    if c == 'Unnamed: 0':
        entrez[entrez.index(c)] = 'CELLLINE'
        gene_name.append('CELLLINE')
    else:
        # only retain the entrez id in the bracket
        left, right = c.find('('), c.find(')')
        entrez[entrez.index(c)] = c[left+1:right]
        gene_name.append(c[:left-1])

In [5]:
ccle.columns = gene_name
print(ccle.head())
print(ccle.shape)

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816

In [6]:
gene_entrez = pd.DataFrame({'gene_name': gene_name, 'entrez': entrez})
print(gene_entrez.head())

  gene_name    entrez
0  CELLLINE  CELLLINE
1    TSPAN6      7105
2      TNMD     64102
3      DPM1      8813
4     SCYL3     57147


In [8]:
import pickle 

with open('data\gene-expression\CCLE_Public_22Q2\ccle_expression.pkl', 'wb') as f:
    pickle.dump(gene_entrez, f)
    pickle.dump(ccle, f)

In [10]:
import pickle

with open('data\gene-expression\CCLE_Public_22Q2\ccle_expression.pkl', 'rb') as f:
    gene_entrez = pickle.load(f)
    ccle = pickle.load(f)

In [11]:
print(ccle.shape)

(1406, 19222)


In [12]:
print(ccle.head())

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816

In [14]:
import pandas as pd 

ccle_sample_info = pd.read_csv('data/gene-expression/CCLE_Public_22Q2/sample_info.csv')


In [15]:
print(ccle_sample_info.head())

    DepMap_ID cell_line_name stripped_cell_line_name  \
0  ACH-000016         SLR 21                   SLR21   
1  ACH-000032     MHH-CALL-3                MHHCALL3   
2  ACH-000033      NCI-H1819                NCIH1819   
3  ACH-000043       Hs 895.T                  HS895T   
4  ACH-000049         HEK TE                   HEKTE   

                                     CCLE_Name alias  COSMICID     sex  \
0                                 SLR21_KIDNEY   NaN       NaN     NaN   
1  MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE   NaN       NaN  Female   
2                                NCIH1819_LUNG   NaN       NaN  Female   
3                            HS895T_FIBROBLAST   NaN       NaN  Female   
4                                 HEKTE_KIDNEY   NaN       NaN     NaN   

         source       RRID  WTSI_Master_Cell_ID  ...   lineage_sub_subtype  \
0  Academic lab  CVCL_V607                  NaN  ...                   NaN   
1          DSMZ  CVCL_0089                  NaN  ...          

In [16]:
import pickle 

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'wb') as f:
    pickle.dump(ccle_sample_info, f)

In [17]:
import pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'rb') as f:
    ccle_sample_info = pickle.load(f)

Dataset Documentation
- Dataset name: ccle_expression 
- dataset type: gene expression by RNASeq
- dataset shape: (1404, 19222) (n_cells, row_features)
- **each row** represents a gene expression measurement of a cell line
- **each column** after the first column is the specific expression of a gene in a given cell line
- the first column `CELLLINE` is renamed from `0: unnamed`, represents the cell line `DepMap_ID`, and is the **identifier of the cell line**
- `DepMap_ID` can be queried to show further information on cell line from the supplementary spreadsheet 'ccle_sample_info.csv' or the `ccle_sample_info` object in python, including linkage to other identifiers including `Sanger_Model_ID`

- biologically/clinically implicated genes of note for cdk4/6 inhibitors: #TODO

Pre-processing Documentation
- column renaming was performed, from `0: unnamed` to `CELLLINE` for the first column, entrez ids are stripped and put into a separate dataframe as part of data cleaning.

### GDSC 2

GDSC2 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC2` folder.

In [9]:
## Initial Loading of Data

import pandas as pd 

gdsc2 = pd.read_excel('data\drug-response\GDSC2\GDSC2_fitted_dose_response_25Feb20.xlsx')
print(gdsc2.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC2             282       13320532     749709        HCC1954   
1   GDSC2             282       13320565     749710        HCC1143   
2   GDSC2             282       13320598     749711        HCC1187   
3   GDSC2             282       13320631     749712        HCC1395   
4   GDSC2             282       13320668     749713        HCC1599   

  SANGER_MODEL_ID TCGA_DESC  DRUG_ID     DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00872      BRCA     1003  Camptothecin            TOP1   
1       SIDM00866      BRCA     1003  Camptothecin            TOP1   
2       SIDM00885      BRCA     1003  Camptothecin            TOP1   
3       SIDM00884      BRCA     1003  Camptothecin            TOP1   
4       SIDM00877      BRCA     1003  Camptothecin            TOP1   

      PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  DNA replication        1046          Y  0.000098       0.1 -0.251083   
1  DNA r

In [12]:
gdsc2_info = pd.read_csv('data\drug-response\GDSC2\GDSC2_DrugData.csv')



In [13]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'wb') as f:
    pickle.dump(gdsc2, f)
    pickle.dump(gdsc2_info, f)

In [14]:
## Loading cached data

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

### Goncalves 2022 Proteomic Cell Paper (n=949)

Data is retrieved from the [Cell publication of the original article under supplemental information](https://www.cell.com/cancer-cell/fulltext/S1535-6108(22)00274-4) on 01-02-2023.

Gonçalves, E., Poulos, R. C., Cai, Z., Barthorpe, S., Manda, S. S., Lucas, N., Beck, A., Bucio-Noble, D., Dausmann, M., Hall, C., Hecker, M., Koh, J., Lightfoot, H., Mahboob, S., Mali, I., Morris, J., Richardson, L., Seneviratne, A. J., Shepherd, R., … Reddel, R. R. (2022). Pan-cancer proteomic map of 949 human cell lines. Cancer Cell, 40(8), 835-849.e8. https://doi.org/10.1016/j.ccell.2022.06.010

The data is stored in the `data/proteomic-expression/goncalves-2022-cell` folder.

This dataset contains the proteomic expression of 949 cell lines. 

#### Methodology

From Gonçalves et al, 2022 (Results page): 

>To construct a pan-cancer proteomic map, proteomes of 949 human cancer cell lines from 28 tissues and more than 40 genetically and histologically diverse cancer types were quantified (Figures 1A and S1A, Table S1). The proteome for each cell line was acquired by DIA-MS from six replicates using a workflow that enables high throughput and minimal instrument downtime (see STAR Methods, Figure S1B). The resulting dataset was derived from 6,864 DIA-MS runs acquired over 10,000 MS h (Table S1), including peptide preparations derived from the human embryonic kidney cell line HEK293T that were used throughout all data acquisition periods and instruments for quality control. These data, together with the spectral library, were deposited in the Proteomics Identification Database (Perez-Riverol et al., 2019) with dataset identifier PXD030304. Raw DIA-MS data were processed with DIA-NN (Demichev et al., 2020), using retention time-dependent normalization and with a spectral library generated by DIA-NN. For full details of data processing steps and parameters, see STAR Methods and Table S1. MaxLFQ (Cox et al., 2014) was then used to quantify a total of 8,498 proteins (Table S2, Figure S1C), with a median of 5,237 proteins (min-max range: 2,523–6,251) quantified per cell line (Table S1, Figure 1A).

For more detailed information on the methodology, see the STAR Methods section of the paper. In brief, protein expression was measured using DIA-MS, and the data was processed using DIA-NN and quantified using MaxLFQ. Then, data was further processed by log2 transformation. 

For more information on MaxLFQ, see the [Cox et al, 2014](https://www.sciencedirect.com/science/article/pii/S1535947620333107). 

In [27]:
import pandas as pd

# loading in the proteomic data

main_file = pd.ExcelFile('data\proteomic-expression\goncalves-2022-cell\goncalves-2022-cell-949-protein-matrix.xlsx')
print(main_file.sheet_names)

full_protein_matrix = pd.read_excel(main_file, 'Full protein matrix', header=1)
print(full_protein_matrix.head(2))

sin_peptile_exclusion_matrix = pd.read_excel(main_file, 'Prot matrix excl single-peptide', header=1)
print(sin_peptile_exclusion_matrix.head(2))


['Full protein matrix', 'Prot matrix excl single-peptide']
  Project_Identifier  Q9Y651;SOX21_HUMAN  P37108;SRP14_HUMAN  \
0     SIDM00018;K052                 NaN             7.10955   
1    SIDM00023;TE-12                 NaN             6.82802   

   Q96JP5;ZFP91_HUMAN  Q9Y4H2;IRS2_HUMAN  P36578;RL4_HUMAN  \
0             3.38802                NaN           7.86661   
1             4.14346            2.21578           7.62878   

   Q6SPF0;SAMD1_HUMAN  O76031;CLPX_HUMAN  Q8WUQ7;CATIN_HUMAN  \
0             3.77937            4.19666                 NaN   
1             3.23990            4.60902                 NaN   

   A6NIH7;U119B_HUMAN  ...  Q8WXF0;SRS12_HUMAN  P02763;A1AG1_HUMAN  \
0             2.67750  ...                 NaN                 NaN   
1             2.88893  ...                 NaN                 NaN   

   Q9ULK4;MED23_HUMAN  P22352;GPX3_HUMAN  P0C221;CC175_HUMAN  \
0                 NaN                NaN             4.50249   
1                 NaN        

In [29]:
print(full_protein_matrix.shape)
print(sin_peptile_exclusion_matrix.shape)

(949, 8499)
(949, 6693)


In [28]:
# loading in the sample info for the proteomic data

info_file = pd.ExcelFile('data\proteomic-expression\goncalves-2022-cell\goncalves-2022-cell-949-sample-info.xlsx')
print(info_file.sheet_names)

goncalve_cell_line_info = pd.read_excel(info_file, 'Cell line level sample info', header=1)

print(goncalve_cell_line_info.head(1))
print(goncalve_cell_line_info.shape)

['Legend', 'Cell line level sample info', 'Replicate level sample info', 'Variable isolation windows', 'Complete DIANN input file list', 'DIA-NN parameters']
    model_id Project_Identifier Cell_line Source Identifier Gender  \
0  SIDM00896     SIDM00896;BC-1      BC-1   ATCC  CVCL_1079   Male   

                   Tissue_type                    Cancer_type  \
0  Haematopoietic and Lymphoid  B-Cell Non-Hodgkin's Lymphoma   

              Cancer_subtype Haem_lineage  ...        F6        F7        F8  \
0  Primary effusion lymphoma   B lymphoid  ... -0.460411  0.769345  1.655736   

         F9       F10       F11       F12       F13       F14       F15  
0  2.109067 -0.056367 -0.550705  0.229186  0.486789  0.299905 -0.319625  

[1 rows x 42 columns]
(949, 42)


In [31]:
# pickle the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome.pkl', 'wb') as f:
    pickle.dump(full_protein_matrix, f)
    pickle.dump(sin_peptile_exclusion_matrix, f)
    pickle.dump(goncalve_cell_line_info, f)

In [11]:
# load the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome.pkl', 'rb') as f:
    full_protein_matrix = pickle.load(f)
    sin_peptile_exclusion_matrix = pickle.load(f)
    goncalve_cell_line_info = pickle.load(f)

In [12]:
print(full_protein_matrix.head(2))

  Project_Identifier  Q9Y651;SOX21_HUMAN  P37108;SRP14_HUMAN  \
0     SIDM00018;K052                 NaN             7.10955   
1    SIDM00023;TE-12                 NaN             6.82802   

   Q96JP5;ZFP91_HUMAN  Q9Y4H2;IRS2_HUMAN  P36578;RL4_HUMAN  \
0             3.38802                NaN           7.86661   
1             4.14346            2.21578           7.62878   

   Q6SPF0;SAMD1_HUMAN  O76031;CLPX_HUMAN  Q8WUQ7;CATIN_HUMAN  \
0             3.77937            4.19666                 NaN   
1             3.23990            4.60902                 NaN   

   A6NIH7;U119B_HUMAN  ...  Q8WXF0;SRS12_HUMAN  P02763;A1AG1_HUMAN  \
0             2.67750  ...                 NaN                 NaN   
1             2.88893  ...                 NaN                 NaN   

   Q9ULK4;MED23_HUMAN  P22352;GPX3_HUMAN  P0C221;CC175_HUMAN  \
0                 NaN                NaN             4.50249   
1                 NaN                NaN             4.52413   

   P02753;RET4_HUMAN  Q

Dataset Documentation
- Dataset name: full_protein_matrix

- dataset type: proteomic expression by DIA-MS

- dataset shape: (949, 8498) (n_cells, row_quantified_protein_expression), note that each protein expression value is aleady log2 transformed.

- **each row** represents a proteomic measurement of a cell line

- **each column** after the first column is the specific expression of a protein in a given cell line

- the first column `Project_Identifier` is the **identifier of the cell line**, this can be translated to `model_id` in the sample info csv document or the python object `goncalves_sample_info` using the `model_id` column. `model_id` appears to be consistent with the Sanger model ID format. 





In [14]:
# preprocess the dataset by zeroing the nan values 

full_protein_matrix = full_protein_matrix.fillna(0)
sin_peptile_exclusion_matrix = sin_peptile_exclusion_matrix.fillna(0)

# print(full_protein_matrix.head(2))

In [15]:
import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome_fillna.pkl', 'wb') as f:
    pickle.dump(full_protein_matrix, f)
    pickle.dump(sin_peptile_exclusion_matrix, f)
    pickle.dump(goncalve_cell_line_info, f)

In [17]:
# load the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome_fillna.pkl', 'rb') as f:
    full_protein_matrix = pickle.load(f)
    sin_peptile_exclusion_matrix = pickle.load(f)
    goncalve_cell_line_info = pickle.load(f)

### STRING Database for Protein-Protein Interactions

In [1]:
import pandas as pd 

# loading in the proteomic data

string_df = pd.read_csv("data\protein-interaction\STRING\9606.protein.links.detailed.v11.5.txt.gz", delimiter=' ')

In [2]:
string_df.head()

Unnamed: 0,protein1,protein2,neighborhood,fusion,cooccurence,coexpression,experimental,database,textmining,combined_score
0,9606.ENSP00000000233,9606.ENSP00000379496,0,0,0,54,0,0,144,155
1,9606.ENSP00000000233,9606.ENSP00000314067,0,0,0,0,180,0,61,197
2,9606.ENSP00000000233,9606.ENSP00000263116,0,0,0,62,152,0,101,222
3,9606.ENSP00000000233,9606.ENSP00000361263,0,0,0,0,161,0,64,181
4,9606.ENSP00000000233,9606.ENSP00000409666,0,0,0,82,213,0,72,270


In [3]:
string_df_info = pd.read_csv("data\protein-interaction\STRING\9606.protein.info.v11.5.txt.gz", delimiter='\t')

In [4]:
string_df_info.head()

Unnamed: 0,#string_protein_id,preferred_name,protein_size,annotation
0,9606.ENSP00000000233,ARF5,180,ADP-ribosylation factor 5; GTP-binding protein...
1,9606.ENSP00000000412,M6PR,277,Cation-dependent mannose-6-phosphate receptor;...
2,9606.ENSP00000001008,FKBP4,459,Peptidyl-prolyl cis-trans isomerase FKBP4; Imm...
3,9606.ENSP00000001146,CYP26B1,512,Cytochrome P450 26B1; Involved in the metaboli...
4,9606.ENSP00000002125,NDUFAF7,441,"Protein arginine methyltransferase NDUFAF7, mi..."


In [5]:
string_df_alias = pd.read_csv("data\protein-interaction\STRING\9606.protein.aliases.v11.5.txt.gz", delimiter='\t')

In [6]:
string_df_alias.tail(10)

Unnamed: 0,#string_protein_id,alias,source
4213381,9606.ENSP00000485678,Q8NGQ2,BLAST_UniProt_AC
4213382,9606.ENSP00000485678,Q8NGQ2,Ensembl_HGNC_UniProt_ID(supplied_by_UniProt)
4213383,9606.ENSP00000485678,Q8NGQ2,Ensembl_HGNC_UniProt_ID(supplied_by_UniProt)_AC
4213384,9606.ENSP00000485678,Q96R34,BLAST_UniProt_AC
4213385,9606.ENSP00000485678,Q96R34,Ensembl_HGNC_UniProt_ID(supplied_by_UniProt)_AC
4213386,9606.ENSP00000485678,hsa:219952,BLAST_KEGG_KEGGID
4213387,9606.ENSP00000485678,olfactory receptor family 6 subfamily Q member 1,Ensembl_HGNC_Approved_Name
4213388,9606.ENSP00000485678,uc010rjz.2,BLAST_UniProt_DR_UCSC
4213389,9606.ENSP00000485678,uc010rjz.2,Ensembl_HGNC_UCSC_ID(supplied_by_UCSC)
4213390,9606.ENSP00000485678,uc010rjz.2,Ensembl_HGNC_UniProt_ID(supplied_by_UniProt)_D...


In [7]:
# load the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome_fillna.pkl', 'rb') as f:
    full_protein_matrix = pickle.load(f)
    sin_peptile_exclusion_matrix = pickle.load(f)
    goncalve_cell_line_info = pickle.load(f)


# import CCLE gene expression data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_expression.pkl', 'rb') as f:
    gene_entrez = pickle.load(f)
    ccle = pickle.load(f)

# import CCLE sample info data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'rb') as f:
    ccle_sample_info = pickle.load(f)


In [8]:
sin_peptile_exclusion_matrix.head()

Unnamed: 0,Project_Identifier,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,...,P33151;CADH5_HUMAN,Q5EBL4;RIPL1_HUMAN,P49715;CEBPA_HUMAN,Q5TA45;INT11_HUMAN,O14924;RGS12_HUMAN,Q7Z3B1;NEGR1_HUMAN,O60669;MOT2_HUMAN,Q13571;LAPM5_HUMAN,Q96JM2;ZN462_HUMAN,P35558;PCKGC_HUMAN
0,SIDM00018;K052,7.10955,3.41494,0.0,7.86661,3.89547,4.19666,0.0,0.0,3.19088,...,0.0,0.0,3.90064,2.63998,0.0,0.0,0.0,0.0,0.0,0.0
1,SIDM00023;TE-12,6.82802,4.14346,2.23781,7.62878,3.19811,4.60902,0.0,2.47059,3.69535,...,0.0,0.0,0.0,3.19608,0.0,0.0,0.0,0.0,0.0,0.0
2,SIDM00040;TMK-1,7.01426,4.19987,2.44055,8.12459,0.0,4.76881,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,SIDM00041;STS-0421,5.28591,3.35789,0.0,7.97268,0.0,4.52092,0.0,0.0,2.73088,...,0.0,0.0,0.0,2.79023,0.0,0.0,0.0,0.0,0.0,0.0
4,SIDM00042;PL4,5.70786,0.0,0.0,6.22574,0.0,4.49579,0.0,0.0,2.87981,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
columns_protein = sin_peptile_exclusion_matrix.columns

columns_protein[1]

'P37108;SRP14_HUMAN'

#### Core functions

In [17]:
import pandas as pd 
import numpy as np

def get_protein_id_by_name(name: str, info: pd.DataFrame, alias: pd.DataFrame, 
                           absolute_match = True,
                           edit_distance=1):
    # if name exist in the info dataframe, return the id
    # get the `#string_protein_id` column from the info dataframe using the `preferred_name` column, the 
    # param `name` is the value of the `preferred_name` column 

    # get the `#string_protein_id` column using name 
    name_id = info.loc[info['preferred_name'].str.lower() == name.lower()]['#string_protein_id']
    
    # if the name_id is not empty and only one value, return the value
    if not name_id.empty and len(name_id) == 1:
        return name_id.values[0]
    
    if len(name_id) > 1:
        print('Warning: more than one id found for the given name')
        return None
    
    if name_id.empty:  
        
        # get the `#string_protein_id` column from the alias dataframe using the `alias` column, the
        # param `name` is the value of the `alias` column
        alias_id = alias.loc[alias['alias'].str.lower() == name.lower()]['#string_protein_id']

        if len(alias_id) > 1:
            if alias_id.eq(alias_id.iloc[0]).all():
                return alias_id.values[0]
            else:
                print('Warning: more than one id found for the given name (alias)')
                print(alias_id)
                return None

        if not alias_id.empty and len(alias_id) == 1:
            return alias_id.values[0]
        
        if alias_id.empty:
            return None 

    

def select_columns_by_string_ids(omics_df: pd.DataFrame, string_ids: list) -> pd.DataFrame:
    # select the columns from the omics_df that have the given string_id
    # return a dataframe with the selected columns
    pass


def get_protein_interactors(id: str, df: pd.DataFrame):
    # get the interactors of the protein with the given id
    # return a list of protein ids
    pass

def get_protein_name_by_id():
    pass

def get_protein_id_by_best_name():

    pass

def run_test_get_protein_id_by_name():

    test_id = get_protein_id_by_name('HSP90AA1', string_df_info, string_df_alias)
    print(test_id)

    test_id = get_protein_id_by_name('HSP90Aa1', string_df_info, string_df_alias, absolute_match=False)
    print(test_id)

    test_id = get_protein_id_by_name('HSP90A1', string_df_info, string_df_alias, absolute_match=False)
    print(test_id)

    for name in columns_protein[:10]:
        name = name.split(';')[0]
        id = get_protein_id_by_name(name, string_df_info, string_df_alias)
        print(name, id)


for gene in gene_entrez['gene_name'][:10]:
    string_id = get_protein_id_by_name(gene, string_df_info, string_df_alias, absolute_match=False)
    print(gene, string_id)

CELLLINE None
TSPAN6 9606.ENSP00000362111
TNMD 9606.ENSP00000362122
DPM1 9606.ENSP00000360644
SCYL3 9606.ENSP00000356746
C1orf112 9606.ENSP00000286031
FGR 9606.ENSP00000363117
CFH 9606.ENSP00000356399
FUCA2 9606.ENSP00000002165
GCLC 9606.ENSP00000229416


#### Create link between ccle and goncalves

In [52]:
# create a relational dataframe between the goncalve proteome name/id, string protein id

data = []
miss_count = 0 
for i, proteo_id in enumerate(columns_protein):
    protein_str_list = proteo_id.split(';')
    if len(protein_str_list) > 1:
        protein_id = protein_str_list[0]
        protein_name = protein_str_list[1].split('_')[0]
        string_id = get_protein_id_by_name(protein_id, string_df_info, string_df_alias)
        if string_id is not None:
            data.append([proteo_id, string_id, protein_id, protein_name])
            print(f'iteration {i} protein_id: {proteo_id}, string_id: {string_id}, protein_name: {protein_name}, missing so far {miss_count}')
        else:
            miss_count += 1
            print(f'protein_id: {proteo_id}, string_id: {string_id}, protein_name: {protein_name} not found')

goncalve_to_string_id_df = pd.DataFrame(data, columns=['goncalve_protein_id', 'string_protein_id', 'protein_id', 'protein_name'])





iteration 1 protein_id: P37108;SRP14_HUMAN, string_id: 9606.ENSP00000267884, protein_name: SRP14, missing so far 0
iteration 2 protein_id: Q96JP5;ZFP91_HUMAN, string_id: 9606.ENSP00000339030, protein_name: ZFP91, missing so far 0
iteration 3 protein_id: Q9Y4H2;IRS2_HUMAN, string_id: 9606.ENSP00000365016, protein_name: IRS2, missing so far 0
iteration 4 protein_id: P36578;RL4_HUMAN, string_id: 9606.ENSP00000311430, protein_name: RL4, missing so far 0
iteration 5 protein_id: Q6SPF0;SAMD1_HUMAN, string_id: 9606.ENSP00000431971, protein_name: SAMD1, missing so far 0
iteration 6 protein_id: O76031;CLPX_HUMAN, string_id: 9606.ENSP00000300107, protein_name: CLPX, missing so far 0
iteration 7 protein_id: Q8WUQ7;CATIN_HUMAN, string_id: 9606.ENSP00000415078, protein_name: CATIN, missing so far 0
iteration 8 protein_id: A6NIH7;U119B_HUMAN, string_id: 9606.ENSP00000344942, protein_name: U119B, missing so far 0
iteration 9 protein_id: Q9BTD8;RBM42_HUMAN, string_id: 9606.ENSP00000262633, protein_nam

In [53]:
print(f'Size of original proteome: {len(columns_protein)} Size of goncalve_to_string_id_df: {len(goncalve_to_string_id_df)}')
print(f'Percentage of proteins with string id: {len(goncalve_to_string_id_df)/len(columns_protein)*100:.2f}%')

Size of original proteome: 6693 Size of goncalve_to_string_id_df: 6624
Percentage of proteins with string id: 98.97%


In [54]:
goncalve_to_string_id_df.head()

Unnamed: 0,goncalve_protein_id,string_protein_id,protein_id,protein_name
0,P37108;SRP14_HUMAN,9606.ENSP00000267884,P37108,SRP14
1,Q96JP5;ZFP91_HUMAN,9606.ENSP00000339030,Q96JP5,ZFP91
2,Q9Y4H2;IRS2_HUMAN,9606.ENSP00000365016,Q9Y4H2,IRS2
3,P36578;RL4_HUMAN,9606.ENSP00000311430,P36578,RL4
4,Q6SPF0;SAMD1_HUMAN,9606.ENSP00000431971,Q6SPF0,SAMD1


In [None]:
# to pickle
goncalve_to_string_id_df.to_pickle('data/protein-interaction/STRING/goncalve_to_string_id_df.pkl')

### PDE Ribociclib Data (Sungyoung)

Type: Drug response dataset (single drug: ribociclib)

Source: in-house data of external collaborators

Data is stored in the `data\drug-response\PDE_Ribociclib_ExtInHouse` folder.

##### Data Description and Methods

Dataset has multiple excel files, each containing multiple excel sheets. Drug response is measured by percentage (%) decrease of Ki67 positivity versus control. Ki67 is known to play a role in cell proliferation (Soliman et al, 2016). Responders are defined as cells with a decrease of Ki67 positivity of at least 50% OR 25% versus control. Two doses of ribociclib were tested, 100 nM and 500 nM.

From a brief visual inspection of the data, it appears the `datamatrix` sheet from both `response_mimi` and `response_ml_training_data` are the same and refers to "responders" of the 100 nM ribociclib treatment with 25% Ki67 positivity decrease versus control. Proteomic expression data were analyzed using Spectronaut 8 and quantified using MaxQuant Version 1.5.2.8 (Nguyen et al, 2018).  

Soliman, N. A., & Yussif, S. M. (2016). Ki-67 as a prognostic marker according to breast cancer molecular subtype. Cancer Biology & Medicine, 13(4), 496–504. https://doi.org/10.20892/j.issn.2095-3941.2016.0066

Nguyen, E. V., Centenera, M. M., Moldovan, M., Das, R., Irani, S., Vincent, A. D., Chan, H., Horvath, L. G., Lynn, D. J., Daly, R. J., & Butler, L. M. (2018). Identification of Novel Response and Predictive Biomarkers to Hsp90 Inhibitors Through Proteomic Profiling of Patient-derived Prostate Tumor Explants *. Molecular & Cellular Proteomics, 17(8), 1470–1486. https://doi.org/10.1074/mcp.RA118.000633






In [49]:
import pandas as pd 

# loading in the training data file

main_file = pd.ExcelFile('data\drug-response\PDE_Ribociclib_ExtInHouse\Ribociclib_Response_training_data_with_all.xlsx')
print(main_file.sheet_names)

['datamatrix']


In [50]:
pde_drug_response_full = pd.ExcelFile('data\drug-response\PDE_Ribociclib_ExtInHouse\Ribociclib_Response_Mimi.xlsx')
print(pde_drug_response_full.sheet_names)

# load in 'Response groups' 

pde_response_all = pd.read_excel(pde_drug_response_full, 'Response groups', header=1)
print(pde_response_all.head(2))

# drop row if 'Sample ID' is NaN

pde_response_all = pde_response_all.dropna(subset=['Sample ID'])
# print(pde_response_all.head(2))



['Response groups', '100mM_NonResp_Resp_25', '100mM_NonResp_Resp_50', '500mM_NonResp_Resp_50', 'datamatrix']
  Patient ID Sample ID  Vehicle  Rib 100nm  Rib 500nm %Decrease 100  \
0     33011L        P1    72.23      57.53      23.28      0.203517   
1     33042L        P2    17.32      13.12       1.04      0.242494   

  25% decrease cutoff 50% decrease cutoff %Decrease 500 25% decrease cutoff.1  \
0       Non-Responder       Non-Responder      0.677696             Responder   
1       Non-Responder       Non-Responder      0.939954             Responder   

  50% decrease cutoff.1   PR  Unnamed: 12  
0             Responder  NaN          NaN  
1             Responder  NaN          NaN  


  warn(msg)


In [51]:
# first, load in the datamatrix sheet with the first row as the header

ribociclib_response = pd.read_excel(main_file, 'datamatrix', header=1)
print(ribociclib_response.head(2))

response = ribociclib_response.columns
# print(response)
# keep first two letter of the column name as the new column name
pde = ribociclib_response.iloc[0]
pde = pde.tolist()[2:]
# print(pde.tolist()[2:])
response = [c[:2] for c in response][2:]
# print(response)

pde_response = pd.DataFrame({'pde': pde, 'response': response})
print(pde_response.head(2))

# ribociclib_response.columns = ribociclib_response.iloc[0]
# print(ribociclib_response.head(2))
# print(ribociclib_response.shape)

             Unnamed: 0 Unnamed: 1         NR       NR.1       NR.2  \
0  PG.ProteinAccessions   GeneName         P1         P2         P5   
1                A0AV96      RBM47  13.205421  14.677432  16.944985   

        NR.3       NR.4       NR.5       NR.6       NR.7  ...       RP.5  \
0         P6         P8        P10        P16        P17  ...        P12   
1  16.384994  14.139414  15.210211  15.953077  14.870737  ...  15.116242   

        RP.6       RP.7       RP.8       RP.9     RP.10      RP.11     RP.12  \
0        P13        P14        P15        P21       P23        P24       P26   
1  17.175345  16.620938  15.568577  15.789191  16.16687  15.494391  15.83786   

       RP.13      RP.14  
0        P28        P29  
1  15.110893  14.802329  

[2 rows x 32 columns]
  pde response
0  P1       NR
1  P2       NR


  warn(msg)


In [52]:
# then, load in the datamatrix sheet with the second row as the header

ribociclib_expression = pd.read_excel(main_file, 'datamatrix', header=2)
# print(ribociclib_expression.head(2))

ribociclib_protein_id_to_name = ribociclib_expression[['PG.ProteinAccessions', 'GeneName']]
print(ribociclib_protein_id_to_name.head(2))

ribociclib_expression.drop(['PG.ProteinAccessions'], axis=1, inplace=True)
# print(ribociclib_expression.head(2))

ribociclib_expression = ribociclib_expression.T
ribociclib_expression.columns = ribociclib_expression.iloc[0]
ribociclib_expression.drop(['GeneName'], axis=0, inplace=True)
# ribociclib_expression.rename(columns={'GeneName': 'PDE_ID'}, inplace=True)
print(ribociclib_expression.head(2))

# print(ribociclib_expression.columns)

  PG.ProteinAccessions GeneName
0               A0AV96    RBM47
1               A0AVT1     UBA6
GeneName      RBM47       UBA6      ESYT2      SHTN1  ARHGAP10      ILVBL  \
P1        13.205421  13.312968  15.075366   13.94654  14.17361  15.575587   
P2        14.677432  16.032694  16.367723  13.795947   13.9498  16.216862   

GeneName   SH3PXD2B       NBAS      TARS3       VWA8  ...     NDUFB9  \
P1        17.923365  13.991249  15.093301  14.356347  ...  16.408133   
P2        17.704219  14.215525  14.185029  13.544857  ...  16.329183   

GeneName       SQOR      AP1M2      NUMBL     SLC4A4       SCIN      DDX49  \
P1        16.968146  14.367639  14.754497  14.172981  13.620273  15.741047   
P2        17.964655  14.088264   13.63167  15.344575  13.366413   17.89591   

GeneName      WASF2      ENPP4    SEC23IP  
P1        15.439548  13.925692  15.574333  
P2         15.40177  13.765186  14.488299  

[2 rows x 4675 columns]


  warn(msg)


In [53]:
# pickle the ribociclib_response and ribociclib_expression, and ribociclib_protein_id_to_name, pde_response and pde_response_all

import pickle

with open('data/drug-response/PDE_Ribociclib_ExtInHouse/ribociclib_pde_cleaned.pkl', 'wb') as f:
    pickle.dump(ribociclib_response, f)
    pickle.dump(ribociclib_expression, f)
    pickle.dump(ribociclib_protein_id_to_name, f)
    pickle.dump(pde_response, f)
    pickle.dump(pde_response_all, f)

In [59]:
# load the ribociclib_response and ribociclib_expression, and ribociclib_protein_id_to_name, pde_response and pde_response_all

import pickle

with open('data/drug-response/PDE_Ribociclib_ExtInHouse/ribociclib_pde_cleaned.pkl', 'rb') as f:
    ribociclib_response = pickle.load(f)
    ribociclib_expression = pickle.load(f)
    ribociclib_protein_id_to_name = pickle.load(f)
    pde_response = pickle.load(f)
    pde_response_all = pickle.load(f)

In [61]:
pde_response_all.head()

Unnamed: 0,Patient ID,Sample ID,Vehicle,Rib 100nm,Rib 500nm,%Decrease 100,25% decrease cutoff,50% decrease cutoff,%Decrease 500,25% decrease cutoff.1,50% decrease cutoff.1,PR,Unnamed: 12
0,33011L,P1,72.23,57.53,23.28,0.203517,Non-Responder,Non-Responder,0.677696,Responder,Responder,,
1,33042L,P2,17.32,13.12,1.04,0.242494,Non-Responder,Non-Responder,0.939954,Responder,Responder,,
2,33198L,P3,27.78,1.7,1.09,0.938805,Responder,Responder,0.960763,Responder,Responder,,
3,33208R,P4,22.51,8.38,4.57,0.627721,Responder,Responder,0.796979,Responder,Responder,,
4,33209L,P5,5.16,4.2,0.71,0.186047,Non-Responder,Non-Responder,0.862403,Responder,Responder,,


In [62]:
ribociclib_expression.head()

GeneName,RBM47,UBA6,ESYT2,SHTN1,ARHGAP10,ILVBL,SH3PXD2B,NBAS,TARS3,VWA8,...,NDUFB9,SQOR,AP1M2,NUMBL,SLC4A4,SCIN,DDX49,WASF2,ENPP4,SEC23IP
P1,13.205421,13.312968,15.075366,13.94654,14.17361,15.575587,17.923365,13.991249,15.093301,14.356347,...,16.408133,16.968146,14.367639,14.754497,14.172981,13.620273,15.741047,15.439548,13.925692,15.574333
P2,14.677432,16.032694,16.367723,13.795947,13.9498,16.216862,17.704219,14.215525,14.185029,13.544857,...,16.329183,17.964655,14.088264,13.63167,15.344575,13.366413,17.89591,15.40177,13.765186,14.488299
P5,16.944985,15.671765,16.218363,13.257103,12.994196,15.978814,18.000886,14.017516,15.262168,16.353614,...,17.343333,17.551941,16.112098,14.324413,13.245405,13.794597,18.389832,15.712554,15.377941,16.264339
P6,16.384994,13.374703,14.232031,14.121748,14.396288,15.550587,18.098222,13.198621,13.358601,14.552683,...,16.310153,17.334495,14.415664,15.659967,14.340428,13.744643,14.592098,16.482169,14.003858,13.751223
P8,14.139414,15.390785,16.093715,13.583502,13.901949,15.797204,14.786828,14.865055,14.752258,15.173983,...,16.516411,17.022279,13.623395,13.546393,12.912828,13.523286,16.835643,16.440494,13.291755,15.172701


In [64]:
# join pde_response_all and ribociclib_expression on 'Sample ID' and 'GeneName' via index 

pde_decrease_100 = pde_response_all[['%Decrease 100', 'Sample ID']]
pde_decrease_100 = pde_decrease_100.set_index('Sample ID')


In [65]:
pde_decrease_100.head()

Unnamed: 0_level_0,%Decrease 100
Sample ID,Unnamed: 1_level_1
P1,0.203517
P2,0.242494
P3,0.938805
P4,0.627721
P5,0.186047


In [66]:
ribociclib_expression.head()

GeneName,RBM47,UBA6,ESYT2,SHTN1,ARHGAP10,ILVBL,SH3PXD2B,NBAS,TARS3,VWA8,...,NDUFB9,SQOR,AP1M2,NUMBL,SLC4A4,SCIN,DDX49,WASF2,ENPP4,SEC23IP
P1,13.205421,13.312968,15.075366,13.94654,14.17361,15.575587,17.923365,13.991249,15.093301,14.356347,...,16.408133,16.968146,14.367639,14.754497,14.172981,13.620273,15.741047,15.439548,13.925692,15.574333
P2,14.677432,16.032694,16.367723,13.795947,13.9498,16.216862,17.704219,14.215525,14.185029,13.544857,...,16.329183,17.964655,14.088264,13.63167,15.344575,13.366413,17.89591,15.40177,13.765186,14.488299
P5,16.944985,15.671765,16.218363,13.257103,12.994196,15.978814,18.000886,14.017516,15.262168,16.353614,...,17.343333,17.551941,16.112098,14.324413,13.245405,13.794597,18.389832,15.712554,15.377941,16.264339
P6,16.384994,13.374703,14.232031,14.121748,14.396288,15.550587,18.098222,13.198621,13.358601,14.552683,...,16.310153,17.334495,14.415664,15.659967,14.340428,13.744643,14.592098,16.482169,14.003858,13.751223
P8,14.139414,15.390785,16.093715,13.583502,13.901949,15.797204,14.786828,14.865055,14.752258,15.173983,...,16.516411,17.022279,13.623395,13.546393,12.912828,13.523286,16.835643,16.440494,13.291755,15.172701


In [70]:
# print(pde_response_all.index)
# print(ribociclib_expression.index)

ribociclib_pde_decrease100_expression = pde_decrease_100.join(ribociclib_expression, how='inner')

# print(ribociclib_pde_whole_dataset.head(2))

# ribociclib_pde_whole_dataset.to_pickle('data/preprocessed/ribociclib_pde_whole_dataset.pkl')

In [71]:
ribociclib_pde_decrease100_expression.head()

Unnamed: 0,%Decrease 100,RBM47,UBA6,ESYT2,SHTN1,ARHGAP10,ILVBL,SH3PXD2B,NBAS,TARS3,...,NDUFB9,SQOR,AP1M2,NUMBL,SLC4A4,SCIN,DDX49,WASF2,ENPP4,SEC23IP
P1,0.203517,13.205421,13.312968,15.075366,13.94654,14.17361,15.575587,17.923365,13.991249,15.093301,...,16.408133,16.968146,14.367639,14.754497,14.172981,13.620273,15.741047,15.439548,13.925692,15.574333
P2,0.242494,14.677432,16.032694,16.367723,13.795947,13.9498,16.216862,17.704219,14.215525,14.185029,...,16.329183,17.964655,14.088264,13.63167,15.344575,13.366413,17.89591,15.40177,13.765186,14.488299
P3,0.938805,15.684734,15.088263,16.384994,12.988215,12.909802,15.281106,13.580255,14.046009,14.677432,...,16.856803,18.154386,14.205416,14.091711,14.560993,14.558801,18.091205,15.757352,13.34643,15.986624
P4,0.627721,15.269914,14.434301,16.410483,13.444889,13.323079,15.901127,13.642135,14.736527,14.184101,...,17.283643,16.823041,13.76008,13.063194,13.445943,13.976719,16.432988,13.880238,13.576823,15.048702
P5,0.186047,16.944985,15.671765,16.218363,13.257103,12.994196,15.978814,18.000886,14.017516,15.262168,...,17.343333,17.551941,16.112098,14.324413,13.245405,13.794597,18.389832,15.712554,15.377941,16.264339


In [73]:
pde_decrease_100_cutoff25 = pde_response_all[['25% decrease cutoff', 'Sample ID']]
pde_decrease_100_cutoff25 = pde_decrease_100_cutoff25.set_index('Sample ID')
ribociclib_pde_decrease100_cutoff25_expression = pde_decrease_100_cutoff25.join(ribociclib_expression, how='inner')


In [74]:
ribociclib_pde_decrease100_cutoff25_expression.head()

Unnamed: 0,25% decrease cutoff,RBM47,UBA6,ESYT2,SHTN1,ARHGAP10,ILVBL,SH3PXD2B,NBAS,TARS3,...,NDUFB9,SQOR,AP1M2,NUMBL,SLC4A4,SCIN,DDX49,WASF2,ENPP4,SEC23IP
P1,Non-Responder,13.205421,13.312968,15.075366,13.94654,14.17361,15.575587,17.923365,13.991249,15.093301,...,16.408133,16.968146,14.367639,14.754497,14.172981,13.620273,15.741047,15.439548,13.925692,15.574333
P2,Non-Responder,14.677432,16.032694,16.367723,13.795947,13.9498,16.216862,17.704219,14.215525,14.185029,...,16.329183,17.964655,14.088264,13.63167,15.344575,13.366413,17.89591,15.40177,13.765186,14.488299
P3,Responder,15.684734,15.088263,16.384994,12.988215,12.909802,15.281106,13.580255,14.046009,14.677432,...,16.856803,18.154386,14.205416,14.091711,14.560993,14.558801,18.091205,15.757352,13.34643,15.986624
P4,Responder,15.269914,14.434301,16.410483,13.444889,13.323079,15.901127,13.642135,14.736527,14.184101,...,17.283643,16.823041,13.76008,13.063194,13.445943,13.976719,16.432988,13.880238,13.576823,15.048702
P5,Non-Responder,16.944985,15.671765,16.218363,13.257103,12.994196,15.978814,18.000886,14.017516,15.262168,...,17.343333,17.551941,16.112098,14.324413,13.245405,13.794597,18.389832,15.712554,15.377941,16.264339


In [76]:
with open('data/preprocessed/ribociclib_pde_decrease100.pkl', 'wb') as f:
    pickle.dump(ribociclib_pde_decrease100_cutoff25_expression, f)
    pickle.dump(ribociclib_pde_decrease100_expression, f)

This dataset can now be used for drug response prediction.

# Data Integration

## Integration of GDSC2 and CCLE dataset 

### Steps 
1. GDSC2 contains drug data, each drug can be converted into a chemical structure, and the chemical structure can be converted into a SMILES string, or a fingerprint. This represents drug features.
2. CCLE contains gene expression data, each gene can be converted into a gene feature. This represents gene features.
3. The drug features and gene features can be combined to create a drug-gene interaction feature. This represents drug-gene interaction features.
4. Each drug-cell pair is a row in the new dataset (drug-gene interaction features + drug response). 
5. First set of columns should be drug features, second set of columns should be gene features. Output vector should be drug response.

In [1]:
import pandas as pd
import pickle

# import GDSC2 drug response data using pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

# import CCLE gene expression data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_expression.pkl', 'rb') as f:
    gene_entrez = pickle.load(f)
    ccle = pickle.load(f)

# import CCLE sample info data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'rb') as f:
    ccle_sample_info = pickle.load(f)


### Drug ID/Name to Drug Target identification

In [3]:
gdsc2.head()

Unnamed: 0,DATASET,NLME_RESULT_ID,NLME_CURVE_ID,COSMIC_ID,CELL_LINE_NAME,SANGER_MODEL_ID,TCGA_DESC,DRUG_ID,DRUG_NAME,PUTATIVE_TARGET,PATHWAY_NAME,COMPANY_ID,WEBRELEASE,MIN_CONC,MAX_CONC,LN_IC50,AUC,RMSE,Z_SCORE
0,GDSC2,282,13320532,749709,HCC1954,SIDM00872,BRCA,1003,Camptothecin,TOP1,DNA replication,1046,Y,9.8e-05,0.1,-0.251083,0.975203,0.112373,0.906631
1,GDSC2,282,13320565,749710,HCC1143,SIDM00866,BRCA,1003,Camptothecin,TOP1,DNA replication,1046,Y,9.8e-05,0.1,1.343315,0.978464,0.067577,1.683567
2,GDSC2,282,13320598,749711,HCC1187,SIDM00885,BRCA,1003,Camptothecin,TOP1,DNA replication,1046,Y,9.8e-05,0.1,1.736985,0.9951,0.045205,1.875399
3,GDSC2,282,13320631,749712,HCC1395,SIDM00884,BRCA,1003,Camptothecin,TOP1,DNA replication,1046,Y,9.8e-05,0.1,-2.309078,0.867832,0.107282,-0.096212
4,GDSC2,282,13320668,749713,HCC1599,SIDM00877,BRCA,1003,Camptothecin,TOP1,DNA replication,1046,Y,9.8e-05,0.1,-3.106684,0.777532,0.098286,-0.484878


In [4]:
gdsc2_info.head()

Unnamed: 0,drug_id,drug_name,synonyms,pathway_name,targets,pubchem
0,1559,Luminespib,"AUY922, VER-52296,NVP-AUY922, AUY",Protein stability and degradation,HSP90,10096043
1,1372,Trametinib,"GSK1120212, Mekinist",ERK MAPK signaling,"MEK1, MEK2",11707110
2,1909,Venetoclax,"ABT-199, Veneclexta, GDC-0199",Apoptosis regulation,BCL2,49846579
3,1017,Olaparib,"AZD2281, KU0059436, Lynparza",Genome integrity,"PARP1, PARP2",23725625
4,1021,Axitinib,"AG-13736, Inlyta",RTK signaling,"PDGFR, KIT, VEGFR",6450551


Further preprocessing of drug features through retrieving the SMILES string using pubchem id linked in GDSC2, and converting the SMILES string into morgan fingerprint fingerprint using RDKit.

[Refs needed]

In [34]:
# generate a dataframe between drug name and pubchem id using gdsc2_info

drug_pubchem = gdsc2_info[['drug_name', 'pubchem']]
drug_pubchem = drug_pubchem.drop_duplicates()

# remove the drug name with no pubchem id

# manually modify the pubchem id that are written as 'none' or 'several' 
drug_pubchem.loc[drug_pubchem['pubchem'] == 'none', 'pubchem'] = '-'

# https://pubchem.ncbi.nlm.nih.gov/compound/44259, accessed 09-02-2023
drug_pubchem.loc[drug_pubchem['drug_name'] == 'Staurosporine', 'pubchem'] = 44259

# https://pubchem.ncbi.nlm.nih.gov/compound/457193, accessed 09-02-2023
drug_pubchem.loc[drug_pubchem['drug_name'] == 'Dactinomycin', 'pubchem'] = 457193

# remove the drug name with no pubchem id
drug_pubchem = drug_pubchem[drug_pubchem['pubchem'] != '-']

# in the case of multiple pubchem id, only retain the first one
multiples = drug_pubchem[drug_pubchem['pubchem'].str.contains(",")==True]

# modify the pubchem id to only retain the first one in multiples
drug_pubchem.loc[drug_pubchem['pubchem'].str.contains(",")==True, 'pubchem'] = drug_pubchem.loc[drug_pubchem['pubchem'].str.contains(",")==True, 'pubchem'].str.split(",").str[0]

# remove duplicates
drug_pubchem = drug_pubchem.drop_duplicates()

pubchem_list = list(drug_pubchem['pubchem'])

In [35]:
import pubchempy as pcp

# using pubchempy to retrieve the smiles string of each pubchem id

smiles_list = []

for pubchem in pubchem_list:
    try: 
        compound = pcp.Compound.from_cid(pubchem)
        smiles = compound.isomeric_smiles
        smiles_list.append(smiles)
    except Exception as e:
        print(drug_pubchem[drug_pubchem['pubchem'] == pubchem]['drug_name'])
        smiles_list.append('')

# generate a dataframe between drug name and smiles string

drug_smiles = pd.DataFrame({'drug_name': drug_pubchem['drug_name'], 'smiles': smiles_list})
print(drug_smiles.head())

# compare the number of drug name in drug_smiles and drug_pubchem
print(drug_smiles.shape, gdsc2_info.shape)


    drug_name                                             smiles
0  Luminespib  CCNC(=O)C1=C(/C(=C/2\C=C(C(=CC2=O)O)C(C)C)/ON1...
1  Trametinib  CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N(...
2  Venetoclax  CC1(CCC(=C(C1)C2=CC=C(C=C2)Cl)CN3CCN(CC3)C4=CC...
3    Olaparib  C1CC1C(=O)N2CCN(CC2)C(=O)C3=C(C=CC(=C3)CC4=NNC...
4    Axitinib  CNC(=O)C1=CC=CC=C1SC2=CC3=C(C=C2)C(=NN3)/C=C/C...
(152, 2) (198, 6)


In [36]:
# pickle the drug_smiles and drug_pubchem, both have been modified and cleaned

import pickle

with open('data/drug-response/GDSC2/gdsc2_drug_smiles.pkl', 'wb') as f:
    pickle.dump(drug_smiles, f)

with open('data/drug-response/GDSC2/gdsc2_drug_pubchem.pkl', 'wb') as f:
    pickle.dump(drug_pubchem, f)

In [46]:
# Using RDKit to generate molecular fingerprints from GDSC2 drug names

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

# generate a list of rdkit mol objects from the smiles string
mol_list = [Chem.MolFromSmiles(smiles) for smiles in drug_smiles['smiles']]
# print(mol_list[0])

# generate a list of fingerprints from the rdkit mol objects
fp_list = [AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024) for mol in mol_list]


fpnp = np.array(fp_list[0])

# TODO: require further preprocessing documentation for the code above 

Retrieve the gene features from CCLE, and convert the gene features into a gene expression matrix.

In [3]:
print(ccle.head())

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816

### Selecting Specific Drugs 

Matching a specfic drug from GDSC2 and gather available genomic data from CCLE.

#### Ribociclib as an example - Playground

In [19]:
# select a specific drug

drug_name = 'Ribociclib'
drug_dataset = gdsc2.loc[gdsc2['DRUG_NAME'] == drug_name]

drug_response_data = drug_dataset[['SANGER_MODEL_ID', 'LN_IC50']]
# celllines = drug_dataset['SANGER_MODEL_ID']

print(drug_response_data.head())
print(drug_response_data.shape)
id_ccle_info = ccle_sample_info[['Sanger_Model_ID', 'DepMap_ID']].dropna()

# find the intersection between the cell lines in drug response data and the cell lines in CCLE gene expression data using the Sanger_Model_ID

celllines = drug_response_data['SANGER_MODEL_ID'].unique()
celllines = [cellline for cellline in celllines if cellline in id_ccle_info['Sanger_Model_ID'].unique()]

print(len(celllines))
print(celllines)
# locate the DepMap_ID of the cell lines in drug response data

depmap_id = []
for cellline in celllines:
    depmap_id.append(id_ccle_info.loc[id_ccle_info['Sanger_Model_ID'] == cellline]['DepMap_ID'].values[0])

print(depmap_id)
print(ccle.head())
# construct the gene expression dataframe by finding row names that are in the DepMap_ID list

matched_gene_expression_dataset = ccle.loc[ccle['CELLLINE'].isin(depmap_id)]
import numpy as np

print(matched_gene_expression_dataset.shape)
print(matched_gene_expression_dataset.head(2))
# There are only 44 cell lines with matched gene expression data from CCLE to the drug response data from GDSC2. This is insufficient for training a model.
# creating matching training feature and label data, gene expressions are features, drug response ic50 is label


# extract CELLLINE column from matched_gene_expression_dataset

matched_cellline = matched_gene_expression_dataset['CELLLINE'].tolist()
matched_sanger_model_id = []

# find the Sanger_Model_ID of the matched cell lines

for cellline in matched_cellline:
    matched_sanger_model_id.append(id_ccle_info.loc[id_ccle_info['DepMap_ID'] == cellline]['Sanger_Model_ID'].values[0])

# print(len(matched_sanger_model_id), len(matched_cellline)) # sanity check, they should be the same

# join the drug response data and the gene expression data through sanger model id as a medium 

matched_drug_response_data = drug_response_data.loc[drug_response_data['SANGER_MODEL_ID'].isin(matched_sanger_model_id)]

# print(matched_drug_response_data.shape)

matched_drug_response_data = matched_drug_response_data.set_index('SANGER_MODEL_ID')

matched_gene_expression_dataset.insert(0, 'SANGER_MODEL_ID', matched_sanger_model_id)
matched_gene_expression_dataset = matched_gene_expression_dataset.set_index('SANGER_MODEL_ID')
# remove 'CELLLINE' column from matched_gene_expression_dataset
# matched_gene_expression_dataset = matched_gene_expression_dataset.drop(columns=['CELLLINE'])

# print(matched_gene_expression_dataset.shape)

# join the matched_drug_response_data and the matched_gene_expression_dataset

joined_dataset = matched_drug_response_data.join(matched_gene_expression_dataset, how='inner')

# print(joined_dataset.shape)
# print(joined_dataset.head())

# feature and label data creation

# extract the feature data from the joined dataset

feature_data = joined_dataset.drop(columns=['LN_IC50'])
feature_data.drop(columns=['CELLLINE'], inplace=True)

# extract the label data from the joined dataset

label_data = joined_dataset['LN_IC50']

# convert the feature data and label data to numpy array

feature_data_np = feature_data.to_numpy()
label_data_np = label_data.to_numpy()

# print(feature_data_np.shape, label_data_np.shape)

# print(feature_data.head())
# print(label_data.head())

      SANGER_MODEL_ID   LN_IC50
70035       SIDM00872  4.254618
70036       SIDM00866  3.583018
70037       SIDM00885  4.023289
70038       SIDM00884  3.146215
70039       SIDM00877  6.135124
(47, 2)


In [None]:
# numeric index to cell line name mapping

cellline_name = joined_dataset['CELLLINE'].tolist()
index_dict = {i: cellline_name[i] for i in range(len(cellline_name))}

#### Streamlining and Optimization

In [2]:
import numpy as np
# select a specific drug

drug_name = 'Ribociclib'

def create_joint_dataset_from_ccle_gdsc2(drug_name: str, keep_drug_name: bool = False, separate_feature_label: bool = False): 

    drug_dataset = gdsc2.loc[gdsc2['DRUG_NAME'] == drug_name]

    drug_response_data = drug_dataset[['SANGER_MODEL_ID', 'LN_IC50']]
    id_ccle_info = ccle_sample_info[['Sanger_Model_ID', 'DepMap_ID']].dropna()

    # find the intersection between the cell lines in drug response data and the cell lines in CCLE gene expression data using the Sanger_Model_ID

    celllines = drug_response_data['SANGER_MODEL_ID'].unique()
    celllines = [cellline for cellline in celllines if cellline in id_ccle_info['Sanger_Model_ID'].unique()]

    # locate the DepMap_ID of the cell lines in drug response data

    depmap_id = []
    for cellline in celllines:
        depmap_id.append(id_ccle_info.loc[id_ccle_info['Sanger_Model_ID'] == cellline]['DepMap_ID'].values[0])

    # construct the gene expression dataframe by finding row names that are in the DepMap_ID list

    matched_gene_expression_dataset = ccle.loc[ccle['CELLLINE'].isin(depmap_id)]

    # creating matching training feature and label data, gene expressions are features, drug response ic50 is label
    # extract CELLLINE column from matched_gene_expression_dataset

    matched_cellline = matched_gene_expression_dataset['CELLLINE'].tolist()
    matched_sanger_model_id = []

    # find the Sanger_Model_ID of the matched cell lines

    for cellline in matched_cellline:
        matched_sanger_model_id.append(id_ccle_info.loc[id_ccle_info['DepMap_ID'] == cellline]['Sanger_Model_ID'].values[0])

    # join the drug response data and the gene expression data through sanger model id as a medium 

    matched_drug_response_data = drug_response_data.loc[drug_response_data['SANGER_MODEL_ID'].isin(matched_sanger_model_id)]

    # print(matched_drug_response_data.shape)

    matched_drug_response_data = matched_drug_response_data.set_index('SANGER_MODEL_ID')

    matched_gene_expression_dataset.insert(0, 'SANGER_MODEL_ID', matched_sanger_model_id)
    matched_gene_expression_dataset = matched_gene_expression_dataset.set_index('SANGER_MODEL_ID')

    # join the matched_drug_response_data and the matched_gene_expression_dataset

    joined_dataset = matched_drug_response_data.join(matched_gene_expression_dataset, how='inner')

    if keep_drug_name:
        joined_dataset.insert(1, 'DRUG_NAME', drug_name)
    
    if separate_feature_label:
        # feature and label data creation

        # extract the feature data from the joined dataset

        feature_data = joined_dataset.drop(columns=['LN_IC50'])
        feature_data.drop(columns=['CELLLINE'], inplace=True)

        # extract the label data from the joined dataset

        label_data = joined_dataset['LN_IC50']

        return feature_data, label_data
    
    return joined_dataset

joined_dataset = create_joint_dataset_from_ccle_gdsc2('Ribociclib', keep_drug_name=True, separate_feature_label=False)

print(joined_dataset.head())

# convert the feature data and label data to numpy array

# feature_data_np = feature_data.to_numpy()
# label_data_np = label_data.to_numpy()

# print(feature_data_np.shape, label_data_np.shape)

# print(feature_data.head())
# print(label_data.head())

                  LN_IC50   DRUG_NAME    CELLLINE    TSPAN6      TNMD  \
SANGER_MODEL_ID                                                         
SIDM00872        4.254618  Ribociclib  ACH-000859  5.296090  0.000000   
SIDM00866        3.583018  Ribociclib  ACH-000374  5.214125  0.000000   
SIDM00885        4.023289  Ribociclib  ACH-000111  5.241840  0.201634   
SIDM00884        3.146215  Ribociclib  ACH-000699  3.481557  0.000000   
SIDM00877        6.135124  Ribociclib  ACH-000196  4.349082  0.000000   

                     DPM1     SCYL3  C1orf112       FGR       CFH  ...  \
SANGER_MODEL_ID                                                    ...   
SIDM00872        6.794416  3.452859  5.260778  0.042644  4.339137  ...   
SIDM00866        6.328047  3.168321  4.654206  0.042644  0.432959  ...   
SIDM00885        5.615299  3.090853  3.732269  0.084064  0.111031  ...   
SIDM00884        7.070389  2.341986  3.679199  0.176323  3.420887  ...   
SIDM00877        6.479942  2.790772  4.03121

In [37]:
# get multiple drugs and join them together to form a single dataset

drug_names = ['Ribociclib', 'Palbociclib']

dfs = [create_joint_dataset_from_ccle_gdsc2(drug_name, keep_drug_name=True, separate_feature_label=False) for drug_name in drug_names]

all_dfs = pd.concat(dfs)

# print(all_dfs.head(50))

                  LN_IC50    DRUG_NAME    CELLLINE    TSPAN6      TNMD  \
SANGER_MODEL_ID                                                          
SIDM00872        4.254618   Ribociclib  ACH-000859  5.296090  0.000000   
SIDM00866        3.583018   Ribociclib  ACH-000374  5.214125  0.000000   
SIDM00885        4.023289   Ribociclib  ACH-000111  5.241840  0.201634   
SIDM00884        3.146215   Ribociclib  ACH-000699  3.481557  0.000000   
SIDM00877        6.135124   Ribociclib  ACH-000196  4.349082  0.000000   
SIDM00774        3.632985   Ribociclib  ACH-000691  3.523562  0.000000   
SIDM00772        3.074197   Ribociclib  ACH-000755  3.587365  0.000000   
SIDM00675        4.642447   Ribociclib  ACH-000276  3.934517  0.000000   
SIDM00097        3.518966   Ribociclib  ACH-000147  3.310340  0.000000   
SIDM00148        3.027614   Ribociclib  ACH-000019  2.403268  0.000000   
SIDM00122        3.917605   Ribociclib  ACH-000288  3.390943  0.000000   
SIDM00135        4.303794   Ribociclib

In [3]:
# get ribociclib data 

ribociclib_data = create_joint_dataset_from_ccle_gdsc2('Ribociclib', keep_drug_name=False, separate_feature_label=False)

In [6]:
ribociclib_data.to_pickle('data/preprocessed/ribociclib_data.pkl')

In [4]:
palbociclib_data = create_joint_dataset_from_ccle_gdsc2('Palbociclib', keep_drug_name=False, separate_feature_label=False)
# palbociclib_data.to_pickle('data/preprocessed/palbociclib_data.pkl')

In [5]:
ribociclib_data.to_csv('data/preprocessed/ribociclib_data.csv')
palbociclib_data.to_csv('data/preprocessed/palbociclib_data.csv')

In [38]:
# for fun, let's try to create a dataset for all drugs

# all_drug_names = gdsc2['DRUG_NAME'].unique().tolist()

# all_dfs = [create_joint_dataset_from_ccle_gdsc2(drug_name, keep_drug_name=True, separate_feature_label=False) for drug_name in all_drug_names]

# all_dfs = pd.concat(all_dfs)

# print(all_dfs.shape)
# print(all_dfs.head(1000))
# # pickle the dataset for later use

# import pickle

# with open('data/preprocessed/all_drugs_ccle_gdsc2.pkl', 'wb') as f:
#     pickle.dump(all_dfs, f)
    

(102466, 19224)


## Integration of GDSC2 and Goncalves dataset

In [25]:
# load the goncalve_proteome and goncalve_proteome_info

import pickle
import pandas as pd

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome_fillna.pkl', 'rb') as f:
    full_protein_matrix = pickle.load(f)
    sin_peptile_exclusion_matrix = pickle.load(f)
    goncalve_cell_line_info = pickle.load(f)

# import GDSC2 drug response data using pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

In [26]:
full_protein_matrix.head()

Unnamed: 0,Project_Identifier,Q9Y651;SOX21_HUMAN,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,...,Q8WXF0;SRS12_HUMAN,P02763;A1AG1_HUMAN,Q9ULK4;MED23_HUMAN,P22352;GPX3_HUMAN,P0C221;CC175_HUMAN,P02753;RET4_HUMAN,Q9NWZ8;GEMI8_HUMAN,O43427;FIBP_HUMAN,O75319;DUS11_HUMAN,Q8IZU2;WDR17_HUMAN
0,SIDM00018;K052,0.0,7.10955,3.38802,0.0,7.86661,3.77937,4.19666,0.0,2.6775,...,0.0,0.0,0.0,0.0,4.50249,0.0,2.53983,1.99209,0.0,0.0
1,SIDM00023;TE-12,0.0,6.82802,4.14346,2.21578,7.62878,3.2399,4.60902,0.0,2.88893,...,0.0,0.0,0.0,0.0,4.52413,0.0,1.45604,3.03762,0.0,0.0
2,SIDM00040;TMK-1,0.0,7.01426,3.85803,2.27808,8.12459,3.01438,4.76881,0.0,0.0,...,0.0,2.05053,0.0,0.0,4.27994,0.0,3.64597,0.0,0.0,3.93987
3,SIDM00041;STS-0421,0.0,5.28591,3.51695,0.0,7.97268,3.25532,4.52092,0.0,2.79756,...,0.0,0.0,0.0,0.0,4.65226,0.0,2.25913,0.0,0.0,3.05021
4,SIDM00042;PL4,0.0,5.70786,2.03732,0.0,6.22574,1.28246,4.49579,0.0,0.0,...,0.0,0.0,0.0,0.0,3.98994,0.0,0.0,0.0,0.0,0.0


In [27]:
# retrieve model_id based on Project_identifier

df = goncalve_cell_line_info.loc[goncalve_cell_line_info['Project_Identifier'] == 'SIDM00018;K052']
df.head()

Unnamed: 0,model_id,Project_Identifier,Cell_line,Source,Identifier,Gender,Tissue_type,Cancer_type,Cancer_subtype,Haem_lineage,...,F6,F7,F8,F9,F10,F11,F12,F13,F14,F15
865,SIDM00018,SIDM00018;K052,K052,JCRB,CVCL_1321,Male,Haematopoietic and Lymphoid,Acute Myeloid Leukemia,Adult Acute Myeloid Leukemia,Myeloid,...,-0.372191,-2.835982,-2.347274,-0.309295,-0.248008,-1.632443,0.192951,-0.862211,1.037266,-0.184462


In [28]:
df['model_id']

865    SIDM00018
Name: model_id, dtype: object

In [29]:
sanger_model_ids = goncalve_cell_line_info[['model_id', 'Project_Identifier']]
sanger_model_ids.set_index('Project_Identifier', inplace=True)
sanger_model_ids.head()

Unnamed: 0_level_0,model_id
Project_Identifier,Unnamed: 1_level_1
SIDM00896;BC-1,SIDM00896
SIDM00312;L-363,SIDM00312
SIDM00277;EoL-1-cell,SIDM00277
SIDM01119;NCI-H727,SIDM01119
SIDM00657;MV-4-11,SIDM00657


In [30]:
full_protein_matrix.set_index('Project_Identifier', inplace=True)

In [31]:
# join the full_protein_matrix and the sanger_model_ids by Project_Identifier

joined_full_protein_matrix = full_protein_matrix.join(sanger_model_ids, how='inner')

In [32]:
joined_full_protein_matrix.head()

Unnamed: 0_level_0,Q9Y651;SOX21_HUMAN,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,...,P02763;A1AG1_HUMAN,Q9ULK4;MED23_HUMAN,P22352;GPX3_HUMAN,P0C221;CC175_HUMAN,P02753;RET4_HUMAN,Q9NWZ8;GEMI8_HUMAN,O43427;FIBP_HUMAN,O75319;DUS11_HUMAN,Q8IZU2;WDR17_HUMAN,model_id
Project_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIDM00018;K052,0.0,7.10955,3.38802,0.0,7.86661,3.77937,4.19666,0.0,2.6775,3.07932,...,0.0,0.0,0.0,4.50249,0.0,2.53983,1.99209,0.0,0.0,SIDM00018
SIDM00023;TE-12,0.0,6.82802,4.14346,2.21578,7.62878,3.2399,4.60902,0.0,2.88893,3.69535,...,0.0,0.0,0.0,4.52413,0.0,1.45604,3.03762,0.0,0.0,SIDM00023
SIDM00040;TMK-1,0.0,7.01426,3.85803,2.27808,8.12459,3.01438,4.76881,0.0,0.0,3.48951,...,2.05053,0.0,0.0,4.27994,0.0,3.64597,0.0,0.0,3.93987,SIDM00040
SIDM00041;STS-0421,0.0,5.28591,3.51695,0.0,7.97268,3.25532,4.52092,0.0,2.79756,3.3745,...,0.0,0.0,0.0,4.65226,0.0,2.25913,0.0,0.0,3.05021,SIDM00041
SIDM00042;PL4,0.0,5.70786,2.03732,0.0,6.22574,1.28246,4.49579,0.0,0.0,3.71758,...,0.0,0.0,0.0,3.98994,0.0,0.0,0.0,0.0,0.0,SIDM00042


In [33]:
joined_full_protein_matrix.shape

(949, 8499)

In [34]:
joined_full_protein_matrix.set_index('model_id', inplace=True)
joined_full_protein_matrix.head()

Unnamed: 0_level_0,Q9Y651;SOX21_HUMAN,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,...,Q8WXF0;SRS12_HUMAN,P02763;A1AG1_HUMAN,Q9ULK4;MED23_HUMAN,P22352;GPX3_HUMAN,P0C221;CC175_HUMAN,P02753;RET4_HUMAN,Q9NWZ8;GEMI8_HUMAN,O43427;FIBP_HUMAN,O75319;DUS11_HUMAN,Q8IZU2;WDR17_HUMAN
model_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIDM00018,0.0,7.10955,3.38802,0.0,7.86661,3.77937,4.19666,0.0,2.6775,3.07932,...,0.0,0.0,0.0,0.0,4.50249,0.0,2.53983,1.99209,0.0,0.0
SIDM00023,0.0,6.82802,4.14346,2.21578,7.62878,3.2399,4.60902,0.0,2.88893,3.69535,...,0.0,0.0,0.0,0.0,4.52413,0.0,1.45604,3.03762,0.0,0.0
SIDM00040,0.0,7.01426,3.85803,2.27808,8.12459,3.01438,4.76881,0.0,0.0,3.48951,...,0.0,2.05053,0.0,0.0,4.27994,0.0,3.64597,0.0,0.0,3.93987
SIDM00041,0.0,5.28591,3.51695,0.0,7.97268,3.25532,4.52092,0.0,2.79756,3.3745,...,0.0,0.0,0.0,0.0,4.65226,0.0,2.25913,0.0,0.0,3.05021
SIDM00042,0.0,5.70786,2.03732,0.0,6.22574,1.28246,4.49579,0.0,0.0,3.71758,...,0.0,0.0,0.0,0.0,3.98994,0.0,0.0,0.0,0.0,0.0


In [36]:
sin_peptile_exclusion_matrix.set_index('Project_Identifier', inplace=True)

# join the sin_peptile_exclusion_matrix and the sanger_model_ids by Project_Identifier

joined_sin_peptile_exclusion_matrix = sin_peptile_exclusion_matrix.join(sanger_model_ids, how='inner')

joined_sin_peptile_exclusion_matrix.head()

Unnamed: 0_level_0,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,Q9P258;RCC2_HUMAN,...,Q5EBL4;RIPL1_HUMAN,P49715;CEBPA_HUMAN,Q5TA45;INT11_HUMAN,O14924;RGS12_HUMAN,Q7Z3B1;NEGR1_HUMAN,O60669;MOT2_HUMAN,Q13571;LAPM5_HUMAN,Q96JM2;ZN462_HUMAN,P35558;PCKGC_HUMAN,model_id
Project_Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIDM00018;K052,7.10955,3.41494,0.0,7.86661,3.89547,4.19666,0.0,0.0,3.19088,7.35806,...,0.0,3.90064,2.63998,0.0,0.0,0.0,0.0,0.0,0.0,SIDM00018
SIDM00023;TE-12,6.82802,4.14346,2.23781,7.62878,3.19811,4.60902,0.0,2.47059,3.69535,5.7079,...,0.0,0.0,3.19608,0.0,0.0,0.0,0.0,0.0,0.0,SIDM00023
SIDM00040;TMK-1,7.01426,4.19987,2.44055,8.12459,0.0,4.76881,0.0,0.0,0.0,5.52283,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SIDM00040
SIDM00041;STS-0421,5.28591,3.35789,0.0,7.97268,0.0,4.52092,0.0,0.0,2.73088,4.29429,...,0.0,0.0,2.79023,0.0,0.0,0.0,0.0,0.0,0.0,SIDM00041
SIDM00042;PL4,5.70786,0.0,0.0,6.22574,0.0,4.49579,0.0,0.0,2.87981,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,SIDM00042


In [37]:
joined_sin_peptile_exclusion_matrix.shape

(949, 6693)

In [38]:
joined_sin_peptile_exclusion_matrix.set_index('model_id', inplace=True)

In [39]:
joined_sin_peptile_exclusion_matrix.head()

Unnamed: 0_level_0,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,Q9P258;RCC2_HUMAN,...,P33151;CADH5_HUMAN,Q5EBL4;RIPL1_HUMAN,P49715;CEBPA_HUMAN,Q5TA45;INT11_HUMAN,O14924;RGS12_HUMAN,Q7Z3B1;NEGR1_HUMAN,O60669;MOT2_HUMAN,Q13571;LAPM5_HUMAN,Q96JM2;ZN462_HUMAN,P35558;PCKGC_HUMAN
model_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SIDM00018,7.10955,3.41494,0.0,7.86661,3.89547,4.19666,0.0,0.0,3.19088,7.35806,...,0.0,0.0,3.90064,2.63998,0.0,0.0,0.0,0.0,0.0,0.0
SIDM00023,6.82802,4.14346,2.23781,7.62878,3.19811,4.60902,0.0,2.47059,3.69535,5.7079,...,0.0,0.0,0.0,3.19608,0.0,0.0,0.0,0.0,0.0,0.0
SIDM00040,7.01426,4.19987,2.44055,8.12459,0.0,4.76881,0.0,0.0,0.0,5.52283,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
SIDM00041,5.28591,3.35789,0.0,7.97268,0.0,4.52092,0.0,0.0,2.73088,4.29429,...,0.0,0.0,0.0,2.79023,0.0,0.0,0.0,0.0,0.0,0.0
SIDM00042,5.70786,0.0,0.0,6.22574,0.0,4.49579,0.0,0.0,2.87981,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [40]:
# pickle the joined_full_protein_matrix and the joined_sin_peptile_exclusion_matrix

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome_fillna_processed.pkl', 'wb') as f:
    pickle.dump(joined_full_protein_matrix, f)
    pickle.dump(joined_sin_peptile_exclusion_matrix, f)

In [44]:
def create_joint_dataset_from_proteome_gdsc(drug_name: str, proteome: pd.DataFrame, gdsc: pd.DataFrame):
    drug_dataset = gdsc.loc[gdsc['DRUG_NAME'] == drug_name]
    drug_response_data = drug_dataset[['SANGER_MODEL_ID', 'LN_IC50']]
    drug_response_data.set_index('SANGER_MODEL_ID', inplace=True)

    # join the matched_proteome_dataset and the drug_response_data by Sanger_Model_ID (model_id)

    joined_dataset = proteome.join(drug_response_data, how='inner')

    return joined_dataset

ribociclib_proteome_data = create_joint_dataset_from_proteome_gdsc('Ribociclib', joined_sin_peptile_exclusion_matrix, gdsc2)

In [45]:
ribociclib_proteome_data.head()

Unnamed: 0,P37108;SRP14_HUMAN,Q96JP5;ZFP91_HUMAN,Q9Y4H2;IRS2_HUMAN,P36578;RL4_HUMAN,Q6SPF0;SAMD1_HUMAN,O76031;CLPX_HUMAN,Q8WUQ7;CATIN_HUMAN,A6NIH7;U119B_HUMAN,Q9BTD8;RBM42_HUMAN,Q9P258;RCC2_HUMAN,...,Q5EBL4;RIPL1_HUMAN,P49715;CEBPA_HUMAN,Q5TA45;INT11_HUMAN,O14924;RGS12_HUMAN,Q7Z3B1;NEGR1_HUMAN,O60669;MOT2_HUMAN,Q13571;LAPM5_HUMAN,Q96JM2;ZN462_HUMAN,P35558;PCKGC_HUMAN,LN_IC50
SIDM00097,6.33883,4.04533,0.0,7.03146,1.36365,4.5767,0.0,1.77983,3.04222,4.75797,...,1.86544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.518966
SIDM00122,5.6102,2.97959,0.0,6.96398,3.38582,4.30554,0.0,0.0,3.52632,4.68572,...,2.10132,0.0,2.11006,0.0,1.74507,0.0,0.0,0.0,0.0,3.917605
SIDM00135,5.8206,3.61277,0.0,7.5649,0.0,3.61047,0.0,0.0,0.0,4.2338,...,0.0,0.0,0.0,0.0,4.1711,0.0,0.0,0.0,0.0,4.303794
SIDM00146,5.87047,0.0,0.0,8.00034,3.25225,4.75633,0.0,0.0,3.97645,4.45868,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.306478
SIDM00148,6.04663,0.0,0.0,6.77012,4.98828,5.58545,0.0,0.0,3.17816,4.62375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.027614
