# Session Info

necessary packages to run this notebook

In [13]:
import types
def imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            yield val.__name__
list(imports())

['builtins',
 'builtins',
 'pandas',
 'pickle',
 'rdkit.Chem',
 'pubchempy',
 'session_info',
 'types',
 'pkg_resources',
 'pip']

# Data loading 

methods to pre-cache and load in raw data from the data folder

### Data Pre-processing Guidelines

- **documentation of original download information**
    - original download link 
    - the date of download
    - the doi or bibliography of the linked publication 
    - basic description of the dataset
    - |time-consuming| ideally, a simple description of the methodology used to generate the dataset
        - how were the samples collected?
        - at what time point are the samples lysed? 
        - any further pre-processing steps?  
<br />

- **documentation of dataset(s)** 
    - *the information type represented by the dataset (i.e. gene expression, drug response, etc.)*
    - any supplementary spreadsheet(s) associated with the dataset or metadata
    - sample size (e.g. number of samples, number of genes, etc.)
    - main row and column domains (e.g. genes, samples, drugs, etc.)
    - identifiers used for drug, gene, protein etc. 
    - presence of specific drugs or genes/proteins of interest
        - e.g. CDK4/6 inhibitors: palbociclib, ribociclib and abemaciclib  
<br />  

- **documentation of the pre-processing steps**
    - *the final shape of the processed dataset associated with metadata, e.g. (n_samples, n_genes)*
    - |time-consuming| the technique used to transform the dataset
        - e.g. log2 transformation, z-score normalization, etc.
        - e.g. the method used to impute missing values
        - any removal of data and reasoning (i.e. due to missing values, etc.)
    - index to identifier mapping (e.g. gene index to gene symbol mapping)
        - then, the processed dataset will have indexes matched with a corresponding identifier/symbol 
        - e.g. gene index 0 corresponds to gene symbol A1BG
        - e.g. drug index 0 corresponds to drug palbociclib
        - when performing further filtering, the original index order must be preserved or traced to allow for mapping back to the original identifiers
    - creating a paired dataset from two different datasets
        - e.g. drug response and gene expression
        - e.g. drug response and mutation status
        - e.g. gene expression and mutation status
        - must perform model-to-name mapping between the two datasets and document the mapping logic
            - e.g. model are cell lines, matched by cell line name (no spaces, lower case)
            - e.g. model are cell lines, matched by a common identifier (e.g. Sanger_Model_ID)



### GDSC 1 

GDSC1 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC1` folder.

Data Retrieval Date: 2022-06-01

Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D., Smith, J. A., Thompson, I. R., Ramaswamy, S., Futreal, P. A., Haber, D. A., Stratton, M. R., Benes, C., McDermott, U., & Garnett, M. J. (2013). Genomics of Drug Sensitivity in Cancer (GDSC): A resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(Database issue), D955–D961. https://doi.org/10.1093/nar/gks1111

#### Methodology

retrieved from [Genomics of Drug Sensitivity in Cancer](https://www.cancerrxgene.org/help#t_curve)

> The GDSC1 dataset was generated jointly by the Wellcome Sanger Institute and Massachusetts General Hospital between 2009 and 2015 using a matched set of cancer cell lines (the GDSC1000).

> Compounds were stored in aliquots at -80°C and were subjected to a maximum of 5 freeze-thaw cycles.

> Cells were seeded in 96-well or 384-well plates and compound dose titrations were delivered using tip based liquid handling apparatus. Cell viability was measured using either Syto60 or Resazurin. Drug treatments in this dataset used two formats:

> 9-point dose curve incorporating a 2-fold dilution step (256-fold range)
> 5-point dose curve incorporating a 4-fold dilution step (256-fold range)



In [1]:
## Initial Loading of Data

import pandas as pd 

gdsc1 = pd.read_excel('data\drug-response\GDSC1\GDSC1_fitted_dose_response_25Feb20.xlsx')
print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [2]:
gdsc1_info = pd.read_csv('data\drug-response\GDSC1\GDSC1_DrugData.csv')

In [5]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'wb') as f:
    pickle.dump(gdsc1, f)
    pickle.dump(gdsc1_info, f)


In [1]:
## Loading cached data
import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'rb') as f:
    gdsc1 = pickle.load(f)
    gdsc1_info = pickle.load(f)

In [2]:
# investigating the structure of the gdsc1 dataset 

print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [3]:
print(gdsc1.shape)

(310904, 19)


In [4]:
print(gdsc1_info.head())

   drug_id   drug_name                            synonyms  \
0     1559  Luminespib  AUY922, VER-52296,NVP-AUY922,  AUY   
1     1372  Trametinib                GSK1120212, Mekinist   
2     1909  Venetoclax       ABT-199, Veneclexta, GDC-0199   
3     1017    Olaparib       AZD2281, KU0059436,  Lynparza   
4     1021    Axitinib                    AG-13736, Inlyta   

                        pathway_name            targets   pubchem  
0  Protein stability and degradation              HSP90  10096043  
1                 ERK MAPK signaling         MEK1, MEK2  11707110  
2               Apoptosis regulation               BCL2  49846579  
3                   Genome integrity       PARP1, PARP2  23725625  
4                      RTK signaling  PDGFR, KIT, VEGFR   6450551  


In [5]:
palbo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Palbociclib']

print(palbo.shape)

(901, 19)


In [6]:
ribo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Ribociclib']

print(ribo.shape)

(0, 19)


In [7]:
Abemaciclib = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Abemaciclib']

print(Abemaciclib.shape)

(0, 19)


Technical information
- gdsc1 
- dataset type: drug response
- dataset shape: (310904, 19) (n_cells, row_features)
- **each row** represents a drug response measurement of a cell line for a given drug
- **each column** are the features of each drug response measurement
- the column `LN_IC50` is the log-normalized IC50 value of the drug response
- the column `AUC` is the area under the curve of the drug response
- the column `DRUG_ID`, `DRUG_NAME` are the **identifiers of the drug**
    - 'DRUG_ID' can be queried to show further information on drug targets from the supplementary spreadsheet 'gdsc1_info.csv' or the `gdsc1_info` object in python 
- the column `COSMIC_ID`, `SANGER_MODEL_ID`, `CELL_LINE_NAME` are the **identifiers of the cell line**
- drug present: **Palbociclib** (cell line n = 901)


pre-processing may not be required on this dataset, however, further pre-processing is needed if paired with other datasets

### CCLE 22Q2

CCLE (Cancer Cell Line Encyclopedia) is a gene expression dataset, retrieved from [Cancer Cell Line Encyclopedia](https://depmap.org/portal/download/all/). Data is pulled with the option 'DepMap Public 22Q2' in the selection menu.

The data is stored in the `data/gene-expression/CCLE_Public_22Q2` folder.

Data Retrieval Date: 2022-06-01

Ghandi, M., Huang, F. W., Jané-Valbuena, J., Kryukov, G. V., Lo, C. C., McDonald, E. R., Barretina, J., Gelfand, E. T., Bielski, C. M., Li, H., Hu, K., Andreev-Drakhlin, A. Y., Kim, J., Hess, J. M., Haas, B. J., Aguet, F., Weir, B. A., Rothberg, M. V., Paolella, B. R., … Sellers, W. R. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature, 569(7757), Article 7757. https://doi.org/10.1038/s41586-019-1186-3


#### Methodology

From Ghandi et al, 2019:

> WGS for 329 cell lines and WES for 326 cell lines were performed at the Broad Institute Genomics Platform. Libraries were constructed and sequenced on either an Illumina HiSeq 2000 or Illumina GAIIX, with the use of 101-base-pair (bp) paired-end reads for WGS and 76-bp paired-end reads for WES. Output from Illumina software was processed by the Picard data-processing pipeline to yield BAM files containing well-calibrated, aligned reads. All sample information tracking was performed by automated LIMS messaging.

In [2]:
# importing ccle data 

import pandas as pd

ccle = pd.read_csv('data\gene-expression\CCLE_Public_22Q2\CCLE_expression.csv')

In [3]:
print(ccle.shape)

print(ccle.describe())

(1406, 19222)
       TSPAN6 (7105)  TNMD (64102)  DPM1 (8813)  SCYL3 (57147)  \
count    1406.000000   1406.000000  1406.000000    1406.000000   
mean        3.363532      0.069776     6.495860       2.366410   
std         1.645531      0.345324     0.646531       0.544784   
min         0.000000      0.000000     3.654206       0.594549   
25%         2.862946      0.000000     6.097505       2.003602   
50%         3.804776      0.000000     6.479295       2.334854   
75%         4.430620      0.000000     6.916029       2.682573   
max         8.131857      5.251340     9.175100       4.746850   

       C1orf112 (55732)   FGR (2268)   CFH (3075)  FUCA2 (2519)  GCLC (2729)  \
count       1406.000000  1406.000000  1406.000000   1406.000000  1406.000000   
mean           3.674362     0.445801     2.167746      5.140341     4.639761   
std            0.784917     1.250105     2.241927      1.817106     1.152074   
min            0.056584     0.000000     0.000000      0.000000     1.1

In [4]:
# rename ccle columns 

entrez = list(ccle.columns)
gene_name = []

for c in entrez:
    if c == 'Unnamed: 0':
        entrez[entrez.index(c)] = 'CELLLINE'
        gene_name.append('CELLLINE')
    else:
        # only retain the entrez id in the bracket
        left, right = c.find('('), c.find(')')
        entrez[entrez.index(c)] = c[left+1:right]
        gene_name.append(c[:left-1])

In [5]:
ccle.columns = gene_name
print(ccle.head())
print(ccle.shape)

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816

In [6]:
gene_entrez = pd.DataFrame({'gene_name': gene_name, 'entrez': entrez})
print(gene_entrez.head())

  gene_name    entrez
0  CELLLINE  CELLLINE
1    TSPAN6      7105
2      TNMD     64102
3      DPM1      8813
4     SCYL3     57147


In [8]:
import pickle 

with open('data\gene-expression\CCLE_Public_22Q2\ccle_expression.pkl', 'wb') as f:
    pickle.dump(gene_entrez, f)
    pickle.dump(ccle, f)

In [10]:
import pickle

with open('data\gene-expression\CCLE_Public_22Q2\ccle_expression.pkl', 'rb') as f:
    gene_entrez = pickle.load(f)
    ccle = pickle.load(f)

In [11]:
print(ccle.shape)

(1406, 19222)


In [12]:
print(ccle.head())

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816

In [14]:
import pandas as pd 

ccle_sample_info = pd.read_csv('data/gene-expression/CCLE_Public_22Q2/sample_info.csv')


In [15]:
print(ccle_sample_info.head())

    DepMap_ID cell_line_name stripped_cell_line_name  \
0  ACH-000016         SLR 21                   SLR21   
1  ACH-000032     MHH-CALL-3                MHHCALL3   
2  ACH-000033      NCI-H1819                NCIH1819   
3  ACH-000043       Hs 895.T                  HS895T   
4  ACH-000049         HEK TE                   HEKTE   

                                     CCLE_Name alias  COSMICID     sex  \
0                                 SLR21_KIDNEY   NaN       NaN     NaN   
1  MHHCALL3_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE   NaN       NaN  Female   
2                                NCIH1819_LUNG   NaN       NaN  Female   
3                            HS895T_FIBROBLAST   NaN       NaN  Female   
4                                 HEKTE_KIDNEY   NaN       NaN     NaN   

         source       RRID  WTSI_Master_Cell_ID  ...   lineage_sub_subtype  \
0  Academic lab  CVCL_V607                  NaN  ...                   NaN   
1          DSMZ  CVCL_0089                  NaN  ...          

In [16]:
import pickle 

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'wb') as f:
    pickle.dump(ccle_sample_info, f)

In [17]:
import pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'rb') as f:
    ccle_sample_info = pickle.load(f)

Dataset Documentation
- Dataset name: ccle_expression 
- dataset type: gene expression by RNASeq
- dataset shape: (1404, 19222) (n_cells, row_features)
- **each row** represents a gene expression measurement of a cell line
- **each column** after the first column is the specific expression of a gene in a given cell line
- the first column `CELLLINE` is renamed from `0: unnamed`, represents the cell line `DepMap_ID`, and is the **identifier of the cell line**
- `DepMap_ID` can be queried to show further information on cell line from the supplementary spreadsheet 'ccle_sample_info.csv' or the `ccle_sample_info` object in python, including linkage to other identifiers including `Sanger_Model_ID`

- biologically/clinically implicated genes of note for cdk4/6 inhibitors: #TODO

Pre-processing Documentation
- column renaming was performed, from `0: unnamed` to `CELLLINE` for the first column, entrez ids are stripped and put into a separate dataframe as part of data cleaning.

### GDSC 2

GDSC2 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC2` folder.

In [9]:
## Initial Loading of Data

import pandas as pd 

gdsc2 = pd.read_excel('data\drug-response\GDSC2\GDSC2_fitted_dose_response_25Feb20.xlsx')
print(gdsc2.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC2             282       13320532     749709        HCC1954   
1   GDSC2             282       13320565     749710        HCC1143   
2   GDSC2             282       13320598     749711        HCC1187   
3   GDSC2             282       13320631     749712        HCC1395   
4   GDSC2             282       13320668     749713        HCC1599   

  SANGER_MODEL_ID TCGA_DESC  DRUG_ID     DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00872      BRCA     1003  Camptothecin            TOP1   
1       SIDM00866      BRCA     1003  Camptothecin            TOP1   
2       SIDM00885      BRCA     1003  Camptothecin            TOP1   
3       SIDM00884      BRCA     1003  Camptothecin            TOP1   
4       SIDM00877      BRCA     1003  Camptothecin            TOP1   

      PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  DNA replication        1046          Y  0.000098       0.1 -0.251083   
1  DNA r

In [12]:
gdsc2_info = pd.read_csv('data\drug-response\GDSC2\GDSC2_DrugData.csv')



In [13]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'wb') as f:
    pickle.dump(gdsc2, f)
    pickle.dump(gdsc2_info, f)

In [14]:
## Loading cached data

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

### Goncalves 2022 Proteomic Cell Paper (n=949)

Data is retrieved from the [Cell publication of the original article under supplemental information](https://www.cell.com/cancer-cell/fulltext/S1535-6108(22)00274-4) on 01-02-2023.

Gonçalves, E., Poulos, R. C., Cai, Z., Barthorpe, S., Manda, S. S., Lucas, N., Beck, A., Bucio-Noble, D., Dausmann, M., Hall, C., Hecker, M., Koh, J., Lightfoot, H., Mahboob, S., Mali, I., Morris, J., Richardson, L., Seneviratne, A. J., Shepherd, R., … Reddel, R. R. (2022). Pan-cancer proteomic map of 949 human cell lines. Cancer Cell, 40(8), 835-849.e8. https://doi.org/10.1016/j.ccell.2022.06.010

The data is stored in the `data/proteomic-expression/goncalves-2022-cell` folder.

This dataset contains the proteomic expression of 949 cell lines. 

#### Methodology

From Gonçalves et al, 2022 (Results page): 

>To construct a pan-cancer proteomic map, proteomes of 949 human cancer cell lines from 28 tissues and more than 40 genetically and histologically diverse cancer types were quantified (Figures 1A and S1A, Table S1). The proteome for each cell line was acquired by DIA-MS from six replicates using a workflow that enables high throughput and minimal instrument downtime (see STAR Methods, Figure S1B). The resulting dataset was derived from 6,864 DIA-MS runs acquired over 10,000 MS h (Table S1), including peptide preparations derived from the human embryonic kidney cell line HEK293T that were used throughout all data acquisition periods and instruments for quality control. These data, together with the spectral library, were deposited in the Proteomics Identification Database (Perez-Riverol et al., 2019) with dataset identifier PXD030304. Raw DIA-MS data were processed with DIA-NN (Demichev et al., 2020), using retention time-dependent normalization and with a spectral library generated by DIA-NN. For full details of data processing steps and parameters, see STAR Methods and Table S1. MaxLFQ (Cox et al., 2014) was then used to quantify a total of 8,498 proteins (Table S2, Figure S1C), with a median of 5,237 proteins (min-max range: 2,523–6,251) quantified per cell line (Table S1, Figure 1A).

For more detailed information on the methodology, see the STAR Methods section of the paper. In brief, protein expression was measured using DIA-MS, and the data was processed using DIA-NN and quantified using MaxLFQ. Then, data was further processed by log2 transformation. 

For more information on MaxLFQ, see the [Cox et al, 2014](https://www.sciencedirect.com/science/article/pii/S1535947620333107). 

In [27]:
import pandas as pd

# loading in the proteomic data

main_file = pd.ExcelFile('data\proteomic-expression\goncalves-2022-cell\goncalves-2022-cell-949-protein-matrix.xlsx')
print(main_file.sheet_names)

full_protein_matrix = pd.read_excel(main_file, 'Full protein matrix', header=1)
print(full_protein_matrix.head(2))

sin_peptile_exclusion_matrix = pd.read_excel(main_file, 'Prot matrix excl single-peptide', header=1)
print(sin_peptile_exclusion_matrix.head(2))


['Full protein matrix', 'Prot matrix excl single-peptide']
  Project_Identifier  Q9Y651;SOX21_HUMAN  P37108;SRP14_HUMAN  \
0     SIDM00018;K052                 NaN             7.10955   
1    SIDM00023;TE-12                 NaN             6.82802   

   Q96JP5;ZFP91_HUMAN  Q9Y4H2;IRS2_HUMAN  P36578;RL4_HUMAN  \
0             3.38802                NaN           7.86661   
1             4.14346            2.21578           7.62878   

   Q6SPF0;SAMD1_HUMAN  O76031;CLPX_HUMAN  Q8WUQ7;CATIN_HUMAN  \
0             3.77937            4.19666                 NaN   
1             3.23990            4.60902                 NaN   

   A6NIH7;U119B_HUMAN  ...  Q8WXF0;SRS12_HUMAN  P02763;A1AG1_HUMAN  \
0             2.67750  ...                 NaN                 NaN   
1             2.88893  ...                 NaN                 NaN   

   Q9ULK4;MED23_HUMAN  P22352;GPX3_HUMAN  P0C221;CC175_HUMAN  \
0                 NaN                NaN             4.50249   
1                 NaN        

In [29]:
print(full_protein_matrix.shape)
print(sin_peptile_exclusion_matrix.shape)

(949, 8499)
(949, 6693)


In [28]:
# loading in the sample info for the proteomic data

info_file = pd.ExcelFile('data\proteomic-expression\goncalves-2022-cell\goncalves-2022-cell-949-sample-info.xlsx')
print(info_file.sheet_names)

goncalve_cell_line_info = pd.read_excel(info_file, 'Cell line level sample info', header=1)

print(goncalve_cell_line_info.head(1))
print(goncalve_cell_line_info.shape)

['Legend', 'Cell line level sample info', 'Replicate level sample info', 'Variable isolation windows', 'Complete DIANN input file list', 'DIA-NN parameters']
    model_id Project_Identifier Cell_line Source Identifier Gender  \
0  SIDM00896     SIDM00896;BC-1      BC-1   ATCC  CVCL_1079   Male   

                   Tissue_type                    Cancer_type  \
0  Haematopoietic and Lymphoid  B-Cell Non-Hodgkin's Lymphoma   

              Cancer_subtype Haem_lineage  ...        F6        F7        F8  \
0  Primary effusion lymphoma   B lymphoid  ... -0.460411  0.769345  1.655736   

         F9       F10       F11       F12       F13       F14       F15  
0  2.109067 -0.056367 -0.550705  0.229186  0.486789  0.299905 -0.319625  

[1 rows x 42 columns]
(949, 42)


In [31]:
# pickle the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome.pkl', 'wb') as f:
    pickle.dump(full_protein_matrix, f)
    pickle.dump(sin_peptile_exclusion_matrix, f)
    pickle.dump(goncalve_cell_line_info, f)

In [32]:
# load the goncalve_proteome and goncalve_proteome_info

import pickle

with open('data/proteomic-expression/goncalves-2022-cell/goncalve_proteome.pkl', 'rb') as f:
    full_protein_matrix = pickle.load(f)
    sin_peptile_exclusion_matrix = pickle.load(f)
    goncalve_cell_line_info = pickle.load(f)

In [34]:
print(full_protein_matrix.head(2))

  Project_Identifier  Q9Y651;SOX21_HUMAN  P37108;SRP14_HUMAN  \
0     SIDM00018;K052                 NaN             7.10955   
1    SIDM00023;TE-12                 NaN             6.82802   

   Q96JP5;ZFP91_HUMAN  Q9Y4H2;IRS2_HUMAN  P36578;RL4_HUMAN  \
0             3.38802                NaN           7.86661   
1             4.14346            2.21578           7.62878   

   Q6SPF0;SAMD1_HUMAN  O76031;CLPX_HUMAN  Q8WUQ7;CATIN_HUMAN  \
0             3.77937            4.19666                 NaN   
1             3.23990            4.60902                 NaN   

   A6NIH7;U119B_HUMAN  ...  Q8WXF0;SRS12_HUMAN  P02763;A1AG1_HUMAN  \
0             2.67750  ...                 NaN                 NaN   
1             2.88893  ...                 NaN                 NaN   

   Q9ULK4;MED23_HUMAN  P22352;GPX3_HUMAN  P0C221;CC175_HUMAN  \
0                 NaN                NaN             4.50249   
1                 NaN                NaN             4.52413   

   P02753;RET4_HUMAN  Q

Dataset Documentation
- Dataset name: full_protein_matrix

- dataset type: proteomic expression by DIA-MS

- dataset shape: (949, 8498) (n_cells, row_quantified_protein_expression), note that each protein expression value is aleady log2 transformed.

- **each row** represents a proteomic measurement of a cell line

- **each column** after the first column is the specific expression of a protein in a given cell line

- the first column `Project_Identifier` is the **identifier of the cell line**, this can be translated to `model_id` in the sample info csv document or the python object `goncalves_sample_info` using the `model_id` column. `model_id` appears to be consistent with the Sanger model ID format. 





### STRING Database for Protein-Protein Interactions

### PDE Ribociclib Data (Sungyoung)

Type: Drug response dataset (single drug: ribociclib)

Source: in-house data of external collaborators

Data is stored in the `data\drug-response\PDE_Ribociclib_ExtInHouse` folder.

##### Data Description and Methods

Dataset has multiple excel files, each containing multiple excel sheets. Drug response is measured by percentage (%) decrease of Ki67 positivity versus control. Ki67 is known to play a role in cell proliferation (Soliman et al, 2016). Responders are defined as cells with a decrease of Ki67 positivity of at least 50% OR 25% versus control. Two doses of ribociclib were tested, 100 nM and 500 nM.

From a brief visual inspection of the data, it appears the `datamatrix` sheet from both `response_mimi` and `response_ml_training_data` are the same and refers to "responders" of the 100 nM ribociclib treatment with 25% Ki67 positivity decrease versus control. Proteomic expression data were analyzed using Spectronaut 8 and quantified using MaxQuant Version 1.5.2.8 (Nguyen et al, 2018).  

Soliman, N. A., & Yussif, S. M. (2016). Ki-67 as a prognostic marker according to breast cancer molecular subtype. Cancer Biology & Medicine, 13(4), 496–504. https://doi.org/10.20892/j.issn.2095-3941.2016.0066

Nguyen, E. V., Centenera, M. M., Moldovan, M., Das, R., Irani, S., Vincent, A. D., Chan, H., Horvath, L. G., Lynn, D. J., Daly, R. J., & Butler, L. M. (2018). Identification of Novel Response and Predictive Biomarkers to Hsp90 Inhibitors Through Proteomic Profiling of Patient-derived Prostate Tumor Explants *. Molecular & Cellular Proteomics, 17(8), 1470–1486. https://doi.org/10.1074/mcp.RA118.000633






In [30]:
import pandas as pd 

# loading in the training data file

main_file = pd.ExcelFile('data\drug-response\PDE_Ribociclib_ExtInHouse\Ribociclib_Response_training_data_with_all.xlsx')
print(main_file.sheet_names)

['datamatrix']
             Unnamed: 0 Unnamed: 1         NR       NR.1       NR.2  \
0  PG.ProteinAccessions   GeneName         P1         P2         P5   
1                A0AV96      RBM47  13.205421  14.677432  16.944985   

        NR.3       NR.4       NR.5       NR.6       NR.7  ...       RP.5  \
0         P6         P8        P10        P16        P17  ...        P12   
1  16.384994  14.139414  15.210211  15.953077  14.870737  ...  15.116242   

        RP.6       RP.7       RP.8       RP.9     RP.10      RP.11     RP.12  \
0        P13        P14        P15        P21       P23        P24       P26   
1  17.175345  16.620938  15.568577  15.789191  16.16687  15.494391  15.83786   

       RP.13      RP.14  
0        P28        P29  
1  15.110893  14.802329  

[2 rows x 32 columns]
  pde response
0  P1       NR
1  P2       NR


  warn(msg)


In [53]:
pde_drug_response_full = pd.ExcelFile('data\drug-response\PDE_Ribociclib_ExtInHouse\Ribociclib_Response_Mimi.xlsx')
print(pde_drug_response_full.sheet_names)

# load in 'Response groups' 

pde_response_all = pd.read_excel(pde_drug_response_full, 'Response groups', header=1)
print(pde_response_all.head(2))

# drop row if 'Sample ID' is NaN

pde_response_all = pde_response_all.dropna(subset=['Sample ID'])
# print(pde_response_all.head(2))



['Response groups', '100mM_NonResp_Resp_25', '100mM_NonResp_Resp_50', '500mM_NonResp_Resp_50', 'datamatrix']
  Patient ID Sample ID  Vehicle  Rib 100nm  Rib 500nm %Decrease 100  \
0     33011L        P1    72.23      57.53      23.28      0.203517   
1     33042L        P2    17.32      13.12       1.04      0.242494   

  25% decrease cutoff 50% decrease cutoff %Decrease 500 25% decrease cutoff.1  \
0       Non-Responder       Non-Responder      0.677696             Responder   
1       Non-Responder       Non-Responder      0.939954             Responder   

  50% decrease cutoff.1   PR  Unnamed: 12  
0             Responder  NaN          NaN  
1             Responder  NaN          NaN  
  Patient ID Sample ID  Vehicle  Rib 100nm  Rib 500nm %Decrease 100  \
0     33011L        P1    72.23      57.53      23.28      0.203517   
1     33042L        P2    17.32      13.12       1.04      0.242494   

  25% decrease cutoff 50% decrease cutoff %Decrease 500 25% decrease cutoff.1  \
0     

  warn(msg)


In [None]:
# first, load in the datamatrix sheet with the first row as the header

ribociclib_response = pd.read_excel(main_file, 'datamatrix', header=1)
print(ribociclib_response.head(2))

response = ribociclib_response.columns
# print(response)
# keep first two letter of the column name as the new column name
pde = ribociclib_response.iloc[0]
pde = pde.tolist()[2:]
# print(pde.tolist()[2:])
response = [c[:2] for c in response][2:]
# print(response)

pde_response = pd.DataFrame({'pde': pde, 'response': response})
print(pde_response.head(2))

# ribociclib_response.columns = ribociclib_response.iloc[0]
# print(ribociclib_response.head(2))
# print(ribociclib_response.shape)

In [48]:
# then, load in the datamatrix sheet with the second row as the header

ribociclib_expression = pd.read_excel(main_file, 'datamatrix', header=2)
# print(ribociclib_expression.head(2))

ribociclib_protein_id_to_name = ribociclib_expression[['PG.ProteinAccessions', 'GeneName']]
print(ribociclib_protein_id_to_name.head(2))

ribociclib_expression.drop(['PG.ProteinAccessions'], axis=1, inplace=True)
# print(ribociclib_expression.head(2))

ribociclib_expression = ribociclib_expression.T
ribociclib_expression.columns = ribociclib_expression.iloc[0]
ribociclib_expression.drop(['GeneName'], axis=0, inplace=True)
# ribociclib_expression.rename(columns={'GeneName': 'PDE_ID'}, inplace=True)
print(ribociclib_expression.head(2))

# print(ribociclib_expression.columns)

  PG.ProteinAccessions GeneName
0               A0AV96    RBM47
1               A0AVT1     UBA6
GeneName      RBM47       UBA6      ESYT2      SHTN1  ARHGAP10      ILVBL  \
P1        13.205421  13.312968  15.075366   13.94654  14.17361  15.575587   
P2        14.677432  16.032694  16.367723  13.795947   13.9498  16.216862   

GeneName   SH3PXD2B       NBAS      TARS3       VWA8  ...     NDUFB9  \
P1        17.923365  13.991249  15.093301  14.356347  ...  16.408133   
P2        17.704219  14.215525  14.185029  13.544857  ...  16.329183   

GeneName       SQOR      AP1M2      NUMBL     SLC4A4       SCIN      DDX49  \
P1        16.968146  14.367639  14.754497  14.172981  13.620273  15.741047   
P2        17.964655  14.088264   13.63167  15.344575  13.366413   17.89591   

GeneName      WASF2      ENPP4    SEC23IP  
P1        15.439548  13.925692  15.574333  
P2         15.40177  13.765186  14.488299  

[2 rows x 4675 columns]


  warn(msg)


In [None]:
# pickle the ribociclib_response and ribociclib_expression, and ribociclib_protein_id_to_name, pde_response and pde_response_all

import pickle

with open('data/drug-response/PDE_Ribociclib_ExtInHouse/ribociclib_pde_cleaned.pkl', 'wb') as f:
    pickle.dump(ribociclib_response, f)
    pickle.dump(ribociclib_expression, f)
    pickle.dump(ribociclib_protein_id_to_name, f)
    pickle.dump(pde_response, f)
    pickle.dump(pde_response_all, f)

In [None]:
# load the ribociclib_response and ribociclib_expression, and ribociclib_protein_id_to_name, pde_response and pde_response_all

import pickle

with open('data/drug-response/PDE_Ribociclib_ExtInHouse/ribociclib_pde_cleaned.pkl', 'rb') as f:
    ribociclib_response = pickle.load(f)
    ribociclib_expression = pickle.load(f)
    ribociclib_protein_id_to_name = pickle.load(f)
    pde_response = pickle.load(f)
    pde_response_all = pickle.load(f)

# Data Integration

## Integration of GDSC2 and CCLE dataset 

### Steps 
1. GDSC2 contains drug data, each drug can be converted into a chemical structure, and the chemical structure can be converted into a SMILES string, or a fingerprint. This represents drug features.
2. CCLE contains gene expression data, each gene can be converted into a gene feature. This represents gene features.
3. The drug features and gene features can be combined to create a drug-gene interaction feature. This represents drug-gene interaction features.
4. The drug-gene interaction features can be used to train a model to predict drug response. (drug response stored in GDSC2)

In [1]:
import pandas as pd
import pickle

# import GDSC2 drug response data using pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

# import CCLE gene expression data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_expression.pkl', 'rb') as f:
    gene_entrez = pickle.load(f)
    ccle = pickle.load(f)

# import CCLE sample info data using pickle

with open('data/gene-expression/CCLE_Public_22Q2/ccle_sample_info.pkl', 'rb') as f:
    ccle_sample_info = pickle.load(f)


Further preprocessing of drug features through retrieving the SMILES string using pubchem id linked in GDSC2, and converting the SMILES string into morgan fingerprint fingerprint using RDKit.

[Refs needed]

In [34]:
# generate a dataframe between drug name and pubchem id using gdsc2_info

drug_pubchem = gdsc2_info[['drug_name', 'pubchem']]
drug_pubchem = drug_pubchem.drop_duplicates()

# remove the drug name with no pubchem id

# manually modify the pubchem id that are written as 'none' or 'several' 
drug_pubchem.loc[drug_pubchem['pubchem'] == 'none', 'pubchem'] = '-'

# https://pubchem.ncbi.nlm.nih.gov/compound/44259, accessed 09-02-2023
drug_pubchem.loc[drug_pubchem['drug_name'] == 'Staurosporine', 'pubchem'] = 44259

# https://pubchem.ncbi.nlm.nih.gov/compound/457193, accessed 09-02-2023
drug_pubchem.loc[drug_pubchem['drug_name'] == 'Dactinomycin', 'pubchem'] = 457193

# remove the drug name with no pubchem id
drug_pubchem = drug_pubchem[drug_pubchem['pubchem'] != '-']

# in the case of multiple pubchem id, only retain the first one
multiples = drug_pubchem[drug_pubchem['pubchem'].str.contains(",")==True]

# modify the pubchem id to only retain the first one in multiples
drug_pubchem.loc[drug_pubchem['pubchem'].str.contains(",")==True, 'pubchem'] = drug_pubchem.loc[drug_pubchem['pubchem'].str.contains(",")==True, 'pubchem'].str.split(",").str[0]

# remove duplicates
drug_pubchem = drug_pubchem.drop_duplicates()

pubchem_list = list(drug_pubchem['pubchem'])

In [35]:
import pubchempy as pcp

# using pubchempy to retrieve the smiles string of each pubchem id

smiles_list = []

for pubchem in pubchem_list:
    try: 
        compound = pcp.Compound.from_cid(pubchem)
        smiles = compound.isomeric_smiles
        smiles_list.append(smiles)
    except Exception as e:
        print(drug_pubchem[drug_pubchem['pubchem'] == pubchem]['drug_name'])
        smiles_list.append('')

# generate a dataframe between drug name and smiles string

drug_smiles = pd.DataFrame({'drug_name': drug_pubchem['drug_name'], 'smiles': smiles_list})
print(drug_smiles.head())

# compare the number of drug name in drug_smiles and drug_pubchem
print(drug_smiles.shape, gdsc2_info.shape)


    drug_name                                             smiles
0  Luminespib  CCNC(=O)C1=C(/C(=C/2\C=C(C(=CC2=O)O)C(C)C)/ON1...
1  Trametinib  CC1=C2C(=C(N(C1=O)C)NC3=C(C=C(C=C3)I)F)C(=O)N(...
2  Venetoclax  CC1(CCC(=C(C1)C2=CC=C(C=C2)Cl)CN3CCN(CC3)C4=CC...
3    Olaparib  C1CC1C(=O)N2CCN(CC2)C(=O)C3=C(C=CC(=C3)CC4=NNC...
4    Axitinib  CNC(=O)C1=CC=CC=C1SC2=CC3=C(C=C2)C(=NN3)/C=C/C...
(152, 2) (198, 6)


In [36]:
# pickle the drug_smiles and drug_pubchem, both have been modified and cleaned

import pickle

with open('data/drug-response/GDSC2/gdsc2_drug_smiles.pkl', 'wb') as f:
    pickle.dump(drug_smiles, f)

with open('data/drug-response/GDSC2/gdsc2_drug_pubchem.pkl', 'wb') as f:
    pickle.dump(drug_pubchem, f)

In [46]:
# Using RDKit to generate molecular fingerprints from GDSC2 drug names

from rdkit import Chem
from rdkit.Chem import AllChem
import numpy as np

# generate a list of rdkit mol objects from the smiles string
mol_list = [Chem.MolFromSmiles(smiles) for smiles in drug_smiles['smiles']]
# print(mol_list[0])

# generate a list of fingerprints from the rdkit mol objects
fp_list = [AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=1024) for mol in mol_list]


fpnp = np.array(fp_list[0])

# TODO: require further preprocessing documentation for the code above 

Retrieve the gene features from CCLE, and convert the gene features into a gene expression matrix.

In [35]:
print(ccle.head())

     CELLLINE    TSPAN6      TNMD      DPM1     SCYL3  C1orf112       FGR  \
0  ACH-001113  4.331992  0.000000  7.364397  2.792855  4.470537  0.028569   
1  ACH-001289  4.566815  0.584963  7.106537  2.543496  3.504620  0.000000   
2  ACH-001339  3.150560  0.000000  7.379032  2.333424  4.227279  0.056584   
3  ACH-001538  5.085340  0.000000  7.154109  2.545968  3.084064  0.000000   
4  ACH-000242  6.729145  0.000000  6.537607  2.456806  3.867896  0.799087   

        CFH     FUCA2      GCLC  ...      H3C2      H3C3  AC098582.1  \
0  1.226509  3.042644  6.499686  ...  2.689299  0.189034    0.201634   
1  0.189034  3.813525  4.221104  ...  1.286881  1.049631    0.321928   
2  1.310340  6.687061  3.682573  ...  0.594549  1.097611    0.831877   
3  5.868143  6.165309  4.489928  ...  0.214125  0.632268    0.298658   
4  7.208381  5.569856  7.127014  ...  1.117695  2.358959    0.084064   

   DUS4L-BCAP29  C8orf44-SGK3  ELOA3B    NPBWR1  ELOA3D  ELOA3      CDR1  
0      2.130931      0.555816