# Data loading 

methods to pre-cache and load in raw data from the data folder

### Data Pre-processing Guidelines

- **documentation of original download information**
    - original download link 
    - the date of download
    - the doi or bibliography of the linked publication 
    - basic description of the dataset
    - |time-consuming| ideally, a simple description of the methodology used to generate the dataset
        - how were the samples collected?
        - at what time point are the samples lysed? 
        - any further pre-processing steps?  
<br />

- **documentation of dataset(s)** 
    - *the information type represented by the dataset (i.e. gene expression, drug response, etc.)*
    - any supplementary spreadsheet(s) associated with the dataset or metadata
    - sample size (e.g. number of samples, number of genes, etc.)
    - main row and column domains (e.g. genes, samples, drugs, etc.)
    - identifiers used for drug, gene, protein etc. 
    - presence of specific drugs or genes/proteins of interest
        - e.g. CDK4/6 inhibitors: palbociclib, ribociclib and abemaciclib  
<br />  

- **documentation of the pre-processing steps**
    - *the final shape of the processed dataset associated with metadata, e.g. (n_samples, n_genes)*
    - |time-consuming| the technique used to transform the dataset
        - e.g. log2 transformation, z-score normalization, etc.
        - e.g. the method used to impute missing values
        - any removal of data and reasoning (i.e. due to missing values, etc.)
    - index to identifier mapping (e.g. gene index to gene symbol mapping)
        - then, the processed dataset will have indexes matched with a corresponding identifier/symbol 
        - e.g. gene index 0 corresponds to gene symbol A1BG
        - e.g. drug index 0 corresponds to drug palbociclib
        - when performing further filtering, the original index order must be preserved or traced to allow for mapping back to the original identifiers
    - creating a paired dataset from two different datasets
        - e.g. drug response and gene expression
        - e.g. drug response and mutation status
        - e.g. gene expression and mutation status
        - must perform model-to-name mapping between the two datasets and document the mapping logic
            - e.g. model are cell lines, matched by cell line name (no spaces, lower case)
            - e.g. model are cell lines, matched by a common identifier (e.g. Sanger_Model_ID)



### GDSC 1 

GDSC1 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC1` folder.

Data Retrieval Date: 2022-06-01

Yang, W., Soares, J., Greninger, P., Edelman, E. J., Lightfoot, H., Forbes, S., Bindal, N., Beare, D., Smith, J. A., Thompson, I. R., Ramaswamy, S., Futreal, P. A., Haber, D. A., Stratton, M. R., Benes, C., McDermott, U., & Garnett, M. J. (2013). Genomics of Drug Sensitivity in Cancer (GDSC): A resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Research, 41(Database issue), D955–D961. https://doi.org/10.1093/nar/gks1111

#### Methodology

retrieved from [Genomics of Drug Sensitivity in Cancer](https://www.cancerrxgene.org/help#t_curve)

> The GDSC1 dataset was generated jointly by the Wellcome Sanger Institute and Massachusetts General Hospital between 2009 and 2015 using a matched set of cancer cell lines (the GDSC1000).

> Compounds were stored in aliquots at -80°C and were subjected to a maximum of 5 freeze-thaw cycles.

> Cells were seeded in 96-well or 384-well plates and compound dose titrations were delivered using tip based liquid handling apparatus. Cell viability was measured using either Syto60 or Resazurin. Drug treatments in this dataset used two formats:

> 9-point dose curve incorporating a 2-fold dilution step (256-fold range)
> 5-point dose curve incorporating a 4-fold dilution step (256-fold range)



In [1]:
## Initial Loading of Data

import pandas as pd 

gdsc1 = pd.read_excel('data\drug-response\GDSC1\GDSC1_fitted_dose_response_25Feb20.xlsx')
print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [2]:
gdsc1_info = pd.read_csv('data\drug-response\GDSC1\GDSC1_DrugData.csv')

In [5]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'wb') as f:
    pickle.dump(gdsc1, f)
    pickle.dump(gdsc1_info, f)


In [1]:
## Loading cached data
import pickle

with open('data/drug-response/GDSC1/cache_gdsc1.pkl', 'rb') as f:
    gdsc1 = pickle.load(f)
    gdsc1_info = pickle.load(f)

In [2]:
# investigating the structure of the gdsc1 dataset 

print(gdsc1.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC1             281       12974350     683665         MC-CAR   
1   GDSC1             281       12975300     684055            ES3   
2   GDSC1             281       12975647     684057            ES5   
3   GDSC1             281       12975980     684059            ES7   
4   GDSC1             281       12976330     684062          EW-11   

  SANGER_MODEL_ID     TCGA_DESC  DRUG_ID  DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00636            MM        1  Erlotinib            EGFR   
1       SIDM00265  UNCLASSIFIED        1  Erlotinib            EGFR   
2       SIDM00263  UNCLASSIFIED        1  Erlotinib            EGFR   
3       SIDM00269  UNCLASSIFIED        1  Erlotinib            EGFR   
4       SIDM00203  UNCLASSIFIED        1  Erlotinib            EGFR   

     PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  EGFR signaling        1045          Y  0.007813       2.0  2.395685   
1  E

In [3]:
print(gdsc1.shape)

(310904, 19)


In [4]:
print(gdsc1_info.head())

   drug_id   drug_name                            synonyms  \
0     1559  Luminespib  AUY922, VER-52296,NVP-AUY922,  AUY   
1     1372  Trametinib                GSK1120212, Mekinist   
2     1909  Venetoclax       ABT-199, Veneclexta, GDC-0199   
3     1017    Olaparib       AZD2281, KU0059436,  Lynparza   
4     1021    Axitinib                    AG-13736, Inlyta   

                        pathway_name            targets   pubchem  
0  Protein stability and degradation              HSP90  10096043  
1                 ERK MAPK signaling         MEK1, MEK2  11707110  
2               Apoptosis regulation               BCL2  49846579  
3                   Genome integrity       PARP1, PARP2  23725625  
4                      RTK signaling  PDGFR, KIT, VEGFR   6450551  


In [5]:
palbo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Palbociclib']

print(palbo.shape)

(901, 19)


In [6]:
ribo = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Ribociclib']

print(ribo.shape)

(0, 19)


In [7]:
Abemaciclib = gdsc1.loc[gdsc1['DRUG_NAME'] == 'Abemaciclib']

print(Abemaciclib.shape)

(0, 19)


- gdsc1 
- dataset type: drug response
- dataset shape: (310904, 19) (n_cells, row_features)
- **each row** represents a drug response measurement of a cell line for a given drug
- **each column** are the features of each drug response measurement
- the column `LN_IC50` is the log-normalized IC50 value of the drug response
- the column `AUC` is the area under the curve of the drug response
- the column `DRUG_ID`, `DRUG_NAME` are the **identifiers of the drug**
    - 'DRUG_ID' can be queried to show further information on drug targets from the supplementary spreadsheet 'gdsc1_info.csv' or the `gdsc1_info` object in python 
- the column `COSMIC_ID`, `SANGER_MODEL_ID`, `CELL_LINE_NAME` are the **identifiers of the cell line**
- drug present: **Palbociclib** (cell line n = 901)


pre-processing may not be required on this dataset, however, further pre-processing is needed if paired with other datasets

### CCLE 22Q2

CCLE (Cancer Cell Line Encyclopedia) is a gene expression dataset, retrieved from [Cancer Cell Line Encyclopedia](https://depmap.org/portal/download/all/). Data is pulled with the option 'DepMap Public 22Q2' in the selection menu.

The data is stored in the `data/gene-expression/CCLE_Public_22Q2` folder.

Data Retrieval Date: 2022-06-01

Ghandi, M., Huang, F. W., Jané-Valbuena, J., Kryukov, G. V., Lo, C. C., McDonald, E. R., Barretina, J., Gelfand, E. T., Bielski, C. M., Li, H., Hu, K., Andreev-Drakhlin, A. Y., Kim, J., Hess, J. M., Haas, B. J., Aguet, F., Weir, B. A., Rothberg, M. V., Paolella, B. R., … Sellers, W. R. (2019). Next-generation characterization of the Cancer Cell Line Encyclopedia. Nature, 569(7757), Article 7757. https://doi.org/10.1038/s41586-019-1186-3


#### Methodology

From Ghandi et al, 2019:

> WGS for 329 cell lines and WES for 326 cell lines were performed at the Broad Institute Genomics Platform. Libraries were constructed and sequenced on either an Illumina HiSeq 2000 or Illumina GAIIX, with the use of 101-base-pair (bp) paired-end reads for WGS and 76-bp paired-end reads for WES. Output from Illumina software was processed by the Picard data-processing pipeline to yield BAM files containing well-calibrated, aligned reads. All sample information tracking was performed by automated LIMS messaging.

In [None]:
#TODO 

### GDSC 2

GDSC2 is a drug response dataset, retrieved from [Genomics of Drug Sensitivity in Cancer](http://www.cancerrxgene.org/). The data is stored in the `data/drug-response/GDSC2` folder.

In [9]:
## Initial Loading of Data

import pandas as pd 

gdsc2 = pd.read_excel('data\drug-response\GDSC2\GDSC2_fitted_dose_response_25Feb20.xlsx')
print(gdsc2.head())

  DATASET  NLME_RESULT_ID  NLME_CURVE_ID  COSMIC_ID CELL_LINE_NAME  \
0   GDSC2             282       13320532     749709        HCC1954   
1   GDSC2             282       13320565     749710        HCC1143   
2   GDSC2             282       13320598     749711        HCC1187   
3   GDSC2             282       13320631     749712        HCC1395   
4   GDSC2             282       13320668     749713        HCC1599   

  SANGER_MODEL_ID TCGA_DESC  DRUG_ID     DRUG_NAME PUTATIVE_TARGET  \
0       SIDM00872      BRCA     1003  Camptothecin            TOP1   
1       SIDM00866      BRCA     1003  Camptothecin            TOP1   
2       SIDM00885      BRCA     1003  Camptothecin            TOP1   
3       SIDM00884      BRCA     1003  Camptothecin            TOP1   
4       SIDM00877      BRCA     1003  Camptothecin            TOP1   

      PATHWAY_NAME  COMPANY_ID WEBRELEASE  MIN_CONC  MAX_CONC   LN_IC50  \
0  DNA replication        1046          Y  0.000098       0.1 -0.251083   
1  DNA r

In [12]:
gdsc2_info = pd.read_csv('data\drug-response\GDSC2\GDSC2_DrugData.csv')



In [13]:
## Caching loaded data into pickle obj 

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'wb') as f:
    pickle.dump(gdsc2, f)
    pickle.dump(gdsc2_info, f)

In [14]:
## Loading cached data

import pickle

with open('data/drug-response/GDSC2/cache_gdsc2.pkl', 'rb') as f:
    gdsc2 = pickle.load(f)
    gdsc2_info = pickle.load(f)

### Goncalves 2022 Proteomic Cell Paper (n=949)

### Open Cell Protein Interaction Map