<h1>The Genomics of Drug Sensitivity in Cancer (GDSC) </h1> The dataset combines drug response data with genomic profiles of cancer cell lines, allowing researchers to investigate the relationship between genetic features and drug sensitivity.

<h3> Task </h3>
The primary task associated with this dataset is to predict drug sensitivity (measured as IC50 values) based on genomic features of cancer cell lines. This can involve regression tasks to predict exact IC50 values or classification tasks to categorize cell lines as sensitive or resistant to specific drugs. The dataset also allows for the identification of genomic markers that correlate with drug response.

<h3> Datasets </h3>
<h4>GDSC2-dataset.csv: </h4> Contains drug sensitivity data, including IC50 values, for various drugs tested against cancer cell lines.(Original source file)
<h4>Cell_Lines_Details.xlsx:</h4> Provides detailed information about the cancer cell lines, including genomic features such as mutations, copy number alterations, and gene expression. (Original source file)
<h4>Compounds-annotation.csv:</h4> Offers information about the drugs used in the screening, including their targets and pathways. (Original source file)
<h4>GDSC_DATASET.csv:</h4> This is the main dataset file for analysis. It's a merged file combining key information from the above three files, created to facilitate easier analysis. This consolidated dataset includes all necessary features for drug sensitivity prediction and is recommended for use in your analysis.

<h3>Detailed Column Descriptions:</h3>
<h4>1. GDSC2-dataset.csv:</h4>
DATASET: Identifier for the specific GDSC dataset version.</br>
NLME_RESULT_ID: Unique identifier for the non-linear mixed effects model result.</br>
NLME_CURVE_ID: Identifier for the dose-response curve fitted by NLME.</br>
COSMIC_ID: Unique identifier for the cell line from the COSMIC database.</br>
CELL_LINE_NAME: Name of the cancer cell line used in the experiment.</br>
SANGER_MODEL_ID: Identifier used by the Sanger Institute for the cell line model.</br>
TCGA_DESC: Description of the cancer type according to The Cancer Genome Atlas.</br>
DRUG_ID: Unique identifier for the drug used in the experiment.</br>
DRUG_NAME: Name of the drug used in the experiment.</br>
PUTATIVE_TARGET: The presumed molecular target of the drug.</br>
PATHWAY_NAME: The biological pathway affected by the drug.</br>
COMPANY_ID: Identifier for the company that provided the drug.</br>
WEBRELEASE: Date or version of web release for this data.</br>
MIN_CONC: Minimum concentration of the drug used in the experiment.</br>
MAX_CONC: Maximum concentration of the drug used in the experiment.</br>
LN_IC50: Natural log of the half-maximal inhibitory concentration (IC50).</br>
AUC: Area Under the Curve, a measure of drug effectiveness.</br>
RMSE: Root Mean Square Error, indicating the fit quality of the dose-response curve.</br>
Z_SCORE: Standardized score of the drug response, allowing comparison across different drugs and cell lines.</br>
<h4>2. Cell_Lines_Details.xlsx:</h4></br>
Sample Name: Unique identifier for the cell line sample.</br>
COSMIC identifier: Unique ID from the COSMIC database for the cell line.</br>
Whole Exome Sequencing (WES): Genetic mutation data from whole exome sequencing.</br>
Copy Number Alterations (CNA): Data on gene copy number changes in the cell line.</br>
Gene Expression: Information on gene expression levels in the cell line.</br>
Methylation: Data on DNA methylation patterns in the cell line.</br>
Drug Response: Information on how the cell line responds to various drugs.</br>
GDSC Tissue descriptor 1: Primary tissue type classification.</br>
GDSC Tissue descriptor 2: Secondary tissue type classification.</br>
Cancer Type (matching TCGA label): Cancer type according to TCGA classification.</br>
Microsatellite instability Status (MSI): Indicates the cell line's MSI status.</br>
Screen Medium: The growth medium used for culturing the cell line.</br>
Growth Properties: Characteristics of how the cell line grows in culture.</br>
<h4>3. Compounds-annotation.csv:</h4>
DRUG_ID: Unique identifier for the drug.</br>
SCREENING_SITE: Location where the drug screening was performed.</br>
DRUG_NAME: Name of the drug compound.</br>
SYNONYMS: Alternative names for the drug.</br>
TARGET: The molecular target(s) of the drug.</br>
TARGET_PATHWAY: The biological pathway(s) targeted by the drug.</br>

In [1]:
# import library for EDA analysis

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt



In [6]:
# Loading data

GDSC = pd.read_csv('GDSC_Data/GDSC2-dataset.csv')
GDSC

Unnamed: 0,DATASET,NLME_RESULT_ID,NLME_CURVE_ID,COSMIC_ID,CELL_LINE_NAME,SANGER_MODEL_ID,TCGA_DESC,DRUG_ID,DRUG_NAME,PUTATIVE_TARGET,PATHWAY_NAME,COMPANY_ID,WEBRELEASE,MIN_CONC,MAX_CONC,LN_IC50,AUC,RMSE,Z_SCORE
0,GDSC2,343,15946310,683667,PFSK-1,SIDM01132,MB,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.000100,0.1,-1.463887,0.930220,0.089052,0.433123
1,GDSC2,343,15946548,684052,A673,SIDM00848,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.000100,0.1,-4.869455,0.614970,0.111351,-1.421100
2,GDSC2,343,15946830,684057,ES5,SIDM00263,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.000100,0.1,-3.360586,0.791072,0.142855,-0.599569
3,GDSC2,343,15947087,684059,ES7,SIDM00269,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.000100,0.1,-5.044940,0.592660,0.135539,-1.516647
4,GDSC2,343,15947369,684062,EW-11,SIDM00203,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.000100,0.1,-3.741991,0.734047,0.128059,-0.807232
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
242031,GDSC2,343,16188242,1659928,SNU-175,SIDM00216,COREAD,2499,N-acetyl cysteine,Metabolism,Metabolism,1101,Y,2.001054,2000.0,10.127082,0.976746,0.074498,0.156872
242032,GDSC2,343,16188695,1660034,SNU-407,SIDM00214,COREAD,2499,N-acetyl cysteine,Metabolism,Metabolism,1101,Y,2.001054,2000.0,8.576377,0.913378,0.057821,-1.626959
242033,GDSC2,343,16188953,1660035,SNU-61,SIDM00194,COREAD,2499,N-acetyl cysteine,Metabolism,Metabolism,1101,Y,2.001054,2000.0,10.519636,0.975001,0.058090,0.608442
242034,GDSC2,343,16189493,1674021,SNU-C5,SIDM00498,COREAD,2499,N-acetyl cysteine,Metabolism,Metabolism,1101,Y,2.001054,2000.0,10.694579,0.969969,0.101013,0.809684


In [20]:
# Duplicated data finding
print("Number of duplicated row:", GDSC.duplicated().sum())
print('______________________________________________________________________________________')
print("Information about the columns:")
print(GDSC.info())
print('______________________________________________________________________________________')
print("Unique Cell lines:\n",GDSC['CELL_LINE_NAME'].unique())
print('______________________________________________________________________________________')
print("Unique Drug Name:\n",GDSC["DRUG_NAME"].unique())


Number of duplicated row: 0
______________________________________________________________________________________
Information about the columns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242036 entries, 0 to 242035
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   DATASET          242036 non-null  object 
 1   NLME_RESULT_ID   242036 non-null  int64  
 2   NLME_CURVE_ID    242036 non-null  int64  
 3   COSMIC_ID        242036 non-null  int64  
 4   CELL_LINE_NAME   242036 non-null  object 
 5   SANGER_MODEL_ID  242036 non-null  object 
 6   TCGA_DESC        240969 non-null  object 
 7   DRUG_ID          242036 non-null  int64  
 8   DRUG_NAME        242036 non-null  object 
 9   PUTATIVE_TARGET  214881 non-null  object 
 10  PATHWAY_NAME     242036 non-null  object 
 11  COMPANY_ID       242036 non-null  int64  
 12  WEBRELEASE       242036 non-null  object 
 13  MIN_CONC         242036 non-null 