Optimized SQL database schema for GDSC Cancer Genomics data. Includes detailed Entity-Relationship diagrams, normalization logic, and performance-focused queries.
This project involves the design, implementation, and optimization of a relational database system for the Genomics of Drug Sensitivity in Cancer (GDSC) dataset.
Starting with raw, unstructured data from Kaggle, I engineered a robust MySQL architecture capable of handling complex biological relationships between cancer cell lines, drugs, and genetic targets. The project demonstrates a full data lifecycle: from 0NF to 3NF normalization, conceptual modeling (ERD), to advanced SQL implementation including Stored Procedures, Views, and Indexing.
The database schema was designed to eliminate data redundancy and ensure referential integrity. It features a central fact table (drug_sensitivity) connected to various dimensional tables (Cells, Drugs, Targets) in a structure optimized for analytical queries.
- Separation of Concerns:
Cell_linesandDrugsare decoupled to allow independent updates. - Complex Relationships: Handled Many-to-Many relationships between
DrugsandTargetsvia intermediate tables (drug_targets), and extended hierarchical data forPathways. - Bioinformatics Specifics: Dedicated tables for
microsatellite_instabilityandtissue_descriptorsto capture granular biological metadata without bloating the main tables.
The raw dataset contained massive redundancy. I applied rigorous normalization rules:
- 1NF (Atomicity): Split multi-valued attributes (e.g., comma-separated drug synonyms) into distinct rows.
- 2NF (Partial Dependency): Decoupled non-key attributes dependent only on part of the composite key. Separated
Drugproperties fromSensitivityexperiment results. - 3NF (Transitive Dependency): Removed transitive dependencies (e.g.,
Pathwaysdependent onTargets, not directly onDrugs).
Beyond standard CRUD operations, this project utilizes enterprise-level database features:
- ** Stored Procedures:** Automating complex workflows (e.g.,
sp_AddDrugTrialto safely insert new sensitivity records while checking foreign key constraints). - ** Views:** Created virtual tables (e.g.,
vw_HighSensitivityDrugs) to simplify complex joins for data analysts, pre-filtering drugs with high IC50 scores. - ** Indexing:** Implemented indexes on frequently queried columns (e.g.,
drug_id,cell_line_name) to drastically improveJOINperformance and query speed. - Data Integrity: Enforced
ON DELETE CASCADEand foreign key constraints to maintain database health.
The database allows for complex biological questions to be answered via SQL:
- Query 1: "Find the top 5 most effective drugs for 'Lung Cancer' cell lines." (
JOIN,GROUP BY,ORDER BY) - Query 2: "Correlate Microsatellite Instability (MSI) status with drug resistance."
- Query 3: "List all pathways targeted by a specific drug company."
- Clone the repository:
git clone [https://github.com/yourusername/GDSC-Cancer-Database.git](https://github.com/yourusername/GDSC-Cancer-Database.git)
- Import the Schema:
Open MySQL Workbench, go to
File > Open SQL Scriptand runschema.sql. - Load Data:
Run the
insert_data.sqlscript (or import CSVs via the Workbench Table Import Wizard). - Run Queries:
Execute
analysis_queries.sqlto see the database in action.
- Database Engine: MySQL 8.0
- Design Tool: MySQL Workbench (ER Modeling)
- Languages: SQL (DDL, DML, DQL, DCL)
- Source Data: Kaggle (GDSC)
