Skip to content

Alime21/GDSC-Cancer-Database-Project

Repository files navigation

GDSC-Cancer-Database-Project

Optimized SQL database schema for GDSC Cancer Genomics data. Includes detailed Entity-Relationship diagrams, normalization logic, and performance-focused queries.

Project Overview

This project involves the design, implementation, and optimization of a relational database system for the Genomics of Drug Sensitivity in Cancer (GDSC) dataset.

Starting with raw, unstructured data from Kaggle, I engineered a robust MySQL architecture capable of handling complex biological relationships between cancer cell lines, drugs, and genetic targets. The project demonstrates a full data lifecycle: from 0NF to 3NF normalization, conceptual modeling (ERD), to advanced SQL implementation including Stored Procedures, Views, and Indexing.

Database Architecture & ER Diagram

The database schema was designed to eliminate data redundancy and ensure referential integrity. It features a central fact table (drug_sensitivity) connected to various dimensional tables (Cells, Drugs, Targets) in a structure optimized for analytical queries.

Entity Relationship Diagram

Key Architectural Decisions:

  • Separation of Concerns: Cell_lines and Drugs are decoupled to allow independent updates.
  • Complex Relationships: Handled Many-to-Many relationships between Drugs and Targets via intermediate tables (drug_targets), and extended hierarchical data for Pathways.
  • Bioinformatics Specifics: Dedicated tables for microsatellite_instability and tissue_descriptors to capture granular biological metadata without bloating the main tables.

Technical Implementation

1. Data Normalization Process (0NF → 3NF)

The raw dataset contained massive redundancy. I applied rigorous normalization rules:

  • 1NF (Atomicity): Split multi-valued attributes (e.g., comma-separated drug synonyms) into distinct rows.
  • 2NF (Partial Dependency): Decoupled non-key attributes dependent only on part of the composite key. Separated Drug properties from Sensitivity experiment results.
  • 3NF (Transitive Dependency): Removed transitive dependencies (e.g., Pathways dependent on Targets, not directly on Drugs).

2. Advanced SQL Features

Beyond standard CRUD operations, this project utilizes enterprise-level database features:

  • ** Stored Procedures:** Automating complex workflows (e.g., sp_AddDrugTrial to safely insert new sensitivity records while checking foreign key constraints).
  • ** Views:** Created virtual tables (e.g., vw_HighSensitivityDrugs) to simplify complex joins for data analysts, pre-filtering drugs with high IC50 scores.
  • ** Indexing:** Implemented indexes on frequently queried columns (e.g., drug_id, cell_line_name) to drastically improve JOIN performance and query speed.
  • Data Integrity: Enforced ON DELETE CASCADE and foreign key constraints to maintain database health.

Sample Analysis (SQL)

The database allows for complex biological questions to be answered via SQL:

  • Query 1: "Find the top 5 most effective drugs for 'Lung Cancer' cell lines." (JOIN, GROUP BY, ORDER BY)
  • Query 2: "Correlate Microsatellite Instability (MSI) status with drug resistance."
  • Query 3: "List all pathways targeted by a specific drug company."

How to Run

  1. Clone the repository:
    git clone [https://github.com/yourusername/GDSC-Cancer-Database.git](https://github.com/yourusername/GDSC-Cancer-Database.git)
  2. Import the Schema: Open MySQL Workbench, go to File > Open SQL Script and run schema.sql.
  3. Load Data: Run the insert_data.sql script (or import CSVs via the Workbench Table Import Wizard).
  4. Run Queries: Execute analysis_queries.sql to see the database in action.

Tech Stack

  • Database Engine: MySQL 8.0
  • Design Tool: MySQL Workbench (ER Modeling)
  • Languages: SQL (DDL, DML, DQL, DCL)
  • Source Data: Kaggle (GDSC)

About

Optimized SQL database schema for GDSC Cancer Genomics data. Includes detailed Entity-Relationship diagrams, normalization logic, and performance-focused queries.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors