GDSC-Cancer-Database-Project

Optimized SQL database schema for GDSC Cancer Genomics data. Includes detailed Entity-Relationship diagrams, normalization logic, and performance-focused queries.

Project Overview

This project involves the design, implementation, and optimization of a relational database system for the Genomics of Drug Sensitivity in Cancer (GDSC) dataset.

Starting with raw, unstructured data from Kaggle, I engineered a robust MySQL architecture capable of handling complex biological relationships between cancer cell lines, drugs, and genetic targets. The project demonstrates a full data lifecycle: from 0NF to 3NF normalization, conceptual modeling (ERD), to advanced SQL implementation including Stored Procedures, Views, and Indexing.

Database Architecture & ER Diagram

The database schema was designed to eliminate data redundancy and ensure referential integrity. It features a central fact table (drug_sensitivity) connected to various dimensional tables (Cells, Drugs, Targets) in a structure optimized for analytical queries.

Key Architectural Decisions:

Separation of Concerns: Cell_lines and Drugs are decoupled to allow independent updates.
Complex Relationships: Handled Many-to-Many relationships between Drugs and Targets via intermediate tables (drug_targets), and extended hierarchical data for Pathways.
Bioinformatics Specifics: Dedicated tables for microsatellite_instability and tissue_descriptors to capture granular biological metadata without bloating the main tables.

Technical Implementation

1. Data Normalization Process (0NF → 3NF)

The raw dataset contained massive redundancy. I applied rigorous normalization rules:

1NF (Atomicity): Split multi-valued attributes (e.g., comma-separated drug synonyms) into distinct rows.
2NF (Partial Dependency): Decoupled non-key attributes dependent only on part of the composite key. Separated Drug properties from Sensitivity experiment results.
3NF (Transitive Dependency): Removed transitive dependencies (e.g., Pathways dependent on Targets, not directly on Drugs).

2. Advanced SQL Features

Beyond standard CRUD operations, this project utilizes enterprise-level database features:

** Stored Procedures:** Automating complex workflows (e.g., sp_AddDrugTrial to safely insert new sensitivity records while checking foreign key constraints).
** Views:** Created virtual tables (e.g., vw_HighSensitivityDrugs) to simplify complex joins for data analysts, pre-filtering drugs with high IC50 scores.
** Indexing:** Implemented indexes on frequently queried columns (e.g., drug_id, cell_line_name) to drastically improve JOIN performance and query speed.
Data Integrity: Enforced ON DELETE CASCADE and foreign key constraints to maintain database health.

Sample Analysis (SQL)

The database allows for complex biological questions to be answered via SQL:

Query 1: "Find the top 5 most effective drugs for 'Lung Cancer' cell lines." (JOIN, GROUP BY, ORDER BY)
Query 2: "Correlate Microsatellite Instability (MSI) status with drug resistance."
Query 3: "List all pathways targeted by a specific drug company."

How to Run

Clone the repository:

git clone [https://github.com/yourusername/GDSC-Cancer-Database.git](https://github.com/yourusername/GDSC-Cancer-Database.git)

Import the Schema: Open MySQL Workbench, go to File > Open SQL Script and run schema.sql.
Load Data: Run the insert_data.sql script (or import CSVs via the Workbench Table Import Wizard).
Run Queries: Execute analysis_queries.sql to see the database in action.

Tech Stack

Database Engine: MySQL 8.0
Design Tool: MySQL Workbench (ER Modeling)
Languages: SQL (DDL, DML, DQL, DCL)
Source Data: Kaggle (GDSC)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
GDSC_ER.png		GDSC_ER.png
README.md		README.md
SQL_statements.sql		SQL_statements.sql
StoredProcedure.sql		StoredProcedure.sql
View.sql		View.sql
index.sql		index.sql
insert_update_delete.sql		insert_update_delete.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GDSC-Cancer-Database-Project

Project Overview

Database Architecture & ER Diagram

Key Architectural Decisions:

Technical Implementation

1. Data Normalization Process (0NF → 3NF)

2. Advanced SQL Features

Sample Analysis (SQL)

How to Run

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GDSC-Cancer-Database-Project

Project Overview

Database Architecture & ER Diagram

Key Architectural Decisions:

Technical Implementation

1. Data Normalization Process (0NF → 3NF)

2. Advanced SQL Features

Sample Analysis (SQL)

How to Run

Tech Stack

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages