Database Design Group Project
Paper Title: Optimizing SQL Databases for Big Data Workloads: Techniques and Best Practices.
In this project, we've implemented the below optimization techniques mentioned in our paper:
- Indexing
- Caching
- Partitioning
- Sharding
We have used the following dataset for our implementation - https://www.kaggle.com/datasets/jpmiller/healthcare
Steps to run:
Software needed:
- Install MySQL
- Install MySQL Workbench
- Install Python
- Install the mysql-connector and pandas library to connect to the database
- An IDE to run the script (ex. VS Code)
- Download the dataset using the mentioned link
MY SQL Workbench setup:
- Once you install workbench, move it to the applications folder on your system
- Setup root and your password
- Run the following command on the UI - CREATE DATABASE "YOUR_DB_NAME";
- Run the following commands to be able to read the large dataset: SET GLOBAL local_infile = 1 SET GLOBAL wait_timeout = 31536000 SET GLOBAL interactive_timeout = 31536000 SET GLOBAL net_read_timeout = 1200 SET GLOBAL net_write_timeout = 1200 SET GLOBAL max_allowed_packet = 1073741824
CHANGES to be made in sqlConnector.py
- Update host, user, password and database in dbConnect - (get these details from workbench)
- Update the hospitalsDatasetFilePath variable depending on the location of hospitals.csv on your system
- Update the hospitalPricesDatasetFilePath variable depending on the location of hospital_prices.csv on your system
- Save the changes and run the scripts - indexing.py, sharding.py, partitioning.py, caching.py
Steps to run Indexing:
- First run the sqlConnector.py to connect to the database and create and load the tables
- Then run the indexing.py to run the queries in queries.py and view execution times
Steps to run Caching:
- First run the sqlConnectorCaching.py to implement the connect to the database and create the cache table as well
- Then run the Caching.py to run the queries in queries.py and view execution times
Steps to run Range Partitioning:
- First run the sqlConnectorRangePartitioning.py to create the tables and implement the partitions for the same.
- Next run the rangePartitioning.py file to fetch queries from queries.py and execute them and view each of their execution times.
Steps to run Hash Partitioning:
- First run the sqlConnectorHashPartitioning.py to create the tables and implement the partitions for the same.
- Next run the hashPartitioning.py file to fetch queries from queries.py and execute them and view each of their execution times.
Steps to run Sharding:
- Create databases: 'hospitals_shard_1', 'hospitals_shard_2', 'hospitals_shard_3'.
- Execute 'sqlConnectorSharding.py' to connect to the databases, create tables, to execute the queries and to calculate the execution time.
Team Members:
- Manav Kothari
- Keerthi Anand
- Harshitha Devina Anto
- Veronica Chittora
- Siddhi Patil