DBProject

Database Design Group Project

Paper Title: Optimizing SQL Databases for Big Data Workloads: Techniques and Best Practices.

In this project, we've implemented the below optimization techniques mentioned in our paper:

Indexing
Caching
Partitioning
Sharding

We have used the following dataset for our implementation - https://www.kaggle.com/datasets/jpmiller/healthcare

Steps to run:

Software needed:

Install MySQL
Install MySQL Workbench
Install Python
Install the mysql-connector and pandas library to connect to the database
An IDE to run the script (ex. VS Code)
Download the dataset using the mentioned link

MY SQL Workbench setup:

Once you install workbench, move it to the applications folder on your system
Setup root and your password
Run the following command on the UI - CREATE DATABASE "YOUR_DB_NAME";
Run the following commands to be able to read the large dataset: SET GLOBAL local_infile = 1 SET GLOBAL wait_timeout = 31536000 SET GLOBAL interactive_timeout = 31536000 SET GLOBAL net_read_timeout = 1200 SET GLOBAL net_write_timeout = 1200 SET GLOBAL max_allowed_packet = 1073741824

CHANGES to be made in sqlConnector.py

Update host, user, password and database in dbConnect - (get these details from workbench)
Update the hospitalsDatasetFilePath variable depending on the location of hospitals.csv on your system
Update the hospitalPricesDatasetFilePath variable depending on the location of hospital_prices.csv on your system
Save the changes and run the scripts - indexing.py, sharding.py, partitioning.py, caching.py

Steps to run Indexing:

First run the sqlConnector.py to connect to the database and create and load the tables
Then run the indexing.py to run the queries in queries.py and view execution times

Steps to run Caching:

First run the sqlConnectorCaching.py to implement the connect to the database and create the cache table as well
Then run the Caching.py to run the queries in queries.py and view execution times

Steps to run Range Partitioning:

First run the sqlConnectorRangePartitioning.py to create the tables and implement the partitions for the same.
Next run the rangePartitioning.py file to fetch queries from queries.py and execute them and view each of their execution times.

Steps to run Hash Partitioning:

First run the sqlConnectorHashPartitioning.py to create the tables and implement the partitions for the same.
Next run the hashPartitioning.py file to fetch queries from queries.py and execute them and view each of their execution times.

Steps to run Sharding:

Create databases: 'hospitals_shard_1', 'hospitals_shard_2', 'hospitals_shard_3'.
Execute 'sqlConnectorSharding.py' to connect to the databases, create tables, to execute the queries and to calculate the execution time.

Team Members:

Manav Kothari
Keerthi Anand
Harshitha Devina Anto
Veronica Chittora
Siddhi Patil

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DBProject

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.gitignore		.gitignore
Caching.py		Caching.py
README.md		README.md
ShardQueries.py		ShardQueries.py
hashPartitioning.py		hashPartitioning.py
indexing.py		indexing.py
queries.py		queries.py
rangePartitioning.py		rangePartitioning.py
sharding.py		sharding.py
sqlConnector.py		sqlConnector.py
sqlConnectorCaching.py		sqlConnectorCaching.py
sqlConnectorHashPartitioning.py		sqlConnectorHashPartitioning.py
sqlConnectorRangePartitioning.py		sqlConnectorRangePartitioning.py
sqlConnectorSharding.py		sqlConnectorSharding.py

HarshithaDA/Optimizing-SQL-Databases

Folders and files

Latest commit

History

Repository files navigation

DBProject

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages