Multithreading-based-ETL-process

In this project, I have done a comparison of different techniques for tranforming a large dataset and compared the time taken in each technique.

For this, I have created a table in MySQL, extracted the data from it, Tranformed it and then Loaded it again in the table. This ETL process has been implemented in 3 cases (discussed below) and I have analyzed the amount of time it takes to run the 3 cases. The above process has been done for different number of rows in the table i.e. 5000 rows, 10,000 rows, 15,0000 rows, 20,000 rows and 25,000 rows.

Database name: mydb

Table Column names: id, name, age, address, gender

Transformations applied on all records:

Converted name to uppercase
Added 1 to original age
Converted address to uppercase
Complemented the gender

CASE 0:

Extracted data from table 'student1'
Tranformed the data while Loading it to table 'studentc1'

CASE 1:

Extracted data from table 'student2' to a file 'output.csv'
Applied the transformations on the file.
Loaded the data from the file to the table 'studentc2'

CASE 2:

Extracted data from table 'student3' into a number of small files (each file contains a subset of the whole data)
Applied the transformations on all the files simultaneously using multithreading (I have taken 5 threads here)
Loaded the data from the files to the table 'studentc3'
For the process of Multithreading, I have used the ThreadPoolExecutor class available in python.

For all the above 3 cases and all the number of records, I have noted the time in 'analysis.csv' file.
Result in analysis.csv:

In the table below, the columns depict the 3 cases (case 0,1,2) taken and the rows show the different number of records taken (5000 rows, 10,000 rows, 15,0000 rows, 20,000 rows and 25,000 rows.)

ROWS	CASE 0	CASE 1	CASE 2
5000	1.890067577	5.041134357	0.242992401
10000	8.837369204	10.03750229	0.488361597
15000	1.531428337	4.055989265	1.608299494
20000	3.995323658	7.610261917	1.230496168
25000	1.074378967	7.832616568	0.915403366

From the table, we can conclude that CASE2 i.e. the case in which I have done Multithreading takes the least time to complete.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
README.md		README.md
analysis.csv		analysis.csv
connect_sql.py		connect_sql.py
db_create.sql		db_create.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multithreading-based-ETL-process

About

Uh oh!

Releases

Packages

Languages

Saumyaa27/Multithreading-based-ETL-process

Folders and files

Latest commit

History

Repository files navigation

Multithreading-based-ETL-process

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages