In this project, I have done a comparison of different techniques for tranforming a large dataset and compared the time taken in each technique.
For this, I have created a table in MySQL, extracted the data from it, Tranformed it and then Loaded it again in the table. This ETL process has been implemented in 3 cases (discussed below) and I have analyzed the amount of time it takes to run the 3 cases. The above process has been done for different number of rows in the table i.e. 5000 rows, 10,000 rows, 15,0000 rows, 20,000 rows and 25,000 rows.
Database name: mydb
Table Column names: id, name, age, address, gender
Transformations applied on all records:
- Converted name to uppercase
- Added 1 to original age
- Converted address to uppercase
- Complemented the gender
CASE 0:
- Extracted data from table 'student1'
- Tranformed the data while Loading it to table 'studentc1'
CASE 1:
- Extracted data from table 'student2' to a file 'output.csv'
- Applied the transformations on the file.
- Loaded the data from the file to the table 'studentc2'
CASE 2:
- Extracted data from table 'student3' into a number of small files (each file contains a subset of the whole data)
- Applied the transformations on all the files simultaneously using multithreading (I have taken 5 threads here)
- Loaded the data from the files to the table 'studentc3'
For the process of Multithreading, I have used the ThreadPoolExecutor class available in python.
For all the above 3 cases and all the number of records, I have noted the time in 'analysis.csv' file.
Result in analysis.csv:
In the table below, the columns depict the 3 cases (case 0,1,2) taken and the rows show the different number of records taken (5000 rows, 10,000 rows, 15,0000 rows, 20,000 rows and 25,000 rows.)
| ROWS | CASE 0 | CASE 1 | CASE 2 |
| 5000 | 1.890067577 | 5.041134357 | 0.242992401 |
| 10000 | 8.837369204 | 10.03750229 | 0.488361597 |
| 15000 | 1.531428337 | 4.055989265 | 1.608299494 |
| 20000 | 3.995323658 | 7.610261917 | 1.230496168 |
| 25000 | 1.074378967 | 7.832616568 | 0.915403366 |
From the table, we can conclude that CASE2 i.e. the case in which I have done Multithreading takes the least time to complete.