Batch Processing Data Pipeline

Workflow

Data

The dataset used is yellow taxi trip data taken from the website NYC goverment data source. The data has a parquet format and is packed every month

The Yellow Taxi Trip Records dataset is a collection of data about yellow taxi trips taken in New York City. The dataset includes information about pickup and drop-off times and locations, fares, and other details of each trip. The data is collected from GPS devices installed in the taxis and covers trips taken in all five boroughs of New York City. The dataset is used by researchers and analysts to study transportation patterns and traffic flows, and to gain insights into urban life.

Sample Dataset:

VendorID	tpep_pickup_datetime	tpep_dropoff_datetime	passenger_count	trip_distance	RatecodeID	store_and_fwd_flag	PULocationID	DOLocationID	payment_type	fare_amount	extra	mta_tax	tip_amount	improvement_surcharge	total_amount	congestion_surcharge
1	2022-01-01 00:35:40.000	2022-01-01 00:53:29.000	2	3,8	1	N	142	236	1	14,5	3	0,5	3,65	0,3	21,95	2,5
1	2022-01-01 00:33:43.000	2022-01-01 00:42:07.000	1	2,1	1	N	236	42	1	8	0,5	0,5	4	0,3	13,3	0
2	2022-01-01 00:53:21.000	2022-01-01 01:02:19.000	1	0,97	1	N	166	166	1	7,5	0,5	0,5	1,76	0,3	10,56	0
2	2022-01-01 00:25:21.000	2022-01-01 00:35:23.000	1	1,09	1	N	114	68	2	8	0,5	0,5	0	0,3	11,8	2,5
2	2022-01-01 00:36:48.000	2022-01-01 01:14:20.000	1	4,3	1	N	68	163	1	23,5	0,5	0,5	3	0,3	30,3	2,5

Technologies Used

This pipeline is built using the following technologies:

Python - a programming language used for data engineering and processing. The libraries and dependencies used are listed in requirements.txt
Apache Spark (PySpark) - a fast and distributed processing engine used for large-scale data processing.
Apache Airflow - to schedule and orchestrate the data engineering pipeline
Azure Blob Storage - to store the parquet data files in a scalable and cost-effective manner.
Azure SQL Database - to store and manage the processed data from the Yellow Taxi Trip Records dataset in a structured and efficient manner after transformed with pyspark.

Project Structure

Ingesting

In this process the parquet file is taken with the url then stored in blob storage. The url that is retrieved is only the url of the file that has not been stored in blob storage, so that when the python code is executed it will only retrieve new data and does not cause duplicate data. This process is run using Extract.py.

Transform

Data is retrieved from blob storage using pyspark, then transforms by adding the following columns:

Compute the trip duration in minutes from the "tpep_pickup_datetime" and "tpep_dropoff_datetime".
Extract the year, month, day of the week, and quarter from the "tpep_pickup_datetime" column.
Format the day of the week and month name columns.

Additional column after the transformation process:

trip_duration_minutes	year	month	day_of_week	day_name	month_name	quarter
17,816666666666666	2022	1	7	Saturday	January	1
8,4	2022	1	7	Saturday	January	1
8,966666666666667	2022	1	7	Saturday	January	1
10,033333333333333	2022	1	7	Saturday	January	1
37,53333333333333	2022	1	7	Saturday	January	1

Transformation process with pyspark using Transform.py.

Load

Then the data that has been transformed is uploaded to the Azure SQL database. The upload process will only add data that is not yet in the database, so there will be no duplication of data. This process uses Load.py.

Scheduling

Python code is run once a month on day 11 using file taxi_dag.py. Scheduling using apache airflow which is installed locally due to lack of resources, so the data may not be updated.

*NB: Python code is written based on OOP best practices

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
__pycache__		__pycache__
venv		venv
.gitattributes		.gitattributes
.gitignore		.gitignore
Extract.py		Extract.py
Load.py		Load.py
Pipeline Workflow.png		Pipeline Workflow.png
README.md		README.md
Transform.py		Transform.py
requirements.txt		requirements.txt
taxi_dag.py		taxi_dag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Batch Processing Data Pipeline

Workflow

Data

Technologies Used

Project Structure

Ingesting

Transform

Load

Scheduling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Batch Processing Data Pipeline

Workflow

Data

Technologies Used

Project Structure

Ingesting

Transform

Load

Scheduling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages