Formula 2 End-to-End ETLT Pipeline

Overview

This project aims to create an automated pipeline for extracting, transforming, and updating the Formula 2 dataset on Kaggle.

Data Source: www.fiaformula2.com
Destination: Kaggle Dataset

Architecture

Details

Extraction: Data is extracted from oficial F2 website using race IDs. The extracted data is stored in a CSV file in an AWS S3 bucket. The extraction process is orchestrated using Apache Airflow running in a Docker container on an AWS EC2 instance. Web scraping is performed using BeautifulSoup and Pandas.
Transformation: A Lambda function reads the raw race data from the S3 bucket and performs necessary transformations. The transformed data is then concatenated with the existing data and stored in another AWS S3 bucket. Pandas is used for the transformation process.
Upload to Kaggle: An additional Lambda function is triggered by AWS EventBridge to upload the updated data to Kaggle. The Kaggle API is utilized for this task. A new version of the dataset is created on Kaggle.
Database Generation: Optionally, AWS Glue can be used to generate a database from the updated dataset. This enables executing queries using Amazon Athena.

The pipeline enables regular updates of the Formula 2 dataset on Kaggle, ensuring that it remains up-to-date with the latest race information.

Special requirements:

Configure AWS account through AWS CLI (to interact with AWS)
Configure AWS connection via Airflow UI
Docker / Docker-Compose (to run Apache Airflow)
AWS role with read/write S3 objects permission
Kaggle API KEY
Pandas layer for Lambda function

Files Notes

f2-dag.py: The argument for the data extraction is a list with the ID of each race in str format
utils.py: Utility functions file, you can find the race IDs here
lambda_load.py: Lambda function of transformation and load. PUT/GET permissions are required for to S3 bucket
draft.ipynb: Notebook to test each function separately
test_dag_integrity.py: Assess DAG integrity
\DATA\: information cataloged by race_id, event and season

Data description

Detailed description here: Kaggle Dataset

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
DATA		DATA
dags		dags
figures		figures
tests/dags		tests/dags
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
lambda_load.py		lambda_load.py
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATA

DATA

dags

dags

figures

figures

tests/dags

tests/dags

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

lambda_load.py

lambda_load.py

packages.txt

packages.txt

requirements.txt

requirements.txt

Repository files navigation

Formula 2 End-to-End ETLT Pipeline

Overview

Architecture

Details

Special requirements:

Files Notes

Data description

About

Releases

Packages

Languages

License

Alarchemn/F2-Data-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Formula 2 End-to-End ETLT Pipeline

Overview

Architecture

Details

Special requirements:

Files Notes

Data description

About

Topics

Resources

License

Stars

Watchers

Forks

Languages