The primary objective of this project is to consolidate data from various databases containing information on movies, actors, ratings, and comments into a single unified database. This unified database will allow us to view ratings and comments associated with specific movies.
This ETL (Extract, Transform, Load) pipeline is developed using Python, dbt, PostgreSQL, Docker, and Apache Airflow.
The initial step involves extracting data from the source databases. The source data is pre-populated in the source_db_init folder.
The etl_script folder contains Python scripts that handle the extraction and loading processes. These scripts extract the data from the source databases and load it into the destination database.
Once the data is loaded into the destination database, dbt (Data Build Tool) is used to transform the data. This transformation process manipulates the data from different databases using PostgreSQL, resulting in a unique database that contains both comments and ratings for each movie.
To run this project, ensure you have the following installed:
- Python 3
- dbt
- Docker
The project is containerized using Docker and orchestrated with Docker Compose. To build and run the project, follow these steps:
-
Clone the repository.
-
Navigate to the project directory.
-
Run the following command to build the Docker images and start the services:
docker compose up
This command will build all the necessary images and execute all the scripts defined in the project.