ETL Process:
- Description: The process starts with extracting CSV data files corresponding to various stock tickers. Using PySpark, the script reads these files, enforces a schema to maintain data integrity, and stores this raw data in a Delta table known as the 'Bronze' layer.
- Description: The transformation phase involves cleaning and restructuring the data. This includes type casting and aggregating financial metrics such as high, low, opening, and closing prices, as well as volume traded. The resultant structured data is saved in another Delta table, termed the 'Silver' layer. This layer serves two purposes: storing individual asset performance and aiding in market trend analysis.
- Description: The final phase involves creating the 'Gold' layer, where the data is further refined for high-level analysis. This includes calculating the average return on investment (ROI) and volatility for each stock, segmented by year and month. The Gold layer represents the most valuable and insightful data, ready for in-depth analysis and decision-making processes.
-
PySpark: Utilized for its powerful in-memory data processing capabilities, essential for handling large datasets typically found in financial markets.
-
Delta Lake: Chosen for its ACID transaction support and time-travel features, enabling reliable and consistent data storage and retrieval.
- Description: Sets the groundwork for the pipeline execution, including environment variable setup.
- Description: Retrieves the project source code from the specified GitHub repository.
- Description: Verifies the presence of essential CSV files, crucial for the investment analysis. This step ensures data integrity before proceeding further.
- Description: Compiles the necessary components of the project into a compressed file. This artifact represents the build which will be deployed.
- Description: Securely transfers the prepared artifact to MinIO for storage, ensuring that the build is available for deployment.
- Description: Switches to the main branch in the repository, preparing it for the integration of new features or updates.
- Description: Merges the feature branch into the main branch, combining new developments into the primary codebase.
- Description: Represents the stage where the application would be deployed, although specific deployment steps would depend on the project requirements.
- Description: When a developer wants to add a new feature, they start by creating a new branch on GitHub, often named according to the feature, e.g.,
feature/01/buildRacineProject
.
- Description: After developing the feature and pushing their code to the GitHub branch, a Jenkins pipeline is triggered. This pipeline follows several stages:
- Initialization: Displays a start-up message.
- Cloning the Repository: Jenkins clones the repository from GitHub.
- Tests: Runs scripts to check for the presence of necessary files and other tests.
- Preparing the Artifact: Creates a folder for the artifact, which is then compressed.
- Uploading to MinIO: The artifact is uploaded to MinIO, an object storage service.
- Merging with the Main Branch: The feature branch is merged with the main branch of GitHub.
- Deployment: Begins the deployment of the application.
- Post-Process: Cleanup and feedback steps post-pipeline, depending on the build success.
- Command:
make up
- Action: Navigates to the infra directory and starts the Docker containers using
docker-compose up
. The--build
flag ensures that the Docker images are built. The-d
flag runs the containers in detached mode.
- Command:
make create-bronze-table
- Action: Executes a Docker command to run a Spark job inside the
local-spark
container. This job runs theBronzeTables.py
script, creating Bronze tables in your data architecture for initial data loading and raw data storage.
- Command:
make create-silver-table
- Action: Similar to the Bronze tables, this command runs the
SilverTables.py
script inside thelocal-spark
container. Silver tables represent an intermediate layer where data is cleaned and transformed.
- Command:
make create-gold-table
- Action: Runs the
GoldTables.py
script. Gold tables are typically the final layer in a data pipeline, containing refined, business-level data ready for analysis and decision-making.
- Command:
make run-etl
- Action: Executes the ETL (Extract, Transform, Load) process by running the
investment_etl.py
script in thelocal-spark
container. This script handles the entire ETL process, transforming and loading data into the respective tables.
- Command:
make api
- Action: Navigates to the API directory and starts the Python API server by running
main.py
. This server handles API requests for your application.
- Command:
make dashboard
- Action: Moves into the dashboard directory and starts a simple HTTP server using Python. This server hosts your application's dashboard for data visualization and user interaction.
- Command:
make all
- Action: Sequentially executes the commands to create Bronze, Silver, and Gold tables, followed by running the ETL process. This comprehensive command sets up the entire data pipeline.