This repository provides a modular Data Management System (DMS) for air quality sensors. It ingests data from multiple sensor manufacturers via their APIs, standardizes and aggregates the data, applies calibration models for correction, performs quality checks, and stores the processed data in a database through an automated pipeline. It also includes Apache Superset for data visualization, allowing users to create dashboard by connecting to sample database.
Make sure you have the following installed:
- Git
- Docker For Windows: Install Docker Desktop or Docker Engine inside WSL2 For Linux: Install Docker Engine
- Clone the repository
git clone https://github.com/CSTEPBLR/AQDMS.git- Configure settings An .env.example file is included for reference. Replicate it as .env file and update values as needed.
cp .env.example .env- Append the following to .env to get appropriate permissions to make changes to Airflow DAGs.
echo "AIRFLOW_UID=$(id -u)" >> .env - Start docker to get airflow, postgres, pgadmin and superset up and running
docker compose up -d- Access services using following links and enter your credentials as updated in .env:
Airflow UI: http://localhost:8080
Postgres UI: http://127.0.0.1:5050/login
Superset UI: http://localhost:8088.
├── README.md
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── .env.example
├── dags
| ├── sample_dag.py
| └── api_dag.py
├── data
| ├── sample_raw_data
| ├── sample_calibration_model
| ├── init-db.sql
│ └── load_sample_metadata.sql
└── src
├── common
| ├── api_client
| ├── config
| └── db
├── ingestion
| └── manufacturer
├── processing
| ├── staging
| | └── manufacturer
| ├── standardize
| ├── aggregate
| └── calibrate
└── calibrate_aggregated_data
├── quality_check
└── visualization1. ingestion → fetch and store raw manufacturer data (through API/ sample json)
2. processing → process raw data in 4 steps: stage → standardize → aggregate → calibrate
3. quality_check → apply checks on calibrated data
4. visualization → visualize quality checked dataShared configuration helpers can be found in src/common/config folder.
DB related configurations can be found in src/common/db folder.
API keys, DB credentials and docker service credentials are externalized via .env
Manufacturer-specific logic is available here:
- ingestion/manufacturer/
- processing/stage/manufacturer/If new manufacturers need to be added, update required configs and API clients.
This repo supports 2 options -
- Running the pipeline with sample data (case when API credentials of a manufacturer is not available).
- Running full pipeline with valid API credentials.
There are 2 DAGs available for this scenario. A dag called sample_dag is created for ingesting preloaded sample data of raw sensor data from 2 manufacturers - AQMS and Sensit Ramp. This is for demonstration and dashboarding purpose only. Production data is intentionally excluded to keep repo lightweight.
sample_dag uses USE_SAMPLE_DATA=true flag that skips external API calls and loads pre-generated sample raw data from local JSON files to traverse through all steps of the pipeline. Enable dag on http://localhost:8080 to get started.
A second dag called api_dag can ingest data directly from API and complete all stages of the pipeline.
For api_dag, airflow configurations need to be added. Please refer to README.md in dags folder.
Database schema and required metadata tables are predefined in data/init-db.sql.
It inserts sample metadata required to run data pipeline through data/load_sample_metadata.sql.
For more info refer to README in data folder.
- If using pgAdmin UI, register for a database using credentials from .env.
- If using superset dashboard,
- generate a superset secret_key in .env file using
openssl rand -base64 42in terminal - connect to postgres database using SQLAlchemy URI option.
postgresql+psycopg2://postgres_user:postgres_pwd@postgres_host:5432/postgres_db
- generate a superset secret_key in .env file using
Users are advised to develop their own machine learning (ML) calibration models for each sensor used in low-cost sensor devices from individual manufacturers. The ML calibration models provided is for reference purposes only and must not be used as a standard or production calibration models in any application. Open source code users are responsible for generating and placing the calibration ML models in the correct path.
- Airflow UI may automatically forward the port to a random local port. Make sure to check alternate url (depends on forwarded localhost, example: http://localhost:49677/).
- Airflow will give import error for api_dag. Add db_config to load api_dag.