Skip to content

CSTEPBLR/AQDMS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Air Quality Data Management System (AQDMS)

This repository provides a modular Data Management System (DMS) for air quality sensors. It ingests data from multiple sensor manufacturers via their APIs, standardizes and aggregates the data, applies calibration models for correction, performs quality checks, and stores the processed data in a database through an automated pipeline. It also includes Apache Superset for data visualization, allowing users to create dashboard by connecting to sample database.

Prerequisite

Make sure you have the following installed:

  1. Git
  2. Docker For Windows: Install Docker Desktop or Docker Engine inside WSL2 For Linux: Install Docker Engine

Setup

  1. Clone the repository
git clone https://github.com/CSTEPBLR/AQDMS.git
  1. Configure settings An .env.example file is included for reference. Replicate it as .env file and update values as needed.
cp .env.example .env
  1. Append the following to .env to get appropriate permissions to make changes to Airflow DAGs.
echo "AIRFLOW_UID=$(id -u)" >> .env  
  1. Start docker to get airflow, postgres, pgadmin and superset up and running
docker compose up -d
  1. Access services using following links and enter your credentials as updated in .env:
Airflow UI: http://localhost:8080 
Postgres UI: http://127.0.0.1:5050/login
Superset UI: http://localhost:8088

Project structure

.
├── README.md
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
├── .env.example
├── dags
|    ├── sample_dag.py
|    └── api_dag.py
├── data
|    ├── sample_raw_data
|    ├── sample_calibration_model
|    ├── init-db.sql
│    └── load_sample_metadata.sql
└── src
    ├── common
    |    ├── api_client
    |    ├── config
    |    └── db
    ├── ingestion
    |    └── manufacturer
    ├── processing
    |    ├── staging
    |    |    └── manufacturer
    |    ├── standardize
    |    ├── aggregate
    |    └── calibrate
         └── calibrate_aggregated_data
    ├── quality_check
    └── visualization

Pipeline Overview

1. ingestion      → fetch and store raw manufacturer data  (through API/ sample json)
2. processing     → process raw data in 4 steps: stage → standardize → aggregate → calibrate  
3. quality_check  → apply checks on calibrated data  
4. visualization  → visualize quality checked data

Configuration

Shared configuration helpers can be found in src/common/config folder. DB related configurations can be found in src/common/db folder. API keys, DB credentials and docker service credentials are externalized via .env

Manufacturer-specific logic is available here:

- ingestion/manufacturer/
- processing/stage/manufacturer/

If new manufacturers need to be added, update required configs and API clients.

Using Data pipeline:

This repo supports 2 options -

  1. Running the pipeline with sample data (case when API credentials of a manufacturer is not available).
  2. Running full pipeline with valid API credentials.

There are 2 DAGs available for this scenario. A dag called sample_dag is created for ingesting preloaded sample data of raw sensor data from 2 manufacturers - AQMS and Sensit Ramp. This is for demonstration and dashboarding purpose only. Production data is intentionally excluded to keep repo lightweight.

sample_dag uses USE_SAMPLE_DATA=true flag that skips external API calls and loads pre-generated sample raw data from local JSON files to traverse through all steps of the pipeline. Enable dag on http://localhost:8080 to get started.

A second dag called api_dag can ingest data directly from API and complete all stages of the pipeline. For api_dag, airflow configurations need to be added. Please refer to README.md in dags folder.

Database Initialization

Database schema and required metadata tables are predefined in data/init-db.sql. It inserts sample metadata required to run data pipeline through data/load_sample_metadata.sql.

For more info refer to README in data folder.

Additional configuration:

  1. If using pgAdmin UI, register for a database using credentials from .env.
  2. If using superset dashboard,
    • generate a superset secret_key in .env file using openssl rand -base64 42 in terminal
    • connect to postgres database using SQLAlchemy URI option. postgresql+psycopg2://postgres_user:postgres_pwd@postgres_host:5432/postgres_db

NOTE

Users are advised to develop their own machine learning (ML) calibration models for each sensor used in low-cost sensor devices from individual manufacturers. The ML calibration models provided is for reference purposes only and must not be used as a standard or production calibration models in any application. Open source code users are responsible for generating and placing the calibration ML models in the correct path.

Known issues:

  1. Airflow UI may automatically forward the port to a random local port. Make sure to check alternate url (depends on forwarded localhost, example: http://localhost:49677/).
  2. Airflow will give import error for api_dag. Add db_config to load api_dag.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages