Tooling for the Sanalyz project
An ETL pipeline for data extraction, transformation, and loading in the Sanalyz API.
- Python 3.8 or higher
- Required Python libraries:
pandasnumpytqdmrequests
Install the dependencies using:
pip install -r requirements.txtFirst download supported datasets in a folder. Currently, the following datasets are supported:
Ensure that the datasets are in CSV format and placed in a folder. The folder structure should look like this:
data/
├── monkeypox.csv
├── covid19.csv
The name of the files are not important, as long as they are in CSV format.
If you want to add support for new datasets, see the Adding Support for New Datasets section.
When your dataset is ready, run the ETL pipeline with the following command:
python etl <path_to_datasets_folder> <api_base_url>For example:
python etl data https://api.sanalyz.comThis will extract data from the datasets, clean it, transform it to be ready to be loaded, and then load it into the Sanalyz API.
To add support for a new dataset:
- Create a new extractor script in the
etl/extractorsfolder. - Inherit from the
Extractorbase class and implement thetry_extractmethod. - Refer to existing extractors (e.g.,
covid.py,mpox.py) for examples.
Once the new extractor is added, the ETL pipeline will automatically detect and use it.