Skip to content

OncoAtlas/susflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SUSFlow

Python Version License: MIT Code Style: Black Output: pandas.DataFrame PyPI

Modern Python library for downloading, parsing and engineering DATASUS public health datasets. SUSFlow provides:

  • resilient FTP access to DATASUS
  • a local cache that mirrors the FTP tree
  • transparent decompression of proprietary .dbc files to tabular data
  • helpers to load datasets as pandas DataFrame ready for analysis

This repository focuses on practical reproducibility and safe access to legacy public data systems.

Portuguese (Brazil) documentation and module index: Português do Brasil

Contents

  • Module documentation in docs/en/ (layouts, variable dictionaries, notes)
  • Library code: susflow/
  • Utilities: tools/ (FTP mapping and inspection)

Quick links

Installation

Install in editable mode during development:

git clone https://github.com/OncoAtlas/susflow.git
cd susflow
python -m venv .venv
. ./.venv/bin/activate
pip install -U pip
pip install -e .

Install from PyPI (recommended for most users):

pip install susflow

To install a specific released version:

pip install susflow==0.1.1

Core runtime dependencies are declared in pyproject.toml. Typical extras for performance:

  • pyarrow or fastparquet (Parquet cache)
  • pandas (DataFrame API)

Basic usage

Each DATASUS system is available under susflow.systems. APIs are lightweight: list_files, download and read helpers manage discovery, download and conversion.

Example: SINASC (Live Births)

from susflow.systems import sinasc

# list files for a state
sinasc.list_files(uf="SP")

# download and return a pandas.DataFrame
df = sinasc.read(uf="SP", year=2020)

Example: PNI (Vaccinations)

from susflow.systems import pni
df = pni.read(uf="RJ", year=2015)

Caching behavior

By default downloads are stored under ~/.susflow/cache/ mirroring FTP paths. If a requested file is present locally the library skips the download and reads directly from cache. To force re-download set force=True on download/reader helpers.

Performance guidance

  • Downcast numeric types and convert repeated strings to category to reduce memory.
  • Convert commonly used datasets to Parquet once and reuse local Parquet caches.
  • For very large datasets prefer processing in chunks or using DuckDB/Polars to avoid excessive RAM.

Developer tools and linters

We recommend the following dev tools for contributors:

. ./.venv/bin/activate
pip install -U ruff black isort pytest pytest-mock coverage
ruff .
black --check .
isort --check-only .
pytest -q

Testing strategy

  • Unit tests should mock FTP and file IO; see tests/unit/ for examples.
  • Integration tests that access live FTP data should be opt-in and run manually (network-dependent).

Utilities

tools/mapear_ftp.py helps locate and audit DATASUS FTP directory structures when paths change. It can save structured maps to tools/mapas/ for offline analysis.

Contributing

See CONTRIBUTING.md for guidelines: coding style, tests, and PR workflow. See docs/contributing/coverage.md for coverage instructions.

License

This project is released under the MIT License — see LICENSE.

Contact

Open issues and pull requests are welcome. For larger changes please open an issue to discuss scope before implementing.

About

High-performance ETL pipeline and standardized local data lake builder for DATASUS public health data.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Contributors

Languages