Modern Python library for downloading, parsing and engineering DATASUS public health datasets. SUSFlow provides:
- resilient FTP access to DATASUS
- a local cache that mirrors the FTP tree
- transparent decompression of proprietary
.dbcfiles to tabular data - helpers to load datasets as pandas DataFrame ready for analysis
This repository focuses on practical reproducibility and safe access to legacy public data systems.
Portuguese (Brazil) documentation and module index: Português do Brasil
- Module documentation in
docs/en/(layouts, variable dictionaries, notes) - Library code:
susflow/ - Utilities:
tools/(FTP mapping and inspection)
Quick links
- CNES — health establishments
- PNI — immunizations
- SIM — mortality
- SINAN — notifiable diseases
- SINASC — live births
- SIASUS — ambulatory information system (SUS)
- SIHSUS — hospital information system (SUS)
- FTP file patterns summary
Install in editable mode during development:
git clone https://github.com/OncoAtlas/susflow.git
cd susflow
python -m venv .venv
. ./.venv/bin/activate
pip install -U pip
pip install -e .Install from PyPI (recommended for most users):
pip install susflowTo install a specific released version:
pip install susflow==0.1.1Core runtime dependencies are declared in pyproject.toml. Typical extras for performance:
pyarroworfastparquet(Parquet cache)pandas(DataFrame API)
Each DATASUS system is available under susflow.systems. APIs are lightweight: list_files, download and read helpers manage discovery, download and conversion.
Example: SINASC (Live Births)
from susflow.systems import sinasc
# list files for a state
sinasc.list_files(uf="SP")
# download and return a pandas.DataFrame
df = sinasc.read(uf="SP", year=2020)Example: PNI (Vaccinations)
from susflow.systems import pni
df = pni.read(uf="RJ", year=2015)By default downloads are stored under ~/.susflow/cache/ mirroring FTP paths. If a requested file is present locally the library skips the download and reads directly from cache. To force re-download set force=True on download/reader helpers.
- Downcast numeric types and convert repeated strings to
categoryto reduce memory. - Convert commonly used datasets to Parquet once and reuse local Parquet caches.
- For very large datasets prefer processing in chunks or using DuckDB/Polars to avoid excessive RAM.
We recommend the following dev tools for contributors:
. ./.venv/bin/activate
pip install -U ruff black isort pytest pytest-mock coverage
ruff .
black --check .
isort --check-only .
pytest -q- Unit tests should mock FTP and file IO; see
tests/unit/for examples. - Integration tests that access live FTP data should be opt-in and run manually (network-dependent).
tools/mapear_ftp.py helps locate and audit DATASUS FTP directory structures when paths change. It can save structured maps to tools/mapas/ for offline analysis.
See CONTRIBUTING.md for guidelines: coding style, tests, and PR workflow. See docs/contributing/coverage.md for coverage instructions.
This project is released under the MIT License — see LICENSE.
Open issues and pull requests are welcome. For larger changes please open an issue to discuss scope before implementing.