Open Information Extractor for Portuguese based on dependency analysis (SpaCy + Stanza).
This guide shows all ways to run the project via src/dptoie/main.py, with all argument variations, both locally (Poetry) and with Docker / Docker Compose.
- Minimum requirements: Python 3.12+, Poetry, or Docker (optional)
- Models: Stanza downloads models automatically on first run. You can set
STANZA_RESOURCES_DIRto use a local models directory (e.g.,./models/.stanza_resources).
- Installation (Poetry)
- How to run (local, via Poetry)
- How to run with Docker (without Compose)
- How to run with Docker Compose
- Quick references
- How to cite this project
- Authors
poetry installGeneral form:
poetry run python3 src/dptoie/main.py \
-i <input_path> \
-it <txt|conll> \
-o <output_path> \
-ot <json|csv|txt> \
[-cc] [-sc] [-hs] [-a] [-t] [-debug]- -i, --input: path to the input file. Default:
./inputs/teste.txt - -it, --input-type: input file type. Options:
txtorconll. Default:txt- For
txtinput: each line in the file is a sentence; the system generates a temporary.conll. - For
conllinput: the input file is already in CoNLL-U format (one sentence per block, separated by an empty line).
- For
- -o, --output: path to the output file. Default:
./outputs/output.json - -ot, --output-type: output format. Options:
json,csv,txt. Default:json - -cc, --coordinating_conjunctions: enable extractions using coordinating conjunctions
- -sc, --subordinating_conjunctions: enable extractions using subordinating conjunctions
- -hs, --hidden_subjects: enable extractions with hidden subjects (Not implemented)
- -a, --appositive: enable appositive extractions
- -t, --transitive: enable transitivity for appositives (only has effect when
-ais active) - -debug: verbose debug mode
Important:
- Extraction modules are disabled by default. Enable the ones you want using the flags
-cc -sc -a -t.
- TXT input, JSON output (defaults):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json- TXT input, CSV output, enabling coordination and hidden subject (flag example):
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.csv -ot csv -cc- TXT input, human-readable text output:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.txt -ot txt -cc -sc -a -t- Input already in CoNLL-U, JSON output:
poetry run python3 src/dptoie/main.py -i ./inputs/teste.conll -it conll -o ./outputs/out.json -ot json -cc -sc -a -t- Only coordinating conjunctions:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/cc.json -ot json -cc- Debug mode for detailed inspection:
poetry run python3 src/dptoie/main.py -i ./inputs/ceten-200.txt -it txt -o ./outputs/out.json -ot json -cc -debug- Show arguments list:
poetry run python3 src/dptoie/main.py -hExpected outputs:
- JSON: a list of objects per sentence, with extractions inside
extractionsand possiblesub_extractions. - CSV: columns
id,sentence,arg1,rel,arg2(includes sub-extractions with hierarchical ids like1.1). - TXT: the sentence followed by extractions and sub-extractions formatted as lines.
Build the image (from the project root):
docker build -t dptoie_python .Run a one-off command (mounting the current directory and pointing to files inside the container):
docker run --rm -it \
-e STANZA_RESOURCES_DIR=/dptoie_python/models/.stanza_resources \
-v "$(pwd)":/dptoie_python \
-w /dptoie_python \
dptoie_python \
poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -tNote: adjust the -i and -o paths as needed; use -it txt when the input is line-by-line text.
The docker-compose.yml file already includes the dptoie_python service. You can edit the command: line for the desired scenario. Example recommended command:
command: poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/teste.conll -it conll -o /dptoie_python/outputs/out.json -ot json -cc -sc -a -tThen run:
docker compose up --buildUse run to execute other custom commands:
docker compose run dptoie_python poetry run python3 src/dptoie/main.py -i /dptoie_python/inputs/ceten-200.txt -it txt -o /dptoie_python/outputs/out.csv -ot csv -ccTips:
- The volume
.:/dptoie_pythonallows using local files inside the container. STANZA_RESOURCES_DIR(exposed in the compose file) can point tomodels/.stanza_resourcesto avoid repeated downloads.
- TXT input: each line is a sentence; the system creates a temporary
.conll. - CoNLL-U input: use
-it conlland ensure sentences are separated by an empty line. - Rule activation: all rules are disabled by default; add the desired flags.
- Relative paths are interpreted from the project root; in Docker, use absolute paths inside the container (e.g.,
/dptoie_python/...).
If you find this repo helpful, please consider citing:
@Article{dptoie2025, author={xxx xxx}, title={xxxx}, journal={dddd}, year={xxx}, month={x}, day={cc}, issn={xxx}, doi={xxxxx}, url={asas} }- Andre Walker
- Rafael Glauber
- Daniela Barreiro Claro