V tem repozitoriju se nahaja rezultat aktivnosti A3.2 - R3.2.3 Orodje za ekstrakcijo povezav, ki je nastalo v okviru projekta Razvoj slovenščine v digitalnem okolju.
This repository contains a model for relation extraction in the Slovenian language and a docker service which uses this model to extract relations. Repository for the method which was used for training the model can be found on https://github.com/monologg/R-BERT. We used the CroSloEngual BERT model to fine-tune the model for our task.
src/contains the script for predicting the relations and contains the source code of our work. fastapi service.BERT_data.zipcontains our fine-tuned BERT model.methodscontains scripts for training and testing models with three different relation extraction methods.process_wikipedia_pagescontains scripts for converting HTML pages from Slovenian Wikipedia to text with marked relations and entities.
To run this service we first need to extract the folder contained in BERT_data.zip into the root of this project.
Service starts when the logger outputs: INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Docker images require at least 4GB of RAM to be built and run.
To run GPU accelerated docker containers you need to have an Nvidia GPU and CUDA for WSL on Windows 10 or 11 or The NVIDIA Container Toolkit for Linux.
To build the docker image run:
docker buildx build --platform linux/amd64 . -t bert_relation_extraction_gpu -f DockerfileGPU
To run the image in a GPU accelerated container use:
docker run --rm -it --name bert_relation_extraction \
--platform linux/amd64 \
--gpus=all \
-e useGPU=True \
--mount type=bind,source="$(pwd)"/BERT_data,target=/BERT_data,ro \
-p:8000:8000 \
bert_relation_extraction_gpu
To build the docker image run:
docker buildx build --platform linux/amd64 . -t bert_relation_extraction -f Dockerfile
To run the image in a normal container use:
docker run --rm -it --name bert_relation_extraction \
--platform linux/amd64 \
--mount type=bind,source="$(pwd)"/BERT_data,target=/BERT_data,ro \
-p:8000:8000 \
bert_relation_extraction
To run this project we recommend python 3.8.
First, we need to extract the folder contained in BERT_data.zip into the folder src.
To install dependencies run pip install -r requirements.txt -f https://download.pytorch.org/whl/cpu/torch_stable.html in the root folder of this project.
Run uvicorn main:app --host 0.0.0.0 --port 8000 in the folder src to run the aplication on http://0.0.0.0:8000.
For GPU acceleration you need to have the CUDA toolkit.
To enable the GPU acceleration you will need to manually change the use_gpu parameter in src/mark_entities.py classla.Pipeline to True
and string device in src/predict.py to "cuda".
Rest API is provided by FastAPI/uvicorn.
After starting up the API, the OpenAPI/Swagger documentation will become accessible at http://localhost:8000/docs and http://localhost:8000/openapi.json.
Service has a GET and POST endpoint at http://localhost:8000/predict/rel. Both endpoints require three parameters.
- String
textText on which relation extraction will be performed. - Boolean
only_ne_as_mentionsService will only use mentions in the text which are recognized as named entities if set to true. - Float
relationship_thresholdEach relation prediction has a confidence score between 0.0 and 1.0. This parameter can be used to prune prediction with a score below the threshold.
To test the service, try sending a request with curl:
curl -X POST -H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{"text": "France Prešeren je rojen v Vrbi.", "only_ne_as_mentions": false, "relationship_threshold": 0.4}' \
'http://localhost:8000/predict/rel'
This service can be used with BERT models fine-tuned by the method from the methods\BERT folder. To use this service with your model
you need to create your own BERT_data folder in the root of this project for docker use or in the folder src for local use. This folder
needs to have pytorch_model.bin, training_args.bin and config.json that you get from fine-tuning the BERT model.
You also need to add the vocab.txt file from the BERT model and properties-with-labels.txt which has relation labels and descriptions.
Examples for these files can be found in the BERT_data.zip file
Note This project uses NER tagger for the Slovenian language. If you want to use this project for another language you will need to change
src/change mark_entities_in_text.py and perhaps dependencies in requirements.txt.
Note This project uses transformers.AutoTokenizer.from_pretrained function for BERT tokenization. If you use a BERT model with a different recommended tokenization
method you can change it in the load_auto_tokenizer function in src/utils.py.
Note Relations in properties-with-labels.txt should have the same order as in labels.txt in project R-BERT
when you fine-tuned BERT model.
Operacijo Razvoj slovenščine v digitalnem okolju sofinancirata Republika Slovenija in Evropska unija iz Evropskega sklada za regionalni razvoj. Operacija se izvaja v okviru Operativnega programa za izvajanje evropske kohezijske politike v obdobju 2014-2020.
