Nlp Bio Portuguese Chunking

An API for extracting chunks (Noun phrases) in clinical texts

"Chunk Is All You Need" 😄😄😄

To read in Portuguese, click here: README in portuguese

Index

About
POS-Tagger
How to run-locally
Running via docker
How to cite

About

Chunking is a way of grouping sequential elements from text (sentences), which can be noun phrase, verb phrase, prepositional phrase etc, using its part of speech (POS) tags. Unlike named entity recognition (NER), which finds and sorts relevant pieces of text.

In this work, we extract the noun phrases (phrases that have a noun as their head).

We use two methods to generate the POS-tags of sentences:

The spacy library, to tokenize and extract the POS-tag of each word of the sentence, which uses the corpus pt_core_news_md.
A token-sequence BERT model trained with the corpus MacMorpho, using as checkpoint the BioBERTpt model, trained with clinical and biomedical texts in Portuguese.

Next, we create a function that extracts all the nouns from the sentence, grouping with its complements (adjectives, adverbs, etc).

Example:

---Original sentence:---

Data de Criação do Documento: 22/04/2014   Dispneia importante aos esforços + dor tipo peso no peito no esforço. Obeso, has, icc  c # cintilografia miocardica para avaliar angina.


---Sentence´s chunks:---

['Data de Criação do Documento 22/04/2014', 'Dispneia importante aos esforços', 'dor tipo peso no peito no esforço', 'Obeso', 'has', 'icc', 'cintilografia miocardica', 'angina']

POS-Tagger

In addition to the POS-tagger model provided by spacy, we also trained our own model using the fine-tuning of the language model BioBERTpt(all) with the corpus for Portuguese MacMorpho, with 10 epochs, reaching an overall F1-Score of 0.9818.

Our model is in the official repository of Hugging Faces, you can access it through the address: https://huggingface.co/pucpr-br/postagger-bio-portuguese.

If you appreciate our work, don't forget to like the model on Hugging Faces ❤️

How to use the POS-tagger model (without the chunking part):

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("pucpr-br/postagger-bio-portuguese")

model = AutoModelForTokenClassification.from_pretrained("pucpr-br/postagger-bio-portuguese")

Here you have the grammatical types returned by the model:

Acronym	Meaning
ADJ	adjective
ADV	adverb
ADV-KS	Subordinate subjunctive adverb
ADV-KS-REL	Subordinate relative adverb
ART	Article
CUR	currency
IN	Interjection
KC	Coordinating conjunction
KS	Subordinating conjunction
N	noun
NPROP	Proper noun
NUM	Number
PCP	Participle
PDEN	Denotative word
PREP	Preposition
PROADJ	Adjective pronoun
PRO-KS	Subordinate subjunctive pronoun
PRO-KS-REL	Subordinate connective relative pronoun
PROPESS	Personal pronoun
PROSUB	Noun pronoun
V	verb
VAUX	auxiliary verb

More information and examples at: http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf

PS: In case you need other POS-taggers trained for the portuguese language, in clinical or medical domain, you can also try these models trained with Flair.

How to run locally to extract the chunks

To generate the chunks (noun phrases), you can run it directly from these notebooks: with spacy and with POS-Tagger Bio Portuguese

Or run a server to access via an web interface, following the steps below (the following examples are using the spacy library, as it is a lighter model to run, especially within containers).

Clone this repository
Install the necessary libraries (if you prefer, use Anaconda)

pip install flask == 4.3.0
pip install spacy == 2.3.7

or through the command:

pip install -r requirements.txt

Run app.py (it is configured to run on port 5000)

python app.py

In the browser, go to http://localhost:5000/
Write a clinical sentence or select some example sentence and click in the search button.

All the chunks identified in the input sentence will be returned colored.

Running in container via Docker

To run the API inside a Docker container, where it is not necessary to worry about the environment and libraries, just follow the steps:
If you don't have it, install Docker following these guidelines.
Run the following commands (to run the container on port 5000)

docker build -t chunking .

docker run --name chunking_instance -p 0.0.0.0:5000:5000  -d chunking

You also can run directly by our image in Dockerhub:

docker run --name chunking_instance -p 0.0.0.0:5000:5000 -d terumi/chunking:version1

In the browser, go to http://localhost:5000/

How to cite

@article{Schneider_Gumiel_Oliveira_Montenegro_Barzotto_Moro_Pagano_Paraiso_2023,
place={Brasil},
title={Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese},
volume={15},
url={https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/1086},
DOI={10.59681/2175-4411.v15.iEspecial.2023.1086},
abstractNote={&amp;lt;p&amp;gt;Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.&amp;lt;/p&amp;gt;},
number={Especial},
journal={Journal of Health Informatics},
author={Schneider, Elisa Terumi Rubel and Gumiel, Yohan Bonescki and Oliveira, Lucas Ferro Antunes de and Montenegro, Carolina de Oliveira and Barzotto, Laura Rubel and Moro, Claudia and Pagano, Adriana and Paraiso, Emerson Cabrera},
year={2023},
month={jul.}
}

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
__pycache__		__pycache__
img		img
notebook		notebook
static		static
templates		templates
Dockerfile		Dockerfile
README.md		README.md
READMEpt.md		READMEpt.md
app.py		app.py
chunking.py		chunking.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pycache

pycache

img

img

notebook

notebook

static

static

templates

templates

Dockerfile

Dockerfile

README.md

README.md

READMEpt.md

READMEpt.md

app.py

app.py

chunking.py

chunking.py

requirements.txt

requirements.txt

Repository files navigation

Nlp Bio Portuguese Chunking

An API for extracting chunks (Noun phrases) in clinical texts

"Chunk Is All You Need" 😄😄😄

Index

About

POS-Tagger

How to run locally to extract the chunks

Running in container via Docker

How to cite

About

Releases

Packages

Languages

HAILab-PUCPR/nlp-portuguese-chunking

Folders and files

Latest commit

History

Repository files navigation

Nlp Bio Portuguese Chunking

An API for extracting chunks (Noun phrases) in clinical texts

"Chunk Is All You Need" 😄😄😄

Index

About

POS-Tagger

How to run locally to extract the chunks

Running in container via Docker

How to cite

About

Resources

Stars

Watchers

Forks

Languages