Skip to content

HAILab-PUCPR/nlp-portuguese-chunking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nlp Bio Portuguese Chunking

An API for extracting chunks (Noun phrases) in clinical texts

"Chunk Is All You Need" 😄😄😄

To read in Portuguese, click here: README in portuguese

Index

  1. About
  2. POS-Tagger
  3. How to run-locally
  4. Running via docker
  5. How to cite

About

Chunking is a way of grouping sequential elements from text (sentences), which can be noun phrase, verb phrase, prepositional phrase etc, using its part of speech (POS) tags. Unlike named entity recognition (NER), which finds and sorts relevant pieces of text.

In this work, we extract the noun phrases (phrases that have a noun as their head).

We use two methods to generate the POS-tags of sentences:

  1. The spacy library, to tokenize and extract the POS-tag of each word of the sentence, which uses the corpus pt_core_news_md.
  2. A token-sequence BERT model trained with the corpus MacMorpho, using as checkpoint the BioBERTpt model, trained with clinical and biomedical texts in Portuguese.

Next, we create a function that extracts all the nouns from the sentence, grouping with its complements (adjectives, adverbs, etc).

Example:

---Original sentence:---

Data de Criação do Documento: 22/04/2014   Dispneia importante aos esforços + dor tipo peso no peito no esforço. Obeso, has, icc  c # cintilografia miocardica para avaliar angina.


---Sentence´s chunks:---

['Data de Criação do Documento 22/04/2014', 'Dispneia importante aos esforços', 'dor tipo peso no peito no esforço', 'Obeso', 'has', 'icc', 'cintilografia miocardica', 'angina']

POS-Tagger

In addition to the POS-tagger model provided by spacy, we also trained our own model using the fine-tuning of the language model BioBERTpt(all) with the corpus for Portuguese MacMorpho, with 10 epochs, reaching an overall F1-Score of 0.9818.

Our model is in the official repository of Hugging Faces, you can access it through the address: https://huggingface.co/pucpr-br/postagger-bio-portuguese.

If you appreciate our work, don't forget to like the model on Hugging Faces ❤️

How to use the POS-tagger model (without the chunking part):

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("pucpr-br/postagger-bio-portuguese")

model = AutoModelForTokenClassification.from_pretrained("pucpr-br/postagger-bio-portuguese")

Here you have the grammatical types returned by the model:

Acronym Meaning
ADJ adjective
ADV adverb
ADV-KS Subordinate subjunctive adverb
ADV-KS-REL Subordinate relative adverb
ART Article
CUR currency
IN Interjection
KC Coordinating conjunction
KS Subordinating conjunction
N noun
NPROP Proper noun
NUM Number
PCP Participle
PDEN Denotative word
PREP Preposition
PROADJ Adjective pronoun
PRO-KS Subordinate subjunctive pronoun
PRO-KS-REL Subordinate connective relative pronoun
PROPESS Personal pronoun
PROSUB Noun pronoun
V verb
VAUX auxiliary verb

More information and examples at: http://nilc.icmc.usp.br/macmorpho/macmorpho-manual.pdf

PS: In case you need other POS-taggers trained for the portuguese language, in clinical or medical domain, you can also try these models trained with Flair.

How to run locally to extract the chunks

To generate the chunks (noun phrases), you can run it directly from these notebooks: with spacy and with POS-Tagger Bio Portuguese

Or run a server to access via an web interface, following the steps below (the following examples are using the spacy library, as it is a lighter model to run, especially within containers).

  1. Clone this repository
  2. Install the necessary libraries (if you prefer, use Anaconda)
pip install flask == 4.3.0
pip install spacy == 2.3.7

or through the command:

pip install -r requirements.txt
  1. Run app.py (it is configured to run on port 5000)
python app.py
  1. In the browser, go to http://localhost:5000/

  2. Write a clinical sentence or select some example sentence and click in the search button.

All the chunks identified in the input sentence will be returned colored.

Running in container via Docker

  1. To run the API inside a Docker container, where it is not necessary to worry about the environment and libraries, just follow the steps:

  2. If you don't have it, install Docker following these guidelines.

  3. Run the following commands (to run the container on port 5000)

docker build -t chunking .

docker run --name chunking_instance -p 0.0.0.0:5000:5000  -d chunking

You also can run directly by our image in Dockerhub:

docker run --name chunking_instance -p 0.0.0.0:5000:5000 -d terumi/chunking:version1
  1. In the browser, go to http://localhost:5000/

How to cite

@article{Schneider_Gumiel_Oliveira_Montenegro_Barzotto_Moro_Pagano_Paraiso_2023,
place={Brasil},
title={Developing a Transformer-based Clinical Part-of-Speech Tagger for Brazilian Portuguese},
volume={15},
url={https://jhi.sbis.org.br/index.php/jhi-sbis/article/view/1086},
DOI={10.59681/2175-4411.v15.iEspecial.2023.1086},
abstractNote={<p>Electronic Health Records are a valuable source of information to be extracted by means of natural language processing (NLP) tasks, such as morphosyntactic word tagging. Although there have been significant advances in health NLP, such as the Transformer architecture, languages such as Portuguese are still underrepresented. This paper presents taggers developed for Portuguese texts, fine-tuned using BioBERtpt (clinical/biomedical) and BERTimbau (generic) models on a POS-tagged corpus. We achieved an accuracy of 0.9826, state-of-the-art for the corpus used. In addition, we performed a human-based evaluation of the trained models and others in the literature, using authentic clinical narratives. Our clinical model achieved 0.8145 in accuracy compared to 0.7656 for the generic model. It also showed competitive results compared to models trained specifically with clinical texts, evidencing domain impact on the base model in NLP tasks.</p>},
number={Especial},
journal={Journal of Health Informatics},
author={Schneider, Elisa Terumi Rubel and Gumiel, Yohan Bonescki and Oliveira, Lucas Ferro Antunes de and Montenegro, Carolina de Oliveira and Barzotto, Laura Rubel and Moro, Claudia and Pagano, Adriana and Paraiso, Emerson Cabrera},
year={2023},
month={jul.}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • CSS 59.9%
  • Jupyter Notebook 28.3%
  • JavaScript 5.0%
  • HTML 4.1%
  • Python 2.5%
  • Dockerfile 0.2%