<a href="https://colab.research.google.com/github/CinthiaS/portuguese-ner/blob/main/Portuguese_Name_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Portuguese Name Entity Recognition

This notebook describes the steps for using a model to identify named entities in Brazilian Portuguese. The name entity recognition model used is [XLM-R or XLM-RoBERTa](https://huggingface.co/CinthiaS/xlm-roberta-base-finetuned-panx-pt) fine tuned in a Brazilian Portuguese text corpus.

The fine tuning of this model was performed using a subset of the PAN-X corpus, which is a labeled multilingual corpus for the entity identification task.


In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import Libraries

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from spacy import displacy

## Load model and Tokenizer

In [4]:
tokenizer = AutoTokenizer.from_pretrained("CinthiaS/xlm-roberta-base-finetuned-panx-pt")
model = AutoModelForTokenClassification.from_pretrained("CinthiaS/xlm-roberta-base-finetuned-panx-pt")

Downloading:   0%|          | 0.00/451 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/988 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

## Predict

In [7]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
example = "Barbara mora em São Paulo e trabalha no banco Itaú"

ner_results = nlp(example)

## Display Results

In [8]:
docs = [{"text": example,
         "ents" : [{"start": i['start'],"end": i['end'] , "label": i['entity_group']} for i in ner_results],
         "title": "Result"
}]

colors = {"PER": "linear-gradient(90deg, #999999, #cccccc)",
          "LOC": "linear-gradient(90deg, #aa9cfc, #fc9ce7)",
          "ORG": "linear-gradient(90deg, #11111, #fc9ce7)"}
          
options = {'ents': ["PER", "LOC", "ORG"], "colors": colors}

displacy.render(docs, style='ent', manual=True, options=options, jupyter=True)