

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/healthcare_jsl/NER_LEGAL_PT.ipynb)




# **Detect legal entities in Portuguese text**

## 1. Colab Setup

In [None]:
# Install the johnsnowlabs library to access Spark-OCR and Spark-NLP for Healthcare, Finance, and Legal.
! pip install johnsnowlabs

In [None]:
from google.colab import files
print("Please Upload your John Snow Labs License using the button below")
license_keys = files.upload()

In [None]:
from johnsnowlabs import *

# After uploading your license run this to install all licensed Python Wheels and pre-download Jars the Spark Session JVM
# Make sure to restart your notebook afterwards for changes to take effect

jsl.install()

## 2. Start Session

In [None]:
from johnsnowlabs import *
# Automatically load license data and start a session with all jars user has access to
spark = jsl.start()

In [None]:
spark

## 3. Select the DL model

In [None]:
# If you change the model, re-run all the cells below.
# Applicable models: lener_bert_base, lener_bert_large
MODEL_NAME = "lener_bert_base"

## 4. Some sample examples

In [None]:
# Enter examples to be transformed as strings in this list
text_list = [
    """a primeira câmara desta corte , em acórdão constante da relação 31/2000 , ata 27/2000 , ministro marcos vinicios vilaça , julgou regulares com ressalva as contas de carlos aureliano motta de souza , ex-diretor-geral do stm , no ano de 1999 ( peça 1 , p. 49-51 ) .""",
    """com isso , a corte , por meio da decisão 877/2000 – plenário , manifestou-se nos seguintes termos : 8.1 - determinar à secex/rj que , com base nos artigos 41 e 43 , inciso ii , da lei nº 8.443/92 : 8.1.1 - promova a audiência do responsável acima identificado , para que , no prazo regimental , apresente razões de justificativa quanto a ocorrência de antecipações de pagamentos para fornecimento de esquadrias de alumínio , ar-condicionado e elevadores da obra de construção do edifício da 1ª cjm/rj , em afronta aos artigos 62 e 63 da lei nº 4.320/64 ; 38 do decreto nº 93.872/86 ; e 65 , inciso ii , letra `` c '' , da lei nº 8.666/93 , bem como quanto às alterações contratuais ."""
]

## 5. Define Spark NLP pipeline

In [None]:
document_assembler = nlp.DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = nlp.Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

# The model was trained with the Bert embeddings, we need to it.
if MODEL_NAME == 'lener_bert_base':
    bert_model = 'bert_portuguese_base_cased'
else:
    bert_model = 'bert_portuguese_large_cased'

embeddings = nlp.BertEmbeddings.pretrained("bert_portuguese_base_cased", 'pt')\
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained(MODEL_NAME, 'pt', 'clinical/models') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

ner_converter = nlp.NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[document_assembler, 
                                tokenizer,
                                embeddings,
                                ner_model,
                                ner_converter])

bert_portuguese_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
lener_bert_base download started this may take some time.
[OK!]


## 6. Run the pipeline

In [None]:
from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(text_list,StringType()).toDF('text')
result = nlp_pipeline.fit(df).transform(df)

## 7. Visualize results

In [None]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'ner_chunk',
    document_col = 'document'
)