

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_PT.ipynb)




# **Detect legal entities in Portuguese text**

## 1. Colab Setup

In [None]:
import os
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

sparknlp_version = license_keys["PUBLIC_VERSION"]
jsl_version = license_keys["JSL_VERSION"]

In [4]:
print ('SparkNLP Version:', sparknlp_version)
print ('SparkNLP-JSL Version:', jsl_version)

SparkNLP Version: 3.1.2
SparkNLP-JSL Version: 3.1.2


Install dependencies

In [2]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

# Install Spark NLP Display for visualization
!pip install --ignore-installed spark-nlp-display

## 2. Start the Spark session

In [3]:
import json
import pandas as pd
import numpy as np

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

import sparknlp_jsl
from sparknlp_jsl.annotator import *

spark = sparknlp_jsl.start(license_keys['SECRET'])

spark

## 3. Select the DL model

In [6]:
# If you change the model, re-run all the cells below.
# Applicable models: lener_bert_base, lener_bert_large
MODEL_NAME = "lener_bert_base"

## 4. Some sample examples

In [7]:
# Enter examples to be transformed as strings in this list
text_list = [
    """a primeira câmara desta corte , em acórdão constante da relação 31/2000 , ata 27/2000 , ministro marcos vinicios vilaça , julgou regulares com ressalva as contas de carlos aureliano motta de souza , ex-diretor-geral do stm , no ano de 1999 ( peça 1 , p. 49-51 ) .""",
    """com isso , a corte , por meio da decisão 877/2000 – plenário , manifestou-se nos seguintes termos : 8.1 - determinar à secex/rj que , com base nos artigos 41 e 43 , inciso ii , da lei nº 8.443/92 : 8.1.1 - promova a audiência do responsável acima identificado , para que , no prazo regimental , apresente razões de justificativa quanto a ocorrência de antecipações de pagamentos para fornecimento de esquadrias de alumínio , ar-condicionado e elevadores da obra de construção do edifício da 1ª cjm/rj , em afronta aos artigos 62 e 63 da lei nº 4.320/64 ; 38 do decreto nº 93.872/86 ; e 65 , inciso ii , letra `` c '' , da lei nº 8.666/93 , bem como quanto às alterações contratuais ."""
]

## 5. Define Spark NLP pipeline

In [13]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

# The model was trained with the Bert embeddings, we need to it.if MODEL_NAME == 'lener_bert_base':    bert_model = 'bert_portuguese_base_cased'else:    bert_model = 'bert_portuguese_large_cased'
embeddings = BertEmbeddings.pretrained(bert_model, 'pt')\
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained(MODEL_NAME, 'pt', 'clinical/models') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('ner_chunk')

nlp_pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

bert_portuguese_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]


## 6. Run the pipeline

In [14]:
empty_df = spark.createDataFrame([['']]).toDF('text')
pipeline_model = nlp_pipeline.fit(empty_df)
df = spark.createDataFrame(pd.DataFrame({'text': text_list}))
result = pipeline_model.transform(df)

## 7. Visualize results

In [15]:
from sparknlp_display import NerVisualizer

NerVisualizer().display(
    result = result.collect()[0],
    label_col = 'ner_chunk',
    document_col = 'document'
)