![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/BertForTokenClassification.ipynb)

# `BertForTokenClassification` Models

## 1.Colab Setup

In [None]:
# Installing pyspark and spark-nlp

! pip install -q pyspark==3.3.0 spark-nlp==4.0.2
! pip install -q spark-nlp-display

In [2]:
# Import Libraries

import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

from sparknlp_display import NerVisualizer

from pyspark.sql.types import StringType, IntegerType

## 2.Start Spark Session

In [3]:
spark = sparknlp.start()

print("Spark NLP Version :", sparknlp.version())

spark

Spark NLP Version : 4.0.2


## 3.Writing a Generic NER Function

In [4]:
def get_entities(model, text, lang = "en", case = True):
    document_assembler = DocumentAssembler()\
          .setInputCol("text") \
          .setOutputCol("document")

    sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
           .setInputCols(["document"])\
           .setOutputCol("sentence")

    tokenizer = Tokenizer()\
          .setInputCols(["sentence"])\
          .setOutputCol("token")

    ner_converter = NerConverter()\
          .setInputCols(["sentence", "token", "ner"])\
          .setOutputCol("ner_chunk")

    token_classifier = BertForTokenClassification.pretrained(model, lang)\
          .setInputCols(["sentence", "token"])\
          .setOutputCol("ner")\
          .setCaseSensitive(case)\
          .setMaxSentenceLength(512)

    pipeline = Pipeline(stages=[document_assembler, 
                                sentence_detector, 
                                tokenizer, 
                                token_classifier, 
                                ner_converter])

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    pipeline_model = pipeline.fit(empty_data)

    df = spark.createDataFrame(text, StringType()).toDF("text")

    result = pipeline_model.transform(df)

    result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
                      .select(F.expr("cols['0']").alias("chunk"),
                              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
    
    NerVisualizer().display(
            result = result.collect()[0],
            label_col = 'ner_chunk',
            document_col = 'document')



## 4.BertForTokenClassification Models and Outputs

### `bert_base_token_classifier_conll03` model

In [5]:
model = "bert_base_token_classifier_conll03"

text = ["""China on Thursday accused Taipei of spoiling the atmosphere for a resumption of talks across the Taiwan Strait with a visit to Ukraine by Taiwanese Vice President Lien Chan this week that infuriated Beijing."""]

get_entities(model, text)


sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_base_token_classifier_conll03 download started this may take some time.
Approximate size to download 385.4 MB
[OK!]
+-------------+---------+
|chunk        |ner_label|
+-------------+---------+
|China        |LOC      |
|Taipei       |LOC      |
|Taiwan Strait|LOC      |
|Ukraine      |LOC      |
|Taiwanese    |MISC     |
|Lien Chan    |PER      |
|Beijing      |LOC      |
+-------------+---------+



### `bert_large_token_classifier_conll03` model

In [6]:
model = "bert_large_token_classifier_conll03"

text = ["""China on Thursday accused Taipei of spoiling the atmosphere for a resumption of talks across the Taiwan Strait with a visit to Ukraine by Taiwanese Vice President Lien Chan this week that infuriated Beijing."""]

get_entities(model, text)


sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_large_token_classifier_conll03 download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
+-------------+---------+
|chunk        |ner_label|
+-------------+---------+
|China        |LOC      |
|Taipei       |LOC      |
|Taiwan Strait|LOC      |
|Ukraine      |LOC      |
|Taiwanese    |MISC     |
|Lien Chan    |PER      |
|Beijing      |LOC      |
+-------------+---------+



### `bert_base_token_classifier_ontonote` model

In [7]:
model = "bert_base_token_classifier_ontonote"

text = ["""After the Handover: Taiwan - Macau Relations in the New Millennium Last year nearly one million travelers from Taiwan passed through Macau Airport heading for mainland China or visiting Macau itself and accounting for 80 % of passengers using the airport. Around 1000 of Macau 's civil servants have studied in Taiwan and every year some 30 people travel from Macau to Taiwan for business or work along with more than 400 students. But against the backdrop of continued proclamations from the PRC about how Macau's 'one country two systems' arrangement sets an example for solving the Taiwan problem these figures along with the Taiwan - Macau exchanges that they represent have been willfully neglected by Macau's mainstream media. Macau and Taiwan: so near and yet so far. What are the connections that bind these two places and what are the opportunities for new developments in the wake of the 1999 handover? On December 19 1999 the eve of Macau's transfer of sovereignty the China Times in Taiwan published a survey on the Taiwanese public's views about Macau's return to Chinese control. According to the survey 31 % of people in Taiwan weren't even aware that Macau was about to be handed back to China while 53 % answered that they didn't know whether the handover would be beneficial or harmful for Macau's development. But when asked whether the one country two systems formula as used for Hong Kong and Macau was acceptable for Taiwan 59 % responded No   while 27 % said that they didn't know.  As these results clearly indicate the people of Taiwan still know little about Macau in spite of existing exchanges and more than half of them flatly reject Beijing's one country two systems formula - so readily taken up in Macau - as a solution to the Taiwan question."""]

get_entities(model, text)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_base_token_classifier_ontonote download started this may take some time.
Approximate size to download 385.5 MB
[OK!]
+------------------+---------+
|chunk             |ner_label|
+------------------+---------+
|Taiwan - Macau    |NORP     |
|the New Millennium|DATE     |
|Last year         |DATE     |
|nearly one million|CARDINAL |
|Taiwan            |GPE      |
|Macau Airport     |FAC      |
|China             |GPE      |
|Macau             |GPE      |
|80 %              |PERCENT  |
|Around 1000       |CARDINAL |
|Macau             |GPE      |
|Taiwan            |GPE      |
|some              |CARDINAL |
|30                |CARDINAL |
|Macau             |GPE      |
|Taiwan            |GPE      |
|more than 400     |CARDINAL |
|PRC               |GPE      |
|Macau's           |GPE      |
|one               |CARDINAL |
+------------------+---------+
only showing top 20 rows



### `bert_large_token_classifier_ontonote` model

In [8]:
model = "bert_large_token_classifier_ontonote"

text = ["""After the Handover: Taiwan - Macau Relations in the New Millennium Last year nearly one million travelers from Taiwan passed through Macau Airport heading for mainland China or visiting Macau itself and accounting for 80 % of passengers using the airport. Around 1000 of Macau 's civil servants have studied in Taiwan and every year some 30 people travel from Macau to Taiwan for business or work along with more than 400 students. But against the backdrop of continued proclamations from the PRC about how Macau's 'one country two systems' arrangement sets an example for solving the Taiwan problem these figures along with the Taiwan - Macau exchanges that they represent have been willfully neglected by Macau's mainstream media. Macau and Taiwan: so near and yet so far. What are the connections that bind these two places and what are the opportunities for new developments in the wake of the 1999 handover? On December 19 1999 the eve of Macau's transfer of sovereignty the China Times in Taiwan published a survey on the Taiwanese public's views about Macau's return to Chinese control. According to the survey 31 % of people in Taiwan weren't even aware that Macau was about to be handed back to China while 53 % answered that they didn't know whether the handover would be beneficial or harmful for Macau's development. But when asked whether the one country two systems formula as used for Hong Kong and Macau was acceptable for Taiwan 59 % responded No   while 27 % said that they didn't know.  As these results clearly indicate the people of Taiwan still know little about Macau in spite of existing exchanges and more than half of them flatly reject Beijing's one country two systems formula - so readily taken up in Macau - as a solution to the Taiwan question."""]

get_entities(model, text)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_large_token_classifier_ontonote download started this may take some time.
Approximate size to download 1.2 GB
[OK!]
+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Taiwan - Macau              |NORP     |
|the New Millennium Last year|DATE     |
|one million                 |CARDINAL |
|Taiwan                      |GPE      |
|Macau Airport               |FAC      |
|China                       |GPE      |
|Macau                       |GPE      |
|80 %                        |PERCENT  |
|1000                        |CARDINAL |
|Macau                       |GPE      |
|Taiwan                      |GPE      |
|every year                  |DATE     |
|30                          |CARDINAL |
|Macau                       |GPE      |
|Taiwan                      |GPE      |
|than                        |CARD

### `bert_base_token_classifier_few_nerd` model

In [9]:
model = "bert_base_token_classifier_few_nerd"

text = ["""This rivalry intensified in 1919 when Arsenal were unexpectedly prompted the First Division, taking a place that Tottenham believed shuold be theirs."""]

get_entities(model, text)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_base_token_classifier_few_nerd download started this may take some time.
Approximate size to download 385.6 MB
[OK!]
+---------+---------------------+
|chunk    |ner_label            |
+---------+---------------------+
|Arsenal  |ganization-sportsteam|
|Tottenham|ganization-sportsteam|
+---------+---------------------+



### `bert_token_classifier_scandi_ner` model

In [10]:
model = "bert_token_classifier_scandi_ner"

text = ["""Hans er professor ved Statens Universitet, som ligger i København, og han er en rigtig københavner."""]

get_entities(model, text, lang="xx")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_scandi_ner download started this may take some time.
Approximate size to download 636 MB
[OK!]
+-------------------+---------+
|chunk              |ner_label|
+-------------------+---------+
|Hans               |PER      |
|Statens Universitet|ORG      |
|København          |LOC      |
|københavner        |MISC     |
+-------------------+---------+



### `bert_token_classifier_dutch_udlassy_ner` model

In [11]:
model = "bert_token_classifier_dutch_udlassy_ner"

text = ["""William Henry Gates III (geboren 28 oktober 1955) is een Amerikaanse zakenmagnaat, softwareontwikkelaar, investeerder en filantroop. Hij is vooral bekend als medeoprichter van Microsoft Corporation. Tijdens zijn carrière bij Microsoft bekleedde Gates de functies van voorzitter, chief executive officer (CEO), president en chief software architect, terwijl hij ook de grootste individuele aandeelhouder was tot mei 2014. Hij is een van de bekendste ondernemers en pioniers van de microcomputerrevolutie van de jaren 70 en 80. Gates, geboren en getogen in Seattle, Washington, richtte in 1975 samen met jeugdvriend Paul Allen Microsoft op in Albuquerque, New Mexico; het werd "s werelds grootste personal computer softwarebedrijf. Gates leidde het bedrijf als voorzitter en CEO totdat hij in januari 2000 aftrad als CEO, maar hij bleef voorzitter en werd chief software architect. Eind jaren negentig kreeg Gates kritiek vanwege zijn zakelijke tactieken, die als concurrentiebeperkend werden beschouwd. Deze mening is bevestigd door tal van gerechtelijke uitspraken. In juni 2006 kondigde Gates aan dat hij zou overgaan naar een parttime functie bij Microsoft en fulltime gaan werken bij de Bill&Melinda Gates Foundation, de particuliere liefdadigheidsstichting die hij en zijn vrouw, Melinda Gates, in 2000 hebben opgericht. Hij droeg geleidelijk zijn taken over aan Ray Ozzie en Craig Mundie. Hij trad in februari 2014 af als voorzitter van Microsoft en nam een nieuwe functie aan als technologieadviseur ter ondersteuning van de nieuw aangestelde CEO Satya Nadella."""]

get_entities(model, text, lang="nl")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_dutch_udlassy_ner download started this may take some time.
Approximate size to download 388.5 MB
[OK!]
+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|William Henry Gates III|PERSON   |
|28 oktober 1955        |DATE     |
|Amerikaanse            |NORP     |
|Microsoft Corporation  |ORG      |
|Microsoft              |ORG      |
|Gates                  |PERSON   |
|mei 2014               |DATE     |
|jaren 70 en 80         |DATE     |
|Gates                  |PERSON   |
|Seattle                |GPE      |
|Washington             |GPE      |
|1975                   |DATE     |
|Paul Allen             |PERSON   |
|Microsoft              |ORG      |
|Albuquerque            |GPE      |
|New Mexico             |GPE      |
|Gates                  |PERSON   |
|januari 2000           |DATE     |
|jaren nege

### `bert_token_classifier_spanish_ner` model

In [12]:
model = "bert_token_classifier_spanish_ner"

text = ["""Me llamo Antonio y trabajo en la fábrica de Mercedes-Benz en Madrid."""]

get_entities(model, text, lang="es")

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_spanish_ner download started this may take some time.
Approximate size to download 390.9 MB
[OK!]
+-------------+---------+
|chunk        |ner_label|
+-------------+---------+
|Antonio      |PER      |
|Mercedes-Benz|ORG      |
|Madrid       |LOC      |
+-------------+---------+



### `bert_token_classifier_swedish_ner` model

In [13]:
model = "bert_token_classifier_swedish_ner"

text = ["""William Henry Gates III (född 28 oktober 1955) är en amerikansk affärsmagnat, mjukvaruutvecklare, investerare och filantrop. Han är mest känd som medgrundare av Microsoft Corporation. Under sin karriär på Microsoft innehade Gates befattningar som styrelseordförande, verkställande direktör (VD), VD och programvaruarkitekt samtidigt som han var den största enskilda aktieägaren fram till maj 2014. Han är en av de mest kända företagarna och pionjärerna inom mikrodatorrevolutionen på 1970- och 1980-talet. Född och uppvuxen i Seattle, Washington, grundade Gates Microsoft tillsammans med barndomsvän Paul Allen 1975 i Albuquerque, New Mexico; det blev vidare världens största datorprogramföretag. Gates ledde företaget som styrelseordförande och VD tills han avgick som VD i januari 2000, men han förblev ordförande och blev chef för programvaruarkitekt. Under slutet av 1990-talet hade Gates kritiserats för sin affärstaktik, som har ansetts konkurrensbegränsande. Detta yttrande har upprätthållits genom många domstolsbeslut. I juni 2006 meddelade Gates att han skulle gå över till en deltidsroll på Microsoft och heltid på Bill&Melinda Gates Foundation, den privata välgörenhetsstiftelsen som han och hans fru, Melinda Gates, grundade 2000. Han överförde gradvis sina uppgifter till Ray Ozzie och Craig Mundie. Han avgick som styrelseordförande i Microsoft i februari 2014 och tillträdde en ny tjänst som teknologrådgivare för att stödja den nyutnämnda VD Satya Nadella."""]
get_entities(model, text, lang="sv", case = False)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_swedish_ner download started this may take some time.
Approximate size to download 444.1 MB
[OK!]
+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|William Henry Gates III|PER      |
|amerikansk             |NAT      |
|Microsoft              |ORG      |
|Microsoft              |ORG      |
|direktör               |TIT      |
|VD                     |TIT      |
|VD                     |TIT      |
|programvaruarkitekt    |TIT      |
|Seattle,               |LOC      |
|Washington             |LOC      |
|Microsoft              |ORG      |
|Allen                  |ORG      |
|Albuquerque            |LOC      |
|New Mexico             |LOC      |
|Gates                  |PER      |
|VD                     |TIT      |
|VD                     |TIT      |
|Gates                  |ORG      |
|Gates           

### `bert_token_classifier_turkish_ner` model

In [14]:
model = "bert_token_classifier_turkish_ner"

text = ["""Haziran 2006'da William Gates, Microsoft şirketinde yarı zamanlı bir göreve ve 2000 yılında eşi Melinda Gates ile birlikte kurdukları özel hayır kurumu olan Bill&Melinda Gates Vakfı'nda tam zamanlı çalışmaya geçeceğini duyurdu. Görevlerini kademeli olarak Ray Ozzie ve Craig Mundie' ye devretti. """]

get_entities(model, text, lang="tr", case = False)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_turkish_ner download started this may take some time.
Approximate size to download 393.6 MB
[OK!]
+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|William Gates               |PER      |
|Microsoft                   |ORG      |
|Melinda Gates               |PER      |
|Bill&Melinda Gates Vakfı'nda|ORG      |
|Ray Ozzie                   |PER      |
|Craig Mundie                |PER      |
+----------------------------+---------+



### `bert_token_classifier_parsbert_armanner` model

In [15]:
model = "bert_token_classifier_parsbert_armanner"

text = ["""دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد."""]

get_entities(model, text, lang="fa", case = False)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_parsbert_armanner download started this may take some time.
Approximate size to download 578.7 MB
[OK!]
+----------------------+---------+
|chunk                 |ner_label|
+----------------------+---------+
|شرکت کامیکو           |org      |
|شهر ساسکاتون ساسکاچوان|loc      |
+----------------------+---------+



### `bert_token_classifier_parsbert_ner` model

In [16]:
model = "bert_token_classifier_parsbert_ner"

text = ["""دفتر مرکزی شرکت کامیکو در شهر ساسکاتون ساسکاچوان قرار دارد."""]

get_entities(model, text, lang="fa", case = False)

sentence_detector_dl download started this may take some time.
Approximate size to download 514.9 KB
[OK!]
bert_token_classifier_parsbert_ner download started this may take some time.
Approximate size to download 578.7 MB
[OK!]
+----------------------+------------+
|chunk                 |ner_label   |
+----------------------+------------+
|دفتر مرکزی شرکت کامیکو|organization|
|شهر ساسکاتون ساسکاچوان|location    |
+----------------------+------------+

