![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/14.BertForTokenClassification_In_Spark_NLP.ipynb)

# BertForTokenClassification

**BertForTokenClassification** can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with `pretrained` of the companion object. The default model is `"bert_base_token_classifier_conll03"`, if no name is provided. <br/>

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀.

## Colab Setup

In [None]:
! pip install -q pyspark==3.2.0 spark-nlp

In [None]:
import sparknlp

spark = sparknlp.start(spark32=True)

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd


print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.0
Apache Spark version: 3.2.0


**BertForTokenClassification Models**


*   `bert_base_token_classifier_conll03`
*   `bert_large_token_classifier_conll03`
*   `bert_base_token_classifier_ontonote`
*   `bert_large_token_classifier_ontonote`
*   `bert_token_classifier_ner_ud_gsd`
*   `bert_token_classifier_scandi_ner`
*   `bert_token_classifier_chinese_ner`
*   `bert_token_classifier_dutch_udlassy_ner`

**AlbertForTokenClassification Models**



*   `albert_base_token_classifier_conll03`
*   `albert_large_token_classifier_conll03`
*   `albert_xlarge_token_classifier_conll03`

**DistilBertForTokenClassification Models**


*   `distilbert_base_token_classifier_conll03`
*   `distilbert_base_token_classifier_ontonotes`


**RoBertaForTokenClassification Models**


*   `roberta_token_classifier_ticker`
*   `roberta_token_classifier_icelandic_ner`
*   `roberta_base_token_classifier_conll03`
*   `roberta_large_token_classifier_conll03`
*   `roberta_token_classifier_timex_semeval`
*   `distilroberta_base_token_classifier_ontonotes`
*   `roberta_base_token_classifier_ontonotes`
*   `roberta_large_token_classifier_ontonotes`



**XlmRoBertaForTokenClassification Models**


*   `xlm_roberta_large_token_classifier_conll03`
*   `xlm_roberta_large_token_classifier_hrl`
*   `xlm_roberta_base_token_classifier_conll03`



**XlnetForTokenClassification Models**



*   `xlnet_base_token_classifier_conll03`
*   `xlnet_large_token_classifier_conll03`






You can find all these models and more here in [Spark NLP Models Hub](https://nlp.johnsnowlabs.com/models?edition=Spark+NLP&task=Named+Entity+Recognition)



## BertForTokenClassification Pipeline

Now, let's create a Spark NLP Pipeline with `bert_base_token_classifier_conll03` model and check the results. <br/>

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

tokenClassifier = BertForTokenClassification \
      .pretrained('bert_base_token_classifier_conll03', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('ner') \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(512)

# since output column is IOB/IOB2 style, NerConverter can extract entities
ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('entities')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    tokenClassifier,
    ner_converter
])

example = spark.createDataFrame([['My name is John Parker! I live in New York and I am a member of the New York Road Runners.']]).toDF("text")
model = pipeline.fit(example)
result= model.transform(example)

bert_base_token_classifier_conll03 download started this may take some time.
Approximate size to download 385.4 MB
[OK!]


In [None]:
model.stages

[DocumentAssembler_31f857e963aa,
 REGEX_TOKENIZER_e4e9b3ddd4a1,
 BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89,
 NerConverter_1c3a0c474945]

We can check the classes of `bert_base_token_classifier_conll03` model by using `getClasses()` function

In [None]:
tokenClassifier.getClasses()

['B-LOC', 'I-ORG', 'I-MISC', 'I-LOC', 'I-PER', 'B-MISC', 'B-ORG', 'O', 'B-PER']

In [None]:
result.columns

['text', 'document', 'token', 'ner', 'entities']

In [None]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 

Checking the ner labels of each token

In [None]:
result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.ner.result, result.entities.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

result_df.show(50, truncate=100)

+-------+---------+
|  token|ner_label|
+-------+---------+
|     My|        O|
|   name|        O|
|     is|        O|
|   John|    B-PER|
| Parker|    I-PER|
|      !|        O|
|      I|        O|
|   live|        O|
|     in|        O|
|    New|    B-LOC|
|   York|    I-LOC|
|    and|        O|
|      I|        O|
|     am|        O|
|      a|        O|
| member|        O|
|     of|        O|
|    the|        O|
|    New|    B-ORG|
|   York|    I-ORG|
|   Road|    I-ORG|
|Runners|    I-ORG|
|      .|        O|
+-------+---------+



Inspecting the chunks

In [None]:
result_df_1= result.select(F.explode(F.arrays_zip(result.entities.result, result.entities.begin, result.entities.end, result.entities.metadata)).alias("col"))\
                   .select(F.expr("col['0']").alias("entities"),
                            F.expr("col['1']").alias("begin"),
                            F.expr("col['2']").alias("end"),
                            F.expr("col['3']['entity']").alias("ner_label"))
result_df_1.show(50, truncate=False)

+---------------------+-----+---+---------+
|entities             |begin|end|ner_label|
+---------------------+-----+---+---------+
|John Parker          |11   |21 |PER      |
|New York             |34   |41 |LOC      |
|New York Road Runners|68   |88 |ORG      |
+---------------------+-----+---+---------+



##  BertForTokenClassification By Using LightPipeline

Now,  we will use our model with LightPipeline. 

In [None]:
from sparknlp.base import LightPipeline

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

tokenClassifier = BertForTokenClassification \
    .pretrained('bert_base_token_classifier_conll03', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('ner') \
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

# since output column is IOB/IOB2 style, NerConverter can extract entities
ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('entities')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    tokenClassifier,
    ner_converter
])

empty_df = spark.createDataFrame([['']]).toDF("text")
model = pipeline.fit(example)

bert_base_token_classifier_conll03 download started this may take some time.
Approximate size to download 385.4 MB
[OK!]


We've built our model, let's use LightPipeline and fullAnnotate it with sample data. 

In [None]:
light_model= LightPipeline(model)
light_result= light_model.fullAnnotate("Steven Rothery is the original guitarist and the longest continuous member of the British rock band Marillion.")[0]

In [None]:
light_result

{'document': [Annotation(document, 0, 109, Steven Rothery is the original guitarist and the longest continuous member of the British rock band Marillion., {})],
 'entities': [Annotation(chunk, 0, 13, Steven Rothery, {'entity': 'PER', 'sentence': '0', 'chunk': '0'}),
  Annotation(chunk, 82, 88, British, {'entity': 'MISC', 'sentence': '0', 'chunk': '1'}),
  Annotation(chunk, 100, 108, Marillion, {'entity': 'ORG', 'sentence': '0', 'chunk': '2'})],
 'ner': [Annotation(named_entity, 0, 5, B-PER, {'Some(I-LOC)': '1.87171E-5', 'Some(B-PER)': '0.9990536', 'Some(B-ORG)': '2.6759852E-4', 'Some(O)': '1.4683903E-4', 'Some(I-ORG)': '3.1453208E-5', 'Some(B-LOC)': '2.1402798E-4', 'Some(I-MISC)': '2.0058473E-5', 'Some(B-MISC)': '1.285331E-4', 'Some(I-PER)': '1.19188604E-4', 'word': 'Steven', 'sentence': '0'}),
  Annotation(named_entity, 7, 13, I-PER, {'Some(I-LOC)': '1.2193699E-4', 'Some(B-PER)': '4.6251802E-4', 'Some(B-ORG)': '4.08042E-5', 'Some(O)': '1.7726915E-4', 'Some(I-ORG)': '2.6501116E-4', 'So

In [None]:
light_result.keys()

dict_keys(['document', 'token', 'ner', 'entities'])

Checking the ner labels of each token

In [None]:
tokens= []
ner_labels= []

for i, k in list(zip(light_result["token"], light_result["ner"])):
  tokens.append(i.result)
  ner_labels.append(k.result)

result_df= pd.DataFrame({"tokens": tokens, "ner_labels": ner_labels})
result_df.head(20)

Unnamed: 0,tokens,ner_labels
0,Steven,B-PER
1,Rothery,I-PER
2,is,O
3,the,O
4,original,O
5,guitarist,O
6,and,O
7,the,O
8,longest,O
9,continuous,O


Let's check the chunk results

In [None]:
chunks= []
begin= []
end= []
ner_label= []

for i in light_result["entities"]:
  chunks.append(i.result)
  begin.append(i.begin)
  end.append(i.end)
  ner_label.append(i.metadata["entity"])

result_df= pd.DataFrame({"chunks": chunks, "begin": begin, "end": end, "ner_label": ner_label})
result_df.head(20)

Unnamed: 0,chunks,begin,end,ner_label
0,Steven Rothery,0,13,PER
1,British,82,88,MISC
2,Marillion,100,108,ORG
