![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/26.01.BertForTokenClassification.ipynb)

# Named Entity Recognition (Token Classification) with Transformers

This notebook will cover the different parameters and usages of Transformers-bases NER annotators.

**📖 Learning Objectives:**

1. Be able to create a pipeline for NER using a Transformers-bases annotator.

2. Understand how to use the annotators for predictions.

3. Become comfortable using the different parameters of the annotators.

4. Import Transformers models from Hugging Face to Spark NLP.


**🔗 Helpful Links:**

- Documentation : [Transformers in Spark NLP](https://nlp.johnsnowlabs.com/docs/en/transformers)

- Python Docs : [BertForTokenClassification](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/bert_for_token_classification/index.html#sparknlp.annotator.classifier_dl.bert_for_token_classification.BertForTokenClassification)

- Scala Docs : [BertForTokenClassification](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/BertForTokenClassification)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public/).

## Transformers and Spark NLP

Spark NLP has extended support for `HuggingFace` 🤗   and `TF Hub` exported models since `3.1.0` to Spark NLP 🚀 annotators. You can easily use the `saved_model` feature in HuggingFace within a few lines of codes and import any of the following types of models into Spark NLP.




<div align="center">

| **Architect** | **Embeddins**        |
|---------------|----------------------|
| Albert        | AlbertForTokenClassification     |
| BERT          | BertForTokenClassification       |
| CamemBERT     | CamemBertForTokenClassification  |
| DeBERTa       | DeBertaForTokenClassification    |
| DistilBERT    | DistilBertForTokenClassification |
| Longformer    | LongformerForTokenClassification |
| RoBERTa       | RoBertaForTokenClassification    |
| XLM-RoBERTa   | XlmRoBertaForTokenClassification |
| Xlnet         | XlnetForTokenClassification      |

</div>

> We will keep working on the remaining annotators and extend this support to aditional Transformers models. To keep updated, visit [this page](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) on compatibility and development of the adaptations of TF Hub and  HuggingFace to Spark NLP. Keep tuned for the next releases.

### **Token Classification - NER**

As mentioned above, we already have implemented many different Transformers models in Spark NLP, and specifically for NER we have all the versions of **ForTokenClassification**, where can be any of:

- `BERT` ([BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805), Jacob Devlin et al.): Randomly changes input texts (for example, 15% of them) with _MASKS_ or random tokens in order to learn a language model. Given two sentences, the learning process makes two tasks: 
    - Predict the sentences by correctly replacing the wrong tokens.
    - Predict if the sentences are consecutive or not.
- `ALBERT` ([ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942), Zhenzhong Lan et al.): Same as Bert, with changes in some hyperparameters that optimizes memomy usage. The training phase instead of predicting if the two sentences are consecutive, now they predict if they were swapped or not (two consecutive sentences are input, model predict if they were given in the correct order or not).
- `RoBERTa` ([RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692), Yinhan Liu et al.): Same as Bert, but with some different training methods (e.g., using dynamic masking in each epoch instead).
- `CamemBERT` ([CamemBERT: a Tasty French Language Model](https://arxiv.org/abs/1911.03894), Louis Martin et al.): Based on RoBerta model, trained with French dataset.
- `DistilBERT` ([DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108),Victor Sanh et al.): Distilled version of Bert (model parameters were reduced by using transfer learning from big model to smaller model). 
- `Longformer` ([Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150), Iz Beltagy et al.): Allows the use of upt to 4096 tokens instead of the usual limit of 512. To optimize the added computational cost, replace dense matrixes by sparse representations.
- `XlmRoBerta` ([Unsupervised Cross-lingual Representation Learning at Scale](https://arxiv.org/abs/1911.02116), Alexis Conneau et al.): Applies the training methods from RoBerta to Xlm model. 
- `Xlnet` ([XLNet: Generalized Autoregressive Pretraining for Language Understanding](https://arxiv.org/abs/1906.08237), Zhilin Yang et al.): differently than token masking applied in Bert models, it trains the language model by permuting the tokens. 


For more details on these models and others available on HuggingFace, pelase visit the [HuggingFace documentation](https://huggingface.co/docs/transformers/model_summary).

## **🎬 Colab Setup**

In [None]:
! pip install -q pyspark==3.1.2 spark-nlp==4.3.1

In [None]:
import sparknlp

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer

from pyspark.ml import Pipeline
from pyspark.sql import functions as F

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

spark

Spark NLP version:  4.3.1
Apache Spark version:  3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `DOCUMENT`, `TOKEN`

- Output: `NAMED_ENTITY`

## **🔎 Parameters**

- `caseSensitive()`:
Set whether to ignore case in index lookups with this parameter
(Default depends on model)

- `maxSentenceLength` = Maximum sentence length to process, limited to 512 for all models except `Longformer` which has a limit of 4096.

- `batchSize` : Large values allows faster processing but requires more memory, by default 8

- `configProtoBytes` = ConfigProto from tensorflow, serialized into byte array. Get with `config_proto.SerializeToString()`

## Defining the Spark NLP Pipeline

In [None]:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer, BertForTokenClassification
import pyspark.sql.functions as F

Let's prepared the pre-requisite columns first, so we can use them in different annotators.

In [None]:
document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')
        
pipeline = Pipeline(stages=[document_assembler,
                            tokenizer])

In [None]:
example_df = spark.createDataFrame([["Microsoft founder Bill Gates plans to build a new factory in Germany."]]).toDF("text")

example_df = pipeline.fit(example_df).transform(example_df)

## 📍 **BertForTokenClassification**

➤ BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

➤ For extended examples of usage, see the Examples. To see which models are compatible and how to import them see [Import Transformers into Spark NLP](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) 🚀.

In [None]:
bert_tagger = BertForTokenClassification.pretrained("bert_base_token_classifier_conll03", "en") \
        .setInputCols(['document', 'token']) \
        .setOutputCol('ner')\
        .setMaxSentenceLength(512)\
        .setCaseSensitive(True)

bert_base_token_classifier_conll03 download started this may take some time.
Approximate size to download 385.4 MB
[OK!]


In [None]:
result = bert_tagger.transform(example_df)
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|
+--------------------+--------------------+--------------------+--------------------+
|Microsoft founder...|[{document, 0, 68...|[{token, 0, 8, Mi...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select(F.explode(F.arrays_zip("token.result", "ner.result")).alias("cols"))\
      .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\
      .show(50, truncate=False)

+---------+---------+
|token    |ner_label|
+---------+---------+
|Microsoft|B-ORG    |
|founder  |O        |
|Bill     |B-PER    |
|Gates    |I-PER    |
|plans    |O        |
|to       |O        |
|build    |O        |
|a        |O        |
|new      |O        |
|factory  |O        |
|in       |O        |
|Germany  |B-LOC    |
|.        |O        |
+---------+---------+



In [None]:
bert_tagger.extractParamMap()

{Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='engine', doc='Deep Learning engine used for this model'): 'tensorflow',
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='maxSentenceLength', doc='Max sentence length to process'): 512,
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='caseSensitive', doc='whether to ignore case in tokens for embeddings matching'): True,
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='inputCols', doc='previous annotations columns, if renamed'): ['document',
  'token'],
 Param(parent='BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89', name='outputCol', doc='output annotation column. can be left default.'): 'ner'}

In [None]:
bert_tagger.getMaxSentenceLength()

512

In [None]:
bert_tagger.getCaseSensitive()

True

## **📍 RoBertaForTokenClassification**

➤ RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.


➤ For extended examples of usage, see the Examples. To see which models are compatible and how to import them see [Import Transformers into Spark NLP](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) 🚀.

In [None]:
from sparknlp.annotator import RoBertaForTokenClassification

tokenClassifier = RoBertaForTokenClassification \
    .pretrained('roberta_base_token_classifier_ontonotes', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('ner')


result = tokenClassifier.transform(example_df)
result.show()

roberta_base_token_classifier_ontonotes download started this may take some time.
Approximate size to download 434.7 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|
+--------------------+--------------------+--------------------+--------------------+
|Microsoft founder...|[{document, 0, 68...|[{token, 0, 8, Mi...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select(F.explode(F.arrays_zip("token.result", "ner.result")).alias("cols"))\
    .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\
    .show(50, truncate=False)

+---------+---------+
|token    |ner_label|
+---------+---------+
|Microsoft|B-ORG    |
|founder  |O        |
|Bill     |B-PERSON |
|Gates    |I-PERSON |
|plans    |O        |
|to       |O        |
|build    |O        |
|a        |O        |
|new      |O        |
|factory  |O        |
|in       |O        |
|Germany  |B-GPE    |
|.        |O        |
+---------+---------+



## **📍 XlmRoBertaForTokenClassification for Turkish**

➤ XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.


➤ For extended examples of usage, see the Examples. To see which models are compatible and how to import them see [Import Transformers into Spark NLP](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) 🚀.

In [None]:
from sparknlp.annotator import XlmRoBertaForTokenClassification

example_df = spark.createDataFrame([["Benim adım Cesur Yurttaş ve İstanbul'da yaşıyorum."]]).toDF("text")

document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')

ner_tagger = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_base_token_classifier_ner", "tr") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("ner")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    ner_tagger
])

result = pipeline.fit(example_df).transform(example_df)
result.show()

xlm_roberta_base_token_classifier_ner download started this may take some time.
Approximate size to download 812.2 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|
+--------------------+--------------------+--------------------+--------------------+
|Benim adım Cesur ...|[{document, 0, 49...|[{token, 0, 4, Be...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select(F.explode(F.arrays_zip("token.result", "ner.result")).alias("cols"))\
      .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\
      .show(50, truncate=False)

+-----------+---------+
|token      |ner_label|
+-----------+---------+
|Benim      |O        |
|adım       |O        |
|Cesur      |B-PER    |
|Yurttaş    |I-PER    |
|ve         |O        |
|İstanbul'da|B-LOC    |
|yaşıyorum  |O        |
|.          |O        |
+-----------+---------+



## 📍**CamemBertForTokenClassification for French**

➤ CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

➤ ➤ For extended examples of usage, see the Examples. To see which models are compatible and how to import them see [Import Transformers into Spark NLP](https://github.com/JohnSnowLabs/spark-nlp/discussions/5669) 🚀.

In [None]:
from sparknlp.annotator import CamemBertForTokenClassification

example_df = spark.createDataFrame([["Je m'appelle Myriam Gomaz, j'habite à Paris, France."]]).toDF("text")

document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')

camembert_tagger = CamemBertForTokenClassification\
    .pretrained("camembert_classifier_ner", "fr") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("ner")

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    camembert_tagger
])

result = pipeline.fit(example_df).transform(example_df)
result.show()

camembert_classifier_ner download started this may take some time.
Approximate size to download 393.3 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 ner|
+--------------------+--------------------+--------------------+--------------------+
|Je m'appelle Myri...|[{document, 0, 51...|[{token, 0, 1, Je...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+



In [None]:
result.select(F.explode(F.arrays_zip("token.result", "ner.result")).alias("cols"))\
      .select(F.expr("cols['0']").alias("token"), F.expr("cols['1']").alias("ner_label"))\
      .show(50, truncate=False)

+---------+---------+
|token    |ner_label|
+---------+---------+
|Je       |O        |
|m'appelle|O        |
|Myriam   |I-PER    |
|Gomaz    |I-PER    |
|,        |O        |
|j'habite |O        |
|à        |O        |
|Paris    |I-LOC    |
|,        |O        |
|France   |I-LOC    |
|.        |O        |
+---------+---------+



##  📍 **Using LightPipeline**

[LightPipelines](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) are Spark NLP specific Pipelines, equivalent to Spark ML Pipeline, but meant to deal with smaller amounts of data. They’re useful working with small datasets, debugging results, or when running either training or prediction from an API that serves one-off requests.

Spark NLP LightPipelines are Spark ML pipelines converted into a single machine but the multi-threaded task, **becoming more than 10x times faster** for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum). To use them, we simply plug in a trained (fitted) pipeline and then annotate a plain text. We don't even need to convert the input text to DataFrame in order to feed it into a pipeline that's accepting DataFrame as an input in the first place. This feature would be quite useful when it comes to getting a prediction for a few lines of text from a trained ML model.

For more details, check the following 
[Medium post](https://medium.com/spark-nlp/spark-nlp-101-lightpipeline-a544e93f20f1).

This class accepts strings or list of strings as input, without the need to transform your text into a spark data frame. The [.annotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.annotate) method returns a dictionary (or list of dictionary if a list is passed as input) with the results of each step in the pipeline. To retrieve all metadata from the anntoators in the result, use the method [.fullAnnotate()](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline.fullAnnotate) instead, which always returns a list.

To extract the results from the object, you just need to parse the dictionary.

Let's use the `bert_large_token_classifier_ontonote` model with `LightPipeline` and `.fullAnnotate()` it with sample data.

In [None]:
from sparknlp.annotator import NerConverter

document_assembler = DocumentAssembler() \
        .setInputCol('text') \
        .setOutputCol('document')

tokenizer = Tokenizer() \
        .setInputCols(['document']) \
        .setOutputCol('token')

tokenClassifier = BertForTokenClassification \
    .pretrained('bert_large_token_classifier_ontonote', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('ner') \
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('entities')

pipeline = Pipeline(stages=[document_assembler, 
                            tokenizer,
                            tokenClassifier,
                            ner_converter])

empty_df = spark.createDataFrame([['']]).toDF("text")
model = pipeline.fit(empty_df)

bert_large_token_classifier_ontonote download started this may take some time.
Approximate size to download 1.2 GB
[OK!]


In [None]:
from sparknlp.base import LightPipeline

light_model= LightPipeline(model)
light_result= light_model.fullAnnotate("Steven Rothery is the original guitarist and the longest continuous member of the British rock band Marillion.")[0]

In [None]:
light_result

{'document': [Annotation(document, 0, 109, Steven Rothery is the original guitarist and the longest continuous member of the British rock band Marillion., {}, [])],
 'token': [Annotation(token, 0, 5, Steven, {'sentence': '0'}, []),
  Annotation(token, 7, 13, Rothery, {'sentence': '0'}, []),
  Annotation(token, 15, 16, is, {'sentence': '0'}, []),
  Annotation(token, 18, 20, the, {'sentence': '0'}, []),
  Annotation(token, 22, 29, original, {'sentence': '0'}, []),
  Annotation(token, 31, 39, guitarist, {'sentence': '0'}, []),
  Annotation(token, 41, 43, and, {'sentence': '0'}, []),
  Annotation(token, 45, 47, the, {'sentence': '0'}, []),
  Annotation(token, 49, 55, longest, {'sentence': '0'}, []),
  Annotation(token, 57, 66, continuous, {'sentence': '0'}, []),
  Annotation(token, 68, 73, member, {'sentence': '0'}, []),
  Annotation(token, 75, 76, of, {'sentence': '0'}, []),
  Annotation(token, 78, 80, the, {'sentence': '0'}, []),
  Annotation(token, 82, 88, British, {'sentence': '0'}, []

➤ Let's check the classes that `bert_large_token_classifier_ontonote` model can predict

In [None]:
tokenClassifier.getClasses()

['I-TIME',
 'B-PERSON',
 'B-GPE',
 'B-LAW',
 'B-NORP',
 'B-LOC',
 'I-ORG',
 'I-QUANTITY',
 'B-DATE',
 'B-PRODUCT',
 'B-FAC',
 'I-DATE',
 'I-WORK_OF_ART',
 'B-TIME',
 'B-QUANTITY',
 'I-PERCENT',
 'I-LAW',
 'I-GPE',
 'I-NORP',
 'I-ORDINAL',
 'I-EVENT',
 'I-LOC',
 'B-EVENT',
 'I-FAC',
 'B-ORDINAL',
 'B-LANGUAGE',
 'B-MONEY',
 'B-PERCENT',
 'I-LANGUAGE',
 'B-ORG',
 'I-MONEY',
 'I-PRODUCT',
 'O',
 'B-WORK_OF_ART',
 'I-CARDINAL',
 'I-PERSON',
 'B-CARDINAL']

In [None]:
light_result.keys()

dict_keys(['document', 'token', 'ner', 'entities'])

➤ Parsing the dictionary for NER labels

In [None]:
import pandas as pd
tokens= []
ner_labels= []

for i, k in list(zip(light_result["token"], light_result["ner"])):
  tokens.append(i.result)
  ner_labels.append(k.result)

result_df= pd.DataFrame({"tokens": tokens, "ner_labels": ner_labels})
result_df.head(20)

Unnamed: 0,tokens,ner_labels
0,Steven,B-PERSON
1,Rothery,I-PERSON
2,is,O
3,the,O
4,original,O
5,guitarist,O
6,and,O
7,the,O
8,longest,O
9,continuous,O


➤ Parsing the dictionary for `NerConverter` metadata



In [None]:
chunks= []
begin= []
end= []
ner_label= []

for i in light_result["entities"]:
  chunks.append(i.result)
  begin.append(i.begin)
  end.append(i.end)
  ner_label.append(i.metadata["entity"])

result_df= pd.DataFrame({"chunks": chunks, "begin": begin, "end": end, "ner_label": ner_label})
result_df.head(20)

Unnamed: 0,chunks,begin,end,ner_label
0,Steven Rothery,0,13,PERSON
1,British,82,88,NORP
2,Marillion,100,108,ORG


# From HuggingFace to Spark NLP

Here you will learn how to export a model from HuggingFace to Spark NLP. 

For compatibility details and examples, check [this page](https://nlp.johnsnowlabs.com/docs/en/transformers#import-transformers-into-spark-nlp).

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.25.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [None]:
!pip install -q transformers==4.25.1 tensorflow==2.11.0

- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example
- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [None]:
from transformers import TFBertForTokenClassification, BertTokenizer 
import tensorflow as tf

MODEL_NAME = 'dslim/bert-base-NER'

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained('./{}_tokenizer/'.format(MODEL_NAME))

# just in case if there is no TF/Keras file provided in the model
# we can just use `from_pt` and convert PyTorch to TensorFlow
try:
  print('try downloading TF weights')
  model = TFBertForTokenClassification.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  model = TFBertForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)

# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
          "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
      }
  ]
)
def serving_fn(input):
    return model(input)

model.save_pretrained("./{}".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

try downloading TF weights


Downloading (…)"tf_model.h5";:   0%|          | 0.00/434M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForTokenClassification.

All the layers of TFBertForTokenClassification were initialized from the model checkpoint at dslim/bert-base-NER.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


➤ Let's have a look inside these two directories and see what we are dealing with:

In [None]:
!ls -l {MODEL_NAME}

total 421088
-rw-r--r-- 1 root root       999 Mar  3 13:08 config.json
drwxr-xr-x 3 root root      4096 Mar  3 13:08 saved_model
-rw-r--r-- 1 root root 431179820 Mar  3 13:08 tf_model.h5


In [None]:
!ls -l {MODEL_NAME}/saved_model/1

total 9152
drwxr-xr-x 2 root root    4096 Mar  3 13:08 assets
-rw-r--r-- 1 root root      53 Mar  3 13:08 fingerprint.pb
-rw-r--r-- 1 root root  165837 Mar  3 13:08 keras_metadata.pb
-rw-r--r-- 1 root root 9190201 Mar  3 13:08 saved_model.pb
drwxr-xr-x 2 root root    4096 Mar  3 13:08 variables


In [None]:
!ls -l {MODEL_NAME}_tokenizer

total 220
-rw-r--r-- 1 root root    125 Mar  3 13:06 special_tokens_map.json
-rw-r--r-- 1 root root    551 Mar  3 13:06 tokenizer_config.json
-rw-r--r-- 1 root root 213450 Mar  3 13:06 vocab.txt


- As you can see, we need the SavedModel from `saved_model/1/` path
- We also be needing `vocab.txt` from the tokenizer
- All we need is to just copy the `vocab.txt` to `saved_model/1/assets` which Spark NLP will look for
- In addition to vocabs, we also need `labels` and their `ids` which is saved inside the model's config. We will save this inside `labels.txt`

In [None]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

!cp {MODEL_NAME}_tokenizer/vocab.txt {asset_path}

In [None]:
# get label2id dictionary 
labels = model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

➤ We have our vocab.txt and labels.txt inside assets directory

In [None]:
! ls -l {MODEL_NAME}/saved_model/1/assets

total 216
-rw-r--r-- 1 root root     51 Mar  3 13:09 labels.txt
-rw-r--r-- 1 root root 213450 Mar  3 13:09 vocab.txt


## Import and Save BertForTokenClassification in Spark NLP

- Let's use `loadSavedModel` functon in `BertForTokenClassification` which allows us to load TensorFlow model in SavedModel format
- Most params can be set later when you are loading this model in `BertForTokenClassification` in runtime like `setMaxSentenceLength`, so don't worry what you are setting them now
- `loadSavedModel` accepts two params, first is the path to the TF SavedModel. The second is the SparkSession that is `spark` variable we previously started via `sparknlp.start()`
- NOTE: `loadSavedModel` accepts local paths in addition to distributed file systems such as `HDFS`, `S3`, `DBFS`, etc. This feature was introduced in Spark NLP 4.2.2 release. Keep in mind the best and recommended way to move/share/reuse Spark NLP models is to use `write.save` so you can use `.load()` from any file systems natively.

In [None]:
from sparknlp.annotator import *
from sparknlp.base import *


tokenClassifier = BertForTokenClassification.loadSavedModel(
     '{}/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols(["document",'token'])\
 .setOutputCol("ner")

➤ Let's save it on disk so it is easier to be moved around and also be used later via `.load` function

In [None]:
tokenClassifier.write().overwrite().save("./{}_spark_nlp".format(MODEL_NAME))

➤ Let's clean up stuff we don't need anymore

In [None]:
! rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

Awesome 😎  !

This is your BertForTokenClassification model from HuggingFace 🤗  loaded and saved by Spark NLP 🚀 

In [None]:
! ls -l {MODEL_NAME}_spark_nlp

total 429704
-rw-r--r-- 1 root root 440007186 Mar  3 13:10 bert_classification_tensorflow
drwxr-xr-x 5 root root      4096 Mar  3 13:09 fields
drwxr-xr-x 2 root root      4096 Mar  3 13:09 metadata


➤ Now let's see how we can use it on other machines, clusters, or any place you wish to use your new and shiny BertForTokenClassification model 😊

In [None]:
tokenClassifier_loaded = BertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

➤ That's it! You can now go wild and use hundreds of `BertForTokenClassification` models from HuggingFace 🤗 in Spark NLP 🚀 

➤ You can see what labels were used to train this model via getClasses function:

In [None]:
tokenClassifier_loaded.getClasses()

['B-LOC', 'I-ORG', 'I-MISC', 'I-LOC', 'I-PER', 'B-MISC', 'B-ORG', 'O', 'B-PER']

➤ Cool! You can now go wild and use hundreds of BertForTokenClassification models from HuggingFace 🤗 in Spark NLP 🚀