[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20BertForTokenClassification.ipynb)

## Import BertForTokenClassification models from HuggingFace 🤗  into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.2.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import BERT models trained/fine-tuned for token classification via `BertForTokenClassification` or `TFBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels
- Reference: [TFBertForTokenClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification)
- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.25.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [10]:
!pip install -q transformers==4.25.1 tensorflow==2.11.0
!pip install tensorflow-addons
!git clone https://github.com/onnx/onnx-tensorflow.git && cd onnx-tensorflow && pip install -e . 
!pip install pytorch
!pip install torchvision
!pip install ftfy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
fatal: destination path 'onnx-tensorflow' already exists and is not an empty directory.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch
  Using cached pytorch-1.0.2.tar.gz (689 bytes)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pytorch
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for pytorch (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for pytorch[0m[31m
[0m[?25h  Running setup.py clean for pytorch
Failed to build pytorch
Installing collected packages: pytorch
  [1;31merr

In [2]:
# connect ggdrive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example
- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [77]:
import torch
import torch.nn as nn
import torch.optim as optim

import pandas as pd
import numpy as np
import os
from pprint import pprint
import string    
import random
import json
import spacy
from spacy import displacy
#from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, TFAutoModelForTokenClassification

In [79]:
save_model_address = '/content/drive/MyDrive/Data_Science/thesis/ML_NER/NERModel_config'
#save_model = BertForTokenClassification.from_pretrained(save_model_address, num_labels=20)
#tokenizer = BertTokenizer.from_pretrained(save_model_address,do_lower_case=True)

save_model = TFAutoModelForTokenClassification.from_pretrained(save_model_address, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(save_model_address, do_lower_case=True, model_max_length=256)

nlp = pipeline("ner", model=save_model, tokenizer=tokenizer, aggregation_strategy='simple',ignore_labels =['X','O'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [8]:
# test cau dai
orig_string = '''Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in detail ...learned a lot ...must see tutorial for all testers. Masud. "The instructor is just a perfect lecturer! Entire course is very informative and useful for software testers as beginners with a lot of practical examples. Who wants to understand principles of testing and main techniques of it - enroll in this course. Oleksii. "The instructor according to me.....God has gifted her a real talent to be one of the best tutors in this world. Biju. Introducing the Most Practical, Precise and Inexpensive Software Testing Course. It is going to include everything there is to know for you to become a perfect Software Tester. This software testing QA training course is designed by working professionals in a way that, course it will progress from introducing you to the basics of software testing to advanced topics like Software configuration management, creating a test plan, test estimations etc along with introduction and familiarity with Automation testing and test management tools like QTP (intro), QC, JIRA, and Bugzilla. Course Benefits: Syllabus: We came up with a unique list of topics that will help you gradually work your way into the testing world. Practice sessions: Assignments in a way that you will get to apply the theory you learnt immediately. Video sessions of Instructor led live training sessions. Practical learning experienc e with live project work and examples. Support: Our Team is going to be available to you via email or the website for you to reach out to us. Over Lectures and more than+ hours of HD content! Learn Software Testing and Automation basics from a professional trainer from your own desk. Information packed practical training starting from basics to advanced testing techniques. Best suitable for beginners to advanced level users and who learn faster when demonstrated. Get â€œCertificate of completion. LIVE PROJECT End to End Software Testing Training Included. Learn Software Testing and Automation basics from a professional trainer from your own desk. Information packed practical training starting from basics to advanced testing techniques. Best suitable for beginners to advanced level users and who learn faster when demonstrated. Course content designed by considering current software testing technology and the job market. Practical assignments at the end of every session. Practical learning experience with live project work and examples. Lifetime enrollment - Pay one time fee and access video training sessions as many times as you want. Resume Preparation Guidance for Testers Included. Software Testing Interview Questions and Preparation Tips Included. Download Real Software Testing Templates like Test Plan, Test Cases and other important Templates. Software Testing Certification Guidance. Learn Test Management Tools like JIRA, and Bugzilla. Get all future course updates free!'''
#results = nlp(sentences)
#results
#len(tokenizer.tokenize(sentences, truncation=True))
list_of_lines = []
max_length = 256*4
while len(orig_string) > max_length:
    line_length = max(orig_string[:max_length].rfind(i) for i in ".!?,")
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)
list_of_lines

['Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON\'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in deta

In [21]:
# fix ky tu bang thu vien fix that for you
import ftfy
results = nlp(ftfy.fix_text(list_of_lines[0]))
results

[{'entity': 'B-KNOW',
  'score': 0.91182387,
  'index': 4,
  'word': 'software',
  'start': 17,
  'end': 25},
 {'entity': 'I-KNOW',
  'score': 0.89576316,
  'index': 5,
  'word': 'testing',
  'start': 26,
  'end': 33},
 {'entity': 'B-KNOW',
  'score': 0.8580251,
  'index': 17,
  'word': 'software',
  'start': 86,
  'end': 94},
 {'entity': 'I-KNOW',
  'score': 0.8224982,
  'index': 18,
  'word': 'testing',
  'start': 95,
  'end': 102},
 {'entity': 'B-KNOW',
  'score': 0.81303245,
  'index': 116,
  'word': 'software',
  'start': 543,
  'end': 551},
 {'entity': 'I-KNOW',
  'score': 0.9022264,
  'index': 117,
  'word': 'testing',
  'start': 552,
  'end': 559},
 {'entity': 'B-KNOW',
  'score': 0.8579681,
  'index': 132,
  'word': 'software',
  'start': 601,
  'end': 609},
 {'entity': 'I-KNOW',
  'score': 0.8925069,
  'index': 133,
  'word': 'testing',
  'start': 610,
  'end': 617}]

In [22]:
fixed = ftfy.fix_text(list_of_lines[0])
fixed

'Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON\'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in detai

In [53]:
tokens = tokenizer(
    fixed, 
    return_attention_mask=False,
    truncation=True,
    return_special_tokens_mask=True,
    return_offsets_mapping=tokenizer.is_fast,
    return_tensors='pt'
)
tokens

{'input_ids': tensor([[  101,  4553,  1998,  3040,  4007,  5604,  2855,  2013,  1996,  8519,
          1011, 12361,   999,  1996,  1999,  1011,  5995,  4007,  5604,  2731,
          1011,  2011,  4007, 22199,  2075, 16001,  2361,  2136,  1012,  1000,
          2327,  3076,  4060,  1000,  2006, 20904, 26662,  1999,  1996,  4007,
          5604,  4696,   999,  2656,  1009,  2847,  1997, 10751,  4180,  1012,
          3643,  2005,  2769,   999,  2123,  1005,  1056,  7392,  2005,  2060,
          3937,  5352,  1997,  2625,  2084,  6806,  9236,   999,  2261,  3076,
          4391,  2013,  5606,  1997, 14117,  4391,  1024,  1000,  1996,  2607,
          2003,  2019,  3239, 16181,  2046,  1996,  2088,  1997,  2009,  1012,
         14833, 21850,  7393,  1012,  1000,  2769,  2092,  2985,  1010,  6581,
          6959,  1012,  2200, 12367,  8082,  1998,  6742,  1012,  2052,  3811,
         16755,  2000,  3087,  4699,  1999, 11828,  4007,  5604,  2004,  1037,
          2476,  1012, 19330,  2319, 1

In [54]:
if tokenizer.is_fast:
    offset_mapping = tokens.pop("offset_mapping").cpu().numpy()[0]
elif offset_mappings:
    offset_mapping = offset_mappings[i]
else:
    offset_mapping = None

special_tokens_mask = tokens.pop("special_tokens_mask").cpu().numpy()[0]

with torch.no_grad():
    entities = save_model(**tokens)[0][0].cpu().numpy()
    input_ids = tokens["input_ids"].cpu().numpy()[0]


In [76]:
torch.onnx.export(save_model, **tokens)

TypeError: ignored

In [90]:
import tensorflow as tf

MODEL_NAME = '/content/drive/MyDrive/Data_Science/thesis/ML_NER/NERModel_config'

# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
          "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
      }
  ]
)
def serving_fn(input):
    return save_model(input)

save_model.save_pretrained("{}/converting".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})



In [81]:
!apt-get install tree -q

Reading package lists...
Building dependency tree...
Reading state information...
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 27 not upgraded.
Need to get 43.0 kB of archives.
After this operation, 115 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 tree amd64 1.8.0-1 [43.0 kB]
Fetched 43.0 kB in 0s (94.8 kB/s)
Selecting previously unselected package tree.
(Reading database ... 129499 files and directories currently installed.)
Preparing to unpack .../tree_1.8.0-1_amd64.deb ...
Unpacking tree (1.8.0-1) ...
Setting up tree (1.8.0-1) ...
Processing triggers for man-db (2.9.1-1) ...


In [91]:
!tree {MODEL_NAME}/converting

[01;34m/content/drive/MyDrive/Data_Science/thesis/ML_NER/NERModel_config/converting[00m
├── config.json
├── [01;34msaved_model[00m
│   └── [01;34m1[00m
│       ├── [01;34massets[00m
│       ├── fingerprint.pb
│       ├── keras_metadata.pb
│       ├── saved_model.pb
│       └── [01;34mvariables[00m
│           ├── variables.data-00000-of-00001
│           └── variables.index
└── tf_model.h5

4 directories, 7 files


In [92]:
!tree {MODEL_NAME}

[01;34m/content/drive/MyDrive/Data_Science/thesis/ML_NER/NERModel_config[00m
├── config.json
├── [01;34mconverting[00m
│   ├── config.json
│   ├── [01;34msaved_model[00m
│   │   └── [01;34m1[00m
│   │       ├── [01;34massets[00m
│   │       ├── fingerprint.pb
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── [01;34mvariables[00m
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
│   └── tf_model.h5
├── eval_results.txt
├── [01;34mmy_model_tf[00m
│   └── [01;34msaved_model[00m
│       └── [01;34m1[00m
│           ├── [01;34massets[00m
│           │   ├── labels.txt
│           │   └── vocab.txt
│           ├── saved_model.pd
│           └── [01;34mvariables[00m
├── pytorch_model.bin
└── vocab.txt

10 directories, 14 files


In [109]:
!cp -r {MODEL_NAME}/my_model_tf/saved_model/1/assets {MODEL_NAME}/converting/saved_model/1

In [125]:
!tree {MODEL_NAME}/converting

[01;34m/content/drive/MyDrive/Data_Science/thesis/ML_NER/NERModel_config/converting[00m
├── config.json
├── [01;34msaved_model[00m
│   └── [01;34m1[00m
│       ├── [01;34massets[00m
│       │   ├── labels.txt
│       │   └── vocab.txt
│       ├── fingerprint.pb
│       ├── keras_metadata.pb
│       ├── saved_model.pb
│       └── [01;34mvariables[00m
│           ├── variables.data-00000-of-00001
│           └── variables.index
└── tf_model.h5

4 directories, 9 files


In [95]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 4.2.8
setup Colab for PySpark 3.2.3 and Spark NLP 4.2.8
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.8/453.8 KB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [96]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()

In [117]:
from sparknlp.annotator import *

bert = BertForTokenClassification.loadSavedModel(
     '{}/converting/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols(["document",'token'])\
 .setOutputCol("ner")\
 .setCaseSensitive(True)\
 .setMaxSentenceLength(128)

In [122]:
bert.write().overwrite().save("./{}".format(MODEL_NAME))

In [120]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

In [126]:
! ls -l {MODEL_NAME}

total 425716
-rw------- 1 root root      1136 Sep 26 01:35 config.json
drwx------ 3 root root      4096 Jan 30 20:32 converting
-rw------- 1 root root       554 Sep 26 01:35 eval_results.txt
drwx------ 3 root root      4096 Jan 30 20:32 my_model_tf
-rw------- 1 root root 435689969 Sep 26 01:35 pytorch_model.bin
-rw------- 1 root root    231508 Sep 26 01:35 vocab.txt


In [128]:
tokenClassifier_loaded = BertForTokenClassification.load("./{}".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

In [129]:
tokenClassifier_loaded.getClasses()

['I-TOOL',
 'B-TOOL',
 'I-KNOW',
 '[SEP]',
 'B-LANG',
 'I-LANG',
 'B-FRAM',
 'I-FRAM',
 'B-KNOW',
 'I-PLAT',
 '[CLS]',
 'O',
 'B-PLAT']

In [133]:
from sparknlp.base import *
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    tokenClassifier_loaded    
])

# couple of simple examples
example = spark.createDataFrame([["Learn and Master software testing Quickly from the experts - GUARANTEED! THE IN-DEPTH software testing TRAINING - By SoftwareTestingHelp Team. on Udemy in the software testing category!"], ['My name is Clara and I live in Berkeley, California.']]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "ner.result").show()

+--------------------+--------------------+
|                text|              result|
+--------------------+--------------------+
|Learn and Master ...|[O, O, O, B-KNOW,...|
|My name is Clara ...|[O, O, O, O, O, O...|
+--------------------+--------------------+

