[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20BertForTokenClassification.ipynb)

## Import BertForTokenClassification models from HuggingFace 🤗  into Spark NLP 🚀 

Let's keep in mind a few things before we start 😊 

- This feature is only in `Spark NLP 3.2.x` and after. So please make sure you have upgraded to the latest Spark NLP release
- You can import BERT models trained/fine-tuned for token classification via `BertForTokenClassification` or `TFBertForTokenClassification`. These models are usually under `Token Classification` category and have `bert` in their labels
- Reference: [TFBertForTokenClassification](https://huggingface.co/transformers/model_doc/bert.html#tfbertfortokenclassification)
- Some [example models](https://huggingface.co/models?filter=bert&pipeline_tag=token-classification)

## Export and Save HuggingFace model

- Let's install `HuggingFace` and `TensorFlow`. You don't need `TensorFlow` to be installed for Spark NLP, however, we need it to load and save models from HuggingFace.
- We lock TensorFlow on `2.11.0` version and Transformers on `4.25.1`. This doesn't mean it won't work with the future releases, but we wanted you to know which versions have been tested successfully.

In [1]:
!pip install -q transformers==4.25.1 tensorflow==2.11.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m50.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m104.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 KB[0m [31m43.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m105.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [1]:
!pip install tensorflow-addons
!git clone https://github.com/onnx/onnx-tensorflow.git && cd onnx-tensorflow && pip install -e . 
!pip install pytorch
!pip install torchvision
!pip install ftfy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m588.3/588.3 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m89.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m81.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m70.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m439.2/439.2 KB[0m [31m40.3 MB/s[0m eta [36m0:00:00[0m
[?25hLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-addons
  Downloading tensorflow_ad

In [2]:
# connect ggdrive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- HuggingFace comes with a native `saved_model` feature inside `save_pretrained` function for TensorFlow based models. We will use that to save it as TF `SavedModel`.
- We'll use [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER) model from HuggingFace as an example
- In addition to `TFBertForTokenClassification` we also need to save the `BertTokenizer`. This is the same for every model, these are assets needed for tokenization inside Spark NLP.

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

import pandas as pd
import numpy as np
import os
from pprint import pprint
import string    
import random
import json
import spacy
from spacy import displacy
#from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, TFAutoModelForTokenClassification



In [4]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification, TFAutoModelForTokenClassification

In [4]:
#from transformers import BertTokenizer, BertForTokenClassification
save_model_address = '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config'
#save_model = BertForTokenClassification.from_pretrained(save_model_address)
#tokenizer = BertTokenizer.from_pretrained(save_model_address,do_lower_case=True, model_max_length=256)

save_model = TFAutoModelForTokenClassification.from_pretrained(save_model_address, from_pt=True)
tokenizer = AutoTokenizer.from_pretrained(save_model_address, do_lower_case=True, model_max_length=256)

nlp = pipeline("ner", model=save_model, tokenizer=tokenizer, aggregation_strategy='simple',ignore_labels =['X','O'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [5]:
# test cau dai
orig_string = '''Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in detail ...learned a lot ...must see tutorial for all testers. Masud. "The instructor is just a perfect lecturer! Entire course is very informative and useful for software testers as beginners with a lot of practical examples. Who wants to understand principles of testing and main techniques of it - enroll in this course. Oleksii. "The instructor according to me.....God has gifted her a real talent to be one of the best tutors in this world. Biju. Introducing the Most Practical, Precise and Inexpensive Software Testing Course. It is going to include everything there is to know for you to become a perfect Software Tester. This software testing QA training course is designed by working professionals in a way that, course it will progress from introducing you to the basics of software testing to advanced topics like Software configuration management, creating a test plan, test estimations etc along with introduction and familiarity with Automation testing and test management tools like QTP (intro), QC, JIRA, and Bugzilla. Course Benefits: Syllabus: We came up with a unique list of topics that will help you gradually work your way into the testing world. Practice sessions: Assignments in a way that you will get to apply the theory you learnt immediately. Video sessions of Instructor led live training sessions. Practical learning experienc e with live project work and examples. Support: Our Team is going to be available to you via email or the website for you to reach out to us. Over Lectures and more than+ hours of HD content! Learn Software Testing and Automation basics from a professional trainer from your own desk. Information packed practical training starting from basics to advanced testing techniques. Best suitable for beginners to advanced level users and who learn faster when demonstrated. Get â€œCertificate of completion. LIVE PROJECT End to End Software Testing Training Included. Learn Software Testing and Automation basics from a professional trainer from your own desk. Information packed practical training starting from basics to advanced testing techniques. Best suitable for beginners to advanced level users and who learn faster when demonstrated. Course content designed by considering current software testing technology and the job market. Practical assignments at the end of every session. Practical learning experience with live project work and examples. Lifetime enrollment - Pay one time fee and access video training sessions as many times as you want. Resume Preparation Guidance for Testers Included. Software Testing Interview Questions and Preparation Tips Included. Download Real Software Testing Templates like Test Plan, Test Cases and other important Templates. Software Testing Certification Guidance. Learn Test Management Tools like JIRA, and Bugzilla. Get all future course updates free!'''
#results = nlp(sentences)
#results
#len(tokenizer.tokenize(sentences, truncation=True))
list_of_lines = []
max_length = 256*4
while len(orig_string) > max_length:
    line_length = max(orig_string[:max_length].rfind(i) for i in ".!?,")
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)
list_of_lines

['Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON\'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in deta

In [6]:
# fix ky tu bang thu vien fix that for you
import ftfy
results = nlp(ftfy.fix_text(list_of_lines[0]))
results

[{'entity_group': 'KNOW',
  'score': 0.9037937,
  'word': 'software testing',
  'start': 17,
  'end': 33},
 {'entity_group': 'KNOW',
  'score': 0.8402617,
  'word': 'software testing',
  'start': 86,
  'end': 102},
 {'entity_group': 'KNOW',
  'score': 0.8576295,
  'word': 'software testing',
  'start': 543,
  'end': 559},
 {'entity_group': 'KNOW',
  'score': 0.8752378,
  'word': 'software testing',
  'start': 601,
  'end': 617}]

In [7]:
fixed = ftfy.fix_text(list_of_lines[0])
fixed

'Learn and Master Software Testing Quickly from the experts - GUARANTEED! THE IN-DEPTH SOFTWARE TESTING TRAINING - By SoftwareTestingHelp Team. "TOP STUDENT PICK" on Udemy in the Software Testing category! 26+ hours of HD content. Value for money! DON\'T settle for other basic courses of less thanhours! Few Student reviews from hundreds ofstar reviews: "The course is an eye opener into the world of IT. Theophilus. "Money well spent, excellent delivery. Very informative and practical. Would highly recommend to anyone interested in pursuing software testing as a career. Olanrewaju. "Truly the best software testing training I have come across both in dept and in substance. Kingsley. "This is really "The Best Software Training Course". I hardly know anything regarding testing, instructor had taken utmost care in providing the knowledge starting from basics, the terminology etc...I am very much satisfied with this course. I strongly recommend this course. Vijaya. "Great tutorials ..in detai

In [10]:
MODEL_NAME = '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config'
try:
  print('try downloading TF weights')
  save_model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME)
except:
  print('try downloading PyTorch weights')
  save_model = TFAutoModelForTokenClassification.from_pretrained(MODEL_NAME, from_pt=True)


try downloading TF weights
try downloading PyTorch weights


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing TFBertForTokenClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForTokenClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


In [11]:


#tokenizer = AutoTokenizer.from_pretrained(save_model_address, do_lower_case=True, model_max_length=256)
tokenizer.save_pretrained('.{}'.format(MODEL_NAME))

('./content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/tokenizer_config.json',
 './content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/special_tokens_map.json',
 './content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/vocab.txt',
 './content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/added_tokens.json',
 './content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/tokenizer.json')

In [12]:
tokens = tokenizer(
    fixed, 
    return_attention_mask=False,
    truncation=True,
    return_special_tokens_mask=True,
    return_offsets_mapping=tokenizer.is_fast,
    return_tensors='pt'
)
tokens

{'input_ids': tensor([[  101,  4553,  1998,  3040,  4007,  5604,  2855,  2013,  1996,  8519,
          1011, 12361,   999,  1996,  1999,  1011,  5995,  4007,  5604,  2731,
          1011,  2011,  4007, 22199,  2075, 16001,  2361,  2136,  1012,  1000,
          2327,  3076,  4060,  1000,  2006, 20904, 26662,  1999,  1996,  4007,
          5604,  4696,   999,  2656,  1009,  2847,  1997, 10751,  4180,  1012,
          3643,  2005,  2769,   999,  2123,  1005,  1056,  7392,  2005,  2060,
          3937,  5352,  1997,  2625,  2084,  6806,  9236,   999,  2261,  3076,
          4391,  2013,  5606,  1997, 14117,  4391,  1024,  1000,  1996,  2607,
          2003,  2019,  3239, 16181,  2046,  1996,  2088,  1997,  2009,  1012,
         14833, 21850,  7393,  1012,  1000,  2769,  2092,  2985,  1010,  6581,
          6959,  1012,  2200, 12367,  8082,  1998,  6742,  1012,  2052,  3811,
         16755,  2000,  3087,  4699,  1999, 11828,  4007,  5604,  2004,  1037,
          2476,  1012, 19330,  2319, 1

In [13]:
if tokenizer.is_fast:
    offset_mapping = tokens.pop("offset_mapping").cpu().numpy()[0]
elif offset_mappings:
    offset_mapping = offset_mappings[i]
else:
    offset_mapping = None

special_tokens_mask = tokens.pop("special_tokens_mask").cpu().numpy()[0]

with torch.no_grad():
    entities = save_model(**tokens)[0][0].cpu().numpy()
    input_ids = tokens["input_ids"].cpu().numpy()[0]


ValueError: ignored

In [None]:
torch.onnx.export(save_model, **tokens)

TypeError: ignored

In [14]:
import tensorflow as tf

MODEL_NAME = '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config'

# Define TF Signature
@tf.function(
  input_signature=[
      {
          "input_ids": tf.TensorSpec((None, None), tf.int32, name="input_ids"),
          "attention_mask": tf.TensorSpec((None, None), tf.int32, name="attention_mask"),
          "token_type_ids": tf.TensorSpec((None, None), tf.int32, name="token_type_ids"),
      }
  ]
)
def serving_fn(input):
    return save_model(input)

save_model.save_pretrained("{}/converting".format(MODEL_NAME), saved_model=True, signatures={"serving_default": serving_fn})



In [15]:
!apt-get install tree -q

Reading package lists...
Building dependency tree...
Reading state information...
The following package was automatically installed and is no longer required:
  libnvidia-common-510
Use 'apt autoremove' to remove it.
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 27 not upgraded.
Need to get 43.0 kB of archives.
After this operation, 115 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 tree amd64 1.8.0-1 [43.0 kB]
Fetched 43.0 kB in 0s (352 kB/s)
Selecting previously unselected package tree.
(Reading database ... 129496 files and directories currently installed.)
Preparing to unpack .../tree_1.8.0-1_amd64.deb ...
Unpacking tree (1.8.0-1) ...
Setting up tree (1.8.0-1) ...
Processing triggers for man-db (2.9.1-1) ...


In [16]:
!tree {MODEL_NAME}/converting

[01;34m/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/converting[00m
├── config.json
├── [01;34msaved_model[00m
│   └── [01;34m1[00m
│       ├── [01;34massets[00m
│       ├── fingerprint.pb
│       ├── keras_metadata.pb
│       ├── saved_model.pb
│       └── [01;34mvariables[00m
│           ├── variables.data-00000-of-00001
│           └── variables.index
└── tf_model.h5

4 directories, 7 files


In [17]:
!tree {MODEL_NAME}

[01;34m/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config[00m
├── config.json
├── [01;34mconverting[00m
│   ├── config.json
│   ├── [01;34msaved_model[00m
│   │   └── [01;34m1[00m
│   │       ├── [01;34massets[00m
│   │       ├── fingerprint.pb
│   │       ├── keras_metadata.pb
│   │       ├── saved_model.pb
│   │       └── [01;34mvariables[00m
│   │           ├── variables.data-00000-of-00001
│   │           └── variables.index
│   └── tf_model.h5
├── eval_results.txt
├── pytorch_model.bin
└── vocab.txt

5 directories, 11 files


In [None]:
!cp -r {MODEL_NAME}/my_model_tf/saved_model/1/assets {MODEL_NAME}/converting/saved_model/1

In [18]:
!tree {MODEL_NAME}/converting

[01;34m/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/converting[00m
├── config.json
├── [01;34msaved_model[00m
│   └── [01;34m1[00m
│       ├── [01;34massets[00m
│       ├── fingerprint.pb
│       ├── keras_metadata.pb
│       ├── saved_model.pb
│       └── [01;34mvariables[00m
│           ├── variables.data-00000-of-00001
│           └── variables.index
└── tf_model.h5

4 directories, 7 files


In [19]:
asset_path = '{}/saved_model/1/assets'.format(MODEL_NAME)

!cp {MODEL_NAME}/vocab.txt {asset_path}

cp: cannot create regular file '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config/saved_model/1/assets': No such file or directory


In [23]:
# get label2id dictionary 
labels = save_model.config.label2id
# sort the dictionary based on the id
labels = sorted(labels, key=labels.get)

with open(asset_path+'/labels.txt', 'w') as f:
    f.write('\n'.join(labels))

FileNotFoundError: ignored

In [3]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Installing PySpark 3.2.3 and Spark NLP 4.2.8
setup Colab for PySpark 3.2.3 and Spark NLP 4.2.8
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.5/281.5 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.8/453.8 KB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [5]:
# Install Spark NLP from PyPI
!pip install spark-nlp==4.2.8
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=c79182a1832032d98bfd796278b6202b1f4dd7b4975e07fac6ac8a2eecd23f13
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc

In [6]:
import sparknlp
# let's start Spark with Spark NLP
#spark = sparknlp.start()

In [8]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.8")\
    .getOrCreate()

In [9]:
from sparknlp.annotator import *
from sparknlp.training import CoNLL
from pyspark.ml import Pipeline
from sparknlp.common import *
from sparknlp.base import *

In [10]:
training_data = CoNLL().readDataset(spark = spark, path = '/content/drive/MyDrive/Data_Science/thesis/ML_NER/Train_Dataset/*')
training_date.show()

NameError: ignored

In [12]:
MODEL_NAME = '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config'
training_data = CoNLL().readDataset(spark = spark, path = '/content/drive/MyDrive/Data_Science/thesis/ML_NER/Train_Dataset/*')

bert = BertEmbeddings.pretrained('bert_base_cased','en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")

nerTagger = NerDLApproach() \
 .setInputCols(["sentence",'token','bert'])\
 .setLabelColumn("label") \
 .setOutputCol("ner") \
 .setMaxEpochs(1) \
 .setEnableMemoryOptimizer(True)

pipeline = Pipeline(stages = [bert, nerTagger])

ner_model = pipeline.fit(training_data)

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]


In [13]:
ner_model.stages[1].write().save('/content/drive/MyDrive/Data_Science/thesis/NERModel_embeddings/NER_bert_1')

In [11]:
from sparknlp.annotator import *
MODEL_NAME = '/content/drive/MyDrive/Data_Science/thesis/NER_new/NERModel_config'
bert = BertForTokenClassification.loadSavedModel(
     '{}/converting/saved_model/1'.format(MODEL_NAME),
     spark
 )\
 .setInputCols(["document",'token'])\
 .setOutputCol("ner")\
 .setCaseSensitive(True)\
 .setMaxSentenceLength(128)

ConnectionRefusedError: ignored

In [6]:
bert.write().overwrite().save("./{}".format(MODEL_NAME))

In [None]:
!rm -rf {MODEL_NAME}_tokenizer {MODEL_NAME}

In [28]:
! ls -l {MODEL_NAME}

total 425712
-rw------- 1 root root      1136 Sep 26 01:35 config.json
drwx------ 3 root root      4096 Feb  2 05:33 converting
-rw------- 1 root root       554 Sep 26 01:35 eval_results.txt
-rw------- 1 root root 435689969 Sep 26 01:35 pytorch_model.bin
-rw------- 1 root root    231508 Sep 26 01:35 vocab.txt


In [7]:
tokenClassifier_loaded = BertForTokenClassification.load("./{}".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")

In [8]:
tokenClassifier_loaded.getClasses()

['I-TOOL',
 'B-TOOL',
 'I-KNOW',
 '[SEP]',
 'B-LANG',
 'I-LANG',
 'B-FRAM',
 'I-FRAM',
 'B-KNOW',
 'I-PLAT',
 '[CLS]',
 'O',
 'B-PLAT']

In [14]:
from sparknlp.base import *
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentence = SentenceDetector() \
    .setInputCols(['document']) \
    .setOutputCol('sentence')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

bert = BertEmbeddings.pretrained('bert_base_cased','en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert") \
 .setCaseSensitive(True)

loaded_ner_model = NerDLModel.load("/content/drive/MyDrive/Data_Science/thesis/NERModel_embeddings/NER_bert_1")\
   .setInputCols(["sentence", "token", "bert"])\
   .setOutputCol("ner")

converter = NerConverter()\
    .setInputCols(["document","token","ner"])\
    .setOutputCol("ner_span")

custom_ner_pipeline = Pipeline(stages=[
    document_assembler,
    sentence,
    tokenizer,
    bert,
    loaded_ner_model,
    converter
])
'''
custom_ner_pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    tokenClassifier_loaded,
    converter    
])
'''
# couple of simple examples


bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]


'\ncustom_ner_pipeline = Pipeline(stages=[\n    document_assembler,\n    tokenizer,\n    tokenClassifier_loaded,\n    converter    \n])\n'

In [37]:
text = "software testing Quickly from the experts - GUARANTEED! THE IN-DEPTH software testing TRAINING - By SoftwareTestingHelp Team. on Udemy in the software testing category!"
prediction_data = spark.createDataFrame([[text]]).toDF("text")
prediction_model = custom_ner_pipeline.fit(prediction_data)
preds = prediction_model.transform(prediction_data)
preds.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                bert|                 ner|            ner_span|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|software testing ...|[{document, 0, 16...|[{document, 0, 54...|[{token, 0, 7, so...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 15, s...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [48]:
from pyspark.sql.functions import split, col, monotonically_increasing_id
preds = preds.withColumn("id", monotonically_increasing_id())

In [64]:
preds.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+
|                text|            document|            sentence|               token|                bert|                 ner|            ner_span|         id|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+
|software testing ...|[{document, 0, 16...|[{document, 0, 54...|[{token, 0, 7, so...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 15, s...|94489280512|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------+



In [55]:
from pyspark.sql.functions import col
test = test.withColumn("result", col("result").cast("string")) \
  .withColumn("metadata", col("metadata").cast("string")) 

AttributeError: ignored

In [None]:
test.write.options(header='True', delimiter=',', quotes = '"') \
 .csv("/content/drive/MyDrive/Data_Science/thesis/result")

In [62]:
preds.select("ner_span.result","ner_span.metadata").show()

+--------------------+--------------------+
|              result|            metadata|
+--------------------+--------------------+
|[software testing...|[{entity -> KNOW,...|
+--------------------+--------------------+



In [67]:
import pyspark.sql.functions as F
preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities"), 'id') \
.select(F.expr("entities['result']").alias("chunk"), 
        F.expr("entities['metadata'].entity").alias("entity"), 'id'). \
        show(truncate=False)

+----------------+------+-----------+
|chunk           |entity|id         |
+----------------+------+-----------+
|software testing|KNOW  |94489280512|
|software testing|KNOW  |94489280512|
+----------------+------+-----------+



In [15]:
example = spark.createDataFrame([["software testing Quickly from the experts - GUARANTEED! THE IN-DEPTH software testing TRAINING - By SoftwareTestingHelp Team. on Udemy in the software testing category!"]]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "ner.result").show()

IllegalArgumentException: ignored

In [42]:
example = spark.createDataFrame([["software testing Quickly from the experts - GUARANTEED! THE IN-DEPTH software testing TRAINING - By SoftwareTestingHelp Team. on Udemy in the software testing category!"]]).toDF("text")

result = pipeline.fit(example).transform(example)

# result is a DataFrame
result.select("text", "ner.result").show()

+--------------------+--------------------+
|                text|              result|
+--------------------+--------------------+
|software testing ...|[B-KNOW, I-KNOW, ...|
+--------------------+--------------------+



In [25]:
df = spark.read.option('header','true').csv('/content/drive/MyDrive/Data_Science/thesis/K18/File K18/sample_crawl_dataset/Coursera_DataScience.csv',inferSchema=True, escape = '"')

Py4JJavaError: ignored

In [26]:
from pyspark.sql.functions import lower, col
df = df.select(df.SkillWillLearn.alias('text'))
df = df.withColumn('text', lower(col('text')))

df.show()

AttributeError: ignored

In [12]:
result = pipeline.fit(df).transform(df)

# result is a DataFrame
result.select("text", "ner").show()


+--------------------+--------------------+
|                text|                 ner|
+--------------------+--------------------+
|machine learning ...|[{named_entity, 0...|
|in an age now dri...|[{named_entity, 0...|
|in this course, y...|[{named_entity, 0...|
|explain the princ...|[{named_entity, 0...|
|in the first cour...|[{named_entity, 0...|
|define data scien...|[{named_entity, 0...|
|this mooc – a joi...|[{named_entity, 0...|
|describe the use ...|[{named_entity, 0...|
|while telling sto...|[{named_entity, 0...|
|in this course, y...|[{named_entity, 0...|
|properly identify...|[{named_entity, 0...|
|development of an...|[{named_entity, 0...|
|describe differen...|[{named_entity, 0...|
|distinguish betwe...|[{named_entity, 0...|
|apply tidyverse f...|[{named_entity, 0...|
|describe differen...|[{named_entity, 0...|
|articulate differ...|[{named_entity, 0...|
|build an investme...|[{named_entity, 0...|
|in this course yo...|[{named_entity, 0...|
|the capstone proj...|[{named_en

In [35]:
result = result.select("text", "ner.annotatorType", "ner.begin", "ner.end", "ner.result", "ner.metadata", "ner.embeddings")

In [36]:
from pyspark.sql.functions import col
from pyspark.sql.types import StringType

result = result.withColumn("text", col("text").cast("string")) \
.withColumn("annotatorType", col("annotatorType").cast("string")) \
.withColumn("begin", col("begin").cast("string")) \
.withColumn("end", col("end").cast("string")) \
.withColumn("result", col("result").cast("string")) \
.withColumn("metadata", col("metadata").cast("string")) \
.withColumn("embeddings", col("embeddings").cast("string"))

In [50]:
result.select('text','ner_span.result','ner_span').show()

+--------------------+--------------------+--------------------+
|                text|              result|            ner_span|
+--------------------+--------------------+--------------------+
|software testing ...|[software testing...|[{chunk, 0, 15, s...|
+--------------------+--------------------+--------------------+



In [45]:
result = result.withColumn("ner", col("ner").cast("string")) \
.withColumn("document", col("document").cast("string")) \
.withColumn("token", col("token").cast("string"))

In [38]:
result.write.options(header='True', delimiter=',', quotes = '"') \
 .csv("/content/drive/MyDrive/Data_Science/thesis/result")

Py4JJavaError: ignored

In [52]:
result.select('text','token.annotatorType','token.begin','token.end','token.result','token.metadata', 'token.embeddings').show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|       annotatorType|               begin|                 end|              result|            metadata|          embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|machine learning ...|[token, token, to...|[0, 8, 17, 20, 24...|[6, 15, 18, 22, 3...|[machine, learnin...|[{sentence -> 0},...|[[], [], [], [], ...|
|in an age now dri...|[token, token, to...|[0, 3, 6, 10, 14,...|[1, 4, 8, 12, 19,...|[in, an, age, now...|[{sentence -> 0},...|[[], [], [], [], ...|
|in this course, y...|[token, token, to...|[0, 3, 8, 14, 16,...|[1, 6, 13, 14, 18...|[in, this, course...|[{sentence -> 0},...|[[], [], [], [], ...|
|explain the princ...|[token, token, to...|[0, 8, 12, 23, 26...|[6, 10, 21, 24, 2...|[explain, the, pr...|

In [13]:
result.select('token.result','ner.result').show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                              result|                                                                                              result|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|[machine, learning, is, the, science, of, getting, computers, to, act, without, being, explicitly...|[B-KNOW, I-KNOW, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-KNOW, I-KNOW, O, O, O, O...|
|[in, an, age, now, driven, by, ", big, data, ",, we, need, to, cut, through, the, noise, and, pre...|[O, O, O, O, O, O, O, B-KNOW, I-KNOW, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O,

In [15]:
from pyspark.sql import functions as F
result_df = result.select(F.explode(F.arrays_zip(result.token.result,
                                                 result.ner.result, 
                                                 result.ner.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"),
                          F.expr("cols['2']['confidence']").alias("confidence"))

result_df.show(50, truncate=100)

+-------------+---------+----------+
|        token|ner_label|confidence|
+-------------+---------+----------+
|      machine|   B-KNOW|      null|
|     learning|   I-KNOW|      null|
|           is|        O|      null|
|          the|        O|      null|
|      science|        O|      null|
|           of|        O|      null|
|      getting|        O|      null|
|    computers|        O|      null|
|           to|        O|      null|
|          act|        O|      null|
|      without|        O|      null|
|        being|        O|      null|
|   explicitly|        O|      null|
|   programmed|        O|      null|
|            .|        O|      null|
|           in|        O|      null|
|          the|        O|      null|
|         past|        O|      null|
|       decade|        O|      null|
|            ,|        O|      null|
|      machine|   B-KNOW|      null|
|     learning|   I-KNOW|      null|
|          has|        O|      null|
|        given|        O|      null|
|

In [22]:
result.select('token.result', 'ner.result').show()

+--------------------+--------------------+
|              result|              result|
+--------------------+--------------------+
|[machine, learnin...|[B-KNOW, I-KNOW, ...|
|[in, an, age, now...|[O, O, O, O, O, O...|
|[in, this, course...|[O, O, O, O, O, O...|
|[explain, the, pr...|[O, O, O, O, B-KN...|
|[in, the, first, ...|[O, O, O, O, O, O...|
|[define, data, sc...|[O, B-KNOW, I-KNO...|
|[this, mooc, –, a...|[O, O, O, O, O, O...|
|[describe, the, u...|[O, O, O, O, B-KN...|
|[while, telling, ...|[O, B-KNOW, I-KNO...|
|[in, this, course...|[O, O, O, O, O, O...|
|[properly, identi...|[O, O, O, O, O, O...|
|[development, of,...|[O, O, O, O, O, O...|
|[describe, differ...|[O, O, O, O, O, O...|
|[distinguish, bet...|[O, O, O, O, O, O...|
|[apply, tidyverse...|[O, B-FRAM, O, O,...|
|[describe, differ...|[O, O, O, O, B-KN...|
|[articulate, diff...|[O, O, O, O, O, O...|
|[build, an, inves...|[O, O, O, O, O, O...|
|[in, this, course...|[O, O, O, O, O, O...|
|[the, capstone, p...|[O, O, O, 

In [51]:
import pyspark.sql.functions as F

last = result.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"), 
        F.expr("entities['1'].entity").alias("entity")). \
        show()

AnalysisException: ignored

In [45]:
last.select(F.expr("entities['0']").alias("chunk"), 
            F.expr("entities['1'].entity").alias("entity")).show()

AttributeError: ignored

In [47]:
last.select(F.explode(F.expr("entities").alias("chunk")).show()

AttributeError: ignored