![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_bert.ipynb)

## 0. Colab Setup

In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


## Deep Learning NER

In the following example, we walk-through a LSTM NER model training and prediction. This annotator is implemented on top of TensorFlow.

This annotator will take a series of word embedding vectors, training CoNLL dataset, plus a validation dataset. We include our own predefined Tensorflow Graphs, but it will train all layers during fit() stage.

DL NER will compute several layers of BI-LSTM in order to auto generate entity extraction, and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction. 

#### 1. Call necessary imports and set the resource folder path.

In [0]:
import os
import sys
sys.path.append('../../')

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import time
import zipfile
#Setting location of resource Directory
resource_path= "../../../src/test/resources/"

#### 2. Download CoNLL 2003 data if not present

In [0]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request
url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/"
file_train="eng.train"
file_testa= "eng.testa"
file_testb= "eng.testb"
# https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003
if not Path(file_train).is_file():   
    print("Downloading "+file_train)
    urllib.request.urlretrieve(url+file_train, file_train)
if not Path(file_testa).is_file():
    print("Downloading "+file_testa)
    urllib.request.urlretrieve(url+file_testa, file_testa)

if not Path(file_testb).is_file():
    print("Downloading "+file_testb)
    urllib.request.urlretrieve(url+file_testb, file_testb)

#### 3. Create the spark session

In [4]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


#### 4. Load dataset and cache into memory

In [5]:
from sparknlp.training import CoNLL
training_data = CoNLL().readDataset(spark, './eng.train')
training_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|
|The European Comm...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|[[named_entity, 0...|
|Germany 's repres...|[[document, 0, 21...|[[document, 0, 21...|[[token, 0, 6, Ge...|[[pos, 0, 6, NNP,..

#### 5. Create annotator components with appropriate params and in the right order. The finisher will output only NER. Put everything in Pipeline

In [6]:
bert = BertEmbeddings.pretrained() \
 .setInputCols(["sentence", "token"])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)


bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]


In [7]:
training_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|
|The European Comm...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|[[named_entity, 0...|
|Germany 's repres...|[[document, 0, 21...|[[document, 0, 21...|[[token, 0, 6, Ge...|[[pos, 0, 6, NNP,..

In [8]:
%%time
from pathlib import Path


# WARNING: Setting benchmark to true is  slow and might crash your system and is not recommended on standardCollab notebooks-- High end hardware and/or GPU required
## dataframe.cache() does not solve this. Results must be serialized to disk for maximum efficiency
### You might need to restart your driver after this step finishes
benchmark = False 


with_bert_path = "./with_bert.parquet"
if benchmark == True :
  if not Path(with_bert_path).is_dir(): 
    bert.transform(training_data).write.parquet("./with_bert.parquet")
    training_with_bert = spark.read.parquet("./with_bert.parquet").cache()
else : training_with_bert = bert.transform(training_data)


print(training_with_bert.count())
training_with_bert.select("token", "bert").show()

14041
+--------------------+--------------------+
|               token|                bert|
+--------------------+--------------------+
|[[token, 0, 1, EU...|[[word_embeddings...|
|[[token, 0, 4, Pe...|[[word_embeddings...|
|[[token, 0, 7, BR...|[[word_embeddings...|
|[[token, 0, 2, Th...|[[word_embeddings...|
|[[token, 0, 6, Ge...|[[word_embeddings...|
|[[token, 0, 0, ",...|[[word_embeddings...|
|[[token, 0, 1, He...|[[word_embeddings...|
|[[token, 0, 1, He...|[[word_embeddings...|
|[[token, 0, 7, Fi...|[[word_embeddings...|
|[[token, 0, 2, Bu...|[[word_embeddings...|
|[[token, 0, 6, Sp...|[[word_embeddings...|
|[[token, 0, 0, .,...|[[word_embeddings...|
|[[token, 0, 3, On...|[[word_embeddings...|
|[[token, 0, 2, Th...|[[word_embeddings...|
|[[token, 0, 4, Sh...|[[word_embeddings...|
|[[token, 0, 6, Br...|[[word_embeddings...|
|[[token, 0, 0, ",...|[[word_embeddings...|
|[[token, 0, 3, Bo...|[[word_embeddings...|
|[[token, 0, 6, Ge...|[[word_embeddings...|
|[[token, 0, 1, It...|[[wo

In [0]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "bert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setRandomSeed(0)\
  .setVerbose(0)

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

pipeline = Pipeline(
    stages = [
    nerTagger,
    converter
  ])

#### 6. Train the pipeline. (This will take some time)

In [10]:
%%time

start = time.time()
print("Start fitting")
#We have to limit the rows in Collab, otherwise we will encounter exceptions because of RAM limitations
model = pipeline.fit(training_with_bert.limit(25))  
print("Fitting is ended")
print (time.time() - start)

Start fitting
Fitting is ended
7.180534839630127
CPU times: user 21.5 ms, sys: 6.81 ms, total: 28.3 ms
Wall time: 7.18 s


#### 7. Lets predict with the model

In [0]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert,
        model
    ]
)

In [12]:
prediction_data = spark.createDataFrame([["Germany is a nice place"]]).toDF("text")
prediction_data.show()

+--------------------+
|                text|
+--------------------+
|Germany is a nice...|
+--------------------+



In [0]:
prediction_model = prediction_pipeline.fit(prediction_data)

In [14]:
%%time

lp = LightPipeline(prediction_model)
result = lp.annotate("International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk.")
for e in list(zip(result['token'], result['ner'])):
    print(e)

('International', 'O')
('Business', 'O')
('Machines', 'O')
('Corporation', 'O')
('(', 'O')
('IBM', 'O')
(')', 'O')
('is', 'O')
('an', 'O')
('American', 'O')
('multinational', 'O')
('information', 'O')
('technology', 'O')
('company', 'O')
('headquartered', 'O')
('in', 'O')
('Armonk', 'O')
('.', 'O')
CPU times: user 56.3 ms, sys: 7.62 ms, total: 63.9 ms
Wall time: 1.19 s


In [15]:
%%time

# This might take 8 minutes. Timing is not lineal

prediction_model.transform(prediction_data).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+
|                text|            document|            sentence|               token|                bert|                 ner|ner_span|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+
|Germany is a nice...|[[document, 0, 22...|[[document, 0, 22...|[[token, 0, 6, Ge...|[[word_embeddings...|[[named_entity, 0...|      []|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+

CPU times: user 27.2 ms, sys: 6.09 ms, total: 33.3 ms
Wall time: 883 ms


#### 8. Save both pipeline and single model once trained, on disk

In [0]:
prediction_model.write().overwrite().save("./ner_dl_model")

#### 9. Load both again, deserialize from disk

In [0]:
from pyspark.ml import PipelineModel, Pipeline

loaded_prediction_model = PipelineModel.read().load("./ner_dl_model")

In [18]:
%%time
lp = LightPipeline(loaded_prediction_model)
result = lp.annotate("Peter is a good person.")
for e in list(zip(result['token'], result['ner']))[:10]:
    print(e)

('Peter', 'O')
('is', 'O')
('a', 'O')
('good', 'O')
('person', 'O')
('.', 'O')
CPU times: user 55.9 ms, sys: 12.4 ms, total: 68.3 ms
Wall time: 723 ms


In [19]:
for stage in loaded_prediction_model.stages:
    print(stage)
print(loaded_prediction_model.stages[-1].stages)

DocumentAssembler_7a6bc03a0a25
SentenceDetector_8130627c0d5f
REGEX_TOKENIZER_cf7c9407b892
BERT_EMBEDDINGS_abf30dcdf344
PipelineModel_e7f7bc4a5dcc
[NerDLModel_ba63241e33e5, NerConverter_422eed39d1e4]
