# NER with BERT in Spark NLP

## Source

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/3.NER_with_BERT.ipynb

Article:
https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77

## Installation

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-06-16 08:54:04--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-06-16 08:54:04--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-06-16 08:54:04 (31.4 MB/s) - written to stdout [1608/1608]

setup Colab for PySpark 3.0.2 and Spark NLP 3.1.0
Get:1 http://ppa.launchpad.net/c2d4u.tea

## Import libraries and download datasets

In [None]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

In [None]:
# if you have GPU
spark = sparknlp.start(gpu=True)
#spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.1.0
Apache Spark version:  3.0.2


In [None]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [None]:
path = "/gdrive/MyDrive/Colab Notebooks/SparkNLP/NER/CONLLs/"
!ls {path.replace(' ', '\ ')} -lah

total 7.9M
-rw------- 1 root root 809K Jun 16 07:49 eng.testa
-rw------- 1 root root 3.2M Jun 16 07:49 eng.train
-rw------- 1 root root 809K Jun 16 08:52 eng_TST.testa
-rw------- 1 root root 3.2M Jun 16 08:52 eng_TST.train


In [None]:
#
# Open local CONLL file
# eng_TST.train & eng_TST.testa files contains B-TST & I-TST tags
# (instead of B-ORG & I-ORG) to test whether tags can be changed
# using this approach (this change appears to work) 
#
with open(path + "eng_TST.train") as f:
    c=f.read()

print (c[:200])

-DOCSTART- -X- -X- O

EU NNP B-NP B-TST
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Black


## Building NER pipeline

In [None]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, path + "eng_TST.train")
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
training_data.count()

14041

### Loading Bert

In Spark NLP, we have four pre-trained variants of BERT: bert_base_uncased , bert_base_cased , bert_large_uncased , bert_large_cased, and many Smaller BERT models available on our [Models Hub](https://nlp.johnsnowlabs.com/models?q=bert&task=Embeddings). Which one to use depends on your use case, train set, and the complexity of the task you are trying to model.

In the code snippet above, we basically load the bert_base_cased version from Spark NLP public resources and point the sentence and token columns in   setInputCols(). In short, BertEmbeddings() annotator will take sentence and token columns and populate Bert embeddings in bert column. In general, each word is translated to a 768-dimensional vector.

As explained by the authors of official BERT paper, different BERT layers capture different information. The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore it may be biased to those targets. 

In [None]:
# we use BERT Tiny
bert_annotator = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
.setBatchSize(8)

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [None]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, path + 'eng_TST.testa')

test_data = bert_annotator.transform(test_data)

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[[document, 0, 64...|[[document, 0, 64...|[[token, 0, 6, CR...|[[pos, 0, 6, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|   LONDON 1996-08-30|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 5, LO...|[[pos, 0, 5, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|West Indian all-r...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 3, We...|[[pos, 0, 3, NNP,...|[[named_entity, 0...|[[word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [None]:
# let's transform and save our test dataset for evaluation
test_data.write.parquet("test_withEmbeds.parquet")

In [None]:
test_data.select("bert.result","bert.embeddings",'label.result').show()

+--------------------+--------------------+--------------------+
|              result|          embeddings|              result|
+--------------------+--------------------+--------------------+
|[cricket, -, leic...|[[-1.6099579, 0.5...|[O, O, B-TST, O, ...|
|[london, 1996-08-30]|[[-0.6607419, 0.7...|          [B-LOC, O]|
|[west, indian, al...|[[-1.2108911, 0.9...|[B-MISC, I-MISC, ...|
|[their, stay, on,...|[[-0.9397633, 0.0...|[O, O, O, O, O, O...|
|[after, bowling, ...|[[-1.1267813, 1.1...|[O, O, B-TST, O, ...|
|[trailing, by, 21...|[[-1.8359275, 0.4...|[O, O, O, O, B-TS...|
|[essex, ,, howeve...|[[-1.2150196, 0.2...|[B-TST, O, O, O, ...|
|[hussain, ,, cons...|[[-1.6078968, 0.5...|[B-PER, O, O, O, ...|
|[by, the, close, ...|[[-1.8683753, 1.1...|[O, O, O, B-TST, ...|
|[at, the, oval, ,...|[[-1.8740944, 0.6...|[O, O, B-LOC, O, ...|
|[he, was, well, b...|[[-1.6607122, 1.2...|[O, O, O, O, O, B...|
|[derbyshire, kept...|[[-1.1823792, 0.2...|[B-TST, O, O, O, ...|
|[australian, tom,...|[[-

In [None]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "bert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(5)\
  .setLr(0.001)\
  .setPo(0.005)\
  .setBatchSize(32)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setTestDataset("test_withEmbeds.parquet")

pipeline = Pipeline(
    stages = [
    bert_annotator,
    nerTagger
  ])

You can also set learning rate ( setLr ), learning rate decay coefficient ( setPo ), setBatchSize and setDropout rate. Please see the [official APIs](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach.html) for the entire list. 

In [None]:
%%time

ner_model = pipeline.fit(training_data)

CPU times: user 2.52 s, sys: 396 ms, total: 2.91 s
Wall time: 9min 5s


In [None]:
!ls -l /root/annotator_logs/

total 8
-rw-r--r-- 1 root root 4210 Jun 16 09:07 NerDLApproach_78a48b94535a.log


In [None]:
!cat /root/annotator_logs/NerDLApproach_*.log

Name of the selected graph: ner-dl/blstm_10_128_128_120.pb
Training started - total epochs: 5 - lr: 0.001 - batch size: 32 - labels: 9 - chars: 58 - training examples: 14041


Epoch 1/5 started, lr: 0.001, dataset size: 14041


Epoch 1/5 - 103.49s - loss: 2545.3042 - batches: 441
Quality on test dataset: 
time to finish evaluation: 7.75s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1490	 280	 347	 0.8418079	 0.8111051	 0.8261713
B-TST	 634	 122	 707	 0.83862436	 0.4727815	 0.6046734
I-TST	 217	 80	 534	 0.73063976	 0.28894806	 0.41412216
I-MISC	 97	 31	 249	 0.7578125	 0.2803468	 0.40928265
I-LOC	 138	 66	 119	 0.6764706	 0.53696495	 0.59869844
I-PER	 1239	 206	 65	 0.85743946	 0.95015335	 0.9014187
B-MISC	 525	 135	 397	 0.79545456	 0.5694143	 0.6637168
B-PER	 1667	 511	 173	 0.7653811	 0.90597826	 0.8297661
tp: 6007 fp: 1431 fn: 2591 labels: 8
Macro-average	 prec: 0.78295374, rec: 0.60196155, f1: 0.68063086
Micro-average	 prec: 0.80760956, rec: 0.69865084, f1: 0.7491893


Epoch 2/5 starte

**Some notes:**
- we used the smallest BERT model called BERT Tiny
- it's very small and requires less memory among Transformers
- if you have more memory or access to accelerated hardware please choose a larger BERT model for higher accuracy
- you can also set higher Epoch to reach our STOA metrics

We chose the smallest BERT model with only 5 Epochs for the sake of this tutorial within this small Colab VM

In [None]:
# let's save our trained NER model on disk
# so we can load it in a new session or move it to another location
# since we fit NerDL model inside the pipeline, we can access it via stages
ner_model.stages[1].write().overwrite().save('./NER_bert_20200219')

In [None]:
test_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[[document, 0, 64...|[[document, 0, 64...|[[token, 0, 6, CR...|[[pos, 0, 6, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|   LONDON 1996-08-30|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 5, LO...|[[pos, 0, 5, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|West Indian all-r...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 3, We...|[[pos, 0, 3, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|Their stay on top...|[[document, 0, 20...|[[document, 0, 20...|[[token, 0, 4, Th...|[[pos, 0, 4, PRP$...|

In [None]:
# let's only feed sentence and token from our test dataset
predictions = ner_model.transform(test_data.select("sentence", "token", "label"))
predictions.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            sentence|               token|               label|                bert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[[document, 0, 64...|[[token, 0, 6, CR...|[[named_entity, 0...|[[word_embeddings...|[[named_entity, 0...|
|[[document, 0, 16...|[[token, 0, 5, LO...|[[named_entity, 0...|[[word_embeddings...|[[named_entity, 0...|
|[[document, 0, 18...|[[token, 0, 3, We...|[[named_entity, 0...|[[word_embeddings...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [None]:
predictions.select('token.result','label.result','ner.result').show(truncate=40)

+----------------------------------------+----------------------------------------+----------------------------------------+
|                                  result|                                  result|                                  result|
+----------------------------------------+----------------------------------------+----------------------------------------+
|[CRICKET, -, LEICESTERSHIRE, TAKE, OV...|   [O, O, B-TST, O, O, O, O, O, O, O, O]|   [O, O, B-TST, O, O, O, O, O, O, O, O]|
|                    [LONDON, 1996-08-30]|                              [B-LOC, O]|                              [B-LOC, O]|
|[West, Indian, all-rounder, Phil, Sim...|[B-MISC, I-MISC, O, B-PER, I-PER, O, ...|[B-MISC, I-MISC, O, B-PER, I-PER, O, ...|
|[Their, stay, on, top, ,, though, ,, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|
|[After, bowling, Somerset, out, for, ...|[O, O, B-TST, O, O, O, O, O, O, O, O,...|[O, B-TST, I-TST, O, O, O, O, O, O, O...|


In [None]:
predictions.printSchema()

root
 |-- sentence: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (n

In [None]:
import pyspark.sql.functions as F

predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|B-TST       |B-TST     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |B-LOC       |B-LOC     |
|1996-08-30    |O           |O         |
|West          |B-MISC      |B-MISC    |
|Indian        |I-MISC      |I-MISC    |
|all-rounder   |O           |O         |
|Phil          |B-PER       |B-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

# Convert to Pandas

In [None]:
import pandas as pd

df = predictions.select('token.result','label.result','ner.result').toPandas()

df

Unnamed: 0,result,result.1,result.2
0,"[CRICKET, -, LEICESTERSHIRE, TAKE, OVER, AT, T...","[O, O, B-TST, O, O, O, O, O, O, O, O]","[O, O, B-TST, O, O, O, O, O, O, O, O]"
1,"[LONDON, 1996-08-30]","[B-LOC, O]","[B-LOC, O]"
2,"[West, Indian, all-rounder, Phil, Simmons, too...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ..."
3,"[Their, stay, on, top, ,, though, ,, may, be, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-TST,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-TST,..."
4,"[After, bowling, Somerset, out, for, 83, on, t...","[O, O, B-TST, O, O, O, O, O, O, O, O, B-LOC, I...","[O, B-TST, I-TST, O, O, O, O, O, O, O, O, B-LO..."
...,...,...,...
3245,"[But, the, prices, may, move, in, a, close, ra...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
3246,"[Brokers, said, blue, chips, like, IDLC, ,, Ba...","[O, O, O, O, O, B-TST, O, B-TST, I-TST, O, B-T...","[O, O, O, O, O, O, O, B-LOC, O, O, B-TST, O, O..."
3247,"[They, said, there, was, still, demand, for, b...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3248,"[The, DSE, all, share, price, index, closed, 2...","[O, B-TST, O, O, O, O, O, O, O, O, O, O, O, O,...","[O, B-TST, O, O, O, O, O, O, O, O, O, O, O, O,..."
