# NER with BERT in Spark NLP

In [35]:
!python -V

Python 3.7.4


## Installation

In [None]:
# https://medium.com/spark-nlp/introduction-to-spark-nlp-installation-and-getting-started-part-ii-d009f7a177f3

## JDK v8 on MacOS

! brew tap AdoptOpenJDK/openjdk (MAC)
! sudo apt-get install openjdk-8-jre (Ubuntu, Debian)
! su -c "yum install java-1.8.0-openjdk" (Fedora, Ret Hat etc.)

# ---------------------- #
# JDK v8 on Colab
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# ---------------------- #

## Spark NLP

! pip install --ignore-installed pyspark==2.4.4
! pip install --ignore-installed spark-nlp==2.4.1

## Import libraries and download datasets

In [36]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [2]:
spark = sparknlp.start()
#spark = sparknlp.start(gpu=True)

In [37]:
print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.4.1
Apache Spark version:  2.4.4


In [None]:
def start(gpu=False):
    builder = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "8G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
        .config("spark.kryoserializer.buffer.max", "1000M")
    if gpu:
        builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.4.1")
    else:
        builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.1")

    return builder.getOrCreate()

  
spark = start(gpu=False)

In [38]:
from urllib.request import urlretrieve

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
           'eng.train')

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
           'eng.testa')


('eng.testa', <http.client.HTTPMessage at 0x1a1a1318d0>)

In [4]:
with open("eng.train") as f:
    c=f.read()

print (c[:500])

-DOCSTART- -X- -X- O

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG
Commission NNP I-NP I-ORG
said VBD I-VP O
on IN I-PP O
Thursday NNP I-NP O
it PRP B-NP O
disagreed VBD I-VP O
with IN I-PP O
German JJ I-NP I-MISC
advice NN I-NP O
to TO I-PP O
consumers NNS I-NP


## Building NER pipeline

In [39]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, './eng.train')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [7]:
training_data.count()

14041

### Loading Bert

In Spark NLP, we have four pre-trained variants of BERT: bert_base_uncased , bert_base_cased , bert_large_uncased , bert_large_cased . Which one to use depends on your use case, train set, and the complexity of the task you are trying to model.

In the code snippet above, we basically load the bert_base_cased version from Spark NLP public resources and point thesentenceand token columns in   setInputCols(). In short, BertEmbeddings() annotator will take sentence and token columns and populate Bert embeddings in bert column. In general, each word is translated to a 768-dimensional vector. The parametersetPoolingLayer() can be set to 0 as the first layer and fastest, -1 as the last layer and -2 as the second-to-last-hidden layer.

As explained by the authors of official BERT paper, different BERT layers capture different information. The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore it may be biased to those targets. If you want to use the last hidden layer anyway, please feel free to set pooling_layer=-1. Intuitively, pooling_layer=-1 is close to the training output, so it may be biased to the training targets. If you don't fine-tune the model, then this could lead to a bad representation. That said, it is a matter of trade-off between model accuracy and computational resources you have.

In [40]:
bert_annotator = BertEmbeddings.pretrained('bert_base_cased', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)\
 .setPoolingLayer(0)

bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]


In [None]:
# BertEmbeddings.load("local/path/")

In [41]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, './eng.testa')

test_data = bert_annotator.transform(test_data)

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[[document, 0, 64...|[[document, 0, 64...|[[token, 0, 6, CR...|[[pos, 0, 6, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|   LONDON 1996-08-30|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 5, LO...|[[pos, 0, 5, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|West Indian all-r...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 3, We...|[[pos, 0, 3, NNP,...|[[named_entity, 0...|[[word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [26]:
#test_data.limit(1000).write.parquet("test_withEmbeds.parquet")

In [42]:
test_data.select("bert.result","bert.embeddings",'label.result').show()

+--------------------+--------------------+--------------------+
|              result|          embeddings|              result|
+--------------------+--------------------+--------------------+
|[cricket, -, leic...|[[-0.1976323, -0....|[O, O, B-ORG, O, ...|
|[london, 1996-08-30]|[[0.60744655, 0.2...|          [B-LOC, O]|
|[west, indian, al...|[[-0.77769196, -0...|[B-MISC, I-MISC, ...|
|[their, stay, on,...|[[-1.0986965, 0.9...|[O, O, O, O, O, O...|
|[after, bowling, ...|[[-1.1295222, 0.4...|[O, O, B-ORG, O, ...|
|[trailing, by, 21...|[[-1.763109, 0.64...|[O, O, O, O, B-OR...|
|[essex, ,, howeve...|[[-0.7097074, -0....|[B-ORG, O, O, O, ...|
|[hussain, ,, cons...|[[-0.18262176, 0....|[B-PER, O, O, O, ...|
|[by, the, close, ...|[[-0.44396043, 0....|[O, O, O, B-ORG, ...|
|[at, the, oval, ,...|[[-1.0189406, -0....|[O, O, B-LOC, O, ...|
|[he, was, well, b...|[[-0.48848447, 0....|[O, O, O, O, O, B...|
|[derbyshire, kept...|[[-0.08332734, -1...|[B-ORG, O, O, O, ...|
|[australian, tom,...|[[0

In [21]:
import numpy as np

emb_vector = np.array(test_data.select("bert.embeddings").take(1))

emb_vector

array([[[[-0.1976323 , -0.42026019,  0.55059767, ...,  0.43232313,
           0.08174106,  0.20429248],
         [-0.95179886,  0.40518612, -0.18066669, ..., -0.56889868,
           0.60068387,  0.0411097 ],
         [-0.02262329,  0.5078364 ,  0.12273782, ..., -1.2903105 ,
          -0.2035525 ,  0.46557245],
         ...,
         [ 0.05775764, -0.63264298, -0.66510761, ..., -0.70176458,
           0.57686883, -0.59081137],
         [ 0.26866663, -1.05502069, -0.19403641, ...,  0.30519044,
           1.04685092,  0.74213707],
         [-0.32736883,  0.54801679,  1.04293954, ...,  0.25098351,
           0.55407429,  0.26589558]]]])

In [43]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "bert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setLr(0.001)\
  .setPo(0.005)\
  .setBatchSize(8)\
  .setRandomSeed(0)\
  .setVerbose(1)\
  .setValidationSplit(0.2)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setIncludeConfidence(True)\
  .setTestDataset("test_withEmbeds.parquet")


pipeline = Pipeline(
    stages = [
    bert_annotator,
    nerTagger
  ])

You can also set learning rate ( setLr ), learning rate decay coefficient ( setPo ), setBatchSize and setDropout rate. Please see the official repo for the entire list.

In [22]:
%%time

ner_model = pipeline.fit(training_data.limit(1000))

CPU times: user 26.3 ms, sys: 15.1 ms, total: 41.4 ms
Wall time: 44.6 s


In [44]:
ner_model

PipelineModel_cebcb55e7e77

In [None]:
# on COLAB, it takes 30 min to train entire trainset (conll2003) for 10 epochs on GPU with layer = 0

In [45]:
predictions = ner_model.transform(test_data)
predictions.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[[document, 0, 64...|[[document, 0, 64...|[[token, 0, 6, CR...|[[pos, 0, 6, NNP,...|[[named_entity, 0...|[[word_embeddings...|[[named_entity, 0...|
|   LONDON 1996-08-30|[[document, 0, 16...|[[document, 0, 16...|[[token, 0, 5, LO...|[[pos, 0, 5, NNP,...|[[named_entity, 0...|[[word_embeddings...|[[named_entity, 0...|
|West Indian all-r...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 3, We...|[[pos, 0, 3, NNP,...|[[named_entity, 0...|[[word_embeddings...|[[

In [17]:
predictions.select('token.result','label.result','ner.result').show(truncate=40)

+----------------------------------------+----------------------------------------+----------------------------------------+
|                                  result|                                  result|                                  result|
+----------------------------------------+----------------------------------------+----------------------------------------+
|[CRICKET, -, LEICESTERSHIRE, TAKE, OV...|   [O, O, I-ORG, O, O, O, O, O, O, O, O]|   [O, O, I-PER, O, O, O, O, O, O, O, O]|
|                    [LONDON, 1996-08-30]|                              [I-LOC, O]|                              [I-LOC, O]|
|[West, Indian, all-rounder, Phil, Sim...|[I-MISC, I-MISC, O, I-PER, I-PER, O, ...|[I-LOC, O, O, I-ORG, I-ORG, O, O, O, ...|
|[Their, stay, on, top, ,, though, ,, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|
|[After, bowling, Somerset, out, for, ...|[O, O, I-ORG, O, O, O, O, O, O, O, O,...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|


In [98]:
predictions.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = tr

In [20]:
import pyspark.sql.functions as F

predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|I-ORG       |I-PER     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |I-LOC       |I-LOC     |
|1996-08-30    |O           |O         |
|West          |I-MISC      |I-LOC     |
|Indian        |I-MISC      |O         |
|all-rounder   |O           |O         |
|Phil          |I-PER       |I-ORG     |
|Simmons       |I-PER       |I-ORG     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

### Loading from local

In [None]:
# loading the one trained 10 epochs on GPU with entire train set

loaded_ner_model = NerDLModel.load("NER_bert_20200226")\
 .setInputCols(["sentence", "token", "bert"])\
 .setOutputCol("ner")

In [34]:
predictions_loaded = loaded_ner_model.transform(test_data)

predictions_loaded.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(30, truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|I-ORG       |B-PER     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |I-LOC       |B-LOC     |
|1996-08-30    |O           |O         |
|West          |I-MISC      |B-MISC    |
|Indian        |I-MISC      |I-MISC    |
|all-rounder   |O           |O         |
|Phil          |I-PER       |B-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
|for           |O           |O         |
|38            |

In [29]:
import pandas as pd

df = predictions_loaded.select('token.result','label.result','ner.result').toPandas()

df

Unnamed: 0,result,result.1,result.2
0,"[CRICKET, -, LEICESTERSHIRE, TAKE, OVER, AT, T...","[O, O, I-ORG, O, O, O, O, O, O, O, O]","[O, O, B-PER, O, O, O, O, O, O, O, O]"
1,"[LONDON, 1996-08-30]","[I-LOC, O]","[B-LOC, O]"
2,"[West, Indian, all-rounder, Phil, Simmons, too...","[I-MISC, I-MISC, O, I-PER, I-PER, O, O, O, O, ...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ..."
3,"[Their, stay, on, top, ,, though, ,, may, be, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, I-ORG,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-ORG,..."
4,"[After, bowling, Somerset, out, for, 83, on, t...","[O, O, I-ORG, O, O, O, O, O, O, O, O, I-LOC, I...","[O, O, O, O, O, O, O, O, O, O, O, B-LOC, I-LOC..."
...,...,...,...
3245,"[But, the, prices, may, move, in, a, close, ra...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
3246,"[Brokers, said, blue, chips, like, IDLC, ,, Ba...","[O, O, O, O, O, I-ORG, O, I-ORG, I-ORG, O, I-O...","[O, O, O, O, O, B-ORG, O, B-LOC, O, O, B-PER, ..."
3247,"[They, said, there, was, still, demand, for, b...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3248,"[The, DSE, all, share, price, index, closed, 2...","[O, I-ORG, O, O, O, O, O, O, O, O, O, O, O, O,...","[O, B-ORG, I-ORG, O, O, O, O, O, O, O, O, O, O..."


### Bert with poolingLayer -2

In [42]:
bert_annotator.setPoolingLayer(-2)

BERT_EMBEDDINGS_abf30dcdf344

In [43]:
pipeline = Pipeline(
    stages = [
    bert_annotator,
    nerTagger
  ])

In [44]:
ner_model_v2 = pipeline.fit(training_data.limit(1000))

In [45]:
predictions_v2 = ner_model_v2.transform(test_data.limit(10))

predictions_v2.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|I-ORG       |I-ORG     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |I-LOC       |I-LOC     |
|1996-08-30    |O           |O         |
|West          |I-MISC      |I-ORG     |
|Indian        |I-MISC      |I-ORG     |
|all-rounder   |O           |I-PER     |
|Phil          |I-PER       |I-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

### with Glove Embeddings

In [52]:
glove = WordEmbeddingsModel().pretrained() \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("glove")\
 .setCaseSensitive(False)

test_data = CoNLL().readDataset(spark, './eng.testa')

test_data = glove.transform(test_data.limit(1000))

test_data.write.parquet("test_withGloveEmbeds.parquet")


glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [55]:
nerTagger.setInputCols(["sentence", "token", "glove"])
nerTagger.setTestDataset("test_withGloveEmbeds.parquet")

glove_pipeline = Pipeline(
    stages = [
    glove,
    nerTagger
  ])

In [56]:
%%time

ner_model_v3 = glove_pipeline.fit(training_data.limit(1000))

CPU times: user 26.8 ms, sys: 7.29 ms, total: 34.1 ms
Wall time: 17 s


In [60]:
predictions_v3 = ner_model_v3.transform(test_data.limit(10))

# test_data.sample(False,0.1,0)

predictions_v3.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |I-MISC    |
|-             |O           |O         |
|LEICESTERSHIRE|I-ORG       |I-ORG     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |I-LOC       |I-LOC     |
|1996-08-30    |O           |O         |
|West          |I-MISC      |O         |
|Indian        |I-MISC      |O         |
|all-rounder   |O           |O         |
|Phil          |I-PER       |I-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

In [68]:
np.array (predictions.select('token.result').take(1))[0][0]

array(['CRICKET', '-', 'LEICESTERSHIRE', 'TAKE', 'OVER', 'AT', 'TOP',
       'AFTER', 'INNINGS', 'VICTORY', '.'], dtype='<U14')

In [72]:
import pandas as pd

tokens = np.array (predictions.select('token.result').take(1))[0][0]
ground = np.array (predictions.select('label.result').take(1))[0][0]
label_bert_0 = np.array (predictions.select('ner.result').take(1))[0][0]
label_bert_2 = np.array (predictions_v2.select('ner.result').take(1))[0][0]
label_glove = np.array (predictions_v3.select('ner.result').take(1))[0][0]

pd.DataFrame({'token':tokens,
              'ground':ground,
              'label_bert_0':label_bert_0,
              'label_bert_2':label_bert_2,
              'label_glove':label_glove})

Unnamed: 0,token,ground,label_bert_0,label_bert_2,label_glove
0,CRICKET,O,O,O,I-MISC
1,-,O,O,O,O
2,LEICESTERSHIRE,I-ORG,O,I-ORG,I-ORG
3,TAKE,O,O,O,O
4,OVER,O,O,O,O
5,AT,O,O,O,O
6,TOP,O,O,O,O
7,AFTER,O,O,O,O
8,INNINGS,O,O,O,O
9,VICTORY,O,O,O,O


### Saving the trained model

In [85]:
ner_model_v3.stages

[WORD_EMBEDDINGS_MODEL_48cffc8b9a76, NerDLModel_9f1a235716e7]

In [86]:
ner_model_v3.stages[1].write().overwrite().save('NER_bert_20200219')

## Prediction Pipeline

In [87]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

bert = BertEmbeddings.pretrained('bert_base_cased', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(False)

loaded_ner_model = NerDLModel.load("NER_bert_20200219")\
 .setInputCols(["sentence", "token", "bert"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert,
        loaded_ner_model,
        converter])

bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]


In [88]:
empty_data = spark.createDataFrame([['']]).toDF("text")

empty_data.show()

+----+
|text|
+----+
|    |
+----+



In [89]:
prediction_model = ner_prediction_pipeline.fit(empty_data)


In [90]:
text = "Peter Parker is a nice guy and lives in New York."
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show()

+--------------------+
|                text|
+--------------------+
|Peter Parker is a...|
+--------------------+



In [None]:

preds = prediction_model.transform(sample_data)

preds.show()

In [None]:
preds.select('ner_span.result').take(1)

In [83]:

preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

+-----+------+
|chunk|entity|
+-----+------+
+-----+------+



In [93]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

loaded_ner_model = NerDLModel.load("NER_bert_20200219")\
 .setInputCols(["sentence", "token", "glove"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

glove_ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        glove,
        loaded_ner_model,
        converter])

In [94]:
glove_prediction_model = glove_ner_prediction_pipeline.fit(empty_data)

In [95]:

preds = glove_prediction_model.transform(sample_data)

preds.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               glove|                 ner|            ner_span|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter Parker is a...|[[document, 0, 48...|[[document, 0, 48...|[[token, 0, 4, Pe...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 11, P...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [96]:

preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

+------------+------+
|chunk       |entity|
+------------+------+
|Peter Parker|PER   |
+------------+------+



### Pretrained Pipelines

In [37]:
from sparknlp.pretrained import PretrainedPipeline

pretrained_pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')

#onto_recognize_entities_sm
#explain_document_dl

recognize_entities_dl download started this may take some time.
Approx size to download 159 MB
[OK!]


In [39]:
text = "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."

result = pretrained_pipeline.annotate(text)

list(zip(result['token'], result['ner']))

[('The', 'O'),
 ('Mona', 'I-PER'),
 ('Lisa', 'I-PER'),
 ('is', 'O'),
 ('a', 'O'),
 ('16th', 'O'),
 ('century', 'O'),
 ('oil', 'O'),
 ('painting', 'O'),
 ('created', 'O'),
 ('by', 'O'),
 ('Leonardo', 'I-PER'),
 ('.', 'O'),
 ("It's", 'I-ORG'),
 ('held', 'O'),
 ('at', 'O'),
 ('the', 'O'),
 ('Louvre', 'I-LOC'),
 ('in', 'O'),
 ('Paris', 'I-LOC'),
 ('.', 'O')]

In [42]:
pretrained_pipeline2 = PretrainedPipeline('explain_document_dl', lang='en')


explain_document_dl download started this may take some time.
Approx size to download 168.4 MB
[OK!]


In [51]:
text = "The Mona Lisa is a 16th centry oil painting created by Leonrdo. It's held at the Louvre in Paris."

result2 = pretrained_pipeline2.annotate(text)

result2
list(zip(result2['token'],  result2['checked'], result2['pos'], result2['ner'],  result2['lemma'],  result2['stem']))

[('The', 'The', 'DT', 'O', 'The', 'the'),
 ('Mona', 'Mona', 'NNP', 'I-PER', 'Mona', 'mona'),
 ('Lisa', 'Lisa', 'NNP', 'I-PER', 'Lisa', 'lisa'),
 ('is', 'is', 'VBZ', 'O', 'be', 'i'),
 ('a', 'a', 'DT', 'O', 'a', 'a'),
 ('16th', '6th', 'CD', 'O', '6th', '6th'),
 ('centry', 'centry', 'NN', 'O', 'centry', 'centri'),
 ('oil', 'oil', 'NN', 'O', 'oil', 'oil'),
 ('painting', 'painting', 'NN', 'O', 'painting', 'paint'),
 ('created', 'created', 'VBN', 'O', 'create', 'creat'),
 ('by', 'by', 'IN', 'O', 'by', 'by'),
 ('Leonrdo', 'Leonardo', 'NNP', 'I-ORG', 'Leonardo', 'leonardo'),
 ('.', '.', '.', 'O', '.', '.'),
 ("It's", 'Itys', 'NNP', 'I-ORG', 'Itys', 'iti'),
 ('held', 'held', 'VBD', 'O', 'hold', 'held'),
 ('at', 'at', 'IN', 'O', 'at', 'at'),
 ('the', 'the', 'DT', 'O', 'the', 'the'),
 ('Louvre', 'Louvre', 'NNP', 'I-LOC', 'Louvre', 'louvr'),
 ('in', 'in', 'IN', 'O', 'in', 'in'),
 ('Paris', 'Paris', 'NNP', 'I-LOC', 'Paris', 'pari'),
 ('.', '.', '.', 'O', '.', '.')]

In [None]:
xx= pretrained_pipeline2.fullAnnotate(text)

[(n.result, n.metadata['entity']) for n in xx['ner_span']]

## Using your own custom Word Embedding

In [17]:
custom_embeddings = WordEmbeddings()\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("glove")\
  .setStoragePath('/Users/vkocaman/cache_pretrained/PubMed-shuffle-win-2.bin', "BINARY")\
.setDimension(200)

In [18]:
custom_embeddings.fit(training_data.limit(10)).transform(training_data.limit(10)).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|               glove|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|[[word_embeddings...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|The European Comm...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|

## creating your own CoNLL dataset

In [29]:
import json
import os
from pyspark.ml import Pipeline
from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp

spark = sparknlp.start()

def get_ann_pipeline ():
    
    document_assembler = DocumentAssembler() \
        .setInputCol("text")\
        .setOutputCol('document')

    sentence = SentenceDetector()\
        .setInputCols(['document'])\
        .setOutputCol('sentence')\
        .setCustomBounds(['\n'])

    tokenizer = Tokenizer() \
        .setInputCols(["sentence"]) \
        .setOutputCol("token")

    pos = PerceptronModel.pretrained() \
              .setInputCols(["sentence", "token"]) \
              .setOutputCol("pos")
    
    embeddings = WordEmbeddingsModel.pretrained()\
          .setInputCols(["sentence", "token"])\
          .setOutputCol("embeddings")

    ner_model = NerDLModel.pretrained() \
          .setInputCols(["sentence", "token", "embeddings"]) \
          .setOutputCol("ner")

    ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")

    ner_pipeline = Pipeline(
        stages = [
            document_assembler,
            sentence,
            tokenizer,
            pos,
            embeddings,
            ner_model,
            ner_converter
        ]
    )

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    ner_pipelineFit = ner_pipeline.fit(empty_data)

    ner_lp_pipeline = LightPipeline(ner_pipelineFit)

    print ("Spark NLP NER lightpipeline is created")

    return ner_lp_pipeline


In [30]:
conll_pipeline = get_ann_pipeline ()

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.5 MB
[OK!]
Spark NLP NER lightpipeline is created


In [32]:
parsed = conll_pipeline.annotate ("Peter Parker is a nice guy and lives in New York.")
parsed

{'document': ['Peter Parker is a nice guy and lives in New York.'],
 'ner_chunk': ['Peter Parker', 'New York'],
 'pos': ['NNP',
  'NNP',
  'VBZ',
  'DT',
  'JJ',
  'NN',
  'CC',
  'NNS',
  'IN',
  'NNP',
  'NNP',
  '.'],
 'token': ['Peter',
  'Parker',
  'is',
  'a',
  'nice',
  'guy',
  'and',
  'lives',
  'in',
  'New',
  'York',
  '.'],
 'ner': ['I-PER',
  'I-PER',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'I-LOC',
  'I-LOC',
  'O'],
 'embeddings': ['Peter',
  'Parker',
  'is',
  'a',
  'nice',
  'guy',
  'and',
  'lives',
  'in',
  'New',
  'York',
  '.'],
 'sentence': ['Peter Parker is a nice guy and lives in New York.']}

In [36]:
conll_lines=''

for token, pos, ner in zip(parsed['token'],parsed['pos'],parsed['ner']):

    conll_lines += "{} {} {} {}\n".format(token, pos, pos, ner)


print(conll_lines)

Peter NNP NNP I-PER
Parker NNP NNP I-PER
is VBZ VBZ O
a DT DT O
nice JJ JJ O
guy NN NN O
and CC CC O
lives NNS NNS O
in IN IN O
New NNP NNP I-LOC
York NNP NNP I-LOC
. . . O

