# NER with BERT in Spark NLP

## Installation

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-04-12 15:06:32--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.26
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.26|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-04-12 15:06:33--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1593 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               setup Colab for PySpark 3.1.1 and Spark NLP 3.0.1

2021-04-12 15:06:33 (2.23 MB

## Import libraries and download datasets

In [2]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *

In [3]:
# if you have GPU
# spark = sparknlp.start(gpu=True)
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.0.1
Apache Spark version:  3.1.1


In [4]:
from urllib.request import urlretrieve

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
           'eng.train')

urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
           'eng.testa')


('eng.testa', <http.client.HTTPMessage at 0x7fefba656d90>)

In [5]:
with open("eng.train") as f:
    c=f.read()

print (c[:200])

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O

Peter NNP B-NP B-PER
Black


## Building NER pipeline

In [6]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, './eng.train')
training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [7]:
training_data.count()

14041

### Loading Bert

In Spark NLP, we have four pre-trained variants of BERT: bert_base_uncased , bert_base_cased , bert_large_uncased , bert_large_cased, and many Smaller BERT models available on our [Models Hub](https://nlp.johnsnowlabs.com/models?q=bert&task=Embeddings). Which one to use depends on your use case, train set, and the complexity of the task you are trying to model.

In the code snippet above, we basically load the bert_base_cased version from Spark NLP public resources and point the sentence and token columns in   setInputCols(). In short, BertEmbeddings() annotator will take sentence and token columns and populate Bert embeddings in bert column. In general, each word is translated to a 768-dimensional vector.

As explained by the authors of official BERT paper, different BERT layers capture different information. The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore it may be biased to those targets. 

In [8]:
# we use BERT Tiny
bert_annotator = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
.setBatchSize(8)

small_bert_L2_128 download started this may take some time.
Approximate size to download 16.1 MB
[OK!]


In [9]:
from sparknlp.training import CoNLL

test_data = CoNLL().readDataset(spark, './eng.testa')

test_data = bert_annotator.transform(test_data)

test_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

In [10]:
# let's transform and save our test dataset for evaluation
test_data.write.parquet("test_withEmbeds.parquet")

In [11]:
test_data.select("bert.result","bert.embeddings",'label.result').show()

+--------------------+--------------------+--------------------+
|              result|          embeddings|              result|
+--------------------+--------------------+--------------------+
|[cricket, -, leic...|[[-1.6099558, 0.5...|[O, O, B-ORG, O, ...|
|[london, 1996-08-30]|[[-0.66074246, 0....|          [B-LOC, O]|
|[west, indian, al...|[[-1.2108907, 0.9...|[B-MISC, I-MISC, ...|
|[their, stay, on,...|[[-0.9397644, 0.0...|[O, O, O, O, O, O...|
|[after, bowling, ...|[[-1.126781, 1.11...|[O, O, B-ORG, O, ...|
|[trailing, by, 21...|[[-1.8359265, 0.4...|[O, O, O, O, B-OR...|
|[essex, ,, howeve...|[[-1.2150189, 0.2...|[B-ORG, O, O, O, ...|
|[hussain, ,, cons...|[[-1.607896, 0.52...|[B-PER, O, O, O, ...|
|[by, the, close, ...|[[-1.868376, 1.14...|[O, O, O, B-ORG, ...|
|[at, the, oval, ,...|[[-1.874095, 0.69...|[O, O, B-LOC, O, ...|
|[he, was, well, b...|[[-1.6607136, 1.2...|[O, O, O, O, O, B...|
|[derbyshire, kept...|[[-1.1823796, 0.2...|[B-ORG, O, O, O, ...|
|[australian, tom,...|[[-

In [12]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "bert"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(5)\
  .setLr(0.001)\
  .setPo(0.005)\
  .setBatchSize(32)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setTestDataset("test_withEmbeds.parquet")

pipeline = Pipeline(
    stages = [
    bert_annotator,
    nerTagger
  ])

You can also set learning rate ( setLr ), learning rate decay coefficient ( setPo ), setBatchSize and setDropout rate. Please see the [official APIs](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/ner/dl/NerDLApproach.html) for the entire list. 

In [13]:
%%time

ner_model = pipeline.fit(training_data)

CPU times: user 5.42 s, sys: 618 ms, total: 6.03 s
Wall time: 16min 55s


In [14]:
!ls -l /root/annotator_logs/

total 8
-rw-r--r-- 1 root root 4200 Apr 12 15:27 NerDLApproach_16a0e7b3577f.log


In [15]:
!cat /root/annotator_logs/NerDLApproach_16a0e7b3577f.log

Name of the selected graph: ner-dl/blstm_10_128_128_120.pb
Training started - total epochs: 5 - lr: 0.001 - batch size: 32 - labels: 9 - chars: 58 - training examples: 14041


Epoch 1/5 started, lr: 0.001, dataset size: 14041


Epoch 1/5 - 169.06s - loss: 2376.4592 - batches: 441
Quality on test dataset: 
time to finish evaluation: 11.74s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1429	 224	 408	 0.8644888	 0.7778987	 0.81891114
I-ORG	 248	 96	 503	 0.7209302	 0.33022636	 0.45296806
I-MISC	 128	 84	 218	 0.6037736	 0.3699422	 0.45878136
I-LOC	 123	 19	 134	 0.86619717	 0.47859922	 0.6165413
I-PER	 1188	 74	 116	 0.9413629	 0.9110429	 0.9259548
B-MISC	 547	 163	 375	 0.7704225	 0.5932755	 0.6703431
B-ORG	 741	 209	 600	 0.78	 0.5525727	 0.6468791
B-PER	 1495	 145	 345	 0.9115854	 0.8125	 0.8591954
tp: 5899 fp: 1014 fn: 2699 labels: 8
Macro-average	 prec: 0.8073451, rec: 0.6032572, f1: 0.6905373
Micro-average	 prec: 0.8533198, rec: 0.6860898, f1: 0.7606215


Epoch 2/5 started, lr: 9.950249E

**Some notes:**
- we used the smallest BERT model called BERT Tiny
- it's very small and requires less memory among Transformers
- if you have more memory or access to accelerated hardware please choose a larger BERT model for higher accuracy
- you can also set higher Epoch to reach our STOA metrics

We chose the smallest BERT model with only 5 Epochs for the sake of this tutorial within this small Colab VM

In [20]:
# let's save our trained NER model on disk
# so we can load it in a new session or move it to another location
# since we fit NerDL model inside the pipeline, we can access it via stages
ner_model.stages[1].write().overwrite().save('./NER_bert_20200219')

In [21]:
test_data.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|                bert|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|CRICKET - LEICEST...|[{document, 0, 64...|[{document, 0, 64...|[{token, 0, 6, CR...|[{pos, 0, 6, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|   LONDON 1996-08-30|[{document, 0, 16...|[{document, 0, 16...|[{token, 0, 5, LO...|[{pos, 0, 5, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|West Indian all-r...|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 3, We...|[{pos, 0, 3, NNP,...|[{named_entity, 0...|[{word_embeddings...|
|Their stay on top...|[{document, 0, 20...|[{document, 0, 20...|[{token, 0, 4, Th...|[{pos, 0, 4, PRP$...|

In [26]:
# let's only feed sentence and token from our test dataset
predictions = ner_model.transform(test_data.select("sentence", "token", "label"))
predictions.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            sentence|               token|               label|                bert|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|[{document, 0, 64...|[{token, 0, 6, CR...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|[{document, 0, 16...|[{token, 0, 5, LO...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
|[{document, 0, 18...|[{token, 0, 3, We...|[{named_entity, 0...|[{word_embeddings...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [27]:
predictions.select('token.result','label.result','ner.result').show(truncate=40)

+----------------------------------------+----------------------------------------+----------------------------------------+
|                                  result|                                  result|                                  result|
+----------------------------------------+----------------------------------------+----------------------------------------+
|[CRICKET, -, LEICESTERSHIRE, TAKE, OV...|   [O, O, B-ORG, O, O, O, O, O, O, O, O]|   [O, O, B-ORG, O, O, O, O, O, O, O, O]|
|                    [LONDON, 1996-08-30]|                              [B-LOC, O]|                              [B-LOC, O]|
|[West, Indian, all-rounder, Phil, Sim...|[B-MISC, I-MISC, O, B-PER, I-PER, O, ...|[B-MISC, I-MISC, O, B-PER, I-PER, O, ...|
|[Their, stay, on, top, ,, though, ,, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|[O, O, O, O, O, O, O, O, O, O, O, O, ...|
|[After, bowling, Somerset, out, for, ...|[O, O, B-ORG, O, O, O, O, O, O, O, O,...|[O, O, I-PER, O, O, O, O, O, O, O, O,...|


In [28]:
predictions.printSchema()

root
 |-- sentence: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = false)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (n

In [29]:
import pyspark.sql.functions as F

predictions.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

+--------------+------------+----------+
|token         |ground_truth|prediction|
+--------------+------------+----------+
|CRICKET       |O           |O         |
|-             |O           |O         |
|LEICESTERSHIRE|B-ORG       |B-ORG     |
|TAKE          |O           |O         |
|OVER          |O           |O         |
|AT            |O           |O         |
|TOP           |O           |O         |
|AFTER         |O           |O         |
|INNINGS       |O           |O         |
|VICTORY       |O           |O         |
|.             |O           |O         |
|LONDON        |B-LOC       |B-LOC     |
|1996-08-30    |O           |O         |
|West          |B-MISC      |B-MISC    |
|Indian        |I-MISC      |I-MISC    |
|all-rounder   |O           |O         |
|Phil          |B-PER       |B-PER     |
|Simmons       |I-PER       |I-PER     |
|took          |O           |O         |
|four          |O           |O         |
+--------------+------------+----------+
only showing top

# Convert to Pandas

In [31]:
import pandas as pd

df = predictions.select('token.result','label.result','ner.result').toPandas()

df

Unnamed: 0,result,result.1,result.2
0,"[CRICKET, -, LEICESTERSHIRE, TAKE, OVER, AT, T...","[O, O, B-ORG, O, O, O, O, O, O, O, O]","[O, O, B-ORG, O, O, O, O, O, O, O, O]"
1,"[LONDON, 1996-08-30]","[B-LOC, O]","[B-LOC, O]"
2,"[West, Indian, all-rounder, Phil, Simmons, too...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ...","[B-MISC, I-MISC, O, B-PER, I-PER, O, O, O, O, ..."
3,"[Their, stay, on, top, ,, though, ,, may, be, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-ORG,...","[O, O, O, O, O, O, O, O, O, O, O, O, O, B-ORG,..."
4,"[After, bowling, Somerset, out, for, 83, on, t...","[O, O, B-ORG, O, O, O, O, O, O, O, O, B-LOC, I...","[O, O, I-PER, O, O, O, O, O, O, O, O, B-LOC, I..."
...,...,...,...
3245,"[But, the, prices, may, move, in, a, close, ra...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, O]"
3246,"[Brokers, said, blue, chips, like, IDLC, ,, Ba...","[O, O, O, O, O, B-ORG, O, B-ORG, I-ORG, O, B-O...","[O, O, O, O, O, O, O, B-LOC, O, O, B-LOC, O, O..."
3247,"[They, said, there, was, still, demand, for, b...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3248,"[The, DSE, all, share, price, index, closed, 2...","[O, B-ORG, O, O, O, O, O, O, O, O, O, O, O, O,...","[O, B-ORG, O, O, O, O, O, O, O, O, O, O, O, O,..."


## Prediction Pipeline

In [None]:
from pyspark.ml import Pipeline

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

bert = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("bert")\
 .setCaseSensitive(True)

loaded_ner_model = NerDLModel.load("NER_bert_20200219")\
 .setInputCols(["sentence", "token", "bert"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        bert,
        loaded_ner_model,
        converter])

In [None]:
empty_data = spark.createDataFrame([['']]).toDF("text")

empty_data.show()

In [None]:
prediction_model = ner_prediction_pipeline.fit(empty_data)


In [None]:
text = "Peter Parker is a nice guy and lives in New York."
sample_data = spark.createDataFrame([[text]]).toDF("text")
sample_data.show()

In [None]:

preds = prediction_model.transform(sample_data)

preds.show()

In [None]:
preds.select('ner_span.result').take(1)

In [None]:

preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

loaded_ner_model = NerDLModel.load("NER_bert_20200219")\
 .setInputCols(["sentence", "token", "glove"])\
 .setOutputCol("ner")

converter = NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_span")

glove_ner_prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        glove,
        loaded_ner_model,
        converter])

In [None]:
glove_prediction_model = glove_ner_prediction_pipeline.fit(empty_data)

In [None]:

preds = glove_prediction_model.transform(sample_data)

preds.show()

In [None]:

preds.select(F.explode(F.arrays_zip("ner_span.result","ner_span.metadata")).alias("entities")) \
.select(F.expr("entities['0']").alias("chunk"),
        F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

### Pretrained Pipelines

In [None]:
from sparknlp.pretrained import PretrainedPipeline

pretrained_pipeline = PretrainedPipeline('recognize_entities_dl', lang='en')

#onto_recognize_entities_sm
#explain_document_dl

In [None]:
text = "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."

result = pretrained_pipeline.annotate(text)

list(zip(result['token'], result['ner']))

In [None]:
pretrained_pipeline2 = PretrainedPipeline('explain_document_dl', lang='en')


In [None]:
text = "The Mona Lisa is a 16th centry oil painting created by Leonrdo. It's held at the Louvre in Paris."

result2 = pretrained_pipeline2.annotate(text)

result2
list(zip(result2['token'],  result2['checked'], result2['pos'], result2['ner'],  result2['lemma'],  result2['stem']))

In [None]:
xx= pretrained_pipeline2.fullAnnotate(text)

[(n.result, n.metadata['entity']) for n in xx['ner_span']]

### with Glove Embeddings

In [None]:
glove = WordEmbeddingsModel().pretrained() \
 .setInputCols(["sentence",'token'])\
 .setOutputCol("glove")\
 .setCaseSensitive(False)

test_data = CoNLL().readDataset(spark, './eng.testa')

test_data = glove.transform(test_data.limit(1000))

test_data.write.parquet("test_withGloveEmbeds.parquet")


In [None]:
nerTagger.setInputCols(["sentence", "token", "glove"])
nerTagger.setTestDataset("test_withGloveEmbeds.parquet")

glove_pipeline = Pipeline(
    stages = [
    glove,
    nerTagger
  ])

In [None]:
%%time

ner_model_v3 = glove_pipeline.fit(training_data.limit(1000))

In [None]:
predictions_v3 = ner_model_v3.transform(test_data.limit(10))

# test_data.sample(False,0.1,0)

predictions_v3.select(F.explode(F.arrays_zip('token.result','label.result','ner.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("token"),
        F.expr("cols['1']").alias("ground_truth"),
        F.expr("cols['2']").alias("prediction")).show(truncate=False)

In [None]:
np.array (predictions.select('token.result').take(1))[0][0]

In [None]:
import pandas as pd

tokens = np.array (predictions.select('token.result').take(1))[0][0]
ground = np.array (predictions.select('label.result').take(1))[0][0]
label_bert_0 = np.array (predictions.select('ner.result').take(1))[0][0]
label_bert_2 = np.array (predictions_v2.select('ner.result').take(1))[0][0]
label_glove = np.array (predictions_v3.select('ner.result').take(1))[0][0]

pd.DataFrame({'token':tokens,
              'ground':ground,
              'label_bert_0':label_bert_0,
              'label_bert_2':label_bert_2,
              'label_glove':label_glove})

## Using your own custom Word Embedding

In [None]:
custom_embeddings = WordEmbeddings()\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("glove")\
  .setStoragePath('/Users/vkocaman/cache_pretrained/PubMed-shuffle-win-2.bin', "BINARY")\
.setDimension(200)

In [None]:
custom_embeddings.fit(training_data.limit(10)).transform(training_data.limit(10)).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|               glove|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|[[word_embeddings...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|The European Comm...|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 2, Th...|[[pos, 0, 2, DT, ...|

## creating your own CoNLL dataset

In [None]:
import json
import os
from pyspark.ml import Pipeline
from sparknlp.base import *
from sparknlp.annotator import *
import sparknlp

spark = sparknlp.start()

def get_ann_pipeline ():
    
    document_assembler = DocumentAssembler() \
        .setInputCol("text")\
        .setOutputCol('document')

    sentence = SentenceDetector()\
        .setInputCols(['document'])\
        .setOutputCol('sentence')\
        .setCustomBounds(['\n'])

    tokenizer = Tokenizer() \
        .setInputCols(["sentence"]) \
        .setOutputCol("token")

    pos = PerceptronModel.pretrained() \
              .setInputCols(["sentence", "token"]) \
              .setOutputCol("pos")
    
    embeddings = WordEmbeddingsModel.pretrained()\
          .setInputCols(["sentence", "token"])\
          .setOutputCol("embeddings")

    ner_model = NerDLModel.pretrained() \
          .setInputCols(["sentence", "token", "embeddings"]) \
          .setOutputCol("ner")

    ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")

    ner_pipeline = Pipeline(
        stages = [
            document_assembler,
            sentence,
            tokenizer,
            pos,
            embeddings,
            ner_model,
            ner_converter
        ]
    )

    empty_data = spark.createDataFrame([[""]]).toDF("text")

    ner_pipelineFit = ner_pipeline.fit(empty_data)

    ner_lp_pipeline = LightPipeline(ner_pipelineFit)

    print ("Spark NLP NER lightpipeline is created")

    return ner_lp_pipeline


In [None]:
conll_pipeline = get_ann_pipeline ()

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.5 MB
[OK!]
Spark NLP NER lightpipeline is created


In [None]:
parsed = conll_pipeline.annotate ("Peter Parker is a nice guy and lives in New York.")
parsed

{'document': ['Peter Parker is a nice guy and lives in New York.'],
 'ner_chunk': ['Peter Parker', 'New York'],
 'pos': ['NNP',
  'NNP',
  'VBZ',
  'DT',
  'JJ',
  'NN',
  'CC',
  'NNS',
  'IN',
  'NNP',
  'NNP',
  '.'],
 'token': ['Peter',
  'Parker',
  'is',
  'a',
  'nice',
  'guy',
  'and',
  'lives',
  'in',
  'New',
  'York',
  '.'],
 'ner': ['I-PER',
  'I-PER',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'I-LOC',
  'I-LOC',
  'O'],
 'embeddings': ['Peter',
  'Parker',
  'is',
  'a',
  'nice',
  'guy',
  'and',
  'lives',
  'in',
  'New',
  'York',
  '.'],
 'sentence': ['Peter Parker is a nice guy and lives in New York.']}

In [None]:
conll_lines=''

for token, pos, ner in zip(parsed['token'],parsed['pos'],parsed['ner']):

    conll_lines += "{} {} {} {}\n".format(token, pos, pos, ner)


print(conll_lines)

Peter NNP NNP I-PER
Parker NNP NNP I-PER
is VBZ VBZ O
a DT DT O
nice JJ JJ O
guy NN NN O
and CC CC O
lives NNS NNS O
in IN IN O
New NNP NNP I-LOC
York NNP NNP I-LOC
. . . O

