![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dl-ner/ner_dl.ipynb)

## 0. Colab Setup

In [1]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed -q spark-nlp==2.4.5

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 57kB/s 
[K     |████████████████████████████████| 204kB 42.7MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 112kB 3.4MB/s 
[?25h

## Deep Learning NER

In the following example, we walk-through a LSTM NER model training and prediction. This annotator is implemented on top of TensorFlow.

This annotator will take a series of word embedding vectors, training CoNLL dataset, plus a validation dataset. We include our own predefined Tensorflow Graphs, but it will train all layers during fit() stage.

DL NER will compute several layers of BI-LSTM in order to auto generate entity extraction, and it will leverage batch-based distributed calls to native TensorFlow libraries during prediction. 

#### 1. Call necessary imports and set the resource folder path.

In [0]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import time
import zipfile

#### 2. Download CoNLL 2003 data if not present

In [3]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request
url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/"
file_train="eng.train"
file_testa= "eng.testa"
file_testb= "eng.testb"
# https://github.com/patverga/torch-ner-nlp-from-scratch/tree/master/data/conll2003
if not Path(file_train).is_file():   
    print("Downloading "+file_train)
    urllib.request.urlretrieve(url+file_train, file_train)

Downloading eng.train


#### 4. Create the spark session

In [4]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.4.5
Apache Spark version:  2.4.4


#### 6. Load parquet dataset and cache into memory

In [5]:
from sparknlp.training import CoNLL

conll = CoNLL(
    documentCol="document",
    sentenceCol="sentence",
    tokenCol="token",
    posCol="pos"
)

training_data = conll.readDataset(spark, './eng.train')


embeddings = WordEmbeddingsModel.pretrained()\
.setOutputCol('embeddings')

ready_data = embeddings.transform(training_data)

ready_data.show(4)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|          embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|[[word_embeddings...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|The Euro

#### 5. Create annotator components with appropriate params and in the right order. The finisher will output only NER. Put everything in Pipeline

In [0]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(1)\
  .setRandomSeed(0)\
  .setVerbose(0)\
  .setIncludeConfidence(True)

#### 7. Train the NerDLModel. (This will take some time)

In [7]:
start = time.time()
print("Start fitting")
ner_model = nerTagger.fit(ready_data)
print("Fitting is ended")
print (time.time() - start)

Start fitting
Fitting is ended
317.23192048072815


#### 8. Lets predict with the model

In [8]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('token')

embeddings = WordEmbeddingsModel.pretrained()\
.setOutputCol('embeddings')

prediction_pipeline = Pipeline(
    stages = [
        document,
        sentence,
        token,
        embeddings,
        ner_model
    ]
)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [9]:
prediction_data = spark.createDataFrame([["Maria is a nice place."]]).toDF("text")
prediction_data.show()

+--------------------+
|                text|
+--------------------+
|Maria is a nice p...|
+--------------------+



In [10]:
prediction_model = prediction_pipeline.fit(prediction_data)
prediction_model.transform(prediction_data).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Maria is a nice p...|[[document, 0, 21...|[[document, 0, 21...|[[token, 0, 4, Ma...|[[word_embeddings...|[[named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [11]:
# We can be fast!

lp = LightPipeline(prediction_model)
result = lp.annotate("International Business Machines Corporation (IBM) is an American multinational information technology company headquartered in Armonk.")
list(zip(result['token'], result['ner']))

[('International', 'I-ORG'),
 ('Business', 'I-ORG'),
 ('Machines', 'I-ORG'),
 ('Corporation', 'I-ORG'),
 ('(', 'O'),
 ('IBM', 'I-ORG'),
 (')', 'O'),
 ('is', 'O'),
 ('an', 'O'),
 ('American', 'I-MISC'),
 ('multinational', 'O'),
 ('information', 'O'),
 ('technology', 'O'),
 ('company', 'O'),
 ('headquartered', 'O'),
 ('in', 'O'),
 ('Armonk', 'I-LOC'),
 ('.', 'O')]