![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/crf-ner/ner_dl_crf.ipynb)

## 0. Colab Setup

In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 53kB/s 
[K     |████████████████████████████████| 204kB 40.0MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 3.3MB/s 
[?25h

## CRF Named Entity Recognition
In the following example, we walk-through a Conditional Random Fields NER model training and prediction.

This challenging annotator will require the user to provide either a labeled dataset during fit() stage, or use external CoNLL 2003 resources to train. It may optionally use an external word embeddings set and a list of additional entities.

The CRF Annotator will also require Part-of-speech tags so we add those in the same Pipeline. Also, we could use our special RecursivePipeline, which will tell SparkNLP's NER CRF approach to use the same pipeline for tagging external resources.



#### 1. Call necessary imports and set the resource path to read local data files

In [0]:
import os
import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import time
import zipfile

#### 2. Download training dataset if not already there

In [3]:
# Download CoNLL 2003 Dataset
import os
from pathlib import Path
import urllib.request

if not Path("eng.train").is_file():
    print("File Not found will downloading it!")
    url = "https://github.com/patverga/torch-ner-nlp-from-scratch/raw/master/data/conll2003/eng.train"
    urllib.request.urlretrieve(url, 'eng.train')
else:
    print("File already present.")


File Not found will downloading it!


#### 3. Load SparkSession if not already there

In [4]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


#### 4. Create annotator components in the right order, with their training Params. Finisher will output only NER. Put all in pipeline.

In [0]:
nerTagger = NerCrfApproach()\
  .setInputCols(["sentence", "token", "pos", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMinEpochs(1)\
  .setMaxEpochs(1)\
  .setLossEps(1e-3)\
  .setL2(1)\
  .setC0(1250000)\
  .setRandomSeed(0)\
  .setVerbose(0)


#### 6. Load a dataset for prediction. Training is not relevant from this dataset.

In [6]:
from sparknlp.training import CoNLL
conll = CoNLL()
data = conll.readDataset(spark, path='eng.train')

embeddings = WordEmbeddingsModel.pretrained()\
.setOutputCol('embeddings')

ready_data = embeddings.transform(data)

ready_data.show(4)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|          embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[[document, 0, 47...|[[document, 0, 47...|[[token, 0, 1, EU...|[[pos, 0, 1, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|     Peter Blackburn|[[document, 0, 14...|[[document, 0, 14...|[[token, 0, 4, Pe...|[[pos, 0, 4, NNP,...|[[named_entity, 0...|[[word_embeddings...|
| BRUSSELS 1996-08-22|[[document, 0, 18...|[[document, 0, 18...|[[token, 0, 7, BR...|[[pos, 0, 7, NNP,...|[[named_entity, 0...|[[word_embeddings...|
|The Euro

#### 7. Training the model. Training doesn't really do anything from the dataset itself.

In [7]:
start = time.time()
print("Start fitting")
ner_model = nerTagger.fit(ready_data)
print("Fitting has ended")
print (time.time() - start)

Start fitting
Fitting has ended
269.28221678733826


#### 8. Save NerCrfModel into disk after training

In [0]:
ner_model.write().overwrite().save("./pip_wo_embedd/")