![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/model-downloader/Create%20custom%20pipeline%20-%20NerDL.ipynb)

## 0. Colab Setup

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 55kB/s 
[K     |████████████████████████████████| 204kB 47.8MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 3.4MB/s 
[?25h

Show how to use pretrained assertion status

In [None]:
import sys

from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel

import sparknlp
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from sparknlp.pretrained import ResourceDownloader

from pathlib import Path

if sys.version_info[0] < 3:
    from urllib import urlretrieve
else:
    from urllib.request import urlretrieve

In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


Create some data for testing purposes

In [None]:
from pyspark.sql import Row
R = Row('sentence', 'start', 'end')
test_data = spark.createDataFrame([R('Peter is a good person, and he was working at IBM',0,1)])

Create a custom pipeline

In [None]:
import time

documentAssembler = DocumentAssembler() \
    .setInputCol("sentence") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = LemmatizerModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

spell = NorvigSweetingModel.pretrained() \
    .setInputCols(["token"]) \
    .setOutputCol("spell")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["document", "token"]) \

ner_dl = NerDLModel().pretrained() \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner_dl")

finisher = Finisher() \
    .setInputCols(["ner_dl", "lemma", "spell"]) \
    .setIncludeMetadata(True)

pipeline_fast_dl = Pipeline(stages = [
    documentAssembler, 
    tokenizer, 
    lemmatizer, 
    spell, 
    embeddings, 
    ner_dl, 
    finisher])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]
spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


Now let's use these pipelines and see the results

In [None]:
pipeline_fast_dl.fit(test_data).transform(test_data).show(truncate=False)

+-------------------------------------------------+-----+---+--------------------------------------------+-----------------------------------------------------------+---------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence          