Necessary imports, start spark and create our model downloader

In [1]:
import os
import sys
sys.path.append('../../')

print(sys.version)

from sparknlp.pretrained import ResourceDownloader
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

spark = SparkSession.builder \
    .appName("ner")\
    .master("local[1]")\
    .config("spark.driver.memory","4G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jar", "lib/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

# instantiate the downloader
downloader = ResourceDownloader()


3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]


Create a dummy spark dataframe

In [2]:
# create some mock data to play with
l = [
  (1,'To be or not to be'),
  (2,'This is it!')
]

data = spark.createDataFrame(l, ['docID','text'])

Now we intend to download a POS model by its name and language, which requires tokenized text. Hence, we create our tokenizer pipeline to get the data ready.
Then, we add the POS along the other annotators and transform some text.

In [3]:
# download directly - models
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")
    
# pos tagger
pos = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")
    
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])

output = pipeline.fit(data).transform(data)
output.show()

+-----+------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|            sentence|               token|                 pos|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[document, 0, 17...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|
|    2|       This is it!|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+



In [4]:
downloader.clearCache("pos_fast", "en")

We use predefined BasicPipeline in order to annotate a dataframe with it

In [5]:
# download predefined - pipelines
from sparknlp.pretrained.pipeline.en import BasicPipeline

basic_data = BasicPipeline.annotate(data, "text")
basic_data.show()

+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|               token|              normal|               lemma|                 pos|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|
|    2|       This is it!|[[document, 0, 10...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



We can also annotate a single string

In [6]:
# annotat quickly from string
BasicPipeline().annotate("This world is made up of good and bad things")

{'lemma': ['This',
  'world',
  'be',
  'make',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'thing'],
 'document': ['This world is made up of good and bad things'],
 'normal': ['This',
  'world',
  'is',
  'made',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'things'],
 'pos': ['DT', 'NN', 'VBZ', 'VBN', 'RP', 'IN', 'JJ', 'CC', 'JJ', 'NNS'],
 'token': ['This',
  'world',
  'is',
  'made',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'things']}

Alternatively, here we download a Pipeline by its name and language

In [7]:
# Test clearCache
downloader.clearCache("pipeline_basic", "en")

We clear cache of recently downloaded pipeline

In [8]:
# download directly - pipeline models

# simple pipeline with document assembler and tokenizer
pipeline = downloader.downloadPipeline("pipeline_basic", "en")
pipeline.transform(data).show()

+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|               token|              normal|               lemma|                 pos|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|
|    2|       This is it!|[[document, 0, 10...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



Now we proceed to download a POS model, utilizing the downloader alternative way to retrieve it.
We do the right way for the NER model though.
Then, we retrieve the Basic Pipeline and combine these models to use them appropriately meeting their requirements.

In [9]:
# download predefined - models

pos = downloader.downloadModel(PerceptronModel, "pos_fast", "en")    
pos.setInputCols(["document", "normal"]).setOutputCol("pos")

ner = NerCrfModel.pretrained()
ner.setInputCols(["pos", "normal", "document"]).setOutputCol("ner")

annotation_pipeline = BasicPipeline.pretrained()
annotation_data = annotation_pipeline.transform(data)
annotation_data.show()

pos_tagged = pos.transform(annotation_data)
ner_tagged = ner.transform(pos_tagged)
ner_tagged.show()

+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|               token|              normal|               lemma|                 pos|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|
|    2|       This is it!|[[document, 0, 10...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|       

Lets try a sentiment analysis pipeline

In [10]:
from sparknlp.pretrained.pipeline.en import SentimentPipeline

SentimentPipeline.annotate("This is a good movie!!!")

{'document': ['This is a good movie!!!'],
 'token': ['This', 'is', 'a', 'good', 'movie', '!', '!!'],
 'normal': ['This', 'is', 'a', 'good', 'movie'],
 'sentiment': ['positive']}