<a href="https://colab.research.google.com/github/Dirkster99/PyNotes/blob/master/PySpark_SparkNLP/SparkNLP_Advanced_Hellow_World.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark NLP Advanced Hello World

This notebook was created from this youtube video
Source: https://www.youtube.com/watch?v=fEU37G70SFc

and explains some basic/advanced concepts in SparkNLP.

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-06-08 20:21:46--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-06-08 20:21:46--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-06-08 20:21:46 (30.6 MB/s) - written to stdout [1608/1608]

setup Colab for PySpark 3.0.2 and Spark NLP 3.1.0
Get:1 https://cloud.r-project.org/bin/li

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [3]:
spark = sparknlp.start()
spark

In [4]:
data = spark.createDataFrame([['Peter is a good person living in Germny and writting an e-mail. Paula is also a good person. She lives in London.']]).toDF('text')

In [5]:
data.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------+
|text                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------+
|Peter is a good person living in Germny and writting an e-mail. Paula is also a good person. She lives in London.|
+-----------------------------------------------------------------------------------------------------------------+



In [6]:
# DocumentAssembler controls cleaning up of source text and handling of special characters and new lines through 'cleanupMode'
document = DocumentAssembler().setInputCol('text').setOutputCol('document').setCleanupMode('shrink')

In [7]:
# Sentence Detector splits text into sentences in a meaningful way
sentence = SentenceDetector().setInputCols(['document']).setOutputCol('sentence')

In [8]:
# sentence.explodeSentences() explodes sentences improves parallelism in large text concentration
sentence.setExplodeSentences(True) # will put each sentence on a different dataFrame row

SentenceDetector_1c19aff676e2

In [9]:
# Tokenizer splits sentences into neaningful words for later NLP
# Always use the sentence output rather than the document if you have a SentenceDetector
tokenizer = Tokenizer().setInputCols(['sentence']).setOutputCol('token')

In [10]:
# tokenizer.setExceptions()   Configures tokens that we don't want to split
# tokenizer.setContextChars() Configures characters we want to remove from our tokens
# tokenizer.splitChars()      Configure splitting by a certain character
# tokenizer.splitPattern()    Configure splitting by a certain pattern etc...

tokenizer.setExceptions(['e-mail']) # Configures tokens that we don't want to split

Tokenizer_166e35b5e808

In [11]:
# Spell checker SymmetricDetele or NorvigSweeting: Fix token typos
# pretrained() function on AnnotatorModels retrieves free Open Source pretrained models from the Internet
checker = NorvigSweetingModel.pretrained().setInputCols(['token']).setOutputCol('checked')

spellcheck_norvig download started this may take some time.
Approximate size to download 4.2 MB
[OK!]


In [12]:
# NER requires Embeddings as input - you might get the below error if you provide no embeddings
# IllegalArgumentException: requirement failed: Wrong or missing inputCols annotators in NerDLModel_d4424c9af5f4.
#
# Current inputCols: sentence,checked. Dataset's columns:
# (column_name=text,is_nlp_annotator=false)
# (column_name=document,is_nlp_annotator=true,type=document)
# (column_name=sentence,is_nlp_annotator=true,type=document)
# (column_name=token,is_nlp_annotator=true,type=token)
# (column_name=checked,is_nlp_annotator=true,type=token).
# Make sure such annotators exist in your pipeline, with the right output names and that they have following annotator types: document, token, word_embeddings

# There are BERT or WordEmbeddings (GloVe, gensim) - vector representation of tokens
embeddings  = WordEmbeddingsModel.pretrained().setInputCols(['sentence','checked']).setOutputCol('embeddings')

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [13]:
# Named Entity Recognition: Identifies entities in text, i.e. a Person or Location
# Lets use a TensorFlow Model NerDLModel for that
# The NER Model has 3 inputs:
# Embeddings    -> 'embeddings'
# Document Type -> 'sentence'
# Token Type    -> 'token' or 'checked'
ner = NerDLModel.pretrained().setInputCols(['sentence', 'checked', 'embeddings']).setOutputCol('ner')

ner_dl download started this may take some time.
Approximate size to download 13.6 MB
[OK!]


In [14]:
# Chunk builder, NerConverter: Reads NER input and builds chunks based on labelled data
converter = NerConverter().setInputCols(['sentence', 'checked', 'ner']).setOutputCol('chunk')

In [15]:
# Pipeline a component from Apache Spark ML to build multiple stage machine learning processes
from pyspark.ml import Pipeline


In [16]:
# A simple pipeline, which acts as an estimator.
# A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.
# When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model.
# Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage.
# If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage.
# The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages.
# If stages is an empty list, the pipeline acts as an identity transformer.
pipeline = Pipeline().setStages([document, sentence, tokenizer, checker, embeddings, ner, converter])

In [17]:
# Fit is normally required for training only but it still has to be used for protocol
# even if stages are already trained
#
# The dataframe passed here is in our case irrelevant we could also pass
# an empty dataframe since there is not training taking place here
#
# pipeline.fit() returns a PipelineModel object
model = pipeline.fit(data)

In [18]:
# transform on a PipelineModel applies the pipeline transformation on to the data
result = model.transform(data)

In [19]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|             checked|          embeddings|                 ner|               chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter is a good p...|[[document, 0, 11...|[[document, 0, 62...|[[token, 0, 4, Pe...|[[token, 0, 4, Pe...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 4, Pe...|
|Peter is a good p...|[[document, 0, 11...|[[document, 64, 9...|[[token, 64, 68, ...|[[token, 64, 68, ...|[[word_embeddings...|[[named_entity, 6...|[[chunk, 64, 68, ...|
|Peter is a good p...|[[document, 0, 11...|[[document, 93, 1...|[[token, 93, 95, ...|[[token, 93, 95, ...|[[word_embeddings...|[[named_entity, 9...|[[

In [20]:
result.select('sentence.result').show(truncate=False)

+-----------------------------------------------------------------+
|result                                                           |
+-----------------------------------------------------------------+
|[Peter is a good person living in Germny and writting an e-mail.]|
|[Paula is also a good person.]                                   |
|[She lives in London.]                                           |
+-----------------------------------------------------------------+



In [21]:
result.select('checked.result').show(truncate=False)

+-----------------------------------------------------------------------------+
|result                                                                       |
+-----------------------------------------------------------------------------+
|[Peter, is, a, good, person, living, in, Germany, and, writhing, an, e-mail.]|
|[Paula, is, also, a, good, person, .]                                        |
|[She, lives, in, London, .]                                                  |
+-----------------------------------------------------------------------------+



In [22]:
# shows 1 NER label for each token
result.select('ner.result', 'checked.result').show(truncate=False)

+--------------------------------------------+-----------------------------------------------------------------------------+
|result                                      |result                                                                       |
+--------------------------------------------+-----------------------------------------------------------------------------+
|[B-PER, O, O, O, O, O, O, B-LOC, O, O, O, O]|[Peter, is, a, good, person, living, in, Germany, and, writhing, an, e-mail.]|
|[B-PER, O, O, O, O, O, O]                   |[Paula, is, also, a, good, person, .]                                        |
|[O, O, O, B-LOC, O]                         |[She, lives, in, London, .]                                                  |
+--------------------------------------------+-----------------------------------------------------------------------------+



In [23]:
# get the bounds with this
result.select('ner.begin', 'ner.end').show(truncate=False)

+---------------------------------------------+---------------------------------------------+
|begin                                        |end                                          |
+---------------------------------------------+---------------------------------------------+
|[0, 6, 9, 11, 16, 23, 30, 33, 40, 44, 53, 56]|[4, 7, 9, 14, 21, 28, 31, 38, 42, 51, 54, 62]|
|[64, 70, 73, 78, 80, 85, 91]                 |[68, 71, 76, 78, 83, 90, 91]                 |
|[93, 97, 103, 106, 112]                      |[95, 101, 104, 111, 112]                     |
+---------------------------------------------+---------------------------------------------+



In [24]:
# NerConverter chuck type output (based on NER)
result.select('chunk.result','chunk.begin', 'chunk.end').show(truncate=False)

+---------------+-------+-------+
|result         |begin  |end    |
+---------------+-------+-------+
|[Peter, Germny]|[0, 33]|[4, 38]|
|[Paula]        |[64]   |[68]   |
|[London]       |[106]  |[111]  |
+---------------+-------+-------+



In [25]:
# LightPipelines are faster Pipelines for small amounts of data on a single machine (Removes Spark Data Frames overhead)
light = LightPipeline(model)

In [26]:
# annotate() functions works really fast with strings or list of strings and returns a dictionary of results
resultDict = light.annotate('Bruno is living in Italy, and he is doing well.')
resultDict

{'checked': ['Bruno',
  'is',
  'living',
  'in',
  'Italy',
  ',',
  'and',
  'he',
  'is',
  'doing',
  'well',
  '.'],
 'chunk': ['Bruno', 'Italy'],
 'document': ['Bruno is living in Italy, and he is doing well.'],
 'embeddings': ['Bruno',
  'is',
  'living',
  'in',
  'Italy',
  ',',
  'and',
  'he',
  'is',
  'doing',
  'well',
  '.'],
 'ner': ['B-PER', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],
 'sentence': ['Bruno is living in Italy, and he is doing well.'],
 'token': ['Bruno',
  'is',
  'living',
  'in',
  'Italy',
  ',',
  'and',
  'he',
  'is',
  'doing',
  'well',
  '.']}