<a href="https://colab.research.google.com/github/Dirkster99/PyNotes/blob/master/PySpark_SparkNLP/SparkNLP_Advanced_Hellow_World_without_ModelDownloads.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark NLP Advanced Hello World

This notebook was created from this youtube video
Source: https://www.youtube.com/watch?v=fEU37G70SFc

and explains some basic/advanced concepts in SparkNLP.

**This is a simplified version of the full video to demonstrate usage without downloading a pre-trained model.** 

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-06-08 20:32:13--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-06-08 20:32:13--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-06-08 20:32:13 (35.1 MB/s) - written to stdout [1608/1608]

setup Colab for PySpark 3.0.2 and Spark NLP 3.1.0
Get:1 http://security.ubuntu.com/ubuntu 

In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

In [3]:
spark = sparknlp.start()
spark

In [4]:
data = spark.createDataFrame([['Peter is a good person living in Germny and writting an e-mail. Paula is also a good person. She lives in London.']]).toDF('text')

In [5]:
data.show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------+
|text                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------+
|Peter is a good person living in Germny and writting an e-mail. Paula is also a good person. She lives in London.|
+-----------------------------------------------------------------------------------------------------------------+



In [6]:
# DocumentAssembler controls cleaning up of source text and handling of special characters and new lines through 'cleanupMode'
document = DocumentAssembler().setInputCol('text').setOutputCol('document').setCleanupMode('shrink')

In [7]:
# Sentence Detector splits text into sentences in a meaningful way
sentence = SentenceDetector().setInputCols(['document']).setOutputCol('sentence')

In [8]:
# sentence.explodeSentences() explodes sentences improves parallelism in large text concentration
sentence.setExplodeSentences(True) # will put each sentence on a different dataFrame row

SentenceDetector_4492d8de354f

In [9]:
# Tokenizer splits sentences into neaningful words for later NLP
# Always use the sentence output rather than the document if you have a SentenceDetector
tokenizer = Tokenizer().setInputCols(['sentence']).setOutputCol('token')

In [10]:
# tokenizer.setExceptions()   Configures tokens that we don't want to split
# tokenizer.setContextChars() Configures characters we want to remove from our tokens
# tokenizer.splitChars()      Configure splitting by a certain character
# tokenizer.splitPattern()    Configure splitting by a certain pattern etc...

tokenizer.setExceptions(['e-mail']) # Configures tokens that we don't want to split

Tokenizer_c5fd3371bb9b

In [11]:
# Pipeline a component from Apache Spark ML to build multiple stage machine learning processes
from pyspark.ml import Pipeline


In [12]:
# A simple pipeline, which acts as an estimator.
# A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer.
# When Pipeline.fit() is called, the stages are executed in order. If a stage is an Estimator, its Estimator.fit() method will be called on the input dataset to fit a model.
# Then the model, which is a transformer, will be used to transform the dataset as the input to the next stage.
# If a stage is a Transformer, its Transformer.transform() method will be called to produce the dataset for the next stage.
# The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages.
# If stages is an empty list, the pipeline acts as an identity transformer.
pipeline = Pipeline().setStages([document, sentence, tokenizer])

In [13]:
# Fit is normally required for training only but it still has to be used for protocol
# even if stages are already trained
#
# The dataframe passed here is in our case irrelevant we could also pass
# an empty dataframe since there is not training taking place here
#
# pipeline.fit() returns a PipelineModel object
model = pipeline.fit(data)

In [14]:
# transform on a PipelineModel applies the pipeline transformation on to the data
result = model.transform(data)

In [15]:
result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|
+--------------------+--------------------+--------------------+--------------------+
|Peter is a good p...|[[document, 0, 11...|[[document, 0, 62...|[[token, 0, 4, Pe...|
|Peter is a good p...|[[document, 0, 11...|[[document, 64, 9...|[[token, 64, 68, ...|
|Peter is a good p...|[[document, 0, 11...|[[document, 93, 1...|[[token, 93, 95, ...|
+--------------------+--------------------+--------------------+--------------------+



In [16]:
result.select('sentence.result').show(truncate=False)

+-----------------------------------------------------------------+
|result                                                           |
+-----------------------------------------------------------------+
|[Peter is a good person living in Germny and writting an e-mail.]|
|[Paula is also a good person.]                                   |
|[She lives in London.]                                           |
+-----------------------------------------------------------------+



In [17]:
# LightPipelines are faster Pipelines for small amounts of data on a single machine (Removes Spark Data Frames overhead)
light = LightPipeline(model)

In [18]:
# annotate() functions works really fast with strings or list of strings and returns a dictionary of results
resultDict = light.annotate('Bruno is living in Italy, and he is doing well.')
resultDict

{'document': ['Bruno is living in Italy, and he is doing well.'],
 'sentence': ['Bruno is living in Italy, and he is doing well.'],
 'token': ['Bruno',
  'is',
  'living',
  'in',
  'Italy',
  ',',
  'and',
  'he',
  'is',
  'doing',
  'well',
  '.']}