![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# **PySpark Tutorial-8 PySpark Specifics for Spark NLP**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/PySpark/8.PySpark_Specifics_for_Spark_NLP.ipynb)

In this notebook, some special Spark NLP annotators have been performed.




### Install PySpark

In [4]:
# install PySpark
! pip install -q pyspark==3.2.0 spark-nlp

### Initializing Spark

In [5]:
import sparknlp

spark = sparknlp.start(spark32=True)
# params =>> gpu=False, spark23=False (start with spark 2.3)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.1
Apache Spark version: 3.2.0


In [6]:
#  DO NOT FORGET WHEN YOU'RE DONE => spark.stop()

# Annotators and Transformer Concepts

In Spark NLP, all Annotators are either Estimators or Transformers as we see in Spark ML. An Estimator in Spark ML is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer that transforms a DataFrame with features into a DataFrame with predictions. In Spark NLP, there are two types of annotators: AnnotatorApproach and AnnotatorModel AnnotatorApproach extends Estimators from Spark ML, which are meant to be trained through fit(), and AnnotatorModel extends Transformers which are meant to transform data frames through transform(). Some of Spark NLP annotators have a Model suffix and some do not. The model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers but do not contain the suffix Model since they are not trained, annotators. Model annotators have a pre-trained() on its static object, to retrieve the public pre-trained version of a model. Long story short, if it trains on a DataFrame and produces a model, it’s an AnnotatorApproach; and if it transforms one DataFrame into another DataFrame through some models, it’s an AnnotatorModel (e.g. WordEmbeddingsModel) and it doesn’t take Model suffix if it doesn’t rely on a pre-trained annotator while transforming a DataFrame (e.g. Tokenizer).

In [7]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/sample-sentences-en.txt

In [8]:
with open('./sample-sentences-en.txt') as f:
  print (f.read())

Peter is a very good person.
My life in Russia is very interesting.
John and Peter are brothers. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!


In [9]:
spark_df = spark.read.text('./sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



In [10]:
spark_df.printSchema()

root
 |-- text: string (nullable = true)



## Document Assembler

To get through the process in Spark NLP, we need to get raw data transformed into Document type at first.

DocumentAssembler() is a special transformer that does this for us; it creates the first annotation of type Document which may be used by annotators down the road.

DocumentAssembler() comes from sparknlp.base class and has the following settable parameters. See the full list here and the source code here.

In [11]:
from sparknlp.base import *

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")\
      .setCleanupMode("shrink")

doc_df = documentAssembler.transform(spark_df)

doc_df.show(truncate=30)

+------------------------------+------------------------------+
|                          text|                      document|
+------------------------------+------------------------------+
|  Peter is a very good person.|[{document, 0, 27, Peter is...|
|My life in Russia is very i...|[{document, 0, 37, My life ...|
|John and Peter are brothers...|[{document, 0, 76, John and...|
|Lucas Nogal Dunbercker is n...|[{document, 0, 67, Lucas No...|
|Europe is very culture rich...|[{document, 0, 68, Europe i...|
+------------------------------+------------------------------+



In [12]:
doc_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)



In [13]:
doc_df.select('document.result','document.begin','document.end').show(truncate=False)

+-------------------------------------------------------------------------------+-----+----+
|result                                                                         |begin|end |
+-------------------------------------------------------------------------------+-----+----+
|[Peter is a very good person.]                                                 |[0]  |[27]|
|[My life in Russia is very interesting.]                                       |[0]  |[37]|
|[John and Peter are brothers. However they don't support each other that much.]|[0]  |[76]|
|[Lucas Nogal Dunbercker is no longer happy. He has a good car though.]         |[0]  |[67]|
|[Europe is very culture rich. There are huge churches! and big houses!]        |[0]  |[68]|
+-------------------------------------------------------------------------------+-----+----+



In [14]:
doc_df.select("document.result").take(1)

[Row(result=['Peter is a very good person.'])]

## Sentence Detector
Finds sentence bounds in raw text.
`setCustomBounds(string)`: Custom sentence separator text e.g. `["\n"]`

`setUseCustomOnly(bool)`: Use only custom bounds without considering those of Pragmatic Segmenter. Defaults to false. Needs customBounds.

`setUseAbbreviations(bool)`: Whether to consider abbreviation strategies for better accuracy but slower performance. Defaults to true.

`setExplodeSentences(bool)`: Whether to split sentences into different Dataset rows. Useful for higher parallelism in fat rows. Defaults to false.

In [15]:
from sparknlp.annotator import *

# we feed the document column coming from Document Assembler

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

In [16]:
sentenceDetector.extractParamMap()

{Param(parent='SentenceDetector_4a1dfd3e9e88', name='customBounds', doc='characters used to explicitly mark sentence bounds'): [],
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='detectLists', doc='whether detect lists during sentence detection'): True,
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='explodeSentences', doc='whether to explode each sentence into a different row, for better parallelization. Defaults to false.'): False,
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='inputCols', doc='previous annotations columns, if renamed'): ['document'],
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='maxLength', doc='Set the maximum allowed length for each sentence'): 99999,
 Param(parent='SentenceDetector_4a1dfd3e9e88', name='minLength', doc='Set the minimum allowed length for each sentence.'): 0,
 Param(parent='Sentenc

In [17]:
sent_df = sentenceDetector.transform(doc_df)

sent_df.show(truncate=False)

+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                         |document                                                                                                               |sentences                                                                                                                                                                                          |
+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------+---------

In [18]:
sent_df.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = tru

In [19]:
sent_df.select('sentences').take(5)

[Row(sentences=[Row(annotatorType='document', begin=0, end=27, result='Peter is a very good person.', metadata={'sentence': '0'}, embeddings=[])]),
 Row(sentences=[Row(annotatorType='document', begin=0, end=37, result='My life in Russia is very interesting.', metadata={'sentence': '0'}, embeddings=[])]),
 Row(sentences=[Row(annotatorType='document', begin=0, end=27, result='John and Peter are brothers.', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document', begin=29, end=76, result="However they don't support each other that much.", metadata={'sentence': '1'}, embeddings=[])]),
 Row(sentences=[Row(annotatorType='document', begin=0, end=41, result='Lucas Nogal Dunbercker is no longer happy.', metadata={'sentence': '0'}, embeddings=[]), Row(annotatorType='document', begin=43, end=67, result='He has a good car though.', metadata={'sentence': '1'}, embeddings=[])]),
 Row(sentences=[Row(annotatorType='document', begin=0, end=27, result='Europe is very culture rich.', met

In [20]:
sent_df.select('sentences.result').take(5)

[Row(result=['Peter is a very good person.']),
 Row(result=['My life in Russia is very interesting.']),
 Row(result=['John and Peter are brothers.', "However they don't support each other that much."]),
 Row(result=['Lucas Nogal Dunbercker is no longer happy.', 'He has a good car though.']),
 Row(result=['Europe is very culture rich.', 'There are huge churches!', 'and big houses!'])]

## Tokenizer

Identifies tokens with tokenization open standards. It is an **Annotator Approach, so it requires .fit()**.

A few rules will help customizing it if defaults do not fit user needs.

setExceptions(StringArray): List of tokens to not alter at all. Allows composite tokens like two worded tokens that the user may not want to split.

`addException(String)`: Add a single exception

`setExceptionsPath(String)`: Path to txt file with list of token exceptions

`caseSensitiveExceptions(bool)`: Whether to follow case sensitiveness for matching exceptions in text

`contextChars(StringArray)`: List of 1 character string to rip off from tokens, such as parenthesis or question marks. Ignored if using prefix, infix or suffix patterns.

`minLength`: Set the minimum allowed legth for each token

`maxLength`: Set the maximum allowed legth for each token

In [21]:
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

In [22]:
text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

spark_df = spark.createDataFrame([[text]]).toDF("text")

In [23]:
doc_df = documentAssembler.transform(spark_df)

token_df = tokenizer.fit(doc_df).transform(doc_df)

token_df.show(truncate=100)

+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                           text|                                                                                            document|                                                                                               token|
+-------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!|[{document, 0, 78, Peter Parker (Spiderman) is a nice guy and lives

In [24]:
token_df.select('token.result').take(1)

[Row(result=['Peter', 'Parker', '(', 'Spiderman', ')', 'is', 'a', 'nice', 'guy', 'and', 'lives', 'in', 'New', 'York', 'but', 'has', 'no', 'e-mail', '!'])]

## Perceptron Model

POS - Part of speech tags

Averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

This is the instantiated model of the PerceptronApproach. For training your own model, please see the documentation of that class.

In [25]:
pos = PerceptronModel.pretrained()\
    .setInputCols(['document', 'token'])\
    .setOutputCol('pos')

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


## Creating Pipeline

In [27]:
documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")\

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pos = PerceptronModel.pretrained()\
    .setInputCols(['document', 'token'])\
    .setOutputCol('pos')

pipeline = Pipeline().setStages([
    documentAssembler, 
    tokenizer, 
    pos])

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[OK!]


In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/PySpark/data/sample-sentences-en.txt

In [57]:
data = spark.read.text('./sample-sentences-en.txt').toDF('text')

data.show(truncate = False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



In [30]:
model = pipeline.fit(data)

In [31]:
result = model.transform(data)

In [32]:
result.show(5)

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|                 pos|
+--------------------+--------------------+--------------------+--------------------+
|Peter is a very g...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|
|My life in Russia...|[{document, 0, 37...|[{token, 0, 1, My...|[{pos, 0, 1, PRP$...|
|John and Peter ar...|[{document, 0, 76...|[{token, 0, 3, Jo...|[{pos, 0, 3, NNP,...|
|Lucas Nogal Dunbe...|[{document, 0, 67...|[{token, 0, 4, Lu...|[{pos, 0, 4, NNP,...|
|Europe is very cu...|[{document, 0, 68...|[{token, 0, 5, Eu...|[{pos, 0, 5, NNP,...|
+--------------------+--------------------+--------------------+--------------------+



In [35]:
result.select("text").show(1, truncate=False)

+----------------------------+
|text                        |
+----------------------------+
|Peter is a very good person.|
+----------------------------+
only showing top 1 row



In [39]:
result.select("document.result").show(1, truncate=False)

+------------------------------+
|result                        |
+------------------------------+
|[Peter is a very good person.]|
+------------------------------+
only showing top 1 row



In [38]:
result.select("token.result").show(1, truncate=False)

+-------------------------------------+
|result                               |
+-------------------------------------+
|[Peter, is, a, very, good, person, .]|
+-------------------------------------+
only showing top 1 row



In [40]:
result.select("pos.result").show(1, truncate=False)

+-----------------------------+
|result                       |
+-----------------------------+
|[NNP, VBZ, DT, RB, JJ, NN, .]|
+-----------------------------+
only showing top 1 row



In [54]:
import pyspark.sql.functions as F

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.pos.result)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("POS")
                          ).toPandas()

result_df.head(15)

Unnamed: 0,token,POS
0,Peter,NNP
1,is,VBZ
2,a,DT
3,very,RB
4,good,JJ
5,person,NN
6,.,.
7,My,PRP$
8,life,NN
9,in,IN


-------------
## Spark NLP Annotation UDFs

In [46]:
import pandas as pd
from sparknlp.functions import *
from pyspark.sql.functions import col
from pyspark.sql.types import ArrayType, StringType

In [41]:
result.select('pos').show(1, truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|pos                                                                                                                                                                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [51]:
@udf( StringType())
def nn_annotation(res,meta):
    nn = []
    for i,j in zip(res,meta):
      if i == "NN" or i == "NNP":
        nn.append(j["word"])
    return nn    

In [56]:
result.withColumn("nn & NNp tokens", nn_annotation(col("pos.result"), col("pos.metadata")))\
      .select("nn & NNp tokens")\
      .show(truncate=False)

+-------------------------------+
|nn & NNp tokens                |
+-------------------------------+
|[Peter, person]                |
|[life, Russia]                 |
|[John, Peter]                  |
|[Lucas, Nogal, Dunbercker, car]|
|[Europe]                       |
+-------------------------------+

