![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/explain-document-ml/explain_document_ml.ipynb)

## 0. Colab Setup

In [1]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 55kB/s 
[K     |████████████████████████████████| 204kB 52.0MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 4.7MB/s 
[?25h

# Use pretrained `explain_document_ml` Pipeline

### Stages

 * DocumentAssembler
 * SentenceDetector
 * Tokenizer
 * Lemmatizer
 * Stemmer
 * Part of Speech
 * SpellChecker (Norvig)


In [None]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

### Let's create a Spark Session for our app

In [3]:

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


#### This is our testing document, we'll use it to exemplify all different pipeline stages.

In [None]:
testDoc = spark.createDataFrame([
"""French author who helped pioner the science-fiction genre.
Verne wrate about space, air, and underwater travel before
navigable aircrast and practical submarines were invented,
and before any means of space travel had been devised."""    
], "string").toDF("text")

In [5]:
testDoc.show()

+--------------------+
|                text|
+--------------------+
|French author who...|
+--------------------+



In [6]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


#### We are not interested in handling big datasets, let's switch to LightPipelines for speed.

In [7]:
result = pipeline.annotate(testDoc, "text")
result.printSchema()
result.show()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

#### Let's analyze these results - first let's see what sentences we detected

In [12]:
result.select("sentence.result").show(1, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[French author who helped pioner the science-fiction genre., Verne wrate about space, air, and underwater travel before
navigable aircrast and practical submarines were invented,
and before any means of space travel had been devised.]|
+---------------------------------------------------

#### Now let's see how those sentences were tokenized

In [13]:
result.select("token.result").show(1, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[French, author, who, helped, pioner, the, science-fiction, genre, ., Verne, wrate, about, space, ,, air, ,, and, underwater, travel, before, navigable, aircrast,

#### Notice some spelling errors? the pipeline takes care of that as well

In [17]:
result.select("spell").show(1, False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

#### Now let's see the lemmas

In [18]:
result.select("lemmas.result").show(1, False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[French, author, who, help, pioneer, the, sciencefiction, genre, ., Verne, write, about, space, ,, air, ,, and, underwater, travel, before, navigable, aircraft, and, practical, submarine, be, 

#### Let's check the stems, any difference with the lemmas shown bebore?

In [19]:
result.select("stems.result").show(1, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[french, author, who, help, pioneer, the, sciencefict, genr, ., vern, wrote, about, space, ,, air, ,, and, underwat, travel, befor, navig, aircraft, and, practic, submarin, were, invent, ,, and, befor, ani, mean, of, space, travel, ha

#### Let's look at Part Of Speech (POS) results

In [20]:
result.select("pos.result").show(1, False)


+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[JJ, NN, WP, VBD, NN, DT, NN, NN, ., NNP, VBD, IN, NN, ,, NN, ,, CC, JJ, NN, IN, JJ, NN, CC, JJ, NNS, VBD, VBN, ,, CC, IN, DT, NNS, IN, NN, NN, VBD, VBN, VBN, .]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+

