![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Use pretrained `explain_document_ml` Pipeline

### Stages

 * DocumentAssembler
 * SentenceDetector
 * Tokenizer
 * Lemmatizer
 * Stemmer
 * Part of Speech
 * SpellChecker (Norvig)


In [1]:
import sparknlp
from sparknlp.pretrained import PretrainedPipeline

### Let's create a Spark Session for our app

In [2]:

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:   2.4.2
Apache Spark version:  2.4.4


#### This is our testing document, we'll use it to exemplify all different pipeline stages.

In [3]:
testDoc = spark.createDataFrame([
"""French author who helped pioner the science-fiction genre.
Verne wrate about space, air, and underwater travel before
navigable aircrast and practical submarines were invented,
and before any means of space travel had been devised."""    
], "string").toDF("text")

In [4]:
testDoc.show()

+--------------------+
|                text|
+--------------------+
|French author who...|
+--------------------+



In [5]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]


#### We are not interested in handling big datasets, let's switch to LightPipelines for speed.

In [6]:
result = pipeline.annotate(testDoc, "text")
result.printSchema()
result.show()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true

#### Let's analyze these results - first let's see what sentences we detected

In [7]:
result.select("sentence.result").show(1, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[French author who helped pioner the science-fiction genre., Verne wrate about space, air, and underwater travel before
navigable aircrast and practical submarines were invented,
and before any means of space travel had been devised.]|
+---------------------------------------------------

#### Now let's see how those sentences were tokenized

In [8]:
result.select("token.result").show(1, False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[French, author, who, helped, pioner, the, science-fiction, genre, ., Verne, wrate, about, space, ,, air, ,, and, underwater, travel, before, navigable, aircrast,

#### Notice some spelling errors? the pipeline takes care of that as well

In [9]:
result.select("checked").show(1, False)

AnalysisException: "cannot resolve '`checked`' given input columns: [spell, pos, lemmas, stems, token, sentence, document, text];;\n'Project ['checked]\n+- Project [text#8, document#11, sentence#15, token#20, spell#26, lemmas#33, stems#41, UDF(array(spell#26, sentence#15)) AS pos#50]\n   +- Project [text#8, document#11, sentence#15, token#20, spell#26, lemmas#33, UDF(array(spell#26)) AS stems#41]\n      +- Project [text#8, document#11, sentence#15, token#20, spell#26, UDF(array(spell#26)) AS lemmas#33]\n         +- Project [text#8, document#11, sentence#15, token#20, UDF(array(token#20)) AS spell#26]\n            +- Project [text#8, document#11, sentence#15, UDF(array(sentence#15)) AS token#20]\n               +- Project [text#8, document#11, UDF(array(document#11)) AS sentence#15]\n                  +- Project [text#8, UDF(text#8) AS document#11]\n                     +- Project [text#2 AS text#8]\n                        +- Project [value#0 AS text#2]\n                           +- LogicalRDD [value#0], false\n"

#### Now let's see the lemmas

In [None]:
result.select("lemma.result").show(1, False)

#### Let's check the stems, any difference with the lemmas shown bebore?

In [None]:
result.select("stem.result").show(1, False)

#### Let's look at Part Of Speech (POS) results

In [None]:
result.select("pos.result").show(1, False)
