![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Use pretrained `explain_document` Pipeline

### Stages

 * DocumentAssembler
 * SentenceDetector
 * Tokenizer
 * Lemmatizer
 * Stemmer
 * Part of Speech
 * SpellChecker (Norvig)

In [2]:
! pip install -q pyspark==3.1.2 spark-nlp

[K     |████████████████████████████████| 212.4 MB 74 kB/s 
[K     |████████████████████████████████| 133 kB 20.0 MB/s 
[K     |████████████████████████████████| 198 kB 48.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [3]:
import sys
import time

#Spark ML and SQL
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import array_contains
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

### Let's create a Spark Session for our app

In [4]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.3.4
Apache Spark version:  3.1.2


#### This is our testing document, we'll use it to exemplify all different pipeline stages.

In [5]:
testDoc = [
"Frenchg author who helped pioner the science-fiction genre. \
Verne wrate about space, aisr, and underwater travel befdaore \
navigable aircrast and practical submarines were invented, \
and before any means of space travel had been devised. "
]

In [6]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

explain_document_ml download started this may take some time.
Approx size to download 9.1 MB
[OK!]


#### We are not interested in handling big datasets, let's switch to LightPipelines for speed.

In [7]:
result = pipeline.annotate(testDoc)

#### Let's analyze these results - first let's see what sentences we detected

In [8]:
[content['sentence'] for content in result]

[['Frenchg author who helped pioner the science-fiction genre.',
  'Verne wrate about space, aisr, and underwater travel befdaore navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.']]

#### Now let's see how those sentences were tokenized

In [9]:
[content['token'] for content in result]

[['Frenchg',
  'author',
  'who',
  'helped',
  'pioner',
  'the',
  'science-fiction',
  'genre',
  '.',
  'Verne',
  'wrate',
  'about',
  'space',
  ',',
  'aisr',
  ',',
  'and',
  'underwater',
  'travel',
  'befdaore',
  'navigable',
  'aircrast',
  'and',
  'practical',
  'submarines',
  'were',
  'invented',
  ',',
  'and',
  'before',
  'any',
  'means',
  'of',
  'space',
  'travel',
  'had',
  'been',
  'devised',
  '.']]

#### Notice some spelling errors? the pipeline takes care of that as well

In [10]:
[content['spell'] for content in result]

[['Frenchy',
  'author',
  'who',
  'helped',
  'pioneer',
  'the',
  'sciencefiction',
  'genre',
  '.',
  'Verne',
  'wrote',
  'about',
  'space',
  ',',
  'airs',
  ',',
  'and',
  'underwater',
  'travel',
  'befdaore',
  'navigable',
  'aircraft',
  'and',
  'practical',
  'submarines',
  'were',
  'invented',
  ',',
  'and',
  'before',
  'any',
  'means',
  'of',
  'space',
  'travel',
  'had',
  'been',
  'devised',
  '.']]

#### Now let's see the lemmas

In [11]:
[content['lemmas'] for content in result]

[['Frenchy',
  'author',
  'who',
  'help',
  'pioneer',
  'the',
  'sciencefiction',
  'genre',
  '.',
  'Verne',
  'write',
  'about',
  'space',
  ',',
  'air',
  ',',
  'and',
  'underwater',
  'travel',
  'befdaore',
  'navigable',
  'aircraft',
  'and',
  'practical',
  'submarine',
  'be',
  'invent',
  ',',
  'and',
  'before',
  'any',
  'mean',
  'of',
  'space',
  'travel',
  'have',
  'be',
  'devise',
  '.']]

#### Let's check the stems, any difference with the lemmas shown bebore?

[content['lemmas'] for content in result]

In [12]:
[content['stems'] for content in result]

[['frenchi',
  'author',
  'who',
  'help',
  'pioneer',
  'the',
  'sciencefict',
  'genr',
  '.',
  'vern',
  'wrote',
  'about',
  'space',
  ',',
  'air',
  ',',
  'and',
  'underwat',
  'travel',
  'befdaor',
  'navig',
  'aircraft',
  'and',
  'practic',
  'submarin',
  'were',
  'invent',
  ',',
  'and',
  'befor',
  'ani',
  'mean',
  'of',
  'space',
  'travel',
  'had',
  'been',
  'devis',
  '.']]

#### Now it's the turn on Part Of Speech(POS)

In [13]:
pos = [content['pos'] for content in result]
token = [content['token'] for content in result]
# let's put token and tag together
list(zip(token[0], pos[0]))

[('Frenchg', 'NNP'),
 ('author', 'NN'),
 ('who', 'WP'),
 ('helped', 'VBD'),
 ('pioner', 'NN'),
 ('the', 'DT'),
 ('science-fiction', 'NN'),
 ('genre', 'NN'),
 ('.', '.'),
 ('Verne', 'NNP'),
 ('wrate', 'VBD'),
 ('about', 'IN'),
 ('space', 'NN'),
 (',', ','),
 ('aisr', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('underwater', 'JJ'),
 ('travel', 'NN'),
 ('befdaore', 'NN'),
 ('navigable', 'JJ'),
 ('aircrast', 'NN'),
 ('and', 'CC'),
 ('practical', 'JJ'),
 ('submarines', 'NNS'),
 ('were', 'VBD'),
 ('invented', 'VBN'),
 (',', ','),
 ('and', 'CC'),
 ('before', 'IN'),
 ('any', 'DT'),
 ('means', 'NNS'),
 ('of', 'IN'),
 ('space', 'NN'),
 ('travel', 'NN'),
 ('had', 'VBD'),
 ('been', 'VBN'),
 ('devised', 'VBN'),
 ('.', '.')]

# Use pretrained `match_chunk` Pipeline for Individual Noun Phrase 

* DocumentAssembler
* SentenceDetector
* Tokenizer
* Part of speech
* chunker

Pipeline:
* The pipeline uses regex `<DT>?<JJ>*<NN>+`
* which states that whenever the chunk finds an optional determiner 
* (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.

In [18]:
!java -version

openjdk version "11.0.11" 2021-04-20
OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.18.04)
OpenJDK 64-Bit Server VM (build 11.0.11+9-Ubuntu-0ubuntu2.18.04, mixed mode, sharing)


In [19]:
#pipeline = PretrainedPipeline('match_datetime', lang='en')

In [None]:
result = pipeline.annotate("The book has many chapters") # single noun phrase

In [None]:
result['chunk']

['The book']

In [None]:
result = pipeline.annotate("the little yellow dog barked at the cat") #multiple noune phrases

In [None]:
result['chunk']

['the little yellow dog', 'the cat']

In [None]:
result

{'chunk': ['the little yellow dog', 'the cat'],
 'document': ['the little yellow dog barked at the cat'],
 'pos': ['DT', 'JJ', 'JJ', 'NN', 'JJ', 'IN', 'DT', 'NN'],
 'token': ['the', 'little', 'yellow', 'dog', 'barked', 'at', 'the', 'cat'],
 'sentence': ['the little yellow dog barked at the cat']}