## Spark NLP - Explain Document (pretained pipeline)

We start by importing required modules.

In [1]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:   2.4.2
Apache Spark version:  2.4.4


Now, we load a pipeline model that contains the following annotators as a default: 

- Tokenizer
- Deep Sentence Detector
- Lemmatizer
- Stemmer
- Part of Speech (POS)
- Context Spell Checker (NorvigSweetingModel)
- Word Embeddings (glove)
- NER-DL (trained by SOTA algorithm)


In [2]:
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *

pipeline = PretrainedPipeline('explain_document_dl')

explain_document_dl download started this may take some time.
Approx size to download 168.4 MB
[OK!]


We simply send the text we want to transform and the pipeline does the work.

In [3]:
text = 'John Smith would love to visit many beautful cities and take a pictre. He lives in Germany for the last 12 years.'
result = pipeline.annotate(text)

We can see the output of each annotator below. This one is doing so many things at once!

In [4]:
result.keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [5]:
result['entities']

['John Smith', 'Germany']

In [6]:
result['sentence']

['John Smith would love to visit many beautful cities and take a pictre.',
 'He lives in Germany for the last 12 years.']

In [7]:
list(zip(result['token'],result['stem'],result['lemma'],result['pos'],result['checked'],result['ner']))

[('John', 'john', 'John', 'NNP', 'John', 'I-PER'),
 ('Smith', 'smith', 'Smith', 'NNP', 'Smith', 'I-PER'),
 ('would', 'would', 'would', 'MD', 'would', 'O'),
 ('love', 'love', 'love', 'VB', 'love', 'O'),
 ('to', 'to', 'to', 'TO', 'to', 'O'),
 ('visit', 'visit', 'visit', 'VB', 'visit', 'O'),
 ('many', 'mani', 'many', 'JJ', 'many', 'O'),
 ('beautful', 'beauti', 'beautiful', 'JJ', 'beautiful', 'O'),
 ('cities', 'citi', 'city', 'NNS', 'cities', 'O'),
 ('and', 'and', 'and', 'CC', 'and', 'O'),
 ('take', 'take', 'take', 'VB', 'take', 'O'),
 ('a', 'a', 'a', 'DT', 'a', 'O'),
 ('pictre', 'pictur', 'picture', 'NN', 'picture', 'O'),
 ('.', '.', '.', '.', '.', 'O'),
 ('He', 'he', 'He', 'PRP', 'He', 'O'),
 ('lives', 'live', 'life', 'VBZ', 'lives', 'O'),
 ('in', 'in', 'in', 'IN', 'in', 'O'),
 ('Germany', 'germani', 'Germany', 'NNP', 'Germany', 'I-LOC'),
 ('for', 'for', 'for', 'IN', 'for', 'O'),
 ('the', 'the', 'the', 'DT', 'the', 'O'),
 ('last', 'last', 'last', 'JJ', 'last', 'O'),
 ('12', '12', '12', '

In [8]:
import pandas as pd

df = pd.DataFrame(list(zip(result['token'],result['stem'],result['lemma'],result['pos'],result['checked'],result['ner'])),
            columns = ['token','stem', 'lemma', 'pos', 'spell_checked', 'ner'])

df

Unnamed: 0,token,stem,lemma,pos,spell_checked,ner
0,John,john,John,NNP,John,I-PER
1,Smith,smith,Smith,NNP,Smith,I-PER
2,would,would,would,MD,would,O
3,love,love,love,VB,love,O
4,to,to,to,TO,to,O
5,visit,visit,visit,VB,visit,O
6,many,mani,many,JJ,many,O
7,beautful,beauti,beautiful,JJ,beautiful,O
8,cities,citi,city,NNS,cities,O
9,and,and,and,CC,and,O


Lets print out the entire result

In [9]:
import pprint 
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(result)

{   'checked': [   'John',
                   'Smith',
                   'would',
                   'love',
                   'to',
                   'visit',
                   'many',
                   'beautiful',
                   'cities',
                   'and',
                   'take',
                   'a',
                   'picture',
                   '.',
                   'He',
                   'lives',
                   'in',
                   'Germany',
                   'for',
                   'the',
                   'last',
                   '12',
                   'years',
                   '.'],
    'document': [   'John Smith would love to visit many beautful cities and '
                    'take a pictre. He lives in Germany for the last 12 '
                    'years.'],
    'embeddings': [   'John',
                      'Smith',
                      'would',
                      'love',
                      'to',
                   