## Spark NLP - Explain Document (pretained pipeline)

### **Installation**

In [2]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# Install Spark NLP Display for visualization
!pip install --ignore-installed spark-nlp-display

--2021-04-19 04:07:04--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.26
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.26|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-04-19 04:07:04--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1594 (1.6K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               setup Colab for PySpark 3.0.2 and Spark NLP 3.0.1

2021-04-19 04:07:04 (2.46 MB

### **Importing required modules**

In [3]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.0.1
Apache Spark version:  3.0.2


### Now, we load a pipeline model that contains the following annotators as a default: 

- Tokenizer
- Deep Sentence Detector
- Lemmatizer
- Stemmer
- Part of Speech (POS)
- Context Spell Checker (NorvigSweetingModel)
- Word Embeddings (glove)
- NER-DL (trained by SOTA algorithm)


In [4]:
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *

pipeline = PretrainedPipeline('explain_document_dl')

explain_document_dl download started this may take some time.
Approx size to download 169.3 MB
[OK!]


We simply send the text we want to transform and the pipeline does the work.

In [5]:
text = 'John Smith would love to visit many beautful cities and take a pictre. He lives in Germany for the last 12 years.'
result = pipeline.annotate(text)

We can see the output of each annotator below. This one is doing so many things at once!

In [6]:
result.keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [7]:
result['entities']

['John Smith', 'Germany']

In [8]:
result['sentence']

['John Smith would love to visit many beautful cities and take a pictre.',
 'He lives in Germany for the last 12 years.']

In [9]:
list(zip(result['token'],result['stem'],result['lemma'],result['pos'],result['checked'],result['ner']))

[('John', 'john', 'John', 'NNP', 'John', 'B-PER'),
 ('Smith', 'smith', 'Smith', 'NNP', 'Smith', 'I-PER'),
 ('would', 'would', 'would', 'MD', 'would', 'O'),
 ('love', 'love', 'love', 'VB', 'love', 'O'),
 ('to', 'to', 'to', 'TO', 'to', 'O'),
 ('visit', 'visit', 'visit', 'VB', 'visit', 'O'),
 ('many', 'mani', 'many', 'JJ', 'many', 'O'),
 ('beautful', 'beauti', 'beautiful', 'JJ', 'beautiful', 'O'),
 ('cities', 'citi', 'city', 'NNS', 'cities', 'O'),
 ('and', 'and', 'and', 'CC', 'and', 'O'),
 ('take', 'take', 'take', 'VB', 'take', 'O'),
 ('a', 'a', 'a', 'DT', 'a', 'O'),
 ('pictre', 'pictur', 'picture', 'NN', 'picture', 'O'),
 ('.', '.', '.', '.', '.', 'O'),
 ('He', 'he', 'He', 'PRP', 'He', 'O'),
 ('lives', 'live', 'life', 'VBZ', 'lives', 'O'),
 ('in', 'in', 'in', 'IN', 'in', 'O'),
 ('Germany', 'germani', 'Germany', 'NNP', 'Germany', 'B-LOC'),
 ('for', 'for', 'for', 'IN', 'for', 'O'),
 ('the', 'the', 'the', 'DT', 'the', 'O'),
 ('last', 'last', 'last', 'JJ', 'last', 'O'),
 ('12', '12', '12', '

In [10]:
import pandas as pd

df = pd.DataFrame(list(zip(result['token'],result['stem'],result['lemma'],result['pos'],result['checked'],result['ner'])),
            columns = ['token','stem', 'lemma', 'pos', 'spell_checked', 'ner'])

df

Unnamed: 0,token,stem,lemma,pos,spell_checked,ner
0,John,john,John,NNP,John,B-PER
1,Smith,smith,Smith,NNP,Smith,I-PER
2,would,would,would,MD,would,O
3,love,love,love,VB,love,O
4,to,to,to,TO,to,O
5,visit,visit,visit,VB,visit,O
6,many,mani,many,JJ,many,O
7,beautful,beauti,beautiful,JJ,beautiful,O
8,cities,citi,city,NNS,cities,O
9,and,and,and,CC,and,O


Lets print out the entire result

In [11]:
import pprint 
pp = pprint.PrettyPrinter(indent=4)

pp.pprint(result)

{   'checked': [   'John',
                   'Smith',
                   'would',
                   'love',
                   'to',
                   'visit',
                   'many',
                   'beautiful',
                   'cities',
                   'and',
                   'take',
                   'a',
                   'picture',
                   '.',
                   'He',
                   'lives',
                   'in',
                   'Germany',
                   'for',
                   'the',
                   'last',
                   '12',
                   'years',
                   '.'],
    'document': [   'John Smith would love to visit many beautful cities and '
                    'take a pictre. He lives in Germany for the last 12 '
                    'years.'],
    'embeddings': [   'John',
                      'Smith',
                      'would',
                      'love',
                      'to',
                   

### Visualization with Spark NLP Display

In [27]:
detailed_result = pipeline.fullAnnotate(text)

In [38]:
from sparknlp_display import NerVisualizer

ner_vis = NerVisualizer()

## To set custom label colors:
ner_vis.set_label_colors({'LOC':'#0096c7', 'PER':'#ade8f4'})

ner_vis.display(detailed_result[0], 'entities', 'document')