![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/model-downloader/Running_Pretrained_pipelines.ipynb)

## 0. Colab Setup

In [None]:
import os

# Install java
! apt-get update -qq
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 54kB/s 
[K     |████████████████████████████████| 204kB 48.4MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 8.6MB/s 
[?25h

## Runing Pretrained models

In the following example, we walk-through different use cases of some of our Pretrained models and pipelines which could be used off the shelf.

There is BasicPipeline which will return tokens, normalized tokens, lemmas and part of speech tags. The AdvancedPipeline will return same as the BasicPipeline plus Stems, Spell Checked tokens and NER entities using the CRF model. All the pipelines and pre trained models are downloaded from internet at run time hence would require internet access. 

#### 1. Call necessary imports and create the spark session

In [None]:
import os
import sys
print(sys.version)

import sparknlp
from sparknlp.pretrained import ResourceDownloader
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline


3.6.9 (default, Apr 18 2020, 01:56:04) 
[GCC 8.4.0]


In [None]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)


Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


#### 2. Create a dummy spark dataframe

In [None]:

l = [
  (1,'To be or not to be'),
  (2,'This is it!')
]

data = spark.createDataFrame(l, ['docID','text'])

#### 3. We use predefined BasicPipeline in order to annotate a dataframe with it

In [None]:
# download predefined - pipelines
from sparknlp.pretrained import PretrainedPipeline

explain_document_ml = PretrainedPipeline("explain_document_ml")
basic_data = explain_document_ml.annotate(data, 'text') 
basic_data.show()

explain_document_ml download started this may take some time.
Approx size to download 9.4 MB
[OK!]
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|            sentence|               token|               spell|              lemmas|               stems|                 pos|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[document, 0, 17...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[token, 0, 1, To...|[[token, 0, 1, to...|[[pos, 0, 1, TO, ...|
|    2|       This is it!|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[token, 0, 3, Th...|[[token, 0, 3, th...|[[pos, 0, 3, DT, ...|
+-----+------------------+-----

#### We can also annotate a single string

In [None]:
# annotat quickly from string
explain_document_ml.annotate("This world is made up of good and bad things")

{'document': ['This world is made up of good and bad things'],
 'lemmas': ['This',
  'world',
  'be',
  'make',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'thing'],
 'pos': ['DT', 'NN', 'VBZ', 'VBN', 'RP', 'IN', 'JJ', 'CC', 'JJ', 'NNS'],
 'sentence': ['This world is made up of good and bad things'],
 'spell': ['This',
  'world',
  'is',
  'made',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'things'],
 'stems': ['thi',
  'world',
  'i',
  'made',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'thing'],
 'token': ['This',
  'world',
  'is',
  'made',
  'up',
  'of',
  'good',
  'and',
  'bad',
  'things']}

#### 4. Now we intend to use one of the fast pretrained models such as Preceptron model which is a POS model trained with ANC American Corpus 

In [None]:

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

wordEmbeddings = WordEmbeddingsModel.pretrained().setOutputCol("word_embeddings")    

# download directly - models
pos = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")
    
advancedPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos, wordEmbeddings])

output = advancedPipeline.fit(data).transform(data)
output.show()

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|            sentence|               token|                 pos|     word_embeddings|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[document, 0, 17...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|[[word_embeddings...|
|    2|       This is it!|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|[[word_embeddings...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+



#### 5. Now we proceed to download a Fast CRF Named Entity Recognitionl which is trained with Glove embeddings. Then, we retrieve the `advancedPipeline` and combine these models to use them appropriately meeting their requirements.

In [None]:
ner = NerCrfModel.pretrained()
ner.setInputCols(["pos", "token", "document", "word_embeddings"]).setOutputCol("ner")

annotation_data = advancedPipeline.fit(data).transform(data)

pos_tagged = pos.transform(annotation_data)
ner_tagged = ner.transform(pos_tagged)
ner_tagged.show()

ner_crf download started this may take some time.
Approximate size to download 10.1 MB
[OK!]
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|docID|              text|            document|            sentence|               token|                 pos|     word_embeddings|                 ner|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    1|To be or not to be|[[document, 0, 17...|[[document, 0, 17...|[[token, 0, 1, To...|[[pos, 0, 1, TO, ...|[[word_embeddings...|[[named_entity, 0...|
|    2|       This is it!|[[document, 0, 10...|[[document, 0, 10...|[[token, 0, 3, Th...|[[pos, 0, 3, DT, ...|[[word_embeddings...|[[named_entity, 0...|
+-----+------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------

#### 6. Finally, lets try a pre trained sentiment analysis pipeline

In [None]:
PretrainedPipeline("analyze_sentiment").annotate("This is a good movie!!!")

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


{'checked': ['This', 'is', 'a', 'good', 'movie', '!!!'],
 'document': ['This is a good movie!!!'],
 'sentence': ['This is a good movie!!!'],
 'sentiment': ['positive'],
 'token': ['This', 'is', 'a', 'good', 'movie', '!!!']}