![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/35.03.PretrainedPipeline.ipynb)

# 🔎 What is `PretrainedPipeline`?

`PretrainedPipelines` are fully constructed Spark NLP pipelines that are ready-to-use with a one-line of code. 



# 🔎 How `PretrainedPipelines` work?

`PretrainedPipelines` are end-to-end ready to use Spark NLP pipelines that are fitted. Instead of building up a pipeline, you can just download and use `PretrainedPipelines` for getting the results that you want to have.   



# 🔎 How `PretrainedPipeline` can be used?

When you load a `PretrainedPipeline` it returns a `LightPipeline` version of the Spark NLP pipeline, so you can use `annotate()` or `fullAnnotate()` methods by providing a string or list of string or you can use `.transform` method for processing Spark DataFrames.


# 📚 Documentation

```
PretrainedPipeline(name, lang='en', remote_loc=None, parse_embeddings=False, disk_location=None)
```


You can check our [Python API](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/pretrained/pretrained_pipeline/index.html#sparknlp.pretrained.pretrained_pipeline.PretrainedPipeline) and [ScalaDoc](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/pretrained/PretrainedPipeline.html) for more details about `PretrainedPipeline`.

You can also find all `PretrainedPipelines` in Spark NLP on [Models Hub](https://nlp.johnsnowlabs.com/models) page.

# Colab Setup

In [None]:
!pip install -q pyspark==3.3.0 spark-nlp==4.2.4

[K     |████████████████████████████████| 281.3 MB 48 kB/s 
[K     |████████████████████████████████| 448 kB 54.8 MB/s 
[K     |████████████████████████████████| 199 kB 63.6 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.3.0


# PretrainedPipeline

You can list all `PretrainedPipelines` in Spark NLP by using the line below.

In [None]:
from sparknlp.pretrained import ResourceDownloader
ResourceDownloader.showPublicPipelines(lang="en")

+-------------------------------------------------------------------------------------------------------------+------+---------+
| Pipeline                                                                                                    | lang | version |
+-------------------------------------------------------------------------------------------------------------+------+---------+
| dependency_parse                                                                                            |  en  | 2.0.2   |
| check_spelling                                                                                              |  en  | 2.1.0   |
| match_datetime                                                                                              |  en  | 2.1.0   |
| match_pattern                                                                                               |  en  | 2.1.0   |
| clean_pattern                                                                                  

## Sample Data

In [None]:
sample_text = '''
Peter is a very good persn.
My life in Russia is very intersting.
John and Peter are brthers. However they don't support each other that much.
'''

sample_list = ['Lucas Dunbercker is no longer happy. He has a good car though.',
               'Europe is very culture rich. Thre are huge churches and big houses!']

data = spark.createDataFrame([[sample_text]]).toDF("text")

## Explain Document DL

➤ Now we will download `explain_document_dl` `PretrainedPipeline` as an example and run it on our sample text and get the detected sentences, tokens, corrected spells, lemmas, stems, part of speeches, embeddings and NERs.

**Stages**
- Document Assembler
- Sentence Detector
- Tokenizer
- Spell Checker
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)
- Word Embeddings (GloVe 100D)
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- NER Converter


In [None]:
from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline('explain_document_dl', lang='en')

explain_document_dl download started this may take some time.
Approx size to download 169.4 MB
[OK!]


We can get the information of the stages in the pipeline as shown below. 

In [None]:
# pipeline stages

pipeline.model.stages

[document_7939d5bf1083,
 SENTENCE_05265b07c745,
 REGEX_TOKENIZER_c5c312143f63,
 SPELL_e4ea67180337,
 LEMMATIZER_c62ad8f355f9,
 STEMMER_ba49f7631065,
 POS_d01c734956fe,
 WORD_EMBEDDINGS_MODEL_48cffc8b9a76,
 NerDLModel_d4424c9af5f4,
 NER_CONVERTER_a81db9af2d23]

In [None]:
# storageRef of embeddings model in the pipeline

pipeline.model.stages[-3].getStorageRef()

'glove_100d'

In [None]:
# storageRef of NER model in the pipeline

pipeline.model.stages[-2].getStorageRef()

'glove_100d'

In [None]:
# NER model labels in the pipeline

pipeline.model.stages[-2].getClasses()

['O', 'B-ORG', 'B-LOC', 'B-PER', 'I-PER', 'I-ORG', 'B-MISC', 'I-LOC', 'I-MISC']

In [None]:
# input columns of NER Converter in the pipeline

pipeline.model.stages[-1].getInputCols()

['sentence', 'token', 'ner']

## 📌 `PretrainedPipeline` Methods

Since `PretrainedPipeline` returns [Spark NLP `LightPipeline`](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline), it uses all the methods of `LightPipeline`. Let's review them one by one.

### 💡 `annotate` Method

➤ When you use `.annotate` method, it will return a dictionary which contains output columns of the annotators as keys and the results as values. 

➤ `.annotate` results contain only the results which is easy to check.

Let's show an example using our `sample_text`.

In [None]:
print(sample_text)


Peter is a very good persn.
My life in Russia is very intersting.
John and Peter are brthers. However they don't support each other that much.



In [None]:
annotate_result = pipeline.annotate(sample_text)
annotate_result.keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
annotate_result

{'entities': ['Peter', 'Russia', 'John', 'Peter'],
 'stem': ['peter',
  'i',
  'a',
  'veri',
  'good',
  'person',
  '.',
  'my',
  'life',
  'in',
  'russia',
  'i',
  'veri',
  'interest',
  '.',
  'john',
  'and',
  'peter',
  'ar',
  'brother',
  '.',
  'howev',
  'thei',
  "don't",
  'support',
  'each',
  'other',
  'that',
  'much',
  '.'],
 'checked': ['Peter',
  'is',
  'a',
  'very',
  'good',
  'person',
  '.',
  'My',
  'life',
  'in',
  'Russia',
  'is',
  'very',
  'interesting',
  '.',
  'John',
  'and',
  'Peter',
  'are',
  'brothers',
  '.',
  'However',
  'they',
  "don't",
  'support',
  'each',
  'other',
  'that',
  'much',
  '.'],
 'lemma': ['Peter',
  'be',
  'a',
  'very',
  'good',
  'person',
  '.',
  'My',
  'life',
  'in',
  'Russia',
  'be',
  'very',
  'interest',
  '.',
  'John',
  'and',
  'Peter',
  'be',
  'brother',
  '.',
  'However',
  'they',
  "don't",
  'support',
  'each',
  'other',
  'that',
  'much',
  '.'],
 'document': ["\nPeter is a very

In [None]:
annotate_result["sentence"]

['Peter is a very good persn.',
 'My life in Russia is very intersting.',
 'John and Peter are brthers.',
 "However they don't support each other that much."]

In [None]:
# print corrected tokens and NER results together

list(zip(annotate_result["checked"], annotate_result["ner"]))

[('Peter', 'B-PER'),
 ('is', 'O'),
 ('a', 'O'),
 ('very', 'O'),
 ('good', 'O'),
 ('person', 'O'),
 ('.', 'O'),
 ('My', 'O'),
 ('life', 'O'),
 ('in', 'O'),
 ('Russia', 'B-LOC'),
 ('is', 'O'),
 ('very', 'O'),
 ('interesting', 'O'),
 ('.', 'O'),
 ('John', 'B-PER'),
 ('and', 'O'),
 ('Peter', 'B-PER'),
 ('are', 'O'),
 ('brothers', 'O'),
 ('.', 'O'),
 ('However', 'O'),
 ('they', 'O'),
 ("don't", 'O'),
 ('support', 'O'),
 ('each', 'O'),
 ('other', 'O'),
 ('that', 'O'),
 ('much', 'O'),
 ('.', 'O')]

In [None]:
# results in pandas df

import pandas as pd

df = pd.DataFrame({'token':annotate_result['token'], 
                   'spell_corrected':annotate_result['checked'], 
                   'POS':annotate_result['pos'],
                   'lemmas':annotate_result['lemma'], 
                   'stems':annotate_result['stem'],
                   'ner_label':annotate_result['ner']})

df

Unnamed: 0,token,spell_corrected,POS,lemmas,stems,ner_label
0,Peter,Peter,NNP,Peter,peter,B-PER
1,is,is,VBZ,be,i,O
2,a,a,DT,a,a,O
3,very,very,RB,very,veri,O
4,good,good,JJ,good,good,O
5,persn,person,NN,person,person,O
6,.,.,.,.,.,O
7,My,My,PRP$,My,my,O
8,life,life,NN,life,life,O
9,in,in,IN,in,in,O


➤ We can also use `.annotate` method with an Array of strings. In this case, `.annotate` method returns a list of dictionaries which contain the results of each item of the list.

In [None]:
sample_list

['Lucas Dunbercker is no longer happy. He has a good car though.',
 'Europe is very culture rich. Thre are huge churches and big houses!']

In [None]:
annotate_list_result = pipeline.annotate(sample_list)
annotate_list_result[0].keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
# length of results and the list of strings are in the same

len(annotate_list_result)

2

In [None]:
annotate_list_result

[{'entities': ['Lucas Dunbercker'],
  'stem': ['luca',
   'dunberck',
   'i',
   'no',
   'longer',
   'happi',
   '.',
   'he',
   'ha',
   'a',
   'good',
   'car',
   'though',
   '.'],
  'checked': ['Lucas',
   'Dunbercker',
   'is',
   'no',
   'longer',
   'happy',
   '.',
   'He',
   'has',
   'a',
   'good',
   'car',
   'though',
   '.'],
  'lemma': ['Lucas',
   'Dunbercker',
   'be',
   'no',
   'long',
   'happy',
   '.',
   'He',
   'have',
   'a',
   'good',
   'car',
   'though',
   '.'],
  'document': ['Lucas Dunbercker is no longer happy. He has a good car though.'],
  'pos': ['NNP',
   'NNP',
   'VBZ',
   'DT',
   'RB',
   'JJ',
   '.',
   'PRP',
   'VBZ',
   'DT',
   'JJ',
   'NN',
   'IN',
   '.'],
  'token': ['Lucas',
   'Dunbercker',
   'is',
   'no',
   'longer',
   'happy',
   '.',
   'He',
   'has',
   'a',
   'good',
   'car',
   'though',
   '.'],
  'ner': ['B-PER',
   'I-PER',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   'O',
   

➤ Lets show the `ner` and corrected token result of second text in the list.

In [None]:
# print corrected tokens and NER results together in second sentence

list(zip(annotate_list_result[1]["checked"], annotate_list_result[1]["ner"]))

[('Europe', 'B-LOC'),
 ('is', 'O'),
 ('very', 'O'),
 ('culture', 'O'),
 ('rich', 'O'),
 ('.', 'O'),
 ('There', 'O'),
 ('are', 'O'),
 ('huge', 'O'),
 ('churches', 'O'),
 ('and', 'O'),
 ('big', 'O'),
 ('houses', 'O'),
 ('!', 'O')]

### 💡 `fullAnnotate` Method

➤ When you use `.fullAnnotate` method, it annotates the data provided into *Annotation type* results. It will return a list of dictionaries that contain the output columns of annotators as keys and their results as values. 

➤ `.fullAnnotate` results contain `begin`, `end`, `result`, `metadata` information which is good for checking the results deeply or using these results for the downstream tasks.

Let's show an example using our `sample_text`.

In [None]:
fullAnnotate_result = pipeline.fullAnnotate(sample_text)
fullAnnotate_result[0].keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
fullAnnotate_result

[{'entities': [Annotation(chunk, 1, 5, Peter, {'entity': 'PER', 'sentence': '0', 'chunk': '0'}),
   Annotation(chunk, 40, 45, Russia, {'entity': 'LOC', 'sentence': '1', 'chunk': '1'}),
   Annotation(chunk, 67, 70, John, {'entity': 'PER', 'sentence': '2', 'chunk': '2'}),
   Annotation(chunk, 76, 80, Peter, {'entity': 'PER', 'sentence': '2', 'chunk': '3'})],
  'stem': [Annotation(token, 1, 5, peter, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 7, 8, i, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 10, 10, a, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 12, 15, veri, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 17, 20, good, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 22, 26, person, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 27, 27, ., {'confidence': '0.0', 'sentence': '0'}),
   Annotation(token, 29, 30, my, {'confidence': '1.0', 'sentence': '1'}),
   Annotation(token, 32, 35, life, {'

In [None]:
fullAnnotate_result[0]["sentence"]

[Annotation(document, 1, 27, Peter is a very good persn., {'sentence': '0'}),
 Annotation(document, 29, 65, My life in Russia is very intersting., {'sentence': '1'}),
 Annotation(document, 67, 93, John and Peter are brthers., {'sentence': '2'}),
 Annotation(document, 95, 142, However they don't support each other that much., {'sentence': '3'})]

In [None]:
fullAnnotate_result[0]["entities"]

[Annotation(chunk, 1, 5, Peter, {'entity': 'PER', 'sentence': '0', 'chunk': '0'}),
 Annotation(chunk, 40, 45, Russia, {'entity': 'LOC', 'sentence': '1', 'chunk': '1'}),
 Annotation(chunk, 67, 70, John, {'entity': 'PER', 'sentence': '2', 'chunk': '2'}),
 Annotation(chunk, 76, 80, Peter, {'entity': 'PER', 'sentence': '2', 'chunk': '3'})]

➤ Let's show the entity results in a pandas dataframe.

In [None]:
import pandas as pd

begin= []
end= []
entities= []
labels = []
sentence_ids = []

for i in fullAnnotate_result[0]["entities"]:
    entities.append(i.result)
    begin.append(i.begin)
    end.append(i.end)
    labels.append(i.metadata["entity"])
    sentence_ids.append(i.metadata["sentence"])

print(sample_text)
result_df= pd.DataFrame({"sentence_id":sentence_ids, "begin": begin, "end": end, "entity": entities, "label":labels})
result_df


Peter is a very good persn.
My life in Russia is very intersting.
John and Peter are brthers. However they don't support each other that much.



Unnamed: 0,sentence_id,begin,end,entity,label
0,0,1,5,Peter,PER
1,1,40,45,Russia,LOC
2,2,67,70,John,PER
3,2,76,80,Peter,PER


➤ Lets use `.fullannotate` method with an Array of strings. 

In [None]:
fullAnnotate_list_result = pipeline.fullAnnotate(sample_list)

In [None]:
# length of results and the list of strings are in the same

len(fullAnnotate_list_result)

2

In [None]:
fullAnnotate_list_result[0].keys()

dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [None]:
fullAnnotate_list_result

[{'entities': [Annotation(chunk, 0, 15, Lucas Dunbercker, {'entity': 'PER', 'sentence': '0', 'chunk': '0'})],
  'stem': [Annotation(token, 0, 4, luca, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 6, 15, dunberck, {'confidence': '0.0', 'sentence': '0'}),
   Annotation(token, 17, 18, i, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 20, 21, no, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 23, 28, longer, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 30, 34, happi, {'confidence': '1.0', 'sentence': '0'}),
   Annotation(token, 35, 35, ., {'confidence': '0.0', 'sentence': '0'}),
   Annotation(token, 37, 38, he, {'confidence': '1.0', 'sentence': '1'}),
   Annotation(token, 40, 42, ha, {'confidence': '1.0', 'sentence': '1'}),
   Annotation(token, 44, 44, a, {'confidence': '1.0', 'sentence': '1'}),
   Annotation(token, 46, 49, good, {'confidence': '1.0', 'sentence': '1'}),
   Annotation(token, 51, 53, car, {'confidence': '1.0', '

In [None]:
# entities of first text in the list

fullAnnotate_list_result[0]["entities"]

[Annotation(chunk, 0, 15, Lucas Dunbercker, {'entity': 'PER', 'sentence': '0', 'chunk': '0'})]

In [None]:
# entities of second text in the list

fullAnnotate_list_result[1]["entities"]

[Annotation(chunk, 0, 5, Europe, {'entity': 'LOC', 'sentence': '0', 'chunk': '0'})]

### 💡 `transform` Method

➤ We are able to use `.transform` method to process Spark DataFrames with `PretrainedPipelines`. This method converts `PretrainedPipeline` to `PipelineModel` (fitted pipeline) and returns a Spark DataFrame. This result will be the same if we build the same pipeline by ourselves and transform the data.

Now we will show an example using our Spark DataFrame which is named as `data`.

In [None]:
data.show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
|\nPeter is a very good persn.\nMy life in Russia is very intersting.\nJohn and Peter are brthers. However they don't support each other that much.\n|
+----------------------------------------------------------------------------------------------------------------------------------------------------+



In [None]:
transform_result = pipeline.transform(data)
transform_result.show(truncate=50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                          document|                                          sentence|                                             token|                                           checked|                                             lemma|                                              stem|                                               pos|                            

In [None]:
transform_result.select("entities").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                                                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 1, 5, Peter, {entity -> PER, sentence -> 0, chunk -> 0}, []}, {chunk, 40, 45, Russia, {entity -> LOC, sentence -> 1, chunk 

In [None]:
from pyspark.sql import functions as F

result_df = transform_result.select(F.explode(F.arrays_zip(transform_result.token.result,
                                                           transform_result.ner.result)).alias("cols"))\
                            .select(F.expr("cols['0']").alias("token"),
                                    F.expr("cols['1']").alias("ner_label"))

result_df.show(50, truncate=100)

+----------+---------+
|     token|ner_label|
+----------+---------+
|     Peter|    B-PER|
|        is|        O|
|         a|        O|
|      very|        O|
|      good|        O|
|     persn|        O|
|         .|        O|
|        My|        O|
|      life|        O|
|        in|        O|
|    Russia|    B-LOC|
|        is|        O|
|      very|        O|
|intersting|        O|
|         .|        O|
|      John|    B-PER|
|       and|        O|
|     Peter|    B-PER|
|       are|        O|
|   brthers|        O|
|         .|        O|
|   However|        O|
|      they|        O|
|     don't|        O|
|   support|        O|
|      each|        O|
|     other|        O|
|      that|        O|
|      much|        O|
|         .|        O|
+----------+---------+



In [None]:
# ner_chunks

chunk_result_df = transform_result.select(F.explode(F.arrays_zip(transform_result.entities.result,
                                                                 transform_result.entities.metadata)).alias("cols"))\
                                  .select(F.expr("cols['0']").alias("entity"),
                                          F.expr("cols['1']['entity']").alias("ner_label"))
      
chunk_result_df.show(50, truncate=100)

+------+---------+
|entity|ner_label|
+------+---------+
| Peter|      PER|
|Russia|      LOC|
|  John|      PER|
| Peter|      PER|
+------+---------+



## Get Embeddings Using `PretrainedPipeline`

As you can see in the `annotate` and `fullAnnotate` results, the embeddings of the tokens are not shown in the metadata. To get the embeddings, we need to call `parse_embeddings = True` while loading the `PretrainedPipeline`.

Let's use `onto_recognize_entities_sm` pretrained pipeline for showing the embeddings of the tokens.

In [None]:
onto_pipeline = PretrainedPipeline('onto_recognize_entities_sm', lang = 'en', parse_embeddings=True)
#annotations =  pipeline.fullAnnotate(""Hello from John Snow Labs ! "")[0]

onto_recognize_entities_sm download started this may take some time.
Approx size to download 160.1 MB
[OK!]


In [59]:
onto_pipeline.model.stages

[document_a038d5e3dbd6,
 SENTENCE_b9edee8e6e45,
 REGEX_TOKENIZER_92d77cf41cbc,
 WORD_EMBEDDINGS_MODEL_48cffc8b9a76,
 NerDLModel_bf2f1fa3f2d5,
 NER_CONVERTER_5c35aa46cebb]

➤ Let's use `annotate` method to get the embeddings.

In [61]:
annotate_results_emb = onto_pipeline.annotate(sample_text)
annotate_results_emb.keys()

dict_keys(['entities', 'document', 'token', 'ner', 'embeddings', 'sentence'])

In [66]:
# token-ner-embeddings in pandas df

df = pd.DataFrame({'token':annotate_results_emb['token'], 
                   'ner_label':annotate_results_emb['ner'],
                   'embeddings': annotate_results_emb['embeddings']
                   })

df

Unnamed: 0,token,ner_label,embeddings
0,Peter,B-PERSON,-0.12434 0.27086 -0.25726 -0.92575 0.28346 -0....
1,is,O,-0.54264 0.41476 1.0322 -0.40244 0.46691 0.218...
2,a,O,-0.27086 0.044006 -0.02026 -0.17395 0.6444 0.7...
3,very,O,-0.84136 0.30985 0.05817 -0.1282 -0.57563 -0.0...
4,good,O,-0.030769 0.11993 0.53909 -0.43696 -0.73937 -0...
5,persn.,O,0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0....
6,My,O,0.080273 -0.10861 0.72067 -0.45136 -0.7496 0.6...
7,life,O,0.25157 0.4589 0.30274 0.12461 0.15062 0.7373 ...
8,in,O,0.085703 -0.22201 0.16569 0.13373 0.38239 0.35...
9,Russia,B-GPE,0.21537 0.71956 1.7838 1.2954 0.3855 -0.95089 ...


# 🔎 Custom `PretrainedPipeline`

You can create a custom pipeline by using Spark NLP annotators and then save this pipeline and use it as a `PretrainedPipeline` by calling from disk.  

Let's create a stop words cleaner pretrained pipeline.



In [69]:
from sparknlp.base import *
from sparknlp.annotator import *

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
      .setInputCols(['document'])\
      .setOutputCol('sentence')

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)\
    #.setStopWords(["no", "without"]) (e.g. read a list of words from a txt)

nlpPipeline = Pipeline(stages=[documentAssembler, 
                               sentenceDetector,
                               tokenizer,
                               stopwords_cleaner])

# fit model
sw_cleaner_model = nlpPipeline.fit(data)

In [70]:
sw_cleaner_model.stages

[DocumentAssembler_a0afe31b18b0,
 SentenceDetector_a6ae0e8aa182,
 REGEX_TOKENIZER_4b7641f0c468,
 StopWordsCleaner_f5592152b50d]

➤ Now we will save the model.

In [71]:
sw_cleaner_model.write().overwrite().save('clean_stopwords_pipeline')

➤ Lets load our pipeline using `.from_disk()` method and use it as a pretrained pipeline.

In [72]:
custom_pipeline = PretrainedPipeline.from_disk("clean_stopwords_pipeline")
custom_pipeline.model.stages

[DocumentAssembler_a0afe31b18b0,
 SentenceDetector_a6ae0e8aa182,
 REGEX_TOKENIZER_4b7641f0c468,
 StopWordsCleaner_f5592152b50d]

In [79]:
text = 'Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'

sw_result = custom_pipeline.annotate(text)
sw_result

{'document': ['Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'],
 'sentence': ['Peter Parker (Spiderman) is a nice guy and lives in New York but has no e-mail!'],
 'token': ['Peter',
  'Parker',
  '(',
  'Spiderman',
  ')',
  'is',
  'a',
  'nice',
  'guy',
  'and',
  'lives',
  'in',
  'New',
  'York',
  'but',
  'has',
  'no',
  'e-mail',
  '!'],
 'cleanTokens': ['Peter',
  'Parker',
  '(',
  'Spiderman',
  ')',
  'nice',
  'guy',
  'lives',
  'New',
  'York',
  'e-mail',
  '!']}

In [80]:
sw_result["token"]

['Peter',
 'Parker',
 '(',
 'Spiderman',
 ')',
 'is',
 'a',
 'nice',
 'guy',
 'and',
 'lives',
 'in',
 'New',
 'York',
 'but',
 'has',
 'no',
 'e-mail',
 '!']

In [81]:
sw_result["cleanTokens"]

['Peter',
 'Parker',
 '(',
 'Spiderman',
 ')',
 'nice',
 'guy',
 'lives',
 'New',
 'York',
 'e-mail',
 '!']