![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/35.01.LightPipeline.ipynb)

# 🔎 What is `LightPipeline`?

`LightPipelines` are Spark NLP specific pipelines, equivalent to *Spark ML Pipeline*. The difference is that it’s execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. 

They’re useful while working with small datasets, experimentation, and debugging results.

They're especially useful while building real-time APIs for serving real-time requets.


# 🔎 How `LightPipeline` work?

Spark NLP `LightPipelines` are Spark ML pipelines converted into a single machine.

**They do not leverage full Spark cluster, and run in local mode (on driver) only.**

**10x times faster for smaller amounts of data (small is relative, but 50k sentences are roughly a good maximum)**.




# 🔎 How `LightPipeline` can be used?

Full pipelines are casted to `LightPipelines` to remove spark overhead. In this process, two new functions are introduced:

- `annotate()`
- `fullAnnotate()`

**Difference in input**:
While full pipelines require pyspark dataframes as inputs, the light pipeline requires a single example of a list of examples.


# 📚 Documentation

```
LightPipeline(pipelineModel: PipelineModel, parse_embeddings: Boolean = false)
```


You can check our [Python API](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/base/light_pipeline/index.html#sparknlp.base.light_pipeline.LightPipeline) and [ScalaDoc](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/LightPipeline.html) for more details about `LightPipeline`.

## Colab Setup

In [1]:
!pip install -q pyspark==3.3.0 spark-nlp==4.2.4

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.3/281.3 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.4/448.4 KB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.3.0


# LightPipeline

Now, let's create a Spark NLP Pipeline that can tokenize the text and get the embeddings of lemmas by using `LightPipeline`.

In [3]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence",'token'])\
    .setOutputCol("embeddings")

pipeline = Pipeline(stages = [
      documentAssembler,
      sentenceDetector,
      tokenizer,
      glove_embeddings
    ])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


➤ We will fit our pipeline with a Spark DataFrame and have a model.




In [4]:
sample_text = "Peter Pipers employees are picking pecks of pickled peppers. He had a good income last year."
sample_list = ["Peter Pipers employees are picking pecks of pickled peppers.", "He had a good income last year."]

data = spark.createDataFrame([[sample_text]]).toDF("text")

model = pipeline.fit(data)

In [75]:
%%time
res = model.transform(data)
res.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter Pipers empl...|[{document, 0, 91...|[{document, 0, 59...|[{token, 0, 4, Pe...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

CPU times: user 18.8 ms, sys: 6.03 ms, total: 24.9 ms
Wall time: 503 ms


➤ Now we will convert the fitted model to a `LightPipeline`.

In [82]:
light_model = LightPipeline(pipelineModel = model, parse_embeddings = False)

## 📌 `LightPipeline` Methods

### 💡 `transform` Method

➤ The transform method expects pyspark dataframe as input (for consistency with full piplines).

In [48]:
%%time
res = light_model.transform(data)
res.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|Peter Pipers empl...|[{document, 0, 91...|[{document, 0, 59...|[{token, 0, 4, Pe...|[{word_embeddings...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

CPU times: user 20.6 ms, sys: 7.12 ms, total: 27.7 ms
Wall time: 485 ms


### 💡 `fullAnnotate` Method

➤ When you use `.fullAnnotate` method of `LightPipeline`, it annotates the data provided into *Annotation type* results. It will return a list of dictionaries that contain the output columns of annotators as keys and their results as values. 

➤ `.fullAnnotate` results contain `begin`, `end`, `result`, `metadata` information which is good for checking the results deeply or using these results for the downstream tasks.

Let's show an example using our `sample_text`.

In [59]:
sample_text

'Peter Pipers employees are picking pecks of pickled peppers. He had a good income last year.'

In [56]:
%%time
fullAnnotate_result = light_model.fullAnnotate(sample_text)
fullAnnotate_result[0].keys()

CPU times: user 29.1 ms, sys: 9.62 ms, total: 38.7 ms
Wall time: 129 ms


dict_keys(['document', 'sentence', 'token', 'embeddings'])

In [58]:
fullAnnotate_result[0]

{'document': [Annotation(document, 0, 59, Peter Pipers employees are picking pecks of pickled peppers., {})],
 'sentence': [Annotation(document, 0, 59, Peter Pipers employees are picking pecks of pickled peppers., {'sentence': '0'})],
 'token': [Annotation(token, 0, 4, Peter, {'sentence': '0'}),
  Annotation(token, 6, 11, Pipers, {'sentence': '0'}),
  Annotation(token, 13, 21, employees, {'sentence': '0'}),
  Annotation(token, 23, 25, are, {'sentence': '0'}),
  Annotation(token, 27, 33, picking, {'sentence': '0'}),
  Annotation(token, 35, 39, pecks, {'sentence': '0'}),
  Annotation(token, 41, 42, of, {'sentence': '0'}),
  Annotation(token, 44, 50, pickled, {'sentence': '0'}),
  Annotation(token, 52, 58, peppers, {'sentence': '0'}),
  Annotation(token, 59, 59, ., {'sentence': '0'})],
 'embeddings': [Annotation(word_embeddings, 0, 4, Peter, {'isOOV': 'false', 'pieceId': '-1', 'isWordStart': 'true', 'token': 'Peter', 'sentence': '0'}),
  Annotation(word_embeddings, 6, 11, Pipers, {'isOOV'

➤ We can also use `.fullAnnotate` method with an Array of strings. In this case, `.fullAnnotate` method returns a list of dictionaries which contain the results of each item of the list.

In [76]:
sample_list

['Peter Pipers employees are picking pecks of pickled peppers.',
 'He had a good income last year.']

In [77]:
fullAnnotate_result = light_model.fullAnnotate(sample_list)
print (len(fullAnnotate_result))
fullAnnotate_result[0].keys()

2


dict_keys(['document', 'sentence', 'token', 'embeddings'])

In [79]:
fullAnnotate_result[0]

{'document': [Annotation(document, 0, 59, Peter Pipers employees are picking pecks of pickled peppers., {})],
 'sentence': [Annotation(document, 0, 59, Peter Pipers employees are picking pecks of pickled peppers., {'sentence': '0'})],
 'token': [Annotation(token, 0, 4, Peter, {'sentence': '0'}),
  Annotation(token, 6, 11, Pipers, {'sentence': '0'}),
  Annotation(token, 13, 21, employees, {'sentence': '0'}),
  Annotation(token, 23, 25, are, {'sentence': '0'}),
  Annotation(token, 27, 33, picking, {'sentence': '0'}),
  Annotation(token, 35, 39, pecks, {'sentence': '0'}),
  Annotation(token, 41, 42, of, {'sentence': '0'}),
  Annotation(token, 44, 50, pickled, {'sentence': '0'}),
  Annotation(token, 52, 58, peppers, {'sentence': '0'}),
  Annotation(token, 59, 59, ., {'sentence': '0'})],
 'embeddings': [Annotation(word_embeddings, 0, 4, Peter, {'isOOV': 'false', 'pieceId': '-1', 'isWordStart': 'true', 'token': 'Peter', 'sentence': '0'}),
  Annotation(word_embeddings, 6, 11, Pipers, {'isOOV'

### 💡 `annotate` Method

➤ When you use `.annotate` method of `LightPipeline`, it will return a dictionary which contains output columns of the annotators as keys and the results as values. 

➤ `.annotate` results contain only the results, they don't contain `begin`, `end`, `metadata` information which is easy to check the results.

Let's show an example using our `sample_text`.

In [83]:
%%time
annotate_result = light_model.annotate(sample_text)
annotate_result.keys()

CPU times: user 17.6 ms, sys: 2.1 ms, total: 19.7 ms
Wall time: 177 ms


dict_keys(['document', 'sentence', 'token', 'embeddings'])

In [85]:
annotate_result

{'document': ['Peter Pipers employees are picking pecks of pickled peppers. He had a good income last year.'],
 'sentence': ['Peter Pipers employees are picking pecks of pickled peppers.',
  'He had a good income last year.'],
 'token': ['Peter',
  'Pipers',
  'employees',
  'are',
  'picking',
  'pecks',
  'of',
  'pickled',
  'peppers',
  '.',
  'He',
  'had',
  'a',
  'good',
  'income',
  'last',
  'year',
  '.'],
 'embeddings': ['Peter',
  'Pipers',
  'employees',
  'are',
  'picking',
  'pecks',
  'of',
  'pickled',
  'peppers',
  '.',
  'He',
  'had',
  'a',
  'good',
  'income',
  'last',
  'year',
  '.']}

➤ We can also use `.annotate` method with an Array of strings. In this case, `.annotate` method returns a list of dictionaries which contain the results of each item of the list.

In [86]:
sample_list

['Peter Pipers employees are picking pecks of pickled peppers.',
 'He had a good income last year.']

In [87]:
annotate_list_result = light_model.annotate(sample_list)
annotate_list_result[0].keys()

dict_keys(['document', 'sentence', 'token', 'embeddings'])

In [88]:
# length of results and the list of strings are in the same

len(annotate_list_result)

2

In [89]:
annotate_list_result

[{'document': ['Peter Pipers employees are picking pecks of pickled peppers.'],
  'sentence': ['Peter Pipers employees are picking pecks of pickled peppers.'],
  'token': ['Peter',
   'Pipers',
   'employees',
   'are',
   'picking',
   'pecks',
   'of',
   'pickled',
   'peppers',
   '.'],
  'embeddings': ['Peter',
   'Pipers',
   'employees',
   'are',
   'picking',
   'pecks',
   'of',
   'pickled',
   'peppers',
   '.']},
 {'document': ['He had a good income last year.'],
  'sentence': ['He had a good income last year.'],
  'token': ['He', 'had', 'a', 'good', 'income', 'last', 'year', '.'],
  'embeddings': ['He', 'had', 'a', 'good', 'income', 'last', 'year', '.']}]

➤ Lets show the `token` result of second text in the list.

In [90]:
annotate_list_result[1]["token"]

['He', 'had', 'a', 'good', 'income', 'last', 'year', '.']

CPU times: user 32.6 ms, sys: 13.3 ms, total: 45.8 ms
Wall time: 174 ms


dict_keys(['document', 'sentence', 'token', 'embeddings'])

### Get Embeddings Using `LightPipeline`

As you can see in the results, the embeddings of the tokens are not shown in the metadata. To get the embeddings, we need to call `parse_embeddings = True` while creating the `LightPipeline`.

In [93]:
light_model_emb = LightPipeline(pipelineModel = model, parse_embeddings=True)

➤ Let's use `annotate` method to get the embeddings.

In [94]:
annotate_results_emb = light_model_emb.annotate(sample_text)
annotate_results_emb

{'document': ['Peter Pipers employees are picking pecks of pickled peppers. He had a good income last year.'],
 'sentence': ['Peter Pipers employees are picking pecks of pickled peppers.',
  'He had a good income last year.'],
 'token': ['Peter',
  'Pipers',
  'employees',
  'are',
  'picking',
  'pecks',
  'of',
  'pickled',
  'peppers',
  '.',
  'He',
  'had',
  'a',
  'good',
  'income',
  'last',
  'year',
  '.'],
 'embeddings': ['-0.12434 0.27086 -0.25726 -0.92575 0.28346 -0.21944 -0.25647 -0.3976 -0.57385 -0.68947 -0.013447 0.1228 0.026195 0.61443 0.27363 -0.76713 0.24401 0.11872 -0.95617 0.5759 -0.26431 0.27444 0.50889 0.075364 0.4246 -0.071953 0.35437 -0.20185 0.38063 0.58091 -0.47259 0.16159 -0.017361 -0.2855 -0.49036 -0.5123 -0.045206 0.47847 0.30254 -0.27687 -0.27672 0.095084 0.29336 0.25971 -0.14715 -0.23236 -0.91433 -0.87662 0.048327 0.32749 1.0647 -0.61005 0.1543 0.38154 -0.16749 -3.0263 -1.0712 0.44597 0.052224 -0.50453 0.38656 0.20731 -0.1338 0.1269 0.18352 -0.56448 -0.