![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/04.01.NGramGenerator.ipynb)

# NGramGenerator

This notebook will cover the different parameters and usages of `NGramGenerator`.

**📖 Learning Objectives:**

1. Understand how to use `NGramGenerator`.

2. Become familiar with the parameters and options available for the `NGramGenerator`.

**🔗 Helpful Links:**

- Documentation: [NGramGenerator](https://nlp.johnsnowlabs.com/docs/en/annotators#ngramgenerator)

- Python Docs: [NGramGenerator](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/n_gram_generator/index.html#sparknlp.annotator.n_gram_generator.NGramGenerator)

- Scala Docs: [NGramGenerator](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/NGramGenerator)

## **📜 Background**
N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks. A feature transformer that converts the input array of strings into an array of n-grams. Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.


'N' here is used to designate the number of 'grams' or neighbouring sequences of items. Common splits are as follows:

- n = 1: Unigram
- n = 2: Bigram
- n = 3: Trigram

## **🎬 Colab Setup**

In [None]:
!pip install -q pyspark==3.2.1 spark-nlp==4.2.5

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m453.4/453.4 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.0/199.0 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [None]:
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.sql import functions as F
from pyspark.sql.types import StringType

spark = sparknlp.start()
spark

## **🖨️ Input/Output Annotation Types**


- Input: `TOKEN`

- Output: `CHUNK`

## **🔎Parameters**
- `n`: (IntParam) Minimum n-gram length, greater than or equal to 1 (Default: 2, bigram features).
- `enableCumulative`: (BooleanParam) Whether to calculate just the actual n-grams or all n-grams from 1 through n (Default: false).
- `delimiter`: (String) Glue character used to join the tokens (Default: " ").


### `setN`

In [None]:
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bigrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("bigrams") \
    .setN(2)

trigrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("trigrams") \
    .setN(3)

pipeline = Pipeline(stages=[
    documentAssembler,
    tokenizer,
    bigrams,
    trigrams
])

data = spark.createDataFrame([
    "Cloud computing is benefiting major manufacturing companies",
    "Big data cloud computing cyber security machine learning"
], StringType()).toDF("text")

result = pipeline.fit(data).transform(data)


In [None]:
result.select("bigrams.result").show(2, truncate=False)

+--------------------------------------------------------------------------------------------------------------+
|result                                                                                                        |
+--------------------------------------------------------------------------------------------------------------+
|[Cloud computing, computing is, is benefiting, benefiting major, major manufacturing, manufacturing companies]|
|[Big data, data cloud, cloud computing, computing cyber, cyber security, security machine, machine learning]  |
+--------------------------------------------------------------------------------------------------------------+



In [None]:
result.select("trigrams.result").show(2, truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                    |
+------------------------------------------------------------------------------------------------------------------------------------------+
|[Cloud computing is, computing is benefiting, is benefiting major, benefiting major manufacturing, major manufacturing companies]         |
|[Big data cloud, data cloud computing, cloud computing cyber, computing cyber security, cyber security machine, security machine learning]|
+------------------------------------------------------------------------------------------------------------------------------------------+



### `setEnableCumulative`
If we set EnableCumulative True. Return all n-grams from 1 through n. You can see an example in below.

In [None]:
trigrams.setEnableCumulative(True)

NGramGenerator_26fa38408f8a

In [None]:
data = spark.createDataFrame([
    "Cloud computing is benefiting major manufacturing companies",
    "Big data cloud computing cyber security machine learning"
], StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

In [None]:
result.select("trigrams.result").show(2, truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Cloud, computing, is, benefiting, major, manufacturing, 

### `setDelimiter`
If we set delimiter "/", tokens will be joined with "/".

In [None]:
bigrams.setDelimiter("/")

NGramGenerator_0fd4f0e688c6

In [None]:
data = spark.createDataFrame([
    "Cloud computing is benefiting major manufacturing companies",
], StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

In [None]:
result.select("bigrams.result").show(2, truncate=False)

+--------------------------------------------------------------------------------------------------------------+
|result                                                                                                        |
+--------------------------------------------------------------------------------------------------------------+
|[Cloud/computing, computing/is, is/benefiting, benefiting/major, major/manufacturing, manufacturing/companies]|
+--------------------------------------------------------------------------------------------------------------+

