![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/chunking/NgramGenerator.ipynb)

## 0. Colab Setup

In [0]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed -q spark-nlp==2.5.0

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 62kB/s 
[K     |████████████████████████████████| 204kB 51.0MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 9.6MB/s 
[?25hopenjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)



### NGramGenerator

`NGramGenerator` annotator takes as input a sequence of strings (e.g. the output of a `Tokenizer`, `Normalizer`, `Stemmer`, `Lemmatizer`, and `StopWordsCleaner`). The parameter `n` is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words with annotatorType `CHUNK` same as the `Chunker` annotator.

**Output type:** CHUNK  
**Input types:** TOKEN  
**Reference:** [NGramGenerator](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/NGramGenerator.scala)  
**Functions:**

- setN: number elements per n-gram (>=1)
- setEnableCumulative: whether to calculate just the actual n-grams or all n-grams from 1 through n

**Example:**

Refer to the [NGramGenerator](https://nlp.johnsnowlabs.com/api/index#com.johnsnowlabs.nlp.annotators.NGramGenerator) Scala docs for more details on the API.

```python
ngrams_cum = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("ngrams") \
            .setN(2) \
            .setEnableCumulative(True)
```

```scala
val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("ngrams")
      .setN(2)
      .setEnableCumulative(true)
```


In [0]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.sql.types import StringType

In [0]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.5.0
Apache Spark version:  2.4.4


In [0]:
dfTest = spark.createDataFrame([
    "Cloud computing is benefiting major manufacturing companies",
    "Big data cloud computing cyber security machine learning"
], StringType()).toDF("text")

In [0]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")
    
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bigrams = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("bigrams") \
            .setN(2)

trigrams_cum = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("trigrams") \
            .setN(3)            

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer, 
    bigrams,
    trigrams_cum
])


#### Use the Pipeline in Spark (DataFrame)

In [0]:
model = pipeline.fit(dfTest)
prediction = model.transform(dfTest)

In [0]:
prediction.select("bigrams.result").show(2, truncate=60)

+------------------------------------------------------------+
|                                                      result|
+------------------------------------------------------------+
|[Cloud computing, computing is, is benefiting, benefiting...|
|[Big data, data cloud, cloud computing, computing cyber, ...|
+------------------------------------------------------------+



In [0]:
prediction.select("trigrams.result").show(2, truncate=60)

+------------------------------------------------------------+
|                                                      result|
+------------------------------------------------------------+
|[Cloud computing is, computing is benefiting, is benefiti...|
|[Big data cloud, data cloud computing, cloud computing cy...|
+------------------------------------------------------------+



#### Use the Pipeline in Python (string)

In [0]:
from sparknlp.base import LightPipeline

text = 'Cloud computing is benefiting major manufacturing companies'

In [0]:
result = LightPipeline(model).annotate(text)

In [0]:
list(result.keys())

['document', 'token', 'bigrams', 'trigrams']

In [0]:
result['bigrams']

['Cloud computing',
 'computing is',
 'is benefiting',
 'benefiting major',
 'major manufacturing',
 'manufacturing companies']

In [0]:
result['trigrams']

['Cloud computing is',
 'computing is benefiting',
 'is benefiting major',
 'benefiting major manufacturing',
 'major manufacturing companies']