![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

### Install `spark-nlp` in Python

* pip

```
pip install spark-nlp==2.3.4
```

* Conda

```
conda install -c johnsnowlabs spark-nlp==2.3.4
```

### NGramGenerator

`NGramGenerator` annotator takes as input a sequence of strings (e.g. the output of a `Tokenizer`, `Normalizer`, `Stemmer`, `Lemmatizer`, and `StopWordsCleaner`). The parameter `n` is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words with annotatorType `CHUNK` same as the `Chunker` annotator.

**Output type:** CHUNK  
**Input types:** TOKEN  
**Reference:** [NGramGenerator](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/annotators/NGramGenerator.scala)  
**Functions:**

- setN: number elements per n-gram (>=1)
- setEnableCumulative: whether to calculate just the actual n-grams or all n-grams from 1 through n

**Example:**

Refer to the [NGramGenerator](https://nlp.johnsnowlabs.com/api/index#com.johnsnowlabs.nlp.annotators.NGramGenerator) Scala docs for more details on the API.

```python
ngrams_cum = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("ngrams") \
            .setN(2) \
            .setEnableCumulative(True)
```

```scala
val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("ngrams")
      .setN(2)
      .setEnableCumulative(true)
```


In [15]:
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from pyspark.sql.types import StringType

In [None]:
spark = sparknlp.start()

In [71]:
print("Spark NLP version")
sparknlp.version()

Spark NLP version


'2.3.4'

In [72]:
print("Apache Spark version")
spark.version

Apache Spark version


'2.4.3'

In [73]:
dfTest = spark.createDataFrame([
    "Cloud computing is benefiting major manufacturing companies",
    "Big data cloud computing cyber security machine learning"
], StringType()).toDF("text")

In [74]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")
    
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

bigrams = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("bigrams") \
            .setN(2)

trigrams_cum = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("trigrams") \
            .setN(3)            

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer, 
    bigrams,
    trigrams_cum
])


#### Use the Pipeline in Spark (DataFrame)

In [75]:
model = pipeline.fit(dfTest)
prediction = model.transform(dfTest)

In [76]:
prediction.select("bigrams.result").show(2, truncate=60)

+------------------------------------------------------------+
|                                                      result|
+------------------------------------------------------------+
|[Cloud computing, computing is, is benefiting, benefiting...|
|[Big data, data cloud, cloud computing, computing cyber, ...|
+------------------------------------------------------------+



In [77]:
prediction.select("trigrams.result").show(2, truncate=60)

+------------------------------------------------------------+
|                                                      result|
+------------------------------------------------------------+
|[Cloud computing is, computing is benefiting, is benefiti...|
|[Big data cloud, data cloud computing, cloud computing cy...|
+------------------------------------------------------------+



#### Use the Pipeline in Python (string)

In [78]:
from sparknlp.base import LightPipeline

text = 'Cloud computing is benefiting major manufacturing companies'

In [79]:
result = LightPipeline(model).annotate(text)

In [80]:
list(result.keys())

['document', 'token', 'bigrams', 'trigrams']

In [81]:
result['bigrams']

['Cloud computing',
 'computing is',
 'is benefiting',
 'benefiting major',
 'major manufacturing',
 'manufacturing companies']

In [82]:
result['trigrams']

['Cloud computing is',
 'computing is benefiting',
 'is benefiting major',
 'benefiting major manufacturing',
 'major manufacturing companies']