# DataFrameOptimizer Demo Notebook

 This notebook showcases `DataFrameOptimizer` transformer which is intended to improve performance for Spark NLP pipelines or when preparing  data for export. It allows partition tuning via `numPartitions` directly, or indirectly using  `executorCores` and `numWorkers`.

The DataFrame can also be persisted in a specified format
    (`csv`, `json`, or `parquet`) with additional writer options.

## Setup and Initialization
Let's keep in mind a few things before we start 😊

This feature was introduces in Spark NLP 6.0.4. Please make sure you have upgraded to the latest Spark NLP release.

- Let's install and setup Spark NLP in Google Colab. This part is pretty easy via our simple script

In [14]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Processing ./spark_nlp-6.0.3-py2.py3-none-any.whl
spark-nlp is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [15]:
import sparknlp
# let's start Spark with Spark NLP
spark = sparknlp.start()
print("Apache Spark version: {}".format(spark.version))

Apache Spark version: 3.5.1


Use `DataFrameOptimizer` in a Spark NLP pipelines

In [16]:
from sparknlp.annotator.dataframe_optimizer import DataFrameOptimizer
from sparknlp import DocumentAssembler
from sparknlp.annotator import SentenceDetector
from pyspark.ml import Pipeline

test_df = spark.createDataFrame([("This is a test sentence. It contains multiple sentences.",)], ["text"])

In [17]:
data_frame_optimizer = DataFrameOptimizer() \
    .setExecutorCores(2) \
    .setNumWorkers(2) \
    .setDoCache(True)

document_assembler = DocumentAssembler() \
            .setInputCol("text") \
            .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

pipeline = Pipeline(stages=[
          data_frame_optimizer,
          document_assembler,
          sentence_detector
      ])

optimized_result_df = pipeline.fit(test_df).transform(test_df)
print(f"Number of partitions: {optimized_result_df.rdd.getNumPartitions()}")

Number of partitions: 4


In [18]:
optimized_result_df.show()

+--------------------+--------------------+--------------------+
|                text|            document|           sentences|
+--------------------+--------------------+--------------------+
|This is a test se...|[{document, 0, 55...|[{document, 0, 23...|
+--------------------+--------------------+--------------------+



Persisting data with DataFrameOptimizer

In [19]:
persist_path = "/tmp/optimized_output"
optimizer_persist = DataFrameOptimizer() \
    .setNumPartitions(4) \
    .setDoCache(False) \
    .setPersistPath(persist_path) \
    .setPersistFormat("parquet") \
    .setOutputOptions({"compression": "snappy"})

persisted_df = optimizer_persist.transform(test_df)
print(f"Data persisted to: {persist_path}")

Data persisted to: /tmp/optimized_output


In [20]:
restored_df = spark.read.parquet(persist_path)
restored_df.show(5)

+--------------------+
|                text|
+--------------------+
|This is a test se...|
+--------------------+

