![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/util/OpenAIEmbeddings.ipynb)

## OpenAIEmbeddings in SparkNLP

In this notebook, we'll explore the process of utilizing OpenAIEmbeddings within SparkNLP's framework.

Spark NLP offers a seamless integration with various OpenAI APIs, presenting a powerful synergy. With the introduction of Spark NLP 5.1.0, leveraging the OpenAICompletition and OpenAIEmbeddings transformers becomes achievable. This integration not only ensures the utilization of OpenAI's capabilities but also capitalizes on Spark's inherent scalability advantages.

## Spark NLP Settings

All you need to do is to setup your [OpenAI API Key](https://platform.openai.com/docs/api-reference/authentication) and add it to Spark properties

In [None]:
print("Enter your OPENAI API Key:")
OPENAI_API_KEY = input()

In [None]:
from sparknlp.annotator import *
from pyspark.ml import Pipeline

In [None]:
import sparknlp
# let's start Spark with Spark NLP
openai_params = {"spark.jsl.settings.openai.api.key": OPENAI_API_KEY}
spark = sparknlp.start(params=openai_params)

Apache Spark version: 3.4.0


In [None]:
document_assembler = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document")

openai_embeddings = OpenAIEmbeddings() \
       .setInputCols("document") \
       .setOutputCol("embeddings") \
       .setModel("text-embedding-ada-002")

# Define the pipeline
pipeline = Pipeline(stages=[
    document_assembler, openai_embeddings
])

In [None]:
empty_df = spark.createDataFrame([[""]], ["text"])
sample_text= [["The food was delicious and the waiter..."]]
sample_df= spark.createDataFrame(sample_text).toDF("text")
sample_df.show()

+--------------------+
|                text|
+--------------------+
|The food was deli...|
+--------------------+



In [None]:
pipeline_model = pipeline.fit(empty_df)
embeddings_df = pipeline_model.transform(sample_df)

In [None]:
embeddings_df.select("embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------