![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/open-source-nlp/22.1.OpenAI_In_SparkNLP.ipynb)

## OpenAI in SparkNLP

Spark NLP offers a seamless integration with various OpenAI APIs, presenting a powerful synergy. Since Spark NLP 5.1.0, the library supports OpenAICompletition for text generation and OpenAIEmbeddings for creating vector representation of texts. This integration not only ensures the utilization of OpenAI's capabilities but also capitalizes on Spark's inherent scalability advantages.

## Spark NLP Settings

In [None]:
!wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

All you need to do is to setup your [OpenAI API Key](https://platform.openai.com/docs/api-reference/authentication) and add it to Spark properties

In [2]:
from getpass import getpass
OPENAI_API_KEY = getpass('Please enter your open_api_key:')

Please enter your open_api_key:··········


In [3]:
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from sparknlp.base import LightPipeline

In [4]:
import sparknlp
# let's start Spark with Spark NLP
openai_params = {"spark.jsl.settings.openai.api.key": OPENAI_API_KEY}
spark = sparknlp.start(params=openai_params)

In [18]:
spark

## TextGeneraiton with OpenAICompletion annotator

In [5]:
document_assembler = DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

openai_completion = OpenAICompletion() \
  .setInputCols("document") \
  .setOutputCol("completion") \
  .setModel("gpt-3.5-turbo-instruct") \
  .setMaxTokens(50)

# Define the pipeline
pipeline = Pipeline(
  stages=[
    document_assembler,
    openai_completion
])

In [6]:
openai_completion.extractParamMap()

{Param(parent='OpenAICompletion_183f04bc78d5', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='OpenAICompletion_183f04bc78d5', name='maxTokens', doc='The maximum number of tokens to generate in the completion.'): 50,
 Param(parent='OpenAICompletion_183f04bc78d5', name='temperature', doc='What sampling temperature to use, between 0 and 2'): 1.0,
 Param(parent='OpenAICompletion_183f04bc78d5', name='topP', doc='An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass'): 1.0,
 Param(parent='OpenAICompletion_183f04bc78d5', name='numberOfCompletions', doc='How many completions to generate for each prompt.'): 1,
 Param(parent='OpenAICompletion_183f04bc78d5', name='echo', doc='Echo back the prompt in addition to the completion'): False,
 Param(parent='OpenAICompletion_183f04bc78d5', name='presencePenalty', doc="Number between -2.0 and 2

In [7]:
sample_text= [
    ["Generate a restaurant review."],
    ["Write a review for a local eatery."],
    ["Create a JSON with a review of a dining experience."]
]

sample_df= spark.createDataFrame(sample_text).toDF("text")
sample_df.show(truncate=False)

+---------------------------------------------------+
|text                                               |
+---------------------------------------------------+
|Generate a restaurant review.                      |
|Write a review for a local eatery.                 |
|Create a JSON with a review of a dining experience.|
+---------------------------------------------------+



In [8]:
completion_df = pipeline.fit(sample_df).transform(sample_df)

In [9]:
completion_df.select("completion").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|completion                                                                                                                                                                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 215, \n\nI recently dined at the new restaurant in town, The Rustic Olive, and I must say, I was thoroughly impressed. From the moment I walked in, I was greeted with warm smiles a

**Using LightPipeline**

In [10]:
empty_df = spark.createDataFrame([[""]], ["text"])
pipeline_model = pipeline.fit(empty_df)

In [11]:
light_pipeline_openai = LightPipeline(pipeline_model)

In [12]:
light_pipeline_openai.fullAnnotate("Generate a negative review of a movie")

[{'document': [Annotation(document, 0, 36, Generate a negative review of a movie, {}, [])],
  'completion': [Annotation(document, 0, 217, 
   
   I recently wasted two hours of my life watching the disaster of a film, "The Last Unicorn." This movie was a complete waste of money, with terrible animation, lackluster acting, and a convoluted plot.
   
   First of all,, {}, [])]}]

In [13]:
light_pipeline_openai.annotate("Generate a negative review of a movie")

{'document': ['Generate a negative review of a movie'],
 'completion': ['\n\nI recently watched the movie "The Last Resort" and I was extremely disappointed. From start to finish, the film was a complete disaster.\n\nFirst of all, the acting was subpar at best. It was clear that the actors were not fully']}

### Other parameters

There are other parameters that can be used to control de generation of text in OpenAI API. You can check their description on the [official API docs from OpenAI](https://platform.openai.com/docs/api-reference/completions).

- `maxTokens`: The maximum number of tokens to generate in the completion.
- `temperature`: What sampling temperature to use, between 0 and 2.
- `topP`: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
- `numberOfCompletions`: How many completions to generate for each prompt.
- `echo`: Echo back the prompt in addition to the completion.
- `presencePenalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
- `frequencyPenalty`: Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
- `bestOf`: Generates best_of completions server-side and returns the `best` (the one with the highest log probability per token).

## Creating vector representation with OpenAIEmbeddings

In [14]:
document_assembler = DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

openai_embeddings = OpenAIEmbeddings() \
  .setInputCols("document") \
  .setOutputCol("embeddings") \
  .setModel("text-embedding-ada-002")

# Define the pipeline
pipeline = Pipeline(
  stages=[
    document_assembler,
    openai_embeddings
])

In [15]:
sample_text= [["The food was delicious and the waiter..."]]
sample_df= spark.createDataFrame(sample_text).toDF("text")
sample_df.show(truncate=False)

+----------------------------------------+
|text                                    |
+----------------------------------------+
|The food was delicious and the waiter...|
+----------------------------------------+



In [16]:
embeddings_df = pipeline.fit(sample_df).transform(sample_df)

In [17]:
embeddings_df.select("embeddings").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------