# Azure OpenAI for Big Data

The Azure OpenAI service can be used to solve a large number of natural language tasks through prompting the completion API. To make it easier to scale your prompting workflows from a few examples to large datasets of examples we have integrated the Azure OpenAI service with the distributed machine learning library [SynapseML](https://www.microsoft.com/en-us/research/blog/synapseml-a-simple-multilingual-and-massively-parallel-machine-learning-library/). This integration makes it easy to use the [Apache Spark](https://spark.apache.org/) distributed computing framework to process millions of prompts with the OpenAI service. This tutorial shows how to apply large language models at a distributed scale using Azure Open AI and Azure Synapse Analytics. 

## Step 1: Prerequisites

The key prerequisites for this quickstart include a working Azure OpenAI resource, and an Apache Spark cluster with SynapseML installed. We suggest creating a Synapse workspace, but an Azure Databricks, HDInsight, or Spark on Kubernetes, or even a python environment with the `pyspark` package will work. 

1. An Azure OpenAI resource – request access [here](https://customervoice.microsoft.com/Pages/ResponsePage.aspx?id=v4j5cvGGr0GRqy180BHbR7en2Ais5pxKtso_Pz4b1_xUOFA5Qk1UWDRBMjg0WFhPMkIzTzhKQ1dWNyQlQCN0PWcu) before [creating a resource](https://docs.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource)
1. [Create a Synapse workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-create-workspace)
1. [Create a serverless Apache Spark pool](https://docs.microsoft.com/en-us/azure/synapse-analytics/get-started-analyze-spark#create-a-serverless-apache-spark-pool)


## Step 2: Import this guide as a notebook

The next step is to add this code into your Spark cluster. You can either create a notebook in your Spark platform and copy the code into this notebook to run the demo. Or download the notebook and import it into Synapse Analytics

1.	[Download this demo as a notebook](https://github.com/microsoft/SynapseML/blob/master/notebooks/features/cognitive_services/CognitiveServices%20-%20OpenAI.ipynb) (click Raw, then save the file)
1.	Import the notebook [into the Synapse Workspace](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-development-using-notebooks#create-a-notebook) or if using Databricks [into the Databricks Workspace](https://docs.microsoft.com/en-us/azure/databricks/notebooks/notebooks-manage#create-a-notebook)
1. Install SynapseML on your cluster. Please see the installation instructions for Synapse at the bottom of [the SynapseML website](https://microsoft.github.io/SynapseML/). Note that this requires pasting an additional cell at the top of the notebook you just imported
3.	Connect your notebook to a cluster and follow along, editing and rnnung the cells below.

## Step 3: Fill in your service information

Next, please edit the cell in the notebook to point to your service. In particular set the `service_name`, `deployment_name`, `location`, and `key` variables to match those for your OpenAI service:

In [None]:
# Load datasets
pip install datasets

In [None]:
from datasets import load_dataset
from pyspark.sql.functions import  col, udf,mean,  concat, lit
from pyspark.sql import functions as F
from nltk.translate.bleu_score import sentence_bleu
from notebookutils import mssparkutils
from notebookutils import mssparkutils


dataset = load_dataset('opus_books','en-fr')
df = dataset['train'].to_pandas().head(5000)
# Convert to Spark dataframe
df = spark.createDataFrame(df)
display(df)

In [None]:
df = df.select('id', F.col('translation').getItem('fr').alias('fr'), F.col('translation')['en'].alias('en'))
df = df.select(concat(lit("Translate this sentence to English: "), col('fr')).alias('fr'), 'en')
display(df)

In [None]:

# Fill in the following lines with your service information
deployment_name = "text-ada-001"
key=mssparkutils.credentials.getSecret('mldemo4764731797' , 'OPENAI-API-KEY') # get OpenAI key from KeyVault
key=mssparkutils.credentials.getSecret('mldemo4764731797' , 'OPENAI-API-ENDPOINT') # get OpenAI key from KeyVault

# Create the OpenAICompletion Apache Spark Client

To apply the OpenAI Completion service to your dataframe you just created, create an OpenAICompletion object which serves as a distributed client. Parameters of the service can be set either with a single value, or by a column of the dataframe with the appropriate setters on the `OpenAICompletion` object. Here we are setting `maxTokens` to 200. A token is around 4 characters, and this limit applies to the sum of the prompt and the result. We are also setting the `promptCol` parameter with the name of the prompt column in the dataframe.

In [None]:
from synapse.ml.cognitive import OpenAICompletion

completion = (
    OpenAICompletion()
    .setSubscriptionKey(key)
    .setDeploymentName(deployment_name)
    .setUrl("https://{}.openai.azure.com/".format(service_name))
    .setMaxTokens(200)
    .setPromptCol("fr")
    .setErrorCol("error")
    .setOutputCol("translated_en")
)

# To process 5000 sentences, it takes 25 minutes using small cpu cluster
# completed_df = completion.transform(df).cache()
completed_df = completion.transform(df).select(
        col("fr"),
        col("error"),
        col("translated_en.choices.text").getItem(0).alias("text"),
        col('en')
)
completed_df = completed_df.select(col("fr"), col("text"), col("en"),bleu("text","en").alias("bleu")).cache()
display(completed_df)

In [None]:
# TO DO: 
# Write code to measure the performance of the translation