# 03 SDK | 02 ChatCompletion Streaming

## Why Streaming?

In our [previous notebook](./01_ChatCompletion.ipynb) we covered how chat completion work. Sending _System_, _Assistance_ and _User_ messages as prompt, and receiving in a single payload the response. This is great for a lot of use cases, but it has some limitations:

- In case the response is too long, it might take longer to receive the response.
- If the calculation of the response takes too long, the user might think the LLM model is not responding.

For these (and more) scenarios we can use the streaming option. With this option, we can call the LLM with _Async_ mode, and receive the response in chunks. This way we can start receiving the response as soon as the LLM starts calculating it, and we can also receive the response in chunks, so we can show the user the response as soon as we receive it.

### Further reasons to use streaming:

__Real-time Interactions:__ Some applications require real-time interactions with the API, where the user's input is processed on-the-fly. Streaming allows for more immediate feedback, making it suitable for applications such as AI agents, real-time translations, or interactive tutorials.

__Handling Large Data:__ If an application needs to send or receive large amounts of data, breaking it into smaller chunks and streaming it can be more efficient than sending it all at once. This can help avoid timeouts or other issues associated with large data transfers.

__Adaptive Responses:__ In some applications, the response from the model might influence subsequent inputs. For example, if a chatbot gives a particular answer, the user might ask a follow-up question. Streaming allows this kind of adaptive interaction.


__Improved User Experience:__ For end-users, streaming can provide a smoother, more interactive experience, especially in applications where rapid back-and-forth communication is necessary.


## Azure Environment

You can create the necessary instance of Azure OpenAI and deploy an embedding model by executing one of the deployment options [mentioned here](../../create_env/src/AzureCLI/CreateEnv.azcli). To execute the sample code service specific information is needed ([Details and instructions here](../01_DemoEnvironment/01_Environment.ipynb)) 


## Step 1: Create OpenAIClient

The OpenAIClient from Azure.AI.OpenAI is a .NET client library that acts as the centralized point for all .NET functionality that want to interact with a deployed Azure OpenAI Large Language Model. It provides methods to access the OpenAI REST APIs for various tasks such as text completion, text embedding, and chat completion, etc.. It also allows developers to specify the model, engine, and options for each request, such as temperature, frequency penalty, presence penalty, and stop sequences. 

The OpenAIClient can connect to any Azure OpenAI resource or to the non-Azure OpenAI inference endpoint, making it a versatile and powerful tool for .NET development with OpenAI.


In [None]:
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.12"
#r "nuget: DotNetEnv, 2.5.0"

using Azure; 
using Azure.AI.OpenAI;
using DotNetEnv;
using System.IO;
using System.Text.Json; 

//configuration file is created during environment creation
//if you skipped the deployment just remove the code and provide values from your deployment
static string _configurationFile = @"../01_DemoEnvironment/conf/application.env";
Env.Load(_configurationFile);

string oAiApiKey = Environment.GetEnvironmentVariable("SKIT_AOAI_APIKEY") ?? "SKIT_AOAI_APIKEY not found";
string oAiEndpoint = Environment.GetEnvironmentVariable("SKIT_AOAI_ENDPOINT") ?? "SKIT_AOAI_ENDPOINT not found";
string chatCompletionDeploymentName = Environment.GetEnvironmentVariable("SKIT_CHATCOMPLETION_DEPLOYMENTNAME") ?? "SKIT_CHATCOMPLETION_DEPLOYMENTNAME not found";

string assetsFolder = Path.Combine(Directory.GetCurrentDirectory(), "..", "..", "assets");

AzureKeyCredential azureKeyCredential = new AzureKeyCredential(oAiApiKey);
OpenAIClient openAIClient = new OpenAIClient(new Uri(oAiEndpoint), azureKeyCredential);

Console.WriteLine($"OpenAI Client created...");

Expected output:

```
Installed Packages
    Azure.AI.OpenAI, 1.0.0-beta.12
    DotNetEnv, 2.5.0

OpenAI Client created...
```

## Step 2: Compose ChatCompletionsOptions

Each chat would follow similar structure, where _System_, _Agent_ and _User_ messages are added in sequence. Parameters, such as _Temperature_ could be set per call.
The model would then response with the most likely next message. In the __Streaming__ option we would be able to process the response in chunks.

In [None]:
//Define System Prompt
string system = @" 
    You summarize technical information provided to you. 
    The summarization is focused on key elements and messages from the text. 
    You response with one paragraph and in a language which is easy understandable for an experienced .NET developer.
";

//Compose Chat (Simplified - No few shot learning in this example)
ChatCompletionsOptions chatCompletionsOptions = new ChatCompletionsOptions();

chatCompletionsOptions.Messages.Add(new ChatRequestSystemMessage(system));
chatCompletionsOptions.Messages.Add(new ChatRequestUserMessage(File.ReadAllText(Path.Combine(assetsFolder, "docs", "03_SDK", "aci_documentation.txt"))));

//Request Properties
chatCompletionsOptions.MaxTokens = 500;
chatCompletionsOptions.Temperature = 0.0f;
chatCompletionsOptions.NucleusSamplingFactor = 0.0f;
chatCompletionsOptions.FrequencyPenalty = 0.7f;
chatCompletionsOptions.PresencePenalty = 0.7f;
chatCompletionsOptions.StopSequences.Add("\n"); 
chatCompletionsOptions.DeploymentName = chatCompletionDeploymentName;
//Choices per prompt
chatCompletionsOptions.ChoiceCount = 1;

Console.WriteLine("ChatCompletionsOptions created...");

Expected output:

```
ChatCompletionsOptions created...
```

## Step 3: Call Streaming API

The API will response in chunks of tokens. One can display them as outlined in this example.

In [None]:
StreamingResponse<StreamingChatCompletionsUpdate> response = await openAIClient.GetChatCompletionsStreamingAsync(chatCompletionsOptions);

string responseContent = string.Empty;
await foreach (var messages in response.EnumerateValues())
{
    if(messages.ContentUpdate != null)
    {
        Console.WriteLine($"Arrived chunk: {messages.ContentUpdate}");
        responseContent += messages.ContentUpdate;        
    }
}

Console.WriteLine($"Response: {responseContent.Substring(0, 50)}...")

## Step 3.1: Call Streaming API - typing 

The following is show casing the typing effect. The response is received in chunks and written to the console, adding line breaks between specific word count.

In [None]:
StreamingResponse<StreamingChatCompletionsUpdate> response = await openAIClient.GetChatCompletionsStreamingAsync(chatCompletionsOptions);

string responseContent = string.Empty;
int wordsinline = 15;

int pos = 0;
await foreach (var messages in response.EnumerateValues())
{
    if(messages.ContentUpdate != null)
    {
        pos = (pos == wordsinline) ? 0 : pos + 1;
        if (pos == 0) Console.WriteLine();
        Console.Write($"{messages.ContentUpdate}");
        responseContent += messages.ContentUpdate;        
    }
}

Console.WriteLine("");
// Once streaming is complete, the response object contains the completion
Console.WriteLine($"Response: {responseContent.Substring(0, 50)}...")

## Step 3.2: Cancellation token

In few scenarios calls might be canceled before the entire response is received. This is show cased in this example.
In this example we are using the DeepDev package, lets install it.

The following example demonstrate a cancelled call. Using a time driven cancellation token, the call will be cancelled after the specified milliseconds. Using a tokenizer to convert the textual responses to tokens to ensure cost is calculated correctly.

In [None]:
#r "nuget: Microsoft.DeepDev.TokenizerLib"

Expected output:

```
Installed Packages
    Microsoft.DeepDev.TokenizerLib, 1.3.2

```

In [None]:
using System;
using System.Threading;
using Microsoft.DeepDev;
var IM_START = "<|im_start|>";
var IM_END = "<|im_end|>";

var SpecialTokens = new Dictionary<string, int>{
                                            { IM_START, 100264},
                                            { IM_END, 100265},
                                        };

var tokenizer = await TokenizerBuilder.CreateByModelNameAsync("gpt-3.5-turbo",SpecialTokens);

var cts = new CancellationTokenSource(TimeSpan.FromMilliseconds(920)); // Cancel after xxx mili-seconds
string responseContent = string.Empty;

// try {
    StreamingResponse<StreamingChatCompletionsUpdate> response  = await openAIClient.GetChatCompletionsStreamingAsync(chatCompletionsOptions, cancellationToken: cts.Token);
    int wordsinline = 15;

    int pos = 0;
    await foreach (var messages in response.EnumerateValues())
    {
        if(messages.ContentUpdate != null)
        {
            pos = (pos == wordsinline) ? 0 : pos + 1;
            if (pos == 0) Console.WriteLine();
            Console.Write($"{messages.ContentUpdate}");
            responseContent += messages.ContentUpdate;        
        }
    }    
// }catch (Exception ex) {
//     Console.WriteLine($"An error occurred: {ex.Message}");
//     Console.WriteLine("Operation was canceled");
//     Console.WriteLine($"Response: {responseContent}");
//     Console.Out.Flush();
//     // var encoded = tokenizer.Encode(responseContent, new HashSet<string>(SpecialTokens.Keys));

//     // Console.WriteLine($"Tokens used up to cancellation: {responseContent.Length} , tokens: {encoded.Count}");
// }

## Next Steps

- The concept of Embeddings allows transforming information into a numerical representation preserving the semantic context of the information: [Demo Embeddings](../04_Embeddings/01_BasicEmbeddings.ipynb)