# Enhancing Whisper transcriptions: pre- & post-processing techniques

This notebook offers a guide to improve the Whisper's transcriptions. We'll streamline your audio data via trimming and segmentation, enhancing Whisper's transcription quality. After transcriptions, we'll refine the output by adding punctuation, adjusting product terminology (e.g., 'five two nine' to '529'), and mitigating Unicode issues. These strategies will help improve the clarity of your transcriptions, but remember, customization based on your unique use-case may be beneficial.

## Installation
Install the Azure Open AI SDK using the below command.

In [1]:
#r "nuget: Azure.AI.OpenAI, 1.0.0-beta.14"

In [None]:
#r "nuget:Microsoft.DotNet.Interactive.AIUtilities, 1.0.0-beta.24129.1"

using Microsoft.DotNet.Interactive;
using Microsoft.DotNet.Interactive.AIUtilities;

## Run this cell, it will prompt you for the apiKey, endPoint, gtpDeployment, and whisperDeployment

In [5]:
var azureOpenAIKey = await Kernel.GetPasswordAsync("Provide your OPEN_AI_KEY");

// Your endpoint should look like the following https://YOUR_OPEN_AI_RESOURCE_NAME.openai.azure.com/
var azureOpenAIEndpoint = await Kernel.GetInputAsync("Provide the OPEN_AI_ENDPOINT");

// Enter the deployment name you chose when you deployed the model.
var gptDeployment = await Kernel.GetInputAsync("Provide gpt deployment name");

var whisperDeployment = await Kernel.GetInputAsync("Provide whisper deployment name");

### Import namesapaces and create an instance of `OpenAiClient` using the `azureOpenAIEndpoint` and the `azureOpenAIKey`

In [6]:
using Azure;
using Azure.AI.OpenAI;

In [7]:
OpenAIClient client = new (new Uri(azureOpenAIEndpoint), new AzureKeyCredential(azureOpenAIKey.GetClearTextPassword()));

## Setup
To get started let's import a few different libraries:

 - [Naudio](https://github.com/naudio/NAudio) is a simple and easy-to-use library for audio processing tasks such as slicing, concatenating, and exporting audio files.

 - For our audio file, we'll use a fictional earnings call written by ChatGPT and read aloud by the author.This audio file is relatively short, but hopefully provides you with an illustrative idea of how these pre and post processing steps can be applied to any audio file.

In [8]:
using System.Net.Http;
using System.IO;

// set download paths
var earningsCallUrl = "https://cdn.openai.com/API/examples/data/EarningsCall.wav";

//set local save locations
var earningsCallFilepath = "./EarningsCall.wav";

// download the file
var httpClient = new HttpClient();
using (var stream = await httpClient.GetStreamAsync(earningsCallUrl))
{
    if(File.Exists(earningsCallFilepath))
    {
        File.Delete(earningsCallFilepath);
    }
    using (var fileStream = new FileStream(earningsCallFilepath, FileMode.CreateNew))
    {
      
        await stream.CopyToAsync(fileStream);
    }
}

In [9]:
#r "nuget: NAudio, 2.2.1"

At times, files with long silences at the beginning can cause Whisper to transcribe the audio incorrectly. We'll use `NAudio`` to detect and trim the silence.


In [10]:
using NAudio.Wave;
using System.IO;

public record Silence(long Start, long End, TimeSpan Duration);

// Find silcence in the file so we can trim it and split by silences
public Silence[] FindSilences(string fileName, double silenceThreshold = -40){

    bool IsSilence(float amplitude, double threshold)
    {
        double dB = 20 * Math.Log10(Math.Abs(amplitude));
        return dB < threshold;
    }

    var silences = new List<Silence>();
    using (var reader = new AudioFileReader(fileName))
    {
        var buffer = new float[reader.WaveFormat.SampleRate * 4];
    
        long start = 0;
        bool eof = false;
        long counter = 0;
        bool detected = false;
        while (!eof)
        {
            int samplesRead = reader.Read(buffer, 0, buffer.Length);
            if (samplesRead == 0)
                {
                    eof = true;
                    if (detected){
                        double silenceSamples = (double)counter / reader.WaveFormat.Channels;
                        double silenceDuration = (silenceSamples / reader.WaveFormat.SampleRate) * 1000;
                        silences.Add(new Silence(start, start + counter, TimeSpan.FromMilliseconds(silenceDuration)));
                    }
                }

            for (int n = 0; n < samplesRead; n++)
            {
                if (IsSilence(buffer[n], silenceThreshold))
                {
                    detected = true;
                    counter++;
                }
                else{
                    if(detected)
                    {
                        double silenceSamples = (double)counter / reader.WaveFormat.Channels;
                        double silenceDuration = (silenceSamples / reader.WaveFormat.SampleRate) * 1000;
                        var last =silences.Count - 1;
                        if (last >= 0)
                        {
                            // see if we can merge with the last silence
                            var gap = start - silences[last].End;
                            var gapDuration = (double)gap / reader.WaveFormat.SampleRate * 1000;
                            if (gapDuration < 500)
                            {
                                silenceDuration = silenceDuration + silences[last].Duration.TotalMilliseconds;
                                silences[last] = new Silence(silences[last].Start, counter + silences[last].End, TimeSpan.FromMilliseconds(silenceDuration));
                            }
                            else
                            {
                                silences.Add(new Silence(start, counter, TimeSpan.FromMilliseconds(silenceDuration)));
                            }
                        }
                        else
                        {
                            silences.Add(new Silence(start, counter, TimeSpan.FromMilliseconds(silenceDuration)));
                        }

                        start = start + counter;
                        counter = 0;
                        detected = false;
                    }
                }            
            }        
        }
    }
    return silences.ToArray();
}

In [11]:
public record AudioSegment(long Start, long End, TimeSpan Duration);

public AudioSegment[] FindAudibleSegments(string fileName, Silence[] silences){
    var segments = new List<AudioSegment>();
    using (var reader = new AudioFileReader(fileName)){
        var totalSamples = reader.Length;
        for(var i = 0; i< silences.Length; i++){
            if(i == 0 && silences[i].Start > 0){
                segments.Add(new AudioSegment(0, silences[i].Start, TimeSpan.FromMilliseconds(silences[i].Start / reader.WaveFormat.SampleRate * 1000)));
            }
            if(i == silences.Length - 1 && silences[i].End < totalSamples){
                segments.Add(new AudioSegment(silences[i].End, totalSamples, TimeSpan.FromMilliseconds((totalSamples - silences[i].End) / reader.WaveFormat.SampleRate * 1000)));
            }
            if(i < silences.Length - 1){
                var current = silences[i];
                var next = silences[i+1];
                if(current.End < next.Start)
                {
                    segments.Add(new AudioSegment(current.End, next.Start, TimeSpan.FromMilliseconds((next.Start - current.End) / reader.WaveFormat.SampleRate * 1000)));
                    segments.Last().Display();
                }
            }
            
        }
    }
    return segments.ToArray();
}

Here, we've set the decibel threshold of -19. You can change this if you would like.

In [12]:
var silences = FindSilences(earningsCallFilepath, -19);
silences.Display();
var audioSegments = FindAudibleSegments(earningsCallFilepath, silences);
audioSegments.Display();

index,value
,
,
0,"Silence { Start = 0, End = 5211100, Duration = 00:03:37.1268367 }Start0End5211100Duration00:03:37.1268367"
,
Start,0
End,5211100
Duration,00:03:37.1268367
1,"Silence { Start = 5211100, End = 5246770, Duration = 00:00:01.4862500 }Start5211100End5246770Duration00:00:01.4862500"
,
Start,5211100

Unnamed: 0,Unnamed: 1
Start,0
End,5211100
Duration,00:03:37.1268367

Unnamed: 0,Unnamed: 1
Start,5211100
End,5246770
Duration,00:00:01.4862500


index,value
,
0,"AudioSegment { Start = 5246770, End = 22134528, Duration = 00:11:43 }Start5246770End22134528Duration00:11:43"
,
Start,5246770
End,22134528
Duration,00:11:43

Unnamed: 0,Unnamed: 1
Start,5246770
End,22134528
Duration,00:11:43


Now that we have audio segments we can create trimmed files to use with the `Whisper` model.

In [13]:
var trimmedFiles = new List<string>();

foreach(var audioSegment in audioSegments ){
    var trimmedFile = $"./EarningsCall-{audioSegment.Start}-{audioSegment.End}.wav";
    trimmedFiles.Add(trimmedFile);
    using (var reader = new AudioFileReader(earningsCallFilepath))
    {
        reader.Position = audioSegment.Start;
        using (WaveFileWriter writer = new WaveFileWriter(trimmedFile, reader.WaveFormat))
        {
            var endPos = audioSegment.End;
            byte[] buffer = new byte[1024];
            while (reader.Position < endPos)
            {
                int bytesRequired = (int)(endPos - reader.Position);
                if (bytesRequired > 0)
                {
                    int bytesToRead = Math.Min(bytesRequired, buffer.Length);
                    int bytesRead = reader.Read(buffer, 0, bytesToRead);
                    if (bytesRead > 0)
                    {
                        writer.Write(buffer, 0, bytesRead);
                    }
                }
            }
        }
    }
}

In [14]:
var transcript = new StringBuilder();

foreach(var trimmedFile in trimmedFiles){
    var audioFile = File.ReadAllBytes(trimmedFile);
    var response = await client.GetAudioTranscriptionAsync(new AudioTranscriptionOptions(whisperDeployment, BinaryData.FromBytes(audioFile)));
    transcript.AppendLine(response.Value.Text);
}

In [15]:
using System.Text.RegularExpressions;
var asciiText = Regex.Replace(transcript.ToString(), @"[^\u0000-\u007F]+", string.Empty);
asciiText.Display();

Good afternoon, everyone, and welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year-over-year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially, thanks to the expansion of our high-yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations and residential mortgage-backed securities. We've also invested $25 million in AAA-rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 b

This function will add formatting and punctuation to our transcript. Whisper generates a transcript with punctuation but without formatting.

In [16]:
var punctuationResponse = await client.GetChatCompletionsAsync(new ChatCompletionsOptions{
    Messages={
        new ChatRequestSystemMessage(@"You are a helpful assistant that adds punctuation to text. Preserve the original words and only insert necessary punctuation such as periods, commas, capialization, symbols like dollar sings or percentage signs, and formatting. Use only the context provided. If there is no context provided say, 'No context provided'"),
        new ChatRequestUserMessage(asciiText)
    },
    Temperature = 0.0f,
    DeploymentName = gptDeployment
});

var punctuatedTranscript  = punctuationResponse.Value.Choices[0].Message.Content;
punctuatedTranscript.Display();

Good afternoon, everyone, and welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year-over-year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially, thanks to the expansion of our high-yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations and residential mortgage-backed securities. We've also invested $25 million in AAA-rated corporate bonds, enhancing our risk-adjusted returns. As for our balance sheet, total assets reached $1.5 b

Our audio file is a recording from a fake earnings call that includes a lot of financial products. This function can help ensure that if Whisper transcribes these financial product names incorrectly, that they can be corrected.

In [17]:
var productAssistantResponse = await client.GetChatCompletionsAsync(new ChatCompletionsOptions{
    Messages={
        new ChatRequestSystemMessage( @"You are an intelligent assistant specializing in financial products; your task is to process transcripts of earnings calls, ensuring that all references to financial products and common financial terms are in the correct format. For each financial product or common term that is typically abbreviated as an acronym, the full term should be spelled out followed by the acronym in parentheses. For example, '401k' should be transformed to '401(k) retirement savings plan', 'HSA' should be transformed to 'Health Savings Account (HSA)', 'ROA' should be transformed to 'Return on Assets (ROA)', 'VaR' should be transformed to 'Value at Risk (VaR)', and 'PB' should be transformed to 'Price to Book (PB) ratio'. Similarly, transform spoken numbers representing financial products into their numeric representations, followed by the full name of the product in parentheses. For instance, 'five two nine' to '529 (Education Savings Plan)' and 'four zero one k' to '401(k) (Retirement Savings Plan)'. However, be aware that some acronyms can have different meanings based on the context (e.g., 'LTV' can stand for 'Loan to Value' or 'Lifetime Value'). You will need to discern from the context which term is being referred to and apply the appropriate transformation. In cases where numerical figures or metrics are spelled out but do not represent specific financial products (like 'twenty three percent'), these should be left as is. Your role is to analyze and adjust financial product terminology in the text. Once you've done that, produce the adjusted transcript and a list of the words you've changed"),
        new ChatRequestUserMessage(punctuatedTranscript)
    },
    Temperature = 0.0f,
    DeploymentName = gptDeployment
});

var finalTranscript  = productAssistantResponse.Value.Choices[0].Message.Content;
finalTranscript.Display();

Good afternoon, everyone, and welcome to FinTech Plus Sync's second quarter 2023 earnings call. I'm John Doe, CEO of FinTech Plus. We've had a stellar Q2 with a revenue of $125 million, a 25% increase year-over-year. Our gross profit margin stands at a solid 58%, due in part to cost efficiencies gained from our scalable business model. Our EBITDA (Earnings Before Interest, Taxes, Depreciation, and Amortization) has surged to $37.5 million, translating to a remarkable 30% EBITDA margin. Our net income for the quarter rose to $16 million, which is a noteworthy increase from $10 million in Q2 2022. Our total addressable market has grown substantially, thanks to the expansion of our high-yield savings product line and the new RoboAdvisor platform. We've been diversifying our asset-backed securities portfolio, investing heavily in collateralized debt obligations and residential mortgage-backed securities. We've also invested $25 million in AAA-rated corporate bonds, enhancing our risk-adjus