# Audio file to SRT subtitles file

## Requirements

The requirements for this workbook to work are:
- An Azure cognitive service with a cognitive service key
- An input video file

## Install Required Libraries

This section installs the required libraries for the code to run:
- [azure-cognitiveservices-speech](https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech?view=azure-python): is used to communicate with the Azure Speech API
- [python-dotenv](https://pypi.org/project/python-dotenv/): is used to read the .env file containing the Cognitive Service keys

In [None]:
! pip install azure-cognitiveservices-speech
! pip install python-dotenv

## Settings

You probably should not put your settings in this file. Instead, put them in a .env file and import them from there.

Your .env file should look like this:
```
SPEECH_KEY=<your-speech-key>
SPEECH_REGION=<your-speech-region>
SPEECH_FILE=<your-source-video-file-path>
SPEECH_LANGUAGE=<your-speech-language>
```

Then we create a settings variable containing the values from the .env file.
- `key`: Azure Cognitive service key
- `region`: Azure cognitive service region
- `speechFile`: Name of the video file to convert to SRT subtitles
- `language`: Source language of the file
- `fileName`: Name of the output audio file

In [None]:
from dotenv import load_dotenv
import re
load_dotenv(override=True)

settings = {
    'key': os.environ.get('SPEECH_KEY'),
    'region': os.environ.get('SPEECH_REGION'),
    'speechFile': os.getenv('SPEECH_FILE'), # Feel free to hardcode the file path
    'language': os.environ.get('SPEECH_LANGUAGE'),   # Feel free to hardcode the language
    'fileName': "./audio.wav"
}
# Create fileName by taking the leaf of the path from speechFile and replacing the extension with .wav using regex
settings['fileName'] = f"./outputs/{settings['speechFile'].split('/')[-1]}"
settings['fileName'] = re.sub(r'\.[^\.]+$', '.wav', settings['fileName'])


### Helpers

Create a function to convert nanoseconds to timestamp.

In [None]:
import math

def toTimeStamp(nano):
    nano = nano*100
    hour = math.floor(nano/3600000000000)
    nano = nano -hour*3600000000000
    minutes = math.floor(nano/60000000000)
    nano = nano - minutes*60000000000
    seconds = math.floor(nano/1000000000)
    nano = nano - seconds*1000000000
    milliseconds = math.floor(nano/1000000)
    timestamp = "{0:02d}:{1:02d}:{2:02d},{3:03d}".format(hour,minutes,seconds,milliseconds)
    return timestamp

## Extract Video Audio

In this section, we will use ffmpeg to extract the audio from the video.

> **Note**: ffmpeg needs to be installed in the *ffmpeg* folder in this project.
> You can download it from [here](https://ffmpeg.org/download.html).
> If you have ffmpeg installed in your system, you can either update the command below to use it or, if its path is in the environment PATH, you can just replace the path below with `ffmpeg`.

In [None]:
import subprocess

command = f"./ffmpeg/ffmpeg -y -i '{settings['speechFile']}' -ab 160k -ac 2 -ar 44100 -vn '{settings['fileName']}'"

subprocess.call(command, shell=True)

## Generate captions from audio file

This section will use the Azure Speech API to generate captions from the audio file.

An SRT caption file is in the format:
```
<number>
<start timestamp> --> <end timestamp>
<caption>


<number>
<start timestamp> --> <end timestamp>
<caption>


1
00:00:00,000 --> 00:00:02,000
Hello!

2
00:00:02,000 --> 00:00:04,000
This is a test.
```


### Get the transcript using Speech Services

This section will use the Azure Speech API to generate captions from the audio file.

The result of the speech API is a list of possible recognized words/sentences. Each word/sentence has its own confidence score, and timestamp.

In [None]:
import azure.cognitiveservices.speech as speechsdk
import time


def get_transcript_from_file():
    print("===================================")
    print ("Processing "+settings['fileName'])
    print("===================================")

    # Creates an instance of a speech config with specified subscription key and service region.
    speech_config = speechsdk.SpeechConfig(subscription=settings['key'], region=settings['region'])
    speech_config.request_word_level_timestamps()
    speech_config.set_property(property_id=speechsdk.PropertyId.SpeechServiceResponse_OutputFormatOption, value="detailed")

    # Creates a speech recognizer using file as audio input.
    audio_input = speechsdk.audio.AudioConfig(filename=settings['fileName'])
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, language=settings['language'], audio_config=audio_input)
    
    # initialize some variables
    results = []
    done = False

    # Event handler to add event to the result list
    def handleResult(evt):
        import json
        nonlocal results
        results.append(json.loads(evt.result.json))
        
        print('RECOGNIZED: {}'.format(evt)) # print the result (optional, otherwise it can run for a few minutes without output)

        # result object
        res = {'text':evt.result.test, 'timestamp': evt.result.offset, 'duration':evt.result.duration}

        if (evt.result.text != ""):
            results.append(res)
            print(evt.result)

    
    # Event handler to check if the recognizer is done
    def stop_cb(evt):
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done= True

    # Connect callbacks to the events fired by the speech recognizer & displays the info/status
    # Ref:https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.eventsignal?view=azure-python   
    # speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    # speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    # speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    # speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    speech_recognizer.recognized.connect(handleResult) 
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    # Starts continuous speech recognition
    speech_recognizer.start_continuous_recognition()

    # Wait for speech recognition to complete
    while not done:
        time.sleep(1)
    
    return results

results = get_transcript_from_file()

### Create the individual words caption

This is useful for video editing, you get to have a caption for every word, that way when you cut off a word, you do not need to edit the entire caption, or risk losing the caption/re-time the caption.

In [None]:
import json
captions = "" # this is used to store the output SRT data
i=0 # this is the index of the result (<number> in the SRT example above)

for result in results:
    if (result['RecognitionStatus']!='InitialSilenceTimeout'):
        res= result["NBest"][0]
        try:
            for word in res["Words"]:
                i+=1
                start = toTimeStamp(word["Offset"])
                end = toTimeStamp(word["Offset"]+word["Duration"])
                captions += """{0}
{1} --> {2}
{3}

""".format(i,start,end,word["Word"])
        except KeyError:
            null = ""

f = open("{0}-caption-words.srt".format(settings['fileName']), "w")
f.write(captions)
f.close()

### Create Entire Caption

Same as the previous section, but this time we will create a single caption for an entire sentence. This is better for a final output.

In [None]:
import json
captions = ""
i=0
row = 0
for result in results:
    try:
        i+=1
        start = toTimeStamp(result["Offset"])
        end = toTimeStamp(result["Offset"]+result["Duration"])
        captions += """{0}
{1} --> {2}
{3}

""".format(i,start,end,result["DisplayText"])
    except KeyError:
        i-=1

f = open("{0}-caption.srt".format(settings['fileName']), "w")
f.write(captions)
f.close()

### Create Text

This is just one big string of all the words/sentences without timestamps.

In [None]:
import json
outputText = ""
i=0
row = 0
for result in results:
    try:
        i+=1
        start = toTimeStamp(result["Offset"])
        end = toTimeStamp(result["Offset"]+result["Duration"])
        outputText += " {0}".format(result["DisplayText"])
    except KeyError:
        i-=1

f = open("{0}-text.txt".format(settings['fileName']), "w")
f.write(outputText)
f.close()