# Automatically subtitle a Dhivehi Video

This is a simple demonstration of a use case for an ASR toolchain,
such as the Hugging Face wav2vec2 model mentioned on
[https://dhivehi.ai/docs/technologies/stt/](https://dhivehi.ai/docs/technologies/stt/)

The tutorial is inspired by [this article](https://towardsdatascience.com/generating-subtitles-automatically-using-mozilla-deepspeech-562c633936a7)
published towardsdatascience.com.

The process follows a few basic steps:
 * Extract audio from the video
 * Download STT pretrained model and setup inference pipeline
 * Run STT on the audio to transcribe the audio
 * generate a .srt file containing subtitles with timestamps

## Setup requirements

In [None]:
! pip install ../requirements.txt

## Process and extract audio

For the purposes of this tutorial, we will be using this episode of Floak the International
downloaded off Youtube [here](https://www.youtube.com/watch?v=ccdwQQ1OQB4)

Before you proceed, download your video and store it somewhere.

#### define some helper methods

In [25]:
import subprocess
from pyAudioAnalysis.audioSegmentation import silence_removal
from pyAudioAnalysis.audioBasicIO import read_audio_file
from scipy.io import wavfile

def extractAudio(input_file, audio_file_name):
    command = "ffmpeg -hide_banner -loglevel warning -i {} -b:a 192k -ac 1 -ar 16000 -vn {}".format(input_file, audio_file_name)
    try:
        ret = subprocess.call(command, shell=True)
        print("Extracted audio to audio/{}".format(audio_file_name.split("/")[-1]))
    except Exception as e:
        print("Error: ", str(e))
        exit(1)

def silenceRemoval(input_file, output_dir, smoothing_window = 1.0, weight = 0.1):
    print("Detecting silences...")
    [fs, x] = read_audio_file(input_file)
    segmentLimits = silence_removal(x, fs, 0.05, 0.05, smoothing_window, weight)
    ifile_name = os.path.basename(input_file)

    print("Writing segments...")
    for i, s in enumerate(segmentLimits):
        strOut = "{0:s}_{1:.3f}-{2:.3f}.wav".format(ifile_name, s[0], s[1])
        strOut = os.path.join(output_dir, strOut)
        wavfile.write(strOut, fs, x[int(fs * s[0]):int(fs * s[1])])


#### Extract the audio

In [26]:
import os
os.makedirs("../audio/segments/")
extractAudio("../floak_ep1.mp4", "../audio/floak_ep1.wav")
silenceRemoval("../audio/floak_ep1.wav", "../audio/segments/")

Extracted audio to audio/floak_ep1.wav
Detecting silences...
Writing segments...


## Setup the STT pipeline

For this tutorial, we will be using the minimal quantized model
to run inference. If you are looking to use the full model for extra
fine-tuning, refer to the [Hugging Face Model page](https://huggingface.co/shahukareem/wav2vec2-large-xlsr-53-dhivehi)

### Download the STT model and extract it somewhere

In [2]:
import gdown
import os
from shutil import unpack_archive

# set output dir
op_dir = "../models"
op_file = os.path.join(op_dir, "w2v2-53.tar.gz")

In [None]:
# # download and extract
os.makedirs(op_dir, exist_ok=True)
gdown.download(
    f"https://drive.google.com/uc?id=1m6QXhMF9Zf6P04Z1D2qFiQjEFo16Vexv",
    op_file
)
unpack_archive(op_file, "../models")

### Initialize the STT model and prepare it for inference

In [3]:
import os
import librosa
import torch
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2CTCTokenizer, Wav2Vec2Processor

model_dir = os.path.join(op_dir, "stt_model")
STT_MODEL_PATH = os.path.join(model_dir, "wav2vec_traced_quantized.pt")
STT_VOCAB_FILE = os.path.join(model_dir, "vocab.json")
SAMPLING_RATE = 16000

tokenizer = Wav2Vec2CTCTokenizer(STT_VOCAB_FILE, unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=SAMPLING_RATE, padding_value=0.0, do_normalize=True, return_attention_mask=False)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
model = torch.jit.load(STT_MODEL_PATH)

def transcribe(audio_path):
    audio_input, sr = librosa.load(audio_path, sr=SAMPLING_RATE)
    inputs = processor(
        audio_input,
        sampling_rate=SAMPLING_RATE,
        return_tensors="pt",
        padding=True
    )

    with torch.no_grad():
        logits = model(inputs.input_values)['logits']

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    return transcription


def process_audio(audio_file):
    start, end = audio_file.split("/")[-1][:-4].split("_")[-1].split("-")
    transcription = transcribe(audio_file)
    return start,end,transcription

## Transcribe the audio to subtitles

In [11]:
from glob import glob
from tqdm import tqdm
import os
import datetime


def write_to_file(file_handle, inferred_text, line_count, limits):
    """Write the inferred text to SRT file
    Follows a specific format for SRT files
    Args:
        file_handle : SRT file handle
        inferred_text : text to be written
        line_count : subtitle line count
        limits : starting and ending times for text
    """

    d = str(datetime.timedelta(seconds=float(limits[0])))
    try:
        from_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
    except:
        from_dur = "0" + str(d) + "," + "00"

    d = str(datetime.timedelta(seconds=float(limits[1])))
    try:
        to_dur = "0" + str(d.split(".")[0]) + "," + str(d.split(".")[-1][:2])
    except:
        to_dur = "0" + str(d) + "," + "00"

    file_handle.write(str(line_count) + "\n")
    file_handle.write(from_dur + " --> " + to_dur + "\n")
    file_handle.write(inferred_text + "\n\n")

100%|██████████| 232/232 [05:50<00:00,  1.51s/it]


In [None]:
with open("../floak_ep1.srt", "w") as f:

    for w_file in tqdm(glob("../audio/segments/*.wav")):
        start, end, transcription = process_audio(w_file)
        if len(transcription.strip())==0:
            continue

        write_to_file(f, transcription, 1, (start, end))

