> *Use this Colab notebook to auto-generate sub-word timestamps for a vocals-only audio file, for your karaoke-making purposes.*
>
>*It's very loosely based and heavily modified off of [this official tutorial notebook](https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/tools/NeMo_Forced_Aligner_Tutorial.ipynb).*
>
>*It doesn't require a GPU and should be able to run on any linux-based runtime, including the default free ones. 👍*
>
>*Enjoy!*

-- *Kadda OK*

# 1. Configuration

In [None]:
# @title Vocal Audio Source

# @markdown  There are a few options of how to get the vocal wave file into this colab instance.
# @markdown
# @markdown  - **Google Drive**: Upload the wave file to your drive, then select it in the drive
# @markdown  and go to `Share`. \
# @markdown  In the Share popup, under General Access, select "`Anyone with the link`", and then
# @markdown  Copy link. \
# @markdown Paste it in the field below. \
#@markdown *(For example, "`https://drive.google.com/file/d/SoMe_iDeNtIfIer/view?usp-drive_link`")*
google_drive_url = "" # @param {type:"string"}
# @markdown **OR**

# @markdown  - **Direct Download**: If you have somewhere that's open to the internet that you can
# @markdown  put a wave file, and get a url to it that, when visited in a browser, will either
# @markdown  start downloading it immediately as a file, or will show play/pause/seek/time controls
# @markdown  and nothing else, no site branding or other navigation, then you can use that URL here.
# @markdown  (If it doesn't behave that way, this won't work.)\
# @markdown  You can paste such a link in this field.\
# @markdown Note: you may have to include quotation marks if `wget` yells at you about extra parameters. \
# @markdown *(For example, "`"https://some.site/your_stuff/songToLoad.wav"`")*
direct_url = "" # @param {type:"string"}
# @markdown **OR**

# @markdown  - **Local File Path**: If you are running this colab connected to a ***LOCAL*** runtime,
# @markdown  you can just use the path of the file directly.\
# @markdown Note: you will have to include quotation marks if the path contains spaces. \
# @markdown *(For example, "`"/mnt/c/audio/song To Load.wav"`"")*
local_file_path = "" # @param {type:"string"}
# @markdown

# @markdown **SO**, which of the above do you want me to use? \
# @markdown *(YES you still have to tell me, even if you only filled in one of the inputs above, because I'm lazy.)*
get_audio_from = "Google Drive" # @param ["Google Drive", "Direct Download", "Local File"]

## Lyrics

Input between the `"""` lines the text of all the lyrics.

This should include all repeated lines, as this exact text in order is what will be aligned.

For best results you probably want to *remove* any `.`, `,`, `!`, and `?` characters.

To control line breaks in the .ASS output, end each line with a `|` character.

In [None]:
text = """
Twinkle twinkle little star |
How I wonder what you are |
Up above the world so high |
Like a diamond in this guy
"""

## .ASS generation settings

In [None]:

# @markdown To be honest, I don't know much about .ASS files, I'm just exposing
# @markdown what the NeMo Forced Aligner script provides here.

# @markdown ---
fontsize = 20 # @param {type:"number"}
verticalalign = "center" # @param ["center", "bottom"]
# @markdown ---
# @markdown > If `resegment` is checked, the ASS file will use new segments
# @markdown such that each segment will not take up more than (approximately)
# @markdown `maxlines` when the ASS file is applied to a video.
resegment = False # @param {type:"boolean"}
maxlines = 2 # @param {type:"number"}
# @markdown ---
# @markdown Color of text already sung:
sung_R = 49 # @param {type:"slider", min:0, max:255, step:1}
sung_G = 46 # @param {type:"slider", min:0, max:255, step:1}
sung_B = 61 # @param {type:"slider", min:0, max:255, step:1}
# @markdown ---
# @markdown Color of text while it is sung:
singing_R = 57 # @param {type:"slider", min:0, max:255, step:1}
singing_G = 171 # @param {type:"slider", min:0, max:255, step:1}
singing_B = 9 # @param {type:"slider", min:0, max:255, step:1}
# @markdown ---
# @markdown Color of text not yet sung:
unsung_R = 194 # @param {type:"slider", min:0, max:255, step:1}
unsung_G = 193 # @param {type:"slider", min:0, max:255, step:1}
unsung_B = 199 # @param {type:"slider", min:0, max:255, step:1}



---
Below this line, you shouldn't have to alter anything unless you run into problems.

---

# Installation
This takes about 5 min to run, but you should only have to run it once per session.  Once it has run, to work on other files or tweak settings aand try again, you can just re-run the Configuration section and then the Execution section, skipping this section entirely.

In [None]:
# cython is required to install nemo toolkit
!pip install cython

# current nemo_toolkit version as of this notebook is 1.22, but it has a bug:
# https://github.com/NVIDIA/NeMo/issues/8179
Get_NeMo_Version = "1.21"
!pip install nemo_toolkit[all]==$Get_NeMo_Version

# still need the source though apparently
BRANCH = f"v{Get_NeMo_Version}.0"
!git clone -b $BRANCH https://github.com/NVIDIA/NeMo

# need these items in order to downmix and resample to mono 16KHz
!pip install ffmpeg pydub soundfile python-magic

# Execution

## Download Audio

In [None]:
# first make a directory WORK_DIR that we will save all our new files in
WORK_DIR="WORK_DIR"
!mkdir $WORK_DIR

# name for the output file
INPUT_FILE_NAME=f"{WORK_DIR}/downloadedInput.unknown" # it's easier this way because of gdown

if get_audio_from == "Google Drive":
  !pip install gdown
  !gdown --fuzzy $google_drive_url --output $INPUT_FILE_NAME

if get_audio_from == "Direct Download":
  !wget --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36" --no-check-certificate $direct_url --output-document=$INPUT_FILE_NAME

if get_audio_from == "Local File":
  !cp -v -T $local_file_path $INPUT_FILE_NAME

## Prepare Audio for NFA

It seems like this only works with mono 16KHz PCM, so we'll convert the input audio into that form.

In [None]:
from pydub import AudioSegment
import magic
import soundfile as sf

def detect_audio_format(file_path):
    mime_type = magic.Magic(mime=True).from_file(file_path)
    if mime_type == 'audio/x-wav':
        return 'wav'
    elif mime_type == 'audio/flac':
        return 'flac'
    elif mime_type == 'audio/mpeg':
        return 'mp3'
    # Add more mime types as needed
    else:
        raise ValueError("Unsupported audio format")

format = detect_audio_format(INPUT_FILE_NAME)

# Load the audio file
sound = AudioSegment.from_file(INPUT_FILE_NAME, format=format)

RESAMPLED_FILE_NAME = f"{WORK_DIR}/resampledInput.wav"
# Downmix to mono if stereo
if sound.channels == 2:
    sound = sound.set_channels(1)

# Resample to 16000 Hz if higher
if sound.frame_rate > 16000:
    sound = sound.set_frame_rate(16000)

# Export the processed audio
sound.export(RESAMPLED_FILE_NAME, format="wav")

## Prepare NFA Manifest

In [None]:
import json

manifest_filepath = f"{WORK_DIR}/manifest.json"
manifest_data = {
    "audio_filepath": f"{RESAMPLED_FILE_NAME}",
    "text": text
}
with open(manifest_filepath, 'w') as f:
  line = json.dumps(manifest_data)
  f.write(line + "\n")

In [None]:
!cat $manifest_filepath

## Run NFA

In [None]:
!python NeMo/tools/nemo_forced_aligner/align.py \
  pretrained_name="stt_en_fastconformer_hybrid_large_pc" \
  manifest_filepath=$manifest_filepath \
  output_dir=$WORK_DIR/nfa_output/ \
  additional_segment_grouping_separator="|" \
  ass_file_config.fontsize=$fontsize \
  ass_file_config.vertical_alignment=$verticalalign \
  ass_file_config.resegment_text_to_fill_space=$resegment \
  ass_file_config.max_lines_per_segment=$maxlines \
  ass_file_config.text_already_spoken_rgb=[$sung_R,$sung_G,$sung_B] \
  ass_file_config.text_being_spoken_rgb=[$singing_R,$singing_G,$singing_B] \
  ass_file_config.text_not_yet_spoken_rgb=[$unsung_R,$unsung_G,$unsung_B]

# Finished

If this succeeded, you should now be able to go to the Files section on the left and find the appropriate files in `WORK_DIR/nfa_output`:

* `ass/tokens/resampledInput.ass` should be a usable ASS subtitle file with sub-word-level timing

* `ctm/tokens/resampledInput.ctm` can be imported into the `Recognize` tab [Kadda OK Tools](https://github.com/KaddaOK/KaddaOKTools/) to target Karaoke Builder Studio or YouTube Movie Maker ([coming soon](https://github.com/KaddaOK/KaddaOKTools/issues/4))




In [None]:
!ls -R $WORK_DIR/nfa_output