<a href="https://colab.research.google.com/github/gnloop/MFA-Universal-Notebook/blob/slicer/MFA_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/)**
_Command line utility for forced alignment using Kaldi_

\
____
\

## _Known issues:_
- It really struggles with long silences, long notes and humming.

- It's really dependent on Whisper's performance as well, which isn't always perfect: if you want a better base it's highly recommended to edit the transcriptions!

- The pretrained dictionaries for MFA are often lackluster and/or inaccurate when transposed to singing, which affects the label quality (I've particularly noticed this with French).

- It's possible to supplement the dictionaries with G2P models, but I haven't implemented that.

\
____


# Huge thanks to PixPrucer and HAI-D for basically making every part of this notebook. I just updated things around basically ⛹

##**Upload files**

In [None]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Unzip and optionally slice corpus
#@markdown Unzip your dataset for transcription stuff. Make sure it is an archive only containing wavs (15-30 seconds in length recommended).<br>If your samples are long you can use the audio slicer by setting slice_samples to True.

file_location = '/content/drive/MyDrive/wav.zip' #@param {type:"string"}
slice_samples = True #@param {type:"boolean"}

!7z x "$file_location" -o/content/db
from IPython.display import clear_output
clear_output()
print("Wavs extracted in db folder")

if slice_samples:
  !git clone https://github.com/openvpi/audio-slicer
  !pip install librosa
  !pip install soundfile
  !mkdir /content/slicer-temp
  !mv /content/db/*.wav /content/slicer-temp/
  import os

  # Change directory to the audio-slicer folder
  os.chdir('/content/slicer-temp')

  # Loop through all files in the folder
  for filename in os.listdir():
      # Check if the file is a .wav file
      if filename.endswith('.wav'):
          # Execute the command for the current file
          !python /content/audio-slicer/slicer2.py /content/slicer-temp/"$filename" --out /content/db/
  from IPython.display import clear_output
  clear_output()
  print("Your sliced audios should be in the 'db' folder and the old files will be in the 'slicer-temp' folder")

In [None]:
#@title (Optional) Unzip edited transcriptions
#@markdown Unzip your own transcriptions into the `txt` folder so you don't need to use Whisper.

file_location = '/content/drive/MyDrive/txt.zip' #@param {type:"string"}

!7z x "$file_location" -o/content/txt
from IPython.display import clear_output
clear_output()
print("Transcriptions extracted in txt folder")

# **Whisper inference (Auto-transcriptions)**

In [None]:
#@title Install Whisper
!pip install -U openai-whisper
!pip install ffmpeg
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Whisper inference
#@markdown **Make transcriptions** <br/> Worth noting that your singing database shouldn't have long pauses, *ooh-ing*, lalala-ing, humming etc. in it, otherwise it'll probably break the transcription making (Whisper poorly recognises those).
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l

import os
os.makedirs('/content/txt/', exist_ok=True)
!cd /content/db

def Transcriber(audiofile):
    import whisper
    from whisper.tokenizer import get_tokenizer
    #encourage model to transcribe words literally
    tokenizer = get_tokenizer(multilingual=True)  # use multilingual=True if using multilingual model
    number_tokens = [
        i
        for i in range(tokenizer.eot)
        if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
    ]

    model = whisper.load_model("medium")
    answer = model.transcribe(audiofile, suppress_tokens=[-1] + number_tokens)

    print(answer['text'])

    output_txt = os.path.join('/content/txt/', os.path.splitext(filename)[0] + '.txt')

    with open(output_txt, 'w') as f:
      f.write(answer['text'])

for filename in os.listdir('/content/db/'):
  if filename.endswith('.wav'):
    file_path = os.path.join('/content/db/', filename)
    Transcriber(file_path)
from IPython.display import clear_output
clear_output()
print("Hopefully everything worked and your transcriptions are in the 'txt' folder!")

In [None]:
#@title (Optional) Zip up text transcriptions `txt` for you to dowload and edit
#@markdown Make sure there's only .txt files in your .zip archive!
!zip -j transcriptions.zip /content/txt/*.txt
from IPython.display import clear_output
clear_output()
print("Your transcriptions are now in 'transcriptions.zip'")

#**Auto-alignment (MFA)**

In [None]:
#@title Install Condacolab
#@markdown The session will crash and restart, that's normal!
!pip install -q condacolab
import condacolab
condacolab.install()
from IPython.display import clear_output
clear_output()
print("All done, please wait for the session to restart!")

In [None]:
#@title Install MFA
!conda install -c conda-forge montreal-forced-aligner spacy sudachipy sudachidict-core
!pip install speechbrain
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Download the alignment models
#@markdown Choose the model for your desired language and scroll down to find the name of the model under "Installation"<br>After "mfa model download acoustic/dictionary" (e.g.: italian_cv)<br>Acoustic models: https://mfa-models.readthedocs.io/en/latest/acoustic/index.html<br>Dictionaries: https://mfa-models.readthedocs.io/en/latest/dictionary/index.html
acoustic = 'italian_cv' #@param {type:"string"}
dictionary = 'italian_cv' #@param {type:"string"}
# Download Model
!mfa model download acoustic "$acoustic"
# Download G2P
!mfa model download dictionary "$dictionary"
from IPython.display import clear_output
clear_output()
print("Alignment models downloaded!")


In [None]:
#@title Start aligning!
!mv /content/txt/*.txt /content/db
!mfa align /content/db "$dictionary" "$acoustic" /content/alignment --beam 400
from IPython.display import clear_output
clear_output()
print("All done! Check the 'alignment' folder for .TextGrid files")

#**LAB Converter**

In [None]:
#@title Install TextGrid to LAB converter
!pip install mytextgrid
!git clone https://github.com/gnloop/MFA-Universal-Notebook
!mv /content/MFA-Universal-Notebook/text2lab_test.py /content/alignment/
from IPython.display import clear_output
clear_output()
print("All done!")

### HALT! Before you go happily converting your TextGrid files, if your language is not present in the dropdown list below, you're gonna have to make your own 'custom_converter.txt' in the converters folder and choose it in the dropdown menu.
#### Most MFA models use some sort of IPA system which doesn't sit well with DiffSinger.
#### For further details on what phonemes MFA uses, you should check out the webpage where the MFA models are listed. There's usually a phoneme list there as well.

In [None]:
#@title Convert TextGrid to LAB
converter = 'converter_IT' #@param ['converter_JP', 'converter_EN-ARPA', 'converter_ES_tiny', 'converter_ES_Ryoku', 'converter_ES_njokis', 'converter_IT', 'custom_converter']
!python -X utf8 /content/alignment/text2lab_test.py -c /content/MFA-Universal-Notebook/converters/"$converter".txt
from IPython.display import clear_output
clear_output()
print("Your .lab files should now be under the 'alignment' folder!")


In [None]:
#@title (ITALIAN ONLY) Fix labels
#@markdown DO NOT USE FOR OTHER LANGUAGES<br>This will make its best attempts to fix some of the labels. However, manual editing will still be required.
#@markdown <br><br>Known issues are:<br>- SH used instead of SK<br>- Poor estimation of consonant clusters<br>- Poor performance of the dictionary
import re
import os

def process_lab_files(lab_file_path):
    """Divides double consonants, excluding 'rr', adjusts durations,
       and replaces 'u' with 'w' and 'i' with 'y' if duration is less than 500000
       and preceded by a consonant and followed by a vowel in the previous and next labels.

    Args:
        lab_file_path: The path to the .lab file.
    """
    with open(lab_file_path, 'r') as f:
        lines = f.readlines()

    modified_lines = []
    for i, line in enumerate(lines):
        parts = line.strip().split()
        if len(parts) == 3:  # Check if it's a valid label line
            start_time, end_time, phoneme = parts

            # Gemination fix: Divide double consonants
            double_consonants = re.findall(r'(?!rr)([a-z])\1', phoneme)
            if double_consonants:
                duration = float(end_time) - float(start_time)
                half_duration = duration / 2
                modified_lines.append(f"{int(float(start_time))} {int(float(start_time) + half_duration)} {double_consonants[0]}\n")
                modified_lines.append(f"{int(float(start_time) + half_duration)} {int(float(end_time))} {double_consonants[0]}\n")
                continue  # Skip to the next line to avoid processing the original line

            # Cluster duration fix: Replace short 'u' and 'i' with exceptions
            duration = int(end_time) - int(start_time)
            if duration < 500000:
                # Check for preceding consonant and following vowel in previous and next labels
                prev_phoneme = lines[i - 1].strip().split()[-1] if i > 0 else ""
                next_phoneme = lines[i + 1].strip().split()[-1] if i < len(lines) - 1 else ""

                if re.search(r'[bcdfghjklmnprstvwxz]', prev_phoneme, re.IGNORECASE) and re.search(r'[aeiou]', next_phoneme, re.IGNORECASE):
                    if phoneme == 'u':
                        phoneme = 'w'
                    elif phoneme == 'i':
                        phoneme = 'y'

            modified_lines.append(f"{start_time} {end_time} {phoneme}\n")  # Append the modified or original line

        else:
            modified_lines.append(line)  # Keep lines that are not labels

    with open(lab_file_path, 'w') as f:
        f.writelines(modified_lines)

# Specify the folder containing your .lab files
folder_path = '/content/alignment/'  # Replace with your folder path

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.lab'):  # Process only .lab files
        lab_file_path = os.path.join(folder_path, filename)
        process_lab_files(lab_file_path)

print("Fixed gemination and consonant clusters for all .lab files in the folder!")

In [None]:
#@title Zip output
# @markdown Enter the path where you want your labels saved (no / at the end)
zip_path = '/content/drive/MyDrive/MFA_output' #@param {type:"string"}
#@markdown Tick this box if you also wanna save the sliced samples.
from IPython.display import display, HTML, Javascript
import os

# Create the directory if it doesn't exist
os.makedirs(zip_path, exist_ok=True)

save_samples = True #@param {type:"boolean"}

# Zip wav files if save_samples is True
if save_samples:
    os.system(f"zip -j {zip_path}/sliced_wavs.zip /content/db/*.wav")

# Zip labels
!zip -j {zip_path}/labels.zip /content/alignment/*.lab

from IPython.display import clear_output
clear_output()
print("You can now download your labels in the labels.zip file and wavs in the sliced_wavs.zip file!")

#**Extras**

In [None]:
#@title Whisper (Transformers) Install
!pip install --upgrade pip
!pip install --upgrade transformers accelerate torchaudio
!pip install "punctuators==0.0.5"
!pip install "pyannote.audio"
!pip install git+https://github.com/huggingface/diarizers.git

In [None]:
#@title Whisper (Transformers) Inference
#@markdown **Make transcriptions a bit faster with community-made models** <br/> Whisper (Transformers) is the implementation of Whisper by HuggingFace, 7x times faster than the common Whisper, it is placed in extra options due to its complexity of use; it allows the use of models made by the community, perfect for complex accent audios that would require a fine-tuned model.
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l, edited to be compatible with Transformers HuggingFace implementation

from transformers import pipeline
import os

os.makedirs('/content/txt/', exist_ok=True)

#@markdown <b><font size="3.5"> Model name (HuggingFace)
model_name = "openai/whisper-medium"  #@param {type:"string"}
#@markdown <b><font size="3.5"> generate_kwarts
language = "Japanese"  #@param {type:"string"}
no_repeat_ngram_size = 0  #@param {type:"integer"}
repetition_penalty = 1.0  #@param {type:"number"}

generate_kwargs = {
    "language": language,
    "no_repeat_ngram_size": no_repeat_ngram_size,
    "repetition_penalty": repetition_penalty,
}

def transcriber(audiofile):
    transcriber = pipeline("automatic-speech-recognition", model=model_name, generate_kwargs=generate_kwargs)

    print(f"Processing files: {audiofile}")
    result = transcriber(audiofile)

    transcription = result['text']
    print(transcription)

    output_txt = os.path.join('/content/txt/', os.path.splitext(os.path.basename(audiofile))[0] + '.txt')
    with open(output_txt, 'w') as f:
        f.write(transcription)

audio_directory = '/content/db/'
for filename in os.listdir(audio_directory):
    if filename.endswith('.wav'):
        file_path = os.path.join(audio_directory, filename)
        transcriber(file_path)

from IPython.display import clear_output
clear_output()
print("Hopefully everything worked and your transcriptions are in the 'txt' folder!")


In [None]:
# @title WhisperX Install
!pip install ffmpeg
!pip install git+https://github.com/m-bain/whisperx.git

In [None]:
# @title WhisperX inference
# @markdown **Make transcriptions faster with optional forced-alignment** <br/> WhisperX is an alternative 70x times faster than the common Whisper, it is placed in extra options due to its complexity of use; it allows time stamping for words and/or sentences, which can be useful to use it as a slicer.

import os
import whisperx
import torch
from IPython.display import clear_output

# @markdown ### Select Model, Language, and Batch Size
Model = "large-v3"  # @param ["tiny", "tiny.en", "base", "base.en", "small", "small.en", "medium", "medium.en", "large-v1", "large-v2", "large-v3"]
Language = "ja"  # @param {type:"string"}
Batch_size = 9  # @param {type:"integer"}
# @markdown <font size="-1.5">
Chunk_size = 30 # @param {type:"integer"}

# @markdown ### Forced Alignment Options
Forced_alignment = False  # @param {type:"boolean"}
Alignment_mode = "Sentence-level" # @param ["Word-level", "Sentence-level"] {allow-input: false}
Export_format = "Audacity"  # @param ["Audacity", ".lab"] {allow-input: false}

# @markdown <font size="-1.5"> the maximum number of words in a line before breaking the line (default: None)
Max_words_per_line = None # @param {type:"integer"}
# @markdown <font size="-1.5"> the maximum number of characters in a line before breaking the line (default: None)
Max_line_width = None # @param {type:"integer"}
# @markdown <font size="-1.5"> the maximum number of lines in a segment (default: None)
Max_line_count = None # @param {type:"integer"}

# Create necessary directories
os.makedirs('/content/forced-alignment/', exist_ok=True)
os.makedirs('/content/txt/', exist_ok=True)

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(Model, device, compute_type="float16", language=Language)

def transcribe_audio(audio_file):
    audio = whisperx.load_audio(audio_file)
    result = model.transcribe(audio, batch_size=Batch_size, language=Language, chunk_size=Chunk_size)
    return result

def align_audio(result, audio_file):
    # 2. Align the text to the audio
    model_a, metadata = whisperx.load_align_model(language_code=Language, device=device)
    result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device, return_char_alignments=False)
    return result_aligned

def format_transcription(result, mode, max_words=None, max_width=None, max_lines=None):
    """Formats the transcription according to the specified parameters."""
    formatted_lines = []
    if mode == "Word-level":
        for segment in result["segments"]:
            for word in segment["words"]:
                if word["start"] is not None and word["end"] is not None:
                    formatted_lines.append(f"{word['start']:.3f}\t{word['end']:.3f}\t{word['word']}")

    elif mode == "Sentence-level":
        for segment in result["segments"]:
            if segment["start"] is not None and segment["end"] is not None:
                current_line = []
                current_line_width = 0
                line_count = 0
                segment_start_time = segment['start']
                segment_end_time = segment['end']
                words_with_times = [(word, w['start'], w['end']) for word, w in zip(segment['text'].strip().split(), segment['words'])]

                for i, (word, start, end) in enumerate(words_with_times):
                    if (max_words and len(current_line) >= max_words) or \
                       (max_width and current_line_width + len(word) + 1 > max_width) or \
                       (max_lines and line_count >= max_lines):

                        # Determine end time of the current line
                        line_end_time = words_with_times[i - 1][2] if i > 0 else segment_start_time

                        formatted_lines.append(f"{segment_start_time:.3f}\t{line_end_time:.3f}\t" + " ".join(current_line))
                        current_line = []
                        current_line_width = 0
                        line_count += 1
                        segment_start_time = line_end_time  # Update start time for the next line

                    current_line.append(word)
                    current_line_width += len(word) + 1

                # Handle the last line in the segment
                if current_line:
                    formatted_lines.append(f"{segment_start_time:.3f}\t{segment_end_time:.3f}\t" + " ".join(current_line))

    return formatted_lines

def export_labels(result_aligned, filename):
    output_filename = os.path.join('/content/forced-alignment/', os.path.splitext(filename)[0])

    formatted_transcription = format_transcription(
        result_aligned,
        Alignment_mode,
        Max_words_per_line,
        Max_line_width,
        Max_line_count
    )

    extension = ".txt" if Export_format == "Audacity" else ".lab"
    with open(output_filename + extension, "w") as f:
        for line in formatted_transcription:
            f.write(line + "\n")

for filename in os.listdir('/content/db/'):
    if filename.endswith('.wav'):
        file_path = os.path.join('/content/db/', filename)

        result = transcribe_audio(file_path)
        full_text = " ".join([segment["text"] for segment in result["segments"]])

        output_txt = os.path.join('/content/txt/', os.path.splitext(filename)[0] + '.txt')
        with open(output_txt, 'w') as f:
            f.write(full_text)

        if Forced_alignment:
            result_aligned = align_audio(result, file_path)
            export_labels(result_aligned, filename)

clear_output()
print("Transcription complete! Transcriptions are in the 'txt' folder.")
if Forced_alignment:
    print(f"Forced alignment labels ({Alignment_mode}) are in the 'forced-alignment' folder.")