<a href="https://colab.research.google.com/github/KahazaTester/MFA-Universal-Notebook/blob/whisper-transformers/MFA_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/)**
_Command line utility for forced alignment using Kaldi_

\
____
\

## _Known issues:_
- It really struggles with long silences, long notes and humming.

- It's really dependent on Whisper's performance as well, which isn't always perfect: if you want a better base it's highly recommended to edit the transcriptions!

- The pretrained dictionaries for MFA are often lackluster and/or inaccurate when transposed to singing, which affects the label quality (I've particularly noticed this with French).

- It's possible to supplement the dictionaries with G2P models, but I haven't implemented that.

\
____


# Huge thanks to PixPrucer and HAI-D for basically making every part of this notebook. I just updated things around basically ⛹

##**Upload files**

In [None]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Unzip corpus
#@markdown Unzip your dataset for transcription stuff. Make sure it is an archive only containing wavs (15-30 seconds in length recommended).<br>If your samples are long you can use the audio slicer below.

file_location = '/content/drive/MyDrive/wav.zip' #@param {type:"string"}

!7z x "$file_location" -o/content/db
from IPython.display import clear_output
clear_output()
print("Wavs extracted in db folder")

In [None]:
#@title (Optional) Unzip edited transcriptions
#@markdown Unzip your own transcriptions into the `txt` folder so you don't need to use Whisper.

file_location = '/content/drive/MyDrive/txt.zip' #@param {type:"string"}

!7z x "$file_location" -o/content/txt
from IPython.display import clear_output
clear_output()
print("Transcriptions extracted in txt folder")

# (Optional) Audio-slicer

In [None]:
#@title Slice it up!
!git clone https://github.com/openvpi/audio-slicer
!pip install librosa
!pip install soundfile
!mkdir /content/slicer-temp
!mv /content/db/*.wav /content/slicer-temp/
import os

# Change directory to the audio-slicer folder
os.chdir('/content/slicer-temp')

# Loop through all files in the folder
for filename in os.listdir():
    # Check if the file is a .wav file
    if filename.endswith('.wav'):
        # Execute the command for the current file
        !python /content/audio-slicer/slicer2.py /content/slicer-temp/"$filename" --out /content/db/
from IPython.display import clear_output
clear_output()
print("Your sliced audios should be in the 'db' folder and the old files will be in the 'slicer-temp' folder")

# **Whisper inference (Auto-transcriptions)**

In [None]:
#@title Install Whisper
!pip install -U openai-whisper
!pip install ffmpeg
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Whisper inference
#@markdown **Make transcriptions** <br/> Worth noting that your singing database shouldn't have long pauses, *ooh-ing*, lalala-ing, humming etc. in it, otherwise it'll probably break the transcription making (Whisper poorly recognises those).
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l

import os
os.makedirs('/content/txt/', exist_ok=True)
!cd /content/db

def Transcriber(audiofile):
    import whisper
    from whisper.tokenizer import get_tokenizer
    #encourage model to transcribe words literally
    tokenizer = get_tokenizer(multilingual=True)  # use multilingual=True if using multilingual model
    number_tokens = [
        i
        for i in range(tokenizer.eot)
        if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
    ]

    model = whisper.load_model("medium")
    answer = model.transcribe(audiofile, suppress_tokens=[-1] + number_tokens)

    print(answer['text'])

    output_txt = os.path.join('/content/txt/', os.path.splitext(filename)[0] + '.txt')

    with open(output_txt, 'w') as f:
      f.write(answer['text'])

for filename in os.listdir('/content/db/'):
  if filename.endswith('.wav'):
    file_path = os.path.join('/content/db/', filename)
    Transcriber(file_path)
from IPython.display import clear_output
clear_output()
print("Hopefully everything worked and your transcriptions are in the 'txt' folder!")

In [None]:
#@title (Optional) Zip up text transcriptions `txt` for you to dowload and edit
#@markdown Make sure there's only .txt files in your .zip archive!
!zip -r transcriptions.zip /content/txt/*.txt
from IPython.display import clear_output
clear_output()
print("Your transcriptions are now in 'transcriptions.zip'")

#**Auto-alignment (MFA)**

In [None]:
#@title Install Condacolab
#@markdown The session will crash and restart, that's normal!
!pip install -q condacolab
import condacolab
condacolab.install()
from IPython.display import clear_output
clear_output()
print("All done, please wait for the session to restart!")

In [None]:
#@title Install MFA
!conda install -c conda-forge montreal-forced-aligner spacy sudachipy sudachidict-core
!pip install speechbrain
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Download the alignment models
#@markdown Choose the model for your desired language and scroll down to find the name of the model under "Installation"<br>After "mfa model download acoustic/dictionary" (e.g.: italian_cv)<br>Acoustic models: https://mfa-models.readthedocs.io/en/latest/acoustic/index.html<br>Dictionaries: https://mfa-models.readthedocs.io/en/latest/dictionary/index.html
acoustic = 'japanese_mfa' #@param {type:"string"}
dictionary = 'japanese_mfa' #@param {type:"string"}
# Download Model
!mfa model download acoustic "$acoustic"
# Download G2P
!mfa model download dictionary "$dictionary"
from IPython.display import clear_output
clear_output()
print("Alignment models downloaded!")


In [None]:
#@title Start aligning!
!mv /content/txt/*.txt /content/db
!mfa align /content/db "$dictionary" "$acoustic" /content/alignment --beam 400
from IPython.display import clear_output
clear_output()
print("All done! Check the 'alignment' folder for .TextGrid files")

#**LAB Converter**

In [None]:
#@title Install TextGrid to LAB converter
!pip install mytextgrid
!git clone https://github.com/gnloop/MFA-Universal-Notebook
!mv /content/MFA-Universal-Notebook/text2lab_test.py /content/alignment/
from IPython.display import clear_output
clear_output()
print("All done!")

### HALT! Before you go happily converting your TextGrid files, if your language is not present in the dropdown list below, you're gonna have to make your own 'custom_converter.txt' in the converters folder!
#### Most MFA models use some sort of IPA system which doesn't sit well with DiffSinger. The default converter.txt file is set up for English: it changes every phoneme from uppercase to lowercase and deletes any numbers.
#### For further details on what phonemes MFA uses, you should check out the webpage where the MFA models are listed. There's usually a phoneme list there as well.

In [None]:
#@title Convert TextGrid to LAB
converter = 'converter_JP' #@param ['converter_JP', 'converter_EN-ARPA', 'converter_ES', 'converter_IT', 'custom_converter']
!python -X utf8 /content/alignment/text2lab_test.py -c /content/MFA-Universal-Notebook/converters/"$converter".txt
from IPython.display import clear_output
clear_output()
print("Your .lab files should now be under the 'alignment' folder!")


In [None]:
#@title (ITALIAN ONLY) Gemination fix
#@markdown This splits double consonants into 2 separate phonemes to adapt to the current Italian implementation on OpenUTAU<br>(You'll still need to turn 'u' and 'i' into 'w' and 'y' for consonant clusters!)
import re
import os

def divide_double_consonants(lab_file_path):
    """Divides double consonants in a monophonic HTS .lab file,
       excluding 'rr', and adjusts durations.

    Args:
        lab_file_path: The path to the .lab file.
    """
    with open(lab_file_path, 'r') as f:
        lines = f.readlines()

    modified_lines = []
    for line in lines:
        parts = line.strip().split()
        if len(parts) == 3:  # Check if it's a valid label line
            start_time, end_time, phoneme = parts
            # Find double consonants excluding "rr"
            double_consonants = re.findall(r'(?!rr)([a-z])\1', phoneme)
            if double_consonants:
                duration = float(end_time) - float(start_time)
                half_duration = duration / 2
                # Split the label into two with half durations, using integers for timestamps
                modified_lines.append(f"{int(float(start_time))} {int(float(start_time) + half_duration)} {double_consonants[0]}\n")
                modified_lines.append(f"{int(float(start_time) + half_duration)} {int(float(end_time))} {double_consonants[0]}\n")
            else:
                modified_lines.append(line)  # Keep original line if no double consonants
        else:
            modified_lines.append(line)  # Keep lines that are not labels

    with open(lab_file_path, 'w') as f:
        f.writelines(modified_lines)

# Specify the folder containing your .lab files
folder_path = '/content/alignment/'  # Replace with your folder path

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith('.lab'):  # Process only .lab files
        lab_file_path = os.path.join(folder_path, filename)
        divide_double_consonants(lab_file_path)

print("Double consonants divided in all .lab files in the folder!")

In [None]:
#@title Zip labels
!zip -r /content/labels.zip /content/alignment/*.lab
from IPython.display import clear_output
clear_output()
print("You can now download your labels in the labels.zip file!")

#**Extras**

In [None]:
#@title Whisper (Transformers) inference
#@markdown **Make transcriptions** <br/> Only use this if you know what you are doing
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l, edited to be compatible with Transformers HuggingFace implementation

!pip install --upgrade pip
!pip install --upgrade transformers accelerate torchaudio
!pip install "punctuators==0.0.5"
!pip install "pyannote.audio"
!pip install git+https://github.com/huggingface/diarizers.git

from transformers import pipeline
import os

os.makedirs('/content/txt/', exist_ok=True)

#@markdown <b><font size="3.5"> Model name (HuggingFace)
model_name = "openai/whisper-medium"  #@param {type:"string"}
#@markdown <b><font size="3.5"> generate_kwarts
language = "Japanese"  #@param {type:"string"}
no_repeat_ngram_size = 0  #@param {type:"integer"}
repetition_penalty = 1.0  #@param {type:"number"}

generate_kwargs = {
    "language": language,
    "no_repeat_ngram_size": no_repeat_ngram_size,
    "repetition_penalty": repetition_penalty,
}

def transcriber(audiofile):
    transcriber = pipeline("automatic-speech-recognition", model=model_name, generate_kwargs=generate_kwargs)

    print(f"Processing files: {audiofile}")
    result = transcriber(audiofile)

    transcription = result['text']
    print(transcription)

    output_txt = os.path.join('/content/txt/', os.path.splitext(os.path.basename(audiofile))[0] + '.txt')
    with open(output_txt, 'w') as f:
        f.write(transcription)

audio_directory = '/content/db/'
for filename in os.listdir(audio_directory):
    if filename.endswith('.wav'):
        file_path = os.path.join(audio_directory, filename)
        transcriber(file_path)

from IPython.display import clear_output
clear_output()
print("Hopefully everything worked and your transcriptions are in the 'txt' folder!")
