<a href="https://colab.research.google.com/github/gnloop/MFA-Universal-Notebook/blob/main/MFA_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Huge thanks to PixPrucer and HAI-D for basically making every part of this notebook. I just updated things around basically ⛹

In [None]:
#@title Install Whisper and Condacolab
#@markdown The session will crash and restart, that's normal!
!pip install -U openai-whisper
!pip install ffmpeg
!pip install -q condacolab
import condacolab
condacolab.install()
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Mount Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#@title Unzip corpus
#@markdown Unzip your dataset for transcription stuff. Make sure it is an archive only containing wavs (15-30 seconds in length recommended).

file_location = '/content/drive/MyDrive/wav.zip' #@param {type:"string"}

!7z x "$file_location" -o/content/db
from IPython.display import clear_output
clear_output()
print("Wavs extracted in db folder")

In [None]:
#@title Whisper inference
#@markdown **Make transcriptions** <br/> Worth noting that your singing database shouldn't have long pauses, *ooh-ing*, lalala-ing, humming etc. in it, otherwise it'll probably break the transcription making (Whisper poorly recognises those).
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l

import os
os.makedirs('/content/txt/', exist_ok=True)
!cd /content/db

def Transcriber(audiofile):
    import whisper
    from whisper.tokenizer import get_tokenizer
    #encourage model to transcribe words literally
    tokenizer = get_tokenizer(multilingual=True)  # use multilingual=True if using multilingual model
    number_tokens = [
        i
        for i in range(tokenizer.eot)
        if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
    ]

    model = whisper.load_model("medium")
    answer = model.transcribe(audiofile, suppress_tokens=[-1] + number_tokens)

    print(answer['text'])

    output_txt = os.path.join('/content/txt/', os.path.splitext(filename)[0] + '.txt')

    with open(output_txt, 'w') as f:
      f.write(answer['text'])

for filename in os.listdir('/content/db/'):
  if filename.endswith('.wav'):
    file_path = os.path.join('/content/db/', filename)
    Transcriber(file_path)
from IPython.display import clear_output
clear_output()
print("Hopefully everything worked and your transcriptions are in the 'txt' folder!")

In [None]:
#@title (Optional) Zip up text transcriptions `txt` for you to dowload and edit
!zip transcriptions.zip /content/txt/*.txt

In [None]:
#@title Install MFA
!conda install -c conda-forge montreal-forced-aligner
!conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
!pip install speechbrain
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Download the alignment models
#@markdown Choose the model for your desired language and scroll down to find the name of the model under "Installation"<br>After "mfa model download acoustic/dictionary" (e.g.: italian_cv)<br>Acoustic models: https://mfa-models.readthedocs.io/en/latest/acoustic/index.html<br>Dictionaries: https://mfa-models.readthedocs.io/en/latest/dictionary/index.html
acoustic = 'french_mfa' #@param {type:"string"}
dictionary = 'french_mfa' #@param {type:"string"}
# Download Model
!mfa model download acoustic "$acoustic"
# Download G2P
!mfa model download dictionary "$dictionary"
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Start aligning!
!mv /content/txt/*.txt /content/db
!mfa align /content/db "$acoustic" "$dictionary" /content/alignment --beam 400
from IPython.display import clear_output
clear_output()
print("You should now have .TextGrid files under /alignment!")

HALT! Before you go happily converting your TextGrid files, **I heavily suggest you go check the 'converter.txt' file** to automatically change the phonetic system to one you prefer.
<br><br>Most MFA models use some sort of IPA system which doesn't sit well with DiffSinger.<br><br>The default converter.txt file is set up for English: it changes every phoneme from uppercase to lowercase and deletes any numbers.
<br><br>For further details on what phonemes MFA uses, you should check out the webpage where the MFA models are listed. There's usually a phoneme list there as well.

In [None]:
#@title Install TextGrid to LAB converter
!git clone https://github.com/gnloop/MFA-Universal-Notebook
!mv /content/MFA-Universal-Notebook/text2lab_test.py /content/alignment/
!mv /content/MFA-Universal-Notebook/converter.txt /content/
!pip install mytextgrid
from IPython.display import clear_output
clear_output()
print("All done!")

In [None]:
#@title Convert TextGrid to LAB
!python -X utf8 /content/alignment/text2lab_test.py
from IPython.display import clear_output
clear_output()
print("Your .lab files should now be under the 'alignment' folder!")


In [None]:
#@title Zip labels
!zip -r /content/labels.zip /content/alignment/*.lab
from IPython.display import clear_output
clear_output()
print("You can now download your labels in the labels.zip file!")