<a href="https://colab.research.google.com/github/MLo7Ghinsan/DiffSinger_colab_notebook_MLo7/blob/main/lab_base_maker_AutoLabelingForSVS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for automatic `txt` transcription and `lab` generation

Original notebook by PixPrucer

Edited by MLo7

zip format example:
<pre>
your_zip.zip:                              ###NO SUBFOLDER INSIDE THE ZIP###
    |
    |
    data_1.wav
    data_1.txt (optional)
    .
    data_2.wav
    data_2.txt (optional)
    .
    data_3.wav
    data_3.txt (optional)
    .
    ...
</pre>

Reccomend audio length: 3~15 seconds per wav file

In [None]:
#@title # Mount Drive and install Conda
#@markdown Basically installs all necesarry things for the notebook to work <br/> **Wait until it restarts your session !**
#@markdown Mount Drive
from IPython.display import clear_output
from google.colab import drive

%cd /content
drive.mount('/content/drive')
!pip install -q condacolab
clear_output() # rawr

import condacolab
condacolab.install()

In [None]:
#@title # Main installation

%cd /content
!conda create -n aligner -c conda-forge montreal-forced-aligner==2.0.6
# Download Model
!source activate aligner; \
mfa model download acoustic english_us_arpa
!mkdir /content/jpmodel
!wget https://cdn.discordapp.com/attachments/816517150175920138/1133350036995059732/jpn_dict_4_autoalign_colab.txt #edited file
!wget https://huggingface.co/datasets/fox7005/tool/resolve/main/jp_acoustic_model.zip
# Download G2P
!source activate aligner; \
mfa model download dictionary english_us_arpa
# HAI-D's TextGrid-->LAB
!wget https://cdn.discordapp.com/attachments/816517150175920138/1161903924903677982/textgrid2lab.py #edited file
!wget https://cdn.discordapp.com/attachments/1004785092129996850/1093284704599412906/converter.txt
# Arpabet phoneme mapping table
!pip install openai-whisper
!pip install pykakasi
!pip install mytextgrid

In [None]:
#@title Unzip corpus
from IPython.display import clear_output

#@markdown Unzip your dataset for transcription stuff. Make sure it is an archive only containing wavs (3~15 seconds in length recommended).

file_location = '' #@param {type:"string"}

!7z x "$file_location" -o/content/db

In [None]:
#@title Whisper & MFA inference
#@markdown **Make transcriptions** <br/> Worth noting that your singing database shouldn't have long pauses, *ooh-ing*, lalala-ing, humming etc. in it, otherwise it'll probably break the transcription making (Whisper poorly recognises those).
#Implemented from https://github.com/openai/whisper/discussions/1041 by Haru0l
import os
import pykakasi
import re
import warnings
warnings.filterwarnings("ignore")

%cd /content
clear_output()

#@markdown Select this if you have your own transcription along with your wav (so it wont transcribe them)
user_transcripts = False # @param {type:"boolean"}

language = "JPN (Japanese)" #@param ["JPN (Japanese)", "ENG (English)", "rawr x3"]

#@markdown The higher the number, the better alignment will be i guess, it's a hit or miss though
beam = 10 # @param {type:"slider", min:10, max:2000, step:10}
retry_beam = beam * 4

if not user_transcripts:
    if language == "JPN (Japanese)":
        !whisper --model medium -f txt --language ja --output_dir /content/db /content/db/*.wav
        #sussy stuff to make it work lmao
        folder_path = "/content/db"
        kakasi = pykakasi.kakasi()
        kakasi.setMode("J", "H")
        kakasi.setMode("K", "H")
        conv = kakasi.getConverter()

        def add_space(text):
            special_combinations = ["きゃ", "きゅ", "きょ", "しゃ", "しゅ", "しょ", "ちゃ", "ちゅ", "ちょ", "にゃ", "にゅ", "にょ", "ひゃ", "ひゅ", "ひょ", "みゃ", "みゅ", "みょ", "りゃ", "りゅ", "りょ"]
            result = []
            buffer = ""
            for char in text:
                if buffer + char in special_combinations:
                    buffer += char
                else:
                    if buffer:
                        result.append(buffer)
                    buffer = char
            if buffer:
                result.append(buffer)
            return " ".join(result)

        def remove_space(text):
            return " ".join(text.split())

        for file_name in os.listdir(folder_path):
            if file_name.endswith(".txt"):
                file_path = os.path.join(folder_path, file_name)
                with open(file_path, "r", encoding="utf-8") as file:
                    japanese_text = file.read()
                hiragana_text = conv.do(japanese_text)
                hiragana_add = add_space(hiragana_text)
                hiraganaur = remove_space(hiragana_add)
                with open(file_path, "w", encoding="utf-8") as file:
                    file.write(hiraganaur)


    elif language == "ENG (English)":
        def Transcriber(audiofile):
            import whisper
            from whisper.tokenizer import get_tokenizer

            #encourage model to transcribe words literally
            tokenizer = get_tokenizer(multilingual=False)  # use multilingual=True if using multilingual model
            number_tokens = [
                i
                for i in range(tokenizer.eot)
                if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
            ]

            model = whisper.load_model("medium.en")
            answer = model.transcribe(audiofile, suppress_tokens=[-1] + number_tokens)

            print(answer['text'])

            output_txt = os.path.join('/content/db/', os.path.splitext(filename)[0] + '.txt')

            with open(output_txt, 'w') as f:
              f.write(answer['text'])

        for filename in os.listdir('/content/db/'):
          if filename.endswith('.wav'):
            file_path = os.path.join('/content/db/', filename)
            Transcriber(file_path)
    else:
        print("rawr xd nuzzle pounces on you uwu you so warm")
else:
    pass

#**Make alignments**
%cd /content

if language == "JPN (Japanese)":
    print("You've selected [JPN]")
    !source activate aligner; \
    mfa align /content/db /content/jpn_dict_4_autoalign_colab.txt /content/jp_acoustic_model.zip /content/alignment --beam {beam} --retry_beam {retry_beam} --clean

elif language == "ENG (English)":
    print("You've selected [ENG]")
    !source activate aligner; \
    mfa align /content/db english_us_arpa english_us_arpa /content/alignment --beam {beam} --retry_beam {retry_beam} --clean

else:
    print("You've selected the uwu")
#mfa align --custom_mapping_path /content/arpa_cleaners.yaml /content/db english_us_arpa english_us_arpa /content/alignment
# Thank u HAI-D I'd probably die figuring out myself

#**Convert to LAB format**
%cd /content
!python /content/textgrid2lab.py

In [None]:
#@markdown Zips up labels `lab` for you to dowload (through colab's file explorer)
!zip labels.zip /content/alignment/*.lab

You might want to adjust the labels after using vLabeler. I like to treat those as a baseline for hand-labelling (makes the job just. Slightly easier)