<a href="https://colab.research.google.com/github/Spr-Aachen/EVT-Resources/blob/main/Easy_Voice_Toolkit_for_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Terms of Use

**Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer have nothing to do with the consequences!**

1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.
2. Any videos based on Easy Voice Toolkit that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.

## Configure Colab

找到上方菜单栏“代码执行程序”——>“更改运行时类型”——>"硬件加速器"，选择GPU

In [None]:
#@title Clone Repository
!git clone https://github.com/Spr-Aachen/Easy-Voice-Toolkit.git
%cd /content/Easy-Voice-Toolkit

In [None]:
#@title Install Dependencies
!apt-get update``
!apt-get install portaudio19-dev
!pip3 install -r requirements.txt
#!pip3 install --force-reinstall --yes torch torchvision torchaudio
'''
!apt-get install python3.9
!cp -r /usr/local/lib/python3.10/dist-packages /usr/local/lib/python3.9/
'''
#exit() # Enable this only when you decide to delete the runtime

In [None]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

检查是否已将需要处理的文件上传到了 https://drive.google.com/drive/my-drive 中

## Run Tools

In [None]:
#@title [Tool] AudioProcessor 该工具会将媒体文件批量转换为音频文件然后自动切除音频的静音部分
%cd /content/Easy-Voice-Toolkit

from EVT_Core.Process.Process import Audio_Processing

class Execute_Audio_Processing:
    '''
    Change media format to WAV and cut off the silent parts
    '''
    #@markdown **媒体输入目录**：需要输出为音频文件的媒体文件的目录（注意：结尾不需要斜杠）
    Media_Dir_Input: str = '/content/drive/MyDrive/%EVT/MediaInput%'   #@param {type:"string"}
    #@markdown **媒体输出格式**：需要输出为的音频文件的格式
    Media_Format_Output: str = 'wav'   #@param ["flac", "wav", "mp3", "aac", "ogg", "m4a", "wma", "aiff", "au"]
    #@markdown **启用静音切除**：音频中的静音部分将被切除
    Slice_Audio: bool = True   #@param {type:"boolean"}
    #@markdown **均方根阈值 (db)**：低于该阈值的片段将被视作静音进行处理，若有降噪需求可以增加该值
    RMS_Threshold: float = -40.   #@param {type:"number"}
    #@markdown **跳跃大小 (ms)**：每个RMS帧的长度，增加该值能够提高分割精度但会减慢进程
    Hop_Size: int = 10   #@param {type:"integer"}
    #@markdown **最小静音间隔 (ms)**：静音部分被分割成的最小长度，若音频只包含短暂中断可以减小该值（注意：这个值必须小于 Audio Length Min，大于 Hop Size）
    Silent_Interval_Min: int = 300   #@param {type:"integer"}
    #@markdown **最大静音长度 (ms)**：被分割的音频周围保持静音的最大长度（提示：这个值无需完全对应被分割音频中的静音长度。算法将自行检索最佳的分割位置）
    Silence_Kept_Max: int = 1000   #@param {type:"integer"}
    #@markdown **最小音频长度 (ms)**：每个被分割的音频片段所需的最小长度
    Audio_Length_Min: int = 3000   #@param {type:"integer"}
    #@markdown **输出采样率**：输出音频所拥有的采样率，若维持不变则保持'None'即可
    SampleRate: int = None   #@param ["None", 44100, 48000, 96000, 192000]
    #@markdown **输出采样位数**：输出音频所拥有的采样位数，若维持不变则保持'None'即可
    SampleWidth: int = None   #@param ["None", 8, 16, 24, 32]
    #@markdown **合并声道**：将输出音频的声道合并为单声道
    ToMono: bool = False   #@param {type:"boolean"}
    #@markdown **媒体输出目录**：于保存最后生成的音频文件的目录（注意：结尾不需要斜杠）
    Media_Dir_Output: str = '/content/drive/MyDrive/%EVT/ProcessResult%'   #@param {type:"string"}

AudioConvertandSlice = Audio_Processing(
    Execute_Audio_Processing.Media_Dir_Input,
    Execute_Audio_Processing.Media_Dir_Output,
    Execute_Audio_Processing.Media_Format_Output,
    Execute_Audio_Processing.SampleRate if Execute_Audio_Processing.SampleRate != "None" else None,
    Execute_Audio_Processing.SampleWidth if Execute_Audio_Processing.SampleWidth != "None" else None,
    Execute_Audio_Processing.ToMono,
    Execute_Audio_Processing.Slice_Audio,
    Execute_Audio_Processing.RMS_Threshold,
    Execute_Audio_Processing.Audio_Length_Min,
    Execute_Audio_Processing.Silent_Interval_Min,
    Execute_Audio_Processing.Hop_Size,
    Execute_Audio_Processing.Silence_Kept_Max
)
AudioConvertandSlice.Process_Audio()

In [None]:
#@title [Tool] VoiceIdentifier 该工具会在不同说话人的音频中批量筛选出属于同一说话人的音频
%cd /content/Easy-Voice-Toolkit

from EVT_Core.ASR.VPR.Identify import Voice_Identifying

class Execute_Voice_Identifying:
    '''
    Contrast the voice and filter out the similar ones
    '''
    #@markdown **音频输入目录**：需要进行语音识别筛选的音频文件的目录（注意：结尾不需要斜杠）
    Audio_Dir_Input: str = '/content/drive/MyDrive/%EVT/ProcessResult%'   #@param {type:"string"}
    #@markdown **目标人物与音频**：目标人物的名字及其语音文件的所在路径
    StdAudioSpeaker: dict = {'%SpeakerName%': '/content/drive/MyDrive/%EVT/Audio.wav%'}   #@param {type:"raw"}
    #@markdown **模型加载路径**：用于加载的声纹识别模型的所在路径
    Model_Path: str = '/content/drive/MyDrive/%EVT/Model_Download/ASR/VPR/Ecapa-Tdnn_spectrogram.pth%'
    #@markdown **判断阈值**：判断是否为同一人的阈值，若参与比对的说话人声音相识度较高可以增加该值
    DecisionThreshold: float = 0.75   #@param {type:"number"}
    #@markdown **模型类型**：声纹识别模型的类型
    Model_Type: str = 'Ecapa-Tdnn'   #@param ["Ecapa-Tdnn"]
    #@markdown **特征提取方法**：音频特征的提取方法
    Feature_Method: str = 'spectrogram'   #@param ["spectrogram", "melspectrogram"]
    #@markdown **音频长度**：用于预测的音频长度
    Duration_of_Audio: float = 3.00   #@param {type:"number"}
    #@markdown **语音识别结果保存路径**：用于保存识别得到的音频文件与对应说话人的信息文件的路径
    AudioSpeakersData_Path: str = '/content/drive/MyDrive/%EVT/ASRResult/AudioSpeakersData.txt%' #@param {type:"string"}
    #@markdown **音频文件保存目录**：用于保存拥有匹配人物的音频文件的目录
    MoveToDst: str = '/content/drive/MyDrive/%EVT/ASRResult%' #@param {type:"string"}

import os, shutil
from pathlib import Path
def ASRResult_Update(AudioSpeakersData_Path: str, MoveToDst: str):
    os.makedirs(MoveToDst, exist_ok = True) if Path(MoveToDst).exists() == False else None
    with open(AudioSpeakersData_Path, mode = 'w', encoding = 'utf-8') as AudioSpeakersData:
        AudioSpeakers = AudioSpeakersData.readlines()
        Lines = []
        for AudioSpeaker in AudioSpeakers:
            Audio, Speaker = AudioSpeaker.split('|', maxsplit = 1)
            if Speaker.strip() != '':
                Lines.append(f"{Path(MoveToDst).joinpath(Path(Audio).name)}|{Speaker}\n")
                shutil.copy(Audio, MoveToDst)
        AudioSpeakersData.writelines(Lines)

AudioContrastInference = Voice_Identifying(
    Execute_Voice_Identifying.StdAudioSpeaker,
    Execute_Voice_Identifying.Audio_Dir_Input,
    Execute_Voice_Identifying.AudioSpeakersData_Path,
    Execute_Voice_Identifying.Model_Path,
    Execute_Voice_Identifying.Model_Type,
    Execute_Voice_Identifying.Feature_Method,
    Execute_Voice_Identifying.DecisionThreshold,
    Execute_Voice_Identifying.Duration_of_Audio
)
AudioContrastInference.GetModel()
AudioContrastInference.Inference()
ASRResult_Update(
    Execute_Voice_Identifying.AudioSpeakersData_Path,
    Execute_Voice_Identifying.MoveToDst
)

In [None]:
#@title [Tool] VoiceTranscriber 该工具会将语音文件的内容批量转换为带时间戳的文本并以字幕文件的形式保存
%cd /content/Easy-Voice-Toolkit

from EVT_Core.STT.Whisper.Transcribe import Voice_Transcribing

class Execute_Voice_Transcribing:
    '''
    Transcribe WAV content to SRT
    '''
    #@markdown **音频目录**：需要将语音内容转为文字的wav文件的目录（注意：结尾不需要斜杠）
    Audio_Dir: str = '/content/drive/MyDrive/%EVT/ASRResult%'   #@param {type:"string"}
    #@markdown **模型加载路径**：用于加载的Whisper模型的所在路径
    Model_Path: str = '/content/drive/MyDrive/%EVT/Model_Download/STT/Whisper/small.pt%'   #@param {type:"string"}
    #@markdown **标注语言信息**：标注音频中说话人所使用的语言，若用于VITS数据集制作则建议启用
    Add_LanguageInfo: str = True   #@param {type:"boolean"}
    #@markdown **半精度训练**：主要使用半精度浮点数进行计算，若GPU不可用则忽略或禁用此项
    fp16: bool = True   #@param {type:"boolean"}
    #@markdown **启用输出日志**：是否输出debug日志
    Verbose: bool = True   #@param {type:"boolean"}
    #@markdown **关联上下文**：在音频之间的内容具有关联性时启用该项可以获得更好的效果，若模型陷入了失败循环则禁用此项
    Condition_on_Previous_Text: bool = False   #@param {type:"boolean"}
    #@markdown **字幕输出目录**：最后生成的字幕文件将会保存到该目录中（注意：结尾不需要斜杠）
    SRT_Dir: str = '/content/drive/MyDrive/%EVT/STTResult%'   #@param {type:"string"}

WAVtoSRT = Voice_Transcribing(
    Execute_Voice_Transcribing.Model_Path,
    Execute_Voice_Transcribing.Audio_Dir,
    Execute_Voice_Transcribing.SRT_Dir,
    Execute_Voice_Transcribing.Verbose,
    Execute_Voice_Transcribing.Add_LanguageInfo,
    Execute_Voice_Transcribing.Condition_on_Previous_Text,
    Execute_Voice_Transcribing.fp16
)
WAVtoSRT.Transcriber()

In [None]:
#@title [Tool] DatasetCreator 该工具会生成适用于语音模型训练的数据集
%cd /content/Easy-Voice-Toolkit

from EVT_Core.Dataset.VITS.Create import Dataset_Creating

class Execute_Dataset_Creating:
    '''
    Convert the whisper-generated SRT and split the WAV
    '''
    #@markdown **语音识别结果文件路径**：由语音识别得到的音频文件与对应说话人的信息文件的路径
    AudioSpeakersData_Path: str = '/content/drive/MyDrive/%EVT/ASRResult/AudioSpeakersData.txt%'   #@param {type:"string"}
    #@markdown **字幕输入目录**：需要转为适用于模型训练的csv文件的srt文件的目录（注意：结尾不需要斜杠）
    SRT_Dir: str = '/content/drive/MyDrive/%EVT/STTResult/Transcript_SRT%'   #@param {type:"string"}
    #@markdown **添加辅助数据**：添加用以辅助训练的数据集，若当前语音数据的质量/数量较低则建议启用
    Add_AuxiliaryData: bool = False   #@param {type:"boolean"}
    #@markdown **辅助数据文本路径**：辅助数据集的文本的所在路径
    AuxiliaryData_Path: str = '/content/drive/MyDrive/%EVT/AuxiliaryData/VITS/AuxiliaryData.txt%'   #@param {type:"string"}
    #@markdown **添加其它语言辅助数据**：启用以允许添加与当前数据集语言不匹配的辅助数据
    Add_UnmatchedLanguage: bool = False   #@param {type:"boolean"}
    #@markdown **采样率 (HZ)**：数据集所要求的音频采样率，若维持不变则保持'None'即可
    SampleRate: int = 22050   #@param ["None", 22050, 44100, 48000, 96000, 192000]
    #@markdown **采样位数**：数据集所要求的音频采样位数，若维持不变则保持'None'即可
    SampleWidth: str = '16'   #@param ["None", 8, 16, 24, 32]
    #@markdown **合并声道**：将输出音频的声道合并为单声道
    ToMono: bool = True   #@param {type:"boolean"}
    #@markdown **训练集占比**：划分给训练集的数据在数据集中所占的比例
    TrainRatio: float = 0.7   #@param {type:"number"}
    #@markdown **音频输出目录**：用于保存最后处理完成的音频的目录（注意：结尾不需要斜杠）
    WAV_Dir_Split: str = '/content/drive/MyDrive/%EVT/Dataset%'   #@param {type:"string"}
    #@markdown **训练集文本路径**：用于保存最后生成的训练集txt文件的路径
    FileList_Path_Training: str = '/content/drive/MyDrive/%EVT/Dataset/Train.txt%'   #@param {type:"string"}
    #@markdown **验证集文本路径**：用于保存最后生成的验证集txt文件的路径
    FileList_Path_Validation: str = '/content/drive/MyDrive/%EVT/Dataset/Valid.txt%'   #@param {type:"string"}

SRTtoCSVandSplitAudio = Dataset_Creating(
    Execute_Dataset_Creating.SRT_Dir,
    Execute_Dataset_Creating.AudioSpeakersData_Path,
    Execute_Dataset_Creating.SampleRate if Execute_Dataset_Creating.SampleRate != "None" else None,
    Execute_Dataset_Creating.SampleWidth if Execute_Dataset_Creating.SampleWidth != "None" else None,
    Execute_Dataset_Creating.ToMono,
    Execute_Dataset_Creating.WAV_Dir_Split,
    Execute_Dataset_Creating.Add_AuxiliaryData,
    Execute_Dataset_Creating.AuxiliaryData_Path,
    Execute_Dataset_Creating.Add_UnmatchedLanguage,
    Execute_Dataset_Creating.TrainRatio,
    Execute_Dataset_Creating.FileList_Path_Training,
    Execute_Dataset_Creating.FileList_Path_Validation
)
SRTtoCSVandSplitAudio.CallingFunctions()

In [None]:
#@title [Tool] VoiceTrainer 该工具会训练出适用于语音合成的模型文件
%cd /content/Easy-Voice-Toolkit

from EVT_Core.Train.VITS.Train import Voice_Training

class Execute_Voice_Training:
    '''
    Preprocess and then start training
    '''
    #@markdown **训练集文本路径**：用于提供训练集音频路径及其语音内容的训练集txt文件的路径
    FileList_Path_Training: str = '/content/drive/MyDrive/%EVT/Dataset/Train.txt%'   #@param {type:"string"}
    #@markdown **验证集文本路径**：用于提供验证集音频路径及其语音内容的验证集txt文件的路径
    FileList_Path_Validation: str = '/content/drive/MyDrive/%EVT/Dataset/Val.txt%'   #@param {type:"string"}
    #@markdown **迭代次数**：将全部样本完整迭代一轮的次数
    Epochs: int = 100   #@param {type:"integer"}
    #@markdown **批处理量**：每轮迭代中单位批次的样本数量（注意：最好设置为2的幂次）
    Batch_Size: int = 16   #@param {type:"integer"}
    #@markdown **使用预训练模型**：使用预训练模型（底模），注意其载入优先级高于检查点
    Use_PretrainedModels: bool = True   #@param {type:"boolean"}
    #@markdown **[可选]预训练G模型路径**：预训练生成器（Generator）模型的路径
    Model_Path_Pretrained_G: str = '/content/drive/MyDrive/%EVT/Pretrained Models/standard_G.pth%'   #@param {type:"string"}
    #@markdown **[可选]预训练D模型路径**：预训练判别器（Discriminator）模型的路径
    Model_Path_Pretrained_D: str = '/content/drive/MyDrive/%EVT/Pretrained Models/standard_D.pth%'   #@param {type:"string"}
    #@markdown **[可选]保留原说话人**：保留底模中原有的说话人，请保证每个原角色至少有一两条音频参与训练
    Keep_Original_Speakers: bool = False   #@param {type:"boolean"}
    #@markdown **[可选]配置加载路径**：用于加载底模人物信息的配置文件的所在路径
    Config_Path_Load: str = '/content/drive/MyDrive/%EVT/Pretrained Models/standard_Config.json%'   #@param {type:"string"}
    #@markdown **进程数量**：进行数据加载时可并行的进程数量
    Num_Workers: int = 4   #@param {type:"integer"}
    #@markdown **半精度训练**：通过混合了float16精度的训练方式减小显存占用以支持更大的批处理量
    FP16_Run: bool = True   #@param {type:"boolean"}
    #@markdown **评估间隔**：每次保存模型所间隔的step数
    Eval_Interval: int = 1000   #@param {type:"integer"}
    #@markdown **输出目录**：用于存放生成的模型和配置文件的目录，若目录中已存在模型则会将其视为检查点（注意：当目录中存在多个模型时，编号最大的会被选为检查点）
    Dir_Output: str = '/content/drive/MyDrive/EVT/TrainResult'   #@param {type:"string"}

# Load the TensorBoard notebook extension
%load_ext tensorboard
# Start TensorBoard
%tensorboard --logdir /content/drive/MyDrive/EVT/TrainResult

PreprocessandTrain = Voice_Training(
    Execute_Voice_Training.FileList_Path_Training,
    Execute_Voice_Training.FileList_Path_Validation,
    Execute_Voice_Training.Eval_Interval,
    Execute_Voice_Training.Epochs,
    Execute_Voice_Training.Batch_Size,
    Execute_Voice_Training.FP16_Run,
    Execute_Voice_Training.Keep_Original_Speakers,
    Execute_Voice_Training.Config_Path_Load,
    Execute_Voice_Training.Num_Workers,
    Execute_Voice_Training.Use_PretrainedModels,
    Execute_Voice_Training.Model_Path_Pretrained_G if Execute_Voice_Training.Model_Path_Pretrained_G != "None" else None,
    Execute_Voice_Training.Model_Path_Pretrained_D if Execute_Voice_Training.Model_Path_Pretrained_D != "None" else None,
    Execute_Voice_Training.Dir_Output
)
PreprocessandTrain.Preprocessing_and_Training()

In [None]:
#@title [Tool] VoiceConverter 该工具会将文字转为语音并生成音频文件
%cd /content/Easy-Voice-Toolkit

from EVT_Core.TTS.VITS.Convert import Voice_Converting

class Execute_Voice_Converting:
    '''
    Convert text to speech and save as audio files
    '''
    #@markdown **配置加载路径**：该路径对应的配置文件会用于推理
    Config_Path_Load: str = '/content/drive/MyDrive/%TrainResult/Config.json%'   #@param {type:"string"}
    #@markdown **G模型加载路径**：用于推理的生成器（Generator）模型所在路径
    Model_Path_Load: str = '/content/drive/MyDrive/%TrainResult/G_*.pth%'   #@param {type:"string"}
    #@markdown **输入文字**：输入的文字会作为说话人的语音内容
    Text: str = '请输入语句'   #@param {type:"string"}
    #@markdown **所用语言**：说话人/文字所使用的语言
    Language: str = '[ZH]'   #@param ["[ZH]", "[EN]", "[JA]"]
    #@markdown **人物名字**：说话人物的名字
    Speaker: str = '%Name%'   #@param {type:"string"}
    #@markdown **情感强度**：情感的变化程度
    EmotionStrength: float = .667   #@param {type:"number"}
    #@markdown **音素音长**：音素的发音长度
    PhonemeDuration: float = 0.8   #@param {type:"number"}
    #@markdown **整体语速**：整体的说话速度
    SpeechRate: float = 1.0   #@param {type:"number"}
    #@markdown **音频保存路径**：用于保存推理得到的音频的路径
    Audio_Path_Save: str = '/content/drive/MyDrive/%Audio_Converted.wav%'   #@param {type:"string"}

VoiceConverting = Voice_Converting(
    Execute_Voice_Converting.Config_Path_Load,
    Execute_Voice_Converting.Model_Path_Load,
    Execute_Voice_Converting.Text,
    Execute_Voice_Converting.Language,
    Execute_Voice_Converting.Speaker,
    Execute_Voice_Converting.EmotionStrength,
    Execute_Voice_Converting.PhonemeDuration,
    Execute_Voice_Converting.SpeechRate,
    Execute_Voice_Converting.Audio_Path_Save
)
VoiceConverting.Converting()