## Terms of Use

**Please solve the authorization problem of the dataset on your own. You shall be solely responsible for any problems caused by the use of non-authorized datasets for training and all consequences thereof.The repository and its maintainer have nothing to do with the consequences!**

1. This project is established for academic exchange purposes only and is intended for communication and learning purposes. It is not intended for production environments.
2. Any videos based on Easy Voice Toolkit that are published on video platforms must clearly indicate in the description that they are used for voice changing and specify the input source of the voice or audio, for example, using videos or audios published by others and separating the vocals as input source for conversion, which must provide clear original video or music links. If your own voice or other synthesized voices from other commercial vocal synthesis software are used as the input source for conversion, you must also explain it in the description.
3. You shall be solely responsible for any infringement problems caused by the input source. When using other commercial vocal synthesis software as input source, please ensure that you comply with the terms of use of the software. Note that many vocal synthesis engines clearly state in their terms of use that they cannot be used for input source conversion.
4. Continuing to use this project is deemed as agreeing to the relevant provisions stated in this repository README. This repository README has the obligation to persuade, and is not responsible for any subsequent problems that may arise.
5. If you distribute this repository's code or publish any results produced by this project publicly (including but not limited to video sharing platforms), please indicate the original author and code source (this repository).
6. If you use this project for any other plan, please contact and inform the author of this repository in advance. Thank you very much.

## Configure Colab

### 防止断连<br>Prevent Disconnection

### 使用GPU<br>Use GPU

### 克隆仓库<br>Clone Repository

In [None]:
#@title Clone Repository
!git clone https://github.com/Spr-Aachen/Easy-Voice-Toolkit.git
%cd /content/Easy-Voice-Toolkit

### 安装依赖<br>Install Dependencies

In [None]:
#@title Install Dependencies
!apt-get update``
!sudo apt install portaudio19-dev
%pip uninstall -y torchdata torchtext
%pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
%pip install -r requirements.txt
#exit() # Enable this only when you decide to delete the runtime

### 装载硬盘<br>Mount Google Drive

In [None]:
#@title Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

### 准备文件<br>Prepare Files

## Run Tools

### Tool_AudioProcessor
该工具会将媒体文件批量转换为音频文件然后自动切除音频的静音部分

In [None]:
#@title [Tool] AudioProcessor 该工具会将媒体文件批量转换为音频文件然后自动切除音频的静音部分
%cd /content/Easy-Voice-Toolkit

#import os, sys
#sys.path.append(os.path.join(os.getcwd(), "Tool_AudioProcessor"))

from Tool_AudioProcessor.Process import Audio_Processing

class Execute_Audio_Processing:
    '''
    Change media format to WAV and cut off the silent parts
    '''
    #@markdown **媒体输入目录**：需要输出为音频文件的媒体文件的目录（注意：结尾不需要斜杠）
    Media_Dir_Input: str = '/content/drive/MyDrive/%Audio_Original%'   #@param {type:"string"}
    #@markdown **媒体输出格式**：需要输出为的音频文件的格式
    Media_Format_Output: str = 'wav'   #@param ["wav", "mp3"]
    #@markdown **均方根阈值 (db)**：低于该阈值的片段将被视作静音进行处理，若有降噪需求可以增加该值
    RMS_Threshold: float = -40.   #@param {type:"number"}
    #@markdown **跳跃大小 (ms)**：每个RMS帧的长度，增加该值能够提高分割精度但会减慢进程
    Hop_Size: int = 10   #@param {type:"integer"}
    #@markdown **最小静音间隔 (ms)**：静音部分被分割成的最小长度，若音频只包含短暂中断可以减小该值（注意：这个值必须小于 Audio Length Min，大于 Hop Size）
    Silent_Interval_Min: int = 300   #@param {type:"integer"}
    #@markdown **最大静音长度 (ms)**：被分割的音频周围保持静音的最大长度（提示：这个值无需完全对应被分割音频中的静音长度。算法将自行检索最佳的分割位置）
    Silence_Kept_Max: int = 1000   #@param {type:"integer"}
    #@markdown **最小音频长度 (ms)**：每个被分割的音频片段所需的最小长度
    Audio_Length_Min: int = 5000   #@param {type:"integer"}
    #@markdown **媒体输出目录**：于保存最后生成的音频文件的目录（注意：结尾不需要斜杠）
    Media_Dir_Output: str = '/content/drive/MyDrive/%Audio_Processed%'   #@param {type:"string"}

    def Run():
        AudioConvertandSlice = Audio_Processing(
            Execute_Audio_Processing.Media_Dir_Input,
            Execute_Audio_Processing.Media_Dir_Output,
            Execute_Audio_Processing.Media_Format_Output,
            Execute_Audio_Processing.RMS_Threshold,
            Execute_Audio_Processing.Audio_Length_Min,
            Execute_Audio_Processing.Silent_Interval_Min,
            Execute_Audio_Processing.Hop_Size,
            Execute_Audio_Processing.Silence_Kept_Max
        )
        AudioConvertandSlice.Convert_Media()
        AudioConvertandSlice.Slice_Audio()

Execute_Audio_Processing.Run()

### Tool_VoiceIdentifier
该工具会在不同说话人的音频中批量筛选出属于同一说话人的音频

In [None]:
#@title [Tool] VoiceIdentifier 该工具会在不同说话人的音频中批量筛选出属于同一说话人的音频
%cd /content/Easy-Voice-Toolkit

#import os, sys
#sys.path.append(os.path.join(os.getcwd(), 'Tool_VoiceIdentifier'))
from typing import Optional

from Tool_VoiceIdentifier.Identify import Voice_Identifying

class Execute_Voice_Identifying:
    '''
    Contrast the voice and filter out the similar ones
    '''
    #@markdown **音频输入目录**：需要进行语音识别筛选的音频文件的目录（注意：结尾不需要斜杠）
    Audio_Dir_Input: str = '/content/drive/MyDrive/%Audio_Processed%'   #@param {type:"string"}
    #@markdown **标准音频路径**：用于作为识别的比对标准（期望值）的音频的路径
    Audio_Path_Std: str = '/content/drive/MyDrive/%Audio.wav%'   #@param {type:"string"}
    #@markdown **模型存放目录**：用于存放下载的声纹识别模型的目录，若模型已存在会直接使用（注意：结尾不需要斜杠）
    Model_Dir: str = '/content/drive/MyDrive/%Model_Download%'   #@param {type:"string"}
    #@markdown **模型类型**：声纹识别模型的类型
    Model_Type: str = 'Ecapa-Tdnn'   #@param ["Ecapa-Tdnn"]
    #@markdown **模型名字**：声纹识别模型的名字，默认代表模型的大小
    Model_Name: str = 'small'   #@param ["small"]
    #@markdown **特征提取方法**：音频特征的提取方法
    Feature_Method: str = 'spectrogram'   #@param ["spectrogram", "melspectrogram"]
    #@markdown **判断阈值**：判断是否为同一人的阈值，若参与比对的说话人声音相识度较高可以增加该值
    DecisionThreshold: float = 0.84   #@param {type:"number"}
    #@markdown **音频长度**：用于预测的音频长度
    Duration_of_Audio: float = 3.00   #@param {type:"number"}
    #@markdown **音频输出目录**：用于存放筛选出的音频文件的目录（注意：结尾不需要斜杠）
    Audio_Dir_Output: str = '/content/drive/MyDrive/%Audio_Filtered%' #@param {type:"string"}
    #@markdown **[可选]人物编号**：说话人物的编号，单人模型可不填写，多人模型需填写对应编号（注意：第一个人物的编号为0，第二个为1，以此类推）
    SpeakerID: Optional[int] = 0   #@param {type:"integer"}

    def Run():
        AudioContrastInference = Voice_Identifying(
            Execute_Voice_Identifying.Audio_Path_Std,
            Execute_Voice_Identifying.Audio_Dir_Input,
            Execute_Voice_Identifying.Audio_Dir_Output,
            Execute_Voice_Identifying.Model_Dir,
            Execute_Voice_Identifying.Model_Type,
            Execute_Voice_Identifying.Model_Name,
            Execute_Voice_Identifying.Feature_Method,
            Execute_Voice_Identifying.DecisionThreshold,
            Execute_Voice_Identifying.Duration_of_Audio,
            Execute_Voice_Identifying.SpeakerID
        )
        AudioContrastInference.GetModel()
        AudioContrastInference.Inference()

Execute_Voice_Identifying.Run()

### Tool_VoiceTranscriber
该工具会将语音文件的内容批量转换为带时间戳的文本并以字幕文件的形式保存

In [None]:
#@title [Tool] VoiceTranscriber 该工具会将语音文件的内容批量转换为带时间戳的文本并以字幕文件的形式保存
%cd /content/Easy-Voice-Toolkit

#import os, sys
#sys.path.append(os.path.join(os.getcwd(), 'Tool_VoiceTranscriber'))
from typing import Optional

from Tool_VoiceTranscriber.Transcribe import Voice_Transcribing

class Execute_Voice_Transcribing:
    '''
    Transcribe WAV content to SRT
    '''
    #@markdown **音频目录**：需要将语音内容转为文字的wav文件的目录（注意：结尾不需要斜杠）
    WAV_Dir: str = '/content/drive/MyDrive/%Audio_Filtered%'   #@param {type:"string"}
    #@markdown **模型存放目录**：用于存放下载的语音识别模型的目录，若模型已存在会直接使用（注意：结尾不需要斜杠）
    Model_Dir: str = '/content/drive/MyDrive/%Model_Download%'   #@param {type:"string"}
    #@markdown **模型名字**：语音识别 (whisper) 模型的名字，默认对应了模型的大小
    Model_Name: str = 'small'   #@param ["tiny", "base", "small", "medium", "large"]
    #@markdown **启用输出日志**：是否输出debug日志
    Verbose: bool = True   #@param {type:"boolean"}
    #@markdown **前后文一致**：将模型之前的输出作为下个窗口的提示，若模型陷入了失败循环则禁用此项
    Condition_on_Previous_Text: bool = True   #@param {type:"boolean"}
    #@markdown **半精度训练**：主要使用半精度浮点数进行计算，若GPU不可用则忽略或禁用此项
    fp16: bool = True   #@param {type:"boolean"}
    #@markdown **字幕输出目录**：最后生成的字幕文件将会保存到该目录中（注意：结尾不需要斜杠）
    SRT_Dir: str = '/content/drive/MyDrive/%Transcript_SRT%'   #@param {type:"string"}
    #@markdown **[可选]所用语言**：音频中说话人所使用的语言，若存在多种语言则保持'None'即可
    Language: Optional[str] = None   #@param ["None", "zh", "en"]

    def Run():
        WAVtoSRT = Voice_Transcribing(
            Execute_Voice_Transcribing.Model_Name,
            Execute_Voice_Transcribing.Model_Dir,
            Execute_Voice_Transcribing.WAV_Dir,
            Execute_Voice_Transcribing.SRT_Dir,
            Execute_Voice_Transcribing.Verbose,
            Execute_Voice_Transcribing.Language,
            Execute_Voice_Transcribing.Condition_on_Previous_Text,
            Execute_Voice_Transcribing.fp16
        )
        WAVtoSRT.Transcriber()

Execute_Voice_Transcribing.Run()

### Tool_DatasetCreator
该工具会生成适用于语音模型训练的数据集

In [None]:
#@title [Tool] DatasetCreator 该工具会生成适用于语音模型训练的数据集
%cd /content/Easy-Voice-Toolkit

#import os, sys
#sys.path.append(os.path.join(os.getcwd(), 'Tool_DatasetCreator'))

from Tool_DatasetCreator.Create import Dataset_Creating

class Execute_Dataset_Creating:
    '''
    Convert the whisper-generated SRT and split the WAV
    '''
    #@markdown **音频输入目录**：需要重采样和按字幕时间戳进行分割的wav文件的目录（注意：结尾不需要斜杠）
    WAV_Dir: str = '/content/drive/MyDrive/%Audio_Filtered%'   #@param {type:"string"}
    #@markdown **采样率 (HZ)**：要使用的新采样率
    Sample_Rate: int = 22050   #@param {type:"integer"}
    #@markdown **采样格式**：要使用的新采样格式
    Subtype: str = 'PCM_16'   #@param ["PCM_16"]
    #@markdown **音频输出目录**：用于保存最后处理完成的音频的目录（注意：结尾不需要斜杠）
    WAV_Dir_Split: str = '/content/drive/MyDrive/%Output4%'   #@param {type:"string"}
    #@markdown **字幕输入目录**：需要转为适用于模型训练的csv文件的srt文件的目录（注意：结尾不需要斜杠）
    SRT_Dir: str = '/content/drive/MyDrive/%Transcript_SRT%'   #@param {type:"string"}
    #@markdown **自编码器**：模型训练所使用的自动编码器
    Encoder: str = 'VITS'   #@param ["VITS"]
    #@markdown **是否多人**：是否进行多人模型训练
    IsSpeakerMultiple: bool = False   #@param {type:"boolean"}
    #@markdown **训练集文本路径**：用于保存最后生成的训练集txt文件的路径
    FileList_Path_Training: str = '/content/drive/MyDrive/%Train.txt%'   #@param {type:"string"}
    #@markdown **验证集文本路径**：用于保存最后生成的验证集txt文件的路径
    FileList_Path_Validation: str = '/content/drive/MyDrive/%Val.txt%'   #@param {type:"string"}

    def Run():
        SRTtoCSVandSplitAudio = Dataset_Creating(
            Execute_Dataset_Creating.SRT_Dir,
            Execute_Dataset_Creating.WAV_Dir,
            Execute_Dataset_Creating.Sample_Rate,
            Execute_Dataset_Creating.Subtype,
            Execute_Dataset_Creating.WAV_Dir_Split,
            Execute_Dataset_Creating.Encoder,
            Execute_Dataset_Creating.IsSpeakerMultiple,
            Execute_Dataset_Creating.FileList_Path_Training,
            Execute_Dataset_Creating.FileList_Path_Validation
        )
        SRTtoCSVandSplitAudio.CallingFunctions()

Execute_Dataset_Creating.Run()

### Tool_VoiceTrainer
该工具会训练出适用于语音合成的模型文件

In [None]:
#@title [Tool] VoiceTrainer 该工具会训练出适用于语音合成的模型文件
%cd /content/Easy-Voice-Toolkit

#import os, sys
#sys.path.append(os.path.join(os.getcwd(), 'Tool_VoiceTrainer'))
from typing import Optional

from Tool_VoiceTrainer.Train import Voice_Training

class Execute_Voice_Training:
    '''
    Preprocess and then start training
    '''
    #@markdown **训练集文本路径**：用于提供训练集音频路径及其语音内容的训练集txt文件的路径
    FileList_Path_Validation: str = '/content/drive/MyDrive/%Train.txt%'   #@param {type:"string"}
    #@markdown **验证集文本路径**：用于提供验证集音频路径及其语音内容的验证集txt文件的路径
    FileList_Path_Training: str = '/content/drive/MyDrive/%Val.txt%'   #@param {type:"string"}
    #@markdown **所用语言**：音频中说话人所使用的语言
    Language: str = 'mandarin_english'   #@param ["mandarin", "mandarin_english"]
    #@markdown **评估间隔**：每次评估并保存模型所间隔的step数
    Eval_Interval: int = 1000   #@param {type:"integer"}
    #@markdown **迭代次数**：将全部样本完整迭代一轮的次数
    Epochs: int = 10000   #@param {type:"integer"}
    #@markdown **批处理量**：每轮迭代中单位批次的样本数量，若用户GPU性能较弱可减小该值（注意：最好设置为2的幂次。设置为1会导致网络很难收敛）
    Batch_Size: int = 8   #@param {type:"integer"}
    #@markdown **进程数量**：进行数据加载时可使用的子进程数量，若用户CPU性能较弱可减小该值
    Num_Workers: int = 8   #@param {type:"integer"}
    #@markdown **半精度训练**：通过混合了float16精度的训练方式减小显存占用以支持更大的批处理量
    FP16_Run: bool = True   #@param {type:"boolean"}
    #@markdown **是否多人**：启用以支持多人模型训练
    IsSpeakerMultiple: bool = False   #@param {type:"boolean"}
    #@markdown **人物名字**：单人模型可不填写，多人模型需填写对应名字（注意：不同人物名之间要用逗号隔开，就像 ['人名1', '人名2', ] ）
    Speakers: list = ['Name1',]   #@param {type:"raw"}
    #@markdown **配置保存目录**：用于保存根据以上设置更新参数后的配置文件的目录（注意：结尾不需要斜杠）
    Config_Dir_Save: str = '/content/drive/MyDrive/%Config_Train%'   #@param {type:"string"}
    #@markdown **模型保存目录**：用于存放生成的模型的目录（注意：结尾不需要斜杠；请不要在目录中存放由不同数据集训练得到的模型）
    Model_Dir_Save: str = '/content/drive/MyDrive/%Model_Train%'   #@param {type:"string"}
    #@markdown **[可选]配置加载路径**：用于替代默认配置文件的用户配置文件的路径
    Config_Path_Load: Optional[str] = None   #@param {type:"string"}
    #@markdown **[可选]预训练G模型路径**：用作检查点的预训练生成器（Generator）模型的路径（提示：该模型文件会被复制到模型保存目录下的"checkpoints"文件夹中）
    Model_Path_Pretrained_G: Optional[str] = None   #@param {type:"string"}
    #@markdown **[可选]预训练D模型路径**：用作检查点的预训练判别器（Discriminator）模型的路径（提示：该模型文件会被复制到模型保存目录下的"checkpoints"文件夹中）
    Model_Path_Pretrained_D: Optional[str] = None   #@param {type:"string"}
    
    def Run():
        PreprocessandTrain = Voice_Training(
            Execute_Voice_Training.FileList_Path_Validation,
            Execute_Voice_Training.FileList_Path_Training,
            Execute_Voice_Training.Language,
            Execute_Voice_Training.Config_Path_Load,
            Execute_Voice_Training.Config_Dir_Save,
            Execute_Voice_Training.Eval_Interval,
            Execute_Voice_Training.Epochs,
            Execute_Voice_Training.Batch_Size,
            Execute_Voice_Training.FP16_Run,
            Execute_Voice_Training.IsSpeakerMultiple,
            Execute_Voice_Training.Speakers,
            Execute_Voice_Training.Num_Workers,
            Execute_Voice_Training.Model_Path_Pretrained_G,
            Execute_Voice_Training.Model_Path_Pretrained_D,
            Execute_Voice_Training.Model_Dir_Save
        )
        PreprocessandTrain.Preprocessing_and_Training()

Execute_Voice_Training.Run()