# MeloTTS for OpenVINO 單獨運行含Gradio界面範例 (MeloTTS_run.ipynb)
by Jack OmniXRI, 2024/12/12

本範例為簡化版，運行前必需安裝好 OpenVINO 2024.4 Notebooks 虛擬環境並啟動。完整步驟可參考下列連結。  
https://github.com/zhaohb/MeloTTS-OV/tree/speech-enhancement-and-npu  

接著安裝 MeloTTS-OV (speech-enhancement-and-npu版本）  
* 首先到 https://github.com/zhaohb/MeloTTS-OV/tree/speech-enhancement-and-npu  
* 按下綠色「<> Code」，選擇「Download ZIP」，解壓縮後把 \MeloTTS-OV-speech-enhancement-and-npu 複製到  
OpenVINO Notebooks 路徑 \openvino_notebooks\notebooks\ 下。(請不要直接使用 git clone 命令，這樣會誤下載到標準版本)  
* 接著進到指定路徑，開始安裝必要套件。  
cd speech-enhancement-and-npu  
pip install -r requirements.txt  
pip install openvino nncf  
python setup.py develop # or  pip install -e .  
python -m unidic download  
python -m nltk.downloader averaged_perceptron_tagger_eng  
pip install deepfilternet # optional for enhancing speech  
pip install ffmpeg # 一定要裝，不然影音內容無法顯示  

執行測試程式，會順便下載模型並轉換好，存放在 \tts_ov_ZH 下，第一次執行要花較多時間下載及轉換模型。運行完會產生聲音檔案 ov_en_int8_ZH.wav ，可點擊播放測試。  
python  test_tts.py --language ZH --tts_device CPU --bert_device CPU  

完成上述步驟後才能運行下列簡易版程式。

In [1]:
from melo.api import TTS
from pathlib import Path
import time



INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


## MeloTTS 文字轉語音基本參數設定

In [2]:
speed = 1.0 # 調整語速
use_ov = True  # True 使用 OpenVINO, False 使用 PyTorch
use_int8 = True # True 啟用 INT8 格式
speech_enhance = True # True 啟用語音增強模式

tts_device = "CPU" # 指定 TTS 推論裝置 , "CPU" 或 "GPU"（這裡指 Intel GPU）
bert_device = "CPU" # 指定 Bert 推論裝置, "CPU" 或 "GPU" 或 "NPU"
lang =  "ZH" # 設定語系, EN 英文, ZH 中文(含混合英文、簡繁中文皆可)

# 指定測試文字轉語音字串
if lang == "ZH":
    text = "我們討如何在 Intel 平台上轉換和優化 artificial intelligence 模型"
elif lang == "EN":
    text = "For Intel platforms, we explore the methods for converting and optimizing models."

# 若指定語音增強模式則新增 process_audio() 函式
if speech_enhance:
    from df.enhance import enhance, init_df, load_audio, save_audio
    import torchaudio

    # 將輸入聲音檔案處理後轉存到新檔案中
    def process_audio(input_file: str, output_file: str, new_sample_rate: int = 48000):
        """
        Load an audio file, enhance it using a DeepFilterNet, and save the result.

        Parameters:
        input_file (str): Path to the input audio file.
        output_file (str): Path to save the enhanced audio file.
        new_sample_rate (int): Desired sample rate for the output audio file (default is 48000 Hz).
        """

        model, df_state, _ = init_df()
        audio, sr = torchaudio.load(input_file)
        
        # Resample the WAV file to meet the requirements of DeepFilterNet
        resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=new_sample_rate)
        resampled_audio = resampler(audio)

        enhanced = enhance(model, df_state, resampled_audio)

        # Save the enhanced audio
        save_audio(output_file, enhanced, df_state.sr())

# 初始化 TTS
model = TTS(language=lang, tts_device=tts_device, bert_device=bert_device, use_int8=use_int8)

# 取得語者資訊
speaker_ids = model.hps.data.spk2id
speakers = list(speaker_ids.keys())

# 若指定使用 OpenVINO, 檢查該語系是否已處理過，若無則進行轉換，結果會存在 \tts_ov_語系 路徑下 
if use_ov:
    ov_path = f"tts_ov_{lang}"
    
    if not Path(ov_path).exists():
        # 將原始模型轉換成 OpenVINO IR (bin+xml) 格式
        model.tts_convert_to_ov(ov_path, language= lang) 

    # 進行模型初始化
    model.ov_model_init(ov_path, language = lang) 

if not use_ov: # 若未使用 OpenVINO
     for speaker in speakers:
        output_path = 'en_pth_{}.wav'.format(str(speaker))
        start = time.perf_counter()
        model.tts_to_file(text, speaker_ids[speaker], output_path, speed=speed*0.75, use_ov = use_ov)
        end = time.perf_counter()
else: # 若使用 OpenVINO
    for speaker in speakers:
        output_path = 'ov_en_int8_{}.wav'.format(speaker) if use_int8 else 'en_ov_{}.wav'.format(speaker)
        start = time.perf_counter()
        model.tts_to_file(text, speaker_ids[speaker], output_path, speed=speed, use_ov=use_ov)
        
        if speech_enhance:
            print("Use speech enhance")
            process_audio(output_path,output_path)
            
        end = time.perf_counter()         

dur_time = (end - start) * 1000
print(f"MeloTTS 文字轉語音共花費: {dur_time:.2f} ms")

  from torchaudio.backend.common import AudioMetaData


init tts_ov_ZH\bert_int8_ZH.xml
ov_path : tts_ov_ZH\tts_int8_ZH.xml
 > Text split to sentences.
我們討如何在 Intel 平台上轉換和優化 artificial intelligence 模型


  0%|                                                                                    | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\omnixri\AppData\Local\Temp\jieba.cache
Loading model cost 0.953 seconds.
Prefix dict has been built successfully.
100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.01s/it]

Use speech enhance
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on torch 2.1.0+cpu[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on host omnixri[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mLoading model settings of DeepFilterNet3[0m





[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mUsing DeepFilterNet3 model at C:\Users\omnixri\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mInitializing model `deepfilternet3`[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mFound checkpoint C:\Users\omnixri\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3\checkpoints\model_120.ckpt.best with epoch 120[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on device cpu[0m
[32m2024-12-12 18:51:36[0m | [1mINFO    [0m | [36mDF[0m | [1mModel loaded[0m
MeloTTS 文字轉語音共花費: 2828.01 ms


## 使用 Gradio 產生互動界面

除在欄位上顯示操作界面外，亦可在 http://127.0.0.1:7860 (http://localhost:7860) 以網頁方式呈現。  

在操作時，可自行輸入文字，選擇是否使用 OpenVINO，調整語速(50%~200%)，按下=「Submit」後就會開始轉換產生語音檔及顯示轉換耗時。  

In [3]:
import gradio as gr

# 定義文字轉語音處理函式 tts()
# 輸入： content（字串）、 use_ov（布林值）、 speed（數值）
# 輸出： "MeloTTS 文字轉語音共花費: xx.xx ms"（字串）、 語音檔案名稱（字串）
def tts(content, use_ov, speed):
    start = time.perf_counter()    
    model.tts_to_file(content, speaker_ids[speaker], output_path, speed=speed/100, use_ov=use_ov)
    
    if speech_enhance:
            print("Use speech enhance")
            process_audio(output_path,output_path)
        
    end = time.perf_counter()  
    dur_time = (end - start) * 1000
    audio = "ov_en_int8_ZH.wav"
    result = f"MeloTTS 文字轉語音共花費: {dur_time:.2f} ms"
    return result, audio

# 建立輸人及輸出簡單應用界面
# fn: 界面函數名稱
# inputs: 輸人格式， 名字（標籤：名字）、是早上（複選盒）、華氏溫度（標籤：華氏（℉），滑桿，最小值0，最大值100，預設值50）
# outputs: 輸出格式，結果字串（標籤:輸出）、結果溫度（標籤：攝氏（℃））
demo = gr.Interface(
    fn=tts,
    inputs=[gr.Textbox(label="文字", value = "請輸入文字內容"),
            gr.Checkbox(value=True, label="Use_OpenCV"),
            gr.Slider(50, 200, value=100, label="語速(%)") ],
    outputs=[gr.Textbox(label="轉換時間"),
             gr.Audio(label="輸出結果", type="filepath")],
)

# 執行顯示界面
demo.launch()

Running on local URL:  http://127.0.0.1:7861
IMPORTANT: You are using gradio version 4.26.0, however version 4.44.1 is available, please upgrade.
--------

To create a public link, set `share=True` in `launch()`.


