# Digital Human 虛擬主播 
by Jack OmniXRI, 2024/12/12

本範例是基於 **Intel OpenVINO 2024.4** 及 **Notebooks 2024.5** 進行測試，主要整合下列內容：  
* 文字轉語音(Text to Speech, TTS) - MeloTTS https://github.com/zhaohb/MeloTTS-OV/tree/speech-enhancement-and-npu  
* 自動對嘴影片生成 - Wav2Lip  https://github.com/openvinotoolkit/openvino_notebooks/tree/2024.5/notebooks/wav2lip  

本範例使用 Gradio 作為操作界面，預設啟動網址為 http://127.0.0.1:7869/  (http://localhost:7869/ ，操作步驟如下：  
1. 在文字欄位輸入一段文字，調整語速（50% ~ 200%，預計100%），按下「生成語音」鍵即可透過 MeloTTS 產生一個聲音檔案，預設為 "ov_en_int8_ZH.wav"。  
2. 接著按下「載入樣本」鍵即可載入一個預設的影片(data_video_sun_5s.mp4)和剛才生成的聲音檔案。這裡亦可直接上傳影片和聲音檔案或開啟網路攝影機直接錄影、錄音再進行合成。  
3. 最後按下「生成影片」鍵即可開始使用 Wav2Lip 進行影片生成，即得一個可播放的影片，點擊影片左下方播放鍵就能檢視生成結果。  

建議影片長度要大於聲音內容長度，因為影片長度不足時會自動重頭播放，會有不連續跳動感產生。  

本範例執行前請先參考 https://github.com/OmniXRI/digital_human 安裝步驟及注意事項。如想深入了解 Gradio 可參考[【vMaker Edge AI專欄 #24】 如何使用 Gradio 快速搭建人工智慧應用圖形化人機界面](https://omnixri.blogspot.com/2024/12/vmaker-edge-ai-24-gradio.html)

## Prerequisites 預安裝


In [1]:
import os
import sys
from pathlib import Path

from melo.api import TTS
import time
import gradio as gr

#print(os.path.abspath('.'))
sys.path.append('./wav2lip') # 為了讓系統找得到 wav2lip 相關函式，所以手動加入相關路徑

from notebook_utils import device_widget
from ov_inference import ov_inference



INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino


In [2]:
## MeloTTS 文字轉語音設定及處理函式

In [3]:
speed = 1.0 # 調整語速
use_ov = True  # True 使用 OpenVINO, False 使用 PyTorch
use_int8 = True # True 啟用 INT8 格式
speech_enhance = True # True 啟用語音增強模式

tts_device = "CPU" # 指定 TTS 推論裝置 , "CPU" 或 "GPU"（這裡指 Intel GPU）
bert_device = "CPU" # 指定 Bert 推論裝置, "CPU" 或 "GPU" 或 "NPU"
lang =  "ZH" # 設定語系, EN 英文, ZH 中文(含混合英文、簡繁中文皆可)

# 指定測試文字轉語音字串
if lang == "ZH":
    text = "我們討如何在 Intel 平台上轉換和優化 artificial intelligence 模型"
elif lang == "EN":
    text = "For Intel platforms, we explore the methods for converting and optimizing models."

# 若指定語音增強模式則新增 process_audio() 函式
if speech_enhance:
    from df.enhance import enhance, init_df, load_audio, save_audio
    import torchaudio

    # 將輸入聲音檔案處理後轉存到新檔案中
    def process_audio(input_file: str, output_file: str, new_sample_rate: int = 48000):
        """
        Load an audio file, enhance it using a DeepFilterNet, and save the result.

        Parameters:
        input_file (str): Path to the input audio file.
        output_file (str): Path to save the enhanced audio file.
        new_sample_rate (int): Desired sample rate for the output audio file (default is 48000 Hz).
        """

        model, df_state, _ = init_df()
        audio, sr = torchaudio.load(input_file)
        
        # Resample the WAV file to meet the requirements of DeepFilterNet
        resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=new_sample_rate)
        resampled_audio = resampler(audio)

        enhanced = enhance(model, df_state, resampled_audio)

        # Save the enhanced audio
        save_audio(output_file, enhanced, df_state.sr())

# 初始化 TTS
model = TTS(language=lang, tts_device=tts_device, bert_device=bert_device, use_int8=use_int8)

# 取得語者資訊
speaker_ids = model.hps.data.spk2id
speakers = list(speaker_ids.keys())

# 若指定使用 OpenVINO, 檢查該語系是否已處理過，若無則進行轉換，結果會存在 \tts_ov_語系 路徑下 
if use_ov:
    ov_path = f"tts_ov_{lang}"
    
    if not Path(ov_path).exists():
        # 將原始模型轉換成 OpenVINO IR (bin+xml) 格式
        model.tts_convert_to_ov(ov_path, language= lang) 

    # 進行模型初始化
    model.ov_model_init(ov_path, language = lang) 

if not use_ov: # 若未使用 OpenVINO
     for speaker in speakers:
        output_path = 'en_pth_{}.wav'.format(str(speaker))
        start = time.perf_counter()
        model.tts_to_file(text, speaker_ids[speaker], output_path, speed=speed*0.75, use_ov = use_ov)
        end = time.perf_counter()
else: # 若使用 OpenVINO
    for speaker in speakers:
        output_path = 'ov_en_int8_{}.wav'.format(speaker) if use_int8 else 'en_ov_{}.wav'.format(speaker)
        start = time.perf_counter()
        model.tts_to_file(text, speaker_ids[speaker], output_path, speed=speed, use_ov=use_ov)
        
        if speech_enhance:
            print("Use speech enhance")
            process_audio(output_path,output_path)
            
        end = time.perf_counter()         

dur_time = (end - start) * 1000
print(f"MeloTTS 文字轉語音共花費: {dur_time:.2f} ms")

  from torchaudio.backend.common import AudioMetaData


init tts_ov_ZH\bert_int8_ZH.xml
ov_path : tts_ov_ZH\tts_int8_ZH.xml
 > Text split to sentences.
我們討如何在 Intel 平台上轉換和優化 artificial intelligence 模型


  0%|                                                                                    | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\omnixri\AppData\Local\Temp\jieba.cache
Loading model cost 0.775 seconds.
Prefix dict has been built successfully.
100%|████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.86s/it]

Use speech enhance
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on torch 2.1.0+cpu[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on host omnixri[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mLoading model settings of DeepFilterNet3[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mUsing DeepFilterNet3 model at C:\Users\omnixri\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mInitializing model `deepfilternet3`[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mFound checkpoint C:\Users\omnixri\AppData\Local\DeepFilterNet\DeepFilterNet\Cache\DeepFilterNet3\checkpoints\model_120.ckpt.best with epoch 120[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mRunning on device cpu[0m
[32m2024-12-12 21:46:11[0m | [1mINFO    [0m | [36mDF[0m | [1mModel




MeloTTS 文字轉語音共花費: 2491.72 ms


## 聲音自動對嘴生成設定及處理函式

In [4]:
from notebook_utils import device_widget

# 設定下拉式選單，以選擇推論裝置 (CPU, GPU, NPU, AUTO) (這裡的GPU是指Intel內顯）
device = device_widget() 
device

Dropdown(description='Device:', index=3, options=('CPU', 'GPU', 'NPU', 'AUTO'), value='AUTO')

In [5]:
import os
import sys
from pathlib import Path

#print(os.path.abspath('.'))
sys.path.append('./wav2lip') # 為了讓系統找得到 wav2lip 相關函式，所以手動加入相關路徑

OV_FACE_DETECTION_MODEL_PATH = Path("models/face_detection.xml") # 指定人臉偵測模型路徑
OV_WAV2LIP_MODEL_PATH = Path("models/wav2lip.xml") # 指定 wav2lip 模型路徑

In [6]:
from ov_inference import ov_inference

# 確認存放輸出結果路徑是否存在，若否則建立 \results 路徑
if not os.path.exists("results"):
    os.mkdir("results")

# 使用 OpenVINO 進行推論（至少跑一次）
ov_inference(
    "data_video_sun_5s.mp4", # 指定原始影片檔案
    "data_audio_sun_5s.wav", # 指定聲音檔案
    face_detection_path=OV_FACE_DETECTION_MODEL_PATH, # 指定人臉偵測模型路徑
    wav2lip_path=OV_WAV2LIP_MODEL_PATH, # 指定 wav2lip 模型路徑
    inference_device=device.value, # 指定推論裝置
    outfile="results/result_voice.mp4", # 輸出結果檔案名稱
)


Reading video frames...
Number of frames available for inference: 125


  return librosa.filters.mel(hp.sample_rate, hp.n_fft, n_mels=hp.num_mels,


(80, 405)
Length of mel chunks: 123


  0%|                                                                                    | 0/1 [00:00<?, ?it/s]

face_detect_ov images[0].shape:  (768, 576, 3)



  0%|                                                                                    | 0/8 [00:00<?, ?it/s][A
 12%|█████████▌                                                                  | 1/8 [00:19<02:14, 19.27s/it][A
 25%|███████████████████                                                         | 2/8 [00:37<01:53, 18.91s/it][A
 38%|████████████████████████████▌                                               | 3/8 [00:56<01:34, 18.81s/it][A
 50%|██████████████████████████████████████                                      | 4/8 [01:15<01:15, 18.85s/it][A
 62%|███████████████████████████████████████████████▌                            | 5/8 [01:34<00:56, 18.77s/it][A
 75%|█████████████████████████████████████████████████████████                   | 6/8 [01:52<00:37, 18.67s/it][A
 88%|██████████████████████████████████████████████████████████████████▌         | 7/8 [02:11<00:18, 18.61s/it][A
100%|██████████████████████████████████████████████████████████████████████████

Model loaded


100%|███████████████████████████████████████████████████████████████████████████| 1/1 [02:26<00:00, 146.68s/it]


'results/result_voice.mp4'

## 使用 Gradio 產生互動界面

除在欄位上顯示操作界面外，亦可在 http://127.0.0.1:7860 (http://localhost:7860) 以網頁方式呈現。 

In [None]:
# import gradio as gr

# 定義文字轉語音處理函式 tts()
# 輸入： content（字串）、 use_ov（布林值）、 speed（數值）
# 輸出： "MeloTTS 文字轉語音共花費: xx.xx ms"（字串）、 語音檔案名稱（字串）
def tts(content, use_ov, speed):
    start = time.perf_counter()    
    model.tts_to_file(content, speaker_ids[speaker], output_path, speed=speed/100, use_ov=use_ov)
    
    if speech_enhance:
            print("Use speech enhance")
            process_audio(output_path,output_path)
        
    end = time.perf_counter()  
    dur_time = (end - start) * 1000
    audio = "ov_en_int8_ZH.wav"
    result = f"MeloTTS 文字轉語音共花費: {dur_time:.2f} ms"
    return result, audio

In [None]:
def load_example():
    video = "data_video_sun_5s.mp4"
    audio = "ov_en_int8_ZH.wav"
    return video, audio

In [7]:
# 設定客製化 Gradio 人機界面
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            # MeloTTS 相關界面
            txb_content = gr.Textbox(label="文字", value = "請輸入文字內容")
            ckb_use_ov = gr.Checkbox(value=True, label="Use_OpenCV")
            sld_speed = gr.Slider(50, 200, value=100, label="語速(%)")
            txb_cvt_time = gr.Textbox(label="轉換時間")
            file_audio = gr.Audio(label="輸出結果", type="filepath")    
            
            # 設定「生成語音」鍵對應動作
            btn_tts = gr.Button("生成語音")
            btn_tts.click(tts, 
                         inputs=[txb_content, ckb_use_ov, sld_speed], 
                         outputs=[txb_cvt_time, file_audio])
        with gr.Column():    
            # Wav2Lip 相關界面
            face_video = gr.Video(label="人臉影片")
            text_audio = gr.Audio(label="聲音檔案", type="filepath")
            
            # 設定「載入樣本」鍵對應動作
            btn_tts = gr.Button("載入樣本")
            btn_tts.click(load_example, 
                         outputs=[face_video, text_audio])
        with gr.Column():
            output_video = gr.Video(label="結果影片")
        
            # 設定「生成影片」鍵對應動作
            btn_wav2lip = gr.Button("生成影片")
            btn_wav2lip.click(ov_inference, 
                              inputs=[face_video, text_audio],
                              outputs=output_video)

demo.launch()

Running on local URL:  http://127.0.0.1:7863
IMPORTANT: You are using gradio version 4.26.0, however version 4.44.1 is available, please upgrade.
--------

To create a public link, set `share=True` in `launch()`.


