In [1]:
# mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [44]:
%pip install numpy==1.23.5 #restart runtime after upgrading to use this version

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
The folder you are executing pip from can no longer be found.


In [None]:
%pip install pymcd
%pip install pesq

In [3]:
%cd /content/drive/MyDrive/Colab_Notebooks/summerschool2023

/content/drive/MyDrive/Colab_Notebooks/summerschool2023


In [4]:
import os
import numpy as np
import pandas as pd
import math
import librosa
import csv
from scipy.io import wavfile
from scipy.signal import resample
from scipy.stats  import pearsonr
from pesq import pesq
from tqdm import tqdm

In [5]:
#Define the header rows for result CSV file
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"
data_dirs = [GT_dir, resyn_dir, degraded_dir]
csv_dir_0 = "result/summary_"
result_csv_paths = [csv_dir_0+name for name in ["resyn.csv", "degraded.csv"]]
#result_csv_paths[0] for resynthesized audios
#result_csv_paths[1] for degraded audios


# Initialize csvfile, only be run once!
for (data_dir, result_csv_path) in zip(data_dirs,result_csv_paths) :
  audio_files = [f for f in os.listdir(data_dir) if f.endswith('.wav')]
  audio_file_names = [os.path.splitext(f)[0] for f in audio_files]
  audio_file_names.sort()
  init = {'audio_file_name':audio_file_names}

  # Initialize a new csv file with dataframe
  df = pd.DataFrame(init)
  df.to_csv(result_csv_path)

# Basic

## Mel cepstral distortion (MCD)


Mel Cepstral Distortion (MCD) is measure of the difference between two sets of Mel Frequency Cepstral Coefficients(MFCCs). MFCCs are used to represent the spectral envolope of an audio signal and are calculated from Mel-spectrum. 

MCD is a popular objective evluation metric of the similarity between two audios. A lower MCD indicates greater similarity. And a higher MCD indictes greater differences between them[2].

The pymcd package provides scripts to compute a variety of forms of MCD score:

* MCD (plain): the conventional MCD metric, which requires the lengths of two input speeches to be the same. Otherwise, it would simply extend the shorted speech to the length of longer one by padding zero for the time-domain waveform.
* MCD-DTW: an improved MCD metric that adopts the Dynamic Time Warping (DTW) algorithm to find the minimum MCD between two speeches.
* MCD-DTW-SL: MCD-DTW weighted by Speech Length (SL) evaluates both the length and the quality of alignment between two speeches. Based on the MCD-DTW metric, the MCD-DTW-SL incorporates an additional coefficient w.r.t. the difference between the lengths of two speeches[3].

In [7]:
from pymcd.mcd import Calculate_MCD

In [8]:
### Evaluate one pair

# three different modes "plain", "dtw" and "dtw_sl" for the above three MCD metrics 
MCD_mode = "dtw"
mcd_toolbox = Calculate_MCD(MCD_mode=MCD_mode)

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"
mcd_value = mcd_toolbox.calculate_mcd(audio_ref_path, audio_degraded_path)

print(f"mcd({MCD_mode}) = {mcd_value:.3f}")

mcd(dtw) = 4.137


In [9]:
### Evaluate all pairs

# three different modes "plain", "dtw" and "dtw_sl" for the above three MCD metrics 
MCD_mode = "dtw"
mcd_toolbox = Calculate_MCD(MCD_mode=MCD_mode)

# directories of audios from different audio folders
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
mcd_values_resyn = []
mcd_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  mcd_values_resyn.append(mcd_toolbox.calculate_mcd(GT_audio_path, resyn_audio_path))
  mcd_values_degraded.append(mcd_toolbox.calculate_mcd(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['MCD'] = mcd_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['MCD'] = mcd_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of MCD
mcd_resyn = np.mean(mcd_values_resyn)
mcd_degraded = np.mean(mcd_values_degraded)

print("\n -------------------------------------")
print(f"mcd({MCD_mode}) of resynthesized audios = {mcd_resyn:.3f}")
print(f"mcd({MCD_mode}) of degraded audios = {mcd_degraded:.3f}")

100%|██████████| 10/10 [00:34<00:00,  3.45s/it]


 -------------------------------------
mcd(dtw) of resynthesized audios = 2.969
mcd(dtw) of degraded audios = 3.618





## F0-PCC

$F_{0}-PCC$ is the Pearson Correlation Coefficient (PCC) between F0 vecotrs from two audios. The fundamental frequency(F0) of a sound wave represents the perceived pitch of the sound, and can be used to distinguish different speech sounds. 

Pearson correlation coefficient (PCC) is a measure of the linear correlation between two variables. It is commonly used in statistics to measure the strength and direction of the relationship between two sets of data. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation and 1 indicates perfect positive correlation[9]. 

In [10]:
def calculate_F0PCC(audio_ref_path, audio_synthesized_path):
  audio_ref, rate= librosa.load(audio_ref_path)
  audio_synthesized, rate= librosa.load(audio_synthesized_path)

  #audios have different lenghts
  if len(audio_ref)-len(audio_synthesized)>=0:
    audio_synthesized = np.pad(audio_synthesized, (0, len(audio_ref)-len(audio_synthesized)), 'constant', constant_values=(0, 0)) 
  else:
    audio_synthesized = audio_synthesized[:len(audio_ref)]

  f0_ref, voiced_flag_ref, voiced_probs_ref = librosa.pyin(audio_ref, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
  f0_synthesized, voiced_flag_synthesized, voiced_probs_synthesized = librosa.pyin(audio_synthesized, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
  f0_ref = np.nan_to_num((f0_ref))
  f0_synthesized = np.nan_to_num(f0_synthesized)
  
  f0_pcc = pearsonr(f0_ref, f0_synthesized)[0]

  return f0_pcc

In [11]:
### Evaluate one pair

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_short.wav"
audio_degraded_path = "data/resyn/neutral_sent001_short.wav"

f0_pcc = calculate_F0PCC(audio_ref_path, audio_degraded_path)

print(f"F0-PCC = {f0_pcc:.3f}")

F0-PCC = 0.703


In [12]:
### Evaluate all pairs

# directories of audios from different audio folders
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
f0pcc_values_resyn = []
f0pcc_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  f0pcc_values_resyn.append(calculate_F0PCC(GT_audio_path, resyn_audio_path))
  f0pcc_values_degraded.append(calculate_F0PCC(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['F0_PCC'] = f0pcc_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['F0_PCC'] = f0pcc_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of F0_PCC
f0pcc_resyn = np.mean(f0pcc_values_resyn)
f0pcc_degraded = np.mean(f0pcc_values_degraded)

print("\n -------------------------------------")
print(f"F0_PCC of resynthesized audios = {f0pcc_resyn:.3f}")
print(f"F0 PCC of degraded audios = {f0pcc_degraded:.3f}")

100%|██████████| 10/10 [01:56<00:00, 11.63s/it]


 -------------------------------------
F0_PCC of resynthesized audios = nan
F0 PCC of degraded audios = nan





## Perceptual Evaluation of Speech Quality (PESQ)

PESQ is an objective measure of audio quality that compares an audio output to the original voice file, taking into account factors such as audio sharpness, call volume, background noise, latency, clipping, and audio interference. PESQ returns a score between -0.5 and 4.5, with higher scores indicating better quality.

To evaluate audios using PESQ, the original and degraded signals are level equalized, filtered, time-aligned, and processed through an auditory transform to obtain the loudness spectra. The difference in loudness between the original and degraded signals is then computed and averaged over time and frequency to produce a prediction of subjective quality rating[8].

In [13]:
def calculate_PESQ(audio_ref_path, audio_synthesized_path):
  fs,audio_ref = wavfile.read(audio_ref_path)
  fs,audio_synthesized =  wavfile.read(audio_synthesized_path)

  #audios have different lenghts
  if len(audio_ref)-len(audio_synthesized)>=0:
    audio_synthesized = np.pad(audio_synthesized, (0, len(audio_ref)-len(audio_synthesized)), 'constant', constant_values=(0, 0)) 
  else:
    audio_synthesized = audio_synthesized[:len(audio_ref)]

  num_samples_ref = round(len(audio_ref)/fs*16000) #downsample to 16kH
  num_samples_synthesized = round(len(audio_synthesized)/fs*16000) #downsample to 16kH
  ref_speech_16k = resample(audio_synthesized, num_samples_ref)
  synthesized_speech_16k = resample(audio_synthesized, num_samples_synthesized)

  PESQ = pesq(16000, audio_ref, audio_synthesized)

  return PESQ

In [14]:
### Evaluate single audio

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"

PESQ = calculate_PESQ(audio_ref_path, audio_degraded_path)

print(f"PESQ = {PESQ:.3f}")

PESQ = 1.136


In [15]:
### Evaluate all pairs

# directories of audios from different models
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
PESQ_values_resyn = []
PESQ_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  PESQ_values_resyn.append(calculate_PESQ(GT_audio_path, resyn_audio_path))
  PESQ_values_degraded.append(calculate_PESQ(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['PESQ'] = PESQ_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['PESQ'] = PESQ_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of PESQ
PESQ_resyn = np.mean(PESQ_values_resyn)
PESQ_degraded = np.mean(PESQ_values_degraded)

print("\n -------------------------------------")
print(f"PESQ of resyn = {PESQ_resyn:.3f}")
print(f"PESQ of degraded = {PESQ_degraded:.3f}")

100%|██████████| 10/10 [00:05<00:00,  1.75it/s]


 -------------------------------------
PESQ of resyn = 2.491
PESQ of degraded = 1.233





# Advanced

## MOSNet

MOSNet is a deep learning model that uses convolutional and recurrent neural network models to predict Mean Opinion Scores (MOS) for audio signals. The model was tested on a large-scale listening test results of the Voice Conversion Challenge (VCC) 2018,  it learns to predict the MOS score of an audio signal based on its features. The features used by MOSNet include time-domain and frequency-domain representations of the audio signal, as well as higher-level features such as loudness, sharpness, and roughness[5].

In [16]:
%cd MOSNet

%run custom_test.py --rootdir "../data/resyn"
df = pd.read_csv("../"+result_csv_paths[0])
df['MOSNet'] = MOSNet_results
df.to_csv("../"+result_csv_paths[0],index=False)

%run custom_test.py --rootdir "../data/degraded"
df = pd.read_csv("../"+result_csv_paths[1])
df['MOSNet'] = MOSNet_results
df.to_csv("../"+result_csv_paths[1],index=False)

## go back to the directory of summerschool2023
%cd - 

/content/drive/MyDrive/Colab_Notebooks/summerschool2023/MOSNet
Loading model weights
CNN_BLSTM init
Start evaluating 10 waveforms...


100%|██████████| 10/10 [00:06<00:00,  1.66it/s]


Average: 3.0965
Loading model weights
CNN_BLSTM init
Start evaluating 10 waveforms...


100%|██████████| 10/10 [00:12<00:00,  1.29s/it]

Average: 3.3382000000000005
/content/drive/MyDrive/Colab_Notebooks/summerschool2023





## ViSQOL (Virtual Speech Quality Objective Listener) 

ViSQOL (Virtual Speech Quality Objective Listener) is an objective, full-reference metric for perceived audio quality. It uses a spectro-temporal measure of similarity between a reference and a test speech signal to produce a MOS-LQO (Mean Opinion Score - Listening Quality Objective) score. MOS-LQO scores range from 1 (the worst) to 5 (the best)[10].

In [None]:
# Since there is no wrapped python library, we have to install the python api from the first step.
# requires a very long running time ~45min(build with bazel) + 36min(import as python package)

In [42]:
!git clone https://github.com/google/visqol.git

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
fatal: could not create work tree dir 'visqol': No such file or directory


In [None]:
%cd visqol

[Errno 2] No such file or directory: 'visqol'
/content/drive/MyDrive/Colab_Notebooks/summerschool2023/visqol


In [None]:
!npm install -g @bazel/bazelisk ##install bazel 

[K[?25h/tools/node/bin/bazelisk -> /tools/node/lib/node_modules/@bazel/bazelisk/bazelisk.js
/tools/node/bin/bazel -> /tools/node/lib/node_modules/@bazel/bazelisk/bazelisk.js
+ @bazel/bazelisk@0.0.0-PLACEHOLDER
updated 1 package in 0.575s


In [None]:
!touch WORKSPACE

In [None]:
!bazel build :visqol -c opt  ### take about 45min

In [None]:
!pip install . #install visqol in python, takes about 36min

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/drive/MyDrive/Colab_Notebooks/summerschool2023/visqol
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [None]:
import os

from visqol import visqol_lib_py
from visqol.pb2 import visqol_config_pb2
from visqol.pb2 import similarity_result_pb2

config = visqol_config_pb2.VisqolConfig()

mode = "speech"
if mode == "audio":
    config.audio.sample_rate = 48000
    config.options.use_speech_scoring = False
    svr_model_path = "libsvm_nu_svr_model.txt"
elif mode == "speech":
    config.audio.sample_rate = 16000
    config.options.use_speech_scoring = True
    svr_model_path = "lattice_tcditugenmeetpackhref_ls2_nl60_lr12_bs2048_learn.005_ep2400_train1_7_raw.tflite"
else:
    raise ValueError(f"Unrecognized mode: {mode}")

config.options.svr_model_path = os.path.join(
    os.path.dirname(visqol_lib_py.__file__), "model", svr_model_path)

api = visqol_lib_py.VisqolApi()

api.Create(config)

ModuleNotFoundError: ignored

In [None]:
### Evaluate one pair

audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"

rate_ref, audio_ref= wavfile.read(audio_ref_path)
rate_degraded, audio_degraded= wavfile.read(audio_degraded_path)
#!!! audios need to be changed to float64

similarity_result = api.Measure(audio_ref.astype(np.float64), audio_degraded.astype(np.float64))

print(similarity_result.moslqo)

2.9846615563441414


In [None]:
%cd - 
## make sure you are under the folder summerschool2023

/content/drive/MyDrive/Colab_Notebooks/summerschool2023


In [None]:
### Evaluate all pairs

# directories of audios from different models
GT_dir = "data/ground_truth"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
visqol_values_resyn = []
visqol_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)

  _, GT_audio= wavfile.read(GT_audio_path)
  _, resyn_audio= wavfile.read(resyn_audio_path)
  _, degraded_audio= wavfile.read(degraded_audio_path)

  visqol_values_resyn.append(api.Measure(GT_audio.astype(np.float64), resyn_audio.astype(np.float64)).moslqo)
  visqol_values_degraded.append(api.Measure(GT_audio.astype(np.float64), degraded_audio.astype(np.float64)).moslqo)

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['ViSQOL'] = visqol_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['ViSQOL'] = visqol_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of ViSQOL
visqol_resyn = np.mean(visqol_values_resyn)
visqol_degraded = np.mean(visqol_values_degraded)

print("\n -------------------------------------")
print(f"ViSQOL of resyn = {visqol_resyn:.3f}")
print(f"ViSQOL of degraded = {visqol_degraded:.3f}")

100%|██████████| 5/5 [00:35<00:00,  7.07s/it]


 -------------------------------------
ViSQOL of Tacotron2 = 2.843
ViSQOL of FastPitch = 2.961





# Others

## Voicing Decision Error (VDE)


Voicing Decision Error (VDE) is a measure of the accuracy of voicing detection in speech processing. It measures the error in determining whether a speech segment is voiced or unvoiced and is commonly used in speech recognition and speaker verification applications. VDE can be calculated by comparing the voicing decisions made by the algorithm to the ground truth voicing information and is usually expressed as a percentage. Lower VDE values indicate better performance of the voicing detection algorithm[4].

$ VDE = \frac{N_{V\rightarrow U}+N_{U\rightarrow V}}{N}× 100\%$

In [17]:
def calculate_vde(audio_ref_path, audio_synthesized_path):
  audio_ref, rate= librosa.load(audio_ref_path)
  audio_synthesized, rate= librosa.load(audio_synthesized_path)

  #audios have different lenghts
  if len(audio_ref)-len(audio_synthesized)>=0:
    audio_synthesized = np.pad(audio_synthesized, (0, len(audio_ref)-len(audio_synthesized)), 'constant', constant_values=(0, 0)) 
  else:
    audio_synthesized = audio_synthesized[:len(audio_ref)]

  f0_ref, voiced_flag_ref, voiced_probs_ref = librosa.pyin(audio_ref, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
  f0_synthesized, voiced_flag_synthesized, voiced_probs_synthesized = librosa.pyin(audio_synthesized, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))

  num_flag_error = 0
  num_flag_total = len(voiced_flag_ref)
  for i in range(num_flag_total):
    if voiced_flag_ref[i] != voiced_flag_synthesized[i]:
      num_flag_error+=1
      
  vde = num_flag_error/num_flag_total

  return vde

In [18]:
### Evaluate one pair

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"

vde = calculate_vde(audio_ref_path, audio_degraded_path)

print(f"VDE = {vde:.3f}")

VDE = 0.219


In [24]:
### Evaluate all pairs

# directories of audios from different models
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
vde_values_resyn = []
vde_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  vde_values_resyn.append(calculate_vde(GT_audio_path, resyn_audio_path))
  vde_values_degraded.append(calculate_vde(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['VDE'] = vde_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['VDE'] = vde_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of VDE
vde_resyn = np.mean(vde_values_resyn)
vde_degraded = np.mean(vde_values_degraded)

print("\n -------------------------------------")
print(f"VDE of resyn = {vde_resyn:.3f}")
print(f"VDE of degraded = {vde_degraded:.3f}")

100%|██████████| 10/10 [01:52<00:00, 11.21s/it]


 -------------------------------------
VDE of resyn = 0.097
VDE of degraded = 0.136





## Gross Pitch Error (GPE) 


Gross Pitch Error (GPE) is a measure of the accuracy of pitch estimation in speech processing. It is defined as the percentage of frames or segments in which the estimated fundamental frequency deviates from the true fundamental frequency by more than a predefined threshold. A high GPE score indicates that the pitch estimation algorithm is making many errors in estimating the fundamental frequency of the speech signal. GPE is commonly used to evaluate the performance of speech analysis and synthesis systems[4].


$GPE=\frac{N_{F 0 E}}{N_{V V}} \times 100 \%$

$\left|\frac{F 0_{i, \text { estimated }}}{F 0_{i, \text { reference }}}-1\right|>\delta \%$

In [25]:
def calculate_gpe(audio_ref_path, audio_synthesized_path):
  audio_ref, rate= librosa.load(audio_ref_path)
  audio_synthesized, rate= librosa.load(audio_synthesized_path)

  #audios have different lenghts
  if len(audio_ref)-len(audio_synthesized)>=0:
    audio_synthesized = np.pad(audio_synthesized, (0, len(audio_ref)-len(audio_synthesized)), 'constant', constant_values=(0, 0)) 
  else:
    audio_synthesized = audio_synthesized[:len(audio_ref)]

  f0_ref, voiced_flag_ref, voiced_probs_ref = librosa.pyin(audio_ref, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
  f0_synthesized, voiced_flag_synthesized, voiced_probs_synthesized = librosa.pyin(audio_synthesized, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))

  num_f0_error = 0
  num_f0_total = 0
  for i in range(len(f0_ref)):
    if voiced_flag_ref[i]==True and  voiced_flag_synthesized[i]==True:
      num_f0_total+=1
      if np.abs(f0_synthesized[i]/f0_ref[i]-1)>0.2:
        num_f0_error+=1

  gpe = num_f0_error/num_f0_total

  return gpe

In [26]:
### Evaluate one pair

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"

gpe = calculate_gpe(audio_ref_path, audio_degraded_path)

print(f"GPE = {gpe:.3f}")

GPE = 0.421


In [None]:
### Evaluate all pairs

# directories of audios from different models
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
gpe_values_resyn = []
gpe_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  gpe_values_resyn.append(calculate_gpe(GT_audio_path, resyn_audio_path))
  gpe_values_degraded.append(calculate_gpe(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['GPE'] = gpe_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['GPE'] = gpe_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of gpe
gpe_resyn = np.mean(gpe_values_resyn)
gpe_degraded = np.mean(gpe_values_degraded)

print("\n -------------------------------------")
print(f"GPE of resyn = {gpe_resyn:.3f}")
print(f"GPE of degraded = {gpe_degraded:.3f}")

## F0 Frame Error (FFE)

F0 Frame Error (FFE) is a measure of the accuracy of fundamental frequency (F0) estimation in speech processing. It measures the error in estimating the F0 value over a frame of the speech signal. FFE is calculated by comparing the estimated F0 value to the true F0 value for each frame of the speech signal. The error is typically expressed as a percentage of the true F0 value. A high FFE score indicates that the F0 estimation algorithm is making many errors in estimating the F0 value for each frame of the speech signal[4].

$\begin{aligned} FFE & =\frac{\# \text { of error frames }}{\# \text { of total frames }} \times 100 \% \\ = & \frac{N_{U \rightarrow V}+N_{V \rightarrow U}+N_{F 0 E}}{N} \times 100 \% .\end{aligned}$

In [28]:
def calculate_ffe(audio_ref_path, audio_synthesized_path):
  audio_ref, rate= librosa.load(audio_ref_path)
  audio_synthesized, rate= librosa.load(audio_synthesized_path)

  #audios have different lenghts
  if len(audio_ref)-len(audio_synthesized)>=0:
    audio_synthesized = np.pad(audio_synthesized, (0, len(audio_ref)-len(audio_synthesized)), 'constant', constant_values=(0, 0)) 
  else:
    audio_synthesized = audio_synthesized[:len(audio_ref)]

  f0_ref, voiced_flag_ref, voiced_probs_ref = librosa.pyin(audio_ref, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))
  f0_synthesized, voiced_flag_synthesized, voiced_probs_synthesized = librosa.pyin(audio_synthesized, fmin=librosa.note_to_hz('C2'), fmax=librosa.note_to_hz('C7'))

  num_flag_error = 0
  num_flag_total = len(voiced_flag_ref)
  num_f0_error = 0
  num_f0_total = 0
  for i in range(num_flag_total):
    if voiced_flag_ref[i] != voiced_flag_synthesized[i]:
      num_flag_error+=1
    elif voiced_flag_ref[i]==True and np.abs(f0_synthesized[i]/f0_ref[i]-1)>0.2:
      num_f0_error+=1

  ffe = (num_flag_error+num_f0_error)/num_flag_total

  return ffe

In [30]:
### Evaluate one pair

# two inputs w.r.t. reference (ground-truth) and degraded/resynthesized speeches, respectively
audio_ref_path = "data/raw/neutral_sent001_long.wav"
audio_degraded_path = "data/degraded/neutral_sent001_long.wav"

ffe = calculate_ffe(audio_ref_path, audio_degraded_path)

print(f"FFE = {ffe:.3f}")

FFE = 0.431


In [31]:
### Evaluate all pairs

# directories of audios from different models
GT_dir = "data/raw"
resyn_dir = "data/resyn"
degraded_dir = "data/degraded"

# Evaluate all audios in the folders
audio_names = []
for subdir, dirs, files in os.walk(GT_dir):
  for file in files:
      if (".wav" in file):
          audio_names.append(file)

audio_names.sort()
ffe_values_resyn = []
ffe_values_degraded = []
for i in tqdm(range(len(audio_names))):
  audio_name = audio_names[i]
  GT_audio_path = os.path.join(GT_dir,audio_name)
  resyn_audio_path = os.path.join(resyn_dir,audio_name)
  degraded_audio_path = os.path.join(degraded_dir,audio_name)
  ffe_values_resyn.append(calculate_ffe(GT_audio_path, resyn_audio_path))
  ffe_values_degraded.append(calculate_ffe(GT_audio_path, degraded_audio_path))

#Write result in csv file
df = pd.read_csv(result_csv_paths[0])
df['FFE'] = ffe_values_resyn
df.to_csv(result_csv_paths[0],index=False)

df = pd.read_csv(result_csv_paths[1])
df['FFE'] = ffe_values_degraded
df.to_csv(result_csv_paths[1],index=False)

# Calculate the mean score of ffe
ffe_resyn = np.mean(ffe_values_resyn)
ffe_degraded = np.mean(ffe_values_degraded)

print("\n -------------------------------------")
print(f"FFE of resyn = {ffe_resyn:.3f}")
print(f"FFE of degraded = {ffe_degraded:.3f}")

100%|██████████| 10/10 [01:41<00:00, 10.13s/it]


 -------------------------------------
FFE of resyn = 0.105
FFE of degraded = 0.393





## NISQA: Speech Quality and Naturalness Assessment

NISQA is a deep learning model/framework for speech quality prediction. The NISQA model weights can be used to predict the quality of a speech sample that has been sent through a communication system (e.g telephone or video call). Besides overall speech quality, NISQA also provides predictions for the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness to give more insight into the cause of the quality degradation[3].

In [32]:
%cd NISQA
%run run_predict.py --mode predict_file --pretrained_model weights/nisqa_tts.tar --deg "../data/raw/neutral_sent001_long.wav" --output_dir "../result"
## go back to the directory of summerschool2023
%cd - 

/content/drive/MyDrive/Colab_Notebooks/summerschool2023/NISQA
Device: cpu
Model architecture: NISQA
Loaded pretrained model from weights/nisqa_tts.tar
---> Predicting ...
                     deg  mos_pred        model
neutral_sent001_long.wav  2.257622 NISQA_TTS_v1
/content/drive/MyDrive/Colab_Notebooks/summerschool2023


In [33]:
%cd NISQA

%run run_predict.py --mode predict_dir --pretrained_model weights/nisqa_tts.tar --data_dir "../data/resyn" --num_workers 0 --bs 10 
df = pd.read_csv("../"+result_csv_paths[0])
df['NISQA_MOS'] = nisqa.ds_val.df['mos_pred']
df.to_csv("../"+result_csv_paths[0],index=False)

%run run_predict.py --mode predict_dir --pretrained_model weights/nisqa_tts.tar --data_dir "../data/degraded" --num_workers 0 --bs 10 
df = pd.read_csv("../"+result_csv_paths[1])
df['NISQA_MOS'] = nisqa.ds_val.df['mos_pred']
df.to_csv("../"+result_csv_paths[1],index=False)

## go back to the directory of summerschool2023
%cd - 

/content/drive/MyDrive/Colab_Notebooks/summerschool2023/NISQA
Device: cpu
Model architecture: NISQA
Loaded pretrained model from weights/nisqa_tts.tar
# files: 10
---> Predicting ...
                      deg  mos_pred
neutral_sent005_short.wav  3.864740
 neutral_sent005_long.wav  2.201691
neutral_sent002_short.wav  3.295876
 neutral_sent002_long.wav  2.845493
neutral_sent004_short.wav  2.740637
 neutral_sent004_long.wav  2.729923
neutral_sent003_short.wav  2.928636
neutral_sent001_short.wav  2.255678
 neutral_sent001_long.wav  2.441009
 neutral_sent003_long.wav  2.478161
Device: cpu
Model architecture: NISQA
Loaded pretrained model from weights/nisqa_tts.tar
# files: 10
---> Predicting ...
                      deg  mos_pred
 neutral_sent002_long.wav  2.237563
 neutral_sent003_long.wav  2.135721
neutral_sent003_short.wav  2.748559
neutral_sent004_short.wav  2.743794
neutral_sent005_short.wav  3.521120
neutral_sent002_short.wav  3.295687
 neutral_sent005_long.wav  2.353680
 neutral_sen

# Reference

[1] R. Skerry-Ryan et al., “Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron,” arXiv.org, 2018. https://arxiv.org/abs/1803.09047.

[2] R. F. Kubichek, “Mel-cepstral distance measure for objective speech quality assessment,” Pacific Rim Conference on Communications, Computers and Signal Processing, May 1993, doi: https://doi.org/10.1109/pacrim.1993.407206.

[3] gabrielmittag, “gabrielmittag/NISQA: NISQA - Non-Intrusive Speech Quality and TTS Naturalness Assessment,” GitHub, Mar. 22, 2022. https://github.com/gabrielmittag/NISQA.

[4] Wei Chu and A. Alwan, “Reducing F0 Frame Error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend,” 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2009, doi: https://doi.org/10.1109/icassp.2009.4960497.

[5] C.-C. Lo et al., “MOSNet: Deep Learning-Based Objective Assessment for Voice Conversion,” Interspeech 2019, Sep. 2019, doi: https://doi.org/10.21437/interspeech.2019-2003.

[6] schmiph2, “schmiph2/pysepm: Python implementation of performance metrics in Loizou’s Speech Enhancement book,” GitHub, Jul. 14, 2020. https://github.com/schmiph2/pysepm .

[7] H. Kawahara, H. Katayose, A. Cheveigné, and R. Patterson, “Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity,” 6th European Conference on Speech Communication and Technology (Eurospeech 1999).

[8] J. Ma, Y. Hu, and P. C. Loizou, “Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions,” The Journal of the Acoustical Society of America, vol. 125, no. 5, p. 3387, 2009, doi: https://doi.org/10.1121/1.3097493.

[9] “scipy.stats.pearsonr — SciPy v1.10.1 Manual,” Scipy.org, 2023. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.

[10] google, “google/visqol: Perceptual Quality Estimator for speech and audio,” GitHub, Jan. 03, 2023. https://github.com/google/visqol (accessed May 10, 2023).
‌
‌
‌
‌
‌
‌
‌
‌
‌

‌