In [None]:
# real = "/content/drive/MyDrive/ML-Project_Team_minions/Survey Audio/real_04.wav"
# fake = "/content/drive/MyDrive/ML-Project_Team_minions/Survey Audio/fake_04.wav"

# vits2
real = "/content/drive/MyDrive/ML-Project_Team_minions/Survey Audio/Test/ground_vits.wav"
fake = "/content/drive/MyDrive/ML-Project_Team_minions/Survey Audio/Test/fake_vits.wav"

## Current and Final PESQ calculation

In [1]:
%%capture
!pip install pesq

In [2]:
%%capture
!pip install librosa

## PESQ

In [10]:
from pesq import pesq
from scipy.io import wavfile
import librosa
import os
import numpy as np

def calculate_pesq_for_batch(real_audio_dir, fake_audio_dir):
    """
    Calculate PESQ score for a batch of real and fake audio files.

    Args:
        real_audio_dir (str): Directory containing real audio files.
        fake_audio_dir (str): Directory containing fake audio files.

    Returns:
        float: Average PESQ score for the batch.
    """
    pesq_scores = []

    # Ensure the directories contain corresponding audio files
    real_files = sorted([os.path.join(real_audio_dir, f) for f in os.listdir(real_audio_dir) if f.endswith('.wav')])
    fake_files = sorted([os.path.join(fake_audio_dir, f) for f in os.listdir(fake_audio_dir) if f.endswith('.wav')])

    print("REAL FILES: ", real_files)
    print("FAKE FILES: ", fake_files)

    if len(real_files) != len(fake_files):
        raise ValueError("Number of files in real_audio_dir and fake_audio_dir must be the same.")

    for real_path, fake_path in zip(real_files, fake_files):
        # Load audio files
        ref_rate, ref_audio = wavfile.read(real_path)
        deg_rate, deg_audio = wavfile.read(fake_path)

        # Resample to 16000 Hz
        ref_audio_resampled = librosa.resample(ref_audio.astype(float), orig_sr=ref_rate, target_sr=16000)
        deg_audio_resampled = librosa.resample(deg_audio.astype(float), orig_sr=deg_rate, target_sr=16000)

        # Compute PESQ score
        try:
            score = pesq(16000, ref_audio, deg_audio, 'wb')  # 'wb' for wideband
            pesq_scores.append(score)
            print(f"PESQ Score for {os.path.basename(real_path)} and {os.path.basename(fake_path)}: {score}")
        except Exception as e:
            print(f"Error processing {real_path} and {fake_path}: {e}")
            continue

    # Calculate average PESQ score
    average_pesq = np.mean(pesq_scores) if pesq_scores else 0.0
    return average_pesq

# Example
real_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100"
# fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/fake_70K"
# fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/fake_130K"
# fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/fake_140K"
# fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_fake_100_70K"
fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_fake_100_130K"
# fake_audio_dir = "/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_fake_100_140K"

average_pesq_score = calculate_pesq_for_batch(real_audio_dir, fake_audio_dir)
print(f"Average PESQ Score for the batch: {average_pesq_score}")

REAL FILES:  ['/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_01083.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_01264.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_01726.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_01800.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_01817.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_02012.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_02277.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_02594.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_03169.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_03196.wav', '/content/drive/MyDrive/ML-Project_Team_minions/evaluation/eval_real_100/real_03226.wa

## BEST PESQ:

Audio size: 15

ckpt = checkpoint of our trained model

Sampling Rate: **8K**
- 1.06 [70K ckpt]
- 1.07 [130K ckpt] **
- 1.05 [140K ckpt]

---

Sampling Rate: **16K**
- 1.09 [70K ckpt]
- 1.13 [130K ckpt] **
- 1.10 [140K ckpy]

---

Sampling Rate: **22.5K**
- 1.24 [70K ckpt] **
- 1.16 [130K ckpt]
- 1.12 [140K ckpt]

---
---

Audio size: 100

Sampling Rate: **22K**
- 1.17 [140K ckpt]
- **1.39 [130 ckpt]**
- 1.23 [70 ckpt] **

## STOI
The STOI (Short-Time Objective Intelligibility) metric is designed to assess the intelligibility of speech signals, and it is particularly useful when **evaluating how much speech has been degraded or distorted**.

The score reflects the **quality of speech perception by humans**, even when the speech is corrupted by noise, distortion, or other impairments.

The scores can be interpreted as follows:

- 0.9 to 1.0: Very high intelligibility. The generated speech is almost indistinguishable from the original.
- 0.7 to 0.9: Good intelligibility, but there may be slight distortions or noise present that affect understanding.
- 0.5 to 0.7: Moderate intelligibility. The speech is somewhat degraded, and parts of it might be difficult to understand, especially under noisy conditions.
- Below 0.5: Poor intelligibility. The speech is heavily degraded or distorted, and it may be difficult or impossible to understand most or all of the content.


In [None]:
!pip install pystoi

Collecting pystoi
  Downloading pystoi-0.4.1-py2.py3-none-any.whl.metadata (4.0 kB)
Downloading pystoi-0.4.1-py2.py3-none-any.whl (8.2 kB)
Installing collected packages: pystoi
Successfully installed pystoi-0.4.1


In [None]:
from pystoi import stoi
import scipy.io.wavfile as wav
import numpy as np

# Load the real and fake audio files
fs_real, real_audio = wav.read(real)
fs_fake, fake_audio = wav.read(fake)

# Ensure both signals have the same sample rate
assert fs_real == fs_fake, "Sample rates must be the same."

# Get the minimum length of the two signals
min_length = min(len(real_audio), len(fake_audio))

# Trim both signals to the minimum length
real_audio = real_audio[:min_length]
fake_audio = fake_audio[:min_length]

# Calculate STOI score
stoi_score = stoi(real_audio, fake_audio, fs_real)
print("STOI Score:", stoi_score)

STOI Score: 0.18200571635120377


## **EER** is used for mainly **speaker verificaion** when we have multiple speakers. So its of **NO USE in our case** as ours is a single speaker system.

Recommended Approach:
For generating fake audio (e.g., synthetic speech or manipulated audio), the following metrics are typically more useful:

PESQ: For speech or general audio quality.
MOS: For a comprehensive, human-perception-based quality assessment.
SNR: To detect distortions and measure how much noise is introduced in the fake audio.
STOI: If intelligibility is crucial (e.g., for speech applications).
These metrics provide different perspectives on audio quality:

PESQ and MOS measure the overall quality from a perceptual standpoint.
SNR quantifies the signal distortion.
STOI evaluates intelligibility, especially in speech.
You can use a combination of these metrics depending on the characteristics you're most concerned with (e.g., intelligibility, perceptual quality, noise levels).








To assess **Overall Quality**, **Naturalness**, and **Content Clarity** of audio, the following metrics are best suited:

### 1. **Overall Quality**:
   - **Metric**: **PESQ (Perceptual Evaluation of Speech Quality)**
     - **Reason**: PESQ is widely used for evaluating overall speech quality. It accounts for various distortions, including compression artifacts, noise, and other impairments, giving a comprehensive measure of the perceived quality of speech.
     - **Reference**: The PESQ algorithm has been standardized by ITU-T (P.862), and is commonly used in telecommunications and audio quality research.

### 2. **Naturalness**:
   - **Metric**: **MOS (Mean Opinion Score) or Subjective Rating** (specifically focused on naturalness)
     - **Reason**: MOS is an established subjective method to assess how "natural" the audio sounds to human listeners. It directly reflects the perceived naturalness based on human judgment, which is critical when assessing whether synthetic speech or audio sounds natural.
     - **Reference**: MOS has been a standard for subjective assessment of speech and audio quality, particularly in speech synthesis and voice applications (e.g., ITU-T P.800).

### 3. **Content Clarity**:
   - **Metric**: **STOI (Short-Time Objective Intelligibility)**
     - **Reason**: STOI measures how intelligible speech is under varying conditions. It quantifies the preservation of content clarity by comparing the reference and degraded speech. Higher STOI values indicate better content clarity, especially in noisy or degraded environments.
     - **Reference**: STOI has been widely used in speech enhancement research (e.g., ITU-T P.862.2 for speech intelligibility).

### **Conclusion**:
- **PESQ** for **Overall Quality** provides a comprehensive measure of how good the audio sounds.
- **MOS** for **Naturalness** captures how human listeners perceive the naturalness of the audio.
- **STOI** for **Content Clarity** focuses on intelligibility, which directly relates to the clarity of the audio's content.