This cell installs all of the Python packages we need:

Pins NumPy to a version <1.23 so that the FAD library’s old import of numpy.dtypes will still work.

Installs JAX/jaxlib (required by FAD under the hood).

Installs the PESQ, SRMR, and Fréchet Audio Distance libraries, plus librosa, soundfile, and scipy for audio I/O and processing.

Installs mir_eval for computing SDR.


In [1]:
# ─── Cell 1: Install dependencies ───
!pip install "numpy<1.23"
!pip install --upgrade jax jaxlib -f https://storage.googleapis.com/jax-releases/jax_releases.html
!pip install pesq
!pip install git+https://github.com/jfsantos/SRMRpy.git
!pip install frechet_audio_distance
!pip install librosa soundfile scipy
!pip install mir_eval


Collecting numpy<1.23
  Using cached numpy-1.22.4-cp311-cp311-linux_x86_64.whl
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.4
    Uninstalling numpy-1.23.4:
      Successfully uninstalled numpy-1.23.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jax 0.6.1 requires numpy>=1.25, but you have numpy 1.22.4 which is incompatible.
jax 0.6.1 requires scipy>=1.11.1, but you have scipy 1.10.1 which is incompatible.
jaxlib 0.6.1 requires numpy>=1.25, but you have numpy 1.22.4 which is incompatible.
jaxlib 0.6.1 requires scipy>=1.11.1, but you have scipy 1.10.1 which is incompatible.
ml-dtypes 0.5.1 requires numpy>=1.23.3; python_version >= "3.11", but you have numpy 1.22.4 which is incompatible.
frechet-audio-distance 0.3.1 requires numpy==1.23.4, but you have numpy 1.22.4 which is incompatible.
as

Looking in links: https://storage.googleapis.com/jax-releases/jax_releases.html
Collecting numpy>=1.25 (from jax)
  Using cached numpy-2.3.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting scipy>=1.11.1 (from jax)
  Using cached scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-2.3.0-cp311-cp311-manylinux_2_28_x86_64.whl (16.9 MB)
Using cached scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.7 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.4
    Uninstalling numpy-1.22.4:
      Successfully uninstalled numpy-1.22.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.10.1
    Uninstalling scipy-1.10.1:
      Successfully uninstalled scipy-1.10.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo



This cell:

Strips out Jupyter’s own command-line flags (sys.argv = sys.argv[:1]) so that any argparse calls in imported libraries don’t crash.

Monkeys-patches NumPy so that np.dtypes points at the real numpy.core.numerictypes module, satisfying FAD’s expectation.

Imports all of the functions and classes we’ll use:

pesq (and its NoUtterancesError)

srmr

FrechetAudioDistance

mir_eval.separation.bss_eval_sources

In [1]:
# ─── Cell 2: Monkey-patch & Imports ───
import sys
sys.argv = sys.argv[:1]  # drop extra Jupyter flags

import numpy as np
import numpy.core.numerictypes as _numerictypes
np.dtypes = _numerictypes  # shim for frechet_audio_distance

import librosa
import soundfile as sf

from pesq import pesq, NoUtterancesError
from srmrpy import srmr
from frechet_audio_distance import FrechetAudioDistance

import mir_eval  # for SDR




This cell loads your two audio files and prepares them for analysis:

Reads the AI-generated and AI-enhanced WAVs from the Colab file system.

Converts any stereo to mono by averaging channels.

Resamples both signals to 16 kHz, which is required for PESQ (wideband), SRMR, and the VGGish embeddings in FAD.

Prints out the final sample rate and array shapes so you can sanity-check that everything loaded correctly.

In [2]:
# ─── Cell 3: Load & Preprocess Audio ───
import os

# Update these to match your Colab tree:
gen_path = '/content/Generated Audio/Kocking-door.wav'
enh_path = '/content/Enhanced Audio/knocking-door-1.wav'

assert os.path.exists(gen_path), f"{gen_path} not found"
assert os.path.exists(enh_path), f"{enh_path} not found"

gen, sr_gen = sf.read(gen_path)
enh, sr_enh = sf.read(enh_path)

# to mono
if gen.ndim > 1: gen = gen.mean(axis=1)
if enh.ndim > 1: enh = enh.mean(axis=1)

# resample to 16 kHz
target_sr = 16000
if sr_gen != target_sr:
    gen = librosa.resample(gen, orig_sr=sr_gen, target_sr=target_sr)
if sr_enh != target_sr:
    enh = librosa.resample(enh, orig_sr=sr_enh, target_sr=target_sr)

fs = target_sr
print(f"✔ Loaded & resampled to {fs} Hz mono. Shapes: gen={gen.shape}, enh={enh.shape}")


✔ Loaded & resampled to 16000 Hz mono. Shapes: gen=(145868,), enh=(145868,)


This cell computes PESQ (wideband, ITU-T P.862.2) to evaluate speech quality:

Wraps the call in a try/except to catch NoUtterancesError.

If there’s no detectable speech (e.g. door knocks), it prints a warning instead of crashing.

In [3]:
# ─── Cell 4: Compute PESQ (P.862.2 wideband) with safe‐guard ───
try:
    pesq_score = pesq(fs, gen, enh, 'wb')
    print(f"PESQ (wideband) gen→enh: {pesq_score:.3f}")
except NoUtterancesError:
    print("⚠️ PESQ not applicable: no speech detected in these signals.")


⚠️ PESQ not applicable: no speech detected in these signals.


This cell computes the raw SRMR (speech-to-reverberation modulation energy ratio) non-intrusively:

Uses norm=False so that it returns a single meaningful float even for non-speech SFX.

Prints the raw SRMR value, which reflects how “speech-like” the modulation structure is (higher = cleaner/less reverberant).

In [8]:
# ─── Cell 5: Compute SRMR ───
import numpy as np

srmr_out = srmr(
    enh,           # “degraded” / enhanced signal
    fs,
    n_cochlear_filters=23,
    low_freq=125,
    min_cf=4, max_cf=128,
    fast=True,
    norm=True
)

# Unpack if tuple
if isinstance(srmr_out, tuple):
    raw_srmr, norm_srmr = srmr_out
    # norm_srmr may be an array of per-band values → average to get a single score
    norm_scalar = norm_srmr.mean() if isinstance(norm_srmr, np.ndarray) else norm_srmr
    print(f"SRMR (raw)       : {raw_srmr:.3f}")
    print(f"SRMR (normalized): {norm_scalar:.3f}")
else:
    # single float return
    print(f"SRMR: {srmr_out:.3f}")


SRMR (raw)       : 2.548
SRMR (normalized): 0.000


This cell computes the Fréchet Audio Distance (FAD) over your two folders of clips:

Points gen_dir and enh_dir at the folders containing your generated vs. enhanced WAVs.

Instantiates FrechetAudioDistance with the VGGish embedding model.

Calls .score(...) to compare embedding distributions and prints out the resulting distance (lower = more similar).

Make sure you’ve already removed any non-audio subfolders (e.g. .ipynb_checkpoints) so FAD only sees real clips

In [12]:
# ─── Cell 6a: Clean up hidden Jupyter checkpoints ───
!find "/content/Generated Audio" -type d -name ".ipynb_checkpoints" -exec rm -rf {} +
!find "/content/Enhanced Audio"  -type d -name ".ipynb_checkpoints" -exec rm -rf {} +


In [13]:
# ─── Cell 6b: Fréchet Audio Distance (FAD) ───
gen_dir = '/content/Generated Audio/'
enh_dir = '/content/Enhanced Audio/'

fad = FrechetAudioDistance(
    model_name="vggish",
    sample_rate=fs,
    use_pca=False,
    use_activation=False,
    verbose=False
)

fad_score = fad.score(gen_dir, enh_dir, dtype="float32")
print(f"FAD (gen vs enh): {fad_score:.3f}")


Using cache found in /root/.cache/torch/hub/harritaylor_torchvggish_master


FAD (gen vs enh): -0.000


This cell computes two simple energy-based metrics:

SNR:

Defines “noise” as the difference between your generated and enhanced signals.

Calculates 10·log10(signal_power / noise_power).

Prints “∞ dB” if the two signals are identical (zero noise).

SDR:

Uses mir_eval.separation.bss_eval_sources to compute the signal-to-distortion ratio.

Prints “∞ dB” when the SDR is extremely large or infinite (identical inputs).

This gives you classic, interpretable scores alongside PESQ, SRMR, and FAD.

In [16]:
# ─── Cell 7: Simple SNR & SDR (cleaned up) ───
import numpy as np
import warnings
# ignore the FutureWarning from mir_eval
warnings.filterwarnings("ignore", category=FutureWarning)

from mir_eval.separation import bss_eval_sources

# crop to equal length
min_len = min(len(gen), len(enh))
gen_c, enh_c = gen[:min_len], enh[:min_len]

# SNR
noise = gen_c - enh_c
signal_power = np.sum(gen_c**2)
noise_power  = np.sum(noise**2)
if noise_power == 0:
    snr_str = "∞"
else:
    snr_val = 10 * np.log10(signal_power / noise_power)
    snr_str = f"{snr_val:.2f}"
print(f"SNR: {snr_str} dB")

# SDR via mir_eval
sdr, sir, sar, _ = bss_eval_sources(
    np.vstack([gen_c]),
    np.vstack([enh_c])
)
sdr_val = sdr[0]
# treat anything above 100 dB as infinite for clarity
if np.isinf(sdr_val) or sdr_val > 100:
    sdr_str = "∞"
else:
    sdr_str = f"{sdr_val:.2f}"
print(f"SDR: {sdr_str} dB")


SNR: ∞ dB
SDR: ∞ dB
