# Practical 3 - 

[Nasser-eddine Monir](https://nasseredd.github.io/teaching/) (CC BY-NC-SA) -- 2025


* 👥 You may work in pairs,
* 📩 however, submit your work individually by the end of the class.
* 📝 Your names should be included in the file names as follows: **Practical-3-Monir.ipynb**.
* 📧 Please ensure you email me EXCLUSIVELY at the following address: nasser-eddine.monir@inria.fr
* 💬 Don't forget to leave a comment whenever an observation is requested.

# First Steps

1. Change the run time and choose a **T4 GPU**.
2. Click on "Upload to session storage" button to load ```data.zip```.
3. Unzip the folder ```!unzip /content/data.zip -d /content/data```

In [None]:
# TODO: code me!

4. Install ```museval``` using ```pip```.

In [None]:
# TODO: code me!

5. Install these packages

In [None]:
%pip install git+https://github.com/espnet/espnet
%pip install -q espnet_model_zoo

# Imports

In [None]:
import sys
import librosa
import soundfile
import numpy as np
import seaborn as sns
import soundfile as sf
import matplotlib.pyplot as plt
from espnet2.bin.enh_inference import SeparateSpeech
from espnet_model_zoo.downloader import ModelDownloader

#### Load Signals

This function, ```load_signals```, is designed to load and return speech, noise, and their corresponding Room Impulse Responses (RIRs).

In [None]:
def load_signals(speech_file, noise_file, speech_rir_file, noise_rir_file):
    # Load the speech signal
    speech_signal, _ = librosa.load(speech_file, sr=16000)

    # Load the noise signal
    noise_signal, _ = librosa.load(noise_file, sr=16000)

    # Load the RIRs
    speech_rir_data = np.load(speech_rir_file)
    noise_rir_data = np.load(noise_rir_file)

    # Extract the RIR signals
    speech_rir = speech_rir_data['rir']
    noise_rir = noise_rir_data['rir']

    return speech_signal, noise_signal, speech_rir, noise_rir

#### Mixture

Create mixtures by assigning a unique noise signal to each speech signal while maintaining the following conditions:

- Speech Position: Always at the front (0°).
- Noise Position: Always at 90° to the right.
- Signal-to-Noise Ratio (SNR): Fixed at 0 dB.

Ensure that each speech signal is paired with exactly one noise signal, resulting in distinct mixtures.

In [None]:
# TODO: code me!

# Speech Enhancement

The next three cells are from the ESPnet tutorial on speech enhancement algorithms, available on their [GitHub repository](https://github.com/espnet/espnet). It offers four pretrained models, including MVDR and FaSNet. Choose one, run the cells carefully, then visualize and listen to the generated mixture and estimated speech.

The models processe a **four-channel mixture**, selects a **reference channel**, and outputs a **single-channel estimated speech**.

In [None]:
fs = 16000 #@param {type:"integer"}
tag = "espnet/Wangyou_Zhang_chime4_enh_train_enh_beamformer_mvdr_raw" #@param ["espnet/Wangyou_Zhang_chime4_enh_train_enh_beamformer_mvdr_raw", "espnet/Wangyou_Zhang_chime4_enh_train_enh_dc_crn_mapping_snr_raw", "lichenda/chime4_fasnet_dprnn_tac", "https://zenodo.org/record/6025881/files/enh_train_enh_beamformer_mvdr_raw_valid.si_snr.ave.zip"]

In [None]:
d = ModelDownloader()

cfg = d.download_and_unpack(tag)
enh_model_mc = SeparateSpeech(
  train_config=cfg["train_config"],
  model_file=cfg["model_file"],
  normalize_segment_scale=False,
  show_progressbar=True,
  ref_channel=4,
  normalize_output_wav=True,
  device="cuda:0",
)

In [None]:
!gdown --id 1SmrN5NFSg6JuQSs2sfy3ehD8OIcqK6wS -O /content/M05_440C0213_PED_REAL.wav
mixwav_mc, sr = soundfile.read("/content/M05_440C0213_PED_REAL.wav") # mixwav.shape: num_samples, num_channels
mixwav_sc = mixwav_mc[:,4]
wave = enh_model_mc(mixwav_mc[None, ...], sr)

Select a model and use it to generate the estimated speech from the chosen mixture. Once processed, listen to the output to evaluate the quality of the enhanced speech.

In [None]:
# TODO: code me!

Now that you're more familiar with speech enhancement using ESPnet, create a function that processes a given **mixture folder**, selects the **first channel as the reference**, and generates the **estimated speech** in the `est_speech/` folder using FaSNet.

In [None]:
# TODO: code me!
def inference():
  pass

# Evaluation

Museval is a library for evaluating speech enhancement models by computing key performance metrics such as SDR, SIR, and SAR.

In speech enhancement, we evaluate the estimated speech by comparing it to the clean reference and noise. Your task is to stack the **reference sources** (clean speech and noise) and the corresponding **estimates** as follows:

Given:
- **$s$** = clean speech (target signal)
- **$n$** = noise
- **$ \hat{s} $** = estimated speech

Stack the reference sources and estimates as:

$\text{references} = \begin{bmatrix}
s \\
n
\end{bmatrix}$

$
\text{estimates} =
\begin{bmatrix}
\hat{s} \\
\hat{s} - s
\end{bmatrix}$

Next, use the `bss_eval` function from `museval.metrics` to compute the enhancement metrics.  
Set `filters_len=1` to ensure proper evaluation.  

Run the evaluation on **one example of your choice**, print SIR, SAR and SDR, and analyze the results.

In [None]:
# TODO: code me!

* **Signal-to-Interference Ratio** (**SIR**) measures how well the target speech signal is separated from interfering noise or other unwanted sources. A higher SIR indicates better suppression of interference while preserving the target speech signal.

* **Signal-to-Artifacts Ratio** (**SAR**) quantifies the amount of distortion or artifacts introduced during signal processing. A higher SAR means fewer processing artifacts, ensuring the recovered signal remains natural and undistorted.

* **Signal-to-Distortion Ratio** (**SDR**) is an overall measure of signal quality that combines interference suppression and artifact minimization. A higher SDR indicates a better-quality reconstructed signal, balancing both interference removal and minimal processing distortions.

Overlay the spectrums of the speech signal, the noisy mixture, and the estimated speech on the same plot using three distinct colors. This visualization will help compare the spectral differences and assess the effectiveness of the enhancement process. 

**Note**: Ensure that frequencies and magnitudes are clamped within the range of 50 to 5000 Hz to focus on the most relevant speech frequencies. Additionally, display both the x-axis (frequency) and y-axis (magnitude) in logarithmic scale for better visualization of spectral variations.

In [None]:
# TODO: code me!

Compute the average values of SIR, SAR, and SDR across the entire corpus to evaluate the overall performance of the separation system. This will provide a global assessment of interference suppression, artifact reduction, and signal quality across all samples.

In [None]:
# TODO: code me!

Select the MVDR model to generate inferences on the same set of mixtures. Compute the evaluation metrics (SIR, SAR, SDR) and compare the results with the previous estimations obtained using the FaSNet model to assess their relative performance.

In [None]:
# TODO: code me!

# Bonus (+3pts on Your Practicals Grade)

This part is optional and can be done at home until Friday, February 13th, at 11:59 PM.

By this deadline, you should:

- Use an ASR model (e.g., wav2vec, Whisper) to generate transcriptions of the clean speech.
- Apply automatic phoneme segmentation using Montreal Forced Aligner (MFA).
- Select a specific phoneme category (e.g., plosives, fricatives) and concatenate all occurrences of this category in the clean, mixture, and estimated speech (from one algorithm) to create three signals.
- Compute evaluation metrics comparing the clean and estimated speech of this phoneme category.
- Overlay and visualize the spectrums of the three signals.
- Write a brief analysis summarizing the results of this experiment.