<a href="https://colab.research.google.com/github/Maheshcheegiti/SpeechEnhancementEspnet/blob/main/ESPnet_SpeechEnhancement_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ESPnet Speech Enhancement Demonstration

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)


This notebook provides a demonstration of the speech enhancement and separation using ESPnet2-SE.

- ESPnet2-SE: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/enh1

Author: Chenda Li ([@LiChenda](https://github.com/LiChenda)), Wangyou Zhang ([@Emrys365](https://github.com/Emrys365))


## Install

In [None]:
!pip install kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"cheegitimahesh","key":"e653a1cc1578c38361774d30592a025e"}'}

In [6]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [7]:
!kaggle datasets download -d jiangwq666/voicebank-demand

Downloading voicebank-demand.zip to /content
100% 5.24G/5.26G [00:51<00:00, 128MB/s]
100% 5.26G/5.26G [00:51<00:00, 109MB/s]


In [8]:
import zipfile

# Replace 'path_to_zip_file' with the path to your ZIP file
zip_ref = zipfile.ZipFile('/content/voicebank-demand.zip', 'r')

# Replace 'extracted_folder_path' with the path where you want to extract the ZIP file
zip_ref.extractall('/content/drive/MyDrive/Voicebank_demand')

zip_ref.close()

##Resampling the Datasets

In [11]:
import os
import soundfile as sf
import scipy.signal as signal
import wave

# Input directory containing audio files
input_dir = '/content/drive/MyDrive/Voicebank_demand/noisy_testset_wav'
# Output directory to save resampled audio files
output_dir = '/content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav'

# Resampling rate
new_sr = 16000

# Loop through all audio files in input directory
for file_name in os.listdir(input_dir):
    if file_name.endswith('.wav'):
        # Load audio file
        audio, sr = sf.read(os.path.join(input_dir, file_name))

        # Resample audio to new sampling rate
        audio_resampled = signal.resample(audio, int(len(audio) * new_sr / sr))

        # Save resampled audio to output directory with 16-bit depth
        output_file_name = os.path.join(output_dir, file_name)
        sf.write(output_file_name, audio_resampled, new_sr, subtype='PCM_16')

        # Print the bit depth of the resampled audio file
        with wave.open(output_file_name, 'rb') as wave_file:
            print(f"Bit depth of {output_file_name} is {wave_file.getsampwidth() * 8} bits")

Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_001.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_002.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_003.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_005.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_006.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_007.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_009.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_010.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_011.wav is 16 bits
Bit depth of /content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav/p232_012.wav is 16 bits
Bit depth 

In [12]:
%pip install git+https://github.com/espnet/espnet
%pip install -q espnet_model_zoo

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/espnet/espnet
  Cloning https://github.com/espnet/espnet to /tmp/pip-req-build-k8j_fkdk
  Running command git clone --filter=blob:none --quiet https://github.com/espnet/espnet /tmp/pip-req-build-k8j_fkdk
  Resolved https://github.com/espnet/espnet to commit 33aa097148fd584f0efb7a655f6bcd9f5ded8c92
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting configargparse>=1.2.1
  Downloading ConfigArgParse-1.5.3-py3-none-any.whl (20 kB)
Collecting typeguard==2.13.3
  Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Collecting humanfriendly
  Downloading humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting librosa==0.9.2
  Downloading librosa-0.9.2-py3-none-any.wh

## Speech Enhancement

#### Download and load the pretrained Conv-Tasnet


In [16]:
# !gdown --id 17DMWdw84wF3fz3t7ia1zssdzhkpVQGZm -O /content/chime_tasnet_singlechannel.zip
# !unzip /content/chime_tasnet_singlechannel.zip -d /content/enh_model_sc
# cfg = {
#     "train_config": "/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/config.yaml",
#     "model_file": "/content/enh_model_sc/exp/enh_train_enh_conv_tasnet_raw/5epoch.pth",
# }

##########################################################
# If the above command failed, try the following instead #
##########################################################
from espnet_model_zoo.downloader import ModelDownloader

d = ModelDownloader()
cfg = d.download_and_unpack("espnet/Wangyou_Zhang_chime4_enh_train_enh_conv_tasnet_raw")

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

In [18]:
# Load the model
# If you encounter error "No module named 'espnet2'", please re-run the 1st Cell. This might be a colab bug.
import sys
import soundfile
from espnet2.bin.enh_inference import SeparateSpeech


separate_speech = {}
# For models downloaded from GoogleDrive, you can use the following script:
enh_model_sc = SeparateSpeech(
  train_config=cfg["train_config"],
  model_file=cfg["model_file"],
  # for segment-wise process on long speech
  normalize_segment_scale=False,
  show_progressbar=True,
  ref_channel=4,
  normalize_output_wav=True,
  # device="cuda:0",
)

### Enhance your own pre-recordings


##Enhanced Train Dataset

In [31]:
import os
from IPython.display import display, Audio
import soundfile

# Input directory containing audio files
input_dir = '/content/noisy_trainset_16spk_wav'
# Output directory to save enhanced audio files
output_dir = '/content/noisy_trainset_wav_enhanced'

sr = 16000

# Loop through all audio files in input directory
for file_name in os.listdir(input_dir):
    if file_name.endswith('.wav'):
        # Load audio file
        speech, rate = soundfile.read(os.path.join(input_dir, file_name))
        assert rate == sr, "mismatch in sampling rate"

        # Enhance audio using model
        enhanced_speech = enh_model_sc(speech[None, ...], sr)

        # Save enhanced audio to output directory with 16-bit depth
        output_file_name = os.path.join(output_dir, file_name)
        soundfile.write(output_file_name, enhanced_speech[0].squeeze(), sr, subtype='PCM_16')

        # Display original and enhanced audio
        # print(f"Your input speech {file_name}", flush=True)
        # display(Audio(speech, rate=sr))
        # print(f"Enhanced speech for {file_name}", flush=True)
        # display(Audio(enhanced_speech[0].squeeze(), rate=sr))


##Enhanced Test Dataset

In [None]:
import os
from IPython.display import display, Audio
import soundfile

# Input directory containing audio files
input_dir = '/content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_wav'
# Output directory to save enhanced audio files
output_dir = '/content/drive/MyDrive/Voicebank_demand/noisy_testset_16k_enhanced_wav'

sr = 16000

# Loop through all audio files in input directory
for file_name in os.listdir(input_dir):
    if file_name.endswith('.wav'):
        # Load audio file
        speech, rate = soundfile.read(os.path.join(input_dir, file_name))
        assert rate == sr, "mismatch in sampling rate"

        # Enhance audio using model
        enhanced_speech = enh_model_sc(speech[None, ...], sr)

        # Save enhanced audio to output directory with 16-bit depth
        output_file_name = os.path.join(output_dir, file_name)
        soundfile.write(output_file_name, enhanced_speech[0].squeeze(), sr, subtype='PCM_16')

        # Display original and enhanced audio
        # print(f"Your input speech {file_name}", flush=True)
        # display(Audio(speech, rate=sr))
        # print(f"Enhanced speech for {file_name}", flush=True)
        # display(Audio(enhanced_speech[0].squeeze(), rate=sr))


In [33]:
!git clone https://github.com/IMLHF/PHASEN-PyTorch.git

Cloning into 'PHASEN-PyTorch'...
remote: Enumerating objects: 397, done.[K
remote: Counting objects: 100% (397/397), done.[K
remote: Compressing objects: 100% (254/254), done.[K
remote: Total 397 (delta 231), reused 298 (delta 132), pack-reused 0[K
Receiving objects: 100% (397/397), 167.60 KiB | 4.66 MiB/s, done.
Resolving deltas: 100% (231/231), done.


In [34]:
%cd PHASEN-PyTorch/phasen_torch

/content/PHASEN-PyTorch/phasen_torch


In [35]:
pip install pesq

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [36]:
pip install pystoi

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [37]:
from pathlib import Path
from sepm import compare
import time
import numpy as np

def calculate_pm(ref_dir, deg_dir):
    print("Calculate PM:")
    t1 = time.time()
    res = compare(str(ref_dir), str(deg_dir))
    t2 = time.time()

    pm = np.array([x[1:] for x in res])
    pm = np.mean(pm,axis=0)
    print('time: %.3f' % (t2-t1))
    # print('ref=', ref_dir)
    # print('deg=', deg_dir)
    print('csig:%6.4f cbak:%6.4f covl:%6.4f pesq:%6.4f ssnr:%6.4f' % tuple(pm))

##Evaluation of Train dataset

In [38]:
ref_dir = "/content/clean_trainset_16spk_wav"
deg_dir = "/content/noisy_trainset_wav_enhanced"
calculate_pm(ref_dir, deg_dir)

Calculate PM:


Calculating: 11572it [1:36:33,  2.00it/s]

time: 5793.621
csig:2.5857 cbak:1.8110 covl:2.1976 pesq:1.8922 ssnr:-7.3048





In [39]:
import os
import soundfile as sf
from pystoi import stoi

clean_dir = "/content/clean_trainset_16spk_wav"
processed_dir = "/content/noisy_trainset_wav_enhanced"

# Loop over each file in the clean directory and calculate STOI
overall_stoi = 0.0
num_files = 0
for filename in os.listdir(clean_dir):
    if filename.endswith(".wav"):
        clean_path = os.path.join(clean_dir, filename)
        processed_path = os.path.join(processed_dir, filename)

        # Load the clean speech and processed speech signals
        clean_signal, sr = sf.read(clean_path)
        processed_signal, sr = sf.read(processed_path)

        # Ensure both signals have the same length
        min_length = min(len(clean_signal), len(processed_signal))
        clean_signal = clean_signal[:min_length]
        processed_signal = processed_signal[:min_length]

        # Calculate STOI between the two signals
        stoi_value = stoi(clean_signal, processed_signal, sr, extended=False)

        # Add to overall STOI and increment file count
        overall_stoi += stoi_value
        num_files += 1

# Calculate the average STOI over all files
average_stoi = overall_stoi / num_files

print("Overall STOI: {:.3f}".format(average_stoi))

Overall STOI: 0.869


##Evaluation of Test Dataset

In [40]:
ref_dir = "/content/clean_testset_16k_wav"
deg_dir = "/content/noisy_testset_wav_enhanced"
calculate_pm(ref_dir, deg_dir)

Calculate PM:


Calculating: 824it [06:12,  2.21it/s]

time: 372.755
csig:3.1407 cbak:2.0310 covl:2.6687 pesq:2.2334 ssnr:-7.2587





In [41]:
import os
import soundfile as sf
from pystoi import stoi

clean_dir = "/content/clean_testset_16k_wav"
processed_dir = "/content/noisy_testset_wav_enhanced"

# Loop over each file in the clean directory and calculate STOI
overall_stoi = 0.0
num_files = 0
for filename in os.listdir(clean_dir):
    if filename.endswith(".wav"):
        clean_path = os.path.join(clean_dir, filename)
        processed_path = os.path.join(processed_dir, filename)

        # Load the clean speech and processed speech signals
        clean_signal, sr = sf.read(clean_path)
        processed_signal, sr = sf.read(processed_path)

        # Ensure both signals have the same length
        min_length = min(len(clean_signal), len(processed_signal))
        clean_signal = clean_signal[:min_length]
        processed_signal = processed_signal[:min_length]

        # Calculate STOI between the two signals
        stoi_value = stoi(clean_signal, processed_signal, sr, extended=False)

        # Add to overall STOI and increment file count
        overall_stoi += stoi_value
        num_files += 1

# Calculate the average STOI over all files
average_stoi = overall_stoi / num_files

print("Overall STOI: {:.3f}".format(average_stoi))

Overall STOI: 0.932


In [19]:
# from google.colab import files
# from IPython.display import display, Audio
# import soundfile

# uploaded = files.upload()

# for file_name in uploaded.keys():
#   speech, rate = soundfile.read(file_name)
#   sr = rate
#   assert rate == sr, "mismatch in sampling rate"
#   wave = enh_model_sc(speech[None, ...], sr)
#   print(f"Your input speech {file_name}", flush=True)
#   display(Audio(speech, rate=sr))
#   print(f"Enhanced speech for {file_name}", flush=True)
#   display(Audio(wave[0].squeeze(), rate=sr))

