# Speech Seperation Experiments

In this file, we will run basic experiments for speech seperation frameworks. This will detemrine which framwork we will be using for our app: 

- [SpeechBrain's SepFormer](https://github.com/speechbrain/speechbrain)
- [FaceBook's Svoice](https://github.com/facebookresearch/svoice)

### SepFormer

In [1]:
%pip install speechbrain
%pip install torchaudio>=2.1.0,<2.9 torch>=2.1.0,<2.9
# !wget https://raw.githubusercontent.com/speechbrain/speechbrain/develop/requirements.txt
# !sed -i '/lint-requirements.txt/d' requirements.txt
# !pip install -r requirements.txt

/bin/bash: line 1: 2.9: No such file or directory


In [2]:
!pip uninstall -y transformers
!pip install "transformers<5.0"
!pip uninstall -y huggingface_hub
!pip install "huggingface_hub<1.0"


Found existing installation: transformers 4.57.6
Uninstalling transformers-4.57.6:
  Successfully uninstalled transformers-4.57.6
Collecting transformers<5.0
  Using cached transformers-4.57.6-py3-none-any.whl.metadata (43 kB)
Using cached transformers-4.57.6-py3-none-any.whl (12.0 MB)
Installing collected packages: transformers
Successfully installed transformers-4.57.6
Found existing installation: huggingface_hub 0.36.2
Uninstalling huggingface_hub-0.36.2:
  Successfully uninstalled huggingface_hub-0.36.2
Collecting huggingface_hub<1.0
  Using cached huggingface_hub-0.36.2-py3-none-any.whl.metadata (15 kB)
Using cached huggingface_hub-0.36.2-py3-none-any.whl (566 kB)
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.36.2


In [3]:
# Complete patch for torchaudio 2.9.x compatibility
import torchaudio
import soundfile as sf
import torch

# Patch 1: Add missing list_audio_backends function
if not hasattr(torchaudio, "list_audio_backends"):
    torchaudio.list_audio_backends = lambda: ["soundfile"]

# Patch 2: Monkey patch the load function
original_load = torchaudio.load if not hasattr(torchaudio, '_original_load') else torchaudio._original_load

def patched_load(filepath, *args, **kwargs):
    data, samplerate = sf.read(filepath, dtype='float32')
    data = torch.from_numpy(data)
    
    if data.dim() == 1:
        data = data.unsqueeze(0)  # (samples,) -> (1, samples)
    else:
        data = data.T  # (samples, channels) -> (channels, samples)
    
    return data, samplerate

torchaudio._original_load = original_load
torchaudio.load = patched_load

# Patch 3: Monkey patch the save function
original_save = torchaudio.save if not hasattr(torchaudio, '_original_save') else torchaudio._original_save

def patched_save(filepath, src, sample_rate, channels_first=True, **kwargs):
    # src is expected to be (channels, samples) if channels_first=True
    # soundfile expects (samples, channels)
    
    if channels_first:
        if src.dim() == 1:
            data = src.unsqueeze(1)  # (samples,) -> (samples, 1)
        else:
            data = src.T  # (channels, samples) -> (samples, channels)
    else:
        data = src
    
    # Convert to numpy and save
    data_np = data.detach().cpu().numpy()
    sf.write(filepath, data_np, sample_rate)

torchaudio._original_save = original_save
torchaudio.save = patched_save

In [4]:
import torch
import torchaudio
import speechbrain
import transformers
import huggingface_hub

print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)
print("speechbrain:", speechbrain.__version__)
print("transformers:", transformers.__version__)
print("huggingface_hub:", huggingface_hub.__version__)

DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
  self.setter(val)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover


torch: 2.9.0+cpu
torchaudio: 2.9.0+cpu
speechbrain: 1.0.3
transformers: 4.57.6
huggingface_hub: 0.36.2


In [5]:
from speechbrain.inference.separation import SepformerSeparation as seporator 

model = seporator.from_hparams(
    source="speechbrain/sepformer-wsj02mix", 
    savedir='pretrained_models/sepformer-wsj02mix'
)

INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/sepformer-wsj02mix' if not cached


hyperparams.yaml: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--sepformer-wsj02mix/snapshots/3a2826343a10e2d2e8a75f79aeab5ff3a2473531/hyperparams.yaml' -> '/content/pretrained_models/sepformer-wsj02mix/hyperparams.yaml'
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/sepformer-wsj02mix' if not cached
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in pretrained_models/sepformer-wsj02mix.
INFO:speechbrain.utils.fetching:Fetch masknet.ckpt: Fetching from HuggingFace Hub 'speechbrain/sepformer-wsj02mix' if not cached


masknet.ckpt:   0%|          | 0.00/113M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--sepformer-wsj02mix/snapshots/3a2826343a10e2d2e8a75f79aeab5ff3a2473531/masknet.ckpt' -> '/content/pretrained_models/sepformer-wsj02mix/masknet.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["masknet"] = /content/pretrained_models/sepformer-wsj02mix/masknet.ckpt
INFO:speechbrain.utils.fetching:Fetch encoder.ckpt: Fetching from HuggingFace Hub 'speechbrain/sepformer-wsj02mix' if not cached


encoder.ckpt:   0%|          | 0.00/17.3k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--sepformer-wsj02mix/snapshots/3a2826343a10e2d2e8a75f79aeab5ff3a2473531/encoder.ckpt' -> '/content/pretrained_models/sepformer-wsj02mix/encoder.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["encoder"] = /content/pretrained_models/sepformer-wsj02mix/encoder.ckpt
INFO:speechbrain.utils.fetching:Fetch decoder.ckpt: Fetching from HuggingFace Hub 'speechbrain/sepformer-wsj02mix' if not cached


decoder.ckpt:   0%|          | 0.00/17.2k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--sepformer-wsj02mix/snapshots/3a2826343a10e2d2e8a75f79aeab5ff3a2473531/decoder.ckpt' -> '/content/pretrained_models/sepformer-wsj02mix/decoder.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["decoder"] = /content/pretrained_models/sepformer-wsj02mix/decoder.ckpt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: masknet, encoder, decoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): masknet -> /content/pretrained_models/sepformer-wsj02mix/masknet.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): encoder -> /content/pretrained_models/sepformer-wsj02mix/encoder.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): decoder -> /content/pretrained_models/sepformer-wsj02mix/decoder.ckpt


In [4]:
est_sources = model.separate_file(path='speechbrain/sepformer-wsj02mix/test_mixture.wav') 

NameError: name 'model' is not defined

In [5]:
torchaudio.save("source1hat.wav", est_sources[:, :, 0].detach().cpu(), 8000)
torchaudio.save("source2hat.wav", est_sources[:, :, 1].detach().cpu(), 8000)

NameError: name 'est_sources' is not defined

In [None]:
from IPython.display import Audio

Audio("source1hat.wav")


In [19]:
Audio("source2hat.wav")