Time-Frequency Domain Filter-and-Sum Network for Multi-channel Speech Separation

This repository contains the model implementation for the paper titled "Time-Frequency Domain Filter-and-Sum Network for Multi-channel Speech Separation." Our paper proposes a new approach to multi-channel speech separation, building upon the implicit Filter-and-Sum Network (iFaSNet). We achieve this by converting each module of the iFaSNet architecture to perform separation in the time-frequency domain. Our experimental results indicate that our method is superior under the considered conditions.

Model

We implement the Time-Frequency Domain Filter-and-Sum Network (TF-FaSNet) based on iFaSNet's overall structure. The network performs multi-channel speech separation in the time-frequency domain. Refer to the original paper for more information.

We propose the following improvements to enhance the performance of the iFaSNet model for separating mixtures:

Use a multi-path separation module for spectral mapping in the T-F domain
Add a 2D positional encoding to facilitate attention module learning spectro-temporal information
Use narrow-band feature extraction to exploit inter-channel cues of different speakers
Add a convolution module at the end of the separation module to capture local interactions and features.

The following flowchart depicts the TF-FaSNet model.

Usage

A minimum implementation of the TF-FaSNet model can be found in model.py.

Requirements

torch==1.13.1
torchaudio==0.13.1
positional-encodings==6.0.1

Dataset

The model is evaluated on a simulated 6-mic circular array dataset. The data generation script is available at here.

Model configurations

To use our model:

mix_audio = torch.randn(3,6,64000)
test_model = make_TF_FaSNet(
    nmic=6, nspk=2, n_fft=256, embed_dim=16,
    dim_nb=32, dim_ffn=64, n_conv_layers=2, 
    B=4, I=8, J=1, H=128, L=4
    )
separated_audio = test_model(mix_audio)

Each variable stands for:

General config
- nmic: Number of microphones
- nspk: Number of speakers
- n_fft: Number of FFT points
- embed_dim: Embedding dimension for each T-F unit
Encoder-decoder:
- dim_nb: Number of hidden units in the Narrow-band feature extraction module
- dim_fft: Number of hidden units between two linear layers in context decoding module
- n_conv_layers: Number of convolution blocks in the context decoding module
Multi-path separation module:
- B: Number of multi-path blocks
- I: Kernel size for Unfold and Deconv
- J: Stride size for Unfold and Deconv
- H: Number of hidden units in BLSTM
- L: Number of heads in self-attention

With these configurations, we achieve an average 15.5 dB SI-SNR improvement on the simulated 6-mic circular-array dataset with a model size of 2.5M.

Miscellaneous

Given a $D \times T \times F$ tensor, we apply 2D positional encoding as follows:

$$\begin{align*}PE(t,f,2i) = sin(t/10000^{4i/D})\\PE(t,f,2i+1) = cos(t/10000^{4i/D})\\PE(t,f,2j+D/2) = sin(f/10000^{4j/D})\\PE(t,f,2j+1+D/2) = cos(f/10000^{4j/D})\end{align*}$$

where $t$ indexes $T$ frames, $f$ indexes $F$ frequencies, and $i,j \in [0, D/4)$ specify the dimension.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
utility		utility
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
flowchart.png		flowchart.png
model.py		model.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utility

utility

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

flowchart.png

flowchart.png

model.py

model.py

requirements.txt

requirements.txt

Repository files navigation

Time-Frequency Domain Filter-and-Sum Network for Multi-channel Speech Separation

Model

Usage

Requirements

Dataset

Model configurations

Miscellaneous

About

Releases

Packages

Languages

License

JonathanDZ/TF-FaSNet

Folders and files

Latest commit

History

Repository files navigation

Time-Frequency Domain Filter-and-Sum Network for Multi-channel Speech Separation

Model

Usage

Requirements

Dataset

Model configurations

Miscellaneous

About

Resources

License

Stars

Watchers

Forks

Languages