An assignment for artificial intelligence department. A simple implementation of a music separation project, which refer the paper Demucs.
- Windows10
- Ubuntu20.04
- macOS (CPU only)
python run.py --c configs/config.yaml
And then you will see results from records
folder.
The first thing to do should always be data. We use the following dataset from SigSep, an open source website that holds all kinds of data. We select the following one:
For Southeast University student, we upload the dataset to pan.seu.edu.cn
to fasten your downloading. Here is a link:
After downloading and unzip, please change its format from mp3
into .wav
, since the current(Nov 2023) torchaudio only support wav
format. You can directly run the following bash(remember to change the location!), here I recommend you to put the musicdb18 into a parallel position with the project:
audioSep Project
|--changeAudioFormat.bash
|....
musicdb18
|-- piece1.mp3
|-- piece1.mp3
|...
# please ensure you are at the current project work space
chmod +x changeAudioFormat.bash
./changeAudioFormat.bash
After running, you will see folder musicdb18_wav
in your project folder. For more detailed information about this dataset, please refer to the introduction site or click the readme under downloaded original dataset folder.
- mono channel audio only.
The stoi function is designed to evaluate the intelligibility of speech signals, which are typically mono. Intelligibility is a measure of how comprehensible speech is in given conditions, and for this measurement, stereo or multi-channel audio does not provide additional information compared to mono audio.
If the source or predicted audio is stereo (i.e., has 2 channels), it's common practice to either:
- (The method we adopt) Average the channels to get a mono signal.
- Evaluate the metric on each channel separately and then average the results.
- mono channel audio only
Like STOI, PESQ is designed for mono signals and particularly for evaluating the quality of speech signals. For stereo or multi-channel audio, the same approach as STOI can be taken.
Caveat: PESQ is based on perceptual models, so the results can be affected if applied to non-speech signals.
- (The method we adopt) Average the channels to get a mono signal.
Able to Multi-Channel: SDR can be computed for multi-channel audio. When computing SDR for multi-channel audio, it's typically done channel-wise, and then the results can be averaged.
Able to Multi-Channel: SNR can be computed for multi-channel audio. Like SDR, we typically compute SNR for each channel separately and then average.
Able to Multi-Channel: measures the amount of interference from other sources in the separated source. A higher SIR indicates that the separated source has less interference from other sources, which means the model's performance is better.