SAR-SSL

A python implementation of “Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer”

Contributions
- Self-supervised learning of spatial acoustic representation (SSL-SAR)
  - first self-supervised learning method in spatial acoustic representation learning and multi-channel audio signal processing
  - designs cross-channel signal reconstruction pretext task to learn the spatial acoustic and the spectral pattern information
  - learns useful knowledge that can be transferred to the spatial acoustics-related tasks
- Multi-channel audio Conformer (MC-Conformer)
  - unified architecture for both the pretext and downstream tasks
  - learns the local and global properties of spatial acoustics present in the time-frequency domain
  - boosts the performance of both pretext and downstream tasks

Datasets

Source signals: from WSJ0 database
Simulated RIRs: generated by gpuRIR toolbox
Simulated noise: generated by arbitrary noise field generator

Real-world RIRs or microphone signals: from MIR, MeshRIR, DCASE, dEchorate, BUTReverb, ACE, LOCATA databases

Datasets	#Room	Microphone Array	#Mic. Pair	#Room x #Source position x #Array position	Noise Type
MIR	3	Three 8-channel linear arrays	60	3 x 26 x 1	W/o
MeshRIR	1	441 microphones	8874	1 x 32 x 1	W/o
DCASE	9	A 4-channel tetrahedral array (EM32)	3	38530	Ambience
dEchorate	11	Six 5-channel linear arrays	48	11 x 3 x 1	Ambience, babble, white
BUTReverb	9	An 8-channel spherical array	28	51	Ambience
ACE	7	A 2-channel array (Chromebook),	433	7 x 1 x 2	Ambience, babble, fan
		a 3-channel right-angled triangle array (Mobile),
		an 8-channel linear array (Lin8Ch),
		a 32-channel spherical array (EM32)
LOCATA	1	A 15-channel linear array (DICIT),	492	Moving/static	Ambience
		a 12-channel robot array (Robot head),
		a 32-channel spherical array (Eigenmike)

Quick start

Pretext Task

Preparation

Download datasets to folders according to the following dictionary

.-SAR-SSL
| .-code
| .-data
| .-exp
.-data
  .-SouSig
  | .-wsj0
  |   .-dt
  |   .-et
  |   .-tr
  .-RIR
  | .-DCASE
  | | .-TAU-SRIR_DB
  | | .-TAU-SNoise_DB
  | .-Mesh
  | | .-S32-M441_npy
  | .-MIRDB
  | | .-Impulse_response_Acoustic_Lab_Bar-Ilan_University
  | .-dEchorate
  | | .-dEchorate_database.csv
  | | .-dEchorate_rir.h5
  | | .-dEchorate_annotations.h5
  | | .-dEchorate_noise_gzip7.hdf5
  | | .-dEchorate_babble_gzip7.hdf5
  | | .-dEchorate_silence_gzip7.hdf5
  | .-BUTReverb
  | | .-RIRs
  | .-ACE
  |   .-RIRN
  |   .-Data
  .-SenSig
    .-LOCATA
      .-dev
      .-eval

Install: numpy, scipy, soundfile, gpuRIR, etc.

Data generation

Simulated data

python data_generation_SimulatedSIG_notspecifyroom.py --stage pretrain --wnoise --gpu-id [*]
python data_generation_SimulatedSIG_notspecifyroom.py --stage preval --wnoise --gpu-id [*] 
python data_generation_SimulatedSIG_notspecifyroom.py --stage test --wnoise --gpu-id [*]

Real-world data (DCASE, MeshRIR, MIR, ACE, dEchorate, BUTReverb)

select recorded RIRs and noise signals

python data_generation_MeasuredRIR.py --data-id 0 --data-type rir noise # DCASE
python data_generation_MeasuredRIR.py --data-id 3 --data-type rir noise # ACE
python data_generation_MeasuredRIR.py --data-id 4 --data-type rir noise # dEchorate
python data_generation_MeasuredRIR.py --data-id 5 --data-type rir noise # BUTReverb
python data_generation_MeasuredRIR.py --data-id 1 --data-type rir # MeshRIR 
python data_generation_MeasuredRIR.py --data-id 2 --data-type rir # MIR

generate microphone signals with recorded RIRs and noise signals

python data_generation_SIGfromMeasuredRIR.py --data-id 0 3 4 5 --wnoise --stage pretrain 
python data_generation_SIGfromMeasuredRIR.py --data-id 0 3 4 5 --wnoise --stage preval
python data_generation_SIGfromMeasuredRIR.py --data-id 0 3 4 5 --wnoise --stage test
python data_generation_SIGfromMeasuredRIR.py --data-id 1 2 --stage pretrain 
python data_generation_SIGfromMeasuredRIR.py --data-id 1 2 --stage preval
python data_generation_SIGfromMeasuredRIR.py --data-id 1 2 --stage test

Real-world data (LOCATA)

python data_generation_LOCATA.py --stage pretrain
python data_generation_LOCATA.py --stage preval
python data_generation_LOCATA.py --stage test_pretrain

Simulated data (some instances)
1. uncomment acoustic_scene.dp_mic_signal = [] in class RandomMicSigDatasetOri of data_generation_dataset.py
2. specify room_size, T60, SNR in data_generation_opt.py (default)
3. generate corresponding intances
```
python data_generation_SimulatedSIG_notspecifyroom.py --stage test --wnoise --ins --gpu-id 7 
```

Training

Sepcify the data time version (self.time_ver) and whether training with simulated data (self.pretrain_sim) in class opt_pretrain of opt.py. When using real-world data, first train on simulated data with a default cosine-decay learing rate (initialized with 0.001), and then finetune on real-world data with a learning rate 0.0001.
```
python run_pretrain.py --pretrain --gpu-id [*]
```

Evaluation

Specify test_mode in run_pretrain.py

python run_pretrain.py --test --time [*] --gpu-id [*]

Trained models
- best_model.tar

Downstream Task

Preparation
- the same to pretext task

Data generation

Simulated data

generate RIRs

python data_generation_SimulatedRIR.py --gpu-id [*]

generate microphone signals from RIRs

# room = 2, 4, 8, 16, 32, 64, 128 or 256, and room-trial-id = 16, 8, 4, 2, 1, 1, 1 or 1
python data_generation_SIGfromMeasuredRIR.py --data-id 6 --wnoise --stage train --room 8 --room-trial-id 0 
python data_generation_SIGfromMeasuredRIR.py --data-id 6 --wnoise --stage val --room 20 
python data_generation_SIGfromMeasuredRIR.py --data-id 6 --wnoise --stage test --room 20

Stage	Trials	nRooms	nRIRs/Room	nSrcSig/RIR	nMicSig
train	x16	2	50	2	200
	x8	4	50	2	400
	x4	8	50	2	800
	x2	16	50	2	1600
	x1	32	50	2	3200
	x1	64	50	2	6400
	x1	128	50	2	12800
	x1	256	50	2	25600
val	-	20	50	1	1000
test	-	20	50	4	4000

Real-world data

TDOA estimation

python data_generation_LOCATA.py --stage train
python data_generation_LOCATA.py --stage val 
python data_generation_LOCATA.py --stage test

DRR, T60, C50, absorption coefficient estimation: on-the-fly from selected RIRs and noise signals

Training

Sepcify the data time version (self.time_ver) and whether training with simulated data (downstream_sim) in class opt_downstream of opt.py

Simulated data

# ds-nsimroom = 2, 4, 8, 16, 32, 64, 128 or 256
# ds-trainmode = finetune, lineareval or scratchLOW
python run_downstream.py --ds-train --ds-trainmode finetune --ds-nsimroom 8 --ds-task TDOA --time [*] --gpu-id [*] 
python run_downstream.py --ds-train --ds-trainmode finetune --ds-nsimroom 8 --ds-task DRR T60 C50 ABS --time [*] --gpu-id [*] 

python run_downstream.py --ds-train --ds-trainmode scratchUP --ds-task TDOA --time [*] --gpu-id [*] 
python run_downstream.py --ds-train --ds-trainmode scratchUP --ds-task DRR T60 C50 ABS --time [*] --gpu-id [*]

Real-world data

# ds-trainmode = finetune or scratchLOW
# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 1 --ds-task TDOA ---time [*] --gpu-id [*]
python run.py --ds-train --ds-trainmode finetune --ds-real-sim-ratio 1 1 --ds-task DRR T60 C50 ABS--time [*] --gpu-id [*]

Evaluation

Specify test mode (test_mode) in run_downstream.py

Simulated data

# ds-nsimroom = 2, 4, 8, 16, 32, 64, 128 or 256
# ds-trainmode = finetune, lineareval or scratchLOW
python run_downstream.py --ds-test --ds-trainmode finetune --ds-nsimroom 8 --ds-task TDOA --time [*] --gpu-id [*] 
python run_downstream.py --ds-test --ds-trainmode finetune --ds-nsimroom 8 --ds-task DRR T60 C50 ABS --time [*] --gpu-id [*] 

python run_downstream.py --ds-test --ds-trainmode scratchUP --ds-task TDOA --time [*] --gpu-id [*] 
python run_downstream.py --ds-test --ds-trainmode scratchUP --ds-task DRR T60 C50 ABS --time [*] --gpu-id [*]

Real-world data

# ds-trainmode = finetune or scratchLOW
# ds-real-sim-ratio = 1 1, 1 0 or 0 1
python run_downstream.py --ds-test --ds-trainmode finetune -ds-real-sim-ratio 1 1 --ds-task TDOA --time [*] --gpu-id [*] 
python run_downstream.py --ds-test --ds-trainmode finetune -ds-real-sim-ratio 1 1 --ds-task DRR T60 C50 ABS --time [*] --gpu-id [*]

Read downstream results (MAEs of TDOA, DRR, T60, C50, SNR, ABS estimation) from saved mat files

python read_dsmat_bslr.py --time [*]
python read_lossmetric_simdata.py
python read_lossmetric_realdata.py

Trained models
- ensemble_model.tar

Others

If OSError: [Errno 24] Too many open files occurs, input the following at the command line

ulimit -n 2048

Citation

If you find our work useful in your research, please consider citing:

@InProceedings{yang2023sarssl,
    author = "Bing Yang and Xiaofei Li",
    title = "Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer",
    booktitle = "arXiv preprint arXiv:2312.00476",
    year = "2023",
    pages = ""}

Licence

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
code		code
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAR-SSL

Datasets

Quick start

Pretext Task

Downstream Task

Others

Citation

Licence

About

Releases

Packages

Languages

License

BingYang-20/SAR-SSL

Folders and files

Latest commit

History

Repository files navigation

SAR-SSL

Datasets

Quick start

Pretext Task

Downstream Task

Others

Citation

Licence

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages