Official Implementation of research paper "Reasoning Driven Captions To Assist Noise Robust Speech Emotion Recognition" accepted for publication in ICASSP 2026
| Check out Example: 🔈 |
|---|
git clone https://github.com/Snehitc/Reasoning-driven-SER.git
cd Reasoning-driven-SER
conda create -n Rd_SER python=3.9
conda activate Rd_SER
pip install torch==2.1.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Mellow
git clone https://github.com/soham97/mellow.git
$$\textbf{{\color{red}Important:}}$$
Replace$${\color{red}wrapper.py}$$ file from official mellow implimentation with our$${\color{blue}wrapper.py}$$ . Please find our file in the mellow_replace_wrapper directory.
.\mellow\mellow\wrapper.pyreplace this file with.\mellow_replace_wrapper\wrapper.py
Reason: I have modified the wrapper file to take an audio tensor as input instead of an audio filename; since we are creating noisy samples in real time by mixing speech (MSP) with noise (Freesound) in tensor form.
Please download our best-trained model's checkpoint from Zenodo (filename: RdSER_Mellow_BestModel.pt), place this file in .\model\ckpt directory. Check Directory Structure for better understanding.
Note:
- This trained model only contains the checkpoints for WavLM and Downstream Head, and not CLAP, since the CLAP object was kept frozen while finetuning.
- However, while the model's instantiation CLAP will automatically load pretrained weights from HuggingFace 🤗 using from_pretrained command.
Arrange this data in the directories specified in Directory Structure.
- Speech: MSP podcast (Release 1.10)
- Noise: FreeSound and other noise datasets (Manually scraped data for classes mentioned in our paper)
Run this script to get the results for our best-trained model
python evaluate.py
| SNR | Score | Audio-only | Text-only | Baseline Audio+Text: Feature Concatenation |
Audio+Text: Cross-Attention |
|||||
|---|---|---|---|---|---|---|---|---|---|---|
| Transcript | Mellow | Scene | MS-CLAP | Mellow | Scene | MS-CLAP | ||||
| 5dB | Arousal | 0.5929 | 0.0912 | 0.0557 | 0.5911 | 0.5856 | 0.5899 | 0.5908 | 0.6004 | |
| Valence | 0.4385 | 0.1410 | 0.0132 | 0.3888 | 0.3939 | 0.4071 | 0.4272 | 0.4475 | ||
| Dominance | 0.4909 | 0.0041 | 0.0073 | 0.4779 | 0.4564 | 0.4761 | 0.4791 | 0.4837 | ||
| 0dB | Arousal | 0.5736 | 0.0912 | 0.0552 | 0.5713 | 0.5673 | 0.5705 | 0.5594 | 0.5847 | |
| Valence | 0.4122 | 0.1410 | 0.0119 | 0.4215 | 0.3684 | 0.3695 | 0.3957 | 0.4055 | ||
| Dominance | 0.4763 | 0.0041 | 0.0068 | 0.4604 | 0.4409 | 0.4611 | 0.4635 | 0.4768 | ||
| -5dB | Arousal | 0.4808 | 0.0912 | 0.0492 | 0.5043 | 0.4844 | 0.4859 | 0.4743 | 0.5201 | |
| Valence | 0.3460 | 0.1410 | 0.0036 | 0.3359 | 0.3110 | 0.3044 | 0.3408 | 0.3493 | ||
| Dominance | 0.3899 | 0.0041 | 0.0048 | 0.4017 | 0.3619 | 0.3840 | 0.4007 | 0.4232 | ||
| -10dB | Arousal | 0.2484 | 0.0912 | 0.0415 | 0.3251 | 0.2984 | 0.2982 | 0.3195 | 0.3174 | |
| Valence | 0.2155 | 0.1410 | 0.0035 | 0.1857 | 0.2086 | 0.2014 | 0.2371 | 0.2353 | ||
| Dominance | 0.1862 | 0.0041 | 0.0026 | 0.2518 | 0.2069 | 0.2242 | 0.2323 | 0.2505 | ||
Reasoning-driven-SER
|___evaluate.py
|___config.yaml
|___requirements.txt
|___model
|___model.py
|___ckpt
|___# Add "RdSER_Mellow_BestModel.pt" in this dir
|___utils
|___utils.py
|___dataset
|___MSP_dataset.py
|___MSP
|___Audio
|___labels
|___FreeSound_Noise
|___Test
|___tram
|___sea
|___ ...
|___mellow_replace_wrapper
|___wrapper.py # Our file modified version of Mellow's official version
|___mellow
|___mellow
|___wrapper.py # Important: Replace this file with Our "wrapper.py"
|___ ...
This work is accepted for publication in ICASSP 2026. Citation will be updated soon in the near future once it's available on IEEE Xplore
- Readme
- Pipeline (fig)
- Results
- Example
- Directory Structure
- Citation (coming soon: ICASSP 2026)
- Code files
- Config
- Dataloader
- Utils: Mellow
- Model object
- Customised mellow wrapper
- Evaluation
- Requirements
- Trained model's checkpoint
