Reasoning-driven-SER

Official Implementation of research paper "Reasoning Driven Captions To Assist Noise Robust Speech Emotion Recognition" accepted for publication in ICASSP 2026

Pipeline

Check out Example: 🔈

Setup

1. Clone the repository

git clone https://github.com/Snehitc/Reasoning-driven-SER.git
cd Reasoning-driven-SER

2. Create environment

conda create -n Rd_SER python=3.9

conda activate Rd_SER

3. Install Torch (CUDA version)

pip install torch==2.1.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

4. Install requirements

pip install -r requirements.txt

5. Add mellow

Mellow
git clone https://github.com/soham97/mellow.git
$$\textbf{{\color{red}Important:}}$$
Replace $${\color{red}wrapper.py}$$ file from official mellow implimentation with our $${\color{blue}wrapper.py}$$. Please find our file in the mellow_replace_wrapper directory.
.\mellow\mellow\wrapper.py replace this file with .\mellow_replace_wrapper\wrapper.py

Reason: I have modified the wrapper file to take an audio tensor as input instead of an audio filename; since we are creating noisy samples in real time by mixing speech (MSP) with noise (Freesound) in tensor form.

6. Add our checkpoints

Please download our best-trained model's checkpoint from Zenodo (filename: RdSER_Mellow_BestModel.pt), place this file in .\model\ckpt directory. Check Directory Structure for better understanding.

Note:

This trained model only contains the checkpoints for WavLM and Downstream Head, and not CLAP, since the CLAP object was kept frozen while finetuning.

However, while the model's instantiation CLAP will automatically load pretrained weights from HuggingFace 🤗 using from_pretrained command.

7. Add Dataset

Arrange this data in the directories specified in Directory Structure.

Speech: MSP podcast (Release 1.10)
Noise: FreeSound and other noise datasets (Manually scraped data for classes mentioned in our paper)

8. Evaluate

Run this script to get the results for our best-trained model

python evaluate.py

Results

SNR	Score	Audio-only	Text-only		Baseline Audio+Text: Feature Concatenation			$${\color{blue}Proposed}$$ Audio+Text: Cross-Attention
SNR	Score	Audio-only	Transcript	Mellow	Scene	MS-CLAP	Mellow	Scene	MS-CLAP	$${\color{blue}Mellow}$$
5dB	Arousal	0.5929	0.0912	0.0557	0.5911	0.5856	0.5899	0.5908	$${\color{blue}0.6046}$$	0.6004
	Valence	0.4385	0.1410	0.0132	$${\color{blue}0.4497}$$	0.3888	0.3939	0.4071	0.4272	0.4475
	Dominance	0.4909	0.0041	0.0073	0.4779	0.4564	0.4761	0.4791	$${\color{blue}0.4922}$$	0.4837
0dB	Arousal	0.5736	0.0912	0.0552	0.5713	0.5673	0.5705	0.5594	$${\color{blue}0.5852}$$	0.5847
	Valence	0.4122	0.1410	0.0119	0.4215	0.3684	0.3695	0.3957	0.4055	$${\color{blue}0.4227}$$
	Dominance	0.4763	0.0041	0.0068	0.4604	0.4409	0.4611	0.4635	$${\color{blue}0.4822}$$	0.4768
-5dB	Arousal	0.4808	0.0912	0.0492	0.5043	0.4844	0.4859	0.4743	0.5201	$${\color{blue}0.5304}$$
	Valence	0.3460	0.1410	0.0036	0.3359	0.3110	0.3044	0.3408	0.3493	$${\color{blue}0.3659}$$
	Dominance	0.3899	0.0041	0.0048	0.4017	0.3619	0.3840	0.4007	0.4232	$${\color{blue}0.4248}$$
-10dB	Arousal	0.2484	0.0912	0.0415	0.3251	0.2984	0.2982	0.3195	0.3174	$${\color{blue}0.3523}$$
	Valence	0.2155	0.1410	0.0035	0.1857	0.2086	0.2014	0.2371	0.2353	$${\color{blue}0.2553}$$
	Dominance	0.1862	0.0041	0.0026	0.2518	0.2069	0.2242	$${\color{blue}0.2568}$$	0.2323	0.2505

Table: CCC scores on Unseen Synthetic Noisy Speech (Speech: MSP-Podcast-Test1 set) at diverse SNRs comparing Proposed and Baseline with respect to all Context-Aware Texts performance. (Text-only: Transcripts are ground-truth version)

Directory Structure

Reasoning-driven-SER
      |___evaluate.py
      |___config.yaml
      |___requirements.txt
      
      |___model
          |___model.py
          |___ckpt
              |___# Add "RdSER_Mellow_BestModel.pt" in this dir
      
      |___utils
          |___utils.py
      
      |___dataset
          |___MSP_dataset.py
          |___MSP
              |___Audio
              |___labels
          |___FreeSound_Noise
              |___Test
                  |___tram
                  |___sea
                  |___ ...
  
      |___mellow_replace_wrapper
          |___wrapper.py # Our file modified version of Mellow's official version
  
      |___mellow
          |___mellow
              |___wrapper.py # Important: Replace this file with Our "wrapper.py"
          |___ ...

Citation (Coming Soon)

This work is accepted for publication in ICASSP 2026. Citation will be updated soon in the near future once it's available on IEEE Xplore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning-driven-SER

Pipeline

Setup

1. Clone the repository

2. Create environment

3. Install Torch (CUDA version)

4. Install requirements

5. Add mellow

6. Add our checkpoints

7. Add Dataset

8. Evaluate

Results

Directory Structure

Citation (Coming Soon)

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
dataset		dataset
docs		docs
mellow_replace_wrapper		mellow_replace_wrapper
model		model
utils		utils
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reasoning-driven-SER

Pipeline

Setup

1. Clone the repository

2. Create environment

3. Install Torch (CUDA version)

4. Install requirements

5. Add mellow

6. Add our checkpoints

7. Add Dataset

8. Evaluate

Results

Directory Structure

Citation (Coming Soon)

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages