Skip to content

Snehitc/Reasoning-driven-SER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zenodo Clips

Reasoning-driven-SER

Official Implementation of research paper "Reasoning Driven Captions To Assist Noise Robust Speech Emotion Recognition" accepted for publication in ICASSP 2026

Pipeline

Pipeline

Check out Example: 🔈 Clips

Setup

1. Clone the repository

git clone https://github.com/Snehitc/Reasoning-driven-SER.git
cd Reasoning-driven-SER

2. Create environment

conda create -n Rd_SER python=3.9
conda activate Rd_SER

3. Install Torch (CUDA version)

pip install torch==2.1.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

4. Install requirements

pip install -r requirements.txt

5. Add mellow

Mellow

git clone https://github.com/soham97/mellow.git

$$\textbf{{\color{red}Important:}}$$
Replace $${\color{red}wrapper.py}$$ file from official mellow implimentation with our $${\color{blue}wrapper.py}$$. Please find our file in the mellow_replace_wrapper directory.
.\mellow\mellow\wrapper.py replace this file with .\mellow_replace_wrapper\wrapper.py

Reason: I have modified the wrapper file to take an audio tensor as input instead of an audio filename; since we are creating noisy samples in real time by mixing speech (MSP) with noise (Freesound) in tensor form.

6. Add our checkpoints

Please download our best-trained model's checkpoint from Zenodo (filename: RdSER_Mellow_BestModel.pt), place this file in .\model\ckpt directory. Check Directory Structure for better understanding.

Note:

  • This trained model only contains the checkpoints for WavLM and Downstream Head, and not CLAP, since the CLAP object was kept frozen while finetuning.
  • However, while the model's instantiation CLAP will automatically load pretrained weights from HuggingFace 🤗 using from_pretrained command.

7. Add Dataset

Arrange this data in the directories specified in Directory Structure.

  • Speech: MSP podcast (Release 1.10)
  • Noise: FreeSound and other noise datasets (Manually scraped data for classes mentioned in our paper)

8. Evaluate

Run this script to get the results for our best-trained model

python evaluate.py

Results

SNR Score Audio-only Text-only Baseline
Audio+Text: Feature Concatenation
$${\color{blue}Proposed}$$
Audio+Text: Cross-Attention
Transcript Mellow Scene MS-CLAP Mellow Scene MS-CLAP $${\color{blue}Mellow}$$
5dB Arousal 0.5929 0.0912 0.0557 0.5911 0.5856 0.5899 0.5908 $${\color{blue}0.6046}$$ 0.6004
Valence 0.4385 0.1410 0.0132 $${\color{blue}0.4497}$$ 0.3888 0.3939 0.4071 0.4272 0.4475
Dominance 0.4909 0.0041 0.0073 0.4779 0.4564 0.4761 0.4791 $${\color{blue}0.4922}$$ 0.4837
0dB Arousal 0.5736 0.0912 0.0552 0.5713 0.5673 0.5705 0.5594 $${\color{blue}0.5852}$$ 0.5847
Valence 0.4122 0.1410 0.0119 0.4215 0.3684 0.3695 0.3957 0.4055 $${\color{blue}0.4227}$$
Dominance 0.4763 0.0041 0.0068 0.4604 0.4409 0.4611 0.4635 $${\color{blue}0.4822}$$ 0.4768
-5dB Arousal 0.4808 0.0912 0.0492 0.5043 0.4844 0.4859 0.4743 0.5201 $${\color{blue}0.5304}$$
Valence 0.3460 0.1410 0.0036 0.3359 0.3110 0.3044 0.3408 0.3493 $${\color{blue}0.3659}$$
Dominance 0.3899 0.0041 0.0048 0.4017 0.3619 0.3840 0.4007 0.4232 $${\color{blue}0.4248}$$
-10dB Arousal 0.2484 0.0912 0.0415 0.3251 0.2984 0.2982 0.3195 0.3174 $${\color{blue}0.3523}$$
Valence 0.2155 0.1410 0.0035 0.1857 0.2086 0.2014 0.2371 0.2353 $${\color{blue}0.2553}$$
Dominance 0.1862 0.0041 0.0026 0.2518 0.2069 0.2242 $${\color{blue}0.2568}$$ 0.2323 0.2505
Table: CCC scores on Unseen Synthetic Noisy Speech (Speech: MSP-Podcast-Test1 set) at diverse SNRs comparing Proposed and Baseline with respect to all Context-Aware Texts performance. (Text-only: Transcripts are ground-truth version)

Directory Structure

Reasoning-driven-SER
      |___evaluate.py
      |___config.yaml
      |___requirements.txt
      
      |___model
          |___model.py
          |___ckpt
              |___# Add "RdSER_Mellow_BestModel.pt" in this dir
      
      |___utils
          |___utils.py
      
      |___dataset
          |___MSP_dataset.py
          |___MSP
              |___Audio
              |___labels
          |___FreeSound_Noise
              |___Test
                  |___tram
                  |___sea
                  |___ ...
  
      |___mellow_replace_wrapper
          |___wrapper.py # Our file modified version of Mellow's official version
  
      |___mellow
          |___mellow
              |___wrapper.py # Important: Replace this file with Our "wrapper.py"
          |___ ...

Citation (Coming Soon)

This work is accepted for publication in ICASSP 2026. Citation will be updated soon in the near future once it's available on IEEE Xplore

TODO

  • Readme
    • Pipeline (fig)
    • Results
    • Example
    • Directory Structure
    • Citation (coming soon: ICASSP 2026)
  • Code files
    • Config
    • Dataloader
    • Utils: Mellow
    • Model object
    • Customised mellow wrapper
    • Evaluation
    • Requirements
  • Trained model's checkpoint

About

ICASSP 2026: Official Implementation of research paper "Reasoning Driven Captions To Assist Noise Robust Speech Emotion Recognition"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages