Releases: ShahnawazKakarh/speech-emotion-recognition-transfer-learning
Releases · ShahnawazKakarh/speech-emotion-recognition-transfer-learning
v2.0.0 — Cross-Lingual Indo-Aryan SER (failure-modes audit)
Cross-Lingual Speech Emotion Recognition for Indo-Aryan Languages: Acoustic Feature Collapse, Within-Language Show Overfitting, and Asymmetric Transformer Transfer
v2.0.0 — multi-seed audit revealing three structural failure modes in low-resource Indo-Aryan SER.
Summary
Four findings across classical and transformer-based SER for South Asian Indo-Aryan languages:
- C1. Modernised classical-ML baselines exceed the published Sindhi baseline by +1.70 pp UAR (SVM-RBF on IS10 features, UAR 0.5699).
- C2. Cross-corpus transfer Urdu ↔ Sindhi fails catastrophically: best transfer UAR 0.2734 across 30 configurations — a 30.99 pp drop from the within-language ceiling, with feature-set ranking inverting under transfer.
- C3. Multilingual encoder validation: wav2vec2-XLS-R-300M outperforms wav2vec2-base by +11.4 pp WF1 on English speaker-independent RAVDESS (0.659 → 0.773).
- C4. First multi-seed (n=3) transformer cross-lingual SER study between Punjabi RASA and URDU-Latif, revealing three structural failure modes:
- Punjabi within-language is leakage-saturated: WF1 0.994 [0.988, 0.998] only because RASA's publisher split admits speaker overlap.
- Urdu within-language under show-independent eval falls below chance: pooled UAR 0.103 [0.075, 0.131] (n=88) — below the 0.25 chance baseline — corpus too small (~290 utterances) to learn show-invariant representations.
- Urdu → Punjabi zero-shot transfer collapses to single-class prediction: pooled UAR 0.248 [0.225, 0.272] (n=2,886), robust across 3 seeds.
- Only Pun → Urdu clears chance (pooled UAR 0.467 [0.187, 0.538], n=88) but with per-show variance too large for strong claims.
These failure modes are properties of the public Indo-Aryan SER corpora (too small, single-source recording provenance, missing speaker IDs), not of the XLS-R architecture — which achieves UAR 0.773 on English RAVDESS.
Artifacts in this release
paper_v2.pdf— the full v2 preprint (Abstract, Methods, 4 Findings, Discussion, Limitations, References)aggregate.json— per-seed and pooled bootstrap CIs for all 4 conditionssummary.csv— flat machine-readable table of all numbers
Reproducibility
scripts/run_multiseed_cross_lingual.py— orchestrator (6 trainings + 12 evals)scripts/eval_checkpoint.py— generalized within/cross-language evaluatorscripts/aggregate_multiseed.py— mean + 95% bootstrap CI + pooled-prediction CIresearch/cross_lingual/multiseed_protocol.md— full methods documentation
What's next (v2.1, v3.0)
- v2.1: speaker-disjoint Punjabi RASA re-evaluation (pending AI4Bharat speaker-ID release)
- v3.0: Hindi (IITKGP-SEHSC) + multimodal Whisper-Punjabi text branch + larger seed pool
Cite
Citation metadata in CITATION.cff. DOIs (Concept + v2.0.0 version) populated by Zenodo on this release.