Skip to content

v2.0.0 — Cross-Lingual Indo-Aryan SER (failure-modes audit)

Latest

Choose a tag to compare

@ShahnawazKakarh ShahnawazKakarh released this 16 Jun 00:22
· 1 commit to master since this release

Cross-Lingual Speech Emotion Recognition for Indo-Aryan Languages: Acoustic Feature Collapse, Within-Language Show Overfitting, and Asymmetric Transformer Transfer

v2.0.0 — multi-seed audit revealing three structural failure modes in low-resource Indo-Aryan SER.

Summary

Four findings across classical and transformer-based SER for South Asian Indo-Aryan languages:

  • C1. Modernised classical-ML baselines exceed the published Sindhi baseline by +1.70 pp UAR (SVM-RBF on IS10 features, UAR 0.5699).
  • C2. Cross-corpus transfer Urdu ↔ Sindhi fails catastrophically: best transfer UAR 0.2734 across 30 configurations — a 30.99 pp drop from the within-language ceiling, with feature-set ranking inverting under transfer.
  • C3. Multilingual encoder validation: wav2vec2-XLS-R-300M outperforms wav2vec2-base by +11.4 pp WF1 on English speaker-independent RAVDESS (0.659 → 0.773).
  • C4. First multi-seed (n=3) transformer cross-lingual SER study between Punjabi RASA and URDU-Latif, revealing three structural failure modes:
    • Punjabi within-language is leakage-saturated: WF1 0.994 [0.988, 0.998] only because RASA's publisher split admits speaker overlap.
    • Urdu within-language under show-independent eval falls below chance: pooled UAR 0.103 [0.075, 0.131] (n=88) — below the 0.25 chance baseline — corpus too small (~290 utterances) to learn show-invariant representations.
    • Urdu → Punjabi zero-shot transfer collapses to single-class prediction: pooled UAR 0.248 [0.225, 0.272] (n=2,886), robust across 3 seeds.
    • Only Pun → Urdu clears chance (pooled UAR 0.467 [0.187, 0.538], n=88) but with per-show variance too large for strong claims.

These failure modes are properties of the public Indo-Aryan SER corpora (too small, single-source recording provenance, missing speaker IDs), not of the XLS-R architecture — which achieves UAR 0.773 on English RAVDESS.

Artifacts in this release

  • paper_v2.pdf — the full v2 preprint (Abstract, Methods, 4 Findings, Discussion, Limitations, References)
  • aggregate.json — per-seed and pooled bootstrap CIs for all 4 conditions
  • summary.csv — flat machine-readable table of all numbers

Reproducibility

  • scripts/run_multiseed_cross_lingual.py — orchestrator (6 trainings + 12 evals)
  • scripts/eval_checkpoint.py — generalized within/cross-language evaluator
  • scripts/aggregate_multiseed.py — mean + 95% bootstrap CI + pooled-prediction CI
  • research/cross_lingual/multiseed_protocol.md — full methods documentation

What's next (v2.1, v3.0)

  • v2.1: speaker-disjoint Punjabi RASA re-evaluation (pending AI4Bharat speaker-ID release)
  • v3.0: Hindi (IITKGP-SEHSC) + multimodal Whisper-Punjabi text branch + larger seed pool

Cite

Citation metadata in CITATION.cff. DOIs (Concept + v2.0.0 version) populated by Zenodo on this release.