GitHub - MERLIon-Challenge/merlion-ccs-2023

UPDATES (Last Updated 24/8/2023)

The MERLIon CCS Challenge 2023 was a special session at Interspeech 2023, held on Wednesday 23rd August 2023 (4pm to 6pm).

For a detailed description of the MERLIon CCS dataset, check out our Interspeech 2023 paper:

Y. H. V. Chua, H. Liu, L. P. Garcia Perera, F. T. Woon, J. Wong, X. Zhang, S. Khudanpur, A. W. H. Khong, J. Dauwels, and S. J. Styles, MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization, doi: 10.21437/Interspeech.2023-1446.

The winning systems for open and closed tracks presented the following papers during the special session:

S. K. Gupta, S. Hiray, P. Kukde (2023). Spoken Language Identification System for English-Mandarin Code-Switching Child-Directed Speech, doi: 10.21437/Interspeech.2023-1335
M. Shahin, Z. Nan, V. Sethu, B. Ahmed (2023). Improving wav2vec2-based Spoken Language Identification by Learning Phonological Features, doi: 10.21437/Interspeech.2023-2533.
K. Praveen, B. Radhakrishnan, K. Sabu, A. Pandey, M. A. B. Shaik (2023). Language Identification Networks for Multilingual Everyday Recordings, doi: 10.21437/Interspeech.2023-2047.

We also presented an analysis of common errors where submitted systems collectively struggle when performing language identification on complex speech:

S. J. Styles, Y. H. V. Chua, F. T. Woon, H. Liu, L. P. Garcia Perera, S. Khudanpur, A. W. H. Khong, and J. Dauwels (2023). Investigating model performance in language identification: beyond simple error statistics, doi: 10.21437/Interspeech.2023-1707

The MERLIon CCS Challenge will remain open indefinitely to encourage model development. The development and evaluation set is publicly available (https://doi.org/10.21979/N9/ANXS8Z). When using the dataset, please cite:

Chua, Victoria Yi Han; Garcia Perera, Leibny Paola; Khudanpur, Sanjeev; Khong, Andy W. H.; Dauwels, Justin; Woon, Fei Ting; Styles, Suzy J, 2023, "Development and Evaluation data for Multilingual Everyday Recordings - Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge", https://doi.org/10.21979/N9/ANXS8Z, DR-NTU (Data), V1

The evaluation plan for the challenge can be found in the following sections.

Overview

Robust multilingual speech processing systems that can handle diverse recording environments, accents, registers and spontaneous code-switching behaviours across individuals are much needed for advancing progress in fair and inclusive speech technologies. In response to the need for more reliable language identification and language diarization systems, we present the Multilingual Everyday Recordings – Language Identification on Child-Directed Speech Challenge (MERLIon CCS Challenge).

The MERLIon CCS Challenge features a unique first-of-its-kind Zoom videocall dataset from the Talk Together Study, with:

more than 30 hours of Singaporean English/Mandarin code-switched child-directed speech
featuring more than 100 voices
over 300 recordings manually annotated by at least 2 multilingual transcribers

More information on can be found on our website.

Evaluation plan

v1.2 (Last Updated on 17th Feb 2023; Uploaded to arXiV on 31 May 2023)
v1.1
v1.0

Tasks

There are two tasks: Task 1 (Language Identification) and Task 2 (Language Diarization) in the MERLIon CCS Challenge.

Baseline system

The baseline system is an end-to-end conformer model for both tasks. The model consists of four conformer encoder layers followed by a statistics pooling layer and three linear layers with ReLU activation in the first two linear layers. All self-attention encoder layers have eight attention heads with input and output dimensions being 512, and the inner layer of the position-wise feed-forward network is of dimensionality 2048. The 39-dimensional Mel Frequency Cepstral Coefficients (MFCC) features comprising 13-dim MFCCs and their first- and second-order deviations are extracted for each speech signal before being fed into the conformer encoder layers. The statistics pooling layer then generates a 1024-dimensional output which is finally projected by three linear layers to the number of target languages. The three linear layers comprise 1024, 512, and 2 output nodes.

The training data comprise 200 hours of AISHELL Mandarin data, 100 hours of Librispeech data, and 100 hours of National Speech Corpus data. The partitions of each dataset can be found here. The speech signals in these datasets are segmented into maximum of 3s prior to the feature extraction stage. The model is trained for five epochs with batch size 32 and updated with a learning rate that warms up from 0 to 10^-4 in 5000 steps followed by the cosine annealing decay.

An energy-based voice activity detection is performed on the test data to identify the silent parts for the diarization task. Each speech signal is partitioned into speech clips after removing silences before we perform language identification on these clips, where we assume there is no code-switch exists in each speech clip.

For Task 1 (Language Identification), the baseline system achieved the following results on the MERLIon CCS Development set (N:

Equal Error Rate	Balanced Accuracy	Current Accuracy
22.1328%	50.32%	77.7681%

For Task 2 (Language Diarization), the baseline system achieved the following results on the MERLIon CCS Development set:

Language Diarization Error Rate	English Language Error Rate	Mandarin Language Error Rate
86.6%	83.93%	99.8%

The baseline system and scoring scripts for each task can be found here.

Datasets

Training Data

For both Task 1 and 2, there is a closed track and an open track, placing limits on the amount of training data that can be used. More information on the tracks and the training data can be found here.

Development and Evaluation Datasets

The development set and evaluation set are Challenge-ready partitions of a larger dataset from the Talk Together Study which examines parent-child interactions in multilingual Singapore, and is now publicly available (https://doi.org/10.21979/N9/ANXS8Z).

Submission

Submissions of results will be made on CodaLab pages. More information can be found on our website. NOTE: The CodaLab pages will re-open after Interspeech 2023 around the first week of September 2023.

Organising committee

-Leibny Paola Garcia Perera, Johns Hopkins University -YH Victoria Chua, Nanyang Technological University -Hexin Liu, Nanyang Technological University -Fei Ting Woon, Nanyang Technological University -Andy Khong, Nanyang Technological University -Justin Dauwels, TU Delft -Sanjeev Khudanpur, John Hopkins University -Suzy J Styles, Nanyang Technological University

Contact

Please contact Victoria at merlion.challenge@gmail.com or visit us at our website if you have any questions.

Sign up for our mailing list or join our LinkedIn group for updates!

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Repository files navigation

UPDATES (Last Updated 24/8/2023)

Overview

Evaluation plan

Tasks

Baseline system

Datasets

Training Data

Development and Evaluation Datasets

Submission

Organising committee

Contact

About

Releases

Packages

Contributors 2

MERLIon-Challenge/merlion-ccs-2023

Folders and files

Latest commit

History

readme.md

readme.md

Repository files navigation

UPDATES (Last Updated 24/8/2023)

Overview

Evaluation plan

Tasks

Baseline system

Datasets

Training Data

Development and Evaluation Datasets

Submission

Organising committee

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages