Skip to content

An English-Spanish code switching dataset adapted from the Miami-Corpus

Notifications You must be signed in to change notification settings

Brono25/MIAMI-Corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adapted Miami Corpus for Speaker Diarization

Dataset Overview

This dataset is derived from the Bangor Miami Corpus, a Spanish-English code-switching dataset. It includes 8.5 hours of annotated audio across 23 tracks, featuring 36 unique speakers. Some tracks have been adapted to be monolingual by excluding code-switching segments. Below is a breakdown of the minutes of Spanish and English monolingual segments versus Spanish-English code-switch segments.

Description of the image

Contents

  • Reference RTTM Files: Annotation files containing speaker diarization labels.
  • Audio Files: link to audio files on one drive
  • Transcription Files: The .tr files include speaker labels, timestamps, and language labels. Although they also contain transcriptions of the spoken content, these should not be considered accurate since the removal of audio segments has led to some discrepancy between the text and the spoken words.

Access

The dataset is made publicly available.

Links

link to audio files: https://onedrive.live.com/?id=DD72E4A05B8E96B0%21609&cid=DD72E4A05B8E96B0 Bangor Miami Corpus source: http://bangortalk.org.uk/speakers.php?c=miami

About

An English-Spanish code switching dataset adapted from the Miami-Corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published