Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow dialect labels on audio files #2958

Open
DJ-Saidez opened this issue Jun 15, 2022 · 11 comments
Open

Allow dialect labels on audio files #2958

DJ-Saidez opened this issue Jun 15, 2022 · 11 comments
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.

Comments

@DJ-Saidez
Copy link
Member

DJ-Saidez commented Jun 15, 2022

I was thinking, having several audio files available in a sentence is a great achievement (see #183), but I feel like having more immediately-accessible information visible on the sentence page, like dialect of the speaker, would be even more useful to be able to see differences between different recordings.

Maybe for now it could simply be some text that either the audio contributor or the admin initially adds, that then gets shown on every sentence that the contributor records. (Although this might get complicated if said contributor contributes audio in more than one language)
Then maybe later on we can make it so that it works like tags, and we can organize audio based on these labels, so if we wanted to find specifically British English audio for example, we could use a tag for that.

Rough representation:
Screen Shot 2022-06-15 at 10 39 16 AM
Difference is I'd like the box to be expanded a bit to make room for the two lines to fit.

We could also use a similar method for any notes to add to the audio, such as specifying whether it's past or present tense, or even the mood of the way it's said, but that'd be a lot more manual since you'd have to deal with each individual sentence.

@DJ-Saidez DJ-Saidez added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Jun 15, 2022
@DJ-Saidez
Copy link
Member Author

DJ-Saidez commented Jun 15, 2022

Note to self: dev/src/View/Helper/AudioHelper.php

@jiru
Copy link
Member

jiru commented Jun 19, 2022

Good idea!

Note to self: dev/src/View/Helper/AudioHelper.php

Are you interested in working on this?

Just to share my design ideas of how to improve the situation. Right now, every audio attribute is tied to a specific user. The users table includes fields such as audio_licence, audio_attribution_url... To implement your suggestion, we'd add yet another field to the users table such as audio_dialect. It would be perfectly fine to implement it like that. But here is a suggestion that relates to this thread.

We should take the fields users.audio_* out and put them into a new table that I would call "voices". Then the schema becomes: every audio has one voice (instead of one user), and every voice has one user. Most audio contributors would have just one voice, but some may have more.

@DJ-Saidez
Copy link
Member Author

Yeah I wanted to give it a go because I’ve been learning about programming, and I’m actually getting help from Trang today so I can get more of a feel of how the system works.

For now I wanted to just try implementing the basic fix of adding “audio.dialect”, but I like that idea of switching to voices, so that in the event that there are contributors that are capable of having several voices, we have the framework for it. Or were you thinking of something else?

@jiru
Copy link
Member

jiru commented Jun 27, 2022

If you are learning about programming, I suggest that you to start with something simple. You could add a new field for the dialect (without adding a whole new "voices" table). You can go step by step and do the voice thing later.

For now I wanted to just try implementing the basic fix of adding “audio.dialect”

Sorry if you already started implementing this, but a dialect pertains to a user, not the audio. If you add "audios.dialect" (a field "dialect" in the table "audios", that is), the value of that field will be basically the same for all audios of a single user. This violates the principle of single source of truth. Such design won't let us easily know what's the dialect of a given contributor unless we go and check every single audio he contributed. Similarly, if we want to edit the dialect of a user, we'd need to go through all the "dialect" fields of his audio.

So instead, it's better to add a "users.audio_dialect" field. This design also allows an easier transition to the "voice" design, because CK, who is responsible for adding audio, is currently creating separate users for each voice he imports from Common Voice. So in the end we'll just have to "change" each of these dummy users (such as https://tatoeba.org/fr/user/profile/CVTR) into a new voice.

I hope this makes sense to you, otherwise feel free to ask me or Trang! 🙂

@ckjpn
Copy link

ckjpn commented Jul 17, 2022

So in the end we'll just have to "change" each of these dummy users (such as https://tatoeba.org/fr/user/profile/CVTR) into a new voice.

I don't agree with this at all.

I think having a profile for each voice is good since we can include all the info we know about a given voice in the profile.

See a couple of examples.

https://tatoeba.org/en/user/profile/CVjpn1

https://tatoeba.org/en/user/profile/CVeng25

I wouldn't call them "dummy users" since each represent individual people.

There should be no problem with having too many accounts because of this.
We have thousands of accounts now that have no content, and we seem to be dealing with that OK.

That fact that https://tatoeba.org/fr/user/profile/CVTR is an account that has many voices is because the member who created that account preferred doing it that way and TRANG said it was OK.

Added later:

Perhaps we do agree. Are you only suggesting that CVTR be divided to indicate each individual's voice, or were you suggesting getting rid of all the individual CVjpn and CVeng accounts?

@jiru
Copy link
Member

jiru commented Jul 18, 2022

When I used the word "dummy users", I didn't mean anything wrong about the actual people who contributed audio. It was just my way to describe the fact that you are using user accounts as a way to store information about contributors who are not Tatoeba contributors but Common Voice contributors. As far as I understand, each of these accounts can not be used to directly reach out to these people, there is only you at the other end. The profile picture, the username, theirs comments and wall posts if any, they are not theirs, but yours. You are using user accounts in a way they have not been originally designed for, but happens to work for your purpose. (Which is alright by the way 🙂)

My intention here, as a developer, is just to fill this gap by creating a new kind of data (that I called "voice") that better matches the reality. By "reality" I mean the fact that there are a bunch of recordings contributed by people from Common Voice and you are the one who help importing them into Tatoeba. The benefit of having voice-related data properly structured is that it can be better presented (e.g. a translatable UI text saying "this audio comes from Common Voice") and better processed (e.g. how many different voices this Tatoeba user contributes?, or please show me all the voices from the UK).

@ckjpn
Copy link

ckjpn commented Jul 18, 2022

The profile picture, the username, theirs comments and wall posts if any, they are not theirs, but yours.

I have done none to these things and do not ever plan to. (ADDED LATER: I guess I did choose the usernames, though)

Each profile very clearly state that it was created by CK, what the purpose is, and a link to Common Voice, and what Common Voice's purpose is. I think it's very clear that "each of these accounts can not be used to directly reach out to these people."

Some of the advantages of creating profiles for each voice are as follows

  • People clicking the voice name on each page can find information known about the voice.
  • People can click the audio/of/USERNAME link of the voice to hear other audio files by that voice
  • By grouping audio files by voice, members can even discuss things like accents / dialects. For example, a certain voice sounds like a Texas accent, another voice sounds like a Boston accent, etc. There are many different UK accents, too.
  • The Common Voice Client ID is listed, so in the future, if a new set of files from Common Voice is downloaded and processed, it will be easier to make sure the same files do not get re-uploaded as new files, causing duplicate audio files on tatoeba.org. It will also make it possible for new audio files by the same voice to be uploaded into the same account.

A new kind of data called "voice"

Adding this new kind of data is a good idea, too.

Some ideas.

  • Have a field for the general dialect, for example, British English, American English, Australian English, etc. (Maybe referring to the country might be better, for example, the United States of America, England, Australia, New Zealand, Ireland, Scotland, Wales, etc.)
  • Have a secondary sub-section for dialects to fine-tune what the dialect is, for example, Texas, Boston Alabama, North Carolina, Kentucky, Liverpool, Dublin, etc.)
  • gender of speaker (I know you can just listen to find out, but if you want to sort or select by this, it might be useful information to include)
  • age of speaker (Perhaps not really necessary, but Common Voice asks for this information. It could help find children's voices or voices of very old people, though.). Note that for long-term contributors, what starts out as a child's voice, may turn into an adult's voice.

Having these fields for audio contributors to fill out, would encourage contributors to give us this information. Perhaps this form could be added next to the "select license" on the audio page of each profile.

Somehow, to me, also including things like the source, such as Common Voice, seem somewhat unrelated. The files, being in the public domain, do not really need to be credited, even though I put this information on the profile pages. It seems like people would not really find this information so useful as additional information in a "voice" set of data, since the link to the username, as in other usernames for voices, sends people to the profile page which includes this information for CV voices, and any information other usernames have included in their profiles.

Perhaps you could include one extra field called "note" (or something similar), which could be used for the source of files, or any other information that an audio file contributor might feel would be useful info related to the voice.

@LBeaudoux
Copy link
Contributor

LBeaudoux commented Jul 19, 2022

Just as languages are identified (in theory) by their ISO 639-3 language code on Tatoeba, I think we should follow a standard for dialects. The IETF BCP 47 language tag seems to be the reference in this case.

@ckjpn
Copy link

ckjpn commented Jul 19, 2022

Is this the list you are referring to?
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

I think I correctly grabbed every item with the word English, just to get an idea of what's in the file.
These don't seem to include spoken variations, such as regional accents, which is the kind of information we would want about voices.

Type: language	Subtag: en	Description: English	Added: 2005-10-16	Suppress-Script: Latn
Type: language	Subtag: aig	Description: Antigua and Barbuda Creole English	Added: 2009-07-29
Type: language	Subtag: ang	Description: Old English (ca. 450-1100)	Added: 2005-10-16
Type: language	Subtag: bah	Description: Bahamas Creole English	Added: 2009-07-29
Type: language	Subtag: bzj	Description: Belize Kriol English	Added: 2009-07-29
Type: language	Subtag: bzk	Description: Nicaragua Creole English	Added: 2009-07-29
Type: language	Subtag: cpe	Description: English-based creoles and pidgins	Added: 2005-10-16	Scope: collection
Type: language	Subtag: cpi	Description: Chinese Pidgin English	Added: 2009-07-29
Type: language	Subtag: enm	Description: Middle English (1100-1500)	Added: 2005-10-16
Type: language	Subtag: fpe	Description: Fernando Po Creole English	Added: 2009-07-29
Type: language	Subtag: gcl	Description: Grenadian Creole English	Added: 2009-07-29
Type: language	Subtag: gpe	Description: Ghanaian Pidgin English	Added: 2012-08-12
Type: language	Subtag: gul	Description: Sea Island Creole English	Added: 2009-07-29
Type: language	Subtag: gyn	Description: Guyanese Creole English	Added: 2009-07-29
Type: language	Subtag: hwc	Description: Hawai'i Creole English	Description: Hawai'i Pidgin	Added: 2009-07-29
Type: language	Subtag: icr	Description: Islander Creole English	Added: 2009-07-29
Type: language	Subtag: jam	Description: Jamaican Creole English	Added: 2009-07-29
Type: language	Subtag: lir	Description: Liberian English	Added: 2009-07-29
Type: language	Subtag: svc	Description: Vincentian Creole English	Added: 2009-07-29
Type: language	Subtag: tch	Description: Turks And Caicos Creole English	Added: 2009-07-29
Type: language	Subtag: tgh	Description: Tobagonian Creole English	Added: 2009-07-29
Type: language	Subtag: trf	Description: Trinidadian Creole English	Added: 2009-07-29
Type: language	Subtag: vic	Description: Virgin Islands Creole English	Added: 2009-07-29
Type: variant	Subtag: aluku	Description: Aluku dialect	Description: Boni dialect	Added: 2009-09-05	Prefix: djk	Comments: Aluku dialect of the "Busi Nenge Tongo" English-based Creole	  continuum in Eastern Suriname and Western French Guiana
Type: variant	Subtag: basiceng	Description: Basic English	Added: 2015-12-29	Prefix: en
Type: variant	Subtag: boont	Description: Boontling	Added: 2006-09-18	Prefix: en	Comments: Jargon embedded in American English
Type: variant	Subtag: cornu	Description: Cornu-English	Description: Cornish English	Description: Anglo-Cornish	Added: 2015-12-07	Prefix: en
Type: variant	Subtag: emodeng	Description: Early Modern English (1500-1700)	Added: 2012-02-05	Prefix: en
Type: variant	Subtag: ndyuka	Description: Ndyuka dialect	Description: Aukan dialect	Added: 2009-09-05	Prefix: djk	Comments: Ndyuka dialect of the "Busi Nenge Tongo" English-based	  Creole continuum in Eastern Suriname and Western French Guiana
Type: variant	Subtag: newfound	Description: Newfoundland English	Added: 2015-11-25	Prefix: en-CA
Type: variant	Subtag: oxendict	Description: Oxford English Dictionary spelling	Added: 2015-04-17	Prefix: en
Type: variant	Subtag: pamaka	Description: Pamaka dialect	Added: 2009-09-05	Prefix: djk	Comments: Pamaka dialect of the "Busi Nenge Tongo" English-based	  Creole continuum in Eastern Suriname and Western French Guiana
Type: variant	Subtag: scotland	Description: Scottish Standard English	Added: 2007-08-31	Prefix: en
Type: variant	Subtag: scouse	Description: Scouse	Added: 2006-09-18	Prefix: en	Comments: English Liverpudlian dialect known as 'Scouse'
Type: variant	Subtag: spanglis	Description: Spanglish	Added: 2017-02-23	Prefix: en	Prefix: es	Comments: A variety of contact dialects of English and Spanish
Type: grandfathered	Tag: en-GB-oed	Description: English, Oxford English Dictionary spelling	Added: 2003-07-09	Deprecated: 2015-04-17	Preferred-Value: en-GB-oxendict

Related Wikipedia Page:
https://en.wikipedia.org/wiki/Regional_accents_of_English

@LBeaudoux
Copy link
Contributor

LBeaudoux commented Jul 19, 2022

I haven't found a standard specifically dedicated to accents. But IETF BCP 47 allows to combine several subtags to identify a regional variation of a language, which also gives an indication of the accent.

For example en-scotland for Scottish English. es-US for United States Spanish, or fr-CA for Quebec French. If the audio contributor wants to give even more details, he should be able to do so in a description text.

The advantage of these tags is that they are relevant for both speech and text, which makes it possible to use them to classify sentences and audios on Tatoeba.

@ckjpn
Copy link

ckjpn commented Jul 19, 2022

Perhaps accents could be written something like the following, going from general to more specific.

Accent: American >> Southern >> Georgia >> Atlanta

Perhaps looking into how this site does things give you some ideas.

IDEA: International Dialects of English
https://www.dialectsarchive.com/

I notice they also specify "white" and "black."
https://www.dialectsarchive.com/georgia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.
Projects
None yet
Development

No branches or pull requests

4 participants