New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow dialect labels on audio files #2958
Comments
Note to self: dev/src/View/Helper/AudioHelper.php |
Good idea!
Are you interested in working on this? Just to share my design ideas of how to improve the situation. Right now, every audio attribute is tied to a specific user. The users table includes fields such as audio_licence, audio_attribution_url... To implement your suggestion, we'd add yet another field to the users table such as audio_dialect. It would be perfectly fine to implement it like that. But here is a suggestion that relates to this thread. We should take the fields users.audio_* out and put them into a new table that I would call "voices". Then the schema becomes: every audio has one voice (instead of one user), and every voice has one user. Most audio contributors would have just one voice, but some may have more. |
Yeah I wanted to give it a go because I’ve been learning about programming, and I’m actually getting help from Trang today so I can get more of a feel of how the system works. For now I wanted to just try implementing the basic fix of adding “audio.dialect”, but I like that idea of switching to voices, so that in the event that there are contributors that are capable of having several voices, we have the framework for it. Or were you thinking of something else? |
If you are learning about programming, I suggest that you to start with something simple. You could add a new field for the dialect (without adding a whole new "voices" table). You can go step by step and do the voice thing later.
Sorry if you already started implementing this, but a dialect pertains to a user, not the audio. If you add "audios.dialect" (a field "dialect" in the table "audios", that is), the value of that field will be basically the same for all audios of a single user. This violates the principle of single source of truth. Such design won't let us easily know what's the dialect of a given contributor unless we go and check every single audio he contributed. Similarly, if we want to edit the dialect of a user, we'd need to go through all the "dialect" fields of his audio. So instead, it's better to add a "users.audio_dialect" field. This design also allows an easier transition to the "voice" design, because CK, who is responsible for adding audio, is currently creating separate users for each voice he imports from Common Voice. So in the end we'll just have to "change" each of these dummy users (such as https://tatoeba.org/fr/user/profile/CVTR) into a new voice. I hope this makes sense to you, otherwise feel free to ask me or Trang! 🙂 |
I don't agree with this at all. I think having a profile for each voice is good since we can include all the info we know about a given voice in the profile. See a couple of examples. https://tatoeba.org/en/user/profile/CVjpn1 https://tatoeba.org/en/user/profile/CVeng25 I wouldn't call them "dummy users" since each represent individual people. There should be no problem with having too many accounts because of this. That fact that https://tatoeba.org/fr/user/profile/CVTR is an account that has many voices is because the member who created that account preferred doing it that way and TRANG said it was OK. Added later:Perhaps we do agree. Are you only suggesting that CVTR be divided to indicate each individual's voice, or were you suggesting getting rid of all the individual CVjpn and CVeng accounts? |
When I used the word "dummy users", I didn't mean anything wrong about the actual people who contributed audio. It was just my way to describe the fact that you are using user accounts as a way to store information about contributors who are not Tatoeba contributors but Common Voice contributors. As far as I understand, each of these accounts can not be used to directly reach out to these people, there is only you at the other end. The profile picture, the username, theirs comments and wall posts if any, they are not theirs, but yours. You are using user accounts in a way they have not been originally designed for, but happens to work for your purpose. (Which is alright by the way 🙂) My intention here, as a developer, is just to fill this gap by creating a new kind of data (that I called "voice") that better matches the reality. By "reality" I mean the fact that there are a bunch of recordings contributed by people from Common Voice and you are the one who help importing them into Tatoeba. The benefit of having voice-related data properly structured is that it can be better presented (e.g. a translatable UI text saying "this audio comes from Common Voice") and better processed (e.g. how many different voices this Tatoeba user contributes?, or please show me all the voices from the UK). |
I have done none to these things and do not ever plan to. (ADDED LATER: I guess I did choose the usernames, though) Each profile very clearly state that it was created by CK, what the purpose is, and a link to Common Voice, and what Common Voice's purpose is. I think it's very clear that "each of these accounts can not be used to directly reach out to these people." Some of the advantages of creating profiles for each voice are as follows
A new kind of data called "voice"Adding this new kind of data is a good idea, too. Some ideas.
Having these fields for audio contributors to fill out, would encourage contributors to give us this information. Perhaps this form could be added next to the "select license" on the audio page of each profile. Somehow, to me, also including things like the source, such as Common Voice, seem somewhat unrelated. The files, being in the public domain, do not really need to be credited, even though I put this information on the profile pages. It seems like people would not really find this information so useful as additional information in a "voice" set of data, since the link to the username, as in other usernames for voices, sends people to the profile page which includes this information for CV voices, and any information other usernames have included in their profiles. Perhaps you could include one extra field called "note" (or something similar), which could be used for the source of files, or any other information that an audio file contributor might feel would be useful info related to the voice. |
Just as languages are identified (in theory) by their ISO 639-3 language code on Tatoeba, I think we should follow a standard for dialects. The IETF BCP 47 language tag seems to be the reference in this case. |
Is this the list you are referring to? I think I correctly grabbed every item with the word English, just to get an idea of what's in the file.
Related Wikipedia Page: |
I haven't found a standard specifically dedicated to accents. But IETF BCP 47 allows to combine several subtags to identify a regional variation of a language, which also gives an indication of the accent. For example The advantage of these tags is that they are relevant for both speech and text, which makes it possible to use them to classify sentences and audios on Tatoeba. |
Perhaps accents could be written something like the following, going from general to more specific. Accent: American >> Southern >> Georgia >> Atlanta Perhaps looking into how this site does things give you some ideas. IDEA: International Dialects of English I notice they also specify "white" and "black." |
I was thinking, having several audio files available in a sentence is a great achievement (see #183), but I feel like having more immediately-accessible information visible on the sentence page, like dialect of the speaker, would be even more useful to be able to see differences between different recordings.
Maybe for now it could simply be some text that either the audio contributor or the admin initially adds, that then gets shown on every sentence that the contributor records. (Although this might get complicated if said contributor contributes audio in more than one language)
Then maybe later on we can make it so that it works like tags, and we can organize audio based on these labels, so if we wanted to find specifically British English audio for example, we could use a tag for that.
Rough representation:
Difference is I'd like the box to be expanded a bit to make room for the two lines to fit.
We could also use a similar method for any notes to add to the audio, such as specifying whether it's past or present tense, or even the mood of the way it's said, but that'd be a lot more manual since you'd have to deal with each individual sentence.
The text was updated successfully, but these errors were encountered: