New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rules and blacklist for Esperanto #49
Conversation
The file with 300 random sentences extracted based on the rules in this commit is now available here: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt @Vanege and @picsi @RobinvanderVliet @Gregoor @bidaian could you review at least 100 sentences and guess the error rate please? this would help us a lot. If you can't, do you know other Esperantists on github that we could ask? EDIT: a read through the first 100 sentences results in 5-7 mistakes that I can find. Five are clear foreign words and grammar errors but the other two mistakes are just a missing "and"s in two lists. They wouldn't cause problems with the pronunciation. |
Thanks for this.
Cheers. |
Since there are only around 1 000 native speakers and only very few of them use GitHub the reviewers will be likely people speaking good Esperanto as a second language. I will ask around to get people to comment here. |
Hi, I could review them, and show some of them to an actual native speaker, but I need some context. Are we reviewing just the list of 300 random sentences or the rules? Also, based on the error rate, and the current rules, do you plan to mass-import texts from wikipedia? How should we report the results of the revision? (wrong sentences, just the number of errors,...) |
Hello @bidaian thanks for your help. Yes this is a script that can mass import sentences from wikipedia based on strict rules to keep it legally CC0. You can read about the details here. Many languages on Common Voice already used this. We are only reviewing the list of the 300 sentences here and a number like "6/100 sentences are containing errors" is enough. More details and reviews of the rules are always welcome of course, but not nessesary to get this pull request approved. |
Since a few months have passed since I created this I will do a new pull request with a new blacklist and a description that is clearer to outsiders. EDIT: here is the new pull request: #54 |
I am just doing a rerun of the script. I just do the rerun to see if the few new rules broke something big in general but I could also create another file with 300 random sentences and get confirmation here on github if this is necessary. The extraction generally takes a little more than 2 h on my four years old thinkpad E540.