Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rules and blacklist for Esperanto #49

Closed
wants to merge 2 commits into from
Closed

rules and blacklist for Esperanto #49

wants to merge 2 commits into from

Conversation

stefangrotz
Copy link
Contributor

@stefangrotz stefangrotz commented Sep 2, 2019

  • The list of disallowed words is produced with the scripts from the readme-file. I've chosen to exclude all words less frequently used than 80 times.
  • The script produces 128 000 sentences, around 96k without repetitions
  • Two people have read over 300 random sentences, one fluent speaker and an intermediate speaker. Both guessed that the error rate is between 7-10%. I added a few more rules and more abbreviations to the file based on their feedback. Is this enough or should someone confirm this here on github?
  • The rule file excludes most letters that are not part of the Esperanto alphabet and a lot of abbreviations. I also exclude sign-patterns that are extremely unusual for Esperanto but used very often in other languages. (like "the" or "sch")
  • I have a pretty long list of sentences and patterns I want to delete manually when the official file is available. Will it be sorted alphabetically? This would make things a lot easier.
    I am just doing a rerun of the script. I just do the rerun to see if the few new rules broke something big in general but I could also create another file with 300 random sentences and get confirmation here on github if this is necessary. The extraction generally takes a little more than 2 h on my four years old thinkpad E540.

@stefangrotz
Copy link
Contributor Author

stefangrotz commented Sep 2, 2019

The file with 300 random sentences extracted based on the rules in this commit is now available here: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

@Vanege and @picsi @RobinvanderVliet @Gregoor @bidaian could you review at least 100 sentences and guess the error rate please? this would help us a lot. If you can't, do you know other Esperantists on github that we could ask?

EDIT: a read through the first 100 sentences results in 5-7 mistakes that I can find. Five are clear foreign words and grammar errors but the other two mistakes are just a missing "and"s in two lists. They wouldn't cause problems with the pronunciation.

@nukeador
Copy link

nukeador commented Sep 3, 2019

Thanks for this.

  • Yes, you will be able to request removals after the sentences have been extracted.
  • It would be great if 2-3 native speakers also comments here with their review.

Cheers.

@stefangrotz
Copy link
Contributor Author

stefangrotz commented Sep 3, 2019

Since there are only around 1 000 native speakers and only very few of them use GitHub the reviewers will be likely people speaking good Esperanto as a second language. I will ask around to get people to comment here.

@bidaian
Copy link

bidaian commented Sep 5, 2019

Hi, I could review them, and show some of them to an actual native speaker, but I need some context. Are we reviewing just the list of 300 random sentences or the rules? Also, based on the error rate, and the current rules, do you plan to mass-import texts from wikipedia?

How should we report the results of the revision? (wrong sentences, just the number of errors,...)

@stefangrotz
Copy link
Contributor Author

stefangrotz commented Sep 5, 2019

Hello @bidaian thanks for your help. Yes this is a script that can mass import sentences from wikipedia based on strict rules to keep it legally CC0. You can read about the details here. Many languages on Common Voice already used this.

We are only reviewing the list of the 300 sentences here and a number like "6/100 sentences are containing errors" is enough. More details and reviews of the rules are always welcome of course, but not nessesary to get this pull request approved.

@stefangrotz
Copy link
Contributor Author

stefangrotz commented Nov 20, 2019

Since a few months have passed since I created this I will do a new pull request with a new blacklist and a description that is clearer to outsiders.

EDIT: here is the new pull request: #54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants