rules and blacklist for Esperanto #49

stefangrotz · 2019-09-02T19:07:33Z

The list of disallowed words is produced with the scripts from the readme-file. I've chosen to exclude all words less frequently used than 80 times.
The script produces 128 000 sentences, around 96k without repetitions
Two people have read over 300 random sentences, one fluent speaker and an intermediate speaker. Both guessed that the error rate is between 7-10%. I added a few more rules and more abbreviations to the file based on their feedback. Is this enough or should someone confirm this here on github?
The rule file excludes most letters that are not part of the Esperanto alphabet and a lot of abbreviations. I also exclude sign-patterns that are extremely unusual for Esperanto but used very often in other languages. (like "the" or "sch")
I have a pretty long list of sentences and patterns I want to delete manually when the official file is available. Will it be sorted alphabetically? This would make things a lot easier.
I am just doing a rerun of the script. I just do the rerun to see if the few new rules broke something big in general but I could also create another file with 300 random sentences and get confirmation here on github if this is necessary. The extraction generally takes a little more than 2 h on my four years old thinkpad E540.

stefangrotz · 2019-09-02T20:44:08Z

The file with 300 random sentences extracted based on the rules in this commit is now available here: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

@Vanege and @picsi @RobinvanderVliet @Gregoor @bidaian could you review at least 100 sentences and guess the error rate please? this would help us a lot. If you can't, do you know other Esperantists on github that we could ask?

EDIT: a read through the first 100 sentences results in 5-7 mistakes that I can find. Five are clear foreign words and grammar errors but the other two mistakes are just a missing "and"s in two lists. They wouldn't cause problems with the pronunciation.

nukeador · 2019-09-03T08:50:30Z

Thanks for this.

Yes, you will be able to request removals after the sentences have been extracted.
It would be great if 2-3 native speakers also comments here with their review.

Cheers.

stefangrotz · 2019-09-03T09:09:17Z

Since there are only around 1 000 native speakers and only very few of them use GitHub the reviewers will be likely people speaking good Esperanto as a second language. I will ask around to get people to comment here.

bidaian · 2019-09-05T14:05:03Z

Hi, I could review them, and show some of them to an actual native speaker, but I need some context. Are we reviewing just the list of 300 random sentences or the rules? Also, based on the error rate, and the current rules, do you plan to mass-import texts from wikipedia?

How should we report the results of the revision? (wrong sentences, just the number of errors,...)

stefangrotz · 2019-09-05T14:19:29Z

Hello @bidaian thanks for your help. Yes this is a script that can mass import sentences from wikipedia based on strict rules to keep it legally CC0. You can read about the details here. Many languages on Common Voice already used this.

We are only reviewing the list of the 300 sentences here and a number like "6/100 sentences are containing errors" is enough. More details and reviews of the rules are always welcome of course, but not nessesary to get this pull request approved.

stefangrotz · 2019-11-20T12:30:28Z

Since a few months have passed since I created this I will do a new pull request with a new blacklist and a description that is clearer to outsiders.

EDIT: here is the new pull request: #54

Stefan and others added 2 commits September 2, 2019 20:44

added rules and blacklist for Esperanto

0ca13d5

new line at the end of file

c98e315

stefangrotz closed this Nov 20, 2019

stefangrotz mentioned this pull request Nov 20, 2019

Esperanto wiki-extractor script for Common Voice #54

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rules and blacklist for Esperanto #49

rules and blacklist for Esperanto #49

stefangrotz commented Sep 2, 2019 •

edited

stefangrotz commented Sep 2, 2019 •

edited

nukeador commented Sep 3, 2019

stefangrotz commented Sep 3, 2019 •

edited

bidaian commented Sep 5, 2019

stefangrotz commented Sep 5, 2019 •

edited

stefangrotz commented Nov 20, 2019 •

edited

rules and blacklist for Esperanto #49

rules and blacklist for Esperanto #49

Conversation

stefangrotz commented Sep 2, 2019 • edited

stefangrotz commented Sep 2, 2019 • edited

nukeador commented Sep 3, 2019

stefangrotz commented Sep 3, 2019 • edited

bidaian commented Sep 5, 2019

stefangrotz commented Sep 5, 2019 • edited

stefangrotz commented Nov 20, 2019 • edited

stefangrotz commented Sep 2, 2019 •

edited

stefangrotz commented Sep 2, 2019 •

edited

stefangrotz commented Sep 3, 2019 •

edited

stefangrotz commented Sep 5, 2019 •

edited

stefangrotz commented Nov 20, 2019 •

edited