Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basque rules and blacklist #95

Closed
wants to merge 16 commits into from
Closed

Conversation

Thadah
Copy link

@Thadah Thadah commented Feb 29, 2020

  • How many sentences did you get at the end?

We got about 55000 sentences after de-duplication.

  • How did you create the blacklist file?

We filtered it by <20 repetitions.

  • Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

550 different sentences were reviewed from a random sample of 1400 sentences, with an error ratio of around 2%, mostly coming from errors in the Wikipedia articles rather than the scraper.

Link: Common Voice 500 esaldien azterketa PDF

@MichaelKohler
Copy link
Member

@MichaelKohler
Copy link
Member

Can I ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced? This might help on any decisions we have to do regarding this.

  • Update your branch with the latest code from the master branch
  • Rename src/rules/basque.toml to src/rules/eu.toml
  • Rename src/rules/disallowed_words/basque.toml to src/rules/disallowed_words/eu.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l eu -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

@Thadah
Copy link
Author

Thadah commented Mar 6, 2020

Will do! Thank you for the feedback!

@Thadah
Copy link
Author

Thadah commented Mar 7, 2020

All right, I merged master with the basque branch and changed the names for the rules.

The result was 94767 unique sentences. Originally we had a 110000~ one that we cut down manually to around 55000 after they were validated by volunteers as explained by @Txopi in
https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/55

I'll attach the new file just in case you want to have a look at it.
wiki.eu.txt

@Thadah
Copy link
Author

Thadah commented Mar 15, 2020

We did an error rate review of a random 500 sentence sample from the extraction and 267 turned out to be bad. This translates to an error ratio of 53.4%

Wiki.eu-94767-lagina500.txt

@MichaelKohler
Copy link
Member

Can you create a topic on Discourse to discuss this? 53.4% is definitely too high, we're looking for 5-7%. Maybe somebody on Discourse has a good idea on how to further decrease the error rate. Would be great if you could write down the problem and give examples. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants