New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add basque rules and blacklist #95
Conversation
Some contributions to Basque config.
& karakterea duten esaldiek beste hizkuntza bateko edukiak izateko aukera asko dituzte.
& zerrendara gehitzen dut
Beste karaktere arraro bat gehitzen dut zerrendara
Beste karaktere arraro bat gehitzen dut zerrrendara
# Conflicts: # src/rules/disallowed_words/basque.txt
Commented on https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/54?u=mkohler instead of here. |
Can I ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced? This might help on any decisions we have to do regarding this.
Also note that the local command for extraction will now be:
Happy to answer any question you may have and thanks for your efforts! |
Will do! Thank you for the feedback! |
All right, I merged master with the basque branch and changed the names for the rules. The result was 94767 unique sentences. Originally we had a 110000~ one that we cut down manually to around 55000 after they were validated by volunteers as explained by @Txopi in I'll attach the new file just in case you want to have a look at it. |
We did an error rate review of a random 500 sentence sample from the extraction and 267 turned out to be bad. This translates to an error ratio of 53.4% |
Can you create a topic on Discourse to discuss this? 53.4% is definitely too high, we're looking for 5-7%. Maybe somebody on Discourse has a good idea on how to further decrease the error rate. Would be great if you could write down the problem and give examples. Thanks! |
We got about 55000 sentences after de-duplication.
We filtered it by <20 repetitions.
550 different sentences were reviewed from a random sample of 1400 sentences, with an error ratio of around 2%, mostly coming from errors in the Wikipedia articles rather than the scraper.
Link: Common Voice 500 esaldien azterketa PDF