Add basque rules and blacklist #95

Thadah · 2020-02-29T22:37:53Z

How many sentences did you get at the end?

We got about 55000 sentences after de-duplication.

How did you create the blacklist file?

We filtered it by <20 repetitions.

Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

550 different sentences were reviewed from a random sample of 1400 sentences, with an error ratio of around 2%, mostly coming from errors in the Wikipedia articles rather than the scraper.

Link: Common Voice 500 esaldien azterketa PDF

Some contributions to Basque config.

& karakterea duten esaldiek beste hizkuntza bateko edukiak izateko aukera asko dituzte.

& zerrendara gehitzen dut

Beste karaktere arraro bat gehitzen dut zerrendara

Beste karaktere arraro bat gehitzen dut zerrrendara

# Conflicts: # src/rules/disallowed_words/basque.txt

MichaelKohler · 2020-03-01T11:33:59Z

Commented on https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/54?u=mkohler instead of here.

MichaelKohler · 2020-03-04T21:09:02Z

Can I ask you to do the following, to make sure you can profit from the automatic sample extraction we just introduced? This might help on any decisions we have to do regarding this.

Update your branch with the latest code from the master branch
Rename src/rules/basque.toml to src/rules/eu.toml
Rename src/rules/disallowed_words/basque.toml to src/rules/disallowed_words/eu.toml

Also note that the local command for extraction will now be:

cargo run -- extract -l eu -d path/to/files

Happy to answer any question you may have and thanks for your efforts!

Thadah · 2020-03-06T22:56:07Z

Will do! Thank you for the feedback!

Thadah · 2020-03-07T15:42:15Z

All right, I merged master with the basque branch and changed the names for the rules.

The result was 94767 unique sentences. Originally we had a 110000~ one that we cut down manually to around 55000 after they were validated by volunteers as explained by @Txopi in
https://discourse.mozilla.org/t/technical-feedback-needed-wikipedia-extractor-script-beta/42983/55

I'll attach the new file just in case you want to have a look at it.
wiki.eu.txt

Thadah · 2020-03-15T12:15:39Z

We did an error rate review of a random 500 sentence sample from the extraction and 267 turned out to be bad. This translates to an error ratio of 53.4%

Wiki.eu-94767-lagina500.txt

MichaelKohler · 2020-03-19T18:43:47Z

Can you create a topic on Discourse to discuss this? 53.4% is definitely too high, we're looking for 5-7%. Maybe somebody on Discourse has a good idea on how to further decrease the error rate. Would be great if you could write down the problem and give examples. Thanks!

Thadah and others added 14 commits September 5, 2019 22:53

Add basque rules and blacklist

95271b6

Some contributions to Basque config.

4f2307c

Merge pull request #1 from Txopi/basque

a188fd2

Some contributions to Basque config.

Change rules and blacklist

bdfff6a

& zerrendara gehitzen dut

8a830bb

& karakterea duten esaldiek beste hizkuntza bateko edukiak izateko aukera asko dituzte.

Merge pull request #2 from Txopi/patch-2

1a5d2f5

& zerrendara gehitzen dut

Beste karaktere arraro bat gehitzen dut zerrendara

1f309f4

Merge pull request #3 from Txopi/patch-3

f37bc51

Beste karaktere arraro bat gehitzen dut zerrendara

Hizki bakarreko hitzak eta arabierazko are bat hitz gehitu ditut.

a06566f

Beste karaktere arraro bat gehitzen dut zerrrendara

1f8dca9

Merge pull request #5 from Txopi/patch-4

aeda6b5

Beste karaktere arraro bat gehitzen dut zerrrendara

Update basque blacklist

eba286c

Hizki bakarreko hitzak eta arabierazko are bat hitz gehitu ditut.

b0f2831

# Conflicts: # src/rules/disallowed_words/basque.txt

Merge branch 'master' into basque

a281409

MichaelKohler added the waiting on feedback label Mar 1, 2020

Merge remote-tracking branch 'remotes/Common-Voice/master' into basque

9e05196

Rename rules and blacklist to eu

4ab620b

MichaelKohler added waiting on error rate review and removed waiting on feedback labels Mar 9, 2020

MichaelKohler closed this Apr 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basque rules and blacklist #95

Add basque rules and blacklist #95

Thadah commented Feb 29, 2020 •

edited

MichaelKohler commented Mar 1, 2020

MichaelKohler commented Mar 4, 2020

Thadah commented Mar 6, 2020

Thadah commented Mar 7, 2020

Thadah commented Mar 15, 2020 •

edited

MichaelKohler commented Mar 19, 2020

Add basque rules and blacklist #95

Add basque rules and blacklist #95

Conversation

Thadah commented Feb 29, 2020 • edited

MichaelKohler commented Mar 1, 2020

MichaelKohler commented Mar 4, 2020

Thadah commented Mar 6, 2020

Thadah commented Mar 7, 2020

Thadah commented Mar 15, 2020 • edited

MichaelKohler commented Mar 19, 2020

Thadah commented Feb 29, 2020 •

edited

Thadah commented Mar 15, 2020 •

edited