Define general rules for all sentences #78

mone27 · 2020-07-12T11:01:58Z

I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:

everything should be converted to lowercase
[^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;] if a sentence match this regex it contains not valid chars so should be discarded

The discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)

The text was updated successfully, but these errors were encountered:

Mte90 · 2020-07-12T11:05:37Z

For the first case we need to see what happens to various exporter.
My idea is to keep those in this way the corpus can be used also elsewhere.

About the second, for the symbols I think that we need to add more like `-\°()[]€$ but it is something that we can improve.

Mte90 · 2020-11-09T09:22:47Z

We can add it to https://github.com/MozillaItalia/DeepSpeech-Italian-Model/tree/master/MITADS

nefastosaturo · 2020-12-16T14:09:12Z

Right now, in merge_text.sh there is a final regex that filters each sentence. The final regex is:

^[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzèÈ][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzàÀèÈéÉìÌòÒùÙ' ]+$

so, right now there are no special needs for this task so I'm closing this issue

Mte90 added good first issue Good for newcomers help wanted Extra attention is needed labels Nov 9, 2020

nefastosaturo closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define general rules for all sentences #78

Define general rules for all sentences #78

mone27 commented Jul 12, 2020

Mte90 commented Jul 12, 2020

Mte90 commented Nov 9, 2020

nefastosaturo commented Dec 16, 2020

Define general rules for all sentences #78

Define general rules for all sentences #78

Comments

mone27 commented Jul 12, 2020

Mte90 commented Jul 12, 2020

Mte90 commented Nov 9, 2020

nefastosaturo commented Dec 16, 2020