Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

Define general rules for all sentences #78

Closed
mone27 opened this issue Jul 12, 2020 · 3 comments
Closed

Define general rules for all sentences #78

mone27 opened this issue Jul 12, 2020 · 3 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@mone27
Copy link
Collaborator

mone27 commented Jul 12, 2020

I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:

  • everything should be converted to lowercase
  • [^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;] if a sentence match this regex it contains not valid chars so should be discarded

The discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)

@Mte90
Copy link
Member

Mte90 commented Jul 12, 2020

For the first case we need to see what happens to various exporter.
My idea is to keep those in this way the corpus can be used also elsewhere.

About the second, for the symbols I think that we need to add more like `-\°()[]€$ but it is something that we can improve.

@Mte90 Mte90 added good first issue Good for newcomers help wanted Extra attention is needed labels Nov 9, 2020
@Mte90
Copy link
Member

Mte90 commented Nov 9, 2020

@nefastosaturo
Copy link
Collaborator

Right now, in merge_text.sh there is a final regex that filters each sentence. The final regex is:

^[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzèÈ][ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzàÀèÈéÉìÌòÒùÙ' ]+$

so, right now there are no special needs for this task so I'm closing this issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants