You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.
I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:
everything should be converted to lowercase
[^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;] if a sentence match this regex it contains not valid chars so should be discarded
The discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)
The text was updated successfully, but these errors were encountered:
I think it is important to define some rules for processing sentences from all importers.
This checks can either be done in the wrapper script or in sanitize.py (this can be more efficient)
My proposal is:
[^\s'abcdefghijklmnopqrstuvwxyzàèéìíòóôùú,\.!?:;]
if a sentence match this regex it contains not valid chars so should be discardedThe discard will be done after trying to clean sentences (like removing trailing dashes or unescaping html)
The text was updated successfully, but these errors were encountered: