Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Add basic sentence validation requirements #33

Closed
MichaelKohler opened this issue Dec 20, 2018 · 6 comments
Closed

Add basic sentence validation requirements #33

MichaelKohler opened this issue Dec 20, 2018 · 6 comments
Labels
enhancement New feature or request
Milestone

Comments

@MichaelKohler
Copy link
Member

Currently there is a discussion going on on how to improve the quality of the sentences. There are a few requirements we will need to implement here. As this is still ongoing, consider this issue as a placeholder and it will be updated once we have the full list of decisions.

@MichaelKohler MichaelKohler added this to the MVP milestone Dec 20, 2018
@MichaelKohler MichaelKohler added enhancement New feature or request question Further information is requested labels Dec 20, 2018
@nukeador
Copy link

To re-cap the last list we have:

  • Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine” and “two thousand four hundred nine”.
  • Abbreviations and Acronyms. Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.
  • Punctuation. Special symbols and punctuation should only be included when absolutely necessary. For example, an apostrophe is included in English words like “don’t” and “we’re” and should be included in the source text, but it’s unlikely you’ll ever need a special symbol like “@” or “#.”
  • Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.

@MichaelKohler MichaelKohler changed the title Add sentence validation requirements Add basic sentence validation requirements Dec 21, 2018
@MichaelKohler
Copy link
Member Author

@nukeador will figure out how we could do the forth point, the rest I will implement rudimentary.

@MichaelKohler
Copy link
Member Author

Created #35 for the fourth point so we can close this once I have implemented the rest.

@MichaelKohler MichaelKohler added HAS_PR and removed question Further information is requested labels Dec 29, 2018
@dabinat
Copy link

dabinat commented Dec 30, 2018

What about profanity?

@MichaelKohler
Copy link
Member Author

@dabinat thanks for your input. Of course profanity is something we want to deal with as well. We will have a review process in place which should catch most of it. We'll see how this works out, theoretically we could also integrate a blacklist of words.

@MichaelKohler
Copy link
Member Author

🎉 This issue has been resolved in version 1.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants