You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my recent experiments, I have noticed that many files in the Mandarin_cmn subcorpus contain words of length > 5, which is unnatural for Chinese. I checked all words with length > 5, none of them are Chinese (I checked all tokens which contain non-ASCII characters). There are also some non-word sequences among those long words, which start from around length > 10.
Here is a list of files, where I found such long words:
These files should be cleaned from words with length > 5.
Note: there are also shorter words, which are not Chinese (e.g. Arabic, English), but this seems to be a harder issue to solve, and probably, not so crucial as the long words.
The text was updated successfully, but these errors were encountered:
In my recent experiments, I have noticed that many files in the Mandarin_cmn subcorpus contain words of length > 5, which is unnatural for Chinese. I checked all words with length > 5, none of them are Chinese (I checked all tokens which contain non-ASCII characters). There are also some non-word sequences among those long words, which start from around length > 10.
Here is a list of files, where I found such long words:
These files should be cleaned from words with length > 5.
Note: there are also shorter words, which are not Chinese (e.g. Arabic, English), but this seems to be a harder issue to solve, and probably, not so crucial as the long words.
The text was updated successfully, but these errors were encountered: