Skip to content

Commit

Permalink
Update unicharset_extractor.cpp (#1153)
Browse files Browse the repository at this point in the history
* change IsWhitespace to IsUTF8Whitespace

To solve "Phase UP: Generating unicharset and unichar properties files" ERROR #1147

please reference: [#1147](#1147)

* Update unicharset_extractor.cpp

fix the "Phase UP: Generating unicharset and unichar properties files" ERROR

* Update unicharset_extractor.cpp

fix "Phase UP: Generating unicharset and unichar properties files" ERROR #1147

* Update unicharset_extractor.cpp

fix the encoding invalid problem and fix the comment
  • Loading branch information
ivanzz1001 authored and zdenop committed Oct 13, 2017
1 parent 1b0379c commit fb359fc
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion training/unicharset_extractor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ static void AddStringsToUnicharset(const GenericVector<STRING>& strings,
/*report_errors*/ true,
strings[i].string(), &normalized)) {
for (const string& normed : normalized) {
if (normed.empty() || IsWhitespace(normed[0])) continue;

// normed is a UTF-8 encoded string
if (normed.empty() || IsUTF8Whitespace(normed.c_str())) continue;
unicharset->unichar_insert(normed.c_str());
}
} else {
Expand Down

2 comments on commit fb359fc

@haoyangt
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved the "invalid unicode codepoint" issue! Good job, bro!

@hanikh
Copy link

@hanikh hanikh commented on fb359fc Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Please sign in to comment.