-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumpyReader : Replace std::regex with custom implementation #2489
Conversation
Signed-off-by: Joaquin Anton <janton@nvidia.com>
header.erase(std::remove_if(header.begin(), header.end(), ::isspace), header.end()); | ||
|
||
const char t1[] = "{'descr':\'"; | ||
size_t l1 = strlen(t1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite pointless to measure a constant string.
Alternatives:
1.
static const std::string t1 = "{'descr:'";
and use .length()
2.
size_t l1 = strlen(t1); | |
size_t l1=sizeof(t1)-1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexing looks ok, I think we could probably get away with a lot of temporary string creation. I'm not sure if that's the direction we should pursue in this PR.
target.shape[i] = static_cast<int64_t>(stoi(shapevec[i])); | ||
} | ||
|
||
ParseHeaderMetadata(target, header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParseHeaderMetadata(target, header); | |
ParseHeaderMetadata(target, std::move(header)); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not applicable any more (passed by reference now)
|
||
const char t1[] = "{'descr':\'"; | ||
size_t l1 = strlen(t1); | ||
DALI_ENFORCE(strncmp(header.c_str(), t1, l1) == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rise some error message like, "Malformed header", maybe we don't need to report how it's malformed but to let user (and us) know what happened in all of the DALI_ENFORCEs in this file.
|
||
// < means LE, | means N/A, = means native. In all those cases, we can read | ||
bool little_endian = | ||
(typestring[0] == '<' || typestring[0] == '|' || typestring[0] == '='); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm nitpicky, but do you think we can check this under the l1
and create the tid
string once? Instead of creating the string with this character in front and creating next string with 1 character less?
auto fortran_order_str = header.substr(pos, iter - pos); | ||
pos = iter + 1; | ||
if (fortran_order_str == "True") { | ||
target.fortran_order = true; | ||
} else if (fortran_order_str == "False") { | ||
target.fortran_order = false; | ||
} else { | ||
DALI_FAIL("Can not parse fortran order"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again we crate a temporary string that's not exactly necessary and we could use strncmp and header
with offset to pos
Still we create a lot of them below, so...
Signed-off-by: Joaquin Anton <janton@nvidia.com>
} | ||
|
||
template <size_t N> | ||
void Skip(const char*& ptr, const char (&what)[N]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need ptr
to be a reference to a pointer? It adds a little bit complexity to code, while sizeof(const char*&)==sizeof(const char*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference ptr
is there, because we're advancing the pointer.
The reference to array is a the only way to pass a strongly typed sized array (e.g. a string literal) to a function.
sizeof(what) == N
.
If what
was declared as const char what[N]
, the N is meaningless and will be removed, resulting in a type without a size.
std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') { | ||
DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'")); | ||
std::string out; | ||
for (;*input != '\0'; input++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is input++
safe, when input
is const char*&
? It's like iterating the referece, right? IMHO const char*
would be safer and clearer, but that's your call ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to advance the pointer that was passed to this function. The alternative would be to return the new pointer, but then we'd have no way of returning a value from this function. The input pointer is a minimalistic stream.
|
||
std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') { | ||
DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'")); | ||
std::string out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use stringstream
here. E. g. out+='\\'
creates a new object from scratch, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
How += behaves, is implementation defined (haha). Some of behave like vector, so += is efficient, others don't. Since we've recently found out that we're using quite dated libstdc++
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once I had this very argument, where I was on the side of using std::stringstream and the other person saying that += is as good if not better. Measurement showed += to be the winner.
@@ -38,6 +37,115 @@ TypeInfo TypeFromNumpyStr(const std::string &format) { | |||
DALI_FAIL("Unknown Numpy type string"); | |||
} | |||
|
|||
inline void SkipSpace(const char*& ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline void SkipSpace(const char*& ptr) { | |
inline void SkipSpaces(const char*& ptr) { |
Small suggestion, your call
} else if (TrySkip(hdr, "False")) { | ||
target.fortran_order = false; | ||
} else { | ||
DALI_FAIL("Can not parse fortran order"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DALI_FAIL("Can not parse fortran order"); | |
DALI_FAIL("Cannot parse fortran order"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DALI_FAIL("Can not parse fortran order"); | |
DALI_FAIL("Cannot read an array stored in Fortran order."); |
} catch (...) { | ||
DALI_FAIL(make_string("Failed to parse shape data: ", sh_str)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about catching std::exception
and adding error msg to the error msg?
} catch(std::exception &e) {
DALI_FAIL(make_string("Failed to parse shape data: ", sh_str, ". Error: ", e.what()));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to cppreference, std::stoi
can throw only erros derived from logic_error (out_of_range and invalid_argument), so that's what we should catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -52,13 +51,14 @@ class NumpyParseTarget{ | |||
} | |||
}; | |||
|
|||
DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header); | |
void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header); |
I don't think we should expose this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You wouldn't be able to test it otherwise...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do if we want to test it
11e9baf
to
bb2ea29
Compare
if (*input == '\\') { | ||
switch (*++input) { | ||
case '\'': | ||
out += '\\'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be:
out += '\\'; | |
out += '\''; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mzient ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and added the '\\'
case
} else if (TrySkip(hdr, "False")) { | ||
target.fortran_order = false; | ||
} else { | ||
DALI_FAIL("Cannot read an array stored in Fortran order."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was changed, but isn't it:
DALI_FAIL("Cannot read an array stored in Fortran order."); | |
DALI_FAIL("Cannot parse the Fortran order [of an array]."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is. My bad. I applied the fix without really reading it.
bb2ea29
to
57bf317
Compare
T value = static_cast<T>( | ||
strtol(input, const_cast<char**>(&input), 10)); // why is it char** ? | ||
return value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it throw when input has not advanced?
T value = static_cast<T>( | |
strtol(input, const_cast<char**>(&input), 10)); // why is it char** ? | |
return value; | |
char *out = const_cast<char *>(input); | |
T value = static_cast<T>(strtol(input, &out, 10)); | |
DALI_ENFORCE(out != input, "Parse error: expected a number"); | |
input = out; | |
return value; |
051f376
to
4c57c1a
Compare
!build |
CI MESSAGE: [1824582]: BUILD STARTED |
try { | ||
// ParseInteger already skips the leading spaces (strtol does). | ||
target.shape.push_back(ParseInteger<int64_t>(hdr)); | ||
} catch (const std::logic_error& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't happen now.
Signed-off-by: Joaquin Anton <janton@nvidia.com>
4c57c1a
to
bca654a
Compare
!build |
CI MESSAGE: [1824697]: BUILD STARTED |
CI MESSAGE: [1824697]: BUILD PASSED |
Signed-off-by: Joaquin Anton janton@nvidia.com
Why we need this PR?
What happened in this PR?
Fill relevant points, put NA otherwise. Replace anything inside []
Replaced usage of std::regex with custom parser
NumpyReader
Parser implementation
Tests added
N/A
JIRA TASK: [DALI-1737]