Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumpyReader : Replace std::regex with custom implementation #2489

Merged
merged 3 commits into from
Nov 23, 2020

Conversation

jantonguirao
Copy link
Contributor

@jantonguirao jantonguirao commented Nov 20, 2020

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

  • Workaround: It fixes a bug when loading Horovod and DALI, due to ABI's incompatibility

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

  • What solution was applied:
    Replaced usage of std::regex with custom parser
  • Affected modules and functionalities:
    NumpyReader
  • Key points relevant for the review:
    Parser implementation
  • Validation and testing:
    Tests added
  • Documentation (including examples):
    N/A

JIRA TASK: [DALI-1737]

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao jantonguirao changed the title NumpyReader : Replace std::regex for custom implementation NumpyReader : Replace std::regex with custom implementation Nov 20, 2020
header.erase(std::remove_if(header.begin(), header.end(), ::isspace), header.end());

const char t1[] = "{'descr':\'";
size_t l1 = strlen(t1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite pointless to measure a constant string.
Alternatives:
1.

static const std::string t1 = "{'descr:'";

and use .length()
2.

Suggested change
size_t l1 = strlen(t1);
size_t l1=sizeof(t1)-1;

Copy link
Contributor

@klecki klecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexing looks ok, I think we could probably get away with a lot of temporary string creation. I'm not sure if that's the direction we should pursue in this PR.

target.shape[i] = static_cast<int64_t>(stoi(shapevec[i]));
}

ParseHeaderMetadata(target, header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ParseHeaderMetadata(target, header);
ParseHeaderMetadata(target, std::move(header));

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not applicable any more (passed by reference now)


const char t1[] = "{'descr':\'";
size_t l1 = strlen(t1);
DALI_ENFORCE(strncmp(header.c_str(), t1, l1) == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rise some error message like, "Malformed header", maybe we don't need to report how it's malformed but to let user (and us) know what happened in all of the DALI_ENFORCEs in this file.


// < means LE, | means N/A, = means native. In all those cases, we can read
bool little_endian =
(typestring[0] == '<' || typestring[0] == '|' || typestring[0] == '=');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm nitpicky, but do you think we can check this under the l1 and create the tid string once? Instead of creating the string with this character in front and creating next string with 1 character less?

Comment on lines 71 to 79
auto fortran_order_str = header.substr(pos, iter - pos);
pos = iter + 1;
if (fortran_order_str == "True") {
target.fortran_order = true;
} else if (fortran_order_str == "False") {
target.fortran_order = false;
} else {
DALI_FAIL("Can not parse fortran order");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again we crate a temporary string that's not exactly necessary and we could use strncmp and header with offset to pos

Still we create a lot of them below, so...

Signed-off-by: Joaquin Anton <janton@nvidia.com>
}

template <size_t N>
void Skip(const char*& ptr, const char (&what)[N]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need ptr to be a reference to a pointer? It adds a little bit complexity to code, while sizeof(const char*&)==sizeof(const char*)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference ptr is there, because we're advancing the pointer.
The reference to array is a the only way to pass a strongly typed sized array (e.g. a string literal) to a function.
sizeof(what) == N.
If what was declared as const char what[N], the N is meaningless and will be removed, resulting in a type without a size.

std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') {
DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'"));
std::string out;
for (;*input != '\0'; input++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is input++ safe, when input is const char*&? It's like iterating the referece, right? IMHO const char* would be safer and clearer, but that's your call ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to advance the pointer that was passed to this function. The alternative would be to return the new pointer, but then we'd have no way of returning a value from this function. The input pointer is a minimalistic stream.


std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') {
DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'"));
std::string out;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use stringstream here. E. g. out+='\\' creates a new object from scratch, is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
How += behaves, is implementation defined (haha). Some of behave like vector, so += is efficient, others don't. Since we've recently found out that we're using quite dated libstdc++

Copy link
Contributor Author

@jantonguirao jantonguirao Nov 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once I had this very argument, where I was on the side of using std::stringstream and the other person saying that += is as good if not better. Measurement showed += to be the winner.

@@ -38,6 +37,115 @@ TypeInfo TypeFromNumpyStr(const std::string &format) {
DALI_FAIL("Unknown Numpy type string");
}

inline void SkipSpace(const char*& ptr) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline void SkipSpace(const char*& ptr) {
inline void SkipSpaces(const char*& ptr) {

Small suggestion, your call

} else if (TrySkip(hdr, "False")) {
target.fortran_order = false;
} else {
DALI_FAIL("Can not parse fortran order");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DALI_FAIL("Can not parse fortran order");
DALI_FAIL("Cannot parse fortran order");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DALI_FAIL("Can not parse fortran order");
DALI_FAIL("Cannot read an array stored in Fortran order.");

Comment on lines 139 to 141
} catch (...) {
DALI_FAIL(make_string("Failed to parse shape data: ", sh_str));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about catching std::exception and adding error msg to the error msg?

} catch(std::exception &e) {
  DALI_FAIL(make_string("Failed to parse shape data: ", sh_str, ". Error: ", e.what()));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to cppreference, std::stoi can throw only erros derived from logic_error (out_of_range and invalid_argument), so that's what we should catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -52,13 +51,14 @@ class NumpyParseTarget{
}
};

DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);
void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);

I don't think we should expose this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You wouldn't be able to test it otherwise...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do if we want to test it

if (*input == '\\') {
switch (*++input) {
case '\'':
out += '\\';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be:

Suggested change
out += '\\';
out += '\'';

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and added the '\\' case

} else if (TrySkip(hdr, "False")) {
target.fortran_order = false;
} else {
DALI_FAIL("Cannot read an array stored in Fortran order.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was changed, but isn't it:

Suggested change
DALI_FAIL("Cannot read an array stored in Fortran order.");
DALI_FAIL("Cannot parse the Fortran order [of an array].");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is. My bad. I applied the fix without really reading it.

Comment on lines 77 to 81
T value = static_cast<T>(
strtol(input, const_cast<char**>(&input), 10)); // why is it char** ?
return value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it throw when input has not advanced?

Suggested change
T value = static_cast<T>(
strtol(input, const_cast<char**>(&input), 10)); // why is it char** ?
return value;
char *out = const_cast<char *>(input);
T value = static_cast<T>(strtol(input, &out, 10));
DALI_ENFORCE(out != input, "Parse error: expected a number");
input = out;
return value;

@jantonguirao jantonguirao force-pushed the numpy_header_parser branch 2 times, most recently from 051f376 to 4c57c1a Compare November 23, 2020 12:52
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1824582]: BUILD STARTED

try {
// ParseInteger already skips the leading spaces (strtol does).
target.shape.push_back(ParseInteger<int64_t>(hdr));
} catch (const std::logic_error& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't happen now.

Signed-off-by: Joaquin Anton <janton@nvidia.com>
@jantonguirao
Copy link
Contributor Author

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1824697]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [1824697]: BUILD PASSED

@jantonguirao jantonguirao merged commit 8a840e2 into NVIDIA:master Nov 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants