NumpyReader : Replace std::regex with custom implementation #2489

jantonguirao · 2020-11-20T16:11:25Z

Signed-off-by: Joaquin Anton janton@nvidia.com

Why we need this PR?

Workaround: It fixes a bug when loading Horovod and DALI, due to ABI's incompatibility

What happened in this PR?

Fill relevant points, put NA otherwise. Replace anything inside []

What solution was applied:
Replaced usage of std::regex with custom parser
Affected modules and functionalities:
NumpyReader
Key points relevant for the review:
Parser implementation
Validation and testing:
Tests added
Documentation (including examples):
N/A

JIRA TASK: [DALI-1737]

Signed-off-by: Joaquin Anton <janton@nvidia.com>

mzient · 2020-11-20T16:17:32Z

dali/operators/reader/loader/numpy_loader.cc

+  header.erase(std::remove_if(header.begin(), header.end(), ::isspace), header.end());
+
+  const char t1[] = "{'descr':\'";
+  size_t l1 = strlen(t1);


It's quite pointless to measure a constant string.
Alternatives:
1.

static const std::string t1 = "{'descr:'";

and use .length()
2.

Suggested change

size_t l1 = strlen(t1);

size_t l1=sizeof(t1)-1;

klecki

Indexing looks ok, I think we could probably get away with a lot of temporary string creation. I'm not sure if that's the direction we should pursue in this PR.

klecki · 2020-11-20T16:24:47Z

dali/operators/reader/loader/numpy_loader.cc

-      target.shape[i] = static_cast<int64_t>(stoi(shapevec[i]));
-  }
-
+  ParseHeaderMetadata(target, header);


Suggested change

ParseHeaderMetadata(target, header);

ParseHeaderMetadata(target, std::move(header));

?

not applicable any more (passed by reference now)

klecki · 2020-11-20T16:28:11Z

dali/operators/reader/loader/numpy_loader.cc

+
+  const char t1[] = "{'descr':\'";
+  size_t l1 = strlen(t1);
+  DALI_ENFORCE(strncmp(header.c_str(), t1, l1) == 0);


I think we can rise some error message like, "Malformed header", maybe we don't need to report how it's malformed but to let user (and us) know what happened in all of the DALI_ENFORCEs in this file.

klecki · 2020-11-20T16:30:57Z

dali/operators/reader/loader/numpy_loader.cc

+
+  // < means LE, | means N/A, = means native. In all those cases, we can read
+  bool little_endian =
+    (typestring[0] == '<' || typestring[0] == '|' || typestring[0] == '=');


Maybe I'm nitpicky, but do you think we can check this under the l1 and create the tid string once? Instead of creating the string with this character in front and creating next string with 1 character less?

klecki · 2020-11-20T16:32:52Z

dali/operators/reader/loader/numpy_loader.cc

+  auto fortran_order_str = header.substr(pos, iter - pos);
+  pos = iter + 1;
+  if (fortran_order_str == "True") {
+    target.fortran_order = true;
+  } else if (fortran_order_str == "False") {
+    target.fortran_order = false;
+  } else {
+    DALI_FAIL("Can not parse fortran order");
+  }


Here again we crate a temporary string that's not exactly necessary and we could use strncmp and header with offset to pos

Still we create a lot of them below, so...

Signed-off-by: Joaquin Anton <janton@nvidia.com>

szalpal · 2020-11-20T20:21:23Z

dali/operators/reader/loader/numpy_loader.cc

+}
+
+template <size_t N>
+void Skip(const char*& ptr, const char (&what)[N]) {


Do we need ptr to be a reference to a pointer? It adds a little bit complexity to code, while sizeof(const char*&)==sizeof(const char*)

The reference ptr is there, because we're advancing the pointer.
The reference to array is a the only way to pass a strongly typed sized array (e.g. a string literal) to a function.
sizeof(what) == N.
If what was declared as const char what[N], the N is meaningless and will be removed, resulting in a type without a size.

szalpal · 2020-11-20T20:22:18Z

dali/operators/reader/loader/numpy_loader.cc

+std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') {
+  DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'"));
+  std::string out;
+  for (;*input != '\0'; input++) {


Is input++ safe, when input is const char*&? It's like iterating the referece, right? IMHO const char* would be safer and clearer, but that's your call ;)

We want to advance the pointer that was passed to this function. The alternative would be to return the new pointer, but then we'd have no way of returning a value from this function. The input pointer is a minimalistic stream.

szalpal · 2020-11-20T20:23:14Z

dali/operators/reader/loader/numpy_loader.cc

+
+std::string ParseStringValue(const char*& input, char delim_start = '\'', char delim_end = '\'') {
+  DALI_ENFORCE(*input++ == delim_start, make_string("Expected \'", delim_start, "\'"));
+  std::string out;


I think you should use stringstream here. E. g. out+='\\' creates a new object from scratch, is that right?

Good point.
How += behaves, is implementation defined (haha). Some of behave like vector, so += is efficient, others don't. Since we've recently found out that we're using quite dated libstdc++

Once I had this very argument, where I was on the side of using std::stringstream and the other person saying that += is as good if not better. Measurement showed += to be the winner.

szalpal · 2020-11-20T20:26:00Z

dali/operators/reader/loader/numpy_loader.cc

@@ -38,6 +37,115 @@ TypeInfo TypeFromNumpyStr(const std::string &format) {
  DALI_FAIL("Unknown Numpy type string");
 }

+inline void SkipSpace(const char*& ptr) {


Suggested change

inline void SkipSpace(const char*& ptr) {

inline void SkipSpaces(const char*& ptr) {

Small suggestion, your call

szalpal · 2020-11-20T20:29:19Z

dali/operators/reader/loader/numpy_loader.cc

+  } else if (TrySkip(hdr, "False")) {
+    target.fortran_order = false;
+  } else {
+    DALI_FAIL("Can not parse fortran order");


Suggested change

DALI_FAIL("Can not parse fortran order");

DALI_FAIL("Cannot parse fortran order");

Suggested change

DALI_FAIL("Can not parse fortran order");

DALI_FAIL("Cannot read an array stored in Fortran order.");

szalpal · 2020-11-20T20:36:55Z

dali/operators/reader/loader/numpy_loader.cc

+  } catch (...) {
+    DALI_FAIL(make_string("Failed to parse shape data: ", sh_str));
+  }


How about catching std::exception and adding error msg to the error msg?

} catch(std::exception &e) { DALI_FAIL(make_string("Failed to parse shape data: ", sh_str, ". Error: ", e.what())); }

According to cppreference, std::stoi can throw only erros derived from logic_error (out_of_range and invalid_argument), so that's what we should catch.

mzient · 2020-11-23T09:56:45Z

dali/operators/reader/loader/numpy_loader.h

@@ -52,13 +51,14 @@ class NumpyParseTarget{
  }
 };

+DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);


Suggested change

DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);

void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);

I don't think we should expose this function.

You wouldn't be able to test it otherwise...

we do if we want to test it

klecki · 2020-11-23T11:45:47Z

dali/operators/reader/loader/numpy_loader.cc

+    if (*input == '\\') {
+      switch (*++input) {
+        case '\'':
+          out += '\\';


Shouldn't it be:

Suggested change

out += '\\';

out += '\'';

Done, and added the '\\' case

klecki · 2020-11-23T11:48:03Z

dali/operators/reader/loader/numpy_loader.cc

+  } else if (TrySkip(hdr, "False")) {
+    target.fortran_order = false;
+  } else {
+    DALI_FAIL("Cannot read an array stored in Fortran order.");


I know this was changed, but isn't it:

Suggested change

DALI_FAIL("Cannot read an array stored in Fortran order.");

DALI_FAIL("Cannot parse the Fortran order [of an array].");

yeah, it is. My bad. I applied the fix without really reading it.

mzient · 2020-11-23T12:23:31Z

dali/operators/reader/loader/numpy_loader.cc

+  T value = static_cast<T>(
+    strtol(input, const_cast<char**>(&input), 10));  // why is it char** ?
+  return value;


Shouldn't it throw when input has not advanced?

Suggested change

T value = static_cast<T>(

strtol(input, const_cast<char**>(&input), 10)); // why is it char** ?

return value;

char *out = const_cast<char *>(input);

T value = static_cast<T>(strtol(input, &out, 10));

DALI_ENFORCE(out != input, "Parse error: expected a number");

input = out;

return value;

jantonguirao · 2020-11-23T12:53:19Z

!build

dali-automaton · 2020-11-23T12:55:31Z

CI MESSAGE: [1824582]: BUILD STARTED

mzient · 2020-11-23T13:26:32Z

dali/operators/reader/loader/numpy_loader.cc

+    try {
+      // ParseInteger already skips the leading spaces (strtol does).
+      target.shape.push_back(ParseInteger<int64_t>(hdr));
+    } catch (const std::logic_error& e) {


This won't happen now.

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao · 2020-11-23T13:54:19Z

!build

dali-automaton · 2020-11-23T14:00:28Z

CI MESSAGE: [1824697]: BUILD STARTED

dali-automaton · 2020-11-23T15:33:03Z

CI MESSAGE: [1824697]: BUILD PASSED

Replace std::regex for custom implementation

e41bec9

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao changed the title ~~NumpyReader : Replace std::regex for custom implementation~~ NumpyReader : Replace std::regex with custom implementation Nov 20, 2020

mzient reviewed Nov 20, 2020

View reviewed changes

klecki reviewed Nov 20, 2020

View reviewed changes

Code review fixes

3f1fe10

Signed-off-by: Joaquin Anton <janton@nvidia.com>

szalpal reviewed Nov 20, 2020

View reviewed changes

mzient reviewed Nov 23, 2020

View reviewed changes

jantonguirao force-pushed the numpy_header_parser branch from 11e9baf to bb2ea29 Compare November 23, 2020 10:01

mzient approved these changes Nov 23, 2020

View reviewed changes

szalpal approved these changes Nov 23, 2020

View reviewed changes

klecki reviewed Nov 23, 2020

View reviewed changes

jantonguirao force-pushed the numpy_header_parser branch from bb2ea29 to 57bf317 Compare November 23, 2020 12:17

mzient reviewed Nov 23, 2020

View reviewed changes

jantonguirao force-pushed the numpy_header_parser branch 2 times, most recently from 051f376 to 4c57c1a Compare November 23, 2020 12:52

mzient reviewed Nov 23, 2020

View reviewed changes

Code review fixes

bca654a

Signed-off-by: Joaquin Anton <janton@nvidia.com>

jantonguirao force-pushed the numpy_header_parser branch from 4c57c1a to bca654a Compare November 23, 2020 13:30

klecki approved these changes Nov 23, 2020

View reviewed changes

mzient approved these changes Nov 23, 2020

View reviewed changes

jantonguirao merged commit 8a840e2 into NVIDIA:master Nov 23, 2020

	ParseHeaderMetadata(target, header);
	ParseHeaderMetadata(target, std::move(header));

	inline void SkipSpace(const char*& ptr) {
	inline void SkipSpaces(const char*& ptr) {

	DALI_FAIL("Can not parse fortran order");
	DALI_FAIL("Cannot parse fortran order");

	DALI_FAIL("Can not parse fortran order");
	DALI_FAIL("Cannot read an array stored in Fortran order.");

	DLL_PUBLIC void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);
	void ParseHeaderMetadata(NumpyParseTarget& target, const std::string &header);

	DALI_FAIL("Cannot read an array stored in Fortran order.");
	DALI_FAIL("Cannot parse the Fortran order [of an array].");

NumpyReader : Replace std::regex with custom implementation #2489

NumpyReader : Replace std::regex with custom implementation #2489

Conversation

jantonguirao commented Nov 20, 2020 • edited Loading

Why we need this PR?

What happened in this PR?

Choose a reason for hiding this comment

klecki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jantonguirao Nov 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jantonguirao commented Nov 23, 2020

dali-automaton commented Nov 23, 2020

Choose a reason for hiding this comment

jantonguirao commented Nov 23, 2020

dali-automaton commented Nov 23, 2020

dali-automaton commented Nov 23, 2020

jantonguirao commented Nov 20, 2020 •

edited

Loading

jantonguirao Nov 23, 2020 •

edited

Loading