Handle BOM in the beginning of the script #439

AlekMosingiewicz · 2018-05-10T17:54:13Z

Issue this pull request references: #436

Changes proposed in this pull request

detect BOM in the same way a symbolic link is detected before the parsing begins and skip it

MarioLiebisch · 2018-05-11T06:30:12Z

unittests/compiled_tests.cpp

+
+  chai.eval("def func() { return \"Hello World\"; };");
+
+  CHECK(chai.eval<std::string>("\xef\xbb\xbf(func())") == "Hello World");


IMO this whole script should be inline (i.e. don't use func()), since you want to test the script running, not the function call working. Also you should probably remove line 359 as well.

codecov-io · 2018-05-13T11:19:08Z

Codecov Report

Merging #439 into develop will increase coverage by 0.04%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #439      +/-   ##
===========================================
+ Coverage    72.05%   72.09%   +0.04%     
===========================================
  Files           59       59              
  Lines        10884    10912      +28     
===========================================
+ Hits          7842     7867      +25     
- Misses        3042     3045       +3

Impacted Files	Coverage Δ
include/chaiscript/language/chaiscript_parser.hpp	`98.27% <100%> (-0.24%)`	⬇️
include/chaiscript/language/chaiscript_engine.hpp	`93.75% <100%> (+0.38%)`	⬆️
unittests/compiled_tests.cpp	`92.2% <100%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 062f821...b3f77f0. Read the comment docs.

MarioLiebisch · 2018-05-21T06:26:47Z

include/chaiscript/language/chaiscript_engine.hpp

@@ -52,6 +52,7 @@


 #include "../dispatchkit/exception_specification.hpp"
+#include "chaiscript_parser.hpp"


Still not sure it's a good idea to pull in the parser for this very simple test one could easily inline, especially considering it's only used in one place.

That's true, I originally intended it to be done differently... I've refactored the code and removed the include.

MarioLiebisch · 2018-05-21T06:31:47Z

unittests/compiled_tests.cpp

+{
+  chaiscript::ChaiScript_Basic chai(create_chaiscript_stdlib(),create_chaiscript_parser());
+  CHECK_THROWS_AS(chai.eval<std::string>("\xef\xbb\xbfprint \"Hello World\""), chaiscript::exception::eval_error);
+}


Still needs a test case for other binary/non-ANSI garbage in the input at random positions (beginning, middle, end).

I left just one non-ASCII character and moved it to the middle of the string.

lefticus · 2018-05-22T02:08:32Z

We're getting an inexplicable error on Visual C++ - can you make sure you're up to date with the latest develop and see if the error persists? Thanks

AlekMosingiewicz · 2018-05-22T03:13:37Z

I've merged the latest changes from the upstream and pushed to my branch. If it doesn't help, I'll try to compile my project on VS and see what happens ( I'm on G++ right now).

MarioLiebisch · 2018-05-22T05:58:21Z

include/chaiscript/language/chaiscript_engine.hpp

+        infile.read(&v[0], static_cast<std::streamsize>(bytes_needed));
+        std::string buffer_string(v.begin(), v.end());
+
+        if ((buffer_string.size() > 2)


This statement will always be true, since the vector always contains bytes_needed elements. In addition, this should use bytes_needed rather than a constant.

Overall I'd just check whether the stream has reached the end of file or not.

MarioLiebisch · 2018-05-22T05:58:50Z

include/chaiscript/language/chaiscript_engine.hpp

      infile.seekg(0, std::ios::beg);

      assert(size >= 0);

+      if (skip_bom(infile)) {
+          size-=3; // decrement the BOM size from file size, otherwise we'll get parsing errors
+      }


I'd probably do another asset(size >= 0); here to verify there's actually something beyond the BOM.

MarioLiebisch · 2018-05-22T06:02:05Z

include/chaiscript/language/chaiscript_parser.hpp

      bool SkipWS(bool skip_cr=false) {
        bool retval = false;

        while (m_position.has_more()) {
+          if(static_cast<size_t>(*m_position) > 0x7e) {


This cast looks wrong to me. Why no unsigned char instead?

MarioLiebisch · 2018-05-22T06:05:26Z

unittests/compiled_tests.cpp

+TEST_CASE("Non-ASCII characters in string")
+{
+  chaiscript::ChaiScript_Basic chai(create_chaiscript_stdlib(),create_chaiscript_parser());
+  CHECK_THROWS_AS(chai.eval<std::string>("prin\xeft \"Hello World\""), chaiscript::exception::eval_error);


You misunderstood me. I think there should be at least three test cases, I could even imagine five:

Read and run a UTF-8 BOM file – no exception, normal execution

Run a string with UTF-8 BOM – exception? could need others' opinion on this one I guess

Run a string with an invalid character at the start – exception

Run a string with an invalid character in the middle – exception

Run a string with an invalid character at the end – exception

MarioLiebisch · 2018-05-24T06:42:35Z

include/chaiscript/language/chaiscript_engine.hpp

+        std::vector<char> v(bytes_needed);
+
+        infile.read(&v[0], static_cast<std::streamsize>(bytes_needed));
+        std::string buffer_string(v.begin(), v.end());


This step feels completely unnecessary, considering the comparison happens on a character by character level.

MarioLiebisch · 2018-05-24T06:47:01Z

include/chaiscript/language/chaiscript_engine.hpp

+        infile.read(&v[0], static_cast<std::streamsize>(bytes_needed));
+        std::string buffer_string(v.begin(), v.end());
+
+        if (!infile.eof()


The more I think about this check, the less sense it makes. Right now the BOM wouldn't be skipped if the file includes only the BOM (I think; not 100% sure if the eof bit is set already).

Either way, I think the better approach would be to allocate something like char header[3] = {0};, read the first 3 bytes from the stream and just compare them 1:1 with the BOM sequence, not even testing the length/position (since failed reads would result in \0).

lefticus · 2018-05-24T16:18:43Z

Thank you both for the diligence on this. Once you have something sorted out that you like, if you're still having problems with MSVC, I'll debug that bit.

detect BOM in processed file.

anything inside it.

MarioLiebisch · 2018-05-24T20:55:39Z

include/chaiscript/language/chaiscript_engine.hpp

-        std::vector<char> v(bytes_needed);
+        std::streamsize bytes_needed = 3;
+        std::streamsize bytes_read = 0;
+        char buffer[3] = { '\0' };


You don't have to initialize the array elements to \0 if you check length anyway. This was more an idea in case you just do the three comparisons only and skip the length check.

But besides that, nothing more to comment on. :)

This reverts commit 0e964da.

lefticus · 2018-05-26T14:24:41Z

FYI @AlekMosingiewicz and @MarioLiebisch I've just narrowed down why the tests are failing on Appveyor - it is because of an upgrade to cmake. I'll get it fixed shortly.

lefticus · 2018-05-26T17:54:01Z

OK, Appveyor builds are officially fixed now! If you are both happy with this I'll merge it in. I'd like to see this before the next release, and I'd like to make the next release very soon.

AlekMosingiewicz · 2018-05-26T18:16:59Z

I've nothing to add, only waiting for @MarioLiebisch opinion

MarioLiebisch · 2018-05-26T18:24:32Z

Ditto, fine with me.

lefticus · 2018-05-26T20:08:21Z

Ok, I'm going to merge and run through some fuzz testing, since this is hitting the top of the parser.

AlekMosingiewicz added 3 commits May 10, 2018 17:44

Skip UTF-8 BOM before parsing begins.

f37d0e1

Cover skipping BOM with test.

1d78233

Simplify BOM test.

1e8f7f9

MarioLiebisch reviewed May 11, 2018

View reviewed changes

AlekMosingiewicz added 3 commits May 13, 2018 10:25

Throw exception when user-provided input contains BOM.

efbebee

Catch BOM at the beginning of file.

a024db0

Decrement file size when BOM is present to avoid parsing errors.

c09af92

AlekMosingiewicz added 2 commits May 15, 2018 19:25

Check for illegal characters while parsing input.

322568b

Added doc comment.

0d44b0b

MarioLiebisch reviewed May 21, 2018

View reviewed changes

AlekMosingiewicz added 2 commits May 21, 2018 17:04

Refactor skippable BOM detection.

60c0a0b

Non-ASCII characters now in random positions in test; test renamed.

b70a9e7

Merge branch 'develop' into handle-bom-in-script

be29b0a

MarioLiebisch reviewed May 22, 2018

View reviewed changes

AlekMosingiewicz added 5 commits May 22, 2018 16:23

Type cast fix.

d880d46

Another text size assertion.

f9615ef

Add missing test cases.

df6bc8f

Test case for BOM in user-provided string.

67dcd3e

Check EOF rather than buffer_size when skipping BOM.

4ada12a

MarioLiebisch reviewed May 24, 2018

View reviewed changes

Read the stream byte by byte, condition for size when skipping BOM.

ac10575

AlekMosingiewicz added 2 commits May 24, 2018 22:04

Use readsome instead of reading the stream byte-by-byte to

edadb7a

detect BOM in processed file.

Initialize buffer to store potential BOM data before storing

51bb793

anything inside it.

MarioLiebisch reviewed May 24, 2018

View reviewed changes

AlekMosingiewicz added 8 commits May 25, 2018 06:57

Skip buffer initialization.

51693aa

Attempt to remedy the problem occuring on Clang.

0e964da

Revert "Attempt to remedy the problem occuring on Clang."

42c355a

This reverts commit 0e964da.

Another attempt to remedy the problem occuring on Clang.

1711d50

Travis build quick fix.

393f8d3

Fix for Clang.

fb63503

Another fix for Clang.

0f67b2f

Fix implicit conversion warning.

b3f77f0

lefticus merged commit 61dfb22 into ChaiScript:develop May 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle BOM in the beginning of the script #439

Handle BOM in the beginning of the script #439

AlekMosingiewicz commented May 10, 2018

MarioLiebisch May 11, 2018

codecov-io commented May 13, 2018 •

edited

MarioLiebisch May 21, 2018

AlekMosingiewicz May 21, 2018

MarioLiebisch May 21, 2018

AlekMosingiewicz May 21, 2018

lefticus commented May 22, 2018

AlekMosingiewicz commented May 22, 2018

MarioLiebisch May 22, 2018

MarioLiebisch May 22, 2018

MarioLiebisch May 22, 2018

MarioLiebisch May 22, 2018 •

edited

MarioLiebisch May 24, 2018

MarioLiebisch May 24, 2018

lefticus commented May 24, 2018

MarioLiebisch May 24, 2018

lefticus commented May 26, 2018

lefticus commented May 26, 2018

AlekMosingiewicz commented May 26, 2018

MarioLiebisch commented May 26, 2018

lefticus commented May 26, 2018


		chai.eval("def func() { return \"Hello World\"; };");

		CHECK(chai.eval<std::string>("\xef\xbb\xbf(func())") == "Hello World");

		@@ -52,6 +52,7 @@


		#include "../dispatchkit/exception_specification.hpp"
		#include "chaiscript_parser.hpp"

Handle BOM in the beginning of the script #439

Handle BOM in the beginning of the script #439

Conversation

AlekMosingiewicz commented May 10, 2018

Choose a reason for hiding this comment

codecov-io commented May 13, 2018 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lefticus commented May 22, 2018

AlekMosingiewicz commented May 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarioLiebisch May 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lefticus commented May 24, 2018

Choose a reason for hiding this comment

lefticus commented May 26, 2018

lefticus commented May 26, 2018

AlekMosingiewicz commented May 26, 2018

MarioLiebisch commented May 26, 2018

lefticus commented May 26, 2018

codecov-io commented May 13, 2018 •

edited

MarioLiebisch May 22, 2018 •

edited