Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line in binary data file can wrongly interpreted as header #17

Closed
BerndDoser opened this issue Apr 16, 2019 · 5 comments
Closed

Line in binary data file can wrongly interpreted as header #17

BerndDoser opened this issue Apr 16, 2019 · 5 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@BerndDoser
Copy link
Member

The first character of the data section in a binary file can be a '#'-character, which is interpreted as a header comment line, introduced at #1.

if (line[0] != '#') break;

To fix this, I would suggest to end the header section with the line:
# END OF HEADER

It will be still backward compatible, as the line is not needed if no header is used.

@BerndDoser BerndDoser added the bug Something isn't working label Apr 16, 2019
@BerndDoser BerndDoser added this to the 1.2 milestone Apr 16, 2019
@BerndDoser BerndDoser self-assigned this Apr 16, 2019
@tjgalvin
Copy link

Would the idea be that if a # is picked up, the rest of the file is scanned to look for the # END OF HEADER message? I'm just thinking what would happen if another 'lucky' integer in the number of images field tricks the parser.

Perhaps we should consider a # START OF HEADER and # END OF HEADER ? It might not be entirely backwards compatible but should be the end of the issue.

@BerndDoser
Copy link
Member Author

In the current implementation a header section is only allowed in front of the binary section. Each header line have to start with a #. I start reading the binary section at the line where the first character is not a #. The issue is coming, because the first bits of an integer can also be interpreted as a #. To prevent this, we can put a separation line between header and binary section, like:

# This a header line
# This is also a header line
# END OF HEADER

Here, I can check the whole line instead of the first character and it is impossible that the binary section will match this string. A starting line is not necessary.

@tjgalvin
Copy link

I think we agree. The header is only valid if there is a # END OF HEADER line.

If an integer value is incorrectly interpreted as a #, we will know this if the subsequent line is not either the # END OF HEADER terminator or another line starting with another #. At which point the file cursor is returned to the start. This makes sense to me. Easy fix!

@tjgalvin
Copy link

I think this issue may need to be reopened. I am now getting errors when training the SOM with an initializing with a file.

terminate called after throwing an instance of 'std::runtime_error'
what(): readSOM: wrong numberOfChannels.

The check of the numberOfChannels is the first after attempting to bypass the header. I'm not seeing exactly how the error is being introduced, but it may be an off by one?

@tjgalvin
Copy link

I think I figured this out.

The std::getline function will set an error state if the end of file is encountered when reading the stream. The loop that scans the file searching for the # END OF HEADER string will encounter the end of a file when it is using a SOM without a header, thereby trigger an error state. Subsequent seekg will fail and tellg will output -1.

I've found putting a is.clear() before the is.seekg(binary_start_position, is.beg) (i.e. after the scans searching for the header) will clear the error state and fix the issue, at least in my testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants