Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid ctype abuse. #2341

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

riastradh
Copy link

fix #2338
fix #2340

@riastradh riastradh force-pushed the riastradh-20240427-issue2340-ctypeabuse branch 2 times, most recently from 9aa4116 to beb7ca7 Compare April 27, 2024 20:15
@riastradh riastradh force-pushed the riastradh-20240427-issue2340-ctypeabuse branch from beb7ca7 to 789d306 Compare April 27, 2024 20:18
@afh afh self-assigned this Apr 28, 2024
Copy link
Member

@afh afh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @riastradh, very much appreciated!
Would you be willing to also add tests for the changes? I think that tests would help explain the issue(s) with the code prior to this PR and avoid regressions in the future.

@@ -1034,7 +1035,7 @@ bool amount_t::parse(std::istream& in, const parse_flags_t& flags)
commodity_t::parse_symbol(in, symbol);

if (! in.eof() && ((n = static_cast<char>(in.peek())) != '\n')) {
if (std::isspace(static_cast<char>(in.peek())))
if (std::isspace(in.peek()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more context please, as to why the static_cast should be removed here, please?

Copy link
Author

@riastradh riastradh Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we're on a platform where char is signed and EOF is -1.

in.peek() returns an int, and the set of values that it can return is: {-1, 0, 1, 2, 3, 4, ..., 254, 255}. (That's 257 distinct possible values.) This set of values is always valid for std::isspace -- they're designed to be used together like this! -- so the cast is unnecessary.

What happens if we cast anyway?

Of that set, values in the subset {128, 129, 130, ..., 254, 255} are mapped by static_cast<char> to {-128, -127, -126, ..., -2, -1}.

Of that set, only -1 (coming from 255) is a valid input to std::isspace, because it coincides with the value of EOF, but that's probably not what is intended. (It might work by accident here, but if we were asking std::isalpha in an ISO-8859-1 locale, it should return true for 255 but false for EOF.)

All the other possible values 128 and over that in.peek() can return lead to passing values in {-128, -127, ..., -3, -2} to std::isspace, which is undefined behaviour.

So not only is the static_cast<char> here unnecessary, but it's actually harmful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's helpful, thanks! Do you know any platforms where EOF is -1? Once I again, I'd like to kindly ask for tests that catch ledger's current misbehaviour and ensures regressions to this fix are caught. Would you be willing to do that? If there are questions on how to write tests for ledger, I'm happy to help.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know any platforms where EOF is -1?

I don't know any platforms where EOF isn't -1. In principle such platforms might exist and still conform to the C standard. But it's -1 on all the BSDs, on glibc, and on macOS.

Comment on lines +135 to +136
*r++ = static_cast<char>(std::tolower(
static_cast<unsigned char>(*q)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is first casting to unsigned char and then casting to char intentional here? If yes, what are the implications?

Copy link
Author

@riastradh riastradh Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional because the only change I made was to replace std::tolower(*q) by std::tolower(static_cast<unsigned char>(*q)); I preserved the context.

It may be that you can safely remove the static_cast<char>(...) around it. EOF is not a possible output of tolower here, because EOF is not a possible input coming via static_cast<unsigned char>(...), so there's no risk of losing information from a non-injective conversion of int to char. But it might trigger warnings about implicit signedness conversions.

I left it as is to make auditing this change easier: keeping the outer static_cast<char> can't make the code worse. If you want to remove it, that's probably fine; I just felt it would be better to defer to a separate change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to reply in detail, @riastradh. I agree with your reasoning.

@riastradh
Copy link
Author

Thank you for this PR @riastradh, very much appreciated! Would you be willing to also add tests for the changes? I think that tests would help explain the issue(s) with the code prior to this PR and avoid regressions in the future.

It would be nice to make sure all of these cases are exercised, and normally for bug fixes like this I would prefer to add an xfail test in one commit and fix the bug in another commit. However, the symptom of the undefined behaviour is nondeterministic: you're reading out whatever random data happens to precede the ctype/tolower/toupper tables in memory. So it might sometimes return the right answers, and might sometimes return the wrong ones. Or it might crash. Or demons might fly out of your nose. But I don't think the ledger test harness has a way to verify nasal demons.

That said: the tests have been failing on NetBSD for a while, and I just never got around to tracking down the cause. It turns out that this change together with one other fix (for a GCC libstdc++ bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114879, which breaks std::ios::sync_with_stdio(false) on NetBSD, https://gnats.NetBSD.org/58206) resolves all the test failures on my NetBSD system. I'm also considering committing changes to NetBSD to guarantee that ctype abuse triggers immediate SIGSEGV, rather than silent data corruption (https://gnats.NetBSD.org/58208).

So while the existing ledger tests might not exercise all possible ctype abuse (and there's a lot -- there are many calls to ctype functions in ledger), they do seem to exercise enough to raise an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<cctype> abuse Encoding error
2 participants