Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hash chaining to detect modifications in postings #2300

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

jwiegley
Copy link
Member

@jwiegley jwiegley commented Nov 23, 2023

The following details of a posting contribute to its hash:

  • fullname of account
  • string representation of amount

Each posting hashes contributes to the transaction hash, which is compromised of:

  • previous transaction’s hash (as encountered in parsing order)
  • actual date
  • optional auxiliary date
  • optional code
  • payee
  • hashes of all postings

Note that this means that changes in the “code” or any of the comments

The following details of a posting contribute to its hash:

  fullname of account
  string representation of amount

Each posting hashes contributes to the transaction hash, which is compromised
of:

  previous transaction’s hash (as encountered in parsing order)
  actual date
  optional auxiliary date
  optional code
  payee
  hashes of all postings

Note that this means that changes in the “code” or any of the comments
At the moment only "sha512" or "SHA512" is accepted, but this could extend to
more algorithms in the future.
Also, support matching provided hashes against a prefixed of the generated
hash.
@jwiegley jwiegley self-assigned this Nov 27, 2023
Copy link
Member

@simonmichael simonmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jwiegley, reviewing as requested - no C++ code review, just some high level thoughts:

The following details of a posting contribute to its hash:

fullname of account
string representation of amount

Each posting hashes contributes to the transaction hash, which is compromised
of:

previous transaction’s hash (as encountered in parsing order)
actual date
optional auxiliary date
optional code
payee
hashes of all postings

Note that this means that changes in the “code” or any of the comments

Maybe the above details should appear in docs also ? Apologies if I missed it.

Posting's "string representation of amount" - that's the representation the journal file, I assume (not what print or reg would show).

--hashes option requires an argument to specify the algorithm
At the moment only "sha512" or "SHA512" is accepted, but this could extend to
more algorithms in the future.

Overall comments:

Cool feature!

As we discussed in chat, one obvious user benefit it promises is being able to warn when any past entries have changed in the journal files. VCS users can already detect this before commit, but this does not require a VCS and would be available to every Ledger user without setup. VCS users who don't check the diff before committing might find it helpful to avoid accidentally committing fat-finger edits, eg.

I think users fairly often want to clean up small mistakes, whitespace, or even make bigger cleanups to old files and entries. And the commit messages above make me think this will be very sensitive - to any edits, to changes in hashing algorithm, to any rearrangement of included files or to different order of file arguments on the command line (because of "in parsing order"). So it's my guess users of this will quite often need to regenerate hashes for all of their data. Maybe that's not a problem, I'm not totally clear on the workflow. I imagine it would mean replacing at least some explicit Hash metadata values (tags) in journal entries in all old files (and committing those changes in VCS).

It seems to me to be a prototype that will need field testing and tweaking to find its best design and usage patterns. Possibly it's worth signalling this status to users by mentioning "Experimental" in descriptions.

As mentioned in chat Tackler has some similar-ish features described at https://tackler.e257.fi/docs/auditing - perhaps not this exactly, but there might be some interesting related ideas there.

Hope this helps! I appreciate this exploration and will follow with interest.

doc/ledger3.texi Show resolved Hide resolved
test/baseline/opt-hashes-neg.test Show resolved Hide resolved
doc/ledger3.texi Outdated Show resolved Hide resolved
Copy link
Member

@afh afh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this bring ledger closer to being a triple-entry accounting system? 😃

I left some comments from first glance below, and will take a closer look when trying out the proposed changes on my machine.

doc/ledger.1 Outdated Show resolved Hide resolved
doc/ledger.1 Show resolved Hide resolved
doc/ledger.1 Outdated Show resolved Hide resolved
doc/ledger3.texi Outdated Show resolved Hide resolved
doc/ledger3.texi Outdated Show resolved Hide resolved
src/sha512.cc Show resolved Hide resolved
jwiegley and others added 3 commits December 6, 2023 14:11
Co-authored-by: Alexis Hildebrandt <afh@surryhill.net>
Co-authored-by: Alexis Hildebrandt <afh@surryhill.net>
Co-authored-by: Alexis Hildebrandt <afh@surryhill.net>
@afh
Copy link
Member

afh commented Dec 7, 2023

Thanks for the context on sha512_256, @jwiegley.

I did a bit of research into libraries supporting SHA-512/256, e.g. libtomcrypt, Botan-3, pycryptodome. These libraries support it in a different manner than this PR suggests in the sense, that a truncated hash "is not equivalent to simply truncating the output digest" (pycryptodome).

Possibly this is "for users to be able to distinguish between a SHA-512 digest which has been truncated and a SHA512/256 digest, [offering] new initialization constants, analogous to those used in SHA-384."https://eprint.iacr.org/2010/548.pdf
Is this to also thwart length extension attacks? (see https://news.ycombinator.com/item?id=21981874)

I took the liberty to hack on a little Nix Flake to get a feel for the API of the different implementations and evaluate the feasibility of base64 encoding. The Flake uses LibreSSL, Botan-3, and tomcrypt to compute the SHA-512 and SHA-512/256 hashes for the given arguments and print the hashes as a hex and a base64 encoded string.

What are your thoughts on using one of the aforementioned libraries or another third-party implementation?

nix build https://projects.surryhill.net/ledger/sha512-test-0.1.3.tgz
./result/bin/sha512-test 'Heureka!'
input           = Heureka!
LibreSSL
sha512          = 83bcf89b75e21ab7d9fe332a6f82ca4d1e94ec587cec1e137d50087fcc6b7518f366ee9e2ba086346bdcc0561a522db4b3bdebc53483199f58ac7139531ded7c
sha512/evp      = g7z4m3XiGrfZ/jMqb4LKTR6U7Fh87B4TfVAIf8xrdRjzZu6eK6CGNGvcwFYaUi20s73rxTSDGZ9YrHE5Ux3tfA==
sha512_256      = 83bcf89b75e21ab7d9fe332a6f82ca4d1e94ec587cec1e137d50087fcc6b7518
sha512_256/evp  = N/A
Botan3
SHA-512         = 83BCF89B75E21AB7D9FE332A6F82CA4D1E94EC587CEC1E137D50087FCC6B7518F366EE9E2BA086346BDCC0561A522DB4B3BDEBC53483199F58AC7139531DED7C
SHA-512.b64     = g7z4m3XiGrfZ/jMqb4LKTR6U7Fh87B4TfVAIf8xrdRjzZu6eK6CGNGvcwFYaUi20s73rxTSDGZ9YrHE5Ux3tfA==
SHA-512/256     = 5EABC68E077C6338A305D388F20AE3A04200F6D942164FFBFD659E345C39D0A7
SHA-512/256.b64 = XqvGjgd8YzijBdOI8grjoEIA9tlCFk/7/WWeNFw50Kc=
tomcrypt
SHA-512         = 83bcf89b75e21ab7d9fe332a6f82ca4d1e94ec587cec1e137d50087fcc6b7518
SHA-512.b64     = g7z4m3XiGrfZ/jMqb4LKTR6U7Fh87B4TfVAIf8xrdRjzZu6eK6CGNGvcwFYaUi20s73rxTSDGZ9YrHE5Ux3tfA==
SHA-512/256     = 5eabc68e077c6338a305d388f20ae3a04200f6d942164ffbfd659e345c39d0a7
SHA-512/256.b64 = XqvGjgd8YzijBdOI8grjoEIA9tlCFk/7/WWeNFw50Kc=

If you'd like to inspect the small utility closer have a look at main.cc below or download and unpack the flake archive from https://projects.surryhill.net/ledger/sha512-test-0.1.3.tgz and open main.cc in your $EDITOR.

main.cc
#include <string>
#include <sstream>
#include <iomanip>
#include <iostream>
#include <vector>

#include <openssl/crypto.h>
#include <openssl/sha.h>

#include <openssl/hmac.h>
#include <openssl/evp.h>
#include <openssl/bio.h>
#include <openssl/buffer.h>

#include <botan-3/botan/hash.h>
#include <botan-3/botan/hex.h>
#include <botan-3/botan/base64.h>

#include <tomcrypt.h>

// Convert buffer to hex string. Kudos to https://github.com/ledger/ledger/pull/2300
std::string bufferToHex(const unsigned char* buffer, std::size_t size) {
    std::ostringstream oss;
    oss << std::hex << std::setfill('0');
    for(std::size_t i = 0; i < size; ++i)
        oss << std::setw(2) << static_cast<int>(buffer[i]);
    return oss.str();
}

// Encode input as base64 using LibreSSL BIO. Kudos to https://ioncannon.net/programming/34/howto-base64-encode-with-cc-and-openssl/
char *bio_base64(const unsigned char *input, int length) {
  BIO *bmem, *b64;
  BUF_MEM *bptr;
 
  b64 = BIO_new(BIO_f_base64());
  bmem = BIO_new(BIO_s_mem());
  b64 = BIO_push(b64, bmem);
  BIO_write(b64, input, length);
  BIO_flush(b64);
  BIO_get_mem_ptr(b64, &bptr);
 
  char *buff = (char *)malloc(bptr->length);
  memcpy(buff, bptr->data, bptr->length-1);
  buff[bptr->length-1] = 0;
 
  BIO_free_all(b64);
 
  return buff;
}

// Encode input as base64 using LibreSSL EVP. Kudos to mtrw https://stackoverflow.com/a/60580965
char *evp_base64(const unsigned char *input, int length) {
  const auto pl = 4*((length+2)/3);
  auto output = reinterpret_cast<char *>(calloc(pl+1, 1)); // +1 for the terminating null that EVP_EncodeBlock adds on
  const auto ol = EVP_EncodeBlock(reinterpret_cast<unsigned char *>(output), input, length);
  if (pl != ol) { std::cerr << "Whoops, encode predicted " << pl << " but we got " << ol << "\n"; }
  return output;
}

// Encode input as base64 using tomcrypt. Kudos to https://techoverflow.net/2012/11/20/cc-base64-codec-using-libtomcrypt/
std::string encodeBase64(const char* input, const unsigned long inputSize) {
    unsigned long outlen = inputSize + (inputSize / 3.0) + 16;
    unsigned char* outbuf = new unsigned char[outlen]; //Reserve output memory
    base64_encode((unsigned char*) input, inputSize, outbuf, &outlen);
    std::string ret((char*) outbuf, outlen);
    delete[] outbuf;
    return ret;
}

int main(int argc, char*argv[]) {

  // Setup program arguments
  std::vector<std::string> args;
  if (argc > 1)
    args.assign(argv+1, argv + argc);

  // Initialize LibreSSL
  OPENSSL_init_crypto(0, NULL);

  // Initialize Botan3
  const auto bt_sha512_256 = Botan::HashFunction::create_or_throw("SHA-512-256");
  const auto bt_sha512     = Botan::HashFunction::create_or_throw("SHA-512");

  // Initialize tomcrypt
  register_all_ciphers();
  register_all_hashes();
  int tc_sha512_256 = find_hash("sha512-256");
  int tc_sha512     = find_hash("sha512");

  for (const auto& arg : args) {
    std::cout << "input           = " << arg << std::endl;

    // Compute Hash using LibreSSL
    const unsigned char* input = (const unsigned char*)arg.c_str();
    unsigned char* hv = SHA512(input, arg.length(), NULL);
    std::cout << "LibreSSL" << std::endl
      << "sha512          = " << bufferToHex(hv, 64)
      << std::endl
      << "sha512/evp      = " << evp_base64(hv, SHA512_DIGEST_LENGTH)
      << std::endl
      << "sha512_256      = " << bufferToHex(hv, 32)
      << std::endl
      << "sha512_256/evp  = " << "N/A"
      // bio_base64 includes a new line every 64 bytes, which is impractical
      // for ledger's use-case, i.e. single line checksum.
      //<< std::endl
      //<< "sha512/bio      = " << bio_base64(hv, SHA512_DIGEST_LENGTH)
      << std::endl;

    // Compute Hash using Botan3
    bt_sha512->update(input, arg.length());
    auto bt_sha512f = bt_sha512->final();
    std::cout << "Botan3" << std::endl
      << "SHA-512         = " << Botan::hex_encode(bt_sha512f)
      << std::endl
      << "SHA-512.b64     = " << Botan::base64_encode(bt_sha512f)
      << std::endl;

    bt_sha512_256->update(input, arg.length());
    auto bt_sha512_256f = bt_sha512_256->final();
    std::cout
      << "SHA-512/256     = " << Botan::hex_encode(bt_sha512_256f)
      << std::endl
      << "SHA-512/256.b64 = " << Botan::base64_encode(bt_sha512_256f)
      << std::endl;

    // Compute Hash using tomcrypt
    unsigned long outl = MAXBLOCKSIZE;
    unsigned char* out = (unsigned char*)XMALLOC(MAXBLOCKSIZE);
    if (hash_memory(tc_sha512, (unsigned char*)input, arg.length(), out, &outl) != CRYPT_OK)
      continue;
    std::cout << "tomcrypt" << std::endl
      << "SHA-512         = " << bufferToHex(out, 32)
      << std::endl
      << "SHA-512.b64     = " << encodeBase64((const char*)out, outl)
      << std::endl;

    if (hash_memory(tc_sha512_256, (unsigned char*)input, arg.length(), out, &outl) != CRYPT_OK)
      continue;
    std::cout
      << "SHA-512/256     = " << bufferToHex(out, 32)
      << std::endl
      << "SHA-512/256.b64 = " << encodeBase64((const char*)out, outl)
      << std::endl;
  }

  OPENSSL_cleanup();
  return 0;
}

src/sha512.cc Outdated Show resolved Hide resolved
@afh afh added the enhancement New feature or request label Dec 7, 2023
@afh afh added this to the 3.4 milestone Dec 10, 2023
@jwiegley jwiegley requested a review from afh December 12, 2023 03:26
Copy link
Member

@afh afh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwiegley I see several changes to the newly added src/sha512.cc file. I'd prefer to treat it as third-part code that is integrated into ledger verbatim (just like utfcpp) or even better replace it with a third-party library (see my previous comment) that provides calculation of SHA-512 and SHA-512/256 ideally with support for base64 encoding.

What are your thoughts?

@jwiegley
Copy link
Member Author

@jwiegley I see several changes to the newly added src/sha512.cc file. I'd prefer to treat it as third-part code that is integrated into ledger verbatim (just like utfcpp) or even better replace it with a third-party library (see my previous comment) that provides calculation of SHA-512 and SHA-512/256 ideally with support for base64 encoding.

What are your thoughts?

I can reduce the number of changes down to just s/uint8_t/unsigned char/, which I can even do before including it elsewhere, so let me try removing all changes to it.

@jwiegley
Copy link
Member Author

@afh I've reverted all of my changes to the SHA512 code. What I would like to understand now is why the flake build here on GitHub fails, when it succeeds just fine on my machine, using either nix build or nix develop followed by make.

@jwiegley
Copy link
Member Author

I was able to reproduce the build failure in a Linux VM, and found that all we're missing are two standard system headers.

@jwiegley jwiegley requested a review from afh December 12, 2023 20:40
@jwiegley
Copy link
Member Author

What are your thoughts on using one of the aforementioned libraries or another third-party implementation?

I'm not really excited about new dependencies, they come with so many other costs (maintenance, licensing, keeping up-to-date, etc). This is a stable, simple algorithm, and we can crack the can on using a 3rd party library if it becomes a popular feature and people end up wanting other algorithms besides the default ones offered.

@jwiegley
Copy link
Member Author

I'm current awaiting a close review of xact_t::hash from @afh before merging this in.

@afh
Copy link
Member

afh commented Dec 21, 2023

I have this on my agenda, yet it'll likely take me until after the holidays and probably New Year's before I get to this. 🎄🎇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants