Get hash and hash algorithm of a signed PDF #31

fredericoschardong · 2021-08-31T21:05:02Z

Describe the solution you'd like
I would like to get the digest and digest algorithm after signing a PDF.

Describe alternatives you've considered
I have tried to get it somehow out of the sign_pdf method, but it only uses local variables... there is probably a way to get it from the signer PDF, but I don't know how to do it.

fredericoschardong · 2021-08-31T21:19:11Z

Just found out:

r = PdfFileReader(doc)
sig = r.embedded_signatures[0]
                
sig.compute_digest()
get_pyca_cryptography_hash(sig.external_md_algorithm)

Please ignore this issue.

MatthiasValvekens · 2021-08-31T21:23:11Z

Hi Frederico,

There are a number of different aspects to your question, so allow me to break it down a little further.

First, if you're the one calling pyHanko to sign PDFs: you can actually control the digest function used via the PdfSignatureMetadata object you pass into sign_pdf. Unless the document has seed values on its signature fields, that setting will be respected. Seed values are rarely used in the wild these days, so chances are you won't ever have to worry about those.

The sign_pdf function is pretty much the most high-level API pyHanko exposes, so getting access to the internals from there will indeed be difficult. You could get access to all the information you want by hooking into the Signer class.

Aaaaand just as I was about to suggest using PdfFileReader(...).embedded_signatures, you apparently already figured that one out on your own, so that saves me some typing ;). It's a little less performant than getting the information out of the signer, but it's certainly more future-proof. Also, note that that approach will leave you with the document digest only. Technically, that's not the digest that's fed to the signature algorithm (although it is part of the signed attributes in the signature container). Whether that matters or not depends on what you want to do with the digest. If you just need it to identify the document later, I suppose the distinction is irrelevant.

fredericoschardong · 2021-08-31T21:42:31Z

Hey Matthias!

Thanks for the quick reply. Actually, I need the hash of the content being signed... so I guess PdfFileReader(...).embedded_signatures is not the correct way to go. Moreover, thinking a bit more, I see my reply was a bit too early. I need to feed the service that gives me the certificate with the hash of the contents being signed. This hash will be embedded into the certificate (as some exotic OID), such that the certificate and signed PDF are somewhat bound.

MatthiasValvekens · 2021-08-31T22:11:24Z

Aha, I see. That creates multiple tricky chicken-or-egg problems, actually, although the particulars will depend on how the signing service you're using actually operates.

First, there's an architectural issue that you have to be aware of: if the certificate is not known when pyHanko starts the signing process, you can't use any of the high-level APIs. In your situation, given the degree of control that your workflow seems to require, you're pretty much forced to use this lower-level API.

The other issue is more fundamental, and not really related to pyHanko as such: does the signing service you're using supply complete CMS (AKA PKCS#7) signature containers, or just raw signature values? Here's why that matters:

If they supply you with the CMS objects, then passing in the document digest is probably exactly what you need. From the CMS point of view, the document digest is used as the value of the messageDigest signed attribute, which is digested again (together with the other signed attributes) to produce the signature. Anyway, if the service supplies the full CMS container then you don't have to worry about any of that.
On the other hand, if you have to compose the CMS object yourself (i.e. the signing service only provides raw signatures), then it could really be either, and you need to make sure you know which it is. Are you absolutely sure that it's the signed attribute digest that needs to be embedded into the certificate, and not the document digest/messageDigest value?
If it's the former, then you have another problem: that would introduce a circular dependency that effectively makes the signing-certificate-v2 attribute impossible to compute, for example. More to the point: you'll have to implement a lot of the CMS construction logic yourself in that case. That probably isn't particularly difficult (and pyHanko still takes care of the PDF-specific parts either way), but it will require some care.

Out of curiosity, can you point me towards the documentation for that certificate extension and/or protocol you're using? Assuming it's public information, of course.

fredericoschardong · 2021-08-31T22:29:36Z

If they supply you with the CMS objects

Yep, that's the case. I send a PKCS#10 certificate request and get back a PKCS #7.

the document digest is used as the value of the messageDigest signed attribute

I don't follow. But as you said, probably I don't have to worry about it.

you're pretty much forced to use this lower-level API.

Reading the (awesome) documentation, I came across this interrupted signing functionality right at the beginning. Apparently, that's what I need, or am I missing something?

Out of curiosity, can you point me towards the documentation for that certificate extension and/or protocol you're using? Assuming it's public information, of course.

Unfortunately, it is not public, and I cannot share much information about it, sorry.

MatthiasValvekens · 2021-08-31T22:42:41Z

I send a PKCS#10 certificate request and get back a PKCS #7.

Alright, that at least eliminates the worst case scenario, which is good. Passing in the document digest should do the trick, then.

Reading the (awesome) documentation, I came across this interrupted signing functionality right at the beginning. Apparently, that's what I need, or am I missing something?

I'd love to answer "yes" here, but I'm afraid you will have to use PdfCMSEmbedder directly... (see link above).
The interrupted signing API still requires the signer's certificate to be available from the beginning. This is because pyHanko runs a number of checks prior to preparing the document for signing, some of which depend on the certificate being available for inspection. In principle, some of that can probably be factored out, but it'll be tricky to do so without major API breakage, which I generally try to avoid. The interrupted signing workflow doesn't really skip any of that, it just exposes some controls to fine-tune performance.

The low-level PdfCMSEmbedder API is garbage-in/garbage-out, so it doesn't care about any of that.

Unfortunately, it is not public, and I cannot share much information about it, sorry.

Ah, that's unfortunate. You'll have to make do with this abstract explanation then, I suppose :). That said, other people have successfully used PdfCMSEmbedder in their implementations before (based on the sample in the documentation), so chances are that you'll be able to make it work :).

fredericoschardong · 2021-09-01T01:00:08Z

I got it working with PdfCMSEmbedder. Thank you for your help!

MatthiasValvekens · 2021-09-03T22:52:18Z

Since this discussion seems to be resolved, I'll close the issue now. :)

fredericoschardong · 2021-10-26T18:10:45Z

Dear Matthias,

Only now have I realized that the lower level API seems to produce a strange error regarding the timestamp. I have tried it with a third-party timestamper and a local, but both produce the same issue: the PDF is timestamped a day in the past.

Perhaps I am doing something stupidly wrong. Here is the minimal code to reproduce the issue.

Thank you.

MatthiasValvekens · 2021-10-26T19:44:52Z

Ah, good catch! This isn't a signing API problem; it appears to be a bug in generic.pdf_date() that causes it to treat negative UTC offsets incorrectly. Impressively, the bug's been in there since the very early days (introduced in commit 543aac7)...

I'll open a new issue for this one, since it's unrelated to your original question.

MatthiasValvekens added the question Further information is requested label Aug 31, 2021

MatthiasValvekens closed this as completed Sep 3, 2021

MatthiasValvekens mentioned this issue Oct 26, 2021

pdf_date() function outputs wrong dates when the input has a negative UTC offset #40

Closed

Repository owner locked and limited conversation to collaborators Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Get hash and hash algorithm of a signed PDF #31

Get hash and hash algorithm of a signed PDF #31

fredericoschardong commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021 •

edited

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Sep 1, 2021

MatthiasValvekens commented Sep 3, 2021

fredericoschardong commented Oct 26, 2021

MatthiasValvekens commented Oct 26, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Get hash and hash algorithm of a signed PDF #31

Get hash and hash algorithm of a signed PDF #31

Comments

fredericoschardong commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Aug 31, 2021 • edited

MatthiasValvekens commented Aug 31, 2021

fredericoschardong commented Sep 1, 2021

MatthiasValvekens commented Sep 3, 2021

fredericoschardong commented Oct 26, 2021

MatthiasValvekens commented Oct 26, 2021

This issue was moved to a discussion.

fredericoschardong commented Aug 31, 2021 •

edited