Convert PDFs to black and white to remove printer dots #23

gszathmari · 2017-06-06T14:07:00Z

Adding optional switch to convert the document to black and white to remove printer dots

…ove printer dots

gszathmari · 2017-06-06T14:15:19Z

The following two commands can help reveal the yellow, almost invisible printer dots:

convert -channel RG -fx 0 page-0.png blue.png
convert -fx b page-0.png grey.png

Before printer dot sanitisation

After printer dot sanitisation

Frankkkkk · 2017-06-06T15:35:29Z

I would also apply some mathematical morphology (i.e. erode then dilate) in order to remove lone black pixels that may be used to transmit information.

coventry · 2017-06-07T03:47:10Z

Why not just run the document through OCR and publish that?

Frankkkkk · 2017-06-07T08:19:51Z

Wouldn't you lose graphics ?

ghost · 2017-06-07T16:54:18Z

Erosion and dilation might work, but might also change the appearance of the text, depending upon the dot size. The simple black and white conversion is probably a more reliable method since it doesn't depend upon knowing the dot size.

Frankkkkk · 2017-06-07T19:53:55Z

Should we only limit to yellow points or also to random pixel-encoded messages ? In which cases the erosion-dilatation would work. Sure, it would change a bit the appearance of the text, but per experience its still very lisible.

bill-mcgonigle · 2017-06-10T13:52:08Z

If your purpose is to proactively protect whistleblowers, you cannot assume that tracking dots will always be yellow or not added to low-significance bits of high-entropy areas. This is just a matter of printer firmware revisions, well within the means of wealthy interests.
Erosion-dilatation would work but you can also apply higher order statistical models to detect steganographic information hiding. See the steg sections here (one has code):
http://www.cs.dartmouth.edu/farid/#jumpTo

jbolger · 2017-06-10T17:46:19Z

I believe this pull request is confusing the purpose of PDF Redact Tools. The purpose is to sanitize pdf files so journalists can view their contents while minimizing the risk of compromise to their computer. The purpose is not to obfuscate the source of the pdf, that is outside the scope of PDF Redact Tools.

While I believe the goal this pull request tries to accomplish is very important, I feel like it is out of place in PDF Redact Tools. This commit supposedly counters one of the known ways documents may be visually tagged, but there are an infinite number of other tagging techniques which this commit will not address. PDF Redact Tools was never meant to solve this problem, and therefore this problem should be solved by a dedicated tool that was designed to address this problem. Bloating PDF Redact Tools will lower the quality of the tool and exacerbate source exposure problems like these.

PDF Redact Tools was not meant to be the final step before publishing, it was meant to be the first step before reporting.

This feature, to me, sounds like one of dozens that could potentially belong to a new tool which addresses this problem from the start. PDF Redact Tools does its job well, we shouldn't cloud its mission - and we don't want to give journalists a watered-down version of source obfuscation if a better tool can be made for it.

ajkblue · 2017-06-14T16:38:12Z

@jbolger

PDF Redact Tools was not meant to be the final step before publishing, it was meant to be the first step before reporting.

From the Readme:

PDF Redact Tools helps with securely redacting and stripping metadata from documents before publishing.

It seems to me that this fits into this project. Not only that, but while yes, this may not be 100% effective or guarantee that every tracking dot is removed from the source, at least it's there as an option. If you really think that this shouldn't be addd by default, then an easier way to include this as a useful feature could be to add a new command-line flag to use it, e.g. --remove-dots or something along the lines of that.

I see your point and where you're coming from. Nothing is perfect, but this is at least a start. Nothing like this seems to exist on Github, and as so I believe that this is a good feature addition to PDF Redact Tools. So unless a new tool is going to be started that implements this feature, adding it to PDF Redact Tools at least gives journalists the option of using it. Again, maybe not as the default if that might add a false sense of security, but at least it's there.

micahflee · 2017-06-19T23:39:43Z

This is an interesting pull request, thanks for submitting it!

I agree with @jbolger that there are infinite ways in which printers can hide metadata in files they print (or for people to hide any data within arbitrary images), and that PDF Redact Tools can't hope to -- and shouldn't try to -- prevent all of them.

However, I think it's reasonable to specifically protect against printer dots because they're so ubiquitous, and likely on every piece of paper that has something printed on it. According to EFF's (no-longer-updated) list of printers that include tracking dots, "Some of the documents that we previously received through FOIA suggested that all major manufacturers of color laser printers entered a secret agreement with governments to ensure that the output of those printers is forensically traceable."

I've tested and confirmed that this does seem to work and remove printer dots from a scanned document that had them included. It also resulted in a much smaller filesize, which is nice.

However, unfortunately it reduces the quality of the resulting PDF by a lot (probably because it's using threshold to convert to black and white, rather than just grayscale, which wouldn't remove the printer dots). And, of course, it loses color, which might be important to some docs.

So while it's imperfect, I'm into merging this. I'm also open to merging a new PR, if anyone can come up with a way to remove printer dots without reducing so much quality, and ideally even without removing color.

timojuez · 2017-09-22T09:48:08Z

Hi folks,

from ongoing research I can tell you that all the printers that make yellow dots use colours between the HSV values (28,10,214) and (48,67,255).
Making a picture black and white may still contain the dots, depending on the algorithm that maps HSV into black and white. I would suggest to replace spots in the named HSV range with the paper's average white.

Add optional switch to convert the document to black and white to rem…

e5e110f

…ove printer dots

gszathmari mentioned this pull request Jun 6, 2017

Convert PDFs to black and white to remove printer dots #22

Closed

micahflee merged commit e5e110f into firstlookmedia:master Jun 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert PDFs to black and white to remove printer dots #23

Convert PDFs to black and white to remove printer dots #23

gszathmari commented Jun 6, 2017

gszathmari commented Jun 6, 2017

Frankkkkk commented Jun 6, 2017

coventry commented Jun 7, 2017

Frankkkkk commented Jun 7, 2017

ghost commented Jun 7, 2017

Frankkkkk commented Jun 7, 2017

bill-mcgonigle commented Jun 10, 2017

jbolger commented Jun 10, 2017 •

edited

ajkblue commented Jun 14, 2017

micahflee commented Jun 19, 2017

timojuez commented Sep 22, 2017

Convert PDFs to black and white to remove printer dots #23

Convert PDFs to black and white to remove printer dots #23

Conversation

gszathmari commented Jun 6, 2017

gszathmari commented Jun 6, 2017

Before printer dot sanitisation

After printer dot sanitisation

Frankkkkk commented Jun 6, 2017

coventry commented Jun 7, 2017

Frankkkkk commented Jun 7, 2017

ghost commented Jun 7, 2017

Frankkkkk commented Jun 7, 2017

bill-mcgonigle commented Jun 10, 2017

jbolger commented Jun 10, 2017 • edited

ajkblue commented Jun 14, 2017

micahflee commented Jun 19, 2017

timojuez commented Sep 22, 2017

jbolger commented Jun 10, 2017 •

edited