[SVCS-52] Adding support for mutli-page tiff files. #275

AddisonSchiller · 2017-08-31T16:17:24Z

adding more to pdf renderer
tiff files are now rendered by the pdf renderer, and the pdf extension now has an exporter to handle tif/tiff files. Tiff files are now converted to pdfs for viewing

refs: https://openscience.atlassian.net/browse/SVCS-52

Purpose

Add support for mutlipage tiff files.

Summary of changes

Tiff files are sort of like pdfs, and can be converted to them.
The pdf extension now handles tiff files. The pdf extension now has an exporter for tiff files
Reportlab has been added as a dependency.
Some multipage tiffs will not render correctly. see testing notes.

Testing QA notes:

below is a copy and paste of my testing notes on the jira ticket. you can find the zip file over there

Testing Notes:

Attached to the jira ticket is a zip containing some tif/tiff files to use for testing.

Due to the aspects of what this commit is doing, the following should also be tested:
Other pdf files. Try a few regular pdf files and make sure they render properly.
Any files rendered by the unoconv renderer. (ie docx) The unoconv renderer ships files to the pdf renderer.

cslzchen

First pass done 🎆 🎆 . Here is the checklist.

cslzchen · 2017-10-11T16:22:26Z

mfr/extensions/pdf/__init__.py

@@ -1 +1,2 @@
 from .render import PdfRenderer  # noqa
+from .export import PdfExporter # noqa


Use absolute import

Remove the # noqa if possible (if tests pass)

Leaving this like we talked about on other issue

cslzchen · 2017-10-11T16:28:15Z

mfr/extensions/pdf/export.py

+from reportlab.pdfgen import canvas
+
+from mfr.core import extension
+from mfr.extensions.pdf import exceptions


import os import imghdr from PIL import Image from reportlab.pdfgen import canvas from mfr.core import extension from mfr.extensions.pdf import exceptions as pdf_exceptions

Imghdr is a base lib, so leaving it with os, moving others.

cslzchen · 2017-10-11T16:29:25Z

mfr/extensions/pdf/export.py

+        self.metrics.add('pil_version', Image.VERSION)
+
+    def tiff_to_pdf(self, tiff_img, max_size):
+        max_pages = 40


Can we define max_pages somewhere in settings instead of here as a number literal?

cslzchen · 2017-10-11T16:31:42Z

mfr/extensions/pdf/export.py

+    def tiff_to_pdf(self, tiff_img, max_size):
+        max_pages = 40
+        width, height = tiff_img.size
+        c = canvas.Canvas(self.output_file_path)


Is the file at output_file_path guaranteed to be a valid one? If not, we should try and except.

There should be no file at output_file_path. Thats the path MFR wants the exported file to be saved to.

Ah, I see. I thought there is a temp file created.

cslzchen · 2017-10-11T16:39:35Z

mfr/extensions/pdf/export.py

+        page = 0
+
+        # This seems to be the only way to write this loop at the moment
+        while True:


If you want to avoid while True, you can let while handle the break:

while page < max_pages: ... page +=1

cslzchen · 2017-10-11T18:24:10Z

mfr/extensions/pdf/export.py

+            self.tiff_to_pdf(image, max_size)
+            image.close()
+
+        except (UnicodeDecodeError, IOError) as err:


The code and exception message looks similar. Why not combining the two except in one? (A question not a suggestion.)

The ValueError case is a bug with our current version of pillow having trouble opening certain formats of .tiff. Once pillow is updated to 4.3 or higher, this last error section can go away. (tested now and this is indeed the case. broken.tiff will open on pillow 4.3, but not 2.8.3 that we currently have. This is why the errors were separate. The tiff given was valid, just pillow messing up.

cslzchen · 2017-10-11T18:29:10Z

mfr/extensions/pdf/export.py

+                export_format=type,
+                detected_format=imghdr.what(self.source_file_path),
+                original_exception=err,
+                code=400,


Given this is a new file, let's use HTTPStatus (LINK).

from http import HTTPStatus

code=HTTPStatus.BAD_REQUEST

cslzchen · 2017-10-11T18:37:19Z

mfr/extensions/pdf/render.py

+        elif settings.EXPORT_TYPE:
+            exported_url.args['format'] = settings.EXPORT_TYPE
+        else:
+            return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)


What's the conditions for reaching this branch? Is this unexpected?

cslzchen · 2017-10-11T18:38:52Z

mfr/extensions/pdf/render.py

+            return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)
+
+        exported_url = furl.furl(self.export_url)
+        if settings.EXPORT_MAXIMUM_SIZE and settings.EXPORT_TYPE:


This reduce the # of bool checks from 3 to 2.

if settings.EXPORT_TYPE: if settings.EXPORT_MAXIMUM_SIZE exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE, settings.EXPORT_TYPE) else: exported_url.args['format'] = settings.EXPORT_TYPE self.metrics.add('needs_export', True) return self.TEMPLATE.render(base=self.assets_url, url=exported_url.url) # TODO: is this an unexpected case? return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)

This was mostly copied from the image renderer. I assume that the TODO case here is when the settings are not properly defined. Going with your approach here but leaving the unexpected case in for completion etc.

Agreed. Similarly, update image renderer if possible.

cslzchen · 2017-10-11T18:46:20Z

mfr/extensions/pdf/exceptions.py

+from mfr.core.exceptions import ExporterError
+
+
+class PillowImageError(ExporterError):


Seems duplicated, but I guess you have a good reason for not using mfr.extensions.image.exceptions.PillowImageError?

It is duplicated, but it doesnt make sense to import an exception from a different extension (for the most part, extensions dont interact much) thats why its copied this way. Thoughts?

I see what you mean and I agree. Let's leave this here for @felliott in Phase 2 Review.

I'm okay with duplicating this, but the docstrings and __TYPE should be consistent with the new location. __TYPE should probably be pdf_pillow.

cslzchen

Looks good 🎆 and move to PCR for @felliott .

felliott

Just a few issues to address.

The Pillow docs (https://github.com/python-pillow/Pillow/blob/2.8.x/docs/handbook/image-file-formats.rst#tiff) mention that if libtiff is installed, PIL can read a bunch more types of TIF files. Is libtiff already installed by our Docker container? If not, could it be, and what would be an example of a file that would be supported by adding it?

Cheers,
@felliott

felliott · 2017-11-14T23:04:16Z

mfr/extensions/pdf/settings.py

+EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
+EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
+EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])
+EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40)


If set via an environment variable, this will be a string, not an int. Need to add an explicit cast. There might be a method in the config object to do this for you.

felliott · 2017-11-14T23:05:24Z

mfr/extensions/pdf/settings.py

+
+EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
+EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
+EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])


Config vars from the environment are strings. The config object should have a method to handle this.

the image settings is done the same way (its mostly c/p from the image settings). I don't see a method in mfr.settings to handle lists.
Should I be feeding this a string and making the list myself in the renderer or something like that?

also, should the image extension be changed as well?

Other small note: even though the list is turned into a string, it will still find '.pdf' in that string and exclude it. Not sure if its intended to work that way, but that must be why its working for images settings. (this is why it works even though it seems like it shouldn't)

As mentioned above im open to any fix for this

Blarf, sorry I was misremembering. The similar example I was thinking of was here. We support passing a list via the env by requiring it to be a space-separated string that gets split upon construction. And yes, the image renderer settings should be doing it this way, but we never bothered to audit the settings when we set it up. There are probably lots of settings that need this, so we'll create a separate janitorial ticket to audit and clean these up.

Done, and ticket made.

felliott · 2017-11-14T23:06:18Z

tests/extensions/pdf/test_exporter.py

+
+from mfr.extensions.pdf import settings
+from mfr.extensions.pdf import exceptions
+from mfr.extensions.pdf import PdfExporter


Minor: combine these into one.

felliott · 2017-11-14T23:07:36Z

tests/extensions/pdf/test_renderer.py



 @pytest.fixture
 def metadata():
-    return ProviderMetadata('test', '.pdf', 'text/plain', '1234', 'http://wb.osf.io/file/test.pdf?token=1234')
+    return ProviderMetadata('test', '.pdf', 'text/plain', '1234',
+                        'http://wb.osf.io/file/test.pdf?token=1234')


Nitpick: your indentation is a little weird here.

felliott · 2017-11-14T23:17:40Z

mfr/extensions/pdf/exceptions.py

+    and relating to the Pillow Library should inherit from PillowImageError
+    """
+
+    __TYPE = 'image_pillow'


To distinguish this from image exceptions in the metrics, this should be named pdf_pillow.

felliott · 2017-11-15T20:39:36Z

mfr/extensions/pdf/settings.py

+EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
+EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
+EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])
+EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40)


The above two config settings will have problems if set from the env. Envvars are always strings; they won't be ints or lists. I believe there are some helper methods in the config base class to cast these properly.

felliott · 2017-11-15T20:40:12Z

tests/extensions/pdf/test_exporter.py

+
+from mfr.extensions.pdf import settings
+from mfr.extensions.pdf import exceptions
+from mfr.extensions.pdf import PdfExporter


The above could be collapsed.

felliott · 2017-11-15T20:43:58Z

tests/extensions/pdf/test_exporter.py

+
+class TestPdfExporter:
+    '''
+    Opening and veryifying pdfs with report lab is a paid feature


Nitpick: "verifying". Also, "report lab" could probably use a little more context here, it's only mentioned/used in requirements.txt and the renderer file.

reworded/fixed typos/added new typos
complete

felliott · 2017-11-15T20:46:37Z

tests/extensions/pdf/test_exporter.py

+    '''
+    Opening and veryifying pdfs with report lab is a paid feature
+    so these tests are mostly just to see if the file gets created
+    in the correct place


Is there anything else in the stdlib that can assert that the output is actually a pdf? If not, no worries, it would just be good to assert that a pdf is produced, rather than e.g. and html error message.

Followup: it looks like PDFs have a signature: https://en.wikipedia.org/wiki/List_of_file_signatures

This is great, implementing

I found some libraries that use system libraries to check things (libmagic used by python-magic) etc. They would require adding things to the docker file or something like that. I just went with opening the file, and pulling the signature out directly and looking at it. Slightly crude but works fine.

felliott · 2017-11-15T20:48:24Z

tests/extensions/pdf/test_exporter.py

+        assert os.path.exists(directory)
+
+        exporter.export()
+        assert os.path.exists(output_file_path)


It looks like the successful tests and the failing tests could be parameterized, since we don't have the ability to do deep examination of the pdfs.

What do you mean by this?

The three passing tests are all very similar, they just take slightly different arguments. You could parameterize them a la this WB s3 test. That way if you update one later, the others get updated, too.

This is great,
completed.

AddisonSchiller · 2017-11-16T18:49:15Z

@felliott
Random comments/CRR

Take a look at the settings questions, not quite sure how you want me to handle that.

On libtiff:
I took a look at it, and it was sort of inconclusive. All i could find is there seem to be some tiff files that label things not to the tiff standard. using libtiff might help with those? I don't think we are using it in our docker container atm (it might come auto installed though.)

I did try to install it and see if it changed the way any of my test tiff files rendered and it did not seem to.

Just couldn't find enough information about it. Some of the docs linked to other libraries i originally looked at for rendering tiffs etc, and such. If you want i could definitely spend more time on this.

Also, leaving in additional development while I wait for responses to some of my questions

adding more to pdf renderer

coveralls · 2017-11-16T19:17:53Z

Coverage increased (+1.8%) to 69.816% when pulling 114c492 on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

felliott · 2017-11-16T20:18:51Z

I've worked a bunch with .tiff files in the past. The tiff format is more of a container format than a image compression format. It permits lots of weird formats and extensions and its a pretty popular format in science for that reason. I'm not usually a fan of adding dependencies speculatively, but I would make an exception for tiffs. It would be good to add a comment explaining why it's in there.

AddisonSchiller · 2017-11-16T20:59:32Z

Did some more research. It looks like libtiff5 comes default in Jessie. I read some things about the correct version of libtiff not being used. It shouldn't hurt to install python-libtiff and let pillow go crazy with it. So adding that to the PR as suggested.

Adding tests for pdf rendering and tif exporting

coveralls · 2017-11-16T21:28:56Z

Coverage increased (+1.8%) to 69.816% when pulling f0b596a on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

coveralls · 2017-11-17T19:37:46Z

Coverage increased (+1.8%) to 69.816% when pulling a52a515 on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

felliott

To do:

need to remove python-libtiff from requirements
add libtiff5-dev to Dockerfile
verify that Pillow libtiff support inside the container works.

coveralls · 2017-11-20T23:09:40Z

Coverage increased (+1.7%) to 69.72% when pulling d254a1d on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

Multipage-tiff fix on resize Clean up and small fixes

coveralls · 2017-11-21T15:38:52Z

Coverage increased (+1.8%) to 69.865% when pulling e05c0fa on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

coveralls · 2017-11-21T15:41:51Z

Coverage increased (+1.8%) to 69.865% when pulling e05c0fa on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

AddisonSchiller · 2017-11-21T15:42:32Z

Latest additions:
Confirmed (with @felliott ) that libtiff5-dev is working in the docker container
Added libtiff5-dev to the docker-file
Removed unused python libtiff thing

Turning on libtiff use in pillow fixed color issues on some tiffs
fixed resizing on large multipage tiffs (should now display all pages)
lots of refactoring/changes from first few versions.
Pillow is upgraded to 4.3.0 on this ticket. Standalone it will work with tiffs, but will break the image renderer. There is another ticket I have out that should be merged before this one that will fix what pillow 4.3.0 breaks in MFR.

Changed resampling function on resizing (one used by imageexporter was corrupting some pages of resized pdfs)

Also rewrote QA notes to remove mentions to old broken features. Things should just work now without complications (fingers crossed)

felliott · 2017-12-04T18:42:56Z

TerTIFFic! Moving to RTM

[SVCS-52] Closes: #275

AddisonSchiller changed the title ~~Adding support for mutli-page tiff files.~~ [SVCS-52]Adding support for mutli-page tiff files. Oct 5, 2017

AddisonSchiller changed the title ~~[SVCS-52]Adding support for mutli-page tiff files.~~ [SVCS-52] Adding support for mutli-page tiff files. Oct 5, 2017

AddisonSchiller force-pushed the feature/multipage-tif branch from 5fd8ff7 to fbe435a Compare October 11, 2017 15:40

cslzchen requested changes Oct 11, 2017

View reviewed changes

cslzchen added the Code Review label Oct 12, 2017

cslzchen approved these changes Oct 25, 2017

View reviewed changes

cslzchen added Final Review and removed Code Review labels Oct 25, 2017

felliott requested changes Nov 15, 2017

View reviewed changes

AddisonSchiller force-pushed the feature/multipage-tif branch from 0cded16 to 1683656 Compare November 16, 2017 18:35

Adding support for mutli-page tiff files.

24f16d8

adding more to pdf renderer

AddisonSchiller force-pushed the feature/multipage-tif branch from 1d9e73f to 114c492 Compare November 16, 2017 19:05

Style changes and refactoring from review

f0b596a

Adding tests for pdf rendering and tif exporting

AddisonSchiller force-pushed the feature/multipage-tif branch from 36c3465 to f0b596a Compare November 16, 2017 21:02

felliott requested changes Nov 17, 2017

View reviewed changes

libtiff5 support

e05c0fa

Multipage-tiff fix on resize Clean up and small fixes

AddisonSchiller force-pushed the feature/multipage-tif branch from bed7e7c to e05c0fa Compare November 21, 2017 15:38

felliott added Ready To Merge and removed Final Review labels Dec 4, 2017

felliott added a commit that referenced this pull request Dec 4, 2017

Merge branch 'feature/multipage-tiff' into next

1a1830c

[SVCS-52] Closes: #275

felliott closed this in 058ef2a Jan 23, 2018

		@@ -1 +1,2 @@
		from .render import PdfRenderer # noqa
		from .export import PdfExporter # noqa

		from mfr.core.exceptions import ExporterError


		class PillowImageError(ExporterError):

[SVCS-52] Adding support for mutli-page tiff files. #275

[SVCS-52] Adding support for mutli-page tiff files. #275

Conversation

AddisonSchiller commented Aug 31, 2017 • edited

Purpose

Summary of changes

Testing QA notes:

cslzchen left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cslzchen left a comment

Choose a reason for hiding this comment

felliott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AddisonSchiller commented Nov 16, 2017

coveralls commented Nov 16, 2017 • edited

felliott commented Nov 16, 2017

AddisonSchiller commented Nov 16, 2017

coveralls commented Nov 16, 2017 • edited

coveralls commented Nov 17, 2017 • edited

felliott left a comment

Choose a reason for hiding this comment

coveralls commented Nov 20, 2017 • edited

coveralls commented Nov 21, 2017 • edited

coveralls commented Nov 21, 2017 • edited

AddisonSchiller commented Nov 21, 2017 • edited

felliott commented Dec 4, 2017

AddisonSchiller commented Aug 31, 2017 •

edited

cslzchen left a comment •

edited

coveralls commented Nov 16, 2017 •

edited

coveralls commented Nov 16, 2017 •

edited

coveralls commented Nov 17, 2017 •

edited

coveralls commented Nov 20, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

coveralls commented Nov 21, 2017 •

edited

AddisonSchiller commented Nov 21, 2017 •

edited