Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SVCS-52] Adding support for mutli-page tiff files. #275

Conversation

AddisonSchiller
Copy link
Contributor

@AddisonSchiller AddisonSchiller commented Aug 31, 2017

adding more to pdf renderer
tiff files are now rendered by the pdf renderer, and the pdf extension now has an exporter to handle tif/tiff files. Tiff files are now converted to pdfs for viewing

refs: https://openscience.atlassian.net/browse/SVCS-52

Purpose

Add support for mutlipage tiff files.

Summary of changes

Tiff files are sort of like pdfs, and can be converted to them.
The pdf extension now handles tiff files. The pdf extension now has an exporter for tiff files
Reportlab has been added as a dependency.
Some multipage tiffs will not render correctly. see testing notes.

Testing QA notes:

below is a copy and paste of my testing notes on the jira ticket. you can find the zip file over there

Testing Notes:

Attached to the jira ticket is a zip containing some tif/tiff files to use for testing.

Due to the aspects of what this commit is doing, the following should also be tested:
Other pdf files. Try a few regular pdf files and make sure they render properly.
Any files rendered by the unoconv renderer. (ie docx) The unoconv renderer ships files to the pdf renderer.

@AddisonSchiller AddisonSchiller changed the title Adding support for mutli-page tiff files. [SVCS-52]Adding support for mutli-page tiff files. Oct 5, 2017
@AddisonSchiller AddisonSchiller changed the title [SVCS-52]Adding support for mutli-page tiff files. [SVCS-52] Adding support for mutli-page tiff files. Oct 5, 2017
Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass done 🎆 🎆 . Here is the checklist.

  • Why do we need a new PillowImageError?
  • Improve control flow: while, if, try
  • Style
    • imports
    • comments consistency
    • avoid using number or string literals

@@ -1 +1,2 @@
from .render import PdfRenderer # noqa
from .export import PdfExporter # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Use absolute import
  2. Remove the # noqa if possible (if tests pass)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this like we talked about on other issue

from reportlab.pdfgen import canvas

from mfr.core import extension
from mfr.extensions.pdf import exceptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import os

import imghdr
from PIL import Image
from reportlab.pdfgen import canvas

from mfr.core import extension
from mfr.extensions.pdf import exceptions as pdf_exceptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imghdr is a base lib, so leaving it with os, moving others.

self.metrics.add('pil_version', Image.VERSION)

def tiff_to_pdf(self, tiff_img, max_size):
max_pages = 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define max_pages somewhere in settings instead of here as a number literal?

def tiff_to_pdf(self, tiff_img, max_size):
max_pages = 40
width, height = tiff_img.size
c = canvas.Canvas(self.output_file_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the file at output_file_path guaranteed to be a valid one? If not, we should try and except.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no file at output_file_path. Thats the path MFR wants the exported file to be saved to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I thought there is a temp file created.

page = 0

# This seems to be the only way to write this loop at the moment
while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to avoid while True, you can let while handle the break:

while page < max_pages:
    ...
    page +=1

self.tiff_to_pdf(image, max_size)
image.close()

except (UnicodeDecodeError, IOError) as err:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and exception message looks similar. Why not combining the two except in one? (A question not a suggestion.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ValueError case is a bug with our current version of pillow having trouble opening certain formats of .tiff. Once pillow is updated to 4.3 or higher, this last error section can go away. (tested now and this is indeed the case. broken.tiff will open on pillow 4.3, but not 2.8.3 that we currently have. This is why the errors were separate. The tiff given was valid, just pillow messing up.

export_format=type,
detected_format=imghdr.what(self.source_file_path),
original_exception=err,
code=400,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is a new file, let's use HTTPStatus (LINK).

from http import HTTPStatus
code=HTTPStatus.BAD_REQUEST

elif settings.EXPORT_TYPE:
exported_url.args['format'] = settings.EXPORT_TYPE
else:
return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the conditions for reaching this branch? Is this unexpected?

return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)

exported_url = furl.furl(self.export_url)
if settings.EXPORT_MAXIMUM_SIZE and settings.EXPORT_TYPE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduce the # of bool checks from 3 to 2.

if settings.EXPORT_TYPE:
    if settings.EXPORT_MAXIMUM_SIZE
        exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE, settings.EXPORT_TYPE)
    else:
        exported_url.args['format'] = settings.EXPORT_TYPE
    self.metrics.add('needs_export', True)
    return self.TEMPLATE.render(base=self.assets_url, url=exported_url.url)

# TODO: is this an unexpected case?
return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was mostly copied from the image renderer. I assume that the TODO case here is when the settings are not properly defined. Going with your approach here but leaving the unexpected case in for completion etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Similarly, update image renderer if possible.

from mfr.core.exceptions import ExporterError


class PillowImageError(ExporterError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems duplicated, but I guess you have a good reason for not using mfr.extensions.image.exceptions.PillowImageError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is duplicated, but it doesnt make sense to import an exception from a different extension (for the most part, extensions dont interact much) thats why its copied this way. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean and I agree. Let's leave this here for @felliott in Phase 2 Review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with duplicating this, but the docstrings and __TYPE should be consistent with the new location. __TYPE should probably be pdf_pillow.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 🎆 and move to PCR for @felliott .

Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few issues to address.

The Pillow docs (https://github.com/python-pillow/Pillow/blob/2.8.x/docs/handbook/image-file-formats.rst#tiff) mention that if libtiff is installed, PIL can read a bunch more types of TIF files. Is libtiff already installed by our Docker container? If not, could it be, and what would be an example of a file that would be supported by adding it?

Cheers,
@felliott

EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])
EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set via an environment variable, this will be a string, not an int. Need to add an explicit cast. There might be a method in the config object to do this for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete


EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config vars from the environment are strings. The config object should have a method to handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the image settings is done the same way (its mostly c/p from the image settings). I don't see a method in mfr.settings to handle lists.
Should I be feeding this a string and making the list myself in the renderer or something like that?

also, should the image extension be changed as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other small note: even though the list is turned into a string, it will still find '.pdf' in that string and exclude it. Not sure if its intended to work that way, but that must be why its working for images settings. (this is why it works even though it seems like it shouldn't)

As mentioned above im open to any fix for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blarf, sorry I was misremembering. The similar example I was thinking of was here. We support passing a list via the env by requiring it to be a space-separated string that gets split upon construction. And yes, the image renderer settings should be doing it this way, but we never bothered to audit the settings when we set it up. There are probably lots of settings that need this, so we'll create a separate janitorial ticket to audit and clean these up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and ticket made.


from mfr.extensions.pdf import settings
from mfr.extensions.pdf import exceptions
from mfr.extensions.pdf import PdfExporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: combine these into one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete



@pytest.fixture
def metadata():
return ProviderMetadata('test', '.pdf', 'text/plain', '1234', 'http://wb.osf.io/file/test.pdf?token=1234')
return ProviderMetadata('test', '.pdf', 'text/plain', '1234',
'http://wb.osf.io/file/test.pdf?token=1234')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: your indentation is a little weird here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete

and relating to the Pillow Library should inherit from PillowImageError
"""

__TYPE = 'image_pillow'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To distinguish this from image exceptions in the metrics, this should be named pdf_pillow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete

EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ])
EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two config settings will have problems if set from the env. Envvars are always strings; they won't be ints or lists. I believe there are some helper methods in the config base class to cast these properly.


from mfr.extensions.pdf import settings
from mfr.extensions.pdf import exceptions
from mfr.extensions.pdf import PdfExporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above could be collapsed.


class TestPdfExporter:
'''
Opening and veryifying pdfs with report lab is a paid feature
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: "verifying". Also, "report lab" could probably use a little more context here, it's only mentioned/used in requirements.txt and the renderer file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded/fixed typos/added new typos
complete

'''
Opening and veryifying pdfs with report lab is a paid feature
so these tests are mostly just to see if the file gets created
in the correct place
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else in the stdlib that can assert that the output is actually a pdf? If not, no worries, it would just be good to assert that a pdf is produced, rather than e.g. and html error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup: it looks like PDFs have a signature: https://en.wikipedia.org/wiki/List_of_file_signatures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, implementing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some libraries that use system libraries to check things (libmagic used by python-magic) etc. They would require adding things to the docker file or something like that. I just went with opening the file, and pulling the signature out directly and looking at it. Slightly crude but works fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

assert os.path.exists(directory)

exporter.export()
assert os.path.exists(output_file_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the successful tests and the failing tests could be parameterized, since we don't have the ability to do deep examination of the pdfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three passing tests are all very similar, they just take slightly different arguments. You could parameterize them a la this WB s3 test. That way if you update one later, the others get updated, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great,
completed.

@AddisonSchiller
Copy link
Contributor Author

@felliott
Random comments/CRR

Take a look at the settings questions, not quite sure how you want me to handle that.

On libtiff:
I took a look at it, and it was sort of inconclusive. All i could find is there seem to be some tiff files that label things not to the tiff standard. using libtiff might help with those? I don't think we are using it in our docker container atm (it might come auto installed though.)

I did try to install it and see if it changed the way any of my test tiff files rendered and it did not seem to.

Just couldn't find enough information about it. Some of the docs linked to other libraries i originally looked at for rendering tiffs etc, and such. If you want i could definitely spend more time on this.

Also, leaving in additional development while I wait for responses to some of my questions

adding more to pdf renderer
@coveralls
Copy link

coveralls commented Nov 16, 2017

Coverage Status

Coverage increased (+1.8%) to 69.816% when pulling 114c492 on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

@felliott
Copy link
Member

I've worked a bunch with .tiff files in the past. The tiff format is more of a container format than a image compression format. It permits lots of weird formats and extensions and its a pretty popular format in science for that reason. I'm not usually a fan of adding dependencies speculatively, but I would make an exception for tiffs. It would be good to add a comment explaining why it's in there.

@AddisonSchiller
Copy link
Contributor Author

Did some more research. It looks like libtiff5 comes default in Jessie. I read some things about the correct version of libtiff not being used. It shouldn't hurt to install python-libtiff and let pillow go crazy with it. So adding that to the PR as suggested.

Adding tests for pdf rendering and tif exporting
@coveralls
Copy link

coveralls commented Nov 16, 2017

Coverage Status

Coverage increased (+1.8%) to 69.816% when pulling f0b596a on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

@coveralls
Copy link

coveralls commented Nov 17, 2017

Coverage Status

Coverage increased (+1.8%) to 69.816% when pulling a52a515 on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

Copy link
Member

@felliott felliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do:

  • need to remove python-libtiff from requirements
  • add libtiff5-dev to Dockerfile
  • verify that Pillow libtiff support inside the container works.

@coveralls
Copy link

coveralls commented Nov 20, 2017

Coverage Status

Coverage increased (+1.7%) to 69.72% when pulling d254a1d on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

Multipage-tiff fix on resize
Clean up and small fixes
@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+1.8%) to 69.865% when pulling e05c0fa on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

@coveralls
Copy link

coveralls commented Nov 21, 2017

Coverage Status

Coverage increased (+1.8%) to 69.865% when pulling e05c0fa on AddisonSchiller:feature/multipage-tif into 8bb2dd4 on CenterForOpenScience:develop.

@AddisonSchiller
Copy link
Contributor Author

AddisonSchiller commented Nov 21, 2017

Latest additions:
Confirmed (with @felliott ) that libtiff5-dev is working in the docker container
Added libtiff5-dev to the docker-file
Removed unused python libtiff thing

Turning on libtiff use in pillow fixed color issues on some tiffs
fixed resizing on large multipage tiffs (should now display all pages)
lots of refactoring/changes from first few versions.
Pillow is upgraded to 4.3.0 on this ticket. Standalone it will work with tiffs, but will break the image renderer. There is another ticket I have out that should be merged before this one that will fix what pillow 4.3.0 breaks in MFR.

Changed resampling function on resizing (one used by imageexporter was corrupting some pages of resized pdfs)

Also rewrote QA notes to remove mentions to old broken features. Things should just work now without complications (fingers crossed)

@felliott
Copy link
Member

felliott commented Dec 4, 2017

TerTIFFic! Moving to RTM

felliott added a commit that referenced this pull request Dec 4, 2017
@felliott felliott closed this in 058ef2a Jan 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants