New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SVCS-52] Adding support for mutli-page tiff files. #275
[SVCS-52] Adding support for mutli-page tiff files. #275
Conversation
5fd8ff7
to
fbe435a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass done 🎆 🎆 . Here is the checklist.
- Why do we need a new
PillowImageError
? - Improve control flow:
while
,if
,try
- Style
- imports
- comments consistency
- avoid using number or string literals
@@ -1 +1,2 @@ | |||
from .render import PdfRenderer # noqa | |||
from .export import PdfExporter # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Use absolute
import
- Remove the
# noqa
if possible (if tests pass)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this like we talked about on other issue
mfr/extensions/pdf/export.py
Outdated
from reportlab.pdfgen import canvas | ||
|
||
from mfr.core import extension | ||
from mfr.extensions.pdf import exceptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import os
import imghdr
from PIL import Image
from reportlab.pdfgen import canvas
from mfr.core import extension
from mfr.extensions.pdf import exceptions as pdf_exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imghdr is a base lib, so leaving it with os, moving others.
mfr/extensions/pdf/export.py
Outdated
self.metrics.add('pil_version', Image.VERSION) | ||
|
||
def tiff_to_pdf(self, tiff_img, max_size): | ||
max_pages = 40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define max_pages
somewhere in settings instead of here as a number literal?
mfr/extensions/pdf/export.py
Outdated
def tiff_to_pdf(self, tiff_img, max_size): | ||
max_pages = 40 | ||
width, height = tiff_img.size | ||
c = canvas.Canvas(self.output_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the file at output_file_path
guaranteed to be a valid one? If not, we should try
and except
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no file at output_file_path. Thats the path MFR wants the exported file to be saved to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I thought there is a temp file created.
mfr/extensions/pdf/export.py
Outdated
page = 0 | ||
|
||
# This seems to be the only way to write this loop at the moment | ||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to avoid while True
, you can let while
handle the break
:
while page < max_pages:
...
page +=1
self.tiff_to_pdf(image, max_size) | ||
image.close() | ||
|
||
except (UnicodeDecodeError, IOError) as err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code and exception message looks similar. Why not combining the two except
in one? (A question not a suggestion.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ValueError
case is a bug with our current version of pillow having trouble opening certain formats of .tiff
. Once pillow is updated to 4.3 or higher, this last error section can go away. (tested now and this is indeed the case. broken.tiff
will open on pillow 4.3, but not 2.8.3 that we currently have. This is why the errors were separate. The tiff given was valid, just pillow messing up.
mfr/extensions/pdf/export.py
Outdated
export_format=type, | ||
detected_format=imghdr.what(self.source_file_path), | ||
original_exception=err, | ||
code=400, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this is a new file, let's use HTTPStatus
(LINK).
from http import HTTPStatus
code=HTTPStatus.BAD_REQUEST
mfr/extensions/pdf/render.py
Outdated
elif settings.EXPORT_TYPE: | ||
exported_url.args['format'] = settings.EXPORT_TYPE | ||
else: | ||
return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the conditions for reaching this branch? Is this unexpected?
mfr/extensions/pdf/render.py
Outdated
return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url) | ||
|
||
exported_url = furl.furl(self.export_url) | ||
if settings.EXPORT_MAXIMUM_SIZE and settings.EXPORT_TYPE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduce the # of bool
checks from 3 to 2.
if settings.EXPORT_TYPE:
if settings.EXPORT_MAXIMUM_SIZE
exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE, settings.EXPORT_TYPE)
else:
exported_url.args['format'] = settings.EXPORT_TYPE
self.metrics.add('needs_export', True)
return self.TEMPLATE.render(base=self.assets_url, url=exported_url.url)
# TODO: is this an unexpected case?
return self.TEMPLATE.render(base=self.assets_url, url=self.metadata.download_url)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly copied from the image renderer. I assume that the TODO case here is when the settings are not properly defined. Going with your approach here but leaving the unexpected case in for completion etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Similarly, update image renderer if possible.
from mfr.core.exceptions import ExporterError | ||
|
||
|
||
class PillowImageError(ExporterError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems duplicated, but I guess you have a good reason for not using mfr.extensions.image.exceptions.PillowImageError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is duplicated, but it doesnt make sense to import an exception from a different extension (for the most part, extensions dont interact much) thats why its copied this way. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean and I agree. Let's leave this here for @felliott in Phase 2 Review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with duplicating this, but the docstrings and __TYPE
should be consistent with the new location. __TYPE
should probably be pdf_pillow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 🎆 and move to PCR for @felliott .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few issues to address.
The Pillow docs (https://github.com/python-pillow/Pillow/blob/2.8.x/docs/handbook/image-file-formats.rst#tiff) mention that if libtiff is installed, PIL can read a bunch more types of TIF files. Is libtiff already installed by our Docker container? If not, could it be, and what would be an example of a file that would be supported by adding it?
Cheers,
@felliott
mfr/extensions/pdf/settings.py
Outdated
EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf') | ||
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200') | ||
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ]) | ||
EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set via an environment variable, this will be a string, not an int. Need to add an explicit cast. There might be a method in the config
object to do this for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
mfr/extensions/pdf/settings.py
Outdated
|
||
EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf') | ||
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200') | ||
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config vars from the environment are strings. The config
object should have a method to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the image settings is done the same way (its mostly c/p from the image settings). I don't see a method in mfr.settings to handle lists.
Should I be feeding this a string and making the list myself in the renderer or something like that?
also, should the image extension be changed as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other small note: even though the list is turned into a string, it will still find '.pdf' in that string and exclude it. Not sure if its intended to work that way, but that must be why its working for images settings. (this is why it works even though it seems like it shouldn't)
As mentioned above im open to any fix for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blarf, sorry I was misremembering. The similar example I was thinking of was here. We support passing a list via the env by requiring it to be a space-separated string that gets split upon construction. And yes, the image renderer settings should be doing it this way, but we never bothered to audit the settings when we set it up. There are probably lots of settings that need this, so we'll create a separate janitorial ticket to audit and clean these up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and ticket made.
|
||
from mfr.extensions.pdf import settings | ||
from mfr.extensions.pdf import exceptions | ||
from mfr.extensions.pdf import PdfExporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: combine these into one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
|
||
|
||
@pytest.fixture | ||
def metadata(): | ||
return ProviderMetadata('test', '.pdf', 'text/plain', '1234', 'http://wb.osf.io/file/test.pdf?token=1234') | ||
return ProviderMetadata('test', '.pdf', 'text/plain', '1234', | ||
'http://wb.osf.io/file/test.pdf?token=1234') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: your indentation is a little weird here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
mfr/extensions/pdf/exceptions.py
Outdated
and relating to the Pillow Library should inherit from PillowImageError | ||
""" | ||
|
||
__TYPE = 'image_pillow' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To distinguish this from image exceptions in the metrics, this should be named pdf_pillow
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete
mfr/extensions/pdf/settings.py
Outdated
EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf') | ||
EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200') | ||
EXPORT_EXCLUSIONS = config.get('EXPORT_EXCLUSIONS', ['.pdf', ]) | ||
EXPORT_MAX_PAGES = config.get('EXPORT_MAX_PAGES', 40) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above two config settings will have problems if set from the env. Envvars are always strings; they won't be ints or lists. I believe there are some helper methods in the config base class to cast these properly.
|
||
from mfr.extensions.pdf import settings | ||
from mfr.extensions.pdf import exceptions | ||
from mfr.extensions.pdf import PdfExporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above could be collapsed.
|
||
class TestPdfExporter: | ||
''' | ||
Opening and veryifying pdfs with report lab is a paid feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: "verifying". Also, "report lab" could probably use a little more context here, it's only mentioned/used in requirements.txt and the renderer file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded/fixed typos/added new typos
complete
''' | ||
Opening and veryifying pdfs with report lab is a paid feature | ||
so these tests are mostly just to see if the file gets created | ||
in the correct place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything else in the stdlib that can assert that the output is actually a pdf? If not, no worries, it would just be good to assert that a pdf is produced, rather than e.g. and html error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Followup: it looks like PDFs have a signature: https://en.wikipedia.org/wiki/List_of_file_signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, implementing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some libraries that use system libraries to check things (libmagic used by python-magic) etc. They would require adding things to the docker file or something like that. I just went with opening the file, and pulling the signature out directly and looking at it. Slightly crude but works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
assert os.path.exists(directory) | ||
|
||
exporter.export() | ||
assert os.path.exists(output_file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the successful tests and the failing tests could be parameterized, since we don't have the ability to do deep examination of the pdfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The three passing tests are all very similar, they just take slightly different arguments. You could parameterize them a la this WB s3 test. That way if you update one later, the others get updated, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great,
completed.
0cded16
to
1683656
Compare
@felliott Take a look at the settings questions, not quite sure how you want me to handle that. On libtiff: I did try to install it and see if it changed the way any of my test tiff files rendered and it did not seem to. Just couldn't find enough information about it. Some of the docs linked to other libraries i originally looked at for rendering tiffs etc, and such. If you want i could definitely spend more time on this. Also, leaving in additional development while I wait for responses to some of my questions |
adding more to pdf renderer
1d9e73f
to
114c492
Compare
I've worked a bunch with .tiff files in the past. The tiff format is more of a container format than a image compression format. It permits lots of weird formats and extensions and its a pretty popular format in science for that reason. I'm not usually a fan of adding dependencies speculatively, but I would make an exception for tiffs. It would be good to add a comment explaining why it's in there. |
Did some more research. It looks like libtiff5 comes default in Jessie. I read some things about the correct version of libtiff not being used. It shouldn't hurt to install python-libtiff and let pillow go crazy with it. So adding that to the PR as suggested. |
Adding tests for pdf rendering and tif exporting
36c3465
to
f0b596a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do:
- need to remove python-libtiff from requirements
- add libtiff5-dev to Dockerfile
- verify that Pillow libtiff support inside the container works.
Multipage-tiff fix on resize Clean up and small fixes
bed7e7c
to
e05c0fa
Compare
Latest additions: Turning on libtiff use in pillow fixed color issues on some tiffs Changed resampling function on resizing (one used by Also rewrote QA notes to remove mentions to old broken features. Things should just work now without complications (fingers crossed) |
TerTIFFic! Moving to RTM |
adding more to pdf renderer
tiff files are now rendered by the pdf renderer, and the pdf extension now has an exporter to handle tif/tiff files. Tiff files are now converted to pdfs for viewing
refs: https://openscience.atlassian.net/browse/SVCS-52
Purpose
Add support for mutlipage tiff files.
Summary of changes
Tiff files are sort of like pdfs, and can be converted to them.
The pdf extension now handles tiff files. The pdf extension now has an exporter for tiff files
Reportlab has been added as a dependency.
Some multipage tiffs will not render correctly. see testing notes.
Testing QA notes:
below is a copy and paste of my testing notes on the jira ticket. you can find the zip file over there
Testing Notes:
Attached to the jira ticket is a zip containing some tif/tiff files to use for testing.
Due to the aspects of what this commit is doing, the following should also be tested:
Other pdf files. Try a few regular pdf files and make sure they render properly.
Any files rendered by the unoconv renderer. (ie docx) The unoconv renderer ships files to the pdf renderer.