Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

**Blocking SVCS-699** [SVCS-678] Refactor and optimize handlers/extensions #333

Closed

Conversation

NyanHelsing
Copy link
Contributor

@NyanHelsing NyanHelsing commented Apr 19, 2018

What

Improve architecture of handlers and plugins.

  • Plugins receive file stream and metadata as input, and handle what local files are needed internally
  • Plugins run asynchronously, and if they run a subprocess, use asyncio.create_subprocess_exec, so that threads aren't needed.
  • If a plugin has a long running process that can only be designed synchronously, a thread is created to handle only that case
  • Resolving plugin names is generalized, so that only one function is needed to get the name of a plugin
  • Any values needed by a handler are calculated as needed by a @property, so we don't waste cpu time calculating values we won't use. (This should make cache hits faster)
  • The render handler handles the conversion of a requested file render so that file conversion need not occur inside a handler
  • When a file is rendered, the final renderer (that renders the file post-conversion) is the only renderer loaded)
  • If a file can be rendered by only generating an export url (it doesn't need the actual file), the export plugin is not loaded at all during the request by deferring the export to happen when the file stream is awaited inside of the renderer.
    • The renderer knows if it need the file, if it awaits the file, the conversion happens
    • If the renderer never awaits the file, the exporter is not loaded or run, and the conversion does not happen

#Why

A no-op is a simple operation to support, why not do it?

Acceptance criteria

A PR to support

QA notes

Devs will test this. Dev, please add a list of affected and unaffected file types.

Build an MFR export url requesting to export a file of type foo as type foo. This should start a download of the raw file without modification. Exception: image files should respect scaling parameters.

@NyanHelsing NyanHelsing changed the title Ft/noop for export [WIP] [SVCS-678] Ft/noop for export Apr 19, 2018
@NyanHelsing NyanHelsing changed the title [WIP] [SVCS-678] Ft/noop for export [SVCS-678] Ft/noop for export Apr 19, 2018
@coveralls
Copy link

coveralls commented Apr 19, 2018

Coverage Status

Coverage decreased (-0.2%) to 71.095% when pulling c898d6f on birdbrained:ft/noop-for-export into 3a955ac on CenterForOpenScience:develop.

self.output_wb_path = await self.local_cache_provider.validate_path(
'/export/{}'.format(self.output_file_id)
)
self.output_file_path = self.output_wb_path.full_path
self.exporter_metrics.merge({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to know more about what these metrics are trying to cover so I make sure this does'nt break data we've already collected.

# Execute the extension's export method asynchronously and wait for it
# to finish
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self.export)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the export object handle its own async issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to renderer, are there any side effects for let renderer/exporter handle asyncio? Is it possible to move this back to the handler but still keep your code move. I feel more comfortable for the handlers to take care of the asynchronous business all in one place stead of delegating it to the exporter instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would probably be ok.

return await self.write_to_stream()

def __del__(self):
self.output_fp.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This I need a better solution for - this only works because the exporter goes out of scope when the request finishes. It works and will be reliable, but I'd like something a little more pretty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this somewhere above, is this really necessary? Why context manager doesn't work in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think context manager is a better solution

self.exporter_metrics = MetricsRecord('exporter')
if self._get_module_name():
self.metrics = self.exporter_metrics.new_subrecord(self._get_module_name())

async def __call__(self):

self.source_wb_path = await self.local_cache_provider.validate_path(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe I'll move this path/file/stream setup stuff to its own method

@@ -72,6 +138,27 @@ def __init__(self, metadata, file_path, url, assets_url, export_url):
except AttributeError:
pass

async def __call__(self):

self.renderer_metrics.add('class', self._get_module_name())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again I need some input on where metrics need to go.

self.metrics.add('source_file.upload.required', False)

loop = asyncio.get_event_loop()
rendition = await loop.run_in_executor(None, self.render)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the renderer handle its own business

@@ -63,6 +63,51 @@ def make_exporter(name, source_file_path, output_file_path, format):
}
)

def bind_render(metadata, file_stream, url, assets_url, export_url):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: need to remove the old make_renderer/make_exporter utils in favor of the ones that treat export and render as functions.

self._cache_provider = None

@property
def cache_provider(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the cache provider lazy so its only instatiated if used.

Next, if caching is enabled, try to use a cached version.

Finally, do the actual conversion and export the converted file.
"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus the logic in the handler so its clearer what this endpoint does.

)

self.output_file_id = '{}.{}'.format(self.source_file_path.name, self.format)
self.output_file_path = await self.local_cache_provider.validate_path(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make sure all unnecessary path stiff is removed. Only remote cache and osf provider should remain.

await self.write_stream(self.cache_provider.download(self.cache_file_path))
logger.info('Cached file found; Sending downstream [{}]'.format(self.cache_file_path))
self.metrics.add('cache_file.result', 'hit')
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean up the cache try.

test.py Outdated
raise "ERR"

def test():
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops need to remove.

# Spin off upload into non-blocking operation
if renderer.cache_result and settings.CACHE_ENABLED:
# Spin off upload of cached render into non-blocking operation
if render.cache_result and settings.CACHE_ENABLED:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't line up with what the exporter does to cache. Should decide on a similar approach and use in both handlers.


class BaseExporter(metaclass=abc.ABCMeta):

def __init__(self, ext, source_file_path, output_file_path, format):
def __init__(self, metadata, input_stream, format):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the stream in to the exporter so the subcalsses' export method can choose to use the stream rather than writing to disk if it is feasable and appropriate.

@NyanHelsing
Copy link
Contributor Author

This probably breaks a bunch of tests. I want to make sure we like the implementation before spending time fixing tests.

Copy link
Contributor

@cslzchen cslzchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR 🎆 and for helping me go through the code. Please refer to the comments for details. Here I'd like to raise/provide two questions/suggestions.

One Big Ticket VS Two Separate Ones

Personally, I really love your change that moves code from the handler to the exporter. However, it is not necessary for the fix. I would recommend having only the the fix with one PR (this ticket) and having the refactor in another PR (another ticket). Here is my arguments:

  • It is easy for us to get the quick fix in develop and probably to prod for next release.
  • For the code as it stands, the main logic is good. However, it still needs efforts for verification, optimization, rewriting tests, local regression test and documentation. To me, this is more of an improvement than bug fix.

All About Metrics

We probably should have a discussion with the team on how the metrics work and if there is a general rules to follow (I bet there is a documentation somewhere).

from mfr.server import settings


class Cacheable(metaclass=abc.ABCMeta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a good reason, there is no need for an extra layer. Remove this class and put cache_result back.

@@ -37,61 +37,49 @@ class ExportHandler(core.BaseHandler):
cache_file_path_str = '/export/{}.{}'.format(self.cache_file_id, self.exporter_name)
else:
cache_file_path_str = '/export/{}'.format(self.cache_file_id)
self.cache_file_path = await self.cache_provider.validate_path(cache_file_path_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One side effect of this PR is that . cache_file_id and cache_file_path_str are calculated both in the handler and the exporter. I think we should move this to the get() when cache is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be in the exporter - the exporter doesn't do caching it should only be in the handler.


async def get(self):
"""Export a file to the format specified via the associated extension library"""
"""Export a file to the format specified via the associated extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line length is 100 characters for MFR. Please confirm and update the DocStr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm this seems odd.

Finally, do the actual conversion and export the converted file.
"""

self._set_headers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ._set_headers() should be called in the first place and in one place.

# File is already in the requested format
if self.metadata.ext == ".{}".format(self.format):
await self.write_stream(self.provider.download())
logger.info('Exported with no conversion.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.debug() is preferred with more detail (e.g. what's the extension and what's the URL)

}
)

def bind_convert(metadata, file_stream, format):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bind_exporter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Exporter constructor is basically binding the arguments to a function. kind of like functools.partial?:

>>> def hello(msg):
...     print(msg)
...
>>> bound_fn = bind(hello, "Hello")
>>> bound_fn()
Hello

The export objects is a function bound to the context passes to the class constructor

@abc.abstractmethod
def export(self):
pass

async def write_to_stream(self):
self.output_fp = open(self.output_file_path, 'rb')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the context manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to keep the fp alive even after the fn returns. the context manager would need to be inside the handler, and this class will need to define __enter__ and __leave__

with render() as rendered:
    self.write_stream(rendered)

# Spin off upload into non-blocking operation
if renderer.cache_result and settings.CACHE_ENABLED:
# Spin off upload of cached render into non-blocking operation
if render.cache_result and settings.CACHE_ENABLED:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the name renderer? I mean renderer = utils.bind_render()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the flow is: render -> future cache -> write. Now it is render -> write -> future cache? But the rendition is not longer available here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

render - the action of rendering, renderer - a thing that renders, render is a callable that performs the action of rendering. Can't use renderer.render

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this order matters. I placed the ensure_future after to help visualize that this is something that temporally happens after writing the response. I do need a tweak here to store the stream in a local variable for the response and the cache.

# Execute the extension's export method asynchronously and wait for it
# to finish
loop = asyncio.get_event_loop()
await loop.run_in_executor(None, self.export)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to renderer, are there any side effects for let renderer/exporter handle asyncio? Is it possible to move this back to the handler but still keep your code move. I feel more comfortable for the handlers to take care of the asynchronous business all in one place stead of delegating it to the exporter instance.

return await self.write_to_stream()

def __del__(self):
self.output_fp.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mentioned this somewhere above, is this really necessary? Why context manager doesn't work in your case?

- Move responsibility for setting up local filesystem provider into the
exporter

- Modifies exporters to take a file stream as an input

- Make exporter instances a callable that returns a converted file
stream

- Makes remote cache provider lazily instantiated

- Modify handlers to only prepare whats explicitly neccesary to hanle
the request.
@NyanHelsing NyanHelsing changed the title [SVCS-678] Ft/noop for export [SVCS-678] Refactor and optimize handlers/extensions May 11, 2018
Changes the jsc3d exporter to run async without needing to create
another thread
- Plugin name determinations combined
- Cache path construction moved to core handler
The unoconv renderer is obsolete because handling of the conversion if
the file cannot be rendererd using the given rederer occurs in the
render handler. This allows any filetype to be rendered by any renderer,
provided there is a converter that is capable of converting the filetype
to a filtype that the given renderer can render.
- Two utile, make_renderer and make_exporter are replaced by new utils
to construct their respective plugins

- Make sure the get_plugin_name uses the group to resolve the correct
name for the plugin.
self._local_cache_provider = waterbutler.core.utils.make_provider(
'filesystem', {}, {}, settings.LOCAL_CACHE_PROVIDER_SETTINGS
)
return self._local_cache_provider
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be try/except

self._source_file_path = await self.local_cache_provider.validate_path(
'/render/{}'.format(self.source_file_id)
)
return self._source_file_path
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be try/except

await self.local_cache_provider.upload(
await self.file_stream,
await self.source_file_path
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be irrelevant

try:
remove(self._source_file_path.full_path)
except FileNotFoundError:
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be irrelevant

:rtype: :class:`mfr.core.extension.BaseExporter`
"""
normalized_name = (name and name.lower()) or 'none'
def bind_render(metadata, file_stream, url, assets_url, export_url):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs docstring

# 'ppt': {'renderer': '.pdf', 'format': 'pdf'},
# 'pptx': {'renderer': '.pdf', 'format': 'pdf'},
}


class RenderHandler(core.BaseHandler):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docstring

return self._source_stream

@property
def export_url(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docstring

map = RENDER_MAP[self.metadata.ext]
except KeyError:
map = DEFAULT_RENDER
self._source_stream = utils.bind_convert(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a lambda(?) and handle the case in the property so it's not called unless needed

os.remove(self.source_file_path.full_path)
except FileNotFoundError:
pass
def cache_result(self, rendition):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a docstring

setup.py Outdated
'.fodp = mfr.extensions.pdf:PdfRenderer',
'.fods = mfr.extensions.pdf:PdfRenderer',
'.fodt = mfr.extensions.pdf:PdfRenderer',
# '.gif = mfr.extensions.pdf:UnoconvRenderer',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix all of them

@NyanHelsing NyanHelsing changed the title [SVCS-678] Refactor and optimize handlers/extensions **Blocking SVCS-699** [SVCS-678] Refactor and optimize handlers/extensions May 15, 2018
@NyanHelsing NyanHelsing closed this Jun 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants