Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Per-key priorities for dict-like settings by promoting dicts to Settings instances #1149

Merged

Conversation

jdemaeyer
Copy link
Contributor

Expand settings priorities by assigning per-key priorities for the dict-like settings (e.g. DOWNLOADER_MIDDLEWARES), instead of just a single priority for the whole dictionary. This allows updating these settings from multiple locations without having to care (too much) about order. It is a prerequisite for the add-on system (#1272, #591).

There are two main updates:

  1. All default dictionary settings are promoted to the BaseSettings class (formerly Settings). They behave just like dictionaries, but honour per-key priorities when being written to.
  2. Since their rationale becomes obsolete with per-key priorities, all X_BASE settings are deprecated, with default entries now living in the X setting.

And several smaller updates:

  1. Settings is a subclass of BaseSettings. It loads the default settings and promotes dictionaries within them to BaseSettings instances.
  2. BaseSettings has a complete dictionary-like interface.
  3. All dictionary-like component/handler lists allow disabling components/handler by setting the value of a key/value pair to None. A new helper, scrapy.util.without_none_values() was introduced for this. This was previously not supported by FEED_STORAGES, FEED_EXPORTERS, and DEFAULT_REQUEST_HEADERS.
  4. The scrapy.util.build_component_list() helper has been updated according to the deprecation of _BASE settings, as the (base, custom) call signature does not make much sense anymore. It's still backwards-compatible.
  5. ITEM_PIPELINES can no longer be provided as list

Comes with many new/updated tests and documentation.

@jdemaeyer
Copy link
Contributor Author

Implementing per-key priorities like this would render #1110 obsolete

@jdemaeyer
Copy link
Contributor Author

I've resolved issue 1 by providing a BaseSettings class that contains all the methods. Settings is now a subclass which overwrites __init__() and loads the default setting.
Issue 3 is resolved by completely resolving a Settings-setting when the update value is not a mapping, and I've outsourced the duplicate priority string resolving (issue 4).
Issue 5 should probably get its own PR.
@curita I needed to merge master into this since automatic merging was no longer possible at some point. But now I have this huge commit in this PR. Do I wait until we're ready to merge this and then rebase before we do?

@@ -13,7 +13,7 @@ class DownloadHandlers(object):
def __init__(self, crawler):
self._handlers = {}
self._notconfigured = {}
handlers = crawler.settings.get('DOWNLOAD_HANDLERS_BASE')
handlers = crawler.settings.get('DOWNLOAD_HANDLERS_BASE', {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use crawler.settings.getdict() (which has {} as default already) to get both 'DOWNLOAD_HANDLERS_BASE' and 'DOWNLOAD_HANDLERS'.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not just that it also allows you to pass DOWNLOAD_HANDLERS from command line.

@curita
Copy link
Member

curita commented May 27, 2015

Currently DOWNLOAD_HANDLERS is the only setting that does not use the dictionary values for ordering (but instead uses them to store the path of handler classes) and thus cannot use build_component_list(). However, this may change in the future, maybe most of the code from DownloadHandlers.init() should be outsourced into a new helper function (similar to build_component_list but without ordering) in scrapy.utils.conf?

FEED_STORAGES and FEED_EXPORTERS are dictionaries with paths that don't use ordering too, and those should handle *BASE settings (DEFAULT_REQUEST_HEADERS is a dict as well, but it's used differently). I like the idea of outsourcing the common code from FEED* and DOWNLOAD_HANDLERS loading and build_component_list, and it makes sense to do so in this pr. Particularly I'd like to see a single place handling the BASE setting loading and updating with setting so we can deprecate it sometime in the future more easily. Another thing that's useful for is to ensure that all settings can be disabled by assigning them the value None. Right now I think both FEED* settings don't support this for example (and they should).

@curita I needed to merge master into this since automatic merging was no longer possible at some point. But now I have this huge commit in this PR. Do I wait until we're ready to merge this and then rebase before we do?

Don't worry, just rebase upstream/master when you can (we'll definitely need that before merging but can be done at any time you want) and force push the new git history.

@jdemaeyer
Copy link
Contributor Author

Thank you guys for the feedback!

I now have:

  1. Disabled auto-promotion of dicts, only dicts from default_settings will now be promoted to BaseSettings instances
  2. Updated the SettingsAttribute.priority behaviour for SettingsAttributes that contain a BaseSettings instance. It now always reflects the maximum of:
    • the highest priority in the BaseSettings instance
    • the highest priority ever given to SettingsAttribute.set() (even when that priority is higher than anything in BaseSettings)
  3. Updated BaseSettings.update() so it can also deal with JSON-encoded strings, this solves the issue raised by @nramirezuy and avoids downgrading BaseSettings to dicts (losing all per-key priorities) wherever possible
  4. Deprecated scrapy.utils.conf.build_component_list() since the (base, custom) signature is no longer needed. I did not want to reuse the function name though since people might be using it in their extensions.
  5. As replacement, introduced scrapy.utils.conf.build_components() that only takes a single dict, removes all entries containing None as value, then builds the list or returns the dict, depending on the parameter to_list=True.
  6. Removed all references to _BASE settings (and their merging with the non-_BASE setting) in the complete codebase, instead replacing it with calls to the function scrapy.utils.conf._get_composite_setting(settings, settingname). This function, and the calls to it, should only be in the code until we decide to finally remove all support for the _BASE settings. I guess this is not exactly what @curita had hoped for, since we will still have to replace all calls to _get_composite_setting() when we remove _BASE support, and replace them with a simple settings[settingname]. But the alternative (that I could see) would have been to include settings in the signature of build_components(), which seemed not very modular.
  7. Enabled disabling items through setting their dict value to None for all default dict settings (including the default request headers) by placing calls to build_component_list() and remove_none_values() where appropriate.

To do:

  • Update first post with summary of changes
  • Tests
    • Call Settings.update() with json string
    • build_components()
    • Updating default dicts from the command line
  • Check PEP8 compliancy

@curita curita mentioned this pull request May 29, 2015
@jdemaeyer
Copy link
Contributor Author

Alright, this PR is now at a stage where I'm fairly happy with it and would remove the WIP tag as soon as I've written/updated the documentation and rebased. Still very open for feedback of course :) I've updated the first post as an overview if you haven't followed this PR.

@curita I made some changes to the _map_keys() function that lives inside of build_component_list() so that it honours per-key priorities. As giving both base and custom becomes deprecated with this PR though, and per-key priorities weren't available before anyways, maybe that's not necessary and just clutters up the code?

self.settings_module = settings_module
Settings.__init__(self, **kw)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a reason for swapping these two lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, took me quite a while to figure out though ;D It became necessary after I moved the dict promotion into __init__(). Settings.__init__() uses __getitem__() during the promotion of default dictionaries to BaseSettings instances, which in turn accesses self.settings_module, so it needs to be defined before calling __init__()

@curita
Copy link
Member

curita commented Jun 11, 2015

I love the unification of the code, really like the design decisions you took there. I pointed out a couple of remaining details but I think the overall functionality is well defined, you should be able to start with the documentation.

@jdemaeyer jdemaeyer force-pushed the enhancement/settings-per-key-priorities branch from 2f288fa to 4192d8b Compare June 19, 2015 13:16
@jdemaeyer
Copy link
Contributor Author

Alright, I think this PR is ready for final review. I've incorporated your recent feedback (nitpicks, BaseSettings handling in build_component_list, remove custom support in build_component_list, remove non-dict ITEM_PIPELINES support), written the documentation and some more tests, rebased into two feature/test/doc-complete commits, and removed the WIP tag. :)

@dangra
Copy link
Member

dangra commented Aug 19, 2015

+1 to merge, It needs a note about backwards incompatibilities introduced by this PR and how to update users code if possible.

@jdemaeyer
Copy link
Contributor Author

+1 to merge, It needs a note about backwards incompatibilities introduced by this PR and how to update users code if possible.

I'm happy to add a note, would that go into news.rst? Except for ITEM_PIPELINES being no longer allowed to be a list, which was deprecated since 0.20, there shouldn't be any backwards incompatibilities though. The BaseSettings behave just like dicts, and the _BASE settings still work (and throw a ScrapyDeprecationWarning if used).

@dangra
Copy link
Member

dangra commented Aug 20, 2015

Removing item pipelines list support is fine and it is not a backward
incompatible change.

The change that worries me a bit is the new behaviour for dictionary
settings whose values are merged instead than replaced

El ago. 20, 2015 7:05, "Jakob de Maeyer" notifications@github.com
escribió:

+1 to merge, It needs a note about backwards incompatibilities introduced
by this PR and how to update users code if possible.

I'm happy to add a note, would that go into news.rst? Except for
ITEM_PIPELINES being no longer allowed to be a list, which was deprecated
since 0.20, there shouldn't be any backwards incompatibilities though.
The BaseSettings behave just like dicts, and the _BASE settings still
work (and throw a ScrapyDeprecationWarning if used).


Reply to this email directly or view it on GitHub
#1149 (comment).

compsett = BaseSettings(self[name + "_BASE"], priority='default')
compsett.update(self[name])
return compsett
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary, let's drop it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I've seen it with a little more distance, I think this whole function doesn't really achieve what I had intended.

When users override XY_BASE, they explicitly don't want any of Scrapy's defaults for that component setting. But line 206 pulls Scrapy's defaults (which now live in XY) back in, and even worse overwrites the users XY_BASE settings where they have the same keys (say when the user simply changed some orders).

I guess either line 206 could be changed so that only those keys from XYthat have a priority higher than default are considered, or support for _BASE settings could be dropped altogether with this PR. After all, they have always been marked as "never edit this", the behaviour of the dict-like settings changes with this PR anyways, and we could get rid of this non-public helper function.

@jdemaeyer
Copy link
Contributor Author

Here's a proposed release note:

Dictionary settings are no longer overridden
--------------------------------------------

With this release, Scrapy's dictionary-like settings (e.g. ``ITEM_PIPELINES``,
see full list below) will receive per-key settings priorities (see
:ref:`topics-api-settings`). Internally, this is achieved by promoting them to
:class:`~scrapy.settings.BaseSettings` instances. Instances of this class behave
just like dictionaries, with one notable exception: When being written to, the 
:class:`~scrapy.settings.BaseSettings` instance is *updated*, not overwritten::

    >>> settings.set('ITEM_PIPELINES', {'path.one': 100})
    >>> settings.set('ITEM_PIPELINES', {'path.two': 200})
    >>> print dict(settings['ITEM_PIPELINES'])
    {'path.two': 200, 'path.one': 100}

This facilitates easy enabling, disabling, or configuring of single components.
E.g., if you want to disable the pipeline ``some.pipeline`` from the command
line, it is now sufficient to set the dictionary value for just this pipeline to
``None``::

    scrapy crawl example.com -s ITEM_PIPELINES={"some.pipeline": null}

instead of having to provide the full dictionary of all other enabled pipelines.

This update applies to all dictionary-like settings that have a default value,
namely:

* :setting:`DEFAULT_REQUEST_HEADERS`,
* :setting:`DOWNLOAD_HANDLERS`,
* :setting:`DOWNLOADER_MIDDLEWARES`,
* :setting:`EXTENSIONS`,
* :setting:`FEED_STORAGES`,
* :setting:`FEED_EXPORTERS`,
* :setting:`ITEM_PIPELINES`,
* :setting:`SPIDER_MIDDLEWARES`, and
* :setting:`SPIDER_CONTRACTS`.

If your code relies on completely replacing, not updating, one of these
settings, you can work around the new behaviour by deleting the dictionary-like
setting before writing to it::

    >>> del settings['SPIDER_MIDDLEWARES']
    >>> settings.set('SPIDER_MIDDLEWARES', {'my.middleware': 100})
    >>> print dict(settings['SPIDER_MIDDLEWARES'])
    {'my.middleware': 100}

However, keep in mind that Scrapy's standard procedure to disable components
from the dictionary-like settings is to set their value to ``None``.


Deprecation of ``_BASE`` settings
---------------------------------

The new update-not-overwrite behaviour of Scrapy's dictionary settings (see
above) renders the ``_BASE`` settings that were previously used to store
Scrapy's defaults (e.g. `DOWNLOADER_MIDDLEWARES_BASE`) obsolete.
Consequentially, Scrapy's defaults have been moved into the regular settings
(e.g. :setting:`DOWNLOADER_MIDDLEWARES`).

While setting the ``XY_BASE`` setting will still work, please update your code
to override (i.e. update) the ``XY`` setting instead, and set the value of the
default components that you wish to disable to ``None``.

@jdemaeyer jdemaeyer force-pushed the enhancement/settings-per-key-priorities branch 2 times, most recently from 03349ff to d9577da Compare October 27, 2015 13:21
@jdemaeyer jdemaeyer force-pushed the enhancement/settings-per-key-priorities branch from d9577da to 03f1720 Compare October 27, 2015 13:23
@jdemaeyer
Copy link
Contributor Author

Rebased onto current master and updated the function that handles backwards-compatibility for users who explicitly set _BASE settings:

They will now find the expected behaviour that they get none of Scrapy's defaults for each setting where they manually have set a _BASE setting. E.g. doing this:

# settings.py
DOWNLOADER_MIDDLEWARES_BASE = {}

will now result in no downloader middlewares being enabled (and a warning that _BASE settings are deprecated).

Not sure about the codecov test failing. It says only 95 % of this diff are hit because there are a couple of places that weren't covered before but where I switched the syntax to use the new helpers, e.g. like this:

-            valid_output_formats = (
-                list(self.settings.getdict('FEED_EXPORTERS').keys()) +
-                list(self.settings.getdict('FEED_EXPORTERS_BASE').keys())
-            )
+            feed_exporters = without_none_values(self.settings._getcomposite('FEED_EXPORTERS'))
+            valid_output_formats = feed_exporters.keys()

These should definitely be covered but I don't think it belongs into this PR.

dangra added a commit that referenced this pull request Oct 28, 2015
…priorities

[MRG+1] Per-key priorities for dict-like settings by promoting dicts to Settings instances
@dangra dangra merged commit dd9f777 into scrapy:master Oct 28, 2015
@kmike kmike mentioned this pull request Oct 30, 2015
@redapple redapple added this to the Scrapy 1.1 milestone Jan 25, 2016
@redapple
Copy link
Contributor

Removed "backward-incompatible" tag after #1586 merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants