Skip to content

Commit

Permalink
Write a section on delayed file delivery
Browse files Browse the repository at this point in the history
  • Loading branch information
Gallaecio committed Aug 3, 2020
1 parent 1998096 commit 31795e6
Show file tree
Hide file tree
Showing 2 changed files with 34 additions and 10 deletions.
15 changes: 5 additions & 10 deletions docs/news.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,11 @@ Highlights:
* The new :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting allows to deliver
output items in batches of up to the specified number of items.

Some storage backeds (S3, FTP, and now GCS) do not receive items as the are
scraped. Instead, Scrapy writes items into a temporary local file, and only
once all the file contents have been written (i.e. at the end of the crawl)
is that file uploaded to the feed URI.

You can now use :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` to split the output
items in multiple files with the specified maximum item count per file.
That way, as soon as a file reaches the maximum item count, that file is
delivered to the feed URI, allowing item delivery to start way before the
end of the crawl.
It also serves as a workaround for :ref:`delayed file delivery
<delayed-file-delivery>`, which causes Scrapy to only start item delivery
to some storage backends (:ref:`S3 <topics-feed-storage-s3>`, :ref:`FTP
<topics-feed-storage-ftp>`, and now :ref:`GCS <topics-feed-storage-gcs>`)
after the crawl has finished.

* The base implementation of :ref:`item loaders <topics-loaders>` has been
moved into a separate library, :doc:`itemloaders <itemloaders:index>`,
Expand Down
29 changes: 29 additions & 0 deletions docs/topics/feed-exports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,9 @@ FTP supports two different connection modes: `active or passive
mode by default. To use the active connection mode instead, set the
:setting:`FEED_STORAGE_FTP_ACTIVE` setting to ``True``.

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.


.. _topics-feed-storage-s3:

S3
Expand All @@ -195,6 +198,9 @@ You can also define a custom ACL for exported feeds using this setting:

* :setting:`FEED_STORAGE_S3_ACL`

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.


.. _topics-feed-storage-gcs:

Google Cloud Storage (GCS)
Expand All @@ -218,8 +224,11 @@ You can set a *Project ID* and *Access Control List (ACL)* through the following
* :setting:`FEED_STORAGE_GCS_ACL`
* :setting:`GCS_PROJECT_ID`

This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.

.. _google-cloud-storage: https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python


.. _topics-feed-storage-stdout:

Standard output
Expand All @@ -232,6 +241,26 @@ The feeds are written to the standard output of the Scrapy process.
* Required external libraries: none


.. _delayed-file-delivery:

Delayed file delivery
---------------------

As indicated above, some of the described storage backeds use delayed file
delivery.

These storage backeds do not receive items as they are scraped. Instead, Scrapy
writes items into a temporary local file, and only once all the file contents
have been written (i.e. at the end of the crawl) is that file uploaded to the
feed URI.

If you want item delivery to start earlier when using one of these storage
backends, use :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` to split the output items
in multiple files, with the specified maximum item count per file. That way, as
soon as a file reaches the maximum item count, that file is delivered to the
feed URI, allowing item delivery to start way before the end of the crawl.


Settings
========

Expand Down

0 comments on commit 31795e6

Please sign in to comment.