Write a section on delayed file delivery

Gallaecio · Aug 3, 2020 · 31795e6 · 31795e6
1 parent 1998096
commit 31795e6
Show file tree

Hide file tree

Showing 2 changed files with 34 additions and 10 deletions.
diff --git a/docs/news.rst b/docs/news.rst
@@ -16,16 +16,11 @@ Highlights:
 *   The new :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` setting allows to deliver
     output items in batches of up to the specified number of items.
 
-    Some storage backeds (S3, FTP, and now GCS) do not receive items as the are
-    scraped. Instead, Scrapy writes items into a temporary local file, and only
-    once all the file contents have been written (i.e. at the end of the crawl)
-    is that file uploaded to the feed URI.
-
-    You can now use :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` to split the output
-    items in multiple files with the specified maximum item count per file.
-    That way, as soon as a file reaches the maximum item count, that file is
-    delivered to the feed URI, allowing item delivery to start way before the
-    end of the crawl.
+    It also serves as a workaround for :ref:`delayed file delivery
+    <delayed-file-delivery>`, which causes Scrapy to only start item delivery
+    to some storage backends (:ref:`S3 <topics-feed-storage-s3>`, :ref:`FTP
+    <topics-feed-storage-ftp>`, and now :ref:`GCS <topics-feed-storage-gcs>`)
+    after the crawl has finished.
 
 *   The base implementation of :ref:`item loaders <topics-loaders>` has been
     moved into a separate library, :doc:`itemloaders <itemloaders:index>`,

diff --git a/docs/topics/feed-exports.rst b/docs/topics/feed-exports.rst
@@ -170,6 +170,9 @@ FTP supports two different connection modes: `active or passive
 mode by default. To use the active connection mode instead, set the
 :setting:`FEED_STORAGE_FTP_ACTIVE` setting to ``True``.
 
+This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
+
+
 .. _topics-feed-storage-s3:
 
 S3
@@ -195,6 +198,9 @@ You can also define a custom ACL for exported feeds using this setting:
 
  * :setting:`FEED_STORAGE_S3_ACL`
 
+This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
+
+
 .. _topics-feed-storage-gcs:
 
 Google Cloud Storage (GCS)
@@ -218,8 +224,11 @@ You can set a *Project ID* and *Access Control List (ACL)* through the following
  * :setting:`FEED_STORAGE_GCS_ACL`
  * :setting:`GCS_PROJECT_ID`
 
+This storage backend uses :ref:`delayed file delivery <delayed-file-delivery>`.
+
 .. _google-cloud-storage: https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-python
 
+
 .. _topics-feed-storage-stdout:
 
 Standard output
@@ -232,6 +241,26 @@ The feeds are written to the standard output of the Scrapy process.
  * Required external libraries: none
 
 
+.. _delayed-file-delivery:
+
+Delayed file delivery
+---------------------
+
+As indicated above, some of the described storage backeds use delayed file
+delivery.
+
+These storage backeds do not receive items as they are scraped. Instead, Scrapy
+writes items into a temporary local file, and only once all the file contents
+have been written (i.e. at the end of the crawl) is that file uploaded to the
+feed URI.
+
+If you want item delivery to start earlier when using one of these storage
+backends, use :setting:`FEED_EXPORT_BATCH_ITEM_COUNT` to split the output items
+in multiple files, with the specified maximum item count per file. That way, as
+soon as a file reaches the maximum item count, that file is delivered to the
+feed URI, allowing item delivery to start way before the end of the crawl.
+
+
 Settings
 ========