Skip to content

ArtemSerdechnyi/scrapy-aiohttp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapy-aiohttp

version pyversions Tests codecov

This library simply integrates aiohttp with Scrapy, addressing challenges encountered when dealing with websites imposing restrictions on standard Scrapy requests. Specifically, it simply resolves the 403 Forbidden issue often encountered when utilizing Scrapy's built-in requests.

Installation

To install scrapy-aiohttp, use the following command:

$ pip install scrapy-aiohttp

Configuration

  1. Add the aiohttp server address to settings.py of your Scrapy project like this:
AIOHTTP_SERVER_URL = "http://localhost:8080/"
  1. Enable the scrapy-aiohttp middleware by adding it to DOWNLOADER_MIDDLEWARES in your settings.py file:
DOWNLOAD_HANDLERS = {
    "scrapy_aiohttp.AiohttpMiddleware": 651,
}
  1. Set up aiohttp request headers configuration to settings.py like this:
from scrapy_aiohttp.utils import DEFAULT_AIOHTTP_REQUEST_HEADERS_CONFIG

AIOHTTP_REQUEST_HEADERS_CONFIG = DEFAULT_AIOHTTP_REQUEST_HEADERS_CONFIG

Type dict[str, str | Callable[[aiohttp.web.Request], str] | None]

The AIOHTTP_REQUEST_HEADERS_CONFIG serves as an interface for inheriting headers from a Scrapy request and reusing them to create an aiohttp request.

DEFAULT_AIOHTTP_REQUEST_HEADERS_CONFIG is initialized with the values given below in the example.

Here's how you can customize AIOHTTP_REQUEST_HEADERS_CONFIG using the following guidelines:

  • If the header value is a Callable object (a function), it gets executed with the HTTP request object (aiohttp.web.Request) as an argument during header construction. Result of the executed function becomes the header value.
    {"Host": lambda request: urlparse(request.match_info.get("url")).hostname}
  • If the header value is a str, it serves as a static value for the header.
    {"Content-Type": "text/html"}
  • If the header value is set to None, it implies that the header should be inherited from the request headers. In other words, the server will use the same value for this header as it receives in the incoming request.
     {"User-Agent": None}

Note: Headers missing in AIOHTTP_REQUEST_HEADERS_CONFIG will not be applied to the aiohttp request!

Ensure that all necessary headers are defined to meet your specific requirements.

Usage

The easiest way to send requests with aiohttp is to use scrapy_aiohttp.AiohttpRequest. You can also use regular scrapy.Request and the "aiohttp" Request meta key:

from scrapy import Spider, Request
from scrapy_aiohttp import AiohttpRequest


class ExampleSpider(Spider):
    name = "example"

    def start_requests(self):
        # use case: scrapy_aiohttp.AiohttpRequest
        yield AiohttpRequest(
            url="https://example.com",
            callback=self.parse
        )
        # use case: scrapy.Request with meta key
        yield Request(
            url="https://example.com",
            callback=self.parse,
            meta={"aiohttp": True},
        )

    def parse(self, response, **kwargs):
        return {"url": response.url}