# Introduction

**Scrapy-Playwright** is an extension for Scrapy, designed to handle **JavaScript-rendered web pages**.<br>

While Scrapy is a powerful framework for scraping static HTML pages, it struggles with websites that load content dynamically using JavaScript. Scrapy-Playwright solves this by integrating the Playwright browser automation tool, allowing Scrapy to render and interact with JavaScript-heavy pages before scraping them.

## Key Features:
- **JavaScript Rendering**: Uses Playwright to render dynamic content before scraping, making it possible to scrape data from pages that rely on client-side rendering (CSR).
- **Multiple Browsers Support**: Works with major browsers like Chromium, Firefox, and WebKit, allowing you to scrape a wide variety of modern web applications.
- **Seamless Integration**: Scrapy-Playwright is designed to work seamlessly with Scrapy, allowing you to leverage the familiar features of Scrapy, like item pipelines and crawling, while handling dynamic content.

## Use Cases:
- Scraping data from websites that require JavaScript to display content (e.g., SPAs, dynamically-loaded lists).
- Extracting data from modern e-commerce sites, news platforms, or dashboards that load content asynchronously.

# Installation

First, install **scrapy-playwright**:

In [None]:
pip install scrapy-playwright

Then if your haven't already installed **Playwright** itself, you will need to install it using the following command in your command line:

In [None]:
playwright install

## Setup Settings for the Scrapy Playwright

Next, we will need to update our Scrapy projects settings to **activate scrapy-playwright** in the project:

In [None]:
# settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

The **ScrapyPlaywrightDownloadHandler** class inherits from Scrapy's default **http/https** handler. So unless you explicitly activate scrapy-playwright in your Scrapy Request, those requests will be processed by the regular Scrapy download handler.

# Use Scrapy Playwright In Spiders

let's integrate **scrapy-playwright** into a Scrapy spider so all our requests will be JS rendered.

To route our requests through **scrapy-playwright** we just need to enable it in the Request meta dictionary by setting **meta={'playwright': True}**.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta={'playwright': True})

	def parse(self, response):
		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item

The **response** will now contain the rendered page as **seen by the browser**.<br>
However, sometimes **Playwright** will have ended the rendering before the entire page has been rendered which we can solve using Playwright **PageMethods**.

# Interacting With The Page Using Playwright PageMethods

To interaction with the page using scrapy-playwright we will need to use the **PageMethod** class.

- **PageMethod's** allow us to do alot of different things on the page, including:
    - Wait for elements to load before returning response
    - Scrolling the page
    - Clicking on page elements
    - Taking a screenshot of the page
    - Creating PDFs of the page

First, to use the **PageMethod** functionality in your spider you will need to set **playwright_include_page equal to True** so we can access the Playwright Page object and also define any callbacks (i.e. def parse) as a coroutine function (**async def**) in order to await the provided Page object.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem


class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = 'https://quotes.toscrape.com/js/'
		yield scrapy.Request(url, meta=dict(
			playwright = True,
			playwright_include_page = True, 
      		errback=self.errback,
		))

	async def parse(self, response):
		page = response.meta["playwright_page"]
		await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()
 

## 1. Waiting For Page Elements
To wait for a specific page element before stopping the javascript rendering and returning a response to our scraper we just need to add a **PageMethod** to the **playwright_page_methods** key in out Playwrright settings and define a **wait_for_selector**.

Now, when we run the spider scrapy-playwright will render the page until a **div** with a class **quote** appears on the page.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[PageMethod('wait_for_selector', 'div.quote')],
        errback=self.errback,
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()

## 2. Scraping Multiple Pages
Usually we need to scrape multiple pages on a javascript rendered website. We will do this by checking if there is a next page link present on the page and then requesting that page with the url that we scrape from the page.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod


class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = "https://quotes.toscrape.com/js/"
        yield scrapy.Request(url, meta=dict(
                playwright = True,
                playwright_include_page = True, 
                playwright_page_methods =[
                    PageMethod('wait_for_selector', 'div.quote'),
                ],
        errback=self.errback,
            ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()

        for quote in response.css('div.quote'):
            quote_item = QuoteItem()
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

        next_page = response.css('.next>a ::attr(href)').get()

        if next_page is not None:
            next_page_url = 'http://quotes.toscrape.com' + next_page
            yield scrapy.Request(next_page_url, meta=dict(
                playwright = True,
                playwright_include_page = True, 
                playwright_page_methods =[
                    PageMethod('wait_for_selector', 'div.quote'),
                ],
                errback=self.errback,
            ))
  
    async def errback(self, failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

## 3. Scroll Down Infinite Scroll Pages
We can also configure scrapy-playwright to scroll down a page when a website uses an infinite scroll to load in data.

In this example, Playwright will wait for **div.quote** to appear, before scrolling down the page until it **reachs the 10th quote**.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/scroll"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[
          PageMethod("wait_for_selector", "div.quote"),
          PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
          PageMethod("wait_for_selector", "div.quote:nth-child(11)"),  # 10 per page
          ],
        errback=self.errback,
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	await page.close()

		for quote in response.css('div.quote'):
			quote_item = QuoteItem()
			quote_item['text'] = quote.css('span.text::text').get()
			quote_item['author'] = quote.css('small.author::text').get()
			quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
			yield quote_item
  
	async def errback(self, failure):
		page = failure.request.meta["playwright_page"]
		await page.close()

## 4. Take Screenshot Of Page
Taking screenshots of the page are simple too.

Here we wait for Playwright to see the selector **div.quote** then it takes a screenshot of the page.

In [None]:
# spiders/quotes.py

import scrapy
from quotes_js_scraper.items import QuoteItem
from scrapy_playwright.page import PageMethod

class QuotesSpider(scrapy.Spider):
	name = 'quotes'

	def start_requests(self):
		url = "https://quotes.toscrape.com/js/"
		yield scrapy.Request(url, meta=dict(
				playwright = True,
				playwright_include_page = True, 
				playwright_page_methods =[
          PageMethod("wait_for_selector", "div.quote"),
          ]
			))

	async def parse(self, response):
    	page = response.meta["playwright_page"]
    	screenshot = await page.screenshot(path="example.png", full_page=True)
    	# screenshot contains the image's bytes
    	await page.close()

# Sources
- <a href="https://www.youtube.com/watch?v=EijzO7n2-dg&ab_channel=ScrapeOps">Scrapy-Playwright: How To Scrape Dynamic JS Websites(2022) by ScrapeOps</a>
- <a href="https://scrapeops.io/python-scrapy-playbook/scrapy-playwright/">Scrapy Playwright Guide: Render & Scrape JS Heavy Websites
</a>
- <a href="https://github.com/scrapy-plugins/scrapy-playwright">scrapy-plugins/scrapy-playwright</a>