<h1>Crawlers and Scrapers</h1>

The goal of this session is to build and run our own Amazon.com scraper using the **scrapy** python library. 

Our scraper will crawl through a specific product's customer review pages and get all of the available ratings and reviews. This will allow us to get complete review details that we may not be able to get through the Amazon Product Advertising API.

First we will install scrapy using pip command in terminal/cmd:

In [2]:
!pip install scrapy

Collecting scrapy
  Using cached Scrapy-2.6.1-py2.py3-none-any.whl (264 kB)
Collecting lxml>=3.5.0
  Downloading lxml-4.8.0-cp38-cp38-macosx_10_14_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 6.6 MB/s eta 0:00:01
Collecting tldextract
  Using cached tldextract-3.2.0-py3-none-any.whl (87 kB)
Collecting Twisted>=17.9.0
  Using cached Twisted-22.2.0-py3-none-any.whl (3.1 MB)
Collecting service-identity>=16.0.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting zope.interface>=4.1.3
  Downloading zope.interface-5.4.0-cp38-cp38-macosx_10_14_x86_64.whl (208 kB)
[K     |████████████████████████████████| 208 kB 9.3 MB/s eta 0:00:01
[?25hCollecting PyDispatcher>=2.0.5
  Using cached PyDispatcher-2.0.5.zip (47 kB)
Collecting w3lib>=1.17.0
  Using cached w3lib-1.22.0-py2.py3-none-any.whl (20 kB)
Collecting queuelib>=1.4.2
  Using cached queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting cssselect>=0.9.1
  Using cached cssselect-1.1.0-py2.

Next, we'll use scrapy to automatically generate a skeleton of the code needed for our scraper. (On terminal/cmd type the following command without the exclamation mark):

In [6]:
!scrapy genspider amazon_scraper amazon.com

Created spider 'amazon_scraper' using template 'basic' 


A new python-script, amazon_scraper.py file will be created. The final content of our scraper will be as follows:

In [3]:
# -*- coding: utf-8 -*-
import scrapy

class AmazonScraperSpider(scrapy.Spider):
    name = 'amazon_scraper'
    allowed_domains = ['amazon.com']
    # assing a product-review-page url below
    start_urls = ['https://www.amazon.com/Apple-iPhone-Verizon-Unlocked-Renewed/product-reviews/B07HYDFX8G/ref=cm_cr_arp_d_viewopt_srt?ie=UTF8&reviewerType=all_reviews&sortBy=helpful&pageNumber=1']
    
    def parse(self, response):
        
        review_texts = response.css('.a-size-base.review-text')
        
        for i in range(len(review_texts)):
            review_texts[i] = "".join(review_texts[i].css('::text').extract()).strip()

        review_ratings = response.css('[data-hook="review-star-rating"] > span::text').extract()

        for i in range(len(review_texts)):
            review = {
                'text' : review_texts[i],
                'rating': review_ratings[i]
            }
            yield review

        next_page_url = response.css('.a-last > a::attr(href)').extract_first()
        yield response.follow(next_page_url, self.parse)


We can call the script from terminal/cmd as follows:


In [4]:
!scrapy runspider amazon_scraper.py -o out.json --set=USER_AGENT="Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"

2022-04-07 10:02:05 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-04-07 10:02:05 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.8.0 (default, Nov  6 2019, 15:49:01) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-04-07 10:02:05 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True,
 'USER_AGENT': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) '
               'AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}
2022-04-07 10:02:05 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-07 10:02:05 [scrapy.extensions.telnet] INFO: Telnet Password: c9b6618f697ef9a5
2022-04-07 10:02:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsol