In [1]:
import scrapy

# Installation
install scrapy from Anaconda envrionment. 

# 4. Minimal working example
scrapy is used to Scrape Web Pagesrite. In this example, we will build a crawler to scrape and parse the hat selling information on Amazon and store the data to a CSV file.

## (1) create a project
scrapy uses terminal (Mac) or Command Prompt (Windows) to create project.
The project is contained in a folder, including project configuration information (.cfg), py documents (items.py, pipelines.py, settings.py) specify the contain of the project. We can rewrite the documents
to do the project.

In [2]:
# we will create a project named working_example
!scrapy startproject working_example 

New Scrapy project 'working_example', using template directory '/Users/apple/anaconda/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/apple/Desktop/scrapy_project/working_example

You can start your first spider with:
    cd working_example
    scrapy genspider example example.com


In [3]:
!ls

Scrapy.ipynb    [34mworking_example[m[m


In [4]:
%cd working_example

/Users/apple/Desktop/scrapy_project/working_example


In [5]:
!ls

scrapy.cfg      [34mworking_example[m[m


In [6]:
!cat scrapy.cfg
# scrapy.cfg shows the configuration information

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.org/en/latest/deploy.html

[settings]
default = working_example.settings

[deploy]
#url = http://localhost:6800/
project = working_example


In [7]:
# show Available tool commands
!scrapy -h

Scrapy 1.1.1 - project: working_example

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands      
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command


## (2) spider
Spiders are classes that you define and that Scrapy uses to scrape information from websites. 
Spider uses scrapy.spiders to define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.

Follwing are some attributes you can specify by spider:
* name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
* start_requests(   ): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
* parse(  ): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it. The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

### (1) create spider using genspider
genspider is used to create a new spider in the current folder or in the current project’s spiders folder, if called from inside a project. The <name> parameter is set as the spider’s name, while <domain> is used to generate the allowed_domains and start_urls spider’s attributes.<br>
-Syntax: scrapy genspider [-t template] <name> <domain><br>
-genspider doesn't require project

In [8]:
!scrapy genspider firstspider firstspider.com
#you can use scrapy toolst to create a new spider
#Note:some Scrapy commands (like crawl) must be run from inside a Scrapy project.

Created spider 'firstspider' using template 'basic' in module:
  working_example.spiders.firstspider


In [9]:
!scrapy genspider -l # show available templates

Available templates:
  basic
  crawl
  csvfeed
  xmlfeed


In [10]:
!scrapy genspider example example.com
# create spider 'example' using template 'basic' in module

Created spider 'example' using template 'basic' in module:
  working_example.spiders.example


In [11]:
!scrapy genspider -t crawl scrapyorg scrapy.org
# created spider 'scrapyorg' using template 'crawl' in module

Created spider 'scrapyorg' using template 'crawl' in module:
  working_example.spiders.scrapyorg


### list
List all available spiders in the current project. The output is one spider per line.
Syntax: scrapy list
Requires project: yes

In [12]:
!scrapy list

example
firstspider
scrapyorg


### check
Syntax: scrapy check [-l] <spider><br>
check requires project to be existed

In [13]:
!scrapy check -l 
#checks current spiders and items

In [14]:
!scrapy check
#check status of current spiders


----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK


### (2) create spider using scrapy.Spider
you can also use scrapy.spider to create a spider.<br>
Usually, a created spider is stored in spiders directory and executed by crawl command.

In [15]:
%cd working_example/spiders
# open spider directory

/Users/apple/Desktop/scrapy_project/working_example/working_example/spiders


In [16]:
!ls
# shows spiders that already exist

__init__.py    [34m__pycache__[m[m    example.py     firstspider.py scrapyorg.py


In [17]:
!touch Basic_spider.py
# create a new spider named Basic_spider
# write information about how to scrape a website into this spider document

we will create a spider named Basic_spider that extracts information from simon business school's faculty website.

In [18]:
%%writefile Basic_spider.py
import scrapy

class MySpider(scrapy.Spider):
    name = 'faulty'
    allowed_domains = ['simon.rochester.edu']
    start_urls = ['http://www.simon.rochester.edu/faculty-and-research/faculty-directory/index.aspx']

Overwriting Basic_spider.py


In [19]:
!touch Advanced_spider.py
# create another spider named Advanced_spider
# put more detailed information about how to scrape a website

Advanced_spider extracts only the names and links related to names of simon business school's faulty website.

In [20]:
%%writefile Advanced_spider.py
import scrapy
from scrapy.selector import HtmlXPathSelector

class MySpider(scrapy.Spider):
    name = 'faulty'
    allowed_domains = ['simon.rochester.edu']
    start_urls = ['http://www.simon.rochester.edu/faculty-and-research/faculty-directory/index.aspx'
                 ]
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//td[@class='name']")
        for title in titles:
            name = title.select("a/text()").extract()
            link = title.select("a/@href").extract()
            print (title, link)

Overwriting Advanced_spider.py


In [21]:
!ls
# now we have these two new spiders in our folder

Advanced_spider.py __init__.py        example.py         scrapyorg.py
Basic_spider.py    [34m__pycache__[m[m        firstspider.py


## (3) items
items.py document contains a item class, named by your project. It defines the fields you want to scrape from the web. You can rewrite the items document using %%writefile in iPython.

In [22]:
%cd ..
!ls
# returns back to root directory, where we can find the items.py document.

/Users/apple/Desktop/scrapy_project/working_example/working_example
__init__.py  [34m__pycache__[m[m  items.py     pipelines.py settings.py  [34mspiders[m[m


In [23]:
!cat items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WorkingExampleItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


we can rewrite the item class. Let it contain contain the item name and link to open the item.

In [24]:
%%writefile items.py

from scrapy.item import Item, Field

class WorkingExampleItem(Item):
    title = Field() # title is the item name
    link = Field() # link is the like to open the item selling information

Overwriting items.py


In [26]:
%run items.py
# run items.py to let the item information containing in jupyter notebook
# items.py can also be used to specify newly created spider

In [27]:
!cat items.py


from scrapy.item import Item, Field

class WorkingExampleItem(Item):
    title = Field() # title is the item name
    link = Field() # link is the like to open the item selling information

## (4) crawl
Syntax: scrapy crawl <spider>
Requires project: yes

In [28]:
# we can crawl the website by name defined by our spider
!scrapy crawl faulty

2017-02-19 17:11:14 [scrapy] INFO: Scrapy 1.1.1 started (bot: working_example)
2017-02-19 17:11:14 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'working_example', 'NEWSPIDER_MODULE': 'working_example.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['working_example.spiders']}
2017-02-19 17:11:14 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-02-19 17:11:14 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downl

In [29]:
# or we can use runspider command to scrape a website
%cd spiders
!scrapy runspider Basic_spider.py

/Users/apple/Desktop/scrapy_project/working_example/working_example/spiders
2017-02-19 17:11:16 [scrapy] INFO: Scrapy 1.1.1 started (bot: working_example)
2017-02-19 17:11:16 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'working_example', 'NEWSPIDER_MODULE': 'working_example.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['working_example.spiders']}
2017-02-19 17:11:16 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-02-19 17:11:16 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 '

In [30]:
# export the output to csv
output = !scrapy crawl faulty
import numpy as np
np.savetxt('faulty.csv',output, delimiter = ",", fmt = "%s")

In [31]:
!cat faulty.csv

2017-02-19 17:11:24 [scrapy] INFO: Scrapy 1.1.1 started (bot: working_example)
2017-02-19 17:11:24 [scrapy] INFO: Overridden settings: {'BOT_NAME': 'working_example', 'NEWSPIDER_MODULE': 'working_example.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['working_example.spiders']}
2017-02-19 17:11:24 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-02-19 17:11:24 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',


In [32]:
!ls
# now we have the output instored in spider directory

Advanced_spider.py __init__.py        example.py         firstspider.py
Basic_spider.py    [34m__pycache__[m[m        faulty.csv         scrapyorg.py


# 6. other interesting or useful features

## (1) logging
Scrapy uses Python’s builtin logging system for event logging. Logging works out of the box, and can be configured to some extent with the Scrapy settings listed in Logging settings.

In [33]:
# a simply logging message using the logging.WARNING level:
import logging
logging.warning("This is a warning")



In [34]:
# you can put logging inside a spider
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://scrapinghub.com']
    def parse(self, response):
        self.logger.info('Parse function called on %s', response.url)

## (2) sending email
Scrapy provides its own facility for sending e-mails which is very easy to use and it’s implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking IO of the crawler. It also provides a simple API for sending attachments and it’s very easy to configure, with a few settings.

#### syntax
class scrapy.mail.MailSender(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None,
smtpport=None)
send(to, subject, body, cc=None, attachs=(), mimetype=’text/plain’, charset=None)

- Parameters
    - to – the e-mail recipients
    - subject – the subject of the e-mail
    - cc – the e-mails to CC
    - body – the e-mail body
    - attachs(attach_name, mimetype, file_object) – an iterable of tuples  where attach_name is a string with the name that will appear on the e-mail’s attachment, mimetype is the mimetype of the attachment and file_object is a readable file object with the contents of the attachment
    - mimetype  – the MIME type of the e-mail
    - charset – the character encoding to use for the e-mail contents


- The following command define the MailSender class, and can be used to configure setting:
    - subject – the subject of the e-mail
    - cc – the e-mails to CC
    - MAIL_FROM
    Default: ’scrapy@localhost’
    Sender email to use (From: header) for sending emails.
    - MAIL_HOST
    Default: ’localhost’
    SMTP host to use for sending emails.
    - MAIL_PORT
    Default: 25
    MTP port to use for sending emails.
    - MAIL_USER
    Default: None
    User to use for SMTP authentication. If disabled no SMTP authentication will be performed.
    - MAIL_PASS
    Default: None
    Password to use for SMTP authentication, along with MAIL_USER.

- Note: Scrapy does not support sending mail with Python 3.

In [None]:
from scrapy.mail import MailSender
mailer = MailSender()
mailer.send(to=["Pinyi.Liu@simon.rochester.edu"], subject="scrapy tool", body="Hi! This is scrapy", 
            cc=["rippleslpy@gamil.com"])
# Note: Scrapy does not support sending mail with Python 3.

## (3) Core API

- Use those to measure prediction accuracy 
   
    
 - Scrapy core API is used for developers of extensions and middlewares.
 The following are some typical types of API:
    - Crawler API:
      - Crawler object is the main entry point to Scrapy API. It is passed to extensions through the from_crawler class method. This object provides access to all Scrapy core components, and it’s the only way for extensions to access them and hook their functionality into Scrapy. 
 
    - Settings API
       - It sets the key name and priority level of the default settings priorities used in Scrapy. Each item defines a settings entry point, giving it a code name for identification and an integer priority. Greater priorities take more precedence over lesser ones when setting and retrieving values in the Settings class.
   
   - SpiderLoader API
     - It is in charge of retrieving and handling the spider classes defined across the project. Custom spider loaders can be employed by specifying their path in the SPIDER_LOADER_CLASS project setting. They must fully implement the scrapy.interfaces.ISpiderLoader interface to guarantee an errorless execution.

## (4) Signals
Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project to perform additional tasks or extend Scrapy to add functionality not provided out of the box. You can connect to signals (or send your own) through the Signals API.