# scrapy
> [Main Table of Contents](../../README.md)

## In This Notebook

- scrapy vs beautifulsoup4
- Selector for scraping/parsing
- Response for crawling/scraping/parsing
	- scrapy flow for crawling/scraping/parsing
- xpath gotcha

## scrapy vs beautifulsoup4

- scrapy contains web crawling, site scraping, site extracting via APIs, parsing functionality
- beautifulsoup4 contians parsing functionality

## Selector for scraping functionality

- Selector Instance Methods

	Selector Instance Method | Description
	--- | ---
	.xpath() | Returns SelectorList
	.css() | Returns SelectorList

- SelectorList is a subclass of built-in list
- SelectorList Methods

	SelectorList Method | Description
	--- | ---
	.getall() | Returns list of just data<br>Newer version of extract()
	.geta() | Returns str of first data<br>Newer version of extract_first()

In [22]:
# Create Selector object
from scrapy import Selector

html = '''
<html>
  <head>
    <base href='http://example.com/' />
    <title>Example website</title>
  </head>
  <body>
    <div id='images'>
      <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' alt='image1'/></a>
      <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' alt='image2'/></a>
      <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' alt='image3'/></a>
      <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' alt='image4'/></a>
      <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' alt='image5'/></a>
    </div>
  </body>
</html>
'''
sel = Selector(text=html)

In [23]:
sel.css('html>body>div>br')

[]

In [24]:
s = sel.css('a')
print(len(s))
s[0].xpath('./img/@alt').get()

5


'image1'

In [25]:
# Returns SelectorList
sel.css('html > body > div > a')
sel.xpath('/html/body/div/a')

[<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: My image ...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: My image ...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: My image ...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: My image ...'>,
 <Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: My image ...'>]

In [26]:
# Returns list of just data:   getall() == extract()
sel.css('html > body > div > a').getall()
sel.xpath('/html/body/div/a').getall()  

['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg" alt="image2"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg" alt="image3"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg" alt="image4"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg" alt="image5"></a>']

In [27]:
# Returns string of first data:   get() == extract_first()
sel.css('html > body > div > a').get()
sel.xpath('/html/body/div/a').get()

'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg" alt="image1"></a>'

In [28]:
# Chaining between xpath and css
sel.xpath('/html/body').css('div > a')

[<Selector xpath='descendant-or-self::div/a' data='<a href="image1.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::div/a' data='<a href="image2.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::div/a' data='<a href="image3.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::div/a' data='<a href="image4.html">Name: My image ...'>,
 <Selector xpath='descendant-or-self::div/a' data='<a href="image5.html">Name: My image ...'>]

## Response for crawling/scraping/parsing
- Response object has all the functionality of Selector object with crawling capability.
- Request and Response objects are meant to be used in scrapy spider classes.
- Selector functionality + the following:

Additional functionality | Description
--- | ---
.follow() | Returns a Request instance to follow a url

### scrapy flow for crawling/scraping/parsing

1. Instantiate crawling process
2. Create and Add spider to process
 - Subclass `scrapy.Spider`
 - Override `start_requests` method
 - Override `parse` callback (default is available)
3. Start crawling process

In [29]:
# Create spider
from scrapy import Spider
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess
class MySpider(Spider):
    name = 'myspider'

    def start_requests(self):
        urls = ['https://www.datacamp.com/courses/all']
        for url in urls:
            yield Request(url=url, callback=self.custom_parse_first)

    def custom_parse_first(self, response):
        """
        Parse response object.  
        Remember the Response object has same parsing functionality as Selector object, so use xpath, css locators and with get and getall methods to parse and extract data
        """
        parsed = response.xpath('/html/body').css('table#first-table').xpath('.//tr').getall()
        print(parsed)
        # grab next links
        urls = response.xpath('//td[@id="additional_data"]/@href').getall()
        for url in urls:
            yield response.follow(url, self.custom_parse_next)

    def custom_parse_next(self, response):
        # and so forth
        pass

# Instantiate crawling process
process = CrawlerProcess()
# Add spider to prcoess
process.crawl(MySpider)
# Start crawling process
process.start()

2022-11-02 12:01:18 [scrapy.utils.log] INFO: Scrapy 2.7.0 started (bot: scrapybot)
2022-11-02 12:01:18 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.2.0, parsel 1.6.0, w3lib 2.0.1, Twisted 18.9.0, Python 3.8.10 (default, Jun 22 2022, 20:18:18) - [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 38.0.1, Platform Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
2022-11-02 12:01:18 [scrapy.crawler] INFO: Overridden settings:
{}
2022-11-02 12:01:18 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-11-02 12:01:18 [scrapy.extensions.telnet] INFO: Telnet Password: 8e37dff15e7536a4
2022-11-02 12:01:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2022-11-02 12:01:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['

ReactorNotRestartable: 

## xpath gotcha
- When chaining xpath must use `dot/period` to indicate `current` position in subsequent xpath calls

In [None]:
# xpath gotcha.  Notice the xpath startingg with dot.
sel.xpath('/html').css('body > div').xpath('./a')

[<Selector xpath='./a' data='<a href="image1.html">Name: My image ...'>,
 <Selector xpath='./a' data='<a href="image2.html">Name: My image ...'>,
 <Selector xpath='./a' data='<a href="image3.html">Name: My image ...'>,
 <Selector xpath='./a' data='<a href="image4.html">Name: My image ...'>,
 <Selector xpath='./a' data='<a href="image5.html">Name: My image ...'>]