## Learn Scrapy

### Getting Started with Web Scraping using Scrapy


[YouTube link](https://www.youtube.com/watch?v=vkA1cWN4DEc&list=PLZyvi_9gamL-EE3zQJbU5N3nzJcfNeFHU)

`scrapy shell http://quotes.toscrape.com/random`

This downloads the website for us to use

`print (response.text)`

### We can use CSS selectors

```python
In [2]: response.css('small.author')
Out[2]: [<Selector xpath="descendant-or-self::small[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data='<small class="author" itemprop="author">'>]

In [3]: response.css('small.author').extract()
Out[3]: ['<small class="author" itemprop="author">George R.R. Martin</small>']

In [4]: response.css('small.author::text').extract()
Out[4]: ['George R.R. Martin']

In [5]: response.css('small.author::text').extract()[0]
Out[5]: 'George R.R. Martin'

In [6]: response.css('small.author::text').extract_first()
Out[6]: 'George R.R. Martin'

```

### Getting quote and tag text

```python

In [7]: response.css('span.text::text').extract_first()
Out[7]: '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”'
    
In [1]: response.css('a.tag::text').extract()
Out[1]: ['books', 'inspirational', 'reading', 'tea']


```

### Creating a spider

`scrapy genspider quotes http://toscrape.com/
`

quotes is the name of our spider, toscrape.com is the domain of the website that we want to scape

``` python

# -*- coding: utf-8 -*-
import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com/']
    # urls that we want our spider to visit
    start_urls = ['http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/random']

    def parse(self, response):
        # callback method
        self.log("Visited.." + response.url)
        yield {
            'author_name': response.css('small.author::text').extract_first(),
            'text': response.css('span.text::text').extract_first(),
            'tags': response.css('a.tag::text').extract()
        }
```

```
scrapy runspider quotes.py -o items.json

Run the spider and save the op to the JSON file

```

### Scraping Multiple items from a page

[this page]() has many quotes: 10

We want to scrape thes 10 quotes

If we run `response.css('a.tag::text').extract()` it will return all tags in the page

But we want tag by quote

So we have to extract each quote one by one

```python

def parse(self, response):
        # callback method
        quotes = response.css('div.quote')
        for quote in quotes:
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract()
            }
            yield item
        
```

### Pagination Links 

```python

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com']
    # urls that we want our spider to visit
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        # callback method
        quotes = response.css('div.quote')
        for quote in quotes:
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract()
            }
            yield item

        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        print ("Next url:", next_page_url)    
        if next_page_url:
            # create new request
            print ("Here")
            yield scrapy.Request(url = next_page_url, callback=self.parse)    
        else:
            print ("Next url is ", next_page_url)    
            


```

```
{
 'downloader/request_count': 10,
 'item_scraped_count': 100
}
```

### Getting Author Details

If we are already in the scrapy shell for a particular page, we can use the `fetch()` command to get resoinse from another page

We use a diff callback to handle the authors page requests

``` python

class AuthorsspiderSpider(scrapy.Spider):
    name = 'AuthorsSpider'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        author_urls = response.css('div.quote > span > a::attr(href)').extract()
         
        for author_url in author_urls:
            url = response.urljoin(author_url)
            # create a request to the url with a separate callback
            print ("Author URL...", url)
            yield scrapy.Request(url = url, callback=self.parse_details)
        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        print ("Next url:", next_page_url)    
        if next_page_url:
            yield scrapy.Request(url = next_page_url, callback=self.parse)    

    def parse_details(self, response):
        # extract data from author
        yield {
            'name':response.css('h3.author-title::text').extract_first().strip(),
            'birthdate': response.css('span.author-born-date::text').extract_first().strip()
        }
                    
```