## Learn Scrapy

### Getting Started with Web Scraping using Scrapy


[YouTube link](https://www.youtube.com/watch?v=vkA1cWN4DEc&list=PLZyvi_9gamL-EE3zQJbU5N3nzJcfNeFHU)

`scrapy shell http://quotes.toscrape.com/random`

This downloads the website for us to use

`print (response.text)`

### We can use CSS selectors

```python
In [2]: response.css('small.author')
Out[2]: [<Selector xpath="descendant-or-self::small[@class and contains(concat(' ', normalize-space(@class), ' '), ' author ')]" data='<small class="author" itemprop="author">'>]

In [3]: response.css('small.author').extract()
Out[3]: ['<small class="author" itemprop="author">George R.R. Martin</small>']

In [4]: response.css('small.author::text').extract()
Out[4]: ['George R.R. Martin']

In [5]: response.css('small.author::text').extract()[0]
Out[5]: 'George R.R. Martin'

In [6]: response.css('small.author::text').extract_first()
Out[6]: 'George R.R. Martin'

```

### Getting quote and tag text

```python

In [7]: response.css('span.text::text').extract_first()
Out[7]: '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”'
    
In [1]: response.css('a.tag::text').extract()
Out[1]: ['books', 'inspirational', 'reading', 'tea']


```

### Creating a spider

`scrapy genspider quotes http://toscrape.com/
`

quotes is the name of our spider, toscrape.com is the domain of the website that we want to scape

``` python

# -*- coding: utf-8 -*-
import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com/']
    # urls that we want our spider to visit
    start_urls = ['http://quotes.toscrape.com/random', 'http://quotes.toscrape.com/random']

    def parse(self, response):
        # callback method
        self.log("Visited.." + response.url)
        yield {
            'author_name': response.css('small.author::text').extract_first(),
            'text': response.css('span.text::text').extract_first(),
            'tags': response.css('a.tag::text').extract()
        }
```

```
scrapy runspider quotes.py -o items.json

Run the spider and save the op to the JSON file

```

### Scraping Multiple items from a page

[this page]() has many quotes: 10

We want to scrape thes 10 quotes

If we run `response.css('a.tag::text').extract()` it will return all tags in the page

But we want tag by quote

So we have to extract each quote one by one

```python

def parse(self, response):
        # callback method
        quotes = response.css('div.quote')
        for quote in quotes:
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract()
            }
            yield item
        
```

### Pagination Links 

```python

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com']
    # urls that we want our spider to visit
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        # callback method
        quotes = response.css('div.quote')
        for quote in quotes:
            item = {
                'author_name': quote.css('small.author::text').extract_first(),
                'text': quote.css('span.text::text').extract_first(),
                'tags': quote.css('a.tag::text').extract()
            }
            yield item

        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        print ("Next url:", next_page_url)    
        if next_page_url:
            # create new request
            print ("Here")
            yield scrapy.Request(url = next_page_url, callback=self.parse)    
        else:
            print ("Next url is ", next_page_url)    
            


```

```
{
 'downloader/request_count': 10,
 'item_scraped_count': 100
}
```

### Getting Author Details

If we are already in the scrapy shell for a particular page, we can use the `fetch()` command to get resoinse from another page

We use a diff callback to handle the authors page requests

``` python

class AuthorsspiderSpider(scrapy.Spider):
    name = 'AuthorsSpider'
    allowed_domains = ['toscrape.com']
    start_urls = ['http://quotes.toscrape.com']

    def parse(self, response):
        author_urls = response.css('div.quote > span > a::attr(href)').extract()
         
        for author_url in author_urls:
            url = response.urljoin(author_url)
            # create a request to the url with a separate callback
            print ("Author URL...", url)
            yield scrapy.Request(url = url, callback=self.parse_details)
        next_page_url = response.css('li.next > a::attr(href)').extract_first()
        next_page_url = response.urljoin(next_page_url)
        print ("Next url:", next_page_url)    
        if next_page_url:
            yield scrapy.Request(url = next_page_url, callback=self.parse)    

    def parse_details(self, response):
        # extract data from author
        yield {
            'name':response.css('h3.author-title::text').extract_first().strip(),
            'birthdate': response.css('span.author-born-date::text').extract_first().strip()
        }
                    
```

### Infinite Scrolling Pages

For infinite scrolling pages we can see the rquests being made to the server from the Network tab of the dev tools

![](./img/diag1.png)

Also we can see the actual data received from the server in the Network -> Preview tab

This data is already structured in JSON format

We can easily parse the response.text as a dictionary in python

```python

scrapy shell http://quotes.toscrape.com/scroll

import json

data = json.loads(response.text)

print(data.keys())

```

`Out[12]: dict_keys(['has_next', 'page', 'quotes', 'tag', 'top_ten_tags'])`

hes_next tells us when we have to stop making requests

page tells us which page we are currently on

**Thus for any web scraping on a field, including infiiinite scrolling, look for the reqs the browser is making under the Network tab**

This will help us fing the underlying api, which we can then use

```python


class QuotesscrollSpider(scrapy.Spider):
    name = 'QuotesScroll'
    allowed_domains = ['toscrape.com']
    api_url = 'http://quotes.toscrape.com/api/quotes?page={}'
    start_urls = [api_url.format(1)]

    def parse(self, response):
        # parse the response data

        data = json.loads(response.text)

        for quote in data['quotes']:
            yield {
                'author_name': quote['author']['name'],
                'text': quote['text'],
                'tags': quote['tags']
            }
        # check if next_page available

        if data['has_next']:
            next_page = data['page'] + 1
            # generate new req for next page
            yield scrapy.Request(url = self.api_url.format(next_page), callback=self.parse)
                
```


### POST reqs using scrapy

In the quotes website, the author's goodreads link is only visible if we are logged in

Our spider will

1. Login to the site with username and pwd

2. Scrape info

As before go into the network tab in the login page and see the reqs being made

We see that the browser made a `POST` req to url: `http://quotes.toscrape.com/login`

The request url is not an API, it is simply the login page

Also scrolling down we can see from the Form data that the browser sent 3 params to the server

- csrf_token

- username

- password

On insoecting the Login page we find that along with username and pwd field there is a hidden input for csrf 

Inspect , delete the csrf token field and submit the for. It throws an error message

Also, everytime the page is reloaded, we get a new token

Method:

1. Spider downloads login pg
2. Extracts the csrf data from the page and at it to the form data that we want to submit
3. It will create a req to the action url from the form 
4. After logging in, it will get the author name and the goodreads url

We see that the csrf token is like:

```html
<input type="hidden" name="csrf_token" value="qFTafnKpvDlCGAMjytSQbNeZiVOmgPUkJhIYLrwHEuXcRzsWBoxd">
```

```python

In [2]: response.css('input[name = "csrf_token"]::attr(value)').extract_first()
Out[2]: 'okaYKGwTDHJvryUlcINeWzfFChxMbgVOmqBnLZQSPsARjtdEuiXp'

```

```python

def parse(self, response):
        # get csrf token
        token = response.css('input[name = "csrf_token"]::attr(value)').exract_first()
        print ("CSRF TOKEN IS...", token)
        # create dict with data we want to send to server

        data = {
            'csrf_token': token,
            'username': 'shaunak1105',
            'password': 'abcd'
        }

        # submit a POST request

        yield scrapy.FormRequest(url = self.login_url, formdata = self.data, callback = self.pase_quotes())

```

Next we write the code to extract author info

[css selectors](https://www.w3schools.com/csSref/css_selectors.asp)
[xpath](http://www.zvon.org/comp/r/tut-XPath_1.html#Pages~Start_with_%2F%2F)

```python

def parse_quotes(self, response):
        # parse the page after spider is logged in

        # for each quote
            # extract the goodreads link of the author

            for quote in response.css("div.quote"):
                yield {
                    'author_name': quote.css("small.author::text").exract_first(),
                    'author_url': quote.css('small.author ~ a[href*="goodreads.com"]::attr(href)').
    extract_first()
                }
                
```
