<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Web Scraping and Spiders with `scrapy`


---



<a id='scrapy'></a>
<a scrapy-spiders></a>
## What is [Scrapy](http://scrapy.org/)?

---

> *"Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them."*

Below we will walkthrough the creation of a **spider** using scrapy. Spiders are automated processes that will crawl through a webpage or webpages and collect information.

> **Note:** This code should be written in a script outside of jupyter notebook.

<a id='scrapy-project'></a>
### 1. Create a new Scrapy project

In your terminal. `cd` into a directory you want to create your Crawler's folder.  I recommend the desktop for ease of access to the files inside we will need to edit.
> `scrapy startproject gumtree`

**Should create output that looks like this:**
<blockquote>
```
New Scrapy project 'gumtree', using template directory '/Users/XXXX/anaconda3/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/aymericflaisler/gumtree

You can start your first spider with:
    cd gumtree
    scrapy genspider example example.com
```
</blockquote>

**That command generates a set of project files:**
<blockquote>
    

```
gumtree/
    scrapy.cfg
    gumtree/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
```
</blockquote>



Generally, these are our files.  We will go into more detail on these soon.

 * **`scrapy.cfg`:** the project configuration file
 * **`gumtree/`:** the project’s python module, you’ll later import your code from here.
 * **`gumtree/items.py`:** the project’s items file.
 * **`gumtree/pipelines.py`:** the project’s pipelines file.
 * **`gumtree/settings.py`:** the project’s settings file.
 * **`gumtree/spiders/`:** a directory where you’ll later put your spiders.
 

<p style="color:red;font-size:49px" > Warning!</p> <br>

Long story, but please add this line to your gumtree/settings.py file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>


<p style="color:red;font-size:49px" > Warning (again)!</p> <br>

Some website are blocking scrapy. Let's hide the fact we are doing the requests through the framework.

Replace the line starting with `#USER_AGENT` in the same file, with:
 
 <blockquote>
 ```
 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:10.0) Gecko/20100101 Firefox/10.0'
 ```
 </blockquote>


--- 
<a id='define-item'></a>
### 2. Define an "item"

Basically, when we define an item, it's telling our new application what it will be collecting.  In essence, an "item", is an entity that has attributes (ie: "title", "description", "price", etc) that are descriptive and relate to elements on pages that we will be scraping.  

In more precise terms, this is a model (remember the MVC framework from the HTML lesson? ;) ).  Don't worry if this is a foreign concept.  The main idea to understand is that a model has attributes that closely resemble / relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

# Add this in the items.py file:

import scrapy

class gumtreeItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
    price = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3. A spider that crawls

An item is a model that resembles data on a webpage.  A spider is something that crawls pages and uses our item model to to get and hold items for us.

**Scrapy spiders are python classes.  Let's write our first file, called `gumtree_spider.py` and put it in our `/spiders` directory:**

```python
import scrapy

class gumtreeSpider(scrapy.Spider):
    name = "gumtree"
    allowed_domains = ["gumtree.com"]
    start_urls = [
        "https://www.gumtree.com/cars/london/bmw+3+series"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-1]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/gumtree/gumtree` directory:**

```
> scrapy crawl gumtree
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * Ran parse over the content containing the HTML markup, of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory.  It should be named based on the end of the URL.  In our case, it should create a file called "sfc".  This is taken directly from the Scrapy docs and it's only point is to illustrate the workflow so far.  It is kind of nice to have a reference to our HTML file though.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4. XPath + parsing with our spider

So far, we've defined what fields we'll get, some urls to fetch, and saved some content to a file.  Let's actually do something interesting.

**We should let our spider know about the item model we made earlier.  In the head of the `gumtree/gumtree/spiders/gumtree_spider.py`, lets add a new import:**

```python
from gumtree.items import GumtreeItem
from scrapy.selector import Selector
```


<br><br><br>
**Let's replace our parse method, to find some data from our gumtree spider response, and map it to our item model, gumtreeItem:**

```python
def parse(self, response): # define parse function 
    items = [] # element for storing scraped info
	gt_ = Selector(response) # selector is a function that allows us to grab html from the response(target website)
    for elt in gt_.xpath('//article[@class="listing-maxi"]/a'):
        item = gumtreeItem()
        item['title'] = elt.xpath(
            'div[@class="listing-content"]/h2[@class="listing-title"]/text()').extract()
        item['link'] = elt.xpath('@href').extract()
        item['price'] = elt.xpath(
            'div[@class="listing-content"]/span/meta[@itemprop="price"]/@content').extract()

        items.append(item)
    return items  # shows scraped information as terminal output

```



---

<a id='save-examine'></a>
### Save and examine our scraped data

By default, we can save our crawled data as csv.  To save our data, we just need to pass a few optional parameters to our crawl call:

<blockquote>
```
scrapy crawl gumtree -o items.csv -t csv
```
</blockquote>

It's always good to iteratively check our data when developing a spider to make sure it's close to what we want. 

> *Pro tip:  The longer your iterations are between checks, the harder it's going to be to understand what's not working and fix bugs.*

You should now have a file called '`items.csv`' in the directory you ran the `scrapy crawl` command from.

---
<a id='follow-links'></a>
### Following links for more results

100 results is pretty cool but what if we want more?  We need to follow the "next" links, and find new pages to grab.  Using the **`parse()`** method of our spider class, we only need to return another type of object.

cf.: https://scrapy.readthedocs.io/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

```python
class gumtreeSpider(scrapy.Spider):
    name = "gumtree"
    allowed_domains = ["gumtree.com"]
    start_urls = [
        "https://www.gumtree.com/cars/london/bmw+3+series"
    ]

    def __init__(self):
        self.items = []

    def parse(self, response):  # define parse function
        # element for storing scraped info
        # selector is a function that allows us to grab html from the response(target website)
        gt_ = Selector(response)

        for elt in gt_.xpath('//article[@class="listing-maxi"]/a'):
            item = GumtreeItem()
            item['title'] = elt.xpath(
                'div[@class="listing-content"]/h2[@class="listing-title"]/text()').extract()
            item['link'] = elt.xpath('@href').extract()
            item['price'] = elt.xpath(
                'div[@class="listing-content"]/span/meta[@itemprop="price"]/@content').extract()
            print(item)
            self.items.append(item)

        # Does the next page exist?  Let's get it!
        next_page = gt_.xpath('//li[@class="pagination-next"]/a/@href')

        if (next_page) and ('5' not in next_page.extract()[0]):
            url = "https://www.gumtree.com/" + next_page.extract()[0]
            return self.parse(requests.get(url))
        else:
            return self.items
```
