<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Web Scraping and Spiders with `scrapy`

_Authors: Dave Yerrington (SF)_

---

### Learning Objectives
- Understand the structure and content of HTML
- Learn about elements, attributes, and element hierarchy in HTML
- Learn about XPath and using multiple and singular selections
- Practice using Scrapy to get data from craigslist
- Practice using Beautiful Soup to parse data from craigslist
- Walkthrough the construction of a spider built using scrapy


### Lesson Guide
- [Introduction](#introduction)
- [HTML](#html)
    - [Elements](#elements)
    - [Attributes](#attributes)
    - [Element hierarchy](#element-hierarchy)
    - [More resources on HTML structure](#html-resources)
- [What is XPath?](#xpath)
    - [Multiple selections](#multiple-selections)
    - [Singular selections](#singular-selections)
- [A simple `scrapy` example](#scrapy)
- [A practical example with Requests and Beautiful Soup](#practical)
    - [Step 1: fetch the content by URL](#step1)
    - [Step 2: Parse HTML document with Beautiful Soup](#step2)
    - [Practice: can you select the price of our junker?]
- [Scrapy and spiders](#scrapy-spiders)
    - [Create a Scrapy project](#scrapy-project)
    - [Define an "item"](#define-item)
    - [A spider that crawls](#spider-crawl)
    - [XPath and parsing with our spider](#xpath-spider)
    - [Save and examine our scraped data](#save-examine)
- [Addendum: leveraging XPath to get more results](#addendum)
    - [Following links](#follow-links)

<a id='introduction'></a>

![What is Html](http://designshack.designshack.netdna-cdn.com/wp-content/uploads/htmlbasics-0.jpg)

One of the largest sources of data in the world is all around us.  We consume the web in some form every day.  One of the most powerful python toolsets we will learn allows us to extract and normalize data from unstructured sources like webpages.  

**If you can see it, it can be scraped, mined, and put into a dataframe.**

Before we begin the actual process of webscraping with python, it is important to cover the basic constructs that describe HTML as unstructured data. 

Then we will cover a a powerful selection technique called XPath, and look at a basic workflow using a framework called [Scrapy](http://www.scrapy.org).

<a id='html'></a>

## Hypertext markup language (HTML)

---

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
### Elements
Elements begin and end with open and close "tags", which are defined by namespaced, encapsulated strings. 

```
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

<a id='attributes'></a>
### Attributes

HTML elements can have attributes.  They describe properties, and characteristics of elements.  Some affect how the element behaves or looks in terms of the rendered output by the browser.

The most common element is an "anchor" element.  Anchor elements often have an "href" element, which tells the browser where to go after it is clicked.  Anchor elements are typically are formatted in bold, and sometimes are underlined as a visual cue to differentiate itself.

**Markup that describes nn element with attributes, litterally looks like this**

```
<a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">An Awesome Website</a>
```

**However, this element, once rendered, looks like this**

[An Awesome Website](https://www.youtube.com/watch?v=dQw4w9WgXcQ)

<a id='element-hierarchy'></a>
### Element hierarchy

![Nodes](http://www.computerhope.com/jargon/d/dom1.jpg)

**Literally Represented:**

```
<html>
    
    <head>
        <title>Example</title>
    </head>
    
    <body>
        <h1>Example Page</h1>
        <p>This is an example page.</p>
    </body>
    
</html>
```

<a id='html-resources'></a>
### You are now qualified HTML experts

![](http://hpcc.advancingexpertcare.org/wp-content/uploads/2014/10/certified.jpg)

Your HTML learning can continue...

Read all about the different elements supported amongst modern browsers:
 * [HTML5 Cheatsheet](http://websitesetup.org/html5-cheat-sheet/)
 * [Mozilla HTML Element Reference](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
 * [HTML5 Visual Cheatsheet](http://www.unitedleather.biz/PDF/HTML5-Visual-Cheat-Sheet1.pdf)
 

<a id='xpath'></a>

## What is XPath?

---

![](http://img.crx4chrome.com/63/4c/b1/hgimnogjllphhhkhlmebbmlgjoejdpjl-screenshot.jpg)

Understanding how to identify elements and attributes within HTML documents gives us the capability to write simple expressions that create structured data.

To make this process easier to deal with, we will be using XPath helper, which is a Chrome addon.  It's not necessary, but highly recommended to help build XPath expressions.

[XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)

XPath expressions can select elements, element attributes, and element text.  These selections can be either to a single item, or multiple items.  Generally, if you're not specific enough, you will end up selecting multiple elements.


<a id='multiple-selections'></a>
### Multiple selections

***Multiple selections*** are useful for capturing search results, or any repeating element.  For instance, the _titles_ of an apartment listing search results from Craigslist:


**URL**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**HTML Markup**
```
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four bedroom one and a half bath</a> 
</span>
...
```

**XPath - Multiple Titles**
```
//a[@class='hdrlnk']
```

**Returns (Ad Titles)**
```
***New Remodeled two bedroom Apartment***
WONDERFUL ONE BR APARTMENT HOME
Beautiful 1bed/1bath Apartment in Russian Hill NO SECURITY DEPOSIT
Knockout SF View|Green Oasis|Private Driveway|Furnished
3BR/3BA Spacious, Beautiful SOMA Loft: 5 month lease
Nob Hill Large Studio - Light, Quiet, Lovely Building
etc...
```

<a id='singlular-selections'></a>

### Singular selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements.  Here's an example of a details page on Craigslist:

> *Note: this example may be expired if you view it sometime after Jan 12th, 2016. Please replace this with a current craigslist listing!

**URL**

(Only $8000!)
[http://sfbay.craigslist.org/sfc/apa/5400585892.html](http://sfbay.craigslist.org/sfc/apa/5400585892.html)

**HTML Markup**

```
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
```

**XPath - Single Item**

```
//p[@class='postinginfo'][2]/time
```
**Returns (Time of posting)**
```
2016-01-12 11:23pm
```

<a id='scrapy'></a>

## A simple `scrapy` example

---

Below is an example of how to get information out of some fake HTML using the XPath capabilities of the `scrapy` package. You will likely need to install the scrapy package using `conda` or `pip`.

In [1]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

HTML = """
<div class="postinginfos">
    <p class="postinginfo">post id: 5400585892</p>
    <p class="postinginfo">posted: <time datetime="2016-01-12T23:23:19-0800" class="xh-highlight">2016-01-12 11:23pm</time></p>
    <p class="postinginfo"><a href="https://accounts.craigslist.org/eaf?postingID=5400585892" class="tsb">email to friend</a></p>
    <p class="postinginfo"><a class="bestof-link" data-flag="9" href="https://post.craigslist.org/flag?flagCode=9&amp;postingID=5400585892" title="nominate for best-of-CL"><span class="bestof-icon">♥ </span><span class="bestof-text">best of</span></a> <sup>[<a href="http://www.craigslist.org/about/best-of-craigslist">?</a>]</sup>    </p>
</div>
"""

# 1 done!!!
best = Selector(text=HTML).xpath("//span[@class='bestof-text']/text()").extract()

# 2 
best = Selector(text=HTML).xpath("//span[contains(text(), 'best of')]/text()").extract()

# 3 /html/div/p/@class/a
best        =  Selector(text=HTML).xpath("/html/body/div/p/a[@class='bestof-link']")
nested_best =  best.xpath("./span[@class='bestof-text']/text()").extract()
nested_best

[u'best of']

<a id='practical'></a>

## A Practical Example with Requests + Beautiful Soup

---

Please make sure that the required packages are installed: 

```bash
# beautiful soup:
> conda install bs4 
> conda install lxml

# or if conda doesn't work
> pip install bs4
> pip install lxml
```

Here's another posting for a sweet ride on Craigslist (as of 04/29/2016):

![](http://images.craigslist.org/00x0x_hMg0axS9t35_600x450.jpg)

> *Note: you will need to update this to a current/working craigslist post.*

https://merced.craigslist.org/cto/6034381423.html

<a id='step1'></a>
### Step 1: fetch the content by URL



In [6]:
import requests
from bs4 import BeautifulSoup
# from lxml import html

url = "https://merced.craigslist.org/cto/6034381423.html"
response = requests.get(url)

# Pull HTML string out of requests
html = response.text

# The first 500 characters of the content
print "\nFirst part of HTML document fetched as string:\n"
print html[:500]


First part of HTML document fetched as string:

<!DOCTYPE html>
<html class="no-js">
<head>
<title>1999 saturn</title>
    	<link rel="canonical" href="http://merced.craigslist.org/cto/6034381423.html">
	<meta name="description" content="1999 Saturn. Parts car only. Pick up only. Motor and tranny are in great condition. Brand new tires. Aluminum wheels. $600 or best offer. ">
	<meta name="robots" content="noarchive,nofollow,unavailable_after:Thursday, 06-Apr-17 19:59:20 PDT">
	<meta name="twitter:card" content="preview">
	<meta property="og:d


<a id='step2'></a>
### Step 2: Parse HTML document with Beautiful Soup

This step allows us to access the elements of the document by XPATH expressions.

In [7]:
soup = BeautifulSoup(html, 'lxml')

> **Note:** There are many ways to get the elements in a "soup" object

Here are a few ways to select HMTL elements as "objects" within "soup" as a document.

In [8]:
# Singular element
soup.html.title

<title>1999 saturn</title>

In [9]:
# Just the text between elements
print soup.html.title.text

1999 saturn


In [10]:
# Plural / Repeating elements
for meta in soup.html.meta.children:
    print meta

In [11]:
# find single or multiple elements
# First parameter
element = soup.findAll("a", {"class": "header-logo"})
element[0].text

u'CL'

In [12]:
price_search = soup.findAll('span', {"class": "price"})
price_search[0].text

u'$600'

In [13]:
response = requests.get("http://sfbay.craigslist.org/search/sfc/apa")

In [14]:
soup = BeautifulSoup(response.text)
search_titles = soup.findAll("a", {"class": "hdrlnk"})

In [15]:
for link in search_titles[0:5]:
    print link.attrs

{'data-id': '6046406176', 'href': '/sfc/apa/6046406176.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6038233283', 'href': '/sfc/apa/6038233283.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6026983369', 'href': '/sfc/apa/6026983369.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6030386716', 'href': '/sfc/apa/6030386716.html', 'class': ['result-title', 'hdrlnk']}
{'data-id': '6046405539', 'href': '/sfc/apa/6046405539.html', 'class': ['result-title', 'hdrlnk']}


> **Check:** How do we know which parameters `findAll()` takes?

<a id='practice'></a>

### Practice: can you select the price of our junker?  

 - Use XPath Helper to get an idea of where the element is within the HTML document.
 - Try to select using the soup.html.body.something.something method.
 - Try using findAll() to find a concise element.

<a scrapy-spiders></a>
## What is [Scrapy](http://scrapy.org/)?

---

> *"Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them."*

Below we will walkthrough the creation of a **spider** using scrapy. Spiders are automated processes that will crawl through a webpage or webpages and collect information.

> **Note:** This code should be written in a script outside of jupyter notebook.

<a id='scrapy-project'></a>
### 1. Create a new Scrapy project

> `scrapy startproject craigslist`

**Should create output that looks like this:**
<blockquote>
```
2016-01-13 00:12:45 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)
2016-01-13 00:12:45 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-01-13 00:12:45 [scrapy] INFO: Overridden settings: {}
New Scrapy project 'craigslist' created in:
    /Users/davidyerrington/virtualenvs/data/scraping/craigslist

You can start your first spider with:
    cd craigslist
    scrapy genspider example example.com
```
</blockquote>

**That command generates a set of project files:**
<blockquote>
```
craigslist/
    scrapy.cfg
    craigslist/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
```
</blockquote>

Generally, these are our files.  We will go into more detail on these soon.

 * **`scrapy.cfg`:** the project configuration file
 * **`craigslist/`:** the project’s python module, you’ll later import your code from here.
 * **`craigslist/items.py`:** the project’s items file.
 * **`craigslist/pipelines.py`:** the project’s pipelines file.
 * **`craigslist/settings.py`:** the project’s settings file.
 * **`craigslist/spiders/`:** a directory where you’ll later put your spiders.
 
Long story, but please add this line to your craigslist/settings.py file before continuing:
 
 <blockquote>
 ```
 DOWNLOAD_HANDLERS = {'s3': None,}
 ```
 </blockquote>



--- 
<a id='define-item'></a>
### 2. Define an "item"

Basically, when we define an item, it's telling our new application what it will be collecting.  In essence, an "item", is an entity that has attributes (ie: "title", "description", "price", etc) that are descriptive and relate to elements on pages that we will be scraping.  

In more precise terms, this is a model (for those who are familliar with ORM or relational database terms).  Don't worry if this is a foreign concept.  The main idea to understand is that a model has attributes that closely resemble / relate to elements on our target web page(s).

```python
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class CraigslistItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    link = scrapy.Field()
```


---

<a id='spider-crawl'></a>
### 3. A spider that crawls

An item is a model that resembles data on a webpage.  A spider is something that crawls pages and uses our item model to to get and hold items for us.

**Scrapy spiders are python classes.  Let's write our first file, called `craigslist_spider.py` and put it in our `/spiders` directory:**

```python
import scrapy

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["craigslist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sfc/apa"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)
```

**Next, let's dive in and crawl from our `/craigslist/craigslist` directory:**

```
> scrapy crawl craigslist
```

**What just happened?**
 * Our application requested the URLs from the `start_urls` class attribute.
 * Ran parse over the content containing the HTML markup, of each request URL.
 * What else?
 
```python
    with open(filename, 'wb') as f:
        f.write(response.body)
```

It saved a file in our base project directory.  It should be named based on the end of the URL.  In our case, it should create a file called "sfc".  This is taken directly from the Scrapy docs and it's only point is to illustrate the workflow so far.  It is kind of nice to have a reference to our HTML file though.  

There might be some errors listed when we crawl, but they are fine for now.

--- 
<a id='xpath-spider'></a>
### 4. XPath + parsing with our spider

So far, we've defined what fields we'll get, some urls to fetch, and saved some content to a file.  Let's actually do something interesting.

**We should let our spider know about the item model we made earlier.  In the head of the `craigslist/craigslist/spiders/craigslist_spider.py`, lets add a new import:**

```python
from craigslist.items import CraigslistItem
```

> **Check:** Why won't it work otherwise?

<br><br><br>
**Let's replace our parse method, to find some data from our Craigslist spider response, and map it to our item model, CraigslistItem:**

```python
def parse(self, response):

    for sel in response.xpath("//div[@class='content']/span[@class='rows']/p"):

        item = CraigslistItem()
        item['title'] =  sel.xpath("span/span/a[@class='hdrlnk']").extract()[0]
        item['link']  =  sel.xpath("span/span/a[@class='hdrlnk']/@href").extract()[0]
        yield item
```

---

<a id='save-examine'></a>
### Save and examine our scraped data

By default, we can save our crawled data as json.  To save our data, we just need to pass an optional parameter to our crawl call:

<blockquote>
```
> scrapy crawl craigslist -o items.json
```
</blockquote>

It's always good to iteratively check our data when developing a spider to make sure it's close to what we want. 

> *Pro tip:  The longer your iterations are between checks, the harder it's going to be to understand what's no working and fix bugs.*

In [16]:
import pandas as pd

# update this path to your own
# hint: from terminal, use the pwd command in the same directory as items.json to find
# your scraping directory with your json file
# pd.read_json("/Users/davidyerrington/virtualenvs/data/scraping/craigslist/craigslist/items.json").head()

# df = pd.read_csv("/Users/davidyerrington/scrapy_projects/craigslist/craigslist/apts.csv")
# df

<a id='addendum'></a>
## Addendum: leveraging XPath to get more results

---

Generally, a workflow that is useful in this context is to load the page in your Chrome browser, check out the page using the XPath Helper plugin, and from that derive your own XPath expressions based on the output.

`text()` selects only the text of a given element (between the tags), and `@attribute_name` is used to select attributes.

**Here are a few examples of `text()`**:
<blockquote>
```
<h1>Darwin - The Evolution Of An Exhibition</h1>
```
</blockquote>

The XPath selector for this:

<blockquote>
```
//h1/text()
```
</blockquote>

**Here are a few examples of attributes**:

And the description is contained inside a `<div>` tag with `id="description"`:
<blockquote>
```
<h2>Description:</h2>

<div id="description">
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
</div>
...
```
</blockquote>

XPath
<blockquote>
```
//div[@id='description']
```
</blockquote>

---
<a id='follow-links'></a>
### Following links for more results

100 results is pretty cool but what if we want more?  We need to follow the "next" links, and find new pages to grab.  Using the **`parse()`** method of our spider class, we only need to return another type of object.

```python
def parse(self, response):

    for sel in response.xpath("//div[@class='content']/span[@class='rows']/p"):

        item = CraigslistItem()
        item['title'] =  sel.xpath("span/span/a[@class='hdrlnk']").extract()[0]
        item['link']  =  sel.xpath("span/span/a[@class='hdrlnk']/@href").extract()[0]
        item['price'] =  sel.xpath("span/span/span[@class='price']").extract()[0]
        yield item

    # Does the next page exist?  Let's get it!
    next_page   = response.xpath("(//a[@class='button next']/@href)[1]")

    if next_page:
        url = response.urljoin(next_page[0].extract())
        yield scrapy.Request(url, self.parse)

```
