# Web Scraping

## Getting started

* Install scrapy and jupyter
* First we are going to explore, using:

`scrapy shell https://en.wikipedia.org/wiki/Sydney`


Or your wiki page of choice



## Why wiki?

## What does wiki's code look like?

## Getting the title:

Try:

`response.xpath('//title')`

`response.xpath('//title').extract()`

`response.xpath('//title/text()').extract()



## What is contained in the body?

```python
response.xpath('//body')
response.xpath('//body/div')

response.xpath('//body/div/div')
response.xpath('//body/div/div[@id = "bodyContent"]')
```

## Problem: getting the table of contents

```python
response.xpath('//body/div/div[@id = "bodyContent"]/div')

response.xpath('//body/div/div[@id = "bodyContent"]/div/div[@id = "toc]"')

response.xpath('//body/div/div[@id = "bodyContent"]/div/div[@id = "toc"]/ul/li')

response.xpath('//body/div/div[@id = "bodyContent"]/div/div[@id = "toc"]/ul/li/a/span[@class = "toctext"]')

response.xpath('//body/div/div[@id = "bodyContent"]/div/div[@id = "toc"]/ul/li/a/span[@class = "toctext"]/text()').extract()
```

## Back to our original problem: how do we get the first link from the wiki page

```python
response.xpath('//body/div/div[@id = "bodyContent"]/div/p/text()').extract()

response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a')[0]

response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a')[0]

response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a/@href').extract()[0]
```

## Spider

So we now know where the first link in a wiki page is - we want to follow it. To do we need a spider - scrappy can do this for us


* Exit out of the shell
* Create a new project

`scrapy startproject wiki`

* This creates the basic structure for a spider
* We need to create a new file for our spider in /wiki/spiders

In [None]:
import scrapy

class WikiSpider(scrapy.Spider):
    name = "wiki"
    allowed_domains = ["https://en.wikipedia.org"]
    start_urls = [
        "https://en.wikipedia.org/wiki/Sydney"]

    def parse(self, response):
        
        print(response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a/@href').extract()[0])


Now run using:

`scrapy crawl wiki`

If you are getting crazy errors like:
    
```shell
2016-07-21 11:24:53 [scrapy] INFO: Scrapy 1.0.3 started (bot: wiki)
2016-07-21 11:24:53 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-21 11:24:53 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wiki.spiders', 'SPIDER_MODULES': ['wiki.spiders'], 'BOT_NAME': 'wiki'}
2016-07-21 11:24:53 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

2016-07-21 11:24:53 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-21 11:24:53 [boto] DEBUG: Retrieving credentials from metadata server.
2016-07-21 11:24:54 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/boto/utils.py", line 214, in retry_url
    r = opener.open(req)
  File "/Users/rachel/anaconda/lib/python2.7/urllib2.py", line 431, in open
    response = self._open(req, data)
  File "/Users/rachel/anaconda/lib/python2.7/urllib2.py", line 449, in _open
    '_open', req)
  File "/Users/rachel/anaconda/lib/python2.7/urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "/Users/rachel/anaconda/lib/python2.7/urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/Users/rachel/anaconda/lib/python2.7/urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2016-07-21 11:24:54 [boto] ERROR: Unable to read instance data, giving up
2016-07-21 11:24:54 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-21 11:24:54 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-21 11:24:54 [scrapy] INFO: Enabled item pipelines: 
2016-07-21 11:24:54 [scrapy] INFO: Spider opened
2016-07-21 11:24:54 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-21 11:24:54 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Error during info_callback
Traceback (most recent call last):
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 421, in dataReceived
    self._write(bytes)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 569, in _write
    sent = self._tlsConnection.send(toSend)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
    result = _lib.SSL_write(self._ssl, buf, len(buf))
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1154, in infoCallback
    return wrapped(connection, where, ret)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1256, in _identityVerifyingInfoCallback
    transport = connection.get_app_data()
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
    return self._app_data
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
    return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'

2016-07-21 11:24:55 [twisted] CRITICAL: Error during info_callback
Traceback (most recent call last):
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 421, in dataReceived
    self._write(bytes)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 569, in _write
    sent = self._tlsConnection.send(toSend)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1270, in send
    result = _lib.SSL_write(self._ssl, buf, len(buf))
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1154, in infoCallback
    return wrapped(connection, where, ret)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1256, in _identityVerifyingInfoCallback
    transport = connection.get_app_data()
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
    return self._app_data
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
    return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'

From callback <function infoCallback at 0x10598ab18>:
Traceback (most recent call last):
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 926, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1158, in infoCallback
    connection.get_app_data().failVerification(f)
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1589, in get_app_data
    return self._app_data
  File "/Users/rachel/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1148, in __getattr__
    return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'
2016-07-21 11:24:55 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/Sydney> (referer: None)
```

This is a problem with verifying wiki's security certificate. Try installing the service_identify package:

```shell
conda install service_identity
```

or 

```shell
pip install service_identity
```

## Following links


In [None]:
import scrapy

class WikiSpider(scrapy.Spider):
    name = "wiki"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = [
        "https://en.wikipedia.org/wiki/Sydney"]

    base_url = "https://en.wikipedia.org"
    def parse(self, response):
        
        next_link = response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a/@href').extract()[0]

        print next_link

        yield scrapy.Request(self.base_url + next_link, callback = self.parse)



## Storing the links

* Open items.py
* Here we need to define a class that contains the information we need

In [None]:
class WikiItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    pass

In [None]:
import scrapy

from wiki.items import WikiItem

class WikiSpider(scrapy.Spider):
    name = "wiki"
    allowed_domains = ["en.wikipedia.org"]
    start_urls = [
        "https://en.wikipedia.org/wiki/Sydney"]

    base_url = "https://en.wikipedia.org"
    def parse(self, response):
        
        link = WikiItem()

        link["link"] = response.xpath('//body/div/div[@id = "bodyContent"]/div/p/a/@href').extract()[0]


        yield scrapy.Request(self.base_url + link["link"], callback = self.parse)

        yield link
