<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to XPath

_Instructor: Aymeric Flaisler_

---

### Learning Objectives
- Understand scraping basics
- Familiarity with import.io service
- Understand the structure and content of HTML
- Utilize XPath to extract information from HTML 


### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- Understand basic HTML concepts
- Worked with Beautiful Soup


### LESSON GUIDE

- [Introduction](#introduction)
- [HTML](#html)
    - [Elements](#elements)
    - [Attributes](#attributes)
- [What is XPath?](#xpath)
    - [Absolute References](#xpath_absolute)
    - [Relative References](#xpath_relative)
    - ["Wheres Waldo?" Exercise](#waldo_exercise)
- [1 vs N Selectors](#1_v_n)
- [Demo Code](#demo)
    - [Scrape Data Tau](#scrape_tau)
- [Independent Practice](#ind_practice)

---

<a id='introduction'></a>
## Introduction: Scraping Overview

Web scraping is a technique of extracting information from websites. It focuses on transformation of unstructured data on the web, into structured data that can be stored and analyzed.

There are a variety of ways to "scrape" what we want from the web:

- 3rd Party Services (import.io)
- Write our own Python apps that pull HTML documents and parse them.
  - Mechanize
  - Scrapy
  - Requests
  - libxml / XPath
  - Beautifulsoup
  - Regular expressions

<a id='html'></a>
## HTML Review

In the HTML DOM (Document Object Model), everything is a node:
 * The document itself is a document node.
 * All HTML elements are element nodes.
 * All HTML attributes are attribute nodes.
 * Text inside HTML elements are text nodes.
 * Comments are comment nodes.

<a id='elements'></a>
## Elements
Elements begin and end with **open and close "tags"**, which are defined by namespaced, encapsulated strings. 

```html
<title>I am a title.</title>
<p>I am a paragraph.</p>
<strong>I am bold.</strong>
```

_note: the tags **title, p, and strong** are represented here._

## Element Parent / Child Relationships

<img src="http://www.htmlgoodies.com/img/2007/06/flowChart2.gif" width="250">

**Elements begin and end in the same namespace like so:**  `<p></p>`

**Elements can have parents and children:**

```html
<body>
    <div>I am inside the parent element
        <div>I am inside a child element</div>
        <div>I am inside another child element</div>
        <div>I am inside yet another child element</div>
    </div>
</body>
```

<a id='attributes'></a>

## Element Attributes

Elements can also have attributes!  Attributes are defined inside **element tags** and can contain data that may be useful to scrape.

```html
<a href="http://lmgtfy.com/?q=html+element+attributes" title="A title" id="web-link" name="hal">A Simple Link</a>
```

The **element attributes** of this `<a>` tag element are:
- id
- href
- title
- name

This `<a>` tag example will render in your browser like this:
> <a href="https://www.youtube.com/watch?v=dQw4w9WgXcQ">A Simple Link</a>


## Can you identify an attribute, an element, a text item, and a child element?

```HTML
<html>
   <title id="main-title">All this scraping is making me itch!</title>
   <body>
       <h1>Welcome to my Homepage</h1>
       <p id="welcome-paragraph" class="strong-paragraph">
           <span>Hello friends, let me tell you about this cool hair product..</span>
           <ul>
              <li>It's cool</li>
              <li>It's fresh</li>
              <li>It can tell the future</li>
              <li>Always be closing</li>
           </ul>
       </p>
   </body>
```

**Bonus: What's missing?** 

<a id='xpath'></a>

## Enter XPath

XPath uses **path expressions to select nodes or node-sets** in an HTML/XML document. These path expressions look very much like the expressions you see when you work with a traditional computer file system.

<a id='xpath'></a>

## What is XPath?

---


Understanding how to identify elements and attributes within HTML documents gives us the capability to write simple expressions that create structured data.  We can think of **XPath like a query language for querying HTML**.

To make this process easier to deal with, we will be using [ChroPath](https://chrome.google.com/webstore/detail/chropath/ljngjbnaijcbncmcnjfhigebomdlkcjo?hl=en)
, which is a Chrome addon.  It's not necessary, but highly recommended to help build XPath expressions.


You can also try [XPath Helper](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en) but it is not as friendly.

XPath expressions can **select elements, element attributes, and element text**.  These selections can be either to a **single item, or multiple items**.  Generally, if you're not specific enough, you will end up selecting multiple elements.

<a id='multiple-selections'></a>
### Multiple selections

***Multiple selections*** are useful for capturing search results, or any repeating element.  For instance, the _titles_ of an apartment listing search results from Gumtree:


**URL - 1**

[https://www.gumtree.com/london](https://www.gumtree.com/london)

**Example HTML Markup**
```
...
<div class="listing-card-content">
    <h2 class="listing-card-title xh-highlight">Iveco daily 17 seats minibus 43000 on clock </h2>
    <div class="listing-card-description-cont">
    <span class="listing-card-location">Surrey Quays</span>
    <span class="listing-card-category"> Vans </span>
    </div>
    <span class="listing-card-price">£3,950</span>
</div>

...
```

**XPath - Multiple Titles** _copy this into the ChroPath or XPath Helper Query box_
```
//h2[@class='listing-card-title']
```
**Returns (Ad Titles)**


**URL - 2**

[http://sfbay.craigslist.org/search/sfc/apa](http://sfbay.craigslist.org/search/sfc/apa)


**Example HTML Markup**
```
...
<span class="pl"> 
    <time datetime="2016-01-12 23:27" title="Tue 12 Jan 11:27:35 PM">Jan 12</time> 
    <a href="/sfc/apa/5400584579.html" data-id="5400584579" class="hdrlnk">Welcome home to a sweetly renovated four bedroom one and a half bath</a> 
</span>
...
```

**XPath - Multiple Titles** _copy this into the  ChroPath or XPath Helper Query box_
```
//a[@class='result-title hdrlnk']
```

**Returns (Ad Titles)**

**Note:** Double slash (//) is the descendant-or-self axis.

For example:
`//div[@id='add']//span[@id='addone']`
- The first time // appears, it selects all div elements in the document with an id attribute value equal to 'add'.
- The second time // appears, it selects all span elements that are descendents of each of the  div elements selected previously.

<a id='singlular-selections'></a>

### Singular selections

***Singular selections*** are necessary when you want to grab specific, unique text within elements.  Here's an example of a details page:


**URL**

[https://www.gumtree.com/p/nokia/nokia-6310i/1305551456](https://www.gumtree.com/p/nokia/nokia-6310i/1305551456)

**HTML Markup**

```
<div class="grid-col-12 grid-col-l-6">
<dl class="dl-attribute-list attribute-list1">
<dt>Posted</dt>
<dd>1 day ago </dd>
</dl> </div>
```

**XPath - Single Item**

```
//dl[@class='dl-attribute-list attribute-list1'][1]/dd
```
**Returns (Time of posting or age of Post)**
```
xx day ago
```

###### XPath Features

XPath includes over 100 built-in functions to help us select and manipulate HTML (or XML) documents. XPath has functions for:

- string values
- numeric values
- date and time comparison
- sequence manipulation
- Boolean values
- and more!

https://devhints.io/xpath

## Basic XPath Expressions

XPath comes with a wide array of features but the basics of selecting data are the most common problems that XPath can help you solve.

The most common task you'll use **XPath** for is selecting data from HTML documents.  There are two ways you can **select elements** within HTML using **XPath**:

- Absolute reference
- Relative reference

<a id='xpath_absolute'></a>
# XPath:  Absolute References

_For our XPath demonstration, we will use Scrapy, which is using libxml under the hood.  Libxml provides the basic functionality for XPath expressions._

In [None]:
# Before using scrapy, you need to install it. Run the following in the terminal:
# conda install scrapy
# You also might need
# pip install --upgrade zope2 (you need to run the command as admin (sudo...))

In [1]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import requests

In [2]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

HTML = """
<html>
    <body>
        <span id="only-span">good</span>
    </body>
</html>
"""
# The same thing but "absolute" reference
Selector(text=HTML).xpath('/html/body/span/text()').extract()

['good']

<a id='xpath_relative'></a>
## Relative Reference

Relative references in XPath match the "ends" of structures.  Since there is only a single "span" element, `//span/text()` matches **one element**.

In [3]:
Selector(text=HTML).xpath('//span/text()').extract()

['good']

## Selecting Attributes

Attributes **within a tag**, such as `id="only-span"` within our span attribute.  We can get the attribute by using `@` symbol **after** the **element reference**.


In [4]:
Selector(text=HTML).xpath('//span/@id').extract()

['only-span']

### Demo: (code along)

In [5]:
response = requests.get("https://sfbay.craigslist.org/search/sfc/apa")

In [6]:
hxs = Selector(response)

In [7]:
hxs = Selector(response) # selector is a function that allows us to grab html from the response(target website)

In [8]:
items = [] # element for storing scraped info
for i in hxs.xpath("//ul[@class='rows']/li"): 
    item={}
    item['price'] =  i.xpath("a/span/text()").extract() 
    item['link']  =  i.xpath("a/@href").extract() # href/url from the 'a' element 
    item['title'] =  i.xpath('p/a/text()').extract()
    # price from the result price class nested in a few span elements.
    items.append(item)

In [10]:
# items

### Exercise:

Do the same for this url: `https://www.gumtree.com/cars/london/bmw+3+series`

In [None]:
# A:
response = requests.get('https://www.gumtree.com/cars/london/bmw+3+series')
gt_ = Selector(response)

items = [] 
# for ...

<p style='color:white'>

for elt in gt_.xpath('//article[@class="listing-maxi"]/a'): 
    item={}
    item['title'] =  elt.xpath('div[@class="listing-content"]/h2[@class="listing-title"]/text()').extract() 
    item['link']  =  elt.xpath('@href').extract()
    item['price'] =  elt.xpath('div[@class="listing-content"]/span/meta[@itemprop="price"]/@content').extract()
    
    items.append(item)
</p>

<a id='waldo_exercise'></a>
## (~10 mins) Where's Waldo - "XPath Edition"

In this example, we will find Waldo together.  Find Waldo as:

- Element
- Attribute
- Text element

In [None]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        
        
        <waldo>Waldo</waldo>
    </body>
</html>
"""

In [None]:
# A:


<a id='1_v_n'></a>

## 1 vs N Selections

When selecting elements via relative reference, it's possible that you will select multiple items.  It's still possible to select single items, if you're specfic enough.

**Singular Reference**
- **Index** starts at **1**
- Selections by offset
- Selections by "first" or "last"
- Selections by **unique attribute value**


In [None]:
HTML = """
<html>
    <body>
    
        <!-- Search Results -->
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=751hUX_q0Do" title="Rappin with Gas">Rapping with gas</a>
           <span class="link-details">This is a great video about gas.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=97byWqi-zsI" title="Casio Rapmap">The Rapmaster</a>
           <span class="link-details">My first synth ever.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=TSwqnR327fk" title="Cinco Products">Cinco Midi Organizer</a>
           <span class="link-details">Midi files at the speed of light.</span>
        </div>
        <div class="search-result">
           <a href="https://www.youtube.com/watch?v=8TCxE0bWQeQ" title="Baddest Gates">BBG Baddest Moments</a>
           <span class="link-details">It's tough to be a gangster.</span>
        </div>
        
        <!-- Page stats -->
        <div class="page-stats-container">
            <li class="item" id="pageviews">1,333,443</li>
            <li class="item" id="somethingelse">bla</li>
            <li class="item" id="last-viewed">01-22-2016</li>
            <li class="item" id="views-per-hour">1,532</li>
            <li class="item" id="kiefer-views-per-hour">5,233.42</li>
        </div>
        
    </body>
</html>
"""



#### Selecting the first element in a series of elements

In [None]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[2]

#### Selecting the last element in a series of elements

In [None]:
spans = Selector(text=HTML).xpath('//span').extract()
spans[-1]

#### Selecting all elements matching a selection

In [None]:
Selector(text=HTML).xpath('//span').extract()

#### Selecting elements matching an _attribute_

This will be one of the most common ways you will select items.  HTML DOM elements will be more differentiated based on their "class" and "id" variables.  Mainly, these types of attributes are used by web developers to refer to specfic elements or a broad set of elements to apply visual characteristics using CSS.

```HTML 
//element[@attribute="value"]
```

**Generally**

- "class" attributes within elements usually refer to multiple items
- "id" attributes are supposed to be unique, but not always

_CSS stands for cascading style sheets.  These are used to abstract the definition of visual elements on a micro and macro scale for the web.  They are also our best friend as data miners.  They give us strong hints and cues as to how a web document is structured._

## Checkout (in pair):

- What is the difference between css selectors and Xpath? Beautifulsoup and scrapy?
- You need data from a website. Should you go ahead and scrape the website or investigate for an and available API? Why?
- Can you scrape data behind a login page?
- Is web scraping legal?


<a id='scrape_tau'></a>

## Let's Scrape (Hacker News / Data Tau) Headlines

DataTau is a great site for data science news. Let's take their headlines using Python **requests**, and practice selecting various elements.

Using <a href="https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en">XPath helper Chrome plugin</a> or ChroPath and the Chrome "inspect" feature, let's explore the structure of the page.

_Here's a <a href="https://www.youtube.com/watch?v=i2Li1vnv09U">concise video</a> that demonstrates the basic inspect feature within Chrome._

In [14]:
# Please only run this frame once to avoid hitting the site too hard all at once
import requests

response = requests.get("http://www.datatau.com/")
HTML = response.text  
HTML[0:150]           # view the first 150 characters of the HTML index document for DataTau

'<html><head><link rel="stylesheet" type="text/css" href="news.css">\n<link rel="shortcut icon" href="http://www.iconj.com/ico/d/x/dxo02ap56v.ico">\n<scr'

#### Select the headlines using Xpath


In [None]:
import pandas as pd

In [16]:
# A:

Solution (double clic):
<p style="color:white;">
titles = Selector(text=HTML).xpath('//td[@class="title"]/a/text()').extract()
titles[:5] # the first 5 titles
</p>

#### How do we get the urls from the titles?

In [None]:
# A:

Solution (double clic):
<p style="color:white;">

urls = Selector(text=HTML).xpath('//td[@class="title"]/a/@href').extract()
urls[:5]
# titles[0:5] # the first 5 titles
</p>

#### How can we get the site domain, after the title within the parentheses (ie: stitchfix.com)?

In [None]:
# A:

#### How about the points?

In [None]:
# A:

Solution (double clic):
<p style="color:white;">
points = Selector(text=HTML).xpath('//td[@class="subtext"]/span/text()').extract()
points[0:5]
</p>

#### How about the "more Link?"
Hint:  You can use `element[text()='exact text']` to find text element matching specific text.

In [None]:
# A :

Solution (double clic):
<p style="color:white;">
next_link = Selector(text=HTML).xpath('//a[text()="More"]/@href').extract()
next_link
</p>

## Independent Practice / Lab

For the next 30 minutes try to grab the following:

- Story titles
- Story URL (href)
- Domain
- Points

Stretch:
- Author
- Comment count

Then put into a DataFrame.

- Do basic analysis of domains and point distributions

** Bonus **

Automatically find the next "more link" and mine the next page(s) until none exist.  Logically, you can each page with this pseudo code:

1. Does the next link exist (a tag with text == "More")
1. Fetch URL, prepended with domain (datatau.com/(extracted link here))
1. Parse the page with `Selector(text=HTML).xpath('').extract()` to find the elements
1. Add to dataframe

_Note:  You might want to set a limit something like 2-3 total requests per attempt to avoid unecessary transfer_


In [None]:
url="http://www.datatau.com"

response = requests.get(url)

links = Selector(text=response.text).xpath(
    "//td[@class='title']/a/@href").extract()
titles = Selector(text=response.text).xpath(
    "//td[@class='title']/a/text()").extract()
# points = Selector(text=response.text).xpath(
#     "//td[@class='subtext']/span/text()").extract()
domains = Selector(text=response.text).xpath(
    "//td[@class='title']/span/text()").extract()
authors = Selector(text=response.text).xpath(
    "//td[@class='subtext']/a[contains(@href, 'user')]/text()").extract()
comments = Selector(text=response.text).xpath(
    "//td[@class='subtext']/a[contains(@href, 'item')]/text()").extract()

## Create a df with the data (be carefull of the missing values):

In [None]:
url="https://www.datatau.com/"

response = requests.get(url)

links = Selector(text=response.text).xpath(
titles = 
points = 
domains =
authors = 
comments = 


## Altogther into a function calling itself recurcively:

In [None]:
import requests
import numpy as np


def parse_url(url="https://www.datatau.com/", data=False):

    response = 
    # links = Selector(text=response.text).xpath(...
    # etc.
    
    scraped = 
    
    df = pd.DataFrame(scraped)
    
    # If there's data append it, if not, it's the first iteration, no need.
    if type(data) != bool:
        data = df.append(data)
    else:
        data = df

    # Find more link
    more_anchor = Selector(text=response.text).xpath(
        "//a[text() = 'More']/@href").extract()
    
    # Add them recurcively 
    if (len(more_anchor) > 0) and (str(max_page) not in url):
        more_url = "https://www.datatau.com/%s" % more_anchor[0]
        print("Fetching %s..." % more_url)
        return parse_url(more_url, data=data)
    else:
        return data.reset_index()

df 

Extra material: Web-scraping JavaScript page with Python: 
https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python