# Data acquisition and extraction

## **Introduction**

The key aspects for effective scraping are **understanding how content and data are stored on web servers, identifying the data you want to retrieve, and understanding how thetools support this extraction**. We will discuss website structures and the DOM, introduce techniques to parse, and query websites with lxml, XPath, and CSS. We will also look at how to work with websites developed in other languages and different encoding types such as Unicode.

Ultimately, understanding how to find and extract data within an HTML document comes down to understanding the structure of the HTML page, its representation in the DOM, the process of querying the DOM for specific elements, and how to specify which elements you want to retrieve based upon how the data is represented.

## **How to parse websites and navigate the DOM using BeautifulSoup**

When the browser displays a web page it builds a model of the content of the page in a representation knows as the **document object model (DOM)**. The DOM is a hierarchical representation of the page's entire content, as well as structural information, style information, scripts, and links to other content.

It is critical to understand this structure to be ablse to effectively scrape data from web pages. We will look at an example web page, its DOM and examine how to navigate the DOM with Beatiful Soup.

### **Getting ready**

We will use a small web site that is included in the `www` folder of the sample code.  To follow along, start a web server from within the `www` folder.  This can be done with Python 3 as follows:

```bash
ww $ python3 -m http.server 8080
Serving HTTP on 0.0.0.0 port 8080 (http://0.0.0.0:8080/)
```

The DOM of a web page can be examined in Chrome by right-clicking the page and selecting Inspect. This opens the Chrome Developer Tools. Open a browser page to `http://localhost:8080/planets.html`. Within chrome you can right click and select 'inspect' to open developer tools (other browsers have similar tools).

This opens the developer tools and the inspector. The DOM can be examined in the Elements tab.

The following shows the selection of the first row in the table.
ach row of planets is within a `<tr>` element.  There are several characteristics of this element and its neighboring elements that we will examine because they are designed to model common web pages.

Firstly, this element has three attributes: `id`, `planet`, and `name`. Attributes are often important in scraping as they are commonly used to identify and locate data embedded in the HTML.

Secondly, the `<tr>` element has children, and in this case, five `<td>` elements. We will often need to look into the children of a specific element to find the actual data that is desired.

This element also has a parent element, `<tbody>`. There are also sibling elements, and the a set of `<tr>`  child elements.  From any planet, we can go up to the parent and find the other planets. And as we will see, we can use various constructs in the various tools, such as the `find` family of functions in Beautiful Soup, and also `XPath` queries, to easily navigate these relationships.


### **How to do it**

In [1]:
import requests
from bs4 import BeautifulSoup
html = requests.get("http://localhost:8080/planets.html").text
soup = BeautifulSoup(html, 'html.parser')

str(soup)[:1000]

'<html>\n<head>\n</head>\n<body>\n<div id="planets">\n<h1>Planetary data</h1>\n<div id="content">Here are some interesting facts about the planets in our solar system</div>\n<p></p>\n<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\n                    Name\n                </th>\n<th>\n                    Mass (10^24kg)\n                </th>\n<th>\n                    Diameter (km)\n                </th>\n<th>\n                    How it got its Name\n                </th>\n<th>\n                    More Info\n                </th>\n</tr>\n<tr class="planet" id="planet1" name="Mercury">\n<td>\n<img src="img/mercury-150x150.png"/>\n</td>\n<td>\n                    Mercury\n                </td>\n<td>\n                    0.330\n                </td>\n<td>\n                    4879\n                </td>\n<td>Named Mercurius by the Romans because it appears to move so swiftly.</td>\n<td>\n<a href="https://en.wikipedia.org/wiki/Mercury_(planet)">Wikipedia<

We can navigate the elements in the DOM using properties of soup. soup represents the overall document and we can drill into the document by chaining the tag names. The following navigates to the `<table>` containing the data:

In [2]:
str(soup.html.body.div.table)[:200]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\n                    Name\n                </th>\n<th>\n                    Mass (10^24kg)\n                </th>\n<th>\n          '

Note this type of notation retrieves only the first child of that type.  Finding more requires iterations of all the children, which we will do next, or using the find methods (the next recipe).

**Each node has both children and descendants**. Descendants are all the nodes underneath a given node (event at further levels than the immediate children), while children are those that are a first level descendant. The following retrieves the children of the table, which is actually a `list_iterator` object:

In [3]:
soup.html.body.div.table.children

<list_iterator at 0x7f8fbb527fa0>

We can examine each child element in the iterator using a for loop or a Python generator. The following uses a generator to get all the children of the and return the first few characters of their constituent HTML as a list:

In [4]:
[str(c)[:45] for c in soup.html.body.div.table.children]

['\n',
 '<tr id="planetHeader">\n<th>\n</th>\n<th>\n      ',
 '\n',
 '<tr class="planet" id="planet1" name="Mercury',
 '\n',
 '<tr class="planet" id="planet2" name="Venus">',
 '\n',
 '<tr class="planet" id="planet3" name="Earth">',
 '\n',
 '<tr class="planet" id="planet4" name="Mars">\n',
 '\n',
 '<tr class="planet" id="planet5" name="Jupiter',
 '\n',
 '<tr class="planet" id="planet6" name="Saturn"',
 '\n',
 '<tr class="planet" id="planet7" name="Uranus"',
 '\n',
 '<tr class="planet" id="planet8" name="Neptune',
 '\n',
 '<tr class="planet" id="planet9" name="Pluto">',
 '\n']

Last but not least, the parent of a node can be found using the `.parent` property:

In [5]:
str(soup.html.body.div.table.tr.parent)[:200]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\n                    Name\n                </th>\n<th>\n                    Mass (10^24kg)\n                </th>\n<th>\n          '

### **How it works**

Beautiful Soup converts the HTML from the page into its own internal representation. This model has an identical representation to the DOM that would be created by a browser. But Beautiful Soup also provides many powerful capabilities for navigating the elements in the DOM, such as what we have seen when using the tag names as properties.  These are great for finding things when we know a fixed path through the HTML with the tag names.

### **There's more**

**This manner of navigating the DOM is relatively inflexible and is highly dependent upon the structure**. It is possible that this structure can change over time as web pages are updated by their creator(s). The pages could even look identical, but have a completely different structure that breaks your scraping code.

So how can we deal with this? As we will see, **there are several ways of searching for elements that are much better than defining explicit paths**. In general, **we can do this using XPath and by using the find methods of beautiful soup**. We will examine both in recipes later in this chapter.

## **Searching the DOM with Beautiful Soup's find methods**

We can perform simple searches of the DOM using Beautiful Soup's find methods. These methods give us a much more flexible and powerful construct for finding elements that are not dependent upon the hierarchy of those elements.  In this recipe we will examine  several common uses of these functions to locate various elements in the DOM.

### **How to do it**

In [1]:
import requests
from bs4 import BeautifulSoup
html = requests.get('http://localhost:8080/planets.html').text
soup = BeautifulSoup(html, 'lxml')

table = soup.find('table')
str(table)[:100]

'<table border="1" id="planetsTable">\n<tr id="planetHeader">\n<th>\n</th>\n<th>\n                    Name'

In [2]:
[str(tr)[:50] for tr in table.findAll("tr")]

['<tr id="planetHeader">\n<th>\n</th>\n<th>\n           ',
 '<tr class="planet" id="planet1" name="Mercury">\n<t',
 '<tr class="planet" id="planet2" name="Venus">\n<td>',
 '<tr class="planet" id="planet3" name="Earth">\n<td>',
 '<tr class="planet" id="planet4" name="Mars">\n<td>\n',
 '<tr class="planet" id="planet5" name="Jupiter">\n<t',
 '<tr class="planet" id="planet6" name="Saturn">\n<td',
 '<tr class="planet" id="planet7" name="Uranus">\n<td',
 '<tr class="planet" id="planet8" name="Neptune">\n<t',
 '<tr class="planet" id="planet9" name="Pluto">\n<td>']

> Note that these are the descendants and not immediate children.  Change the query to "td" to see the difference.  The are no direct children that are `<td>`, but each row has multiple `<td>` elements.  In all, there would be $54$ `<td>` elements found.

There is a small issue here if we want only rows that contain data for planets. The table header is also included.  We can fix this by utilizing the id attribute of the target rows.  The following finds the row where the value of `id` is `"planet3"`.

In [3]:
table.find("tr", {"id": "planet3"})

<tr class="planet" id="planet3" name="Earth">
<td>
<img src="img/earth-150x150.png"/>
</td>
<td>
                    Earth
                </td>
<td>
                    5.97
                </td>
<td>
                    12756
                </td>
<td>
                    The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,'
                    Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning
                    'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'
                </td>
<td>
<a href="https://en.wikipedia.org/wiki/Earth">Wikipedia</a>
</td>
</tr>

Awesome! We used the fact that this page uses this attribute to represent table rows with actual data.

Now let's go one step further and collect the masses for each planet and put the name and mass in a dictionary:

In [5]:
items = dict()
planet_rows = table.findAll("tr", {"class": "planet"})
for i in planet_rows:
    tds = i.findAll("td")
    items[tds[1].text.strip()] = tds[2].text.strip()

items

{'Mercury': '0.330',
 'Venus': '4.87',
 'Earth': '5.97',
 'Mars': '0.642',
 'Jupiter': '1898',
 'Saturn': '568',
 'Uranus': '86.8',
 'Neptune': '102',
 'Pluto': '0.0146'}

## **Querying the DOM with XPath and lxml**

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

* Can easily navigate through the DOM tree
* More sophisticated and powerful than other selectors like CSS selectors and regular expressions
* It has a great set $(200+)$ of built-in functions and is extensible with custom functions
* It is widely supported by parsing libraries and scraping platforms

XPath contains seven data models (we have seen some of them previously):

* root node (top level parent node)
* element nodes (`<a>...</a>`)
* attribute nodes (`href="example.html"`)
* text nodes (`"this is a text"`)
* comment nodes (`<!-- a comment -->`)
* namespace nodes 
* processing instruction nodes

XPath expressions can return different data types:

* strings
* booleans
* numbers
* node-sets (probably the most common case)

An (XPath) **`axis`** defines a node-set relative to the current node. A total of $13$ axes are defined in XPath to enable easy searching for different node parts, from the current context node, or the root node.

**`lxml`** is a Python wrapper on top of the libxml2 XML parsing library, which is written in C. The implementation in C helps make it faster than Beautiful Soup, but also harder to install on some computers. The latest installation instructions are available [here](http://lxml.de/installation.html).

lxml supports XPath, which makes it considerably easy to manage complex XML and HTML documents. We will examine several techniques of using lxml and XPath together, and how to use lxml and XPath to navigate the DOM and access data.

### **Getting ready**

We will start by importing html from lxml, as well as requests, and then load the page.

In [6]:
%conda install -c anaconda lxml -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/ibrahim/miniconda3/envs/scraping

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    lxml-4.9.1                 |  py310h1edc446_0         6.1 MB  anaconda
    ------------------------------------------------------------
                                           Total:         6.1 MB

The following packages will be SUPERSEDED by a higher-priority channel:

  ca-certificates    conda-forge::ca-certificates-2022.12.~ --> anaconda::ca-certificates-2022.07.19-h06a4308_0 
  certifi            conda-forge/noarch::certifi-2022.12.7~ --> anaconda/linux-64::certifi-2022.6.15-py310h06a4308_0 
  lxml                                            pkgs/main --> anaconda 



Downloading and Extracting Packages
                          

In [7]:
from lxml import html
import requests
page_html = requests.get("http://localhost:8080/planets.html").text

### **How to do it**

The first thing that we do is to load the HTML into an lxml `"etree"`.  This is lxml's representation of the DOM.

In [8]:
tree = html.fromstring(page_html)

The tree variable is now an lxml representation of the DOM which models the HTML content. Let's now examine how to use it and XPath to select various elements from the document.

Out first XPath example will be to find all the the `<tr>` elements below the `<table>` element.

In [9]:
[tr for tr in tree.xpath("/html/body/div/table/tr")]

[<Element tr at 0x7ff20020dd50>,
 <Element tr at 0x7ff20020f290>,
 <Element tr at 0x7ff20020eb10>,
 <Element tr at 0x7ff20020f3d0>,
 <Element tr at 0x7ff20020e430>,
 <Element tr at 0x7ff20020ec00>,
 <Element tr at 0x7ff20020dfd0>,
 <Element tr at 0x7ff20020f1a0>,
 <Element tr at 0x7ff20020e750>,
 <Element tr at 0x7ff20020f010>,
 <Element tr at 0x7ff20020eed0>]

This XPath navigates by tag name from the root of the document down to the `<tr>` element.  This example looks similar to the property notation from Beautiful Soup, but ultimately it is significantly more expressive.  And notice one difference in the result.  All the the `<tr>` elements were returned and not just the first.  As a matter of fact, the tags at each level of this path with return multiple items if they are available.  If there was multiple `<div>` elements just below `<body>`, then the search for `table/tr` would be executed on all of those `<div>`.

The actual result was an lxml element object.  The following gets the HTML associated with the elements but using `etree.tostring()` (albeit they have encoding applied):

In [11]:
from lxml import etree
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr")]

[b'<tr id="planetHeader">\n                <th>\n      ',
 b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ',
 b'<tr id="footerRow">\n                <td>\n         ']

Now let's look at using XPath to select only the `<tr>` elements that are planets.

In [12]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr[@class='planet']")]

[b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ']

The use of the `[]` next to a tag states that we want to do a selection based on some criteria upon the current element.  The @ states that we want to examine an attribute of the tag, and in this cast we want to select tags where the attribute is equal to `"planet"`.

There is also another point to be made out of the query that had 11 `<tr> `rows.  As stated earlier, the XPath runs the navigation on all the nodes found at each level.  There are two tables in this document, both children of a different `<div>` that are both a child or the `<body>` element.  The row with `id="planetHeader`" came from our desired target table, the other, with id="footerRow", came from the second table.

Previously we solved this by selecting `<tr>` with `class="row"`, but there are also other ways worth a brief mention.  The first is that we can also use [] to specify a specific element at each section of the XPath like they are arrays.  Take the following:

In [13]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[1]/table/tr")]

[b'<tr id="planetHeader">\n                <th>\n      ',
 b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ']

Arrays in XPath start at 1 instead of 0 (a common source of error).  This selected the first `<div>`.  A change to `[2]` selects the second `<div>` and hence only the second `<table>`.

In [14]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table[@id='footerTable']")]

[b'<table id="footerTable">\n            <tr id="foote']

The first <div> in this document also has an id attribute:
```html
<div id="planets"> 
```
This can be used to select this `<div>`:

In [17]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr")]

[b'<tr id="planetHeader">\n                <th>\n      ',
 b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ']

Earlier we selected the planet rows based upon the value of the class attribute.  We can also exclude rows:

In [19]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']")]

[b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ']

Suppose that the planet rows did not have attributes (nor the header row), then we could do this by position, skipping the first row:

In [20]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[position() > 1]")]

[b'<tr id="planet1" class="planet" name="Mercury">\n  ',
 b'<tr id="planet2" class="planet" name="Venus">\n    ',
 b'<tr id="planet3" class="planet" name="Earth">\n    ',
 b'<tr id="planet4" class="planet" name="Mars">\n     ',
 b'<tr id="planet5" class="planet" name="Jupiter">\n  ',
 b'<tr id="planet6" class="planet" name="Saturn">\n   ',
 b'<tr id="planet7" class="planet" name="Uranus">\n   ',
 b'<tr id="planet8" class="planet" name="Neptune">\n  ',
 b'<tr id="planet9" class="planet" name="Pluto">\n    ']

It is possible to navigate to the parent of a node using `parent::*`:

In [21]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::*")]

[b'<table id="planetsTable" border="1">\n            <',
 b'<table id="footerTable">\n            <tr id="foote']

This returned two parents as, remember, this XPath returns the rows from two tables, so the parents of all those rows are found. The `*` is a wild card that represents any parent tags with any name. In this case, the two parents are both tables, but in general the result can be any number of HTML element types.  The following has the same result, but if the two parents where different HTML tags then it would only return the `<table>` elements.

In [22]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table")]

[b'<table id="planetsTable" border="1">\n            <',
 b'<table id="footerTable">\n            <tr id="foote']

It is also possible to specify a specific parent by position or attribute. The following selects the parent with `id="footerTable"`:

In [23]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table[@id='footerTable']")]

[b'<table id="footerTable">\n            <tr id="foote']

A shortcut for parent is `..` (and `.` also represents the current node):

In [24]:
[etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/..")]

[b'<table id="planetsTable" border="1">\n            <',
 b'<table id="footerTable">\n            <tr id="foote']

And the last example finds the mass of Earth: 

In [25]:
mass = tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()
mass

'5.97'

The trailing portion of this `XPath`, `/td[3]/text()[1]`, selects the $3^{rd}$ `<td>` element in the row, then the text of that element (which is an array of all the text in the element), and the $1^{st}$ of those which is the mass.

### **How it works**

`XPath` is a element of the **XSLT** (**eXtensible Stylesheet Language Transformation**) standard and provides the ability to select nodes in an XML document. HTML is a variant of XML, and hence XPath can work on on HTML document (although HTML can be improperly formed and mess up XPath parsing in those cases).

XPath itself is designed to model the structure of XML nodes, attributes, and properties. The syntax provides means of finding items in the XML that match the expression. This can include matching or logical comparison of any of the nodes, attributes, values, or text in the XML document.

> XPath expressions can be combined to form very complex paths within the document. It is also possible to navigate the document based upon relative positions, which helps greatly in finding data based upon relative positions instead of absolute positions within the DOM.

Understanding XPath is essential for knowing how to parse HTML and perform web scraping. And as we will see, it underlies, and provides an implementation for, many of the higher level libraries such as lxml.

### **There's more**

There's more...
XPath is actually an amazing tool for working with XML and HTML documents. It is quite rich in its capabilities, and we have barely touched the surface of its capabilities for demonstrating a few examples that are common to scraping data in HTML documents.

To learn much more, please visit the following links:

* [https://www.w3schools.com/xml/xml_xpath.asp](https://www.w3schools.com/xml/xml_xpath.asp)
* [https://www.w3.org/TR/xpath/](https://www.w3.org/TR/xpath/)

## **Querying data with XPath and CSS selectors**

CSS selectors are patterns used for selecting elements and are often used to define the elements that styles should be applied to. They can also be used with lxml to select nodes in the DOM. CSS selectors are commonly used as they are more compact than XPath and generally can be more reusable in code. Examples of common selectors which may be used are as follows:

| What you are looking for                                    | Example      |
| ----------------------------------------------------------- | ------------ |
| All tags                                                    | `*`          |
| A specific tag (that is, `tr`)                              | `.planet`    |
| A class name (that is, `"planet"`)                          | `tr.planet`  |
| A tag with an `ID` `"planet3"`                              | `tr#planet3` |
| A child `tr` of a table                                     | `table tr`   |
| A descendant `tr` of a table                                | `table tr`   |
| A tag with an attribute (that is, `tr` with `id="planet4"`) | `a[id=Mars]` |


### **Getting ready**

In [26]:
from lxml import html
import requests

page_html = requests.get("http://localhost:8080/planets.html").text
tree = html.fromstring(page_html)

### **How to do it**

Now let's start playing with XPath and CSS selectors.  The following selects all `<tr>` elements with a class equal to `"planet"`:

In [27]:
[(v, v.xpath("@name")) for v in tree.cssselect('tr.planet')]

[(<Element tr at 0x7ff20020eac0>, ['Mercury']),
 (<Element tr at 0x7ff1fd68bfb0>, ['Venus']),
 (<Element tr at 0x7ff1dd145580>, ['Earth']),
 (<Element tr at 0x7ff1dd145940>, ['Mars']),
 (<Element tr at 0x7ff1dd161760>, ['Jupiter']),
 (<Element tr at 0x7ff1dd161a30>, ['Saturn']),
 (<Element tr at 0x7ff1dd1636f0>, ['Uranus']),
 (<Element tr at 0x7ff1dd161da0>, ['Neptune']),
 (<Element tr at 0x7ff1dd163ec0>, ['Pluto'])]

Data for the Earth can be found in several ways. The following gets the row based on id:

In [28]:
tr = tree.cssselect("tr#planet3")
tr[0], tr[0].xpath("./td[2]/text()")[0].strip()

(<Element tr at 0x7ff1dd145580>, 'Earth')

The following uses an attribute with a specific value: 

In [None]:
tree.cssselect("tr[name='Pluto']")
tr[0], tr[0].xpath("td[2]/text()")[0].strip()

Note that unlike XPath, the `@` symbol need not be used to specify an attribute.

### **How it works**

lxml converts the CSS selector you provide to XPath, and then performs that XPath expression against the underlying document. In essence, CSS selectors in lxml provide a shorthand to XPath, which makes finding nodes that fit certain patterns simpler than with XPath.

### **There's more**

Because **CSS selectors utilize XPath under the covers, there is overhead to its use as compared to using XPath directly**. This difference is, however, almost a non-issue, and hence in certain scenarios it is easier to just use `cssselect`.

A full description of CSS selectors can be found at: https://www.w3.org/TR/2011/REC-css3-selectors-20110929/

## **Using Scrapy selectors**

Scrapy is a Python web spider framework that is used to extract data from websites. It provides many powerful features for navigating entire websites, such as the ability to follow links. One feature it provides is the ability to find data within a document using the DOM, and using the now, quite familiar, XPath.

In this recipe we will load the list of current questions on StackOverflow, and then parse this using a scrapy selector. Using that selector, we will extract the text of each question.

### **How to do it**

We start by importing Selector from scrapy, and also requests so that we can retrieve the page:

In [41]:
from scrapy.selector import Selector
import requests
# Next we load the page. For this example we are going to retrieve the most recent
# questions on StackOverflow and extract their titles. We can make this query with the following:
response = requests.get("https://stackoverflow.com/questions")
# Now create a Selector and pass it the response object:
selector = Selector(response)
selector

<Selector xpath=None data='<html class="html__responsive " lang=...'>

In [81]:
# With the selector we can find these using XPath:
summaries = selector.xpath('//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]')
summaries[0:5]

[<Selector xpath='//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]' data='<div class="s-post-summary--content-e...'>,
 <Selector xpath='//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]' data='<div class="s-post-summary--content-e...'>,
 <Selector xpath='//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]' data='<div class="s-post-summary--content-e...'>,
 <Selector xpath='//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]' data='<div class="s-post-summary--content-e...'>,
 <Selector xpath='//div[@id="questions"]/div/div[@class="s-post-summary--content"]/div[@class="s-post-summary--content-excerpt"]' data='<div class="s-post-summary--content-e...'>]

And now we drill a little further into each to get the title of the question

In [90]:
[x.strip() for x in summaries.xpath('text()').getall()][:10]

["I was watching a yt tutorial and I can't seem to find the answer to this question.\nI just want to margin top a button at 600px and the other at 500 px\n<style>\n   .subscribe-button{\n      ...",
 'how to calculate the euclidean distance in a json file?\ni have a error: no match for \'operator+\'\n{\n"name" : "att5",\n"comment" : "5 capitals of the US (Padberg/...',
 'I have an expo project which I want to build an .aab app so that I can submit it on play store. But when I built it I get an error when it installs dependencies. I think the problem is one of the ...',
 "I'm using Google Books API to get details about books using their ISBN numbers\nISBN - International Standard Book Number is a numeric commercial book identifier that is intended to be unique\nWhen ...",
 'I try to write some headers on a binary file, the rest of operation (bin file merging etc) are on a batch file but apparently batch is no good to write directly on binary files, i could not manage to ...',
 '2 images

### **How to do it**

Underneath the covers, **Scrapy builds its selectors on top of lxml**. It offers a smaller and slightly simpler API, which is similar in performance to lxml.

### **There's more**

To learn more about Scrapy Selectors see: https://doc.scrapy.org/en/latest/topics/selectors.html.

## **Loading data in unicode / UTF-8**


A document's encoding tells an application how the characters in the document are represented as bytes in the file. Essentially, the encoding specifies how many bits there are per character. In a standard ASCII document, all characters are 8 bits. HTML files are often encoded as 8 bits per character, but with the globalization of the internet, this is not always the case. Many HTML documents are encoded as 16-bit characters, or use a combination of 8- and 16-bit characters.

A particularly common form HTML document encoding is referred to as UTF-8. This is the encoding form that we will examine.

### **Getting ready**

We will read a file named unicode.html from our local web server, located at http://localhost:8080/unicode.html.  This file is UTF-8 encoded and contains several sets of characters in different parts of the encoding space.

### **How to do it**

We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this.  Let's start importing urllib, loading the page, and examining some of the content.

In [91]:
from urllib.request import urlopen
page = urlopen("http://localhost:8080/unicode.html")
content = page.read()
content[840:1280]

b'><strong>Cyrillic</strong> &nbsp; U+0400 \xe2\x80\x93 U+04FF &nbsp; (1024\xe2\x80\x931279)</p>\n    <table class="unicode">\n        <tbody>\n            <tr valign="top">\n                <td width="50">&nbsp;</td>\n                <td class="b" width="50">\xd0\x89</td>\n                <td class="b" width="50">\xd0\xa9</td>\n                <td class="b" width="50">\xd1\x89</td>\n                <td class="b" width="50">\xd3\x83</td>\n            </tr>\n        </tbody>\n    </table>\n\n '

> Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as `\xd0\x89`.

In [92]:
str(content, "utf-8")[837:1270]

'<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n    <table class="unicode">\n        <tbody>\n            <tr valign="top">\n                <td width="50">&nbsp;</td>\n                <td class="b" width="50">Љ</td>\n                <td class="b" width="50">Щ</td>\n                <td class="b" width="50">щ</td>\n                <td class="b" width="50">Ӄ</td>\n            </tr>\n        </tbody>\n    </table>\n\n   '

We can exclude this extra step by using requests.

In [97]:
import requests
response = requests.get("http://localhost:8080/unicode.html")
response.text[837:1270]

' <p><strong>Cyrillic</strong> &nbsp; U+0400 â\x80\x93 U+04FF &nbsp; (1024â\x80\x931279)</p>\n    <table class="unicode">\n        <tbody>\n            <tr valign="top">\n                <td width="50">&nbsp;</td>\n                <td class="b" width="50">Ð\x89</td>\n                <td class="b" width="50">Ð©</td>\n                <td class="b" width="50">Ñ\x89</td>\n                <td class="b" width="50">Ó\x83</td>\n            </tr>\n        </tbody>\n    <'

### **How to works**

In the case of using urlopen, the conversion was explicitly performed by using the str statement and specifying that the content should be converted to UTF-8. For requests, the library was able to determine from the content within the HTML that it was in UTF-8 format by seeing the following tag in the document:

```HTML
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
```

### **There's more**

There are a number of resources available on the internet for learning about Unicode and UTF-8 encoding techniques. Perhaps the best is the following Wikipedia article, which has an excellent summary and a great table describing the encoding technique: https://en.wikipedia.org/wiki/UTF-8