# Web Scarping with `scrapy` in Python

# 1. HTML and XPath Fundamentals

## HTML - HyperTest Markup Language

### Structure of a HMTL:

* HTML Tags:
    * `<html> ... </html>`: root tag
    * `<body> ... </body>`: body tag
    * `<div> ... </div>`: section tag
    * `<p> ... </p>`: paragraph tag
    * `<a> ... </a>`: hyperlink tag
* HTML tree: structure of HTML
    * **child**, **parent**, **sibling**, **generation**, **decendant**

## HTML Specific Syntax

Typical structure of a HTML element:

```html
<tag-name attrib-name="attrib info">
    ..element contents.
</tag-name>
```

### For example:

```html
<div id="unique-id" class="some class">
    ..div element contents..
</div>
```
* `id` is used to be unique for the specific element.
* `class` is also used to identify the element; a tag can belong to multiple attributes. In this example, the classes are `"some"` and `"class"`

### Another example:

```html
<a href="https://somewebsite.com">
    This text links to the website
</a>
```
* For an `a` tag, attribute `href` is the hyperlink directed to




## Intro to XPath Notation

The **XPath** notation identifies the location of any elements in an html object. Think of this **XPath** notation as *paths for html elements*

1.A single forward-slash `/` to more forward one generation

2.Tag-names between slashes to give direction to which element(s)

3.Use `//` to look in **all forward generations** instead of single forward for the specific tag-name

4.`[]` to identify which of the selected siblings to choose (starting from 1)

5.Use `@` sign to signify an attribute. Use `[]` to wrap the condition as a selection. For instance `//div[@id="uid"]` will look for **all** `div` element that has an `id` attribute equal to `"uid"`.

6.Use `*` as the wildcard to select all elements in a forward generation.

7.Use `@` directly after slashes to refer to the attribute itself. `/@attri_name` selects the attributes of current element, while `//@attri_name` selects the attributes of all forward generations

8.Use `text()` function within XPath to signify only texts of an HTML element. Therefore, `/text()` extracts chunks of text in current element, while `//text()` extract all chunks of texts in decedants from forward generations

# 2. XPath and Selectors

## XPathology

**1. Be cautious around `[number]` cases**

When `//` searches across all elements for a tag-name, adding `[n]` to the path will **identify the *n*th element of each selected group of siblings**. Therefore, if `<p> ... </p>` is present in 2 different generation levels, then `"//p[1]"` selects the 1st `p` element in each level.

**2. Using `*` with slashes**

When using `/*`, XPath points to **direct child elements**. When using `//*`, XPath points to **all decendant elements**.

**3. `contains(@attri-name, "string-expr")` function vs `[]`**

This function searches the attribute that has the `"string-expr"` as a substring in its attribute. Using direct `[ ]` is looking for an exact match instead.

For instance, if an element is:

```html
<p class='class-1 class-2' id='id-1'>Paragraph 1</p>
```

Then search using `contains(@class, 'class-1')` or `[@class="class-1 class-2"]` will return this element.



### `Scrapy` `Selector` with XPath

In [1]:
from scrapy import Selector

# provided a html as string
html="""
<html>
  <body>
    <div class="hello datacamp">
      <p>Hello World!</p>
    </div>
    <p>Enjoy DataCamp!</p>
  </body>
</html>
"""

In [3]:
# instantiate a Selector object form the string
sel = Selector(text=html)
print("object sel has class {}".format(type(sel)))

# call .xpath() method within a Selector object
# to create a SelectorList of Selector objects
sel_p_list = sel.xpath("//p")
print("returned object from sel.xpath('//p') is {}".format(
    type(sel_p_list)))

# call .extract() method within a SelectorList
# to get to list of string from each Selector
# in the SelectorList
str_p_list = sel_p_list.extract()
print("Below contents are extracted with .extract():\n{}".format(
    "\n".join(str_p_list)
    ))

object sel has class <class 'scrapy.selector.unified.Selector'>
returned object from sel.xpath('//p') is <class 'scrapy.selector.unified.SelectorList'>
Below contents are extracted with .extract():
<p>Hello World!</p>
<p>Enjoy DataCamp!</p>


In [12]:
# chaining .xpath() methods. if not starting from root
# add "." at the begining of the chain
assert sel.xpath('/html/body//*').extract() == \
    sel.xpath('/html').xpath('./body//*').extract()

### Inspecting the HTML (easy way)

**In oder to examine the HTML of a webpage**

1.Examining the "Source" of any website will display the HTML Code for the page.

2.Inspecting Element will display the corresponding element's (hovered over by mouse) HTML raw code

3.Use Python `requests` module can quickly download the raw HTML code of the specific web page by calling `requests.get(url_text).content`

# 3. CSS Locators

## Difference between XPath and CSS?

1.`/` is replaced by `>` (except first character, which will be ignored)

So **XPath** `/html/body/div` equals **CSS** `html > body > div`

2.`//` is replaced by a blank space (except first character, which will be ignored)

So **XPath** `//div/span//p` equals **CSS** `div > span p`

3.`[N]` replaced by `:nth-of-type(N)`

So **XPath** `//div/p[2]` equals **CSS** `div > p:nth-of-type(2)`

4.To find an element by class, use a period `.`. To find an element by id, use a pound sign `#`

* **Note:** This is a true "matching", meaning it's not looking for string matching in XPath anymore. Instead, it looks for elements that has the same class / id without the need to do string exact match. This matchi s more superior than **XPath**'s `[@attri='exact_string']` AND `contains(@attri, 'sub_string')`

5.Use `*` as wildcard

6.Use `<css-to-element>::attr(attr-name)` to access attribute of elements

So **XPath** `//div[@id='uid']/a/@href` equals **CSS** `div#uid > a::attr(hred)`

7.Use `<css-to-element>::text` to access text of **current element**. Use `<css-to-element> ::text` to access text of all future elements. (Notice the space in second case)

##  How to use `scrapy` with CSS locator string?

Simply call the `.css('CSS_match_string`)` method.

In [21]:
sel.css("html>body *").extract()

['<div class="hello datacamp">\n      <p>Hello World!</p>\n    </div>',
 '<p>Hello World!</p>',
 '<p>Enjoy DataCamp!</p>']

### Using `Response` objects instead of `Selector` object