# Web Scarping with `scrapy` in Python

# 1. HTML and XPath Fundamentals

## HTML - HyperTest Markup Language

### Structure of a HMTL:

* HTML Tags:
    * `<html> ... </html>`: root tag
    * `<body> ... </body>`: body tag
    * `<div> ... </div>`: section tag
    * `<p> ... </p>`: paragraph tag
    * `<a> ... </a>`: hyperlink tag
* HTML tree: structure of HTML
    * **child**, **parent**, **sibling**, **generation**, **decendant**

## HTML Specific Syntax

Typical structure of a HTML element:

```html
<tag-name attrib-name="attrib info">
    ..element contents.
</tag-name>
```

### For example:

```html
<div id="unique-id" class="some class">
    ..div element contents..
</div>
```
* `id` is used to be unique for the specific element.
* `class` is also used to identify the element; a tag can belong to multiple attributes. In this example, the classes are `"some"` and `"class"`

### Another example:

```html
<a href="https://somewebsite.com">
    This text links to the website
</a>
```
* For an `a` tag, attribute `href` is the hyperlink directed to




## Intro to XPath Notation

The **XPath** notation identifies the location of any elements in an html object. Think of this **XPath** notation as *paths for html elements*

1.A single forward-slash `/` to more forward one generation

2.Tag-names between slashes to give direction to which element(s)

3.Use `//` to look in **all forward generations** instead of single forward for the specific tag-name

4.`[]` to identify which of the selected siblings to choose (starting from 1)

5.Use `@` sign to establish a condition based on attribute. Use `[]` to wrap the condition as a selection. For instance `//div[@id="uid"]` will look for **all** `div` element that has an `id` attribute equal to `"uid"`.

6.Use `*` as the wildcard to select all elements in a forward generation.


# 2. XPath and Selectors

## XPathology

**1. Be cautious around `[number]` cases**

When `//` searches across all elements for a tag-name, adding `[n]` to the path will **identify the *n*th element of each selected group of siblings**. Therefore, if `<p> ... </p>` is present in 2 different generation levels, then `"//p[1]"` selects the 1st `p` element in each level.

**2. Using `*` with slashes**

When using `/*`, XPath points to **direct child elements**. When using `//*`, XPath points to **all decendant elements**.