# Web Scraping in R

## HTML (Hypertext Markup Language)
- An HTML page consists of many organized HTML nodes or elements that tell a browser how to render its content. Each node or element has a start tag and an end tag with the same name and wraps some textual content.

```html
<html lang="en">
    <head>
        <meta charset="utf-8">
        <title>Web Scraping in R</title>
    </head>
    <body>
        <h1>Web Scraping in R</h1>
        <p>Web scraping is a technique to extract data from a website.</p>
    </body>
</html>
```

Where:
- `<html>` is the root element of the document.
- `<head>` contains meta information about the document.
- `<title>` is the title of the document.
- `<body>` contains the visible page content.
- `<h1>` is a heading.
- `<p>` is a paragraph.

- A node can also have attributes, which are key-value pairs that provide additional information about the node. For example, the `<html>` node has an attribute `lang` with the value `en` that specifies the language of the document.

- When we opening a web page in a browser, the browser reads the HTML and renders the page. However, when we download a web page, we get the raw HTML code. We can use the `view-source` command in the browser to see the raw HTML code of a web page.

## HTML node hierarchy
- The HTML nodes are organized in a tree-like structure. The `<html>` node is the root node of the tree. The `<head>` and `<body>` nodes are the children of the `<html>` node. The `<title>` node is the child of the `<head>` node. The `<h1>` and `<p>` nodes are the children of the `<body>` node.

## Extracting data from HTML pages in R
- We can use the `rvest` package to extract data from HTML pages in R. The `rvest` package is inspired by the Python library `Beautiful Soup`.

```R
# install.packages("rvest")
library(rvest)

simple_html <- "<html>
                    <head>
                        <title>Web Scraping in R</title>
                    </head>
                        <body>
                            <h1>Web Scraping in R</h1>
                                <p>Web scraping is a technique to extract data from a website.</p>
                        </body>
                </html>"

root_node = read_html(simple_html)
root_node

# [1] "<head><title>Web Scraping in R</title></head><body><h1>Web Scraping in R</h1><p>Web scraping is a technique to extract data from a website.</p></body>"
```

- We can download a HTML file from a URL using the download.file() function. The downloaded file is saved in the working directory.

```R
# Download a HTML file from a URL
download.file("https://www.randomwebsite.com/", destfile="random.html")

root_node = read_html("random.html")
```

- To access any node in the tree, we can use the html_node() function. The html_node() function takes two arguments: the root node and the CSS selector of the node. The CSS selector is a pattern that matches the nodes we want to extract. The CSS selector can be a tag name, a class name, or an id name.

```R
# Format is html_node(root_node, css_selector)
# Extract the title node
title_node = html_node(root_node, "title")

# Extract the h1 node
h1_node = html_node(root_node, "h1")

# Extract the p node
p_node = html_node(root_node, "p")
```

- To extract the text content of a node, we can use the html_text() function.

```R
# Extract the text content of the title node
html_text(title_node)

# [1] "Random Website"
```

- To extract the attribute of a node, we can use the html_attr() function.

```R
# Extract the attribute of the title node
html_attr(title_node, "lang")

# [1] "en"
```