<a href="https://colab.research.google.com/github/JordanDCunha/Automated-Data-Collection-with-R-A-Practical-Guide-to-Web-Scraping-and-Text-Mining/blob/main/Chapter_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 — HTML

## Introduction

Almost everything we see in a web browser is built using **HTML (HyperText Markup Language)**.  
Whether browsing Wikipedia, searching on Google, or using social media, we are interacting with HTML.

HTML is a **markup language**, meaning it uses special tags to structure and describe content.  
It was first proposed by Tim Berners-Lee (1989) and has evolved into modern standards like HTML5.  
Despite revisions, its core structure remains stable — making it essential for web scraping.

---

## 2.1 Browser Presentation vs. Source Code

An HTML file is simply **plain text**.  
What makes it powerful is its **marked-up structure**.

HTML uses *tags* to define elements such as:

- Titles  
- Headings  
- Paragraphs  
- Links  
- Tables  

For example:

```html
<title>First HTML</title>


In [3]:
# Install if necessary:
# install.packages(c("rvest", "xml2"))

library(rvest)
library(xml2)

# Read a webpage
url <- "https://example.com"
page <- read_html(url)

# Extract the page title
html_element(page, "title") |> html_text()

# Extract all paragraph text
html_elements(page, "p") |> html_text()

# Extract all hyperlinks
html_elements(page, "a") |> html_attr("href")

# View first 15 lines of raw HTML source
readLines(url, n = 15)

SyntaxError: invalid syntax (ipython-input-943321251.py, line 12)

# 2.2 Syntax Rules of HTML

Now that we understand the difference between a webpage’s displayed version and its source code, we examine the core syntax rules that structure HTML documents.

---

## 2.2.1 Tags, Elements, and Attributes

### Tags and Elements

HTML turns plain text into structured documents using **tags**.

Example:

<title>First HTML</title>

- `<title>` → start tag  
- `First HTML` → content  
- `</title>` → end tag  

The combination of **start tag + content + end tag** is called an **element**.

Key rules:

- Tags are enclosed in `< >`
- End tags contain `/`
- HTML is **not case sensitive**
- Best practice: use lowercase (`<tagname>`)

---

### Empty Elements

Some elements do not contain content.

Example:

<br>

Or self-closing form:

<tagname />

These are called **empty elements**.

---

### Attributes

Attributes provide additional information inside start tags.

Example:

<a href="http://www.r-datacollection.com/">Link to Homepage</a>

- `href` → attribute name  
- `"http://..."` → attribute value  

Rules:

- Written as `name="value"`
- Multiple attributes separated by spaces
- Values enclosed in single or double quotes

---

## 2.2.2 Tree Structure

HTML documents follow a **tree structure**.

Example:

<html>
  <head>
    <title>First HTML</title>
  </head>
  <body>
    Content here
  </body>
</html>

Hierarchy:

- `<html>` (root)
  - `<head>`
    - `<title>`
  - `<body>`

Elements must be **properly nested**.  
Tags must close in reverse order of opening.

Invalid example:

<b><i>Text</b></i>

---

## 2.2.3 Comments

Comments are ignored by the browser.

Syntax:

<!-- This is a comment -->

Comments:
- Are not displayed
- Remain visible in source code
- Help document structure

---

## 2.2.4 Reserved and Special Characters

Some characters cannot appear directly in HTML content because they are part of markup.

Examples:
- `<`
- `>`
- `&`

Instead, HTML uses **character entities**.

| Character | Entity |
|-----------|--------|
| `<` | `&lt;` |
| `>` | `&gt;` |
| `&` | `&amp;` |
| `"` | `&quot;` |
| `'` | `&apos;` |
| non-breaking space | `&nbsp;` |

Example:

Instead of writing:

5 < 6

We write:

5 &lt; 6

Entities always:
- Start with `&`
- End with `;`

---

## 2.2.5 Document Type Definition (DTD)

The first line of an HTML document specifies its version:

<!DOCTYPE html>

For modern webpages, this indicates HTML5.

---

## 2.2.6 Spaces and Line Breaks

In HTML source code:

- Line breaks are ignored
- Multiple spaces collapse into one space

To force formatting:

- Non-breaking space: `&nbsp;`
- Line break tag: `<br>`

Example:

Hello&nbsp;&nbsp;World<br>
New Line

---

### ✅ Key Takeaway

HTML syntax relies on:

- Tags  
- Attributes  
- Proper nesting  
- Tree structure  
- Special character entities  

Understanding these rules is essential for parsing and web scraping.


In [4]:
# Install required packages if needed
# install.packages(c("rvest", "xml2"))

library(rvest)
library(xml2)

# Example HTML document as a string
html_text_example <- '
<!DOCTYPE html>
<html>
  <head>
    <title>Example Page</title>
  </head>
  <body>
    <p>This is a paragraph.</p>
    <a href="https://example.com">Visit Example</a>
    <!-- This is a comment -->
  </body>
</html>
'

# Parse the HTML
page <- read_html(html_text_example)

# Extract title
html_element(page, "title") |> html_text()

# Extract paragraph text
html_elements(page, "p") |> html_text()

# Extract hyperlink text and href attribute
link <- html_element(page, "a")
html_text(link)
html_attr(link, "href")

# Display tree structure
xml_structure(page)


SyntaxError: unterminated string literal (detected at line 8) (ipython-input-1477065595.py, line 8)

# 2.3 Important HTML Tags and Attributes for Web Data Collection

HTML provides many tags and attributes. For web scraping, only a subset is commonly used because these tags store most of the structured information on webpages.

---

## 2.3.1 Anchor Tag `<a>`

The `<a>` tag creates hyperlinks and is the foundation of navigation across webpages.

### Example
<a href="https://example.com">Visit Example</a>

### Uses
- Links to another webpage
- Links to a specific location inside a page
- Links to a location inside another document

The key attribute is:
- `href` → defines the destination URL

This tag is extremely important for scraping because it allows navigation across multiple pages.

---

## 2.3.2 Metadata Tag `<meta>`

The `<meta>` tag stores information about the webpage. It appears inside the `<head>` element and is an empty tag.

### Example
<meta name="description" content="Example webpage">

### Common Uses
- Page description
- Keywords
- Character encoding
- Instructions for search engines

---

## 2.3.3 External Reference Tag `<link>`

The `<link>` tag connects external resources to an HTML document.

### Example
<link rel="stylesheet" href="style.css">

### Common Uses
- Linking CSS files
- Adding icons
- Linking external documents

---

## 2.3.4 Emphasis Tags

These tags change text appearance and help identify structured content.

| Tag | Meaning |
|------|-----------|
| `<b>` | Bold text |
| `<i>` | Italic text |
| `<strong>` | Important text |

These are useful when important information is consistently formatted.

---

## 2.3.5 Paragraph Tag `<p>`

The `<p>` tag defines paragraphs and automatically creates spacing.

Example:
<p>This is a paragraph.</p>

---

## 2.3.6 Heading Tags `<h1>` – `<h6>`

These tags define hierarchical headings.

- `<h1>` → largest heading
- `<h6>` → smallest heading

Headings often contain key webpage content and are valuable for scraping structured summaries.

---

## 2.3.7 List Tags

HTML supports multiple list types.

### Unordered List
<ul>
  <li>Item</li>
</ul>

### Ordered List
<ol>
  <li>Item</li>
</ol>

### Description List
<dl>
  <dt>Term</dt>
  <dd>Description</dd>
</dl>

Lists often store grouped or categorized data.

---

## 2.3.8 Organizational Tags `<div>` and `<span>`

These tags group content for styling and structure.

- `<div>` → block-level grouping
- `<span>` → inline grouping

They often use attributes like:

class="example"

These groupings are frequently used with CSS and help locate data during scraping.

---

## 2.3.9 Form Tag `<form>` and Input Elements

Forms allow users to send data to servers.

### Example
<form action="submit.html" method="GET">
  <input type="text" name="username">
  <input type="submit">
</form>

Important attributes:

- `action` → destination URL
- `method` → GET or POST

### Input Types
- text
- checkbox
- date
- range
- hidden
- submit
- reset

When `GET` is used, data appears in the URL as a **query string**:

example.com/page?name=value

---

## 2.3.10 Script Tag `<script>`

The `<script>` tag allows HTML to include programming languages such as JavaScript.

### Uses
- Dynamic webpage updates
- User interaction
- Event handling

Scripts may:
- Be written directly in HTML
- Load external JavaScript files
- Trigger events like mouse hover or clicks

---

## 2.3.11 Table Tags

Tables display structured tabular data.

| Tag | Purpose |
|--------|-------------|
| `<table>` | Creates table |
| `<tr>` | Table row |
| `<td>` | Data cell |
| `<th>` | Header cell |

### Example
<table>
  <tr>
    <th>Country</th>
    <th>GDP</th>
  </tr>
  <tr>
    <td>Norway</td>
    <td>98565</td>
  </tr>
</table>

Tables are one of the most common structured data sources for web scraping.

---

## ✅ Key Takeaway

The most important HTML tags for data collection include:

- Navigation → `<a>`
- Structure → `<div>`, `<span>`
- Text organization → `<p>`, `<h1>`–`<h6>`
- Lists → `<ul>`, `<ol>`, `<dl>`
- Metadata → `<meta>`
- External files → `<link>`
- Forms → `<form>` and `<input>`
- Dynamic content → `<script>`
- Tabular data → `<table>`

Understanding these tags helps identify where useful information is stored.


In [None]:
# Install packages if necessary
# install.packages(c("rvest", "xml2"))

library(rvest)
library(xml2)

# Example HTML containing common tags
html_example <- '
<html>
<head>
<meta name="description" content="Sample page">
<link rel="stylesheet" href="style.css">
<title>Example Page</title>
</head>

<body>

<h1>Main Heading</h1>
<p>This is a paragraph.</p>

<a href="https://example.com">Example Link</a>

<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>

<table>
<tr>
<th>Country</th>
<th>GDP</th>
</tr>
<tr>
<td>Norway</td>
<td>98565</td>
</tr>
</table>

<form action="submit.html" method="GET">
<input type="text" name="username" value="guest">
<input type="submit">
</form>

</body>
</html>
'

# Parse HTML
page <- read_html(html_example)

# Extract links
html_elements(page, "a") |> html_attr("href")

# Extract headings
html_elements(page, "h1") |> html_text()

# Extract paragraph text
html_elements(page, "p") |> html_text()

# Extract list items
html_elements(page, "li") |> html_text()

# Extract table data
html_elements(page, "td") |> html_text()

# Extract form input names
html_elements(page, "input") |> html_attr("name")

# Show document structure
xml_structure(page)


# 2.4 Parsing HTML in R

After learning HTML structure, the next step in web scraping is **loading and representing HTML inside R**.  
This process is called **parsing**.

When scraping, parsing happens twice:

1. The **browser parses HTML** to display it visually.
2. **R parses HTML** to build a structured object for data extraction.

---

## 2.4.1 What Is Parsing?

There is an important difference between **reading** and **parsing**.

### Reading (Flat Representation)

Functions like `readLines()` simply load raw text line by line.

- No understanding of HTML grammar
- No hierarchy
- No structure awareness

Result: a flat character vector.

---

### Parsing (Structured Representation)

A parser understands markup and reconstructs the document’s hierarchy.

The result is the **Document Object Model (DOM)**:

- Tree-like structure
- Each tag becomes a node
- Nodes are nested according to HTML structure
- Queryable and extractable

---

## DOM Parsing: Two-Step Process

1. **C-Level Parsing**
   - The entire document is parsed.
   - Nodes are created.
   - Errors (e.g., missing closing tags) are corrected automatically.
   - libxml2 handles malformatted HTML gracefully.

2. **Conversion to R Object**
   - The C structure is translated into an R list-based structure.
   - This allows convenient extraction and manipulation in R.

The main function used:

htmlParse() or htmlTreeParse()

---

## 2.4.2 Discarding Nodes During Parsing

Sometimes we do not need the entire document.

To:
- Save memory
- Improve speed
- Remove unnecessary elements

We can define **handler functions**.

Handlers allow us to:
- Delete nodes
- Modify nodes
- Ignore specific tags (e.g., `<div>`, `<title>`)
- Remove comments

Handlers are passed as a named list to the parser.

Example idea:
- If node name == "body", return NULL → node removed.

---

## Generic Handler Types

| Handler | Operates On |
|----------|-------------|
| startElement() | Any XML element |
| text() | Text nodes |
| comment() | Comments |
| cdata() | CDATA nodes |
| processingInstruction() | Processing instructions |
| namespace() | Namespaces |
| entity() | Entity references |

Handlers allow custom control over DOM construction.

---

## 2.4.3 Extracting Information During Parsing

Instead of:

1. Parsing entire DOM
2. Traversing DOM again
3. Extracting target nodes

We can extract **during parsing**.

This is more efficient for large documents.

To do this, we use:

**Closure-based handler functions**

Why closures?
- Handlers operate in local scope
- We need access to a container in outer scope
- Closures allow writing to non-local variables

Technique:
- Create a container variable
- Define handler for a specific tag (e.g., `<i>`)
- Use superassignment `<<-` to append results
- Provide a return function

This avoids building a full DOM tree and directly captures relevant content.

---

## Key Takeaways

- Parsing converts raw HTML into a structured DOM.
- DOM allows hierarchical querying.
- `readLines()` ≠ parsing.
- Handlers allow:
  - Node deletion
  - Custom DOM building
  - Direct extraction during parsing
- Closure handlers enable efficient extraction.

Parsing is foundational for robust and principled web scraping in R.


In [None]:
# Install if needed
# install.packages("XML")

library(XML)

url <- "http://www.r-datacollection.com/materials/html/fortunes.html"

### 1️⃣ Reading (Flat Representation)

fortunes_raw <- readLines(url)
head(fortunes_raw)


### 2️⃣ DOM Parsing

parsed_doc <- htmlParse(url)
parsed_doc


### 3️⃣ Discarding Specific Nodes with Handlers

# Remove <body> node
h1 <- list(
  body = function(x) NULL
)

parsed_no_body <- htmlTreeParse(url, handlers = h1, asTree = TRUE)
parsed_no_body


### 4️⃣ Removing Multiple Nodes + Comments

h2 <- list(
  startElement = function(node) {
    if (xmlName(node) %in% c("div", "title")) {
      return(NULL)
    } else {
      return(node)
    }
  },
  comment = function(x) NULL
)

parsed_filtered <- htmlTreeParse(url, handlers = h2, asTree = TRUE)
parsed_filtered


### 5️⃣ Extracting <i> Elements During Parsing (Closure Example)

getItalics <- function() {

  i_container <- character()

  handler <- list(
    i = function(x, ...) {
      i_container <<- c(i_container, xmlValue(x))
      NULL
    }
  )

  returnI <- function() {
    i_container
  }

  handler$returnI <- returnI
  handler
}

h3 <- getItalics()

invisible(htmlTreeParse(url, handlers = h3))

# Retrieve extracted italics content
h3$returnI()
