# Introduction to HTML and BeautifulSoup for NLP Preprocessing

Welcome to this introductory guide on using HTML (HyperText Markup Language) and BeautifulSoup for preprocessing webpages in Natural Language Processing (NLP) applications. In the vast expanse of the internet, a significant amount of information is stored in the form of webpages. These webpages are primarily written in HTML, which is the standard language for creating and designing web content.

## Understanding HTML

Before diving into the specifics of BeautifulSoup and its applications in NLP, it's crucial to understand the basics of HTML:

- HTML Structure: HTML documents are composed of a series of elements, represented by tags (like `<p>` for paragraphs, `<h1>` for main headings, `<div>` for divisions/sections, etc.). These elements structure the webpage and define its content.

- Nested Elements: HTML elements can be nested within each other, creating a parent-child relationship in the webpage's structure. This hierarchical structure is key to navigating and extracting specific data from webpages.

- Attributes: HTML elements can have attributes (like `class`, `id`, `href`, etc.) that provide additional information about the element, often used for styling or identifying elements.

## Comparing HTML and Markdown

HTML (HyperText Markup Language) and Markdown are both markup languages used for creating formatted text, but they serve different purposes and exhibit distinct characteristics. By comparing a sample document in both HTML and Markdown side by side, we can observe these differences in complexity, flexibility, and ease of use, highlighting the unique advantages each language brings to text formatting and web content creation.

- [Sample HTML document](./sample_html_doc.html)
- [Sample MD document](./sample_md_doc.md)


## Introduction to BeautifulSoup

With the knowledge of HTML in hand, we can leverage [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), a powerful Python library, to parse and extract information from HTML documents:

- Parsing HTML: BeautifulSoup can read and parse HTML content, allowing you to navigate the structure of a webpage programmatically. It's like giving you the ability to read and understand webpages as a browser does, but with the intention of data extraction and manipulation.

- Locating Elements: Using BeautifulSoup, you can find specific elements, extract data, modify the HTML, or even build an entirely new HTML document. It provides methods to search for elements by tags, attributes, text content, and more.

- NLP Applications: In NLP, preprocessing is a critical step. When working with text data from webpages, BeautifulSoup enables you to clean and prepare this data. This includes tasks like extracting readable text from HTML, removing irrelevant content (like navigation bars, ads, footers), and structuring the text data for NLP models.

In the following section, we'll use the sample HTML document and demonstrate how to apply BeautifulSoup for various preprocessing tasks, setting the stage for more advanced NLP applications. This hands-on approach will give you the practical skills needed to harness the power of web scraping in conjunction with NLP.


### Setup

Try running the cell below while your virtual environment is activated. If you get a ModuleNotFound error, install using `python -m pip install beautifulsoup4`


In [4]:
from bs4 import BeautifulSoup

### BeautifulSoup object

At the heart of BeautifulSoup is the "soup" object, which represents the parsed HTML (or XML) document and provides a rich set of methods to navigate, search, and modify the parse tree.

When you create a BeautifulSoup object, you essentially convert a string of HTML or XML into a complex tree of Python objects. This conversion process makes it possible to interact with the HTML elements programmatically, as if they were typical Python objects. Here are some key features and capabilities of the soup object:

- Parsing HTML/XML: BeautifulSoup can parse content from various sources - such as from a string or a file - and supports different parsers (like html.parser, lxml, etc.), offering flexibility in handling various types of HTML/XML inputs.

- Navigating the Tree: The soup object allows you to navigate the parse tree using tag names, which can be as straightforward as accessing Python attributes. It's like traversing a hierarchical structure of elements in a webpage.

- Searching the Document: With methods like `find()` and `find_all()`, the soup object enables you to locate elements based on tags, attributes, text content, or even CSS selectors. This powerful searching capability is essential for extracting specific data from complex web pages.


In [5]:
# Use context manager to read the sample document and initialise the soup object with `html.parser`
with open('sample_html_doc.html', 'r') as file:
    soup = BeautifulSoup(file, 'html.parser')
    
print(soup)

<!DOCTYPE html>

<html>
<head>
<title>DPAA02 - HTML Introduction</title>
</head>
<body>
<img alt="HTML5 Logo" src="https://www.w3.org/html/logo/downloads/HTML5_Badge_128.png"/>
<h1 id="main-header">Introduction to HTML Cleaning</h1>
<p class="intro">
      This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.
    </p>
<p class="intro">
      This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.
    </p>
<h2 class="sub-header">Why Clean HTML?</h2>
<p class="description">
      Cleaning HTML is essential for web scraping, as it helps in extracting
      useful information by removing unnecessary tags and formatting.
    </p>
<p class="conclusion">
      HTML cleaning can greatly improve the efficiency and accuracy of data
      extraction in web scraping.
    </p>
<h3 class="table-header">Sample Table</h3>
<table border="1" class="data-table">
<tr>
<th>Item</th>
<th>Quantity</th>
<th

### Searching the document

In BeautifulSoup, `find` and `find_all` are two commonly used methods for searching elements in an HTML or XML document, but they serve different purposes:

- `find` Method: The `find` method is used to retrieve the first occurrence of a tag or element that matches the specified criteria. It's useful when you're interested in only the first instance of an element, such as the first paragraph or the first header in a document. For example, `soup.find('p')` will return the first `<p> `tag it encounters in the HTML content.

- `find_all` Method: In contrast, `find_all` collects all elements in the document that match the specified criteria and returns them as a list. This method is ideal when you need to gather all instances of a particular element, such as all links or all images on a page. For instance, `soup.find_all('p')` will return a list of all `<p>` tags present in the document.


In [10]:
# Find the first paragraph using .find
soup.find('p').text.strip()

'This is a sample HTML document used to demonstrate the process of HTML\n      cleaning using BeautifulSoup.'

In [11]:
# Find all paragraphs using .find_all
soup.find_all('p')


[<p class="intro">
       This is a sample HTML document used to demonstrate the process of HTML
       cleaning using BeautifulSoup.
     </p>,
 <p class="intro">
       This is a sample HTML document used to demonstrate the process of HTML
       cleaning using BeautifulSoup.
     </p>,
 <p class="description">
       Cleaning HTML is essential for web scraping, as it helps in extracting
       useful information by removing unnecessary tags and formatting.
     </p>,
 <p class="conclusion">
       HTML cleaning can greatly improve the efficiency and accuracy of data
       extraction in web scraping.
     </p>,
 <p>
       For more information, visit the wikipedia article on
       <a href="https://en.wikipedia.org/wiki/HTML" target="_blank">HTML</a>.
     </p>,
 <p>
       The
       <a href="https://www.w3schools.com/html/html_intro.asp" target="_blank">
         W3Schools HTML tutorial
       </a>
       is also an excellent resource.
     </p>]

### Extracting Text from HTML Elements in BeautifulSoup

After locating an HTML element using BeautifulSoup, a common task is to extract the textual content from that element. BeautifulSoup makes this process straightforward, offering a couple of key ways to access the text within an HTML element:

- The `.text` Attribute: Once you have identified an HTML element (or elements) using methods like `find` or `find_all`, you can use the `.text` attribute to extract all the text content within that element. This attribute aggregates all the text in an element and its descendants, returning it as a single string. For example, after finding a paragraph with `p = soup.find('p')`, you can get its text with `p.text`.

- The `.get_text()` Method: An alternative to the `.text` attribute is the `.get_text()` method. This method is more versatile, allowing you to specify a separator between the texts of different descendants and an option to strip extra whitespaces. For instance, if you have a `<div>` containing multiple `<p>` tags, `div.get_text("|")` will extract the text from each `<p>` tag and join them using the `|` character as a separator.

It's important to note that both `.text` and `.get_text()` will strip away any HTML tags and return only the human-readable text content. This makes them incredibly useful for NLP tasks where the goal is to extract information rather than HTML structure.

In summary, extracting text from HTML elements is a frequent requirement in web scraping and data extraction, and BeautifulSoup provides simple yet powerful tools to accomplish this task effectively. Whether you're scraping paragraphs, headers, or other textual elements, these methods offer a clean and efficient way to access the content you need.


In [12]:
# Get the text of the first paragraph using .text
print(soup.find('p').text)



      This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.
    


In [13]:
print(soup.find('p').get_text(strip=True))

This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.


In [14]:
# Get the texts of all the paragraphs (you have to iterate over the list)
paragraphs = soup.find_all('p')
# print(paragraphs)

for paragraph in paragraphs:
    print(paragraph.get_text(strip=True))

This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.
This is a sample HTML document used to demonstrate the process of HTML
      cleaning using BeautifulSoup.
Cleaning HTML is essential for web scraping, as it helps in extracting
      useful information by removing unnecessary tags and formatting.
HTML cleaning can greatly improve the efficiency and accuracy of data
      extraction in web scraping.
For more information, visit the wikipedia article onHTML.
TheW3Schools HTML tutorialis also an excellent resource.


### Using find and find_all with Additional Attributes in BeautifulSoup

In BeautifulSoup, the `find` and `find_all` methods are not limited to simple tag searches. They can be significantly more powerful when used with additional attributes, enabling more precise and specific element selection based on their attributes in the HTML structure.

- Attribute Filtering: You can pass additional attributes to find and find_all methods to filter elements by their attributes. For example, if you want to find all `<p>` tags with a class `intro`, you can use `soup.find_all("p", {"class": "intro"})`.

- Multiple Attributes: These methods also allow for filtering using multiple attributes simultaneously. For instance, `soup.find_all("div", {"class": "container", "id": "main"})` will find all `<div>` elements with a class of `container` and an ID of `main`.


In [15]:
# Find the paragraph with class="description"
soup.find('p', {'class': 'description'})

<p class="description">
      Cleaning HTML is essential for web scraping, as it helps in extracting
      useful information by removing unnecessary tags and formatting.
    </p>

### Extracting Links from `<a>` Tags Using BeautifulSoup

Hyperlinks are fundamental components of the web, interconnecting various resources and web pages. In web scraping, extracting these links often involves targeting `<a>` (anchor) tags in HTML, which traditionally define hyperlinks. BeautifulSoup simplifies this process, allowing for efficient extraction of URLs embedded within these tags. We often have to extract links from a document in NLP applications.

#### Understanding `<a>` Tags

An `<a>` tag in HTML typically includes an `href` attribute, which holds the URL the link points to.
It may also contain other attributes and text, which provide additional context or display information.


In [16]:
# First, use find_all to retrieve all the <a> tags from the soup object.
all_a_tags = soup.find_all('a')
print(all_a_tags)

[<a href="https://en.wikipedia.org/wiki/HTML" target="_blank">HTML</a>, <a href="https://www.w3schools.com/html/html_intro.asp" target="_blank">
        W3Schools HTML tutorial
      </a>]


In [None]:
# Iterate through the list of <a> tags and use the .get('href') method to extract the URL from each tag's href attribute.
all_a_tags[0].get('href')

In [None]:
# Optionally, you can also extract the text associated with each hyperlink to understand what the link represents.
all_a_tags[0].text

In [None]:
# Extracting URLs and Link Text together with Link Text as key and URL as value
for item in all_a_tags:
    url = item.get('href')
    text = item.text.strip()
    print(f"{text}: {url}")

### Extracting Tabular Data from HTML using BeautifulSoup and Pandas

One of the most common tasks in web scraping and data extraction is retrieving information stored in tables within HTML documents. HTML tables, marked up with `<table>` (overall table), `<tr>` (each table row), `<th>` (table header elements), and `<td>` (table data elements) tags, often contain structured data that are ideal for analysis. However, extracting this data and converting it into a usable format like a Pandas DataFrame requires a careful approach.

In this section, we'll explore two powerful methods for extracting tabular data from HTML:

- Using `BeautifulSoup`: This method involves parsing the HTML content with BeautifulSoup to navigate and extract the table data. It provides a high level of control, allowing for the handling of complex or irregular table structures. We'll manually read the table rows and columns, extract the text, and then structure this data into a DataFrame.

- Using Pandas `read_html`: Pandas offers a convenient function read_html that automatically parses and converts HTML tables into DataFrame objects. This method is incredibly efficient and requires minimal code, making it ideal for straightforward tables. However, it may not always be suitable for tables with complex structures or non-standard formatting.


In [None]:
import pandas as pd

In [None]:
# Find the first table


In [None]:
# Extracting data into a list of lists



In [None]:
# Create a DataFrame


In [None]:
# Use pandas.read_html


### Loading HTML pages from the internet

So far, we have been working with a local HTML file (`*.html`). But we can use the `requests` library to get a page from the Internet using the `.get` method.


In [None]:
import requests

In [None]:
# Use requests.get and pass the url you want to obtain
r = requests.get(r"https://en.wikipedia.org/wiki/HTML")

In [None]:
# Now r.text will contain the html which we can process using beautiful soup.
print(r.text)

---


### Exercise: Scraping content from a Wikipedia page

Find a Wikipedia article that you are interested in and scrape all the paragraphs from it.

**Stretch**: Find an article with a large table (e.g. <https://en.wikipedia.org/wiki/List_of_highest-grossing_films>) and scrape the table using BeautifulSoup.
