**Beautiful Soup** allows us to extract elements from HTML pages:

In [None]:
from bs4 import BeautifulSoup

with open('data/charity.html', 'rb') as file:
    page = BeautifulSoup(file)

In [None]:
page.title

HTML uses **tags** to structure page content.

```html
<title>The page title</title>
<h1>Main heading</h1>
<p>Some section of text.</p>
```

HTML tags can have **attributes**, such as **class** and **id**.

```html
<h2 id="strapline" class="flashing">Funky heading</h2>
```
These attributes can also be used by Beautiful Soup to identify content:

In [None]:
main_content = page.find("div", {"class": "main-content-container"})
main_content

We can continue to use Beautiful Soup to identify **nested elements** within others:

In [None]:
tables = main_content.find_all("table")
tables

The pandas `.read_html()` method is useful for extracting data from tables into DataFrames:

-  notice that we need to convert the Beautiful Soup ResultSet object (`tables`) to a string

In [None]:
import pandas as pd

In [None]:
dataframes = pd.read_html(str(tables))

In [None]:
df = dataframes[0]
df

- Sometimes this will require some cleaning of the result such as removal of unwanted rows or misinterpretation of certain characters
- In this instance, notice that pandas has appended a `.1` to the final column label to differentiate it from the other `%` column label

[Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

[pd.read_html() documentation](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)