# 16. Extracting Text from Web Pages

## 16.1. The Structure of HTML Documents

The `<html>` tag specifies the start on the entire HTM document. Lets utilize that tag to create a document composed of just a single word; _Hello_.

**Listing 16. 1. Defining a simple HTML string**

In [1]:
html_contents = "<html>Hello</html>"

We can render `html_contents` directly in an IPython Jupyter Notebook. We simply need to import `HTML` and `display` from `IPython.core.display`. Afterwards, executing `display(HTML(html_contents))` will display the rendered output. 

**Listing 16. 2. Rendering an HTML string**

In [2]:
from IPython.core.display import display, HTML
def render(html_contents): display(HTML(html_contents))
render(html_contents)

We’ve rendered our HTML document. It’s not very impressive. The body is composed of a single word. Furthermore, the document lacks a title. Let’s assign the document a title, using the `<title>` tag.  

**Listing 16. 3. Defining an HTML title**

In [3]:
title = "<title>Data Science is Fun</title>"

Now, we’ll nest the title within `<html>` and `</html>` by running `html_contents = f"<html>{title}Hello</html>"`. 

**Listing 16. 4. Adding a title to the HTML string**

In [4]:
html_contents = f"<html>{title}Hello</html>"
render(html_contents)

The title reflects vital information despite its absence from the body of the document. This critical distinction is commonly emphasized using `<head>` and `<body>` tags. The content delimited by the HTML `<body>` tag will appear in the body of the output. Meanwhile, `<head>` delimits vital information that is not rendered within the body. 

**Listing 16. 5. Adding a head and body to the HTML string**

In [5]:
head = f"<head>{title}</head>"
body = "<body>Hello</body>"
html_contents = f"<html> {title} {body}</html>"

Occasionally, we’ll want to display a document’s title within the body of a page. This visualized title is referred is the page _header_. It is demarcated with the `<h1>` tag. 

**Listing 16. 6. Adding a header to the HTML string**

In [6]:
header = "<h1>Data Science is Fun</h1>"
body =  f"<body>{header}Hello</body>"
html_contents = f"<html> {title} {body}</html>"
render(html_contents)

HTML documents usually contain multiple sentences, enclosed in multiple paragraphs.  such paragraphs are marked with a `<p>` tag.  Let’s add two consecutive paragraphs to our HTML. 

**Listing 16. 7. Adding paragraphs to the HTML string**

In [7]:
paragraphs = ''
for i in range(2):
    paragraph_string = f"Paragraph {i} " * 40
    paragraphs += f"<p>{paragraph_string}</p>"
    
body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body}</html>"
render(html_contents)

We can discriminate between `<p>` tags by assigning each tag a unique id. The id can be inserted directly into the tag brackets. The added _id_ is referred as an **attribute** of the paragraph element. Attributes are inserted into element start-tags in order to track useful tag information.

**Listing 16. 8. Adding id attributes to the paragraphs**

In [8]:
paragraphs = ''
for i in range(2):
    paragraph_string = f"Paragraph {i} " * 40
    attribute = f"id='paragraph {i}'"
    paragraphs += f"<p {attribute}>{paragraph_string}</p>"
    
body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body}</html>"

The internet is built on-top of **hyperlinks**, which are clickable texts that connect web pages. Each hyperlink is marked by an anchor tag, `<a>`. Furthermore, the URL of the link is provided using the _href_ attribute. Below, we’ll create a hyperlink that reads _Data Science Bookcamp_. We’ll link that clickable text to the website for this book. 

**Listing 16. 9. Adding a hyperlink to the HTML string**

In [9]:
link_text = "Data Science Bookcamp"
url = "https://www.manning.com/books/data-science-bookcamp"
hyperlink = f"<a href='{url}'>{link_text}</a>"
new_paragraph = f"<p id='paragraph 2'>Here is a link to {hyperlink}</p>"
paragraphs += new_paragraph
body = f"<body>{header}{paragraphs}</body>"
html_contents = f"<html> {title} {body}</html>"
render(html_contents)

Beyond just headers and paragraphs, we can also visualize lists of texts within an HTML document. Suppose, for instance, that we wish to display a list of popular data science libraries. We’ll start by defining that list in Python.

**Listing 16. 10. Defining a list of data science libraries**

In [10]:
libraries = ['NumPy', 'Scipy', 'Pandas', 'Scikit-Learn']

Now, we’ll demarcate every item in our list with an `<li>` tag, which stands for _list item_. 

**Listing 16. 11. Demarcating list items with an li tag.**

In [11]:
items = ''
for library in libraries:
    items += f"<li>{library}</li>"

Finally, we’ll nest the `items` string within a `<ul>` tag, where `ul` stands for unstructured list. Afterwards, we’ll append the unstructured list to the body our HTML. 

**Listing 16. 12. Adding an unstructured list to the HTML string**

In [12]:
unstructured_list = f"<ul>{items}</ul>"
header2 = '<h2>Common Data Science Libraries</h2>'
body = f"<body>{header}{paragraphs}{header2}{unstructured_list}</body>"
html_contents = f"<html> {title} {body}</html>"
render(html_contents)

At this point, it’s worth noting that our HTML body is divided into two distinct parts. Typically, such divisions are captured using special `<div>` tags. Usually, each `<div>` tag is distinguished by some attribute. If the attribute is unique to a division, then that attribute is an `id`. Otherwise, if the attribute is shared by more than one division, then a special `class` signifier is used. For consistency's sake, we’ll now divide our two sections by nesting them within two different divisions.

**Listing 16. 13. Adding divisions to the HTML string**

In [13]:
div1 = f"<div id='paragraphs' class='text'>{paragraphs}</div>"
div2 = f"<div id='list' class='text'>{header2}{unstructured_list}</div>"
div3 = "<div id='empty' class='empty'></div>"
body = f"<body>{header}{div1}{div2}{div3}</body>"
html_contents = f"<html> {title}{body}</html>"

We’ve made many changes to our `html_contents` string. Lets actually review its altered contents.

**Listing 16. 14. Printing the altered HTML string**

In [14]:
print(html_contents)

<html> <title>Data Science is Fun</title><body><h1>Data Science is Fun</h1><div id='paragraphs' class='text'><p id='paragraph 0'>Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 </p><p id='paragraph 1'>Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragra

The printed output is a mess! The HTML contents are nearly unreadable. Also, extracting individual elements from `html_contents` is exceedingly difficult. 

**Listing 16. 15. Extracting the HTML title using basic Python**

In [15]:
split_contents = html_contents.split('>')
for i, substring in enumerate(split_contents):
    if substring.endswith('<title'):
        next_string = split_contents[i + 1]
        title = next_string.split('<')[0]
        print(title)
        break

Data Science is Fun


Is there a cleaner way to extract elements from HTML documents? Yes! We don’t need to manually parse the documents. Instead, we can just leverage the external Beautiful Soup library.

## 16.2. Parsing HTML using Beautiful Soup

Lets import a `BeautifulSoup` class from `bs4`. Following a common convention, we’ll import `BeautifulSoup` as simply `bs`.

**Listing 16. 16. Importing the `BeautifulSoup` class**

In [16]:
from bs4 import BeautifulSoup as bs

We’ll now initialize the `BeautifulSoup` class by running `bs(html_contents)`. In keeping with convention, we’ll assign the initialized object to a `soup` variable.

**Listing 16. 17. Initializing `BeautifulSoup` using an HTML string**

In [17]:
soup = bs(html_contents)

Our `soup` object tracks all elements in the parsed HTML. We can output these elements in a clean, readable format by running the `soup.prettify()` method.

**Listing 16. 18. Printing readable HTML with Beautiful Soup**

In [18]:
print(soup.prettify())

<html>
 <title>
  Data Science is Fun
 </title>
 <body>
  <h1>
   Data Science is Fun
  </h1>
  <div class="text" id="paragraphs">
   <p id="paragraph 0">
    Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0
   </p>
   <p id="paragraph 1">
    Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 P

Suppose we want to access an individual element, such as the title. The `soup` object provides that access through its `find` method. 

**Listing 16. 19. Extracting the title with Beautiful Soup**

In [19]:
title = soup.find('title')
print(title)

<title>Data Science is Fun</title>


The outputted `title` appears to be an HTML string that’s demarcated by the title tags. However, our `title` variable is not a string. Rather, it’s an initialized Beautiful Soup `Tag` class. 

**Listing 16. 20. Outputting the title’s data type**

In [20]:
print(type(title))

<class 'bs4.element.Tag'>


Each `Tag` object  contains a `text` attribute, which maps to the text within the tag. Thus, printing `title.text` will return  _Data Science is Fun_.

**Listing 16. 21. Outputting the title’s text attribute**

In [21]:
print(title.text)

Data Science is Fun


We’ve accessed our `title` tag by running `soup.find('title')`. Additionally, we can access that same tag simply by running `soup.title`. Therefore, running `soup.title.text` will return a string that’s equal to `title.text`.

**Listing 16. 22. Accessing the title’s text attribute from `soup`**

In [22]:
assert soup.title.text == title.text

In this same manner, we can access the body of our document by running `soup.body`.

**Listing 16. 23. Accessing the body’s text attribute from soup**

In [23]:
body = soup.body
print(body.text)

Data Science is FunParagraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Here is a link to Dat

Our output is an aggregation of all the text within the body. It is virtually unreadable. Rather than outputting all the text, we should instead narrow the scope of our output. Lets print the text of just the first paragraph. 

**Listing 16. 24. Accessing the text of the first paragraph**

In [24]:
assert body.p.text == soup.p.text
print(soup.p.text)

Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 


Accessing `body.p` returns the first paragraph in `body`. How do we access the remaining two paragraphs? Well, we can utilize the `find_all` method. Running `body.find_all('p')` will return a list of all the paragraph tags within the body.

**Listing 16. 25. Accessing all paragraphs in the body**

In [25]:
paragraphs = body.find_all('p')
for i, paragraph in enumerate(paragraphs):
    print(f"\nPARAGRAPH {i}:")
    print(paragraph.text)


PARAGRAPH 0:
Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 

PARAGRAPH 1:
Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 

PARAGRAPH

Similarly, we access our list of bullet-points by running `body.find_all('li')`. 

In [26]:
print([bullet.text for bullet
       in  body.find_all('li')])

['NumPy', 'Scipy', 'Pandas', 'Scikit-Learn']


Suppose we wish to access an element with a unique id of `x`. In order to search on that attribute id, we simply need to execute `find(id=x)`. With this in mind, lets output the text of the final paragraph, whose assigned id is `paragraph 2`.

**Listing 16. 27. Accessing a paragraph by id**

In [27]:
paragraph_2 = soup.find(id='paragraph 2')
print(paragraph_2.text)

Here is a link to Data Science Bookcamp


The contents of `paragraph_2` include a web link to _Data Science Bookcamp_. The actual url is stored within a `href` attribute. Beautiful Soup permits us to access any attribute using the `get` method. Thus, running `paragraph_2.get(id)` will return _paragraph 2_. Subsequently, running `paragraph_2.a.get(href)` will return the url. 

**Listing 16. 28. Accessing an attribute within a tag**

In [28]:
assert paragraph_2.get('id') == 'paragraph 2'
print(paragraph_2.a.get('href'))

https://www.manning.com/books/data-science-bookcamp


 Not all our attributes are unique. For instance, two of our three division elements share the `class` attribute of _text_. How do we obtain just those two divisions where the class is set to _text_? Well, we just simply need to run `body.find_all('div', class_='text')`. 
 
**Listing 16. 29. Accessing divisions by their shared class attribute**

In [29]:
for division in soup.find_all('div', class_='text'):
    id_ = division.get('id')
    print(f"\nDivision with id '{id_}':")
    print(division.text)


Division with id 'paragraphs':
Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 0 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Paragraph 1 Here is 

The Beautiful Soup library also allows us to edit individual elements. For example, given a `tag` object, we can delete that object by running `tag.decompose()`. Lets delete the first two paragraphs.

**Listing 16. 30. Paragraph deletion with Beautiful Soup**

In [30]:
body.find(id='paragraph 0').decompose()
soup.find(id='paragraph 1').decompose()
print(body.find(id='paragraphs').text)

Here is a link to Data Science Bookcamp


Additionally, we’re able to insert new tags into the HTML. Suppose we wish to insert a new paragraph into our final empty division. To do so, we must first create a new paragraph element. 

**Listing 16. 31. Initializing an empty paragraph `Tag`**

In [31]:
new_paragraph = soup.new_tag('p')
print(new_paragraph)

<p></p>


Next, we must update the initialized paragraph’s text, by assigning it to `new_paragraph.string`. 

**Listing 16. 32. Updating the text of an empty paragraph**

In [32]:
new_paragraph.string = "This paragraph is new"
print(new_paragraph)

<p>This paragraph is new</p>


Finally, we must append the updated `new_paragraph`  to an existing `Tag` object. Given two `Tag` objects; `tag1` and `tag`, we can insert `tag1` into `tag2` by running `tag2.append(tag1)`. Thus, running `soup.find(id='empty').append(new_paragraph)` should append to paragraph to the empty division.

**Listing 16. 33. Paragraph insertion with Beautiful Soup**

In [33]:
soup.find(id='empty').append(new_paragraph)
render(soup.prettify())

# 16.3. Downloading and Parsing Online Data 

The Beautiful Soup library allows to easily parse, analyze, and edit HTML documents. In most cases, these documents must be downloaded directly from the web. Lets briefly review the procedure for downloading HTML files. We’ll start by importing the `urlopen` function.

**Listing 16. 34. Importing the `urlopen` function**

In [34]:
from urllib.request import urlopen

Given the url of an online document, we can download the associated HTML contents by running `urlopen(url).read()`. Below, we’ll use `urlopen` to download the Manning website for this book. Afterwards, we’ll print the first 1000 characters of the downloaded HTML.

**Listing 16. 35. Downloading an HTML document**

In [35]:
url = "https://www.manning.com/books/data-science-bookcamp"
html_contents = urlopen(url).read()
print(html_contents[:1000])

b'\n<!DOCTYPE html>\n<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6 ie"> <![endif]-->\n<!--[if IE 7 ]>    <html lang="en" class="no-js ie7 ie"> <![endif]-->\n<!--[if IE 8 ]>    <html lang="en" class="no-js ie8 ie"> <![endif]-->\n<!--[if IE 9 ]>    <html lang="en" class="no-js ie9 ie"> <![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js"><!--<![endif]-->\n\n<head>\n    <title>Manning | Data Science Bookcamp</title>\n    \n\n\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">\n<meta name="application-name" content="Data Science Bookcamp"/>\n<meta name="apple-mobile-web-app-title" content="Data Science Bookcamp"/>\n\n<meta property="og:title" content="Data Science Bookcamp"/>\n<meta name="twitter:title" content="Data Science Bookcamp"/>\n\n<meta name="twitter:site" content="&#64;manningbo

Now, lets extract the title from our messy HTML using Beautiful Soup.

**Listing 16. 36. Accessing the title with Beautiful Soup**

In [36]:
soup = bs(html_contents)
print(soup.title.text)

Manning | Data Science Bookcamp


Leveraging our `soup` object, we can further analyze the page. For instance, we can extract the division that contains an _about the book_ header in order to print a description of this book.

**Listing 16. 37. Accessing a description of this book**

In [37]:
for division in soup.find_all('div'):
    header = division.h2
    if header is None:
        continue
        
    if header.text.lower() == 'about the book':
        print(division.text)


about the book

Data Science Bookcamp is a comprehensive set of challenging projects carefully designed to grow your data science skills from novice to master. Veteran data scientist Leonard Apeltsin sets five increasingly difficult exercises that test your abilities against the kind of problems you’d encounter in the real world. As you solve each challenge, you’ll acquire and expand the data science and Python skills you’ll use as a professional data scientist. Ranging from text processing to machine learning, each project comes complete with a unique downloadable data set and a fully-explained step-by-step solution. Because these projects come from Dr. Apeltsin’s vast experience, each solution highlights the most likely failure points along with practical advice for getting past unexpected pitfalls.  When you wrap up these five awesome exercises, you’ll have a diverse relevant skill set that’s transferable to working in industry.
    
