## Beautiful Soup

* Cheatsheet: http://akul.me/blog/2016/beautifulsoup-cheatsheet/
* Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

`pip install bs4`

In [1]:
# Dependencies
from bs4 import BeautifulSoup as bs

In [2]:
html_string = """
<html>
<head>
<title>
A Simple HTML Document
</title>
</head>
<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>
</html>
"""

  * A Beautiful Soup object is parsed/created using `bs(html_string, 'html.parser')` and the object returned is assigned to the `soup` variable.

In [3]:
soup = bs(html_string, 'html.parser')

In [4]:
type(soup)

bs4.BeautifulSoup

In [5]:
soup.prettify()

'<html>\n <head>\n  <title>\n   A Simple HTML Document\n  </title>\n </head>\n <body>\n  <p>\n   This is a very simple HTML document\n  </p>\n  <p>\n   It only has two paragraphs\n  </p>\n </body>\n</html>\n'

* The DOM is a tree whose structure is defined by the nesting of tags. Beautiful Soup looks through this tree and then converts it into a specialized object equipped with powerful methods for traversing and searching the HTML for attributes, text, etc.

* The `type(soup)` method being used confirms that the `soup` object created is indeed a BeatifulSoup object.

* The `prettify()` method of the Beautiful Soup library is then used to return a formatted version of the object that is easier to read.


In [7]:
soup.title

<title>
A Simple HTML Document
</title>

In [8]:
soup.title.text

'\nA Simple HTML Document\n'

In [10]:
soup.title.text.strip()

'A Simple HTML Document'

In [11]:
soup.head

<head>
<title>
A Simple HTML Document
</title>
</head>

In [12]:
# Body of HTML with tags
soup.body

<body>
<p>This is a very simple HTML document</p>
<p>It only has two paragraphs</p>
</body>

In [13]:
# Strips html from body
soup.body.text.strip()

'This is a very simple HTML document\nIt only has two paragraphs'

In [14]:
# Finds all Paragraph tags
soup.body.find_all('p')

[<p>This is a very simple HTML document</p>, <p>It only has two paragraphs</p>]

In [16]:
# Loop through body paragraphs and print each line
for paragraph in soup.body.find_all('p'):
    print(paragraph.text.strip())

This is a very simple HTML document
It only has two paragraphs


In [17]:
# Sets paragraph to soup function
paragraph = soup.body.find_all('p')

In [18]:
# 
paragraph[0].text.strip()

'This is a very simple HTML document'