In [28]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
  html = urlopen('http://pythonscraping.com/pages/page1.htmll')
except HTTPError as e:  
  print(e)
except URLError as e:  
  print("The server could not be found!")
else:
  print("It Worked!")

HTTP Error 404: Not Found


In [15]:
!pip install bs4
!pip install html5lib




[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting html5lib
  Downloading html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Collecting webencodings (from html5lib)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Downloading html5lib-1.1-py2.py3-none-any.whl (112 kB)
   ---------------------------------------- 0.0/112.2 kB ? eta -:--:--
   --- ------------------------------------ 10.2/112.2 kB ? eta -:--:--
   ------------------------------------ --- 102.4/112.2 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 112.2/112.2 kB 1.3 MB/s eta 0:00:00
Using cached webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.1 webencodings-0.5.1



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser') # (string that object is based on, specified parser to use)
print(bs.h1)

<h1>An Interesting Title</h1>


**The scraper below comes with more advanced handling to handle potential errors such as:**

  * HTTP error - url is found but file/filepath isn't found
  * Url error - no url found
  * Attribute error - missing attribute like h1

In [34]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    except URLError as e:
        print("The server could not be found!")
        return None
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
  print("Title could not be found")
else:
    print(title)

<h1>An Interesting Title</h1>


**This section covers searching for tags by attributes, working with lists of tags, and navigating parse trees**

This subsection will focus on parsing html that has css styling

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

# use find all to extract a python list of proper nouns found by selecting only the text within <specifiedTag></specifiedTag>
name_list = bs.find_all('span', {'class':'green'}) # (tagName, tagAttributes)
for name in name_list:
  print(name.get_text()) # .get_text strips all tags from the document you are working with and returns a Unicode string containing the text (within those tags) only

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


Calling '.get_text' should always be the last thing you do, immediately before you print, store, or manipulate your final data.
In general, try to preserve the tag structure of a document as long as possible

In this next subsection, 'find()' and 'find_all()' with 'BeautifulSoup' will be discussed

The functions are very similar:
  * find_all(tag, attrs, recursive, text, limit, **kwargs)
  * find(tag, attrs, recursive, text, **kwargs)

95% of the time you will only need to use the first 2 arguments:
  * tag
  * attributes

In find_all tag parameter, you can pass a sting tag name or a list of string tag names:
  * find_all('h1')
  * find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6']) --> which returns a list of all the header tags in a document

The attributes (attrs) must be a python dictionary of attributes and values
The following function will return BOTH the green and red span tags in the HTML document:
  * .find_all('span', {'class': ['green', 'red']})

The recursive parameter is a boolean. How deep into the document do you want to go?
If recursive is set to True, the 'find_all' function looks into children, and childrens children, etc., for tags that match the parameters.
If recursive is false, it will only look at top-level tags

by default 'find_all' recursive parameter is set to true

The 'text' parameter matches based on the text content in the tags themselves
  * nameList = bs.find_all(text="the prince")
  * print(len(nameList)) --> 7

The 'limit' parameter is set if you want to retrieve the first x items from a page.
This only gives you the first items on the page in the order they occur, not necessarily the first ones you want

The **kwargs parameter allows you to pass any additional named arguments you want into the method.
Any extra arguments that 'find' or 'find_all' doesn't recognize will be used as tag attribute matchers:
  * title = bs.find_all(id='title', class='text')

**Navigating Trees**

This section will focus on navigating HTML trees. Not just downward but up, across, and diagonally

BeautifulSoup functions always deal with the descendants of the current tag selected. For instance, 'bs.body.h1' selects the first h1 tag that is a descendant of the body tag. It will not find tags located outside of the 'body'

In [1]:
# to find only descendants that are children, you can use the '.children' tag
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table', {'id':'giftList'}).children:
  print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


In [None]:
# to find only descendants that are children, you can use the '.descendants' tag
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs = BeautifulSoup(html, "html.parser")

for child in bs.find("table", {"id": "giftList"}).descendants:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<th>
Item Title
</th>

Item Title

<th>
Description
</th>

Description

<th>
Cost
</th>

Cost

<th>
Image
</th>

Image



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<td>
Vegetable Basket
</td>

Vegetable Basket

<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

<span class="excitingNote">Now with super-colorful bell peppers!</span>
Now with super-colorful bell peppers!


<td>
$15.00
</td>

$15.00

<td>
<img src="../img/gifts/img1.jpg"