Example: The following page contains a dialog of the conversation between two characters. The line spoken by characters are in red, whereas the names of characters are in green.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [2]:
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')

In [3]:
bs_obj = BeautifulSoup(html)

The findAll function in BeautifulSoup extracts a python list of proper nouns found by selecting only the text within span class = 'green'.

In [4]:
namelist = bs_obj.find_all('span', {'class':'green'})
for name in namelist:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


The code above returns green text, which are the names of characters.

In [5]:
# Count the occurrence of a word
namecount = bs_obj.find_all(text = 'the prince')
print(len(namecount))

7


## Children and Descendants

In this step, we will play with the Gifts page, which you can access via http://bit.ly/1KGe2Qk

In [16]:
# Children
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs_obj = BeautifulSoup(html)
"""for child in bs_obj.find('table', {'id':'giftList'}).children:
    print(child)"""

child_list = []
for child in bs_obj.find('table', {'id':'giftList'}).children:
    child_list.append(child)

In [30]:
child_list[11]

<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>

The result above is the list of product rows in the giftList table. Note that if you want all descendants, change .children at the end of for loop command to .descendants().

In [22]:
# Descendants
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs_obj = BeautifulSoup(html)
"""for child in bs_obj.find('table', {'id':'giftList'}).descendants:
    print(child)"""

descendant_list = []
for desc in bs_obj.find('table', {'id':'giftList'}).descendants:
    descendant_list.append(desc)

In [34]:
descendant_list[20]

'\n$15.00\n'

## Siblings

It's convenient to collect data from tables, especially ones with title rows.

In [7]:
# Use the same html page
for sibling in bs_obj.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

## Parents

In [8]:
# same page...
print(bs_obj.find('img', {'src':'../img/gifts/img1.jpg'})
     .parent.previous_sibling.get_text())


$15.00



This code prints out the price of the object represented by the image at the location "../img/gifts/img1.jpg" (which is $15.00)

In [9]:
'aa*bbbbb(cc)*(d| )'

'aa*bbbbb(cc)*(d| )'

# Regular Expressions

In [10]:
import re

In [11]:
# same page...
images = bs_obj.find_all('img', {'src': 
                                 re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


# Lambda Expressions

In [12]:
# Find all tags that have exactly two attributes.
tag2 = bs_obj.find_all(lambda tag: len(tag.attrs) == 2)
tag2[:3]

[<img src="../img/gifts/logo.jpg" style="float:left;"/>,
 <tr class="gift" id="gift1"><td>
 Vegetable Basket
 </td><td>
 This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
 <span class="excitingNote">Now with super-colorful bell peppers!</span>
 </td><td>
 $15.00
 </td><td>
 <img src="../img/gifts/img1.jpg"/>
 </td></tr>,
 <tr class="gift" id="gift2"><td>
 Russian Nesting Dolls
 </td><td>
 Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
 </td><td>
 $10,000.52
 </td><td>
 <img src="../img/gifts/img2.jpg"/>
 </td></tr>]