# Chapter 2: Advanced HTML Parsing

---

A HTML page can be very complicated and hard to parse, So writing a robust scraping code is a necessity here to dig into the HTML comfortably without ending with a code that is difficult to debug, fragile, or may be both.

---

## 2.1 Do we really need a hammer?

Here we may face the problem of targeting a type of data (e.g. text, name or any statistic) that is buried under almost 20 tags and ypu had to write a hard-coded scraper like that

In [1]:
# bs.find_all('table')[4].find_all('tr')[2].find('td').find_all('div')[1].find('a')

**Any slight difference in the page will crack your scraper immediately, Since you are laying on the page structure doesn't change**

Cure:

- Looking for a "Print THis Page" link or the mobile version of the website that can be more well-formatted

- Digging into the JavaScript file to extract more sophisticated data which can be hidden embedded 

- You may look closely more in the page URL itself that can contains some info (e.g. titles)

- Websites can be just an aggregating display of other wensites, So you can also search for these websites and tries to extract that data from them or serach for another websites totally that servers the same data

***SO Take a deep breath and think of alternatives.***

---

## 2.2 Another creative & tricky ways

**What if there are no alternatives?!**

### Stylesheets

CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently

So web scrapers can easily separate these two tags based on their class.

In [2]:
# Creating a beautifulsoup object
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html.read(), 'html.parser')

In [3]:
# from BeautifulSoup we can use find_all()
# We use it to extract the characters name in the story ---> colored in green
names_list = bs.find_all('span', {'class':'green'})        # We need text within span and specifying the span we want by class

for name in names_list:
    print(name.get_text())


Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


**Note that**

*Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling .get_text() should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, you should try to preserve the tag structure of a document as long as possible.*

---

## 2.3 BeautifulSoup find() & find_all()

- **`find()`**: Returns the first matching tag (equivalent to `find_all()` with `limit=1`).
- **`find_all()`**: Returns a list of all matching tags.


### Parameters
- `tag`: Specify one or multiple tags to find (e.g., `'h1'` or `['h1', 'h2']`).

- `attributes`: Filter by tag attributes using a dictionary (e.g., `{'class': {'green', 'red'}}`).

- `recursive`: Boolean to control search depth:

   - `True` (default): Searches all descendants.
   - `False`: Searches only immediate children.

- `text`: Match tags based on text content (e.g., `bs.find_all(text='example')`).

- `limit` (*`find_all()` only*): Return only the first `x` matches.

- `keywords`: Filter by specific attributes (e.g., `bs.find_all(id='title', class_='text')`).

In [4]:
# more than one tag
txts = bs.find_all(['h1','h2','h3','h4','h5','h6'])

txts[0].get_text()

'War and Peace'

In [5]:
# can include more than one calss within a dictionary
txts = bs.find_all('span', {'class':{'green', 'red'}})

txts[0].get_text()

"Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news."

In [6]:
# text argument lets you specify the word you want its repeatation
nameList = bs.find_all(text='the prince')

for name in nameList:
    print(name.get_text())

print('Length is:', len(nameList))

the prince
the prince
the prince
the prince
the prince
the prince
the prince
Length is: 7


  nameList = bs.find_all(text='the prince')


In [7]:
# more addressing using (id or any suitable kewwords)
title = bs.find_all(id='title', class_='text') # almost = => bs.find_all(id='title') = => bs.find_all('', {'id': 'title'})
title

[]

### Keyword Arguments and “Class”

Keyword argument is a little bit redundant since there are more flexible and effective techniques that can be used instead (e.g. regular exressions)

And you may face problems like hitting reserved words in python core language (e.g. class) whaich make a line like this raise a syntax error

In [8]:
# bs.find_all(class='green') ---> wrong
# bs.find_all('', {'class': 'green'})

**However Keyword argument can be effectively useful in filtering outputs when passing a lengthy list of tags to find_all()**

---

## 2.4 BeautifulSoup Objects

- BeautifulSoup objects: like the variable *bs* we use

- Tag objects: Retrieved in lists, or retrieved individually by calling find and find_all on a BeautifulSoup object, or drilling down (e.g. bs.div.h1)

- NavigableString objects: Used to represent text within tags

- Comment object: Used to find HTML comments in comment tags

---

## 2.5 Navigating Trees

We may need to reach a specific location within the HTML and since find_all() is resposible for finding elements using their name it can't help us here and we have to navigate the location ourselves

In [9]:
# e.g. bs.tag.subTag.anotherSubTag

### Children & Descendants

*All children are descendants, but not all descendants are children.*

In [10]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for child in bs.find('table',{'id':'giftList'}).children:
    print(child)                        # find only descendants that are children



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


### Siblings

nex_siblings() from beautifulSoup add a powerful collecting technique ti the toolkit bag

And you may notice that it returns the next sibling to the refrenced tag not the tag itself, So Objects cannot be siblings with themselves.

In [11]:
html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'html.parser')

for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)                          # titles row is not included



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

In [17]:
# Why not just 
bs.table

<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gift

**Well the more specific you write your code that generalizes the case you are selecting and at the same time specifies the tag needed using things like tag attributes as much as you can that boosts your scraper robustness and enhance its reusability in similar cases**

### Parents

In [42]:
# Same logic of childrens and descentdants can be applied on parents concept
# The following code line will extract the price of the product using parent concept on images to navigate between lines 

print(bs.find('img', {'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text())


$15.00



---

## 2.6 Regular Expressions

In web scraping, regular expressions (regex) act like precision tools, helping you extract specific patterns such as emails, phone numbers, or URLs from raw HTML. They bring order to unstructured data, making it easy to isolate exactly what you need with a simple yet powerful set of patterns.

### Regular Expressions and BeautifulSoup

Scraper can face a lot of issues reaching target when depending on the position of the target so as said before we need to generalize it

In [43]:
# Using image source URL noticed sequence and capturing that pattern using RegEx 
import re

images = bs.find_all('img', {'src':re.compile('..\/img\/gifts/img.*.jpg')})

for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


  images = bs.find_all('img', {'src':re.compile('..\/img\/gifts/img.*.jpg')})


*A regular expression can be inserted as any argument in a BeautifulSoup expression, allowing you a great deal of flexibility in finding target elements.*

---

## 2.7 Accessing Attributes

Often in web scraping you’re not looking for the content of a tag; you’re looking for its attributes.

In [49]:
# Here we extract the id of the table tag
bs.find_all('table')[0].attrs       # We can also specify it e.g. bs.find_all('table')[0].attrs['id]

{'id': 'giftList'}

---

## 2.8 Lambda Expressions

Lambda expression is a function that is passed into another function as a variable, Due to its simplicity and the fact that it does not do a lot of work that deserve to be written in a defined function

***One thing to be mentioned is** that these functions must take a tag object as an argument and return a boolean*

In [53]:
check = lambda tag: len(tag.attrs) == 2

bs.find_all(check)[0]

<img src="../img/gifts/logo.jpg" style="float:left;"/>

Here we can see that one of the tags that has no.attrs equal to 2 is img tag (attrs = {'src', 'style'})

**Notice also that**

Because the provided lambda function can be any function that returns a True or False value, you can even combine them with regular expressions to find tags with an attribute matching a certain string pattern.

---

## End

Chapter 2 "Advanced HTML Parsing", teaches advanced techniques for using BeautifulSoup to navigate and parse more complex HTML structures. It covers methods like filtering by tag attributes, using CSS selectors, and leveraging regular expressions to efficiently extract data from intricate web pages.