## Materials
- BeautifulSoup Document: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#
- CSS Selector: https://www.w3school.com.cn/cssref/css_selectors.asp


## Interpreter:
  - python `BS4` package: `BeautifulSoup(markup,'html.parser')`
  - LXML HTML:`BeautifulSoup(markup, "lxml")`
  - LXML XML: `BeautifulSoup(markup, "xml")`
  - html5lib: `BeautifulSoup(markup, "html5lib")`

## Decomposition and Name standard html statement(Kinds of Object):🌟
```<tagName attributeName='attributeValue'>Affected Content<!--Comments and other special strings--></tagName>```

In [1]:
pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
from bs4 import BeautifulSoup

In [3]:
#Example: interprete use lxml
#create a bs instance
soup=BeautifulSoup('<p>Hello World</p>','lxml')
print(soup.p.string)

Hello World


## Usage of BeautifulSoup

In [4]:
#Example:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
#the html text above without a completed html structure
soup = BeautifulSoup(html, 'lxml') #BeautifulSoup(content,'interpreter_type'),correcting the html format automatically.
print(soup.prettify()) #turn a Beautiful Soup parse tree into a nicely formatted Unicode string
print(soup.title.string) #use .title.string method to print the text content of <title>...</title> related.

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


## Node Selector
- Method: Just use the **name** of node to select the node, then change it to `string` to get the *content* in the node.(`.tagName.string`)
- After the selector, the outputs are all includes the corresponding tags.
- If there are multiple matched nodes, the selector always gives us the **first** html node, neglecting the rest of its kind.

### Get Content in the Node

In [5]:
#we use the html stated above
print(soup.title)
print(type(soup.title)) #bs4.element.Tag is the important data structure in bs4
print(soup.title.string)
print(soup.head)
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


### Get Name of the Node
- use `.tagName.name` method

In [6]:
print(soup.title.name)

title


In [7]:
print(soup.p.name)

p


### Get Attribute of Node
- use `.tagName.attrs` method
- each node may have multiple attributes, we select this node, getting **all** attributes.
- Return as *dictionary*:{key:value}

In [8]:
print(soup.a.attrs)

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}


In [9]:
print(soup.p.attrs)
print(soup.p.attrs['name'])

{'class': ['title'], 'name': 'dromouse'}
dromouse


In [10]:
#get the attrs value using index
print(soup.p['name'])
print(soup.p['class'])
print(soup.a['id'])
#returns as string or list. If the value is only one, then return as string. If it has mutiple values,return as a list.

dromouse
['title']
link1


### Nested Selection
- Get one element, then we get the element inside the first element we got.
- Note: `soup.element1.element2`

In [11]:
html1 = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
soup1 = BeautifulSoup(html1, 'lxml')
#get head element, then select title element inside head
print(soup1.head.title)
print(type(soup1.head.title))
print(soup1.head.title.string)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story


## Associated Selection
Sometimes we cannot get the element we want at one timetime,we need select one element, then get its child node, parent node, sibling nodes.
### Child and Descendant Node
- use `soup.element.contents`
   - It returns contains nodes and content in a list.
   - All the returns are **direct** child nodes.
- use `soup.element.children`
   - Return as a generator type with `for` loop to illustrate the outcomes
     - Generator: looping while computing mechanism.
   - All the returns are **direct** child nodes.
- use `soup.element.descendant`
   - we get all children nodes including direct and indirect child nodes.
   - Return as a generator type


In [12]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']


In [13]:
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

<list_iterator object at 0x7f6a9f89add0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.
        


In [14]:
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

<generator object Tag.descendants at 0x7f6a9f89c4d0>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.
        


### Parent and Ancestor Nodes
- Use `soup.element.parent`
   - get the parent of one node (direct parent element)
- Use `soup.element.parents`
   - get all ancestor nodes at one time
   - return as a generator type
   - use `list(enumerate(soup.element.parents))` method to list the return.

In [15]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>


In [16]:
print(soup.a.parents)
print(type(soup.a.parents))
#use list() to get index and content
print(list(enumerate(soup.a.parents)))

<generator object PageElement.parents at 0x7f6a9f8ffc50>
<class 'generator'>
[(0, <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters

### Sibling Nodes
- use `soup.element.next_sibling` and `soup.element.previous_sibling` to get direct one adjacent result.
-  `list(enumerate(soup.element.next_siblings))` and `list(enumerate(soup.element.previous_siblings))`method to get mutiple results

In [17]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

Next Sibling 
            Hello
            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]


### Abstract Information(content, attribute)


In [18]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        </p>
"""
soup = BeautifulSoup(html, 'lxml')
#if it return single node, we change it to string to get content
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
#if the returns type is generator,we convert it into list and index it to get element
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']


## Searching the Tree
### find_all
   - get the elements with specific contrains
   - method signature
   ```find_all(name, attrs, recursive, string, limit, **kwargs)```
     - `string`: search for strings instead of tags
     - pass in a number for `limit`.EG: `soup.find_all("a", limit=2)`
     - `recursive`: If you only want to consider direct children, passing in `recursive=False`. By default, it examines all the descendants of tag input.

#### name
  - only consider tags with certain names
  - `soup.find_all(name=...)`
  - return in a list if it contains multiple results.


In [19]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul')) #return as a list with length equals 2
print(type(soup.find_all(name='ul')[0]))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>


In [20]:
#get <li> with for loop
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


In [21]:
for li in ul.find_all(name='li'):
        print(li.string)

Foo
Bar


#### attrs
- `soup.find_all(attrs={key:value})`
   - return as a list
- without assigning attrs values, simply use `soup.find_all(attributeName=attributeValue)`

In [22]:
print(soup.find_all(attrs={'id': 'list-2'}))
print(soup.find_all(attrs={'class': 'panel-heading'}))

[<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
[<div class="panel-heading">
<h4>Hello</h4>
</div>]


In [23]:
print(soup.find_all(id='list-2'))
print(soup.find_all(class_='panel-heading')) #need to add an underline behind class as a keyword argument

[<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
[<div class="panel-heading">
<h4>Hello</h4>
</div>]


#### text
- match the text, the input can be regular expression or string.

In [24]:
import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

['Hello, this is a link', 'Hello, this is a link, too']


### find
- only return **one** matched element, while find_all returns all required matched elements.
- return as an element, not as a list
- `find(name, attrs, recursive, string, **kwargs)`, by default, limit=1
- if it can find nothing, it will return *None*.



In [25]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>


## CSS Selector
- use `soup.select()` with correct CSS selector format.
- Find tags beneath other tags: `soup.select('tag1 tag2 tag3...')` 
- Find tags directly beneath other tags: `soup.select('tag1 > tag2')`
- Find the siblings of tags:`soup.select('tag1 ~ tag2')` or `soup.select('tag1 + tag2')`
- `.select_one()`: find only the first tag that matches a selector.

In [26]:
 html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')
#Find tags beneath other tags
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>


### Nested Selection
- first select one element, then find all elements under the pervious selected element in a list

In [27]:
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


### Get Attributes


In [28]:
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2


### Abstract Content
- use `get_text()`

In [29]:
for li in soup.select('li'):
  #the two methods get the same results
    print('Get Text:', li.get_text())
    print('String:', li.string)

Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar


## Other topics: 
- Modifying
  - append()
  - extend()
  - .new_tag()
  - NavigableString()
  - insert()
  - insert_before()
  - insert_after()
  - clear()
  - extract()
  - decompose()
  - replace_with()
  - wrap()
  - unwrap()
  - smooth()
- Parser Customization 
  - Handling duplicate attributes
  - Customizing multi-valued attributes
  - SoupStrainer
- Comparing objects for equality: use `==`
- Line numbers:`Tag.sourceline` and `Tag.sourcepos`, `store_line_numbers=False` to turn off the function

More Refer to:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#output