# Beautiful Soup and lxml
Once we have the HTML file as a string, we can search for the things we need in this string. We can do:

1. Use a for loop to go through the string and write codes that get the stuff we need from the r.text string. (Requires some programming skills, and a lot of time)

2. Use a `regex` to search in the string. (Require some time to write the regex)

3. Use some external modules to search things for us automatically, The module is called `Beautiful Soup`.


In [24]:
import requests
response = requests.get("https://example.com/") # Response Object
response_html_str = response.text # str, 無法查找或操作 HTML 標籤內的特定元素
print(type(response_html_str))

<class 'str'>


#### `Beautiful Soup` is a Python library for getting data out of HTML and some other markup languages. <br>
#### `Lxml` is a parser for interpreting the HTML.

`BeautifulSoup(string, parser)` method in bs4 module takes a string and one parser as inputs and return a special `BeautifulSoup object`.


In [26]:
import bs4
import lxml

# 把 HTML str parse 成了一個結構化的對象，就可以使用 Python 的方式來操作和查找其中的元素
soup = bs4.BeautifulSoup(response_html_str, "lxml") # BeautifulSoup Object
print(type(soup)) 
print(soup)


<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for 

## After parse the HTML str...
* `soup.find(name, attribute)` returns a TAG object of the first match. We can search by name or attribute, or both.

* `soup.find_all(name, attribute)` returns a list of TAG objects.

* `tag.get(attributeName)` returns the value in that tag’s attribute.

* `tag.getText()` returns the text part of the HTML tag.


In [46]:
# BeautifulSoup_Object.find()
tag_obj = soup.find("p") 
print(type(tag_obj)) 

# BeautifulSoup_Object.find_all()
tag_obj_list = soup.find_all("p")
print(len(tag_obj_list), type(tag_obj_list[0]))

print(tag_obj_list[1])
a_tag = tag_obj_list[1].find('a')
print(a_tag.attrs) # 獲取所有 attribute

# Tag_Object.get
print(a_tag.get("href")) 

# Tag_getText()
print(tag_obj)
print(tag_obj.getText())

print(a_tag)
print(a_tag.getText())

<class 'bs4.element.Tag'>
2 <class 'bs4.element.Tag'>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
{'href': 'https://www.iana.org/domains/example'}
https://www.iana.org/domains/example
<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
<a href="https://www.iana.org/domains/example">More information...</a>
More information...
