<h1 style="color:blue;text-align:center">  Web Crawling with python v.03</h1>
<h3 style="color:blue;text-align:center"> Chih-Hung Lai   </h3>
<h4 style="color:blue;text-align:center"> Created: 2017  &emsp; Last modified: 2022.04.14 </h4>

# Content of Table  
1. Introduction    
 1.1 What is web crawling?   
 1.2 Prerequisites  
 1.3 Good etiquette for web crawling
2. Retrieve data from requests module  
 2.1 requests.get( ) method  
 2.2 requests.get(url).text  
 2.3 Search targeted string in a website  
3. Encoding 
4. Chrome DevTools (part 1)    
5. HTML parsing tool: BeautifulSoup module   
 5.1 Introduction to BeautifulSoup   
 5.2 Geting HTML tag from BeautifulSoup.tag object  
 5.3 Geting HTML tag from BeautifulSoup.find( )   
6. Get content from tag object  
 6.1 Get string from tag object  
 6.2 Get child tags from tag objects
 6.3 Get attributes from tag objects

<h1 id="Introduction" style="color:blue"> 1. Introduction

## 1.1 What is web crawling?

- Definition

    - Web crawling is the process of indexing data on web pages by using a program or automated script. These automated scripts or programs are known by multiple names, including web crawler, web scraper, spider, spider bot, and often shortened to crawler.


- Big web crawlers: Search engines
    - Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently. The goal of a crawler is to learn what webpages are about. This enables users to retrieve any information on one or more pages when it’s needed.
    
    
- Why is web crawling important?
    - https://research.aimultiple.com/web-crawler/

* Categories
    * One webpage
    * One website
    * Multiple websites
    

* steps
    1. Ensure target websites, webpages, or parts of a webpage
        * Tool: Chrome developer
           
    2. Categories
        * Static webpages
            * Request package 
        * Interactive webpages
            * Selenium package
    3. Parse website
        * BeautifulSoup package。
    4. Data processing

In [1]:
# Simple crawler

import requests
url = 'https://w3schools.com/python/demopage.htm'
html = requests.get(url)
html.encoding="utf-8"
print(html.text)

<!DOCTYPE html>
<html>
<body>

<h1>This is a Test Page</h1>

</body>
</html>


In [2]:
# Simple crawler

import requests
url = 'https://edition.cnn.com/'
html = requests.get(url)
print(html.text)

<!DOCTYPE html><html class="no-js"><head><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"><meta charset="utf-8"><meta content="text/html" http-equiv="Content-Type"><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0"><link rel="dns-prefetch" href="/optimizelyjs/128727546.js" /><link rel="dns-prefetch" href="//tpc.googlesyndication.com" /><link rel="dns-prefetch" href="//pagead2.googlesyndication.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="//partner.googleadservices.com" /><link rel="dns-prefetch" href="//www.google.com" /><link rel="dns-prefetch" href="//aax.amazon-adsystem.com" /><link rel="dns-prefetch" href="//c.amazon-adsystem.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//ads.rubiconproject.com" /><link rel="dns-prefetch" href="//optimized-by.rubiconproject.com" /><link rel="dns-prefetch" href="//fastlane.rubiconproject.com" /><link rel

## 1.2 Prerequisites
- HTML 
- CSS 

## 1.3 Good etiquette for web crawling

- [Robots exclusion standard](https://en.wikipedia.org/wiki/Robots_exclusion_standard)

    - The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. 
    
    
- robots.txt
    - When a site owner wishes to give instructions to web robots they place a text file called robots.txt in the root of the web site hierarchy (e.g. https://www.xxxx.com/robots.txt).
    - This text file contains the instructions in a specific format. 
    - Robots that choose to follow the instructions try to fetch this file and read the instructions before fetching any other file from the website. 
    - If this file doesn't exist, web robots assume that the website owner does not wish to place any limitations on crawling the entire site.
    
    
- Example
    - https://edition.cnn.com/robots.txt  (CNN)
    - https://udn.com/robots.txt  (UDN)

### robots.txt document

- English
    - https://en.wikipedia.org/wiki/Robots_exclusion_standard
- Chinese
    - https://www.seoseo.com.tw/article_detail_602.html  
    - https://www.awoo.com.tw/blog/robotstxt-crawl/

<h1 a name="request" style="color:blue"> 2. Retrieve data from requests module

- install requests module (select one)
  - pip install requests 
  - pip3 install requests
- pip list   
  - list all installed packages (or modules)

In [3]:
import requests
url = 'https://edition.cnn.com/'
html = requests.get(url)
print(html.text)

<!DOCTYPE html><html class="no-js"><head><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"><meta charset="utf-8"><meta content="text/html" http-equiv="Content-Type"><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0"><link rel="dns-prefetch" href="/optimizelyjs/128727546.js" /><link rel="dns-prefetch" href="//tpc.googlesyndication.com" /><link rel="dns-prefetch" href="//pagead2.googlesyndication.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="//partner.googleadservices.com" /><link rel="dns-prefetch" href="//www.google.com" /><link rel="dns-prefetch" href="//aax.amazon-adsystem.com" /><link rel="dns-prefetch" href="//c.amazon-adsystem.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//ads.rubiconproject.com" /><link rel="dns-prefetch" href="//optimized-by.rubiconproject.com" /><link rel="dns-prefetch" href="//fastlane.rubiconproject.com" /><link rel

## 2.1 requests.get( ) method

- Function
    - Sends a GET request to the specified url

- Syntax
    - get(url, params, args)
    - https://www.w3schools.com/python/ref_requests_get.asp

- Return value
    - get( ) method returns a ***requests.Response object***.
        - https://zh.wikipedia.org/wiki/HTTP%E7%8A%B6%E6%80%81%E7%A0%81


In [4]:
import requests
url = 'https://edition.cnn.com/'
html = requests.get(url)
print(html.text)

<!DOCTYPE html><html class="no-js"><head><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"><meta charset="utf-8"><meta content="text/html" http-equiv="Content-Type"><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0"><link rel="dns-prefetch" href="/optimizelyjs/128727546.js" /><link rel="dns-prefetch" href="//tpc.googlesyndication.com" /><link rel="dns-prefetch" href="//pagead2.googlesyndication.com" /><link rel="dns-prefetch" href="//www.googletagservices.com" /><link rel="dns-prefetch" href="//partner.googleadservices.com" /><link rel="dns-prefetch" href="//www.google.com" /><link rel="dns-prefetch" href="//aax.amazon-adsystem.com" /><link rel="dns-prefetch" href="//c.amazon-adsystem.com" /><link rel="dns-prefetch" href="//cdn.krxd.net" /><link rel="dns-prefetch" href="//ads.rubiconproject.com" /><link rel="dns-prefetch" href="//optimized-by.rubiconproject.com" /><link rel="dns-prefetch" href="//fastlane.rubiconproject.com" /><link rel

In [5]:
# Get a designated webpage

import requests
url = input('Please input URL:')
html = requests.get(url)
print(html.text)

Please input URL:https://www.cwb.gov.tw/V8/C/W/OBS_Sat.html
<!DOCTYPE html>
<!--[if IE 8]> <html lang="zh-Hant-TW" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="zh-Hant-TW" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="zh-Hant-TW">
<!--<![endif]-->

<head>
  <title>衛星雲圖  | 交通部中央氣象局</title>
  <!-- Meta --> 
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="description" content="">
  <meta name="author" content="">
  <!--Facebook_meta_begin-->
  <!-- Facebook -->
  <meta property="og:url" content="https://www.cwb.gov.tw/V8/C/W/OBS_Sat.html" />
  <meta property="og:image" content="https://www.cwb.gov.tw/Data/satellite/LCC_VIS_TRGB_1000/LCC_VIS_TRGB_1000.jpg?" />
  <meta property="og:title" content="衛星雲圖 - 中央氣象局全球資訊網" /> 
  <meta property="og:description" content="衛星雲圖" />
  <meta property="og:site_name" content="中央氣象局全球資訊網" />
  <!--Facebook_meta_end-->
  <!--tw.gov_meta_begin-->
  <meta name="DC.Creator" cont

### requests.Response object

#### - Common properties and methods of requests.Response object

- text
 - Returns the content of the response, in unicode
 
 
- status_code
 - Returns a number that indicates the status
     - 200 is OK (requests.codes.ok)
     - 404 is Not Found)
     - 400-599 failed
     
     
- raise_for_status
 - detailed status
 - If an error occur, this method returns a HTTPError object
 

  
- url
    - Returns the URL of the response
    
    
- headers
    - Returns a dictionary of response headers

In [6]:
# import requests
url = 'https://www.ndhu.edu.tw'
html = requests.get(url)
html.encoding="utf-8"
if html.status_code == requests.codes.ok:   # is 200
    print(html.url)
    print(html.status_code)
    print(requests.codes.ok)
    print('file size:', len(html.text))

https://www.ndhu.edu.tw/
200
200
file size: 58552


In [8]:
# failed connection

import requests
url = 'http://aaa.bbb.com.tw/'
html = requests.get(url)
print(html.status_code)
print(html.raise_for_status)   

ConnectionError: HTTPConnectionPool(host='aaa.bbb.com.tw', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001C09D4ADA30>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

#### sever header

In [10]:
# get headers of websites' response

import requests
r = requests.get("https://edition.cnn.com/")
print(r.headers['Content-Type'])
print(r.headers['Content-Length'])
print(r.headers['Date'])


text/html; charset=utf-8
154705
Fri, 29 Apr 2022 14:26:02 GMT


## 2.2 requests.get(url).text

- Returned text is a string

- Convert to a list
    - requests.get(url).text.splitlines( ) 

In [11]:
# Returned text is a string

import requests
url = 'https://www.ndhu.edu.tw'
html = requests.get(url)
html.encoding="utf-8"
print(html.text)

<!DOCTYPE html>
<html lang="zh-tw">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="initial-scale=1.0, user-scalable=1, minimum-scale=1.0, maximum-scale=3.0">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<meta name="keywords" content="National Dong Hwa University, NDHU, Dong Hwa, University, 國立東華大學, 東華, 大學, 亞洲, 台灣, 花蓮, 壽豐, 志學, RUR, THE, IOH, 最佳大學" />
<meta name="description" content="國立東華大學 NDHU, National Dong Hwa University 被譽為花東縱谷裡的學術殿堂，具有特色研究與卓越教學之綜合型大學，位於民風純樸、天然及人文資源豐富的花蓮，發揮多元化的教育功能，提昇台灣東部學術及文化水準，朝向成為國際性一流大學的目標邁進。" />
<meta name="robots" content="all" />
<meta name="googlebot" content="all" />
<meta name="google-site-verification" content="xZd74RR0DFg7QRfiNJAZFlmnrpYiJMI9EjBOw7oOHBs" />
<meta property="og:title" content="國立東華大學 NDHU, National Dong Hwa University">
<meta property="og

In [12]:
import requests
url = input('Please input URL:')
html = requests.get(url)
print(html.text)

Please input URL:https://www.cwb.gov.tw/V8/C/W/OBS_Sat.html
<!DOCTYPE html>
<!--[if IE 8]> <html lang="zh-Hant-TW" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="zh-Hant-TW" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="zh-Hant-TW">
<!--<![endif]-->

<head>
  <title>衛星雲圖  | 交通部中央氣象局</title>
  <!-- Meta --> 
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta name="description" content="">
  <meta name="author" content="">
  <!--Facebook_meta_begin-->
  <!-- Facebook -->
  <meta property="og:url" content="https://www.cwb.gov.tw/V8/C/W/OBS_Sat.html" />
  <meta property="og:image" content="https://www.cwb.gov.tw/Data/satellite/LCC_VIS_TRGB_1000/LCC_VIS_TRGB_1000.jpg?" />
  <meta property="og:title" content="衛星雲圖 - 中央氣象局全球資訊網" /> 
  <meta property="og:description" content="衛星雲圖" />
  <meta property="og:site_name" content="中央氣象局全球資訊網" />
  <!--Facebook_meta_end-->
  <!--tw.gov_meta_begin-->
  <meta name="DC.Creator" cont

### Convert a string to a list

In [13]:
import requests
url = 'https://www.ndhu.edu.tw'
html = requests.get(url).text.splitlines( ) 
for i in range(0, 15):
    print(html[i]) 

<!DOCTYPE html>
<html lang="zh-tw">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="initial-scale=1.0, user-scalable=1, minimum-scale=1.0, maximum-scale=3.0">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<meta name="keywords" content="National Dong Hwa University, NDHU, Dong Hwa, University, 國立東華大學, 東華, 大學, 亞洲, 台灣, 花蓮, 壽豐, 志學, RUR, THE, IOH, 最佳大學" />
<meta name="description" content="國立東華大學 NDHU, National Dong Hwa University 被譽為花東縱谷裡的學術殿堂，具有特色研究與卓越教學之綜合型大學，位於民風純樸、天然及人文資源豐富的花蓮，發揮多元化的教育功能，提昇台灣東部學術及文化水準，朝向成為國際性一流大學的目標邁進。" />
<meta name="robots" content="all" />
<meta name="googlebot" content="all" />
<meta name="google-site-verification" content="xZd74RR0DFg7QRfiNJAZFlmnrpYiJMI9EjBOw7oOHBs" />
<meta property="og:title" content="國立東華大學 NDHU, National Dong Hwa University">
<meta property="og

## 2.3 Search a targeted string in a website

<pre>  
if string in requests.get(url).text:  
     print ("Congradulation")  
else:  
     print ("sorry")  
</pre>

In [14]:
import requests

url = 'https://www.csie.ndhu.edu.tw/en/professors/'
name = input("Please input a name:")
html = requests.get(url).text
if name in html:
    print("Find out!")
else:
    print("Sorry! Cannot find {}".format(name))

Please input a name:yang
Find out!


In [15]:
# uniform invoice numbers

import requests

url = 'https://invoice.etax.nat.gov.tw/'
name = input("Please input a number:")
html = requests.get(url).text
if name in html:
    print("Congratulation!")
else:
    print("Sorry! Cannot find {}".format(name))

Please input a number:34
Congratulation!


In [16]:
import requests
url = 'https://www.csie.ndhu.edu.tw/en/professors/'
html = requests.get(url)
html.encoding="utf-8"

htmllist = html.text.splitlines()
n=0
for row in htmllist:
    if "Yang" in row: n+=1
print("Find {} person/people!".format(n))

Find 3 person/people!


 <a name = "Encoding"> <h1 style="color:blue"> 3. Encoding

* Get encoding
    - object.encoding
        - html = requests.get(url)  
        -  print(html.encoding)
* Set encoding
    - html.encoding = 'big5'   
    - html.encoding = 'utf8'   

In [17]:
import requests
url = 'https://tw.news.yahoo.com/'
html = requests.get(url)
print(html.encoding)

utf-8


In [18]:
# dismatch encoding 

import requests

r = requests.get("https://www.ndhu.edu.tw/")
print(r.encoding)  
r.encoding = 'big5'  
print(r.text)

UTF-8
<!DOCTYPE html>
<html lang="zh-tw">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="initial-scale=1.0, user-scalable=1, minimum-scale=1.0, maximum-scale=3.0">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<meta name="keywords" content="National Dong Hwa University, NDHU, Dong Hwa, University, ���蝡���梯�臬之摮�, ��梯��, 憭批飛, 鈭�瘣�, ��啁��, ��梯��, 憯質��, 敹�摮�, RUR, THE, IOH, ���雿喳之摮�" />
<meta name="description" content="���蝡���梯�臬之摮� NDHU, National Dong Hwa University 鋡怨亳��箄�望�梁萵靚瑁ㄐ���摮貉��畾踹��嚗���瑟����寡�脩��蝛嗉�����頞����摮訾��蝬�������憭批飛嚗�雿���潭��憸函��璅詻��憭拍�嗅��鈭箸��鞈�皞�鞊�撖������梯�殷����潭�桀����������������脣����踝����������啁����梢�典飛銵����������瘞湔��嚗������������箏�������找��瘚�憭批飛�����格�������脯��" />
<meta name="robots" content="all" />
<meta name="googlebot" content="all" />
<meta name="google-site-verif

<a name = "Developer">  
<h1 style="color:blue"> 4. Chrome DevTools

#### What is the Chrome DevTools?

- It is the "Chrome Developer Tools".
- A set of web developer tools built directly into the Google Chrome browser. 
- HTML, CCS, and JavaScript run on browsers so they are the three languages that you will work with on Chrome Developer Tools.
- Help you edit pages on-the-fly and diagnose problems quickly, which ultimately helps you build better websites, faster.
- Manipulate the code and the changes will only appear on your browser and will be gone once you refresh the page.

Reference
- https://developer.chrome.com/docs/devtools/overview/ 

#### How to open Chrome DevTools?

- Windows or Linux
    - Right-click to inspect the page
    - Function key: F12
    - Ctrl-Shift-I

- Mac
    - Command+Option+C
    

#### Change location of DevTools
- Developer tools will appear on the right side of the browser window
- Change location
    - Ctrl + D
    - click on the three vertical dots at the upper right corner of the panel and choose the preferred Dock Side setting.
        - ...(three vertical dots) / Dock side

#### Element tab in the Chrome DevTools

- Consists of three panes: Structure, Text, and Scripts. 
- The Structure pane shows the HTML DOM structure of the page that is currently active in the browser. 
- Inspect icon (on the upper left cornor)
    - Select an element in the webpagem to inspect the source code
        
- The changes can be made if The structure or content in the source code is updated dynamically. (Hint: The updated is temporary)

### Practice

1. Use DevTools to find a specific part of a webpage
2. Get content from:
    - HTML tag
    - Class name
    - Id name

In [19]:
# List the names of all faculty members 

import requests
url = 'https://www.csie.ndhu.edu.tw/en/professors/'
html = requests.get(url)
html.encoding="utf-8"

# <div class="prof-info-main">Ph.D., National Chiao Tung University<br style="height: auto;">Neural Networks, Deep Learning, Machine Learning, Pattern Recognition, Intelligent Human-Machine Interface</div>

from bs4 import BeautifulSoup

sp = BeautifulSoup(html.text, 'html.parser')

print(sp.title)

print()
a1 = sp.find(class_ = "prof-nameEn")
print(a1)

a2 = sp.find_all(class_ = "prof-nameEn")
print()
for i in a2:
    print(i.text)


<title>Professors – 國立東華大學 資訊工程學系暨研究所</title>

<div class="prof-nameEn">I-Cheng Chang</div>

I-Cheng Chang
Shin-Feng Lin
Cheng-Chin Chiang
Chang-Hsiung Tsai
Wen-Yen Chang
Shiow-Yang Wu
Chenn-Jung Huang
Shih-Chien Chou
Ching-Nung Yang
Shi-Jim Yen
Mau-Tsuen Yang
Han-Ying Kao
Pao-Lien Lai
Hsin-Chou Chi
Guan-Ling Lee
Shou-Chih Lo
Min-Xiou Chen
Chih-Hung Lai
Tao-Ku Chang
Chung Yung
Wei-Che Chien
Wen-Kai Tai
Sheng-Lung Peng
Chang Chin-Chen
Hong Zhao


<a name = "BeautifulSoup"><h1 style="color:blue"> 5. HTML parsing tool: BeautifulSoup module 

In [20]:
html_doc1 ="""<!DOCTYPE html>
<html lang="big5">
 <head>
  <meta charset="utf-8"/>
  <title>Test my webpage</title>
 </head>
 <body>
  <!-- Surveys -->
  <div class="surveys" id="surveys">
   <div class="survey" id="q1">
    <p class="question">
      <a href="http://example.com/q1">Gender?</a></p>
    <ul class="answer">
     <li class="response">Male - 
       <span class="score selected">20</span></li>
     <li class="response">Female - 
       <span class="score">10</span></li>
    </ul>
   </div>
   <div class="survey" id="q2">
    <p class="question">
      <a href="http://example.com/q2">Do you like the website?</a></p>
    <ul class="answer">
     <li class="response">Like - 
       <span class="score">40</span></li>
     <li class="response">Normal - 
       <span class="score selected">20</span></li>
     <li class="response">Dislike - 
       <span class="score">0</span></li>
    </ul>
   </div>
   <div class="survey" id="q3">
    <p class="question">
      <a href="http://example.com/q3">Can you design web crawlers?</a></p>
    <ul class="answer">
     <li class="response">Yes - 
       <span class="score selected">34</span></li>
     <li class="response">No - 
       <span class="score">6</span></li>
    </ul>
   </div>
  </div>
  <div class="emails" id="emails">
    <div class="question">Email list: </div>
    abc@example.com
    <div class="survey" data-custom="important">def@example.com</div>
    <span class="survey" id="email">ghi@example.com</div>
  </div>
 </body>
</html>
"""

In [21]:
html_doc2 ="""<!DOCTYPE html>
<html lang="big5">
 <head>
   <title>Test my webpage</title>
 </head>
 <body>
  <!-- Surveys -->
  <div class="surveys" id="surveys">
   <div class="survey" id="q1">
    <p class="question">
      <a href="http://example.com/q1">Gender?</a></p>
    <ul class="answer">
     <li class="response">Male - 
       <span class="score selected">20</span></li>
     <li class="response">Female - 
       <span class="score">10</span></li>
    </ul>
   </div>
   <div class="survey" id="q2">
    <p class="question">
      <a href="http://example.com/q2">Do you like the website?</a></p>
   </div>
 </body>
</html>
"""

In [22]:
print(type(html_doc1))
print()
print(html_doc1)

<class 'str'>

<!DOCTYPE html>
<html lang="big5">
 <head>
  <meta charset="utf-8"/>
  <title>Test my webpage</title>
 </head>
 <body>
  <!-- Surveys -->
  <div class="surveys" id="surveys">
   <div class="survey" id="q1">
    <p class="question">
      <a href="http://example.com/q1">Gender?</a></p>
    <ul class="answer">
     <li class="response">Male - 
       <span class="score selected">20</span></li>
     <li class="response">Female - 
       <span class="score">10</span></li>
    </ul>
   </div>
   <div class="survey" id="q2">
    <p class="question">
      <a href="http://example.com/q2">Do you like the website?</a></p>
    <ul class="answer">
     <li class="response">Like - 
       <span class="score">40</span></li>
     <li class="response">Normal - 
       <span class="score selected">20</span></li>
     <li class="response">Dislike - 
       <span class="score">0</span></li>
    </ul>
   </div>
   <div class="survey" id="q3">
    <p class="question">
      <a href="http://

## 5.1 Introduction to BeautifulSoup

- Beautiful Soup is a Python library for ***pulling data*** out of HTML and XML files. 
- It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. 
- It commonly saves programmers hours or days of work.

In [23]:
# install Beautifulsoup
# Anaconda has installed by default

!pip install beautifulsoup4



### Import module

- from bs4 import BeautifulSoup

### Syntax

- BeautifulSoup(String, Parser_Name)

- parses
    - lxml, html.parser, ...
    - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup

### Parse 
- string
- file
- webpage

In [24]:
from bs4 import BeautifulSoup
# parse a string
a = "<html><head></head><body>My test webpage!</body></html>"
print('type(a):', type(a))
print()
print(BeautifulSoup(a , 'html.parser'))
print()


# parse a file
with open("html/test_page.html") as fp:
    soup1 = BeautifulSoup(fp, 'html.parser')
    
print(type(soup1))  # class 'bs4.BeautifulSoup'  
print()
print(soup1)


type(a): <class 'str'>

<html><head></head><body>My test webpage!</body></html>



FileNotFoundError: [Errno 2] No such file or directory: 'html/test_page.html'

In [26]:
# parse from a webpage

import requests
from bs4 import BeautifulSoup
url = 'https://edition.cnn.com/'
html = requests.get(url)
html.encoding = 'utf8'  
sp = BeautifulSoup(html.text, 'lxml')
print(type(sp))
print()
print(sp)

<class 'bs4.BeautifulSoup'>

<!DOCTYPE html>
<html class="no-js"><head><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><meta charset="utf-8"/><meta content="text/html" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/><link href="/optimizelyjs/128727546.js" rel="dns-prefetch"/><link href="//tpc.googlesyndication.com" rel="dns-prefetch"/><link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/><link href="//www.googletagservices.com" rel="dns-prefetch"/><link href="//partner.googleadservices.com" rel="dns-prefetch"/><link href="//www.google.com" rel="dns-prefetch"/><link href="//aax.amazon-adsystem.com" rel="dns-prefetch"/><link href="//c.amazon-adsystem.com" rel="dns-prefetch"/><link href="//cdn.krxd.net" rel="dns-prefetch"/><link href="//ads.rubiconproject.com" rel="dns-prefetch"/><link href="//optimized-by.rubiconproject.com" rel="dns-prefetch"/><link href="//fastlane.rubiconproject.com" rel="dn

### BeautifulSoup_object.prettify( ) method
- return a Beautiful Soup parse tree into a nicely formatted Unicode string, 
- with a separate line for each tag and each string
- make up missing HTML tags

In [None]:
print(sp)

In [27]:
# sp1 = BeautifulSoup(html_doc1, 'lxml')
print(sp.prettify())

<!DOCTYPE html>
<html class="no-js">
 <head>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta content="text/html" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport"/>
  <link href="/optimizelyjs/128727546.js" rel="dns-prefetch"/>
  <link href="//tpc.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//pagead2.googlesyndication.com" rel="dns-prefetch"/>
  <link href="//www.googletagservices.com" rel="dns-prefetch"/>
  <link href="//partner.googleadservices.com" rel="dns-prefetch"/>
  <link href="//www.google.com" rel="dns-prefetch"/>
  <link href="//aax.amazon-adsystem.com" rel="dns-prefetch"/>
  <link href="//c.amazon-adsystem.com" rel="dns-prefetch"/>
  <link href="//cdn.krxd.net" rel="dns-prefetch"/>
  <link href="//ads.rubiconproject.com" rel="dns-prefetch"/>
  <link href="//optimized-by.rubiconproject.com" rel="dns-prefetch"/>
  <link href="//fastlane.rubico

## 5.2 Geting HTML tag from BeautifulSoup.tag object

 - Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 
 - But you’ll only ever have to deal with about four kinds of objects: 
     - Tag
     - NavigableString 
     - BeautifulSoup
     - Comment.

### Method of Geting HTML tag 

(1) BeautifulSoup_object.tag_name  
        -> this section  
(2) BeautifulSoup_object.find( ) OR BeautifulSoup_object.find_all( )   
        -> section 5.3  
(3) BeautifulSoup_object.select_one( ) or BeautifulSoup_object.select( )  
        -> section 5.4

### (1) Syntax of BeautifulSoup_object.tag_name

- BeautifulSoup_object.tag_name
    - return the first bs4.element.Tag object

In [None]:
html_doc2 ="""<!DOCTYPE html>
<html lang="big5">
 <head>
   <title>Test my webpage</title>
 </head>
 <body>
  <!-- Surveys -->
  <div class="surveys" id="surveys">
   <div class="survey" id="q1">
    <p class="question">
      <a href="http://example.com/q1">Gender?</a></p>
    <ul class="answer">
     <li class="response">Male - 
       <span class="score selected">20</span></li>
     <li class="response">Female - 
       <span class="score">10</span></li>
    </ul>
   </div>
   <div class="survey" id="q2">
    <p class="question">
      <a href="http://example.com/q2">Do you like the website?</a></p>
   </div>
 </body>
</html>
"""

In [28]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc2, 'lxml')
print(type(soup))
print()
tag = soup.p
print(type(tag))
print()
print(tag)
print()
print(tag.a)

<class 'bs4.BeautifulSoup'>

<class 'bs4.element.Tag'>

<p class="question">
<a href="http://example.com/q1">Gender?</a></p>

<a href="http://example.com/q1">Gender?</a>


### Structure of a webpage: DOM

In [None]:
html_doc ="""<!DOCTYPE html>
<html lang="big5">
 <head>
  <meta charset="utf-8"/>
  <title>Test my webpage</title>
 </head>
 <body>
  <!-- Surveys -->
  <div class="surveys" id="surveys">
   <div class="survey" id="q1">
    <p class="question">
      <a href="http://example.com/q1">Gender?</a></p>
    <ul class="answer">
     <li class="response">Male - 
       <span class="score selected">20</span></li>
     <li class="response">Female - 
       <span class="score">10</span></li>
    </ul>
   </div>
   <div class="survey" id="q2">
    <p class="question">
      <a href="http://example.com/q2">Do you like the website?</a></p>
    <ul class="answer">
     <li class="response">Like - 
       <span class="score">40</span></li>
     <li class="response">Normal - 
       <span class="score selected">20</span></li>
     <li class="response">Dislike - 
       <span class="score">0</span></li>
    </ul>
   </div>
   <div class="survey" id="q3">
    <p class="question">
      <a href="http://example.com/q3">Can you design web crawlers?</a></p>
    <ul class="answer">
     <li class="response">Yes - 
       <span class="score selected">34</span></li>
     <li class="response">No - 
       <span class="score">6</span></li>
    </ul>
   </div>
  </div>
  <div class="emails" id="emails">
    <div class="question">Email list: </div>
    abc@example.com
    <div class="survey" data-custom="important">def@example.com</div>
    <span class="survey" id="email">ghi@example.com</div>
  </div>
 </body>
</html>
"""

### DOM（Document Object Model）

- When a web page is loaded, the browser creates a Document Object Model of the page.

- The HTML DOM model is constructed as a tree of Objects:

<img src = images/html_structure.jpg width = 600>

In [25]:
from bs4 import BeautifulSoup
bs1 = BeautifulSoup(html_doc1, 'lxml')
print(type(bs1))
print()
print(bs1.title)

print()
print(bs1.a)
print()
print(bs1.ul)
print()
print(bs1.ul.li)   # nested
print()
print(bs1.ul.li.span)   # nested

<class 'bs4.BeautifulSoup'>

<title>Test my webpage</title>

<a href="http://example.com/q1">Gender?</a>

<ul class="answer">
<li class="response">Male - 
       <span class="score selected">20</span></li>
<li class="response">Female - 
       <span class="score">10</span></li>
</ul>

<li class="response">Male - 
       <span class="score selected">20</span></li>

<span class="score selected">20</span>


## 5.3 Geting HTML tag from BeautifulSoup.find( )

### Syntax

- ***BeautifulSoup_object.find("tag_name", attribute, recursive, text, **kwargs)***
    - function: return the first tag with a given attribute value
    - return None if no tag meets the condition
    - example
        - sp1.find("a")
            - return the first tag "a"
            - must enclose with " "
        - sp1.find(id = "std_id")
- find( ) method and find_all( ) are the same, execpt:
    - find( ) returns the first tag
        - bs4.element.tag
    - find_all( ) returns all tags
        - bs4.element.ResultSet: like a list of tag

In [29]:
html_doc3 = """
<html><head><title>Web Title</title></head>
<body>
<p class="title"><b> My Document </b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/Rebecca" class="sister1" id="link1">Rebecca</a>,
<a href="http://example.com/Richard" class="sister2" id="link2">Richard <img src="Richard.jpg"> </a> 
and
<a href="http://example.com/tillie" class="sister1" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<img src = "images/pig.png" width = "20" height = "100">

<p class="story">...</p>
</body> </html>
"""

### (1) find("HTML_tag name")

In [30]:
from bs4 import BeautifulSoup
sp3 = BeautifulSoup(html_doc3,'lxml') 
print(sp3.title)   # the same as sp3.find("title")
print(sp3.find("title"))
print(type(sp3.find("title")))
print()
print(sp3.head)
print(sp3.find("head"))
print()
print(sp3.a)
print(sp3.find("a"))
print()
print(sp3.p)
print(sp3.find("p"))

<title>Web Title</title>
<title>Web Title</title>
<class 'bs4.element.Tag'>

<head><title>Web Title</title></head>
<head><title>Web Title</title></head>

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>
<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

<p class="title"><b> My Document </b></p>
<p class="title"><b> My Document </b></p>


In [31]:
print(sp3.find_all("title"))
print(type(sp3.find_all("title")))
print()
print(sp3.find_all("head"))
print()
print(sp3.find_all("a"))
print()
print(sp3.find_all("p"))

[<title>Web Title</title>]
<class 'bs4.element.ResultSet'>

[<head><title>Web Title</title></head>]

[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

[<p class="title"><b> My Document </b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>,
<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a> 
and
<a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]


### (2) find(attribute = 'value')
### (3)  find("HTML_tag_name", attribute = 'value')

- .find(src = "xxx")
- .find(id="id_name")
- .find(class_="class_name")
- .find_all(class_="class_name")
- .find_all(a, class_="class_name")

hint: use "class_" to replace "class", because "class" is a keyword in python

In [32]:
from bs4 import BeautifulSoup
sp = BeautifulSoup(html_doc3,'lxml') 

data1=sp.find("a", class_= "sister1") 
print('\nsp.find("a", class_="sister1") \n', data1) 

data1=sp.find("a", id = "link2") 
print('\nsp.find("a", id = "link2") \n', data1) 


sp.find("a", class_="sister1") 
 <a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

sp.find("a", id = "link2") 
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>


In [33]:
print(sp3.find_all(class_="sister1"))
print()
print(sp3.find(href = "http://example.com/Rebecca"))
print()
print(sp3.find(id = "link1"))


[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>


In [34]:
data1 = sp3.find_all("a")
data2 = sp3.find_all("a", href = "http://example.com/Richard")

print("data1:\n", data1)
print()
print(type(data1))
print("\ndata2:\n", data2)

data1:
 [<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

<class 'bs4.element.ResultSet'>

data2:
 [<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>]


In [35]:
# class_can be omitted

print(sp3.find("a", class_ = "sister2"))
print()
print(sp3.find("a", "sister2"))

print()
print(sp3.find(class_ = "sister2"))

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>


### (4) find(attrs={"attribute": "value"})
- use this approach to avoid some attributes with special symbals, such as data-*
- use "" to enclose the attribute
- .find(attrs={"class": "sister1"}) 
    - .find(attrs={"class_": "sister1"})    is incorrect, "class" cannot followed by "_"

### (5) find(HTML_tag_name, {"attribute": "value"}) 
    - can omit "attrs="


In [36]:
data1=sp3.find("a", href="http://example.com/Richard")   # href has no ""
print(data1)

print()
data1=sp3.find("a", {"href":"http://example.com/Richard"})
print(data1)

print()
data1=sp3.find("a", attrs={"href":"http://example.com/Richard"})
print(data1)

print()
data1=sp3.find(attrs={"href":"http://example.com/Richard"})
print(data1)

print()
data1=sp3.find({"href":"http://example.com/Richard"}) # None
print(data1)

print()
data1=sp3.find_all("a", {"class":"sister1"}) 
print(data1)

print()
data1=sp3.find_all("a", {"class_":"sister1"})   # there is no attribute "class_"
print(data1)

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

None

[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

[]


In [37]:
# find()

from bs4 import BeautifulSoup
sp = BeautifulSoup(html_doc3,'lxml') 

print("sp.tile ->\n", sp.title)

print("\nsp.find('b') ->\n", sp.find('b')) # <b>文件標題</b>

data1=sp.find("a", {"href":"http://example.com/Richard"})
print('\nsp.find("a", {"href":"http://example.com/Richard"}->\n', data1) 

data11=sp.find(id="link2") 
print('\nsp.find(id="link2")\n', data11) 

data12=sp.find(class_="sister2") 
print('\nsp.find(class_="sister2")\n', data12)

data2=sp.find("a", {"id":"link2"}) 
print('\nsp.find("a", {"id":"link2"}) \n', data2) 

data3=sp.find("a", {"class":"sister1"}) 
print('\nsp.find("a", {"class":"sister1"}) \n', data3) 

sp.tile ->
 <title>Web Title</title>

sp.find('b') ->
 <b> My Document </b>

sp.find("a", {"href":"http://example.com/Richard"}->
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

sp.find(id="link2")
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

sp.find(class_="sister2")
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

sp.find("a", {"id":"link2"}) 
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

sp.find("a", {"class":"sister1"}) 
 <a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>


### Second level of Searching

In [39]:
from bs4 import BeautifulSoup
sp3 = BeautifulSoup(html_doc3,'lxml') 

data1=sp3.find("a", {"href":"http://example.com/Richard"})
print(data1) 

print()
data2 = data1.find("img")
print(data2)

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<img src="Richard.jpg"/>


### find_all( )

In [40]:
tag_a = sp3.find_all('a')
print(tag_a)
print()
print(tag_a[0])
print()
print(tag_a[0].attrs)

[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

{'href': 'http://example.com/Rebecca', 'class': ['sister1'], 'id': 'link1'}


In [41]:
tag_a = sp3.find_all('a')
print(tag_a)
print()

for i in tag_a:
    print(i)
    print()

[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

<a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>



In [42]:
import bs4

objSoup = bs4.BeautifulSoup(html_doc3, 'lxml')
objTag = objSoup.find_all(class_ = 'sister1')
print(objTag[0])
print(objTag[0].attrs)

<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>
{'href': 'http://example.com/Rebecca', 'class': ['sister1'], 'id': 'link1'}


### parameters in find_all( )

- .find_all(['h1', 'h2'])  # return all tags with 'h1' or 'h2' 
    
    
- find_all(limit = xx, recursive = False/True)
    - limit = xx
        - the number of returned tags is limited
    - recursive = False/True
        - if search child tags

In [47]:
# find_all()

from bs4 import BeautifulSoup
sp3 = BeautifulSoup(html_doc3,'lxml')

data1 = sp3.find_all("a")
data2 = sp3.find_all("a", limit = 2)
data3 = sp3.find_all(["a", "img"])  # 所有 a標籤或是 img 標籤，兩者都會取出，是進行 OR 運算


print("data1:", len(data1), '\n', data1)

print("\ndata2:", len(data2), '\n', data2)

print("\ndata3:", len(data3), '\n', data3)


data1: 3 
 [<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]

data2: 2 
 [<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>]

data3: 5 
 [<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <img src="Richard.jpg"/>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>, <img height="100" src="images/pig.png" width="20"/>]


In [44]:
sisters = sp3.find_all(class_ = ['sister1', 'sister2'])
print(sisters)


[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]


#### tag.find_all(True) 

- transfer a tag to a list

In [45]:
story = sp.find('p', class_='story')
print(story)
print()

print(story.find_all(True))

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>,
<a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a> 
and
<a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

[<a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>, <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>, <img src="Richard.jpg"/>, <a class="sister1" href="http://example.com/tillie" id="link3">Tillie</a>]


# 6. Get content from tag object

6.1 Content (string)  
6.2 Child tags  
6.3 Attributes

## 6.1 Get string from tag object


* name attribute: return the tag name
* text attribute: return content excluding HTML tag
* string attribute: return content excluding HTML tag, but if there is child tag, return None 
    * Differences between .text & .string
        * https://stackoverflow.com/questions/25327693/difference-between-string-and-text-beautifulsoup
* contents: return content excluding HTML tag, the data type is 


* get_text() method: return content excluding HTML tag
* getText() method: return content excluding HTML tag

In [48]:
# string vs. text 的差異 1

data1=sp3.find("a", {"href":"http://example.com/Rebecca"})
print('data1:\n', data1) 
print('data1.name:', data1.name)
print('\ndata1.string ->\n', data1.string)
print('\ndata1.text ->\n', data1.text)
print('\ndata1.contents ->\n', data1.contents)
print('\ndata1.get_text() ->\n', data1.get_text())
print('\ndata1.getText() ->\n', data1.getText())

data1:
 <a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>
data1.name: a

data1.string ->
 Rebecca

data1.text ->
 Rebecca

data1.contents ->
 ['Rebecca']

data1.get_text() ->
 Rebecca

data1.getText() ->
 Rebecca


In [49]:
from bs4 import BeautifulSoup

html_str = "<div id='msg'>Hello World! <p> Final Test <p></div>"
soup = BeautifulSoup(html_str, "lxml")
tag = soup.div
print("tag.string ->\n", tag.string)       
print("\ntag.text ->\n", tag.text) 
print(type(tag.text))
print('\ntag.contents ->\n', tag.contents)
print('\ndata1.getText() ->\n', tag.getText())
print('\ntype(data1.getText()) ->\n', type(tag.getText()))
 
print("\ntag.get_text() ->\n", tag.get_text())   
print("\ntype(tag.get_text()) ->\n", type(tag.get_text())) 
print("\n", tag.get_text("-")) 
print("\n", tag.get_text("-", strip=True))

tag.string ->
 None

tag.text ->
 Hello World!  Final Test 
<class 'str'>

tag.contents ->
 ['Hello World! ', <p> Final Test </p>, <p></p>]

data1.getText() ->
 Hello World!  Final Test 

type(data1.getText()) ->
 <class 'str'>

tag.get_text() ->
 Hello World!  Final Test 

type(tag.get_text()) ->
 <class 'str'>

 Hello World! - Final Test 

 Hello World!-Final Test


## 6.2 Get child tags from tag objects

- beautifulsoup_object.child_tag

In [None]:
html_doc4 = """
<html><head><title>Web page</title></head>
<body>
<p class="title"><b>content title</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/Rebecca" class="sister1" id="link1">Rebecca</a>,
<a href="http://example.com/Richard" class="sister2" id="link2">Richard <img src="Richard.jpg"> </a> 
and
<a href="http://example.com/tillie" class="sister1" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<img src = "images/pig.png" width = "20" height = "100">

<p class="story">...</p>
</body> </html>
"""

In [54]:
data2=sp3.find("a", {"class":"sister2"}) 
print('data2->\n', data2) 
print('\ndata2.img->\n', data2.img)

data2->
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

data2.img->
 <img src="Richard.jpg"/>


## 6.3 Get attributes from tag objects

- Method 1:beautifulsoup_object['attribe_name'] 
    - sp1['class']
    - sp1['id']
- Method 2:beautifulsoup_object.get('attribute_name')
    - sp1.get('href')
    - sp1.get('src')  
- Method 3:beautifulsoup_object.attrs['attribute_name']   
    - sp1.attrs['href']

In [50]:
from bs4 import BeautifulSoup
sp = BeautifulSoup(html_doc3,'lxml') 

data1=sp.find("a", {"href":"http://example.com/Rebecca"})
print('data1:\n', data1) 
print("\ndata1['href'] ->\n", data1['href'])
print('\ndata1.get("href") ->\n', data1.get('href'))

data3=sp.find("img") 
print('\ndata3\n', data3) 
print('\ndata3.get("src") ->\n', data3.get("src")) # http://example.com/elsie
print('\ndata3["src"] ->\n', data3["src"])

data2=sp.find("a", {"class":"sister2"}) 
print('\ndata2\n', data2) 
print('\ndata2.img\n', data2.img) 
print('\ndata2.img["src"] ->\n', data2.img["src"]) 


data1:
 <a class="sister1" href="http://example.com/Rebecca" id="link1">Rebecca</a>

data1['href'] ->
 http://example.com/Rebecca

data1.get("href") ->
 http://example.com/Rebecca

data3
 <img src="Richard.jpg"/>

data3.get("src") ->
 Richard.jpg

data3["src"] ->
 Richard.jpg

data2
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

data2.img
 <img src="Richard.jpg"/>

data2.img["src"] ->
 Richard.jpg


In [51]:
data2=sp3.find("a", {"class":"sister2"}) 
print('\ndata2\n', data2) 

print('\ndata2.attrs ->\n', data2.attrs) 

print("\ndata2.attrs['href'] ->\n", data2.attrs['href']) 
print("\ndata2.attrs['class'] ->\n", data2.attrs['class']) 


data2
 <a class="sister2" href="http://example.com/Richard" id="link2">Richard <img src="Richard.jpg"/> </a>

data2.attrs ->
 {'href': 'http://example.com/Richard', 'class': ['sister2'], 'id': 'link2'}

data2.attrs['href'] ->
 http://example.com/Richard

data2.attrs['class'] ->
 ['sister2']


### Get links & images in a website

In [52]:
# get all links in a website

from bs4 import BeautifulSoup
import requests

url = 'http://www.ndhu.edu.tw/'
   
html = requests.get(url).text
sp = BeautifulSoup(html, 'lxml')
all_links = sp.find_all('a')
   
for link in all_links:
   href = link.get('href')  
   if href != None and href.startswith('http://'):
      print(href)

http://dp.ndhu.edu.tw/NDHU_welcome/
http://dp.ndhu.edu.tw/NDHU_welcome/
http://www.elearn.ndhu.edu.tw/moodle/
http://www.facebook.com/NDHUFacultyUnion
http://dp.ndhu.edu.tw/epaper/
http://secret.ndhu.edu.tw/p/412-1011-17517.php?Lang=zh-tw


In [53]:
# get all links in a website, including http:// and https://

from bs4 import BeautifulSoup
import requests

url = input('網址:')
   
html = requests.get(url).text
sp = BeautifulSoup(html, 'lxml')
all_links = sp.find_all('a')
   
for link in all_links:
   href = link.get('href')   
   if href != None and (href.startswith('http://') or href.startswith('https://')):
      print(href)

網址:


MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://?

In [None]:
# get all address of links and images
# ch05
from bs4 import BeautifulSoup
import requests

url = 'http://www.e-happy.com.tw'
html = requests.get(url)
html.encoding="utf-8"

sp=BeautifulSoup(html.text, 'lxml')
links=sp.find_all(["a","img"]) 
for link in links:
    href=link.get("href") 
    if  href != None and href.startswith("http://"): 
        print(href)

### Parse URL

- from urllib.parse import urlparse
    - urlparse(url).path
    
- Reference
    - https://docs.python.org/3/library/urllib.parse.html

In [None]:
from urllib.parse import urlparse

url = "https://www.google.com/search?q=w3schools+python&oq=&aqs=chrome.5.35i39i362l8.564582371j0j15&sourceid=chrome&ie=UTF-8"

print("schema:", urlparse(url).scheme)
print("hostname:", urlparse(url).hostname)
print("port:", urlparse(url).port)

print("{}://{}".format(urlparse(url).scheme, urlparse(url).hostname))

In [None]:
# Input a URL and then list directories and file names of images

from bs4 import BeautifulSoup
import requests
import sys
from urllib.parse import urlparse

url = input('input URL:')
domain = "{}://{}".format(urlparse(url).scheme, urlparse(url).hostname)
html = requests.get(url).text
sp = BeautifulSoup(html, 'html.parser')
all_links = sp.find_all(['a','img'])

for link in all_links:
    src = link.get('src')
    href = link.get('href')
    targets = [src, href]
    for t in targets:
        if t != None and ('.jpg' in t or '.png' in t):
            if t.startswith('http'):
                print(t)
            else:
                print(domain+t)

## Practice

### Uniform-Invoice Prize Winning Numbers
- https://www.etax.nat.gov.tw/etwmain/en/etw183w 

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.etax.nat.gov.tw/etwmain/en/etw183w/etw183w2?id=17fabbca9d300000f4f8e02aeb659cd8'
html = requests.get(url)
sp = BeautifulSoup(html.text, 'html.parser')
num= sp.find_all(class_="col-12 mb-3")
for i in num:
   print(i.text)

print()
url = 'https://www.etax.nat.gov.tw/etwmain/en/etw183w/etw183w2?id=17e8fcc31cd00000d1e747a6a3c16bfa'
html = requests.get(url)
sp = BeautifulSoup(html.text, 'html.parser')
num= sp.find_all(class_="col-12 mb-3")
for i in num:
   print(i.text)