<h1 style="color:blue;text-align:center">  Selenium </h1>
<h3 style="color:blue;text-align:center"> Chih-Hung Lai   </h3>
<h4 style="color:blue;text-align:center"> Created: 2019.03  &emsp; Last modified: 2022.04.28 </h4>

## Content:

1. Webbrowser — Convenient web-browser controller  
 1.1 Webbrowser module  
 1.2 Common search techniques for google  
2. Selenium package    
 2.1 Introduction  
 2.2 Common attributes and methods
 2.3 Search elements 
 2.4 Get content of HTML elements   
3. XPath  
 3.1 Introduction to XPath  
 3.2 Get XPath using Developer Tools  
4. Interaction with HTML elements  
 4.1 click a hyperlink  
 4.2 Enter text   
 4.3 Scroll down  
 4.4 run selenium in background  
5. Examples

<a name='browser' /> 

# 1. Webbrowser — Convenient web-browser controller


## 1.1 Webbrowser module

### Function
- The webbrowser module provides a high-level interface to allow displaying web-based documents to users.


- https://docs.python.org/3/library/webbrowser.html#browser-controller-objects

* import webbrowser module
 * import webbrowser as wb    
- open webpage  
    - wb.open(url, new=0, autoraise=True)
        - Display url using the default browser. 
        - If new is 0:
            - the url is opened in the same browser window 
        - If new is 1
            - a new browser window is opened 
        - If new is 2
            - a new browser page (“tab”) is opened 
        - If autoraise is True, the window is raised if possible
     
    - wb.open_new('URL')    
        - Open url in a new window of the default browser
     
    - wb.open_new_tab(url)
        - Open url in a new page (“tab”) of the default browser

- https://docs.python.org/3/library/webbrowser.html

In [1]:
# open a webpage

import webbrowser as wb
url = 'https://www.ndhu.edu.tw/'
wb.open(url, new = 0, autoraise=False)

True

In [2]:
# open 2 urls

import time
import webbrowser as wb
url1 = 'https://www.ndhu.edu.tw/'
url2 = 'http://www.taipeitimes.com/'
wb.open(url1)
time.sleep(3) 
wb.open_new(url2)

True

## 1.2 Common search techniques for google

### paramaters for google search

* Specify what to search
    * Put "search?q=xxx" followed by the google URL
    * https://www.google.com.tw/search?q=resilience 

* Exclude words from your search
    * Put - in front of a word you want to leave out. For example, jaguar speed -car
    * https://www.google.com.tw/search?q=xxx+-exclude_word 
    * https://www.google.com.tw/search?q=COVID%2019+-Taiwan

* Search for information on a specific site
    * Put "site:" in front of a site or domain. 
    * https://www.google.com.tw/search?q=xxx+cite:yyy
    * https://www.google.com.tw/search?q=COVID2019+site:youtube.com
 
* Search different media types
    * https://www.google.com.tw/search?q=xxx+&btm=yyy
    * google news： &tbm=nws
    * google image: &tbm=isch
    * google video： &tbm=vid
    * https://www.google.com.tw/search?q=lottery+&tbm=isch
 
 
- Reference
    - https://support.google.com/websearch/answer/2466433?hl=en

In [3]:
# google search

import webbrowser as wb
search = input( )
url = 'https://www.google.com.tw/search?q=' + search
wb.open(url)


taiwan


True

In [4]:
# Search for information on a specific site

import webbrowser as wb
search1 = input("website:")
search2 = input("searched content:")
url ="https://www.google.com.tw/search?q=" + 'cite:' + search1 + search2
wb.open(url)

website:youtube.com
searched content:car


True

In [7]:
# search image

import webbrowser as wb
search = input("searched content:")
url ="https://www.google.com.tw/search?q=" + search + "&tbm=isch"
wb.open(url, new=2)

searched content:das


True

In [6]:
# browse page n

import time
import webbrowser as wb
for i in range(2, 10):
    url = 'https://24h.pchome.com.tw/cutprice/#!region=&p=' + str(i)
    wb.open(url, new = 0)
    time.sleep(3)


<a name='introduction' />  

#  2. Selenium package</font>

## 2.1 Introduction

- The selenium package is used to automate web browser interaction from Python
- Simulate human browsing behavior
    - click button or hyperlink
    - enter text (including account & password)
        - https://sys.ndhu.edu.tw/aa/class/TeacherSubj/Default.aspx
    - scroll down
        - http://tw.running.biji.co/index.php?q=album&act=photo_list&album_id=30668&cid=5791&type=album&subtitle=第3屆埔里跑 Puli Power 山城派對馬拉松-向善橋(約34K)

### document
- https://selenium-python.readthedocs.io/
- https://pypi.org/project/selenium/

### Install package

1. Install the Selenium package  
    - pip install selenium
 
2. Download a driver (Chrome or Firefox)
    - Selenium requires a driver to interface with the chosen browser.
    - https://chromedriver.chromium.org/downloads
        - must be the same verson with your chrome

    - Download it and make sure it’s in your PATH, e. g., place it at the same directory with python progroms 
    - driverPath = 'chromedriver_win32/chromedriver.exe'

In [8]:
# intall selenium package
!pip install selenium

Collecting selenium
  Downloading selenium-4.1.3-py3-none-any.whl (968 kB)
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.20.0-py3-none-any.whl (359 kB)
Collecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0
  Downloading h11-0.13.0-py3-none-any.whl (58 kB)
Installing collected packages: outcome, h11, wsproto, trio, trio-websocket, selenium
Successfully installed h11-0.13.0 outcome-1.1.0 selenium-4.1.3 trio-0.20.0 trio-websocket-0.9.2 wsproto-1.1.0


In [2]:
# use webdriver to open a browser

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')

browser = webdriver.Chrome(service=s)    
browser.get('http://www.google.com')         # open a browser

print(type(browser))  

<class 'selenium.webdriver.chrome.webdriver.WebDriver'>


In [16]:
# Firefox:

# driverPath = 'D:\geckodriver\geckodriver.exe'
# browser = webdriver.Firefox(executable_path=driverPath)

In [19]:
# open a browser and close it in 5 seconds

from selenium import webdriver
import time
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')

browser = webdriver.Chrome(service=s)  
browser.get('https://www.ndhu.edu.tw')  

time.sleep(5) 
browser.quit()

## 2.2 Common attributes and methods

### Common methods

- get(URL):Loads a web page in the current browser session

- current_url：URL of the currently loaded page


- close()：Closes the current window

- quit()：Quits the driver and closes every associated window.


- maximize_window()：Maximizes the current window that webdriver is using

- fullscreen_window()：Invokes the window manager-specific ‘full screen’ operation


- back()：Goes one step backward in the browser history

- forward()：Goes one step forward in the browser history


- refresh()：Refreshes the current page

- set_page_load_timeout(time_to_wait)
    - Set the amount of time to wait for a page load to complete before throwing an error.

- set_window_size(width, height, windowHandle='current')
    - Sets the width and height of the current window


### Common attributes

- name：Gets the name of the browser
- **page_source**：Gets the source code of the current page
- name: Returns the name of the underlying browser for this instance
- current_url: URL of the currently loaded page
- session_id: String ID of the browser session started and controlled by this WebDriver
- capabilities: Dictionary of effective capabilities of this browser session as returned

In [20]:
# common attributes

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')

browser = webdriver.Chrome(service=s)    
url = 'https://www.ndhu.edu.tw/'
browser.get(url)               
print('name = ', browser.name)             
print('current_url    = ', browser.current_url)     
print('session_id = ', browser.session_id)       
print('browser.capabilities = \n',browser.capabilities)    

print()
print(browser.page_source)   


name =  chrome
current_url    =  https://www.ndhu.edu.tw/
session_id =  409a67d0521a7e6a9c9cd35f4099d890
browser.capabilities = 
 {'acceptInsecureCerts': False, 'browserName': 'chrome', 'browserVersion': '101.0.4951.41', 'chrome': {'chromedriverVersion': '101.0.4951.41 (93c720db8323b3ec10d056025ab95c23a31997c9-refs/branch-heads/4951@{#904})', 'userDataDir': 'C:\\Users\\User\\AppData\\Local\\Temp\\scoped_dir36868_328475317'}, 'goog:chromeOptions': {'debuggerAddress': 'localhost:5304'}, 'networkConnectionEnabled': False, 'pageLoadStrategy': 'normal', 'platformName': 'windows', 'proxy': {}, 'setWindowRect': True, 'strictFileInteractability': False, 'timeouts': {'implicit': 0, 'pageLoad': 300000, 'script': 30000}, 'unhandledPromptBehavior': 'dismiss and notify', 'webauthn:extension:credBlob': True, 'webauthn:extension:largeBlob': True, 'webauthn:virtualAuthenticators': True}

<html lang="zh-tw"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="X-U

In [22]:
# open webpages in sequence

from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)

urls = ['https://www.google.com/',
        'http://www.taipeitimes.com/',
        'https://news.yahoo.com/us/']
  
browser.maximize_window()

for url in urls:
    browser.get(url) 
    sleep(3)
browser.back()
sleep(3)
browser.back()
sleep(3)
browser.close()

In [None]:
# refresh a webpage
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.service import Service

url = 'https://invest.cnyes.com/twstock/TWS/2330'

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)    
browser.maximize_window()
    
browser.get(url)

for url in range(3):
    browser.refresh()
    sleep(3)

## 2.3 Search elements

- Hint: all functions can use find_element & find_elements


* find_element_by_class_name()：Finds the first element by class name
* find_elements_by_class_name()：Finds elements by class name, return a list of WebElement


* find_element_by_id()
* find_elements_by_id() 

* find_element_by_tag_name(): Finds element within this element’s children by tag name
    - .find_element_by_tag_name('h1')
* find_element_by_name(): Finds element within this element’s children by name


* find_element_by_link_text(): Finds element within this element’s children by visible link text
* find_element_by_partial_link_text() 
    - Finds element within this element’s children by partially visible link text
    
    
* find_element_by_css_selector()  
    * find_element_by_css_selector('.class_name')
    * find_element_by_css_selector('#id_name')
    
    
* find_element_by_xpath()


### further information
- https://selenium-python.readthedocs.io/locating-elements.html

In [8]:
# find_element_by_tag_name

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)    

url = 'https://news.yahoo.com/us/'
browser.get(url)                

# tag = browser.find_element_by_tag_name('a')
tag = browser.find_element(by=By.TAG_NAME, value='a')
print(tag.text)        

HOME


In [26]:
html_name='''
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="log" type="submit" value="Login" />
   <input name="cls" type="button" value="Clear" />
  </form>
</body>
</html>
'''

In [27]:
username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

NameError: name 'driver' is not defined

In [9]:
# Ticket of Kuo-Kuang Bus booking
# find_element_by_name

# find_element_by_tag_name
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)    

url = 'https://order.kingbus.com.tw/ORD/ORD_M_1510_OrderGo.aspx'
browser.get(url)                

# tag = browser.find_element_by_name('ctl00$ContentPlaceHolder1$txtCustomer_ID')
tag = browser.find_element(by=By.NAME, value='ctl00$ContentPlaceHolder1$txtCustomer_ID')
print(tag)
print()

print(tag.get_attribute('outerHTML')) 

<selenium.webdriver.remote.webelement.WebElement (session="71061b60c803ccf6b95a92845d346d7f", element="f2775dc7-c174-4485-9eb4-dbf119dfb5ea")>

<input name="ctl00$ContentPlaceHolder1$txtCustomer_ID" type="text" maxlength="20" id="ctl00_ContentPlaceHolder1_txtCustomer_ID" onkeypress="return CheckKeyAZ09()" onkeyup="value=value.replace(/[\W]/g,'')">


In [None]:
<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
</html>

In [None]:
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

In [12]:
# find_element_by_partial_link_text

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)        

url = 'https://news.yahoo.com/us/'
browser.get(url)                

# tag = browser.find_element_by_partial_link_text('COVID')
tag = browser.find_element(by=By.PARTIAL_LINK_TEXT, value='COVID')
print(tag.get_attribute('outerHTML'))   

<a class="_yb_hv7bw  rapid-noclick-resp" href="https://news.yahoo.com/coronavirus/" data-ylk="cpos:4;slk:COVID-19;elm:navcat;sec:ybar;subsec:navrail;pkgt:mid;itc:0;" id="root_4" data-rapid_p="29" data-v9y="1"> COVID-19   <div class="_yb_1mwsc">COVID-19</div></a>


## 2.4 Get content of HTML elements

* tag_name
* text
* location:coordinates, include a dictionary of x and y
* clear()：clear text
* get_attribute(name)：get value of the attribute
* is_displayed()：Whether the element is visible, True or False
* is_abled()
* is_selected()
    - https://tip.railway.gov.tw/tra-tip-web/tip?lang=EN_US

### textContent, innerHTML, & outerHTML

* .get_attribute('textContent')   
    - a String inside the tag
    
    
* .get_attribute('innerHTML')    
    - only obtain the HTML representation of the contents of an element
    
    
* .get_attribute('outerHTML')     
    - gets the serialized HTML fragment describing the element including its descendants


In [29]:

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'http://aaa.24ht.com.tw'
browser.get(url)                

print("Title = ", browser.title)

tag2 = browser.find_element_by_id('author')             
print("\ntag2.tag_name = %s, tag2.text = %s " % (tag2.tag_name, tag2.text))
print("get_attribute('textContent'):", tag2.get_attribute('textContent'))
print("get_attribute('innerHTML'):", tag2.get_attribute('innerHTML'))
print("get_attribute('outerHTML'):", tag2.get_attribute('outerHTML'))

print()
# tag3 = browser.find_elements_by_id('content')           
# for t3 in tag3:
#     print("t3.tag_name = %s, t3.text %s" % (t3.tag_name, t3.text))
#     print("get_attribute('textContent'):", t3.get_attribute('textContent'))
#     print("get_attribute('innerHTML'):", t3.get_attribute('innerHTML'))
#     print("get_attribute('outerHTML'):", t3.get_attribute('outerHTML'))
#     print()

print()
tag4 = browser.find_element_by_tag_name('p')           
print("t4.tag_name = %s, t4.text= %s" % (tag4.tag_name, tag4.text))
print("get_attribute('textContent'):", tag4.get_attribute('textContent'))
print("get_attribute('innerHTML'):", tag4.get_attribute('innerHTML'))
print("get_attribute('outerHTML'):", tag4.get_attribute('outerHTML'))

print()
tag5 = browser.find_elements_by_tag_name('img')         
for t5 in tag5:
    print("tag name = %s, content = %s " % (t5.tag_name, t5.get_attribute('src')))


  browser = webdriver.Chrome(executable_path=driverPath)


Title =  洪錦魁著作

tag2.tag_name = h1, tag2.text = 洪錦魁 
get_attribute('textContent'): 洪錦魁
get_attribute('innerHTML'): 洪錦魁
get_attribute('outerHTML'): <h1 id="author">洪錦魁</h1>


t4.tag_name = p, t4.text= 2015/2016年洪錦魁一個人到南極
get_attribute('textContent'): 2015/2016年洪錦魁一個人到南極
get_attribute('innerHTML'): 2015/2016年<strong>洪錦魁</strong>一個人到南極
get_attribute('outerHTML'): <p>2015/2016年<strong>洪錦魁</strong>一個人到南極</p>

tag name = img, content = http://104.155.193.235/temp/hung.jpg 
tag name = img, content = http://104.155.193.235/temp/travel.jpg 
tag name = img, content = http://104.155.193.235/temp/html5.jpg 
tag name = img, content = http://104.155.193.235/bitnami/images/close.png 
tag name = img, content = http://104.155.193.235/bitnami/images/corner-logo.png 


  tag2 = browser.find_element_by_id('author')
  tag4 = browser.find_element_by_tag_name('p')
  tag5 = browser.find_elements_by_tag_name('img')


In [31]:
#  try, except

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'http://aaa.24ht.com.tw'
browser.get(url)                

try:
    tag = browser.find_element_by_id('main')
    print(tag.tag_name)
except:
    print("Not found")

browser.quit()

  browser = webdriver.Chrome(executable_path=driverPath)
  tag = browser.find_element_by_id('main')


Not found


# 3. XPath

## 3.1 Introduction to XPath

### What is XPath?
- A language used for locating nodes in an XML document, including HTML.

### When to use XPath

- You don’t have a suitable id or name attribute for the element you wish to locate. 


### Catogories
- Absolute path
    - The simplest form of XPath in Selenium. 
    - Starts with a single slash ‘/’ and provides the absolute path of an element in the entire DOM.
    - /html/body/form
- Relative path
    - XPath expression starts from the middle of the DOM structure. 
    - Represented by a double slash ‘//’ denoting the current node.
    - //form


- https://selenium-python.readthedocs.io/locating-elements.html 
- https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/#XPath
- https://www.w3schools.com/xml/xpath_intro.asp

### Selecting Nodes
- nodename
    - Selects all nodes with the name "nodename"
- \/	
    - Selects from the root node
- \/\/	
    - Selects nodes in the document from the current node that match the selection no matter where they are
- \.	
    - Selects the current node
- \..	
    - Selects the parent of the current node
- \@	
    - Selects attributes
    
    
- https://www.w3schools.com/xml/xpath_syntax.asp

In [33]:
html_name='''
<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="log" type="submit" value="Login" />
   <input name="cls" type="button" value="Clear" />
  </form>
</body>
</html>
'''

In [35]:
# find_element_by_xpath

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    # 
url = 'file:///C:/Users/Laich/Google%20%E9%9B%B2%E7%AB%AF%E7%A1%AC%E7%A2%9F/course/course110_2/Data_science/slide_English/week11_selenium/input.html'
# the above path is copied from the URL in the browser 
browser.get(url)  

print(browser.find_element_by_xpath("/html/body/form").get_attribute('outerHTML'))
print()
print(browser.find_element_by_xpath("//body/form").get_attribute('outerHTML'))
print()
print(browser.find_element_by_xpath("//form").get_attribute('outerHTML'))

  browser = webdriver.Chrome(executable_path=driverPath)    #
  print(browser.find_element_by_xpath("/html/body/form").get_attribute('outerHTML'))


NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/form"}
  (Session info: chrome=101.0.4951.41)
Stacktrace:
Backtrace:
	Ordinal0 [0x00BAB8F3+2406643]
	Ordinal0 [0x00B3AF31+1945393]
	Ordinal0 [0x00A2C748+837448]
	Ordinal0 [0x00A592E0+1020640]
	Ordinal0 [0x00A5957B+1021307]
	Ordinal0 [0x00A86372+1205106]
	Ordinal0 [0x00A742C4+1131204]
	Ordinal0 [0x00A84682+1197698]
	Ordinal0 [0x00A74096+1130646]
	Ordinal0 [0x00A4E636+976438]
	Ordinal0 [0x00A4F546+980294]
	GetHandleVerifier [0x00E19612+2498066]
	GetHandleVerifier [0x00E0C920+2445600]
	GetHandleVerifier [0x00C44F2A+579370]
	GetHandleVerifier [0x00C43D36+574774]
	Ordinal0 [0x00B41C0B+1973259]
	Ordinal0 [0x00B46688+1992328]
	Ordinal0 [0x00B46775+1992565]
	Ordinal0 [0x00B4F8D1+2029777]
	BaseThreadInitThunk [0x766C6739+25]
	RtlGetFullPathName_UEx [0x77BA8E7F+1215]
	RtlGetFullPathName_UEx [0x77BA8E4D+1165]


In [7]:
# find_element_by_xpath

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    # 
url = 'file:///C:/Users/Laich/Google%20%E9%9B%B2%E7%AB%AF%E7%A1%AC%E7%A2%9F/course/course110_2/Data_science/slide_English/week11_selenium/input.html'
# the above path is copied from the URL in the browser 
browser.get(url)  

print(browser.find_element_by_xpath("/html/body/form").get_attribute('outerHTML'))
print()
print(browser.find_element_by_xpath("//form").get_attribute('outerHTML'))
print()
print(browser.find_element_by_xpath("//form[@id='loginForm']").get_attribute('outerHTML'))

<form id="loginForm">
   <input name="username" type="text">
   <input name="password" type="password">
   <input name="log" type="submit" value="Login">
   <input name="cls" type="button" value="Clear">
  </form>

<form id="loginForm">
   <input name="username" type="text">
   <input name="password" type="password">
   <input name="log" type="submit" value="Login">
   <input name="cls" type="button" value="Clear">
  </form>

<form id="loginForm">
   <input name="username" type="text">
   <input name="password" type="password">
   <input name="log" type="submit" value="Login">
   <input name="cls" type="button" value="Clear">
  </form>


In [17]:
# browser.find_element_by_xpath vs. browser.find_elements_by_xpath

from selenium import webdriver

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    # 
url = 'file:///C:/Users/Laich/Google%20%E9%9B%B2%E7%AB%AF%E7%A1%AC%E7%A2%9F/course/course110_2/Data_science/slide_English/week11_selenium/input.html'
# the above path is copied from the URL in the browser 
browser.get(url)  

e1 = browser.find_element_by_xpath("//form/input")
print(type(e1))

e2 = browser.find_elements_by_xpath("//form/input")
print(type(e2))
print()
print(e1.get_attribute('outerHTML'))
print()
for i in range(len(e2)):
   print(e2[i].get_attribute('outerHTML'))

<class 'selenium.webdriver.remote.webelement.WebElement'>
<class 'list'>

<input name="username" type="text">

<input name="username" type="text">
<input name="password" type="password">
<input name="log" type="submit" value="Login">
<input name="cls" type="button" value="Clear">


### Common syntax

- //tagname[@Attribute=’Value’]
    - //input[@name=’phone’]
    
    - //a[@class=’SignInBtn’] 
    
- //tagname[contains(@attribute,constantvalue)]
    - contain a specific contain


#### - Reference
    - https://www.w3schools.com/xml/xpath_syntax.asp
    - https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/#XPath

### Wildcards

- \*	Matches any element node
    - /bookstore/*	
        - Selects all the child element nodes of the bookstore element
- @*	Matches any attribute node
    - //* 
        - Selects all elements in the document
    - - //title[@*]	
        - Selects all title elements which have at least one attribute of any kind
- node()	Matches any node of any kind
    

In [None]:
http://aaa.24ht.com.tw

<!doctype html>
<html>
<head>
   <meta charset="utf-8">
   <title>洪錦魁著作</title>
   <style>
      h1#author { width:400px; height:50px; text-align:center;
	     background:linear-gradient(to right,yellow,green);
      }
	  h1#content { width:400px; height:50px;
		 background:linear-gradient(to right,yellow,red); 
      }
      section { background:linear-gradient(to right bottom,yellow,gray); }
   </style>
</head>
<body>
<h1 id="author">洪錦魁</h1>
<img src="hung.jpg" width="100">
<section>
   <h1 id="content">一個人的極境旅行 - 南極大陸北極海</h1>
   <p>2015/2016年<strong>洪錦魁</strong>一個人到南極</p>
   <img src="travel.jpg" width="300"
</section>
<section>
   <h1 id="content">HTML5+CSS3王者歸來</h1>
   <p>本書講解網頁設計使用HTML5+CSS3</p>
   <img src="html5.jpg" width="300">
</section>
</body>
</html>

In [39]:
# retrieve HTML tags from relative xpath

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'http://aaa.24ht.com.tw'


browser.get(url)               
# print(browser.)
n1 = browser.find_element_by_xpath('//h1')
print(n1.text)
n2 = browser.find_element_by_xpath('//body/section/h1')
print(n2.text)
n3 = browser.find_element_by_xpath('//section/h1')
print(n3.text)
n4 = browser.find_element_by_xpath('//body/*/h1')
print(n4.text)      

n5 = browser.find_element_by_xpath('/html/body/section/h1')
print(n5.text)      

  browser = webdriver.Chrome(executable_path=driverPath)
  n1 = browser.find_element_by_xpath('//h1')
  n2 = browser.find_element_by_xpath('//body/section/h1')
  n3 = browser.find_element_by_xpath('//section/h1')
  n4 = browser.find_element_by_xpath('//body/*/h1')


洪錦魁
一個人的極境旅行 - 南極大陸北極海
一個人的極境旅行 - 南極大陸北極海
一個人的極境旅行 - 南極大陸北極海
一個人的極境旅行 - 南極大陸北極海


  n5 = browser.find_element_by_xpath('/html/body/section/h1')


In [40]:
from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'https://selenium-python.readthedocs.io/'

browser.get(url)                

n1 = browser.find_element_by_xpath('//p')
print(n1.text)



  browser = webdriver.Chrome(executable_path=driverPath)


Note


  n1 = browser.find_element_by_xpath('//p')


In [42]:
# indexes of list

from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'http://aaa.24ht.com.tw'

browser.get(url)              

n1 = browser.find_elements_by_xpath("//section/h1")
print(n1)
print()
print(n1[0].text)
print()
print(n1[1].text)

# n2 = browser.find_element_by_xpath("//section/p[2]")
# print(n2.text)


  browser = webdriver.Chrome(executable_path=driverPath)
  n1 = browser.find_elements_by_xpath("//section/h1")


[<selenium.webdriver.remote.webelement.WebElement (session="00fb7ae767a58e869324e005abb88e6d", element="e98c4c9b-eb86-4408-a250-0c12f1cbda7d")>, <selenium.webdriver.remote.webelement.WebElement (session="00fb7ae767a58e869324e005abb88e6d", element="ea8f5962-bc75-4d7e-926a-926169b8b3a4")>]

一個人的極境旅行 - 南極大陸北極海

HTML5+CSS3王者歸來


### Get a tag from an attribute

* //HTML_tag[@attribute=attribute_value]
* n1 = browser.find_element_by_xpath("//section/p[@class='year']")

In [None]:
# open a local HTML file

from selenium import webdriver

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'file:///C:/Users/User/Google%20%E9%9B%B2%E7%AB%AF%E7%A1%AC%E7%A2%9F/python_laich/crawl/html/h7_1.html'

browser.get(url)                

n1 = browser.find_element_by_xpath("//section/p[@class='year']")
print(n1.text)
n1 = browser.find_element_by_xpath("//section/p[@class='price']")
print(n1.text)


### Get attribute

* get_attribute( ) method
    * get_attribute('src')  # get source of an image




In [None]:
# favorit_book.html
<!doctype html>
<html lang="zh-tw">
<head>
   <title>My favorit book</title>
</head>
<body>
   <section class='book'>
      <h1 class='booktitle'>Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!</h1>
      <h4 class='author'>Robert T. Kiyosaki</h4>
      <img src='https://images-na.ssl-images-amazon.com/images/I/51u8ZRDCVoL._SX330_BO1,204,203,200_.jpg' width=50>
      <p class='year'>2017</p>
      <p class='price'>1000</>
   </section>
</body>
</html>

In [18]:
from selenium import webdriver

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'file:///C:/Users/Laich/Google%20%E9%9B%B2%E7%AB%AF%E7%A1%AC%E7%A2%9F/course/course110_2/Data_science/slide_English/week11_selenium/favorit_book.html'
# 中文檔名會出問題，要先複製路徑到chrome網址，之後再複製回來，可解決unicode 的問題
browser.get(url)                

pict = browser.find_element_by_xpath("//section/img")
print(pict.get_attribute('src'))



https://images-na.ssl-images-amazon.com/images/I/51u8ZRDCVoL._SX330_BO1,204,203,200_.jpg


In [47]:
# ch7_7_6.py
from selenium import webdriver

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'http://aaa.24ht.com.tw'
browser.get(url)               

n1 = browser.find_element_by_xpath("//section/p")
print('p:', n1.text)
print('p的textContent : ', n1.get_attribute('textContent'))
n4 = browser.find_element_by_xpath("//h1")
print('p的innerHTML : ', n1.get_attribute('innerHTML'))
n5 = browser.find_element_by_xpath("//h1")
print('p的outerHTML : ', n1.get_attribute('outerHTML'))


p: 2015/2016年洪錦魁一個人到南極
p的textContent :  2015/2016年洪錦魁一個人到南極
p的innerHTML :  2015/2016年<strong>洪錦魁</strong>一個人到南極
p的outerHTML :  <p>2015/2016年<strong>洪錦魁</strong>一個人到南極</p>


### 　contains( ) 
* find a tag whcih contains specific text

In [None]:
#h7_4.html

<!doctype html>
<html lang="zh-tw">
<head>
   <title>洪錦魁著作</title>
</head>
<body>
   <div id='Computer'>
      <h1 class='booktitle'>Python王者歸來</h1>
      <p class='price'>1000</p>
      <div>
         <a class='publisher' href='http://www.deepmind.com.tw'>深智</a>
      </div>
   </div>
   <div id='Traveling'>
      <h1 class='booktitle'>一個人的極境旅行</h1>
      <p class='price'>500</p>
      <div>
         <a class='publisher' href='http://www.deepstone.com.tw'>深石</a>
      </div>
   </div>
</body>
</html>

In [15]:
# ch7_7_7.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)    
url = 'file:///C:/Users/User/Downloads/asd.html'
browser.get(url)               

n = browser.find_element_by_xpath("//div[@id='Traveling']//a[contains(text(),'深石')]")
print(n.get_attribute('outerHTML'))
print(n.get_attribute('href'))


<a class="publisher" href="http://www.deepstone.com.tw">深石</a>
http://www.deepstone.com.tw/


  n = browser.find_element_by_xpath("//div[@id='Traveling']//a[contains(text(),'深石')]")


## 3.2 Get XPath using Developer Tools

- Right click and choose「Copy/Copy XPath」or「Copy/Copy Full XPath」
    - the former is a relative path
    - the latter is an absolute path

In order to locate the element, you can simply do a right-click on the web element and click on Inspect. Then, in the Elements tab, you can start writing the locator.

- https://www.lambdatest.com/blog/complete-guide-for-using-xpath-in-selenium-with-examples/#XPath

# 4. Interaction with HTML elements

## 4.1 click a hyperlink

### click( )
- click a specific hyperlink

In [16]:
# click()


import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
browser = webdriver.Chrome(service=s)
browser.maximize_window()

url = 'http://www.taipeitimes.com/'
browser.get(url)                


elink = browser.find_element_by_xpath('//*[@id="sticky-wrapper"]/div/div[1]/ul/li[3]/a')
time.sleep(3) 
elink.click()

# shorten the above 2 statements as follows
# browser.find_element_by_xpath('//*[@id="sticky-wrapper"]/div/div[1]/ul/li[3]/a').click()

  elink = browser.find_element_by_xpath('//*[@id="sticky-wrapper"]/div/div[1]/ul/li[3]/a')


In [1]:
# 'https://deepmind.com.tw 深智網頁的[深智數位緣起]

from selenium import webdriver
import time

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath) 
browser.maximize_window()

url = 'https://deepmind.com.tw/'
browser.get(url)                

eleLink = browser.find_element_by_link_text('深智數位緣起')
print(type(eleLink))            
time.sleep(5)                   
eleLink.click()                 

<class 'selenium.webdriver.remote.webelement.WebElement'>


## 4.2 Enter text

### Senkeys( ) 
- a method in Selenium that allows QAs to type content automatically into an editable field while executing any tests for forms. 
- enter some text into a text field


### clear( )
- clear the contents of a text field or textarea


Example:
- inputElement = browser.find_element_by_id("")
    inputElement.send_keys('1234')

- driver.find_element_by_name("search").send_keys(Keys.ENTER)

### Practice:

Steps:
1. browse:  
   https://data.epa.gov.tw/

2. Cancel pop-ups

To be continued...

In [18]:
# Need to be modified

from selenium import webdriver
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

url = 'https://data.epa.gov.tw/'
s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')

browser = webdriver.Chrome(service=s)
browser.maximize_window()
browser.get(url)

time.sleep(2)
browser.find_element(by=By.XPATH, value='//div//button').click()
browser.find_element(by=By.XPATH, value='/html/body/div/div/div/div[2]/div/button').click()

browser.find_element(by=By.XPATH, value='/html/body/div/div/div/div/main/div/section[1]/form/div[2]/input').send_keys('PM2.5')
browser.find_element(by=By.XPATH, value='/html/body/div/div/div/div/main/div/section[1]/form/button').click()


## Practice(Continue):

3. download 30 days of weather

In [None]:
# Ch06 Google login 
from selenium import webdriver
from time import sleep

url = input()
email="Your account"
password="your password"

browser = webdriver.Chrome()
browser.maximize_window()
browser.get(url)
#
browser.find_element_by_id('gb_70').click()  # click the button in the upper right corner
                           
browser.find_element_by_id('identifierId').send_keys(email) # account
sleep(2)  
browser.find_element_by_xpath("//span[@class='RveJvd snByac']").click()  # Continue
sleep(2)  

browser.find_element_by_xpath("//input[@type='password']").send_keys(password)  # password
sleep(2) 
browser.find_element_by_xpath("//span[@class='RveJvd snByac']").click()  # continue  
sleep(3)  

In [2]:
# Lai
# search "iphone" in Momo website

from selenium import webdriver
import time

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)    
url = 'https://www.momoshop.com.tw/main/Main.jsp'
browser.get(url)                   

txtBox = browser.find_element_by_xpath('//*[@id="keyword"]')
txtBox.send_keys('iphone')         
time.sleep(5)                      
txtBox.submit()                     

In [58]:
# 使用 selenium 輸入帳號與密碼後進入系統
# 登入學校開課系統  ，密碼需要更新


from selenium import webdriver
import time

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath)     
url = 'https://sys.ndhu.edu.tw/aa/class/TeacherSubj/Default.aspx'
browser.get(url)                   

txtBox = browser.find_element_by_id('MainContent_ed_email_acc')
txtBox.send_keys('****')          
txtBox2 = browser.find_element_by_id('MainContent_ed_email_pass')
txtBox2.send_keys('****')          

submit1 = browser.find_element_by_id('MainContent_Button2') 
time.sleep(5)                       
# submit1.submit()                     
submit1.click()

### Practice

Enter the following website using selenium

https://sys.ndhu.edu.tw/AA/CLASS/SubjEvaluate/eval-login.aspx

In [59]:
# Search books

from selenium import webdriver
import time

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath) 
url = 'http://www.drmaster.com.tw/'
browser.get(url)                    

txtBox = browser.find_element_by_id('content')
txtBox.send_keys('artificial intelligence')          
submit1 = browser.find_element_by_id('button') 
time.sleep(5)                       
submit1.click()                     

In [None]:
# Login a book website

from selenium import webdriver
import time

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath) 
url = 'http://www.drmaster.com.tw/'
browser.get(url)                    

txtBox = browser.find_element_by_id('account')
txtBox.send_keys('****')          
txtBox2 = browser.find_element_by_id('password')
txtBox2.send_keys('****')          
submit1 = browser.find_element_by_id('imageField') 
time.sleep(5)                       
submit1.click()                     

In [None]:
# click Alert window


from selenium import webdriver

url = 'https://www.facebook.com/'
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(chrome_options=chrome_options)

browser.maximize_window()
browser.get(url)

browser.find_element_by_id('email').clear()
browser.find_element_by_id('email').send_keys('Your Email address')
sleep(3) 
browser.find_element_by_id('pass').clear()
browser.find_element_by_id('pass').send_keys('Your password')

browser.find_element_by_id('loginbutton').click()  

In [15]:
# select radio box

from selenium import webdriver
from time import sleep

url = 'https://tip.railway.gov.tw/tra-tip-web/tip?lang=EN_US'

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath) 
browser.maximize_window()
browser.get(url)

browser.find_element_by_xpath('//*[@id="timePeriodType"]/div[1]/label').click()
sleep(3)
browser.find_element_by_xpath('//*[@id="timePeriodType"]/div[2]/label').click()
sleep(3)
browser.find_element_by_xpath('//*[@id="timePeriodType"]/div[1]/label').click()

## 4.3 Scroll down

- Window.scrollTo( ) scrolls to a particular set of coordinates in the document.

- driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  
    - scrollTo(x-coord, y-coord)
    
    
- time.sleep(0.3)    
    - simulate human behavior

In [20]:
# retreive 1000 picture from marathon

import time,os
from urllib.request import urlopen
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service


s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')
driver = webdriver.Chrome(service=s) 
driver.maximize_window()

# 第3屆埔里跑 Puli Power 山城派對馬拉松  向善橋(約34K)
url = 'http://tw.running.biji.co/index.php?q=album&act=photo_list&album_id=30668&cid=5791&type=album&subtitle=第3屆埔里跑 Puli Power 山城派對馬拉松-向善橋(約34K)'
# 第3屆埔里跑 Puli Power 山城派對馬拉松  在终點前80米
#url = 'http://tw.running.biji.co/index.php?q=album&act=photo_list&album_id=30807&cid=5791&type=album&subtitle=第3屆埔里跑 Puli Power 山城派對馬拉松-在终點前80米'

driver.get(url)  #開啟瀏覽器


driver.implicitly_wait(1) 
#隱性等待 1 秒，即等待最多的時間，若提早完成，則提早進入下一個敘述
# time.sleep(1) 是顯性等待時間

for i in range(1,101):
    # Scroll down
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(0.3)
   
soup=BeautifulSoup(driver.page_source,'html.parser')  
title = soup.select('.album-title')[0].text.strip()   # 標題
all_imgs = soup.find_all('img', {"class": "photo_img photo-img"})

# 以標題建立目錄儲存圖片
images_dir=title + "/"
if not os.path.exists(images_dir):
    os.mkdir(images_dir)
    
# 處理所有 <img> 標籤
n=0
for img in all_imgs:
    # 讀取 src 屬性內容
    src=img.get('src')
    # 讀取 .jpg 檔
    if src != None and ('.jpg' in src):
        # 設定圖檔完整路徑
        full_path = src            
        filename = full_path.split('/')[-1]  # file name
        print(full_path)
        # 儲存圖片
        try:
            image = urlopen(full_path)
            with open(os.path.join(images_dir,filename),'wb') as f:
                f.write(image.read())  
            n+=1
            if n>=1000: # at most 1000 pictures
                break
        except:
            print("{} cannot be accessed!".format(filename))
            
print("Downloaded",n,"pictures")                
driver.quit(); 


https://cdntwrunning.biji.co/600_EE41BE00-E06A-4A00-3F36-B08551127095.jpg
https://cdntwrunning.biji.co/600_B9D3C5AF-A798-7F3F-AAD5-F4D1C458ED72.jpg
https://cdntwrunning.biji.co/600_962F1EF7-DE95-FA9C-E9EC-5E9FF51600BA.jpg
https://cdntwrunning.biji.co/600_C0F3AB82-BA97-731C-854F-26606B85388C.jpg
https://cdntwrunning.biji.co/600_5185B02C-3D29-4B6D-1D9C-E5A8672678C2.jpg
https://cdntwrunning.biji.co/600_4FC32284-B5BA-9B22-79B8-F40E50C272C0.jpg
https://cdntwrunning.biji.co/600_A65B017D-6C89-F894-EAA0-CD8F388A61F3.jpg
https://cdntwrunning.biji.co/600_AEFB54D8-F600-B37F-CEB9-A41DC24CE649.jpg
https://cdntwrunning.biji.co/600_D9967A7D-A2ED-AB54-AE3F-B63508B09A24.jpg
https://cdntwrunning.biji.co/600_84660201-2E4C-F546-2E4E-64973F97C1A9.jpg
https://cdntwrunning.biji.co/600_060D4E14-3C25-87EB-00A6-0AD80D578562.jpg
https://cdntwrunning.biji.co/600_AF78C4B2-ACC7-C1D2-2659-BF1F5F1F9C40.jpg
https://cdntwrunning.biji.co/600_C63540DD-E213-1B36-E08C-FFDA82F6C09B.jpg
https://cdntwrunning.biji.co/600_D977B

https://cdntwrunning.biji.co/600_42ECDE77-CBE0-A374-EEC9-222652AE2333.jpg
https://cdntwrunning.biji.co/600_1E1F17EB-C118-4098-E02E-CE9527EDA650.jpg
https://cdntwrunning.biji.co/600_7C02FE89-929F-29C1-2393-70CBD2140704.jpg
https://cdntwrunning.biji.co/600_E97B5D3D-DB44-A6D7-F924-9C233A233624.jpg
https://cdntwrunning.biji.co/600_4B9C077A-A767-113C-FB75-6CF0A7C12927.jpg
https://cdntwrunning.biji.co/600_03F99E17-5451-AEC6-49A0-3F61E22B4ECD.jpg
https://cdntwrunning.biji.co/600_2A3DDC6C-58B7-1F4B-A605-3565EAC01E04.jpg
https://cdntwrunning.biji.co/600_283CEED1-037A-149F-4F6C-EE5BC287DA34.jpg
https://cdntwrunning.biji.co/600_05438B4E-C12C-8BD8-CA3E-96390CD108C4.jpg
https://cdntwrunning.biji.co/600_6D1B6535-13E1-4BFC-2269-3DA5BE25C172.jpg
https://cdntwrunning.biji.co/600_7D97FE02-B69D-3F35-ADE0-CD3C5E2268D5.jpg
https://cdntwrunning.biji.co/600_E6CDAD9E-45DE-3CE3-ED6B-720666EFD755.jpg
https://cdntwrunning.biji.co/600_18170738-4698-F3AF-531B-7069A084A91B.jpg
https://cdntwrunning.biji.co/600_B08AE

https://cdntwrunning.biji.co/600_F22E894D-0534-3378-F764-C4A90C5CC6F3.jpg
https://cdntwrunning.biji.co/600_310123D8-3B91-1272-70D6-4BC8B58E8865.jpg
https://cdntwrunning.biji.co/600_2ECAB4DB-53BA-90F0-6C24-6E9B291FDDBB.jpg
https://cdntwrunning.biji.co/600_39A2CC2B-578C-E555-A9EF-A7073113C40B.jpg
https://cdntwrunning.biji.co/600_930C0E30-67B2-70C7-DD6E-F786A355F575.jpg
https://cdntwrunning.biji.co/600_A5EC9A61-4935-CEB6-B306-095D7C2A60EF.jpg
https://cdntwrunning.biji.co/600_18387A32-F919-EA87-858D-775DEADA05E3.jpg
https://cdntwrunning.biji.co/600_0043B26C-FF55-7F8C-D81C-A947A0BFA95B.jpg
https://cdntwrunning.biji.co/600_E7636E45-1EC0-A288-B702-69DE4DC903FA.jpg
https://cdntwrunning.biji.co/600_15A0BD7E-0AC7-A722-5C7E-62530887A30C.jpg
https://cdntwrunning.biji.co/600_1C5A21B1-E39B-71C3-E69A-C7CBAE13C345.jpg
https://cdntwrunning.biji.co/600_CB3CE6C8-AAA6-65E4-657F-0DE0FDAB591F.jpg
https://cdntwrunning.biji.co/600_E959E90E-662C-082C-2B56-E0901CBCECA4.jpg
https://cdntwrunning.biji.co/600_C6FFD

https://cdntwrunning.biji.co/600_A6C3AE96-E92F-C792-84ED-D4BD8AF121AD.jpg
https://cdntwrunning.biji.co/600_6D88E97C-D80A-0086-EC77-6535C3DE8C71.jpg
https://cdntwrunning.biji.co/600_26A891F1-38F9-2C3F-FE90-31B74554E7EA.jpg
https://cdntwrunning.biji.co/600_836B7355-6F8A-B3F1-CAD5-4663D359A9B3.jpg
https://cdntwrunning.biji.co/600_36B593CF-1FE4-4BAD-2E26-7A22E869558D.jpg
https://cdntwrunning.biji.co/600_AF7AF291-93F1-2C6E-0087-3FD21A6E10DA.jpg
https://cdntwrunning.biji.co/600_7DBB88C4-7C4F-653A-F30E-E9D625AE1772.jpg
https://cdntwrunning.biji.co/600_11388144-666C-6152-D157-C62B72A238FF.jpg
https://cdntwrunning.biji.co/600_A5FA1EBB-3B1A-B686-046F-1E2BC48C0C31.jpg
https://cdntwrunning.biji.co/600_BE9FBADE-8135-0298-D3F8-E3CCDC3F352F.jpg
https://cdntwrunning.biji.co/600_EA3AF961-F74C-40B7-0059-368E7944D224.jpg
https://cdntwrunning.biji.co/600_59A65CB9-DD06-5A4B-6A1D-25BEBAF7DA9C.jpg
https://cdntwrunning.biji.co/600_02E26087-C5B7-EF87-3456-DCAC22AC156E.jpg
https://cdntwrunning.biji.co/600_CE2FF

https://cdntwrunning.biji.co/600_312BC9E4-6454-52C7-C4FF-7FAAEDE0FFFD.jpg
https://cdntwrunning.biji.co/600_6CEB873A-BF68-C02D-7439-DA5B286B3DC3.jpg
https://cdntwrunning.biji.co/600_DC2CE63C-4BDA-8E2B-EEA0-B296DD6C0EAC.jpg
https://cdntwrunning.biji.co/600_A5B92124-6153-4F4A-D632-5C12A42DB058.jpg
https://cdntwrunning.biji.co/600_2878FB9C-C603-D82A-9BB7-1F779BC20B29.jpg
https://cdntwrunning.biji.co/600_CDA1F7DB-B9A4-F177-8E63-9A70695E83C7.jpg
https://cdntwrunning.biji.co/600_4331EF99-7C91-10A9-92C7-87D05283A427.jpg
https://cdntwrunning.biji.co/600_BF935765-D9E1-AD4C-CEDB-023BEC895770.jpg
https://cdntwrunning.biji.co/600_0269F7A8-A723-5E9D-0507-A930F41F0A88.jpg
https://cdntwrunning.biji.co/600_7B36EF3E-4A11-E5A0-65EE-00E2E7C20C3F.jpg
https://cdntwrunning.biji.co/600_511D2A31-C3AC-645A-EB83-3054D8879B53.jpg
https://cdntwrunning.biji.co/600_5EE8CFD7-74BD-B07A-7396-59C7327CAB4F.jpg
https://cdntwrunning.biji.co/600_21E08045-9904-A563-9560-23E2D88AA784.jpg
https://cdntwrunning.biji.co/600_47188

https://cdntwrunning.biji.co/600_242DEC6F-CBDE-07E8-2B4E-2AF219B5160F.jpg
https://cdntwrunning.biji.co/600_C0A40158-22A6-A503-415D-E61CD519D776.jpg
https://cdntwrunning.biji.co/600_B86CED28-7195-98F7-DE9F-1573D7EFB03D.jpg
https://cdntwrunning.biji.co/600_0618F29B-FA4D-C844-0FF6-754B850EE6F5.jpg
https://cdntwrunning.biji.co/600_B4856CF7-F552-2280-DF80-CA967C3DB1F7.jpg
https://cdntwrunning.biji.co/600_C6E3783B-8447-8C11-6BC2-8C40F6CD0A76.jpg
https://cdntwrunning.biji.co/600_4BE2530A-F41D-0764-06B8-3E37959508C9.jpg
https://cdntwrunning.biji.co/600_5B7365DC-EDE5-ACD6-60DA-DA8A06B075BD.jpg
https://cdntwrunning.biji.co/600_12A56AD0-B9BD-54D9-0BDC-EAF5E82058F2.jpg
https://cdntwrunning.biji.co/600_A7DE7B65-FA9C-4996-7429-8E209D2DB808.jpg
https://cdntwrunning.biji.co/600_42DE10BD-6696-4FED-4295-93C5733E9333.jpg
https://cdntwrunning.biji.co/600_CCBFF3B9-FBF7-7508-63FA-E712F2B6A4B2.jpg
https://cdntwrunning.biji.co/600_EA30D86D-6257-40E7-B64F-3728AEAD3B57.jpg
https://cdntwrunning.biji.co/600_63200

https://cdntwrunning.biji.co/600_8F0C62CC-8EDE-F93A-121F-F42AF3FAE60E.jpg
https://cdntwrunning.biji.co/600_663ADB46-FDEF-C382-0101-3C5FAE7A94D8.jpg
https://cdntwrunning.biji.co/600_8AAB0A47-A8D9-771E-D6B9-4E8B80F80527.jpg
https://cdntwrunning.biji.co/600_BE13CFB9-D12F-E427-EA70-0BAF8AB28D22.jpg
https://cdntwrunning.biji.co/600_4A86DD5C-E6C7-839E-ECBE-611128D9B65A.jpg
https://cdntwrunning.biji.co/600_BB7C90ED-5F92-EC06-EC47-17CB923AB336.jpg
https://cdntwrunning.biji.co/600_24184EB5-B5B3-63C3-0989-7ECB27A75E90.jpg
https://cdntwrunning.biji.co/600_5CB1D4A9-48BE-FE61-309C-43CEB3F78839.jpg
https://cdntwrunning.biji.co/600_453EC75C-211D-A3DF-4C68-CF4D8D324231.jpg
https://cdntwrunning.biji.co/600_C731F764-FD7D-BCE5-B696-12D7467052E5.jpg
https://cdntwrunning.biji.co/600_C90DB8A7-CC43-0E64-3B70-89E92F85F026.jpg
https://cdntwrunning.biji.co/600_F043245D-7773-9DD4-CA5F-1D79A51AA862.jpg
https://cdntwrunning.biji.co/600_0774A0A5-54CC-0EE9-F7FB-66F1518BE443.jpg
https://cdntwrunning.biji.co/600_B58B8

https://cdntwrunning.biji.co/600_88EC2D69-63B2-FFFE-93E5-6853D72EFB3E.jpg
https://cdntwrunning.biji.co/600_88D95BC9-50EA-7088-218B-64751F4DAEFD.jpg
https://cdntwrunning.biji.co/600_C93B4D30-DD7F-730C-26B3-43B742780618.jpg
https://cdntwrunning.biji.co/600_1D780CAA-718D-10FC-8B6F-F3165B144627.jpg
https://cdntwrunning.biji.co/600_3DF97505-0D52-004C-367D-B40077505EF4.jpg
https://cdntwrunning.biji.co/600_6653A01B-6402-B06A-230F-09E1471D5DEB.jpg
https://cdntwrunning.biji.co/600_CFD48EA3-E31B-9477-95CA-F4ABC18BBF46.jpg
https://cdntwrunning.biji.co/600_447D7DDD-11C4-FB8F-EEF4-F1B70C896EFF.jpg
https://cdntwrunning.biji.co/600_AC01817C-A815-74BA-019D-DC0CA8DE996F.jpg
https://cdntwrunning.biji.co/600_BA5B7AC6-12FF-9AA0-A5C7-6E7D856E2C5D.jpg
https://cdntwrunning.biji.co/600_EA86DAEC-ECC8-1D98-A10D-41C812436484.jpg
https://cdntwrunning.biji.co/600_72A0433E-80F0-FC7F-7C1F-13A70F69D9B4.jpg
https://cdntwrunning.biji.co/600_C2F3A946-72D0-5F17-BA2C-925B7F031850.jpg
https://cdntwrunning.biji.co/600_E0FE2

https://cdntwrunning.biji.co/600_AD99D6A7-D3C2-7505-F36B-248C9001D119.jpg
https://cdntwrunning.biji.co/600_81C03200-D916-31CE-4BA9-4A3CD2BAFC09.jpg
https://cdntwrunning.biji.co/600_107222CA-E42C-FD25-1A7E-744DF751DA4F.jpg
https://cdntwrunning.biji.co/600_8B715A24-CD04-FA80-4C3E-5AE3F0C1D169.jpg
https://cdntwrunning.biji.co/600_24B66ADB-FC85-C480-3BC5-6AF7EBF6F893.jpg
https://cdntwrunning.biji.co/600_F1E7D055-E621-A9C2-CE84-2C00657676E3.jpg
https://cdntwrunning.biji.co/600_3DCD9A4B-F781-1BFE-07FE-60A984AF3DB4.jpg
https://cdntwrunning.biji.co/600_B5D88F9D-E7F4-75FB-FE76-FFB2134FEEB2.jpg
https://cdntwrunning.biji.co/600_90014660-8582-0DDC-3069-8BA04ACF591C.jpg
https://cdntwrunning.biji.co/600_483C1C0A-F504-1A0F-7906-8219CB061661.jpg
https://cdntwrunning.biji.co/600_592E8836-374C-3F4D-8F88-4DAA27040AEB.jpg
https://cdntwrunning.biji.co/600_03771EEA-14F8-70A3-4F22-BF361C3990BE.jpg
https://cdntwrunning.biji.co/600_4DBF170C-5B51-047C-85A3-A1C18DE07645.jpg
https://cdntwrunning.biji.co/600_DD498

## 4.4 run selenium in background

- Hide the browser


- Syntax
    - chrome_options = Options()  
    - chrome_options.add_argument("--headless")  # define headless  
    - driver = webdriver.Chrome(chrome_options=chrome_options)  

### hide the browser

- headless = webdriver.ChromeOptions()
- headless.add_argument('headless')   
- browser = webdriver.Chrome(executable_path=driverPath, options=headless)

### implicit wait

- browser.implicitly_wait(5) 

- Difference between driver.implicitly_wait() and time.sleep()
    - https://stackoverflow.com/questions/53588966/python-selenium-difference-between-driver-implicitly-wait-and-time-sleep

In [22]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

s = Service(r'C:\SDK\chromedriver_win32\chromedriver.exe')

headless = webdriver.ChromeOptions()
headless.add_argument('headless')   
# browser = webdriver.Chrome(executable_path=driverPath, options=headless)
driver = webdriver.Chrome(service=s, options=headless) 
url = 'file:///C:/Users/User/Downloads/asd.html'
browser.implicitly_wait(5)          
browser.get(url)                   

n = browser.find_element_by_xpath("//div[@id='Traveling']//a[contains(text(),'深石')]")
print(n.get_attribute('outerHTML'))
print(n.get_attribute('href'))


WebDriverException: Message: chrome not reachable
  (Session info: chrome=101.0.4951.54)
Stacktrace:
Backtrace:
	Ordinal0 [0x0082B8F3+2406643]
	Ordinal0 [0x007BAF31+1945393]
	Ordinal0 [0x006AC610+837136]
	Ordinal0 [0x006A0442+787522]
	Ordinal0 [0x006A0C78+789624]
	Ordinal0 [0x006A24B2+795826]
	Ordinal0 [0x0069BF09+769801]
	Ordinal0 [0x006ADAC0+842432]
	Ordinal0 [0x00703E62+1195618]
	Ordinal0 [0x006F4096+1130646]
	Ordinal0 [0x006CE636+976438]
	Ordinal0 [0x006CF546+980294]
	GetHandleVerifier [0x00A99612+2498066]
	GetHandleVerifier [0x00A8C920+2445600]
	GetHandleVerifier [0x008C4F2A+579370]
	GetHandleVerifier [0x008C3D36+574774]
	Ordinal0 [0x007C1C0B+1973259]
	Ordinal0 [0x007C6688+1992328]
	Ordinal0 [0x007C6775+1992565]
	Ordinal0 [0x007CF8D1+2029777]
	BaseThreadInitThunk [0x765B6739+25]
	RtlGetFullPathName_UEx [0x77458E7F+1215]
	RtlGetFullPathName_UEx [0x77458E4D+1165]


### Hide "Chrome is being controlled by automatic software"


\# coding:utf-8

from selenium import webdriver


option = webdriver.ChromeOptions()

option.add_argument('disable-infobars')


driver = webdriver.Chrome(chrome_options=option)

- https://www.codestudyblog.com/8ten1/80330184714.html

In [23]:
### 
# coding:utf-8
from selenium import webdriver

driverPath = 'C:\SDK\chromedriver_win32\chromedriver.exe'


option = webdriver.ChromeOptions()
option.add_argument('disable-infobars')

browser = webdriver.Chrome(options=option, executable_path=driverPath)
browser.get('http://www.ndhu.edu.tw') 

  browser = webdriver.Chrome(options=option, executable_path=driverPath)


WebDriverException: Message: 'chromedriver.exe' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home


# 5. Examples

In [None]:
# 範例：抓取馬拉松的圖片1000張，在背景執行
# python 大數據特訓班 鄧文淵 chap.6

import time,os
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

# 隱藏瀏覽器
chrome_options = Options()
chrome_options.add_argument("--headless")  # 定義 headless
driver = webdriver.Chrome(chrome_options=chrome_options)

# 第3屆埔里跑 Puli Power 山城派對馬拉松  向善橋(約34K)
#url = 'http://tw.running.biji.co/index.php?q=album&act=photo_list&album_id=30668&cid=5791&type=album&subtitle=第3屆埔里跑 Puli Power 山城派對馬拉松-向善橋(約34K)'
# 第3屆埔里跑 Puli Power 山城派對馬拉松  在终點前80米
url = 'http://tw.running.biji.co/index.php?q=album&act=photo_list&album_id=30807&cid=5791&type=album&subtitle=第3屆埔里跑 Puli Power 山城派對馬拉松-在终點前80米'

driver.get(url)  #開啟瀏覽器
#隱性等待 1 秒
driver.implicitly_wait(1)

for i in range(1,101):
    # 向下捲動，會花費一些時間
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(0.3)
   
soup=BeautifulSoup(driver.page_source,'html.parser')  
title = soup.select('.album-title')[0].text.strip()   # 標題
all_imgs = soup.find_all('img', {"class": "photo_img photo-img"})

# 以標題建立目錄儲存圖片
images_dir=title + "/"
if not os.path.exists(images_dir):
    os.mkdir(images_dir)
    
# 處理所有 <img> 標籤
n=0
for img in all_imgs:
    # 讀取 src 屬性內容
    src=img.get('src')
    # 讀取 .jpg 檔
    if src != None and ('.jpg' in src):
        # 設定圖檔完整路徑
        full_path = src            
        filename = full_path.split('/')[-1]  # 取得圖檔名
        print(full_path)
        # 儲存圖片
        try:
            image = urlopen(full_path)
            with open(os.path.join(images_dir,filename),'wb') as f:
                f.write(image.read()) 
            n+=1
            if n>=1000: # 最多下載 1000 張
                break
        except:
            print("{} 無法讀取!".format(filename))
            
print("共下載",n,"張圖片")                
driver.quit(); #關閉瀏覽器並退出驅動程式

In [61]:
# download virious types of files 


from selenium import webdriver
import time

url = 'https://opendata.epa.gov.tw/data/contents/aqi/'

driverPath = 'chromedriver_win32/chromedriver.exe'
browser = webdriver.Chrome(executable_path=driverPath) 
browser.get(url)                    

browser.find_element_by_link_text('JSON').click()      
time.sleep(3)

browser.find_element_by_link_text('XML').click()        
time.sleep(3)

browser.find_element_by_link_text('CSV').click()        
time.sleep(3)


## Reference

* https://pypi.org/project/selenium/
* https://selenium-python.readthedocs.io/
* [WebDriver API官網](https://selenium-python.readthedocs.io/api.html#)
* Python大數據特訓班：鄧文淵 chap.2.3
* 洪錦魁（2019）。Python網路爬蟲王者歸來。台北：深智。
* 資料科學學習手札33:基於Python的網路資料採集實戰（1）  
 * [連結](https://tw.saowen.com/a/85b3c6f230ea22cad822a2fe9074eab46220ae281ad05f40a8630c0fff2e2407)
* 數據科學學習手札47:基於Python的網絡數據採集實戰（2）    
 * [連結](https://cloud.tencent.com/developer/article/1189537)
* 資料科學學習手札50: 基於Python的網路資料採集-selenium篇（上）  
 * [連結](https://tw.saowen.com/a/94cc9a02b26a9e0e38b1ff118d75fe0a8dc6180ba6ee1df1ff9dc5815d9630c1)