# Web Scraping Using Selenium
### Selenium Architecture

<img align="center" width="1000"   src="images/seleniumarchitecture.png"  >

- WebDriver is a module of python in the selenium package
- A module contains different classes and functions for different purposes 
- WebDriver module contains a class for each browser
>- For Chrome Browser  --> `Chrome()`
>- For Firefox Browser --> `Firefox()`
>- For Edge Browser    --> `Edge()`
- WebDriver is an API (Application Programming Interface) because it acts as an API between the python code and Browser (Application/Website)
- ***On a high-level, Selenium WebDriver works in five steps:***
>- 1) Test commands are converted into an HTTP request by the JSON wire protocol.
>- 2) Before executing any test cases, every browser has its own driver, which initializes the server.
>- 3) The browser then starts receiving the request through its driver.
>- 4) The browser perform action and send the responce back to driver 
>- 5) The driver send responce back to selenium client library 

### Download & Install Selenium

In [1]:
import sys
!{sys.executable} -m pip install --upgrade pip -q
!{sys.executable} -m pip install --upgrade selenium -q

In [2]:
import selenium
selenium.__version__, selenium.__path__

('4.11.2',
 ['C:\\Users\\iqbal\\AppData\\Local\\anaconda3\\lib\\site-packages\\selenium'])

### Create an Instance of WebDriver, Load a Web Page, Play and Quit
> **Create an instance of Browser:**
>- The `Service('pathtochromedriver)` method is used to create a Service object that needs to be passed to `Chrome()` method.
>- The `Chrome(service, options)` method is used to create a new instance of the chrome driver, starts the service and then creates a new instance of chrome browser.
>- ChromeOptions is a new concept added in Selenium WebDriver starting from Selenium version 3.6. 0 which is used for customizing the ChromeDriver session. 
>- The `Options()`  method is used to change the default settings of chrome driver. The object is then passed to the webdriver.chrome() method.

> **Load a Web page in the browser window:**
>- The `driver.get('URL')` method is used to load a web page in the current browser session, after which you can access the browser and the HTML code using the driver object.
>- This is similar to `resp = requests.get('URL')`, after which you simply get the response object.

In [1]:
from selenium import webdriver 
from selenium.webdriver import Chrome # From webdriver module of selenium package, import Chrome class
from selenium.webdriver.chrome.service import Service  # From service module import Service Cldss
ser_obj = Service("C:\ChromeDriver\chromedriver.exe") # Create the object of class Service
driver = webdriver.Chrome(service = ser_obj) 
# Create the object of class Chrome, and Chrome's class constructor takes service object a keyword argument
driver.get("https://www.w3schools.com/MySQL/default.asp") # Call the get function Chrome class
driver.quit()

> **Access browser information:** 
>- There is a bunch information about the browser you can request, including window handles, browser size / position, cookies, alerts, etc.

In [2]:
print(dir(driver))

['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_authenticator_id', '_check_if_window_handle_is_current', '_file_detector', '_get_cdp_details', '_is_remote', '_mobile', '_shadowroot_cls', '_switch_to', '_unwrap_value', '_web_element_cls', '_wrap_value', 'add_cookie', 'add_credential', 'add_virtual_authenticator', 'application_cache', 'back', 'bidi_connection', 'capabilities', 'caps', 'close', 'command_executor', 'create_web_element', 'current_url', 'current_window_handle', 'delete_all_cookies', 'delete_cookie', 'delete_network_conditions', 'desired_capabilities', 'error_handler', 'execute', 'execute_async_script', 'execute_cdp_cmd', 

- 1) find_element() vs find_elements()
     - find_element() returns a web element
     - find_elements() returns a list of web elements
- 2) text vs get_attribute()
     - text returns inner text of the elements
     - get_attribute('value') returns values of any attribute of the web element
- These are the five most commonly used types of commands
- ***1) Applicational Commands***
    - a) get()
    - b) title
    - c) current_url
    - d) page_source
    - e) current_window_handle
    - f) session_id
- ***2) Conditional Commands***
    - a) is_displayed()
    - b) is_enabled()
    - c) is_selected()
- ***3) Browser Commands***
    - a) close()
    - b) quit()
- ***4) Nagivational Commands***
    - a) back()
    - b) forward()
    - c) refresh()
- ***5) Wait Commands***

### Handle Cookies
- `cookies = driver.get_cookies()` cookies is a dictionary having name, id, and other attribues of the cookies
- `driver.add_cookie('name':'abc','value':123}` will add a cookie
- `driver.delete_cookie(name of cookie)` will delete a specific cookie
- `driver.delete_all_cookies()` will delete all the cookies

In [3]:
ser_obj = Service("C:\ChromeDriver\chromedriver.exe") # Create the object of class Service
driver = webdriver.Chrome(service = ser_obj) 
# Create the object of class Chrome, and Chrome's class constructor takes service object a keyword argument
driver.get("https://www.w3schools.com/MySQL/default.asp")

In [4]:
driver.title

'MySQL Tutorial'

In [5]:
driver.current_url

'https://www.w3schools.com/MySQL/default.asp'

In [6]:
driver.current_window_handle

'78A97198744B5AFC5FFE0A504511EE44'

In [7]:
driver.session_id

'e79530a064df9b78d94d0e1c6be4558b'

In [8]:
driver.page_source

'<html lang="en-US"><head><script async="" src="//cdn.confiant-integrations.net/prebid/202307190925/wrap.js"></script><script type="text/javascript" async="" src="https://script.4dex.io/localstore.js"></script>\n<title>MySQL Tutorial</title>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<meta name="Keywords" content="HTML, Python, CSS, SQL, JavaScript, How to, PHP, Java, C, C++, C#, jQuery, Bootstrap, Colors, W3.CSS, XML, MySQL, Icons, NodeJS, React, Graphics, Angular, R, AI, Git, Data Science, Code Game, Tutorials, Programming, Web Development, Training, Learning, Quiz, Exercises, Courses, Lessons, References, Examples, Learn to code, Source code, Demos, Tips, Website">\n<meta name="Description" content="Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, Python, PHP, Bootstrap, Java, XML and more.">\n<meta property="og:image" content="https://www.w3schools.com/ima

> **Perform Different operations on the browser:**
>- The `driver.refresh()` method is used to refresh the page contents.
>- The `driver.set_window_position(x,y)` is used to set the positions of the top left corner of the browser window.
>- The `driver.set_window_size(x,y)` is used to set the width and height of current window.
>- The `driver.maximize_window()` is used to maximinize the size of the window.
>- The `driver.minimize_window()` is used to minimize the browser in the taskbar.

In [9]:
driver.refresh()

In [10]:
driver.set_window_position(0,0)

{'height': 700, 'width': 1050, 'x': 0, 'y': 0}

In [11]:
driver.maximize_window()

In [12]:
driver.minimize_window()

In [13]:
driver.maximize_window()

> **Create new tab in the browser window and shift between tabs:**
>- Clicking a link may opens in a new browser tab
>- You can also create a new browser tab programmatically using the `driver.switch_to.new_window('tab')`.
>- All calls to the driver will now be interpreted as being directed to the current browser tab.
- WebDriver supports moving between windows using:
    - `driver.switch_to.window("windowname/ID")`
        - `switch_to.new_window('tab/window')` to open a new empty tab
        - `current_window_handle` return windowID of single bowser/current browser
        - `window_hanldes` return a list of IDs of all the currently open windows 
- WebDriver supports moving between Frames/Iframes/Forms using:
    - `driver.switch_to.frame('framename/ID/webelement/0')` 0 only when there is only one frame
        - `switch_to.default_content()` to move to parent frame
        - `switch_to.parent_frame()` directly switch to the parent to parent frame
- WebDriver supports moving from window to alert/popups using:
    - `alert_obj = driver.switch_to.alert.accept()/dismiss()` for alerts/popups
        - `alert_obj.text` to write text in the input field on alert
        - `alert_obj.accept()` to click on the ok/confirm/accept button on alert
        - `alert_obj.dismiss()` to click on the cancel/reget button on the click
    - All calls to driver will now be interpreted as being directed to the particular window.
- Authentication Popup:
    - syntax: http://username:password@test.com

In [14]:
first_tab = driver.current_window_handle
first_tab

'78A97198744B5AFC5FFE0A504511EE44'

In [15]:
driver.switch_to.new_window('tab')

In [16]:
driver.get("https://www.yahoo.com")
driver.current_url

'https://www.yahoo.com/'

In [17]:
driver.switch_to.window(first_tab)

In [18]:
driver.close()

In [19]:
driver.quit()

> **Close browser tab or close the entire session:**
>- The `driver.close()` will simply closes the current tab of the browser and will not close the browser process.
>- The `driver.quit()` will close all the browser tabs and the background driver process.

#### Browser Options
- Create an object of class `ChromeOptions()`
- Disable notification appeared at the start
    - `ops = webdriver.ChromeOptions()`
    - `ops.add_argument("--disable-notofications')`
    - `driver = webdriver.Chrome(service = ser_obj, options = ops)`
- Headless mode
    - `ops.headless = True`

### Keyboard Keys Options
- `from selenium.webdriver import ActionChains` class `ActionChains` is being imported
- `act = ActionChains(driver)` the object of the class `ActionChains` is created
- `act.key_down(Keys.Control').send_keys('a').key_up(Keys.Control).perform()` Control+A action is performed
- `act.key_down(Keys.Control').send_keys('c').key_up(Keys.Control).perform()` Control+C action is performed
- `act.send_keys(Keys.TAB).perform()` The Tab button is pressed so cursor move to the second textarea
- `act.key_down(Keys.Control').send_keys('v').key_up(Keys.Control).perform()` Control+V action is performed

### Download a file
- `import os` import OS module
- `location = os.getcwd()`
- `preferences = {"download.default_directory":location, 'plugins.always_open_pdf_externally':True}` second key only for PDF files
- `ops = driver.ChromeOptions()`
- `ops.add_experimental_option("prefs":preferences)`
- `driver = webdriver.Chrome(service = ser_obj, options = ops)`

### Capture Screenshot
- `driver.save_screenshot(os.getcwd()+"\\first.png")` capture screenshot and save in current working directing with name first.png
- `driver.get_screenshot_as_file(os.getcwd()+"\\first.png")` capture screenshot and save in current working directing with name first.png

## Example # 01: Scraping a Dynamic Website (https://arifpucit.github.io/bss2/login/)
#### a. Different Ways to Locate Web elements using Selenium
- Once we have the webpage loaded inside our browser, the next task is to locate the web element(s) of our interest and later perform actions on it.
- The two most commonly used methods used to locate elements are:
    - The `driver.find_element(By.LOCATOR, "value")`  method is used to locate a single element.
    - The `driver.find_elements(By.LOCATOR, "value")`  method is used to locate multiple elements.
- The first argument to these methods are the locators, and second argument is the value of that locator.
- In Selenium, there are eight different types of Locators or ways using which we can locate a web element:
    - ID, NAME, and CLASS_NAME attributes of a web element are called direct locators, as they are fast. Their limitation is they may not always work in case of dynamic web sites. 
    - XPATH, and CSS_SELECTOR are called indirect locators as they are comparatively slow, but are really useful in case of dynamic web sites.
    - LINK_TEXT, and PARTIAL_LINK_TEXT
    - TAG_NAME itself, which is seldomly used.

In [20]:
ser_obj = Service("C:\ChromeDriver\chromedriver.exe")
driver = webdriver.Chrome(service = ser_obj) 
driver.get("https://arifpucit.github.io/bss2/login/")
driver.maximize_window()

In [21]:
from selenium.webdriver.common.by import By
user_name = driver.find_element(By.XPATH, '//input[@id="name"]')
print(type(user_name))

<class 'selenium.webdriver.remote.webelement.WebElement'>


In [22]:
# clear method the input field
user_name.clear()
user_name.send_keys('arif')

In [25]:
# driver.find_element(By.LINK_TEXT, 'Ask Google for Password').click()

In [26]:
# Direct to the previous page
# driver.back()

In [23]:
driver.find_element(By.CSS_SELECTOR, 'input#password').send_keys('datascience')

In [24]:
driver.find_element(By.XPATH, '//button[@id="submit_button"]').click()

>- ***find_element() will return a web element***
>- ***find_elements() will return a list***

In [25]:
for i in range(0,9):
    prices = driver.find_elements(By.XPATH, '//p[@class=" price green"]')
    print(prices[i].text)

Rs.2000
Rs.5000
Rs.6900
Rs.2700
Rs.1700
Rs.1800
Rs.6000
Rs.1000
Rs.1800


In [26]:
driver.quit()

### Consolidated Code 

In [85]:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
import pandas as pd

s = Service("C:\ChromeDriver\chromedriver.exe")
driver = Chrome(service = s)
driver.get('https://arifpucit.github.io/bss2/login')
driver.maximize_window()

driver.find_element(By.XPATH, '//input[@id="name"]').send_keys('arif')
driver.find_element(By.XPATH, '//input[@id="password"]').send_keys('datascience')
driver.find_element(By.XPATH, '//button[@id="submit_button"]').click()
time.sleep(2)
# driver.quit()

titles = []
prices = []
availability = []
reviews = []
links = []


def fun1():
    s_titles = driver.find_elements(By.XPATH, "//p[@class='book_name']")
    s_prices = driver.find_elements(By.XPATH, "//p[@class=' price green']")
    s_availability = driver.find_elements(By.XPATH, "/html/body/section/div/div[2]/div[2]/div['+str(i)+']/div/p[2]")
    s_reviews = driver.find_elements(By.XPATH, "//p[@class='review green']")
    s_links = driver.find_elements(By.XPATH, "//a[@target='_blank']")
    for i in range(0,9):
        titles.append(s_titles[i].text)
        prices.append(s_prices[i].text)
        availability.append(s_availability[i].text)
        reviews.append(s_reviews[i].text)
        links.append(s_links[i].text) 
    
fun1()
driver.find_element(By.XPATH, '/html/body/section/div/div[1]/ul/li[2]/a').click()
fun1()
driver.find_element(By.XPATH, '/html/body/section/div/div[1]/ul/li[3]/a').click()
fun1()
    
data = {'Title/Author':titles, 'Price':prices, 'Available':availability, 'Review': reviews, 'Links':links}
df = pd.DataFrame(data)
df.to_csv('book3.csv', index = False)
df = pd.read_csv('book3.csv')
df

Unnamed: 0,Title/Author,Price,Available,Review,Links
0,Operating System Concepts\nBy Avi Silberschatz,Rs.2000,In stock,20 Reviews,Operating System Concepts\nBy Avi Silberschatz
1,UNIX The Textbook\nBy Syed Mansoor Sarwar,Rs.5000,In stock,100 Reviews,UNIX The Textbook\nBy Syed Mansoor Sarwar
2,Taxonomy of IDS\nBy Arif Butt,Rs.6900,Not in stock,20 Reviews,Taxonomy of IDS\nBy Arif Butt
3,Understanding operating systems\nBy Ida Flynn,Rs.2700,Not in stock,60 Reviews,Understanding operating systems\nBy Ida Flynn
4,Computer Systems\nBy Randal E. Bryant,Rs.1700,In stock,25 Reviews,Computer Systems\nBy Randal E. Bryant
5,Linux bible\nBook by Christopher Negus,Rs.1800,Not in stock,21 Reviews,Linux bible\nBook by Christopher Negus
6,Advanced Programming in the UNIX Environment\n...,Rs.6000,In stock,40 Reviews,Advanced Programming in the UNIX Environment\n...
7,Operating Systems: A Design-oriented Approach\...,Rs.1000,In stock,90 Reviews,Operating Systems: A Design-oriented Approach\...
8,Hands-On Network Programming with C\nBy Lewis ...,Rs.1800,In stock,70 Reviews,Hands-On Network Programming with C\nBy Lewis ...
9,LINUX & UNIX Programming Tools\nBy Syed Mansoo...,Rs.5000,In stock,200 Reviews,LINUX & UNIX Programming Tools\nBy Syed Mansoo...


### Mouse Operations
- `from selenium.webdriver import ActionChains`
- `act_obj = ActionChains(driver)`
- `act_obj.move_to_element(element).click().perform()` for mouse hover on element
- `act_obj.context_click(element).perform()` for right click on element
- `act_obj.double_click(element).perform()` for double click on element
- `act_obj.drag_and_drop(source, target).perform()` for drag and drop of a source to target
- `act_obj.drag_and_drop_by_offset(element, X, Y).perform()` for slider
- 

### Fetching Data From Database
- `import mysql.connector` to import the connector
- `con = mysql.connector.connect(host='localhost',port=3305,user='admin',passwd='admin',database='mydb')' to make connection object
- `curs = con.cursor()` to create cursor that is the temporary memory
- `curs.execute(query)` t execute any query
- `con.close()` to close the connection

## Example # 02: Scraping Multiple Web Pages that Employ Infinite Scrolling (https://arifpucit.github.io/bss2/scroll/)

***How to Scroll an Infinite Scrolling Web Page using Selenium***
- The `driver.execute_script(JS)` method is used to synchronously execute JavaScript in the current window/frame.
```
driver.execute_script('alert("Hello JavaScript")')
```
- The `window.scrollTo()` method is used to perform scrolling operation. The pixels to be scrolled horizontally along the x-axis and pixels to be scrolled vertically along the y-axis are passed as parameters to the method.

In [88]:
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
import time

s = Service('C:\ChromeDriver\chromedriver.exe') 
driver = Chrome(service=s)    
driver.get('https://arifpucit.github.io/bss2/scroll') 
driver.maximize_window()

# Scroll the entire page and then starts scrolling
last_height = driver.execute_script('return document.body.scrollHeight')
while True:
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')
    time.sleep(2)
    new_height = driver.execute_script('return document.body.scrollHeight')
    if (new_height == last_height):
        break
    last_height = new_height
    
titles = []
prices = []
availability=[]
reviews=[]
links=[]
star_rates =[]


books = driver.find_elements(By.CLASS_NAME, 'col-sm-4')
books_count = len(books)
for i in range(1,books_count+1):
    title = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]').text
    titles.append(title)
    price = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[1]').text
    prices.append(price)
    avail = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[2]').text
    availability.append(avail)
    review = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/p[3]').text
    reviews.append(review)
    link = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/p[1]/a').get_attribute('href')
    links.append(link)
    star_rate = driver.find_element(By.XPATH,'//*[@id="container"]/div[' +str(i)+ ']/div/div').get_attribute('data-rate-star')
    star_rate = round(float(star_rate),2)
    star_rates.append(star_rate)
    
    
data = {'Title/Author':titles, 'Price':prices, 'Availability':availability, 
        'Reviews':reviews, 'Links':links, 'StarRating': star_rates}
df = pd.DataFrame(data, columns=['Title/Author', 'Price', 'Availability', 'Reviews', 'Links', 'StarRating'])
df.to_csv('books3.csv', index=False)
df = pd.read_csv('books3.csv')
df

Unnamed: 0,Title/Author,Price,Availability,Reviews,Links,StarRating
0,Operating System Concepts\nBy Avi Silberschatz,Rs.9839,Not in stock,114 Reviews,https://www.amazon.com/Operating-System-Concep...,4.13
1,UNIX The Textbook\nBy Syed Mansoor Sarwar,Rs.6916,In stock,57 Reviews,https://www.google.com/search?q=Unix+the+textb...,5.00
2,Taxonomy of IDS\nBy Dr. Arif Butt,Rs.8549,In stock,16 Reviews,https://www.amazon.in/Taxonomy-Ids-Arif-Butt/d...,0.00
3,Things Fall Apart\nBy Chinua Achebe,Rs.4398,In stock,4 Reviews,https://en.wikipedia.org/wiki/Things_Fall_Apart,5.00
4,Fairy tales\nBy Hans Christian Andersen,Rs.3023,Not in stock,90 Reviews,https://en.wikipedia.org/wiki/Fairy_Tales_Told...,2.70
...,...,...,...,...,...,...
98,Mahabharata\nBy Vyasa,Rs.4712,In stock,180 Reviews,https://en.wikipedia.org/wiki/Mahabharata,5.00
99,Leaves of Grass\nBy Walt Whitman,Rs.9616,In stock,191 Reviews,https://en.wikipedia.org/wiki/Leaves_of_Grass,5.00
100,Mrs Dalloway\nBy Virginia Woolf,Rs.4445,In stock,100 Reviews,https://en.wikipedia.org/wiki/Mrs_Dalloway,1.69
101,To the Lighthouse\nBy Virginia Woolf,Rs.7536,Not in stock,81 Reviews,https://en.wikipedia.org/wiki/To_the_Lighthouse,2.62


In [89]:
driver.quit()