# Test Selenium
To install selenium follow the installation guide https://selenium-python.readthedocs.io/installation.html and download the chrome drivers from Test selenium is installed and up and running correctly the these suggested sites, I will be using Chrome, please ensure you download a linux driver in addition to your platforms driver as this will be used inside our docker image https://selenium-python.readthedocs.io/installation.html#drivers

TOC
- Setup
- Tabular Extraction
    - [Data-Cleansing](/notebooks/Test%20Selenium.ipynb#Data-Cleansing)
- Image Extraction
- RPA

In [1]:
# Import required Selenium packages
from selenium.webdriver.chrome.webdriver import WebDriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

import re
import json
import pandas as pd
import urllib.request
from datetime import datetime

## Setup Chrome Driver
Setup the chrome driver using the `selenium.webdriver.chrome.webdriver.WebDriver` as this will allow us to use the downloaded chrome driver executable. To do this we will create a service that points to the executable path using `selenium.webdriver.chrome.service.Service` (This use to be the `executable_path` parameter on the `WebDriver` but is now deprecated).

Next we will also instantiate the `selenium.webdriver.chrome.options.Options` so that we can set the how elements will load on the page. We will set it to eagrer so that all HTML elements are imediately accessible and not blocked by the painting of larger elements like images etc. `Options` has many cool parameters that can let you control things like timeouts etc for a full list check out these links:
- Options: https://www.selenium.dev/documentation/webdriver/drivers/options/
- ChromeOptions: https://www.selenium.dev/documentation/webdriver/browsers/chrome/

https://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.chrome.webdriver

In [2]:
# Set service to exe & options for page load to eager
service = Service('./chromedriver_mac_arm64_v109') # this should be the path to your downloaded driver

options = Options()
options.page_load_strategy = 'eager'

driver = WebDriver(service=service, options=options)

## Tabular Extraction

In [3]:
driver.get("https://au.finance.yahoo.com/gainers")

### Explore & Extract
Explore the page elements to determine how to appropriately select all the rows in the table, we then also want to extract each piece of data into the relevant feilds for later processing.

In [4]:
FINANCE_FEILDS = [
    "symbol",
    "name",
    "price",
    "change",
    "change %",
    "volume",
    "avg",
    "market cap",
    "pe ratio",
    "52 week"
]

# Grab rows
rows = driver.find_elements(By.CSS_SELECTOR, "div > table tbody tr")

In [5]:
%%time
values = []
for el in rows:
    row = {}
    for f, v in zip(FINANCE_FEILDS, el.find_elements(By.CSS_SELECTOR, "td")):
        row[f] = v.text 
    values = [*values, row]

CPU times: user 83.8 ms, sys: 7.15 ms, total: 91 ms
Wall time: 1.94 s


In [6]:
# print(json.dumps(values, indent=4))
values

[{'symbol': 'CCE.AX',
  'name': 'Carnegie Clean Energy Limited',
  'price': '0.0020',
  'change': '+0.0010',
  'change %': '+100.00%',
  'volume': '1.61M',
  'avg': '4.616M',
  'market cap': '31.285M',
  'pe ratio': 'N/A',
  '52 week': ''},
 {'symbol': 'ARE.AX',
  'name': 'Argonaut Resources NL',
  'price': '0.0030',
  'change': '+0.0010',
  'change %': '+50.00%',
  'volume': '65.429M',
  'avg': '9.588M',
  'market cap': '19.086M',
  'pe ratio': 'N/A',
  '52 week': ''},
 {'symbol': 'ANL.XA',
  'name': 'Amani Gold Limited',
  'price': '0.0015',
  'change': '+0.0005',
  'change %': '+50.00%',
  'volume': '1.5M',
  'avg': 'N/A',
  'market cap': 'N/A',
  'pe ratio': 'N/A',
  '52 week': ''},
 {'symbol': 'WBE.XA',
  'name': 'Whitebark Energy Limited',
  'price': '0.0015',
  'change': '+0.0005',
  'change %': '+50.00%',
  'volume': '583,880',
  'avg': 'N/A',
  'market cap': 'N/A',
  'pe ratio': '1.50',
  '52 week': ''},
 {'symbol': 'LNU.AX',
  'name': 'Linius Technologies Limited',
  'price':

In [7]:
%%time
values = [{f: v.text for f,v in zip(FINANCE_FEILDS, el.find_elements(By.CSS_SELECTOR, "td"))} for el in rows]

CPU times: user 65.7 ms, sys: 5.84 ms, total: 71.5 ms
Wall time: 798 ms


In [8]:
driver.close()
print(json.dumps(values, indent=4))

[
    {
        "symbol": "CCE.AX",
        "name": "Carnegie Clean Energy Limited",
        "price": "0.0020",
        "change": "+0.0010",
        "change %": "+100.00%",
        "volume": "1.61M",
        "avg": "4.616M",
        "market cap": "31.285M",
        "pe ratio": "N/A",
        "52 week": ""
    },
    {
        "symbol": "ARE.AX",
        "name": "Argonaut Resources NL",
        "price": "0.0030",
        "change": "+0.0010",
        "change %": "+50.00%",
        "volume": "65.429M",
        "avg": "9.588M",
        "market cap": "19.086M",
        "pe ratio": "N/A",
        "52 week": ""
    },
    {
        "symbol": "ANL.XA",
        "name": "Amani Gold Limited",
        "price": "0.0015",
        "change": "+0.0005",
        "change %": "+50.00%",
        "volume": "1.5M",
        "avg": "N/A",
        "market cap": "N/A",
        "pe ratio": "N/A",
        "52 week": ""
    },
    {
        "symbol": "WBE.XA",
        "name": "Whitebark Energy Limited",
        "pr

In [9]:
gainers = pd.DataFrame(values)
gainers

Unnamed: 0,symbol,name,price,change,change %,volume,avg,market cap,pe ratio,52 week
0,CCE.AX,Carnegie Clean Energy Limited,0.002,0.001,+100.00%,1.61M,4.616M,31.285M,,
1,ARE.AX,Argonaut Resources NL,0.003,0.001,+50.00%,65.429M,9.588M,19.086M,,
2,ANL.XA,Amani Gold Limited,0.0015,0.0005,+50.00%,1.5M,,,,
3,WBE.XA,Whitebark Energy Limited,0.0015,0.0005,+50.00%,583880,,,1.5,
4,LNU.AX,Linius Technologies Limited,0.003,0.001,+50.00%,755556,6.563M,8.938M,,
5,LML.XA,Lincoln Minerals Limited,0.066,0.021,+46.67%,8.763M,,,,
6,LML.AX,Lincoln Minerals Limited,0.067,0.021,+45.65%,44.937M,597816,92.223M,,
7,ENX.XA,Enegex Limited,0.043,0.012,+38.71%,85000,,,,
8,BME.XA,Black Mountain Energy Ltd,0.042,0.011,+35.48%,178331,,,,
9,BME.AX,Black Mountain Energy Ltd,0.043,0.011,+34.37%,1.203M,46348,10.965M,,


### Data Cleansing
Remove nuerical formating and cast colums to the correct data type

In [10]:
def numeric_conversion(v: any) -> float:
    '''
    Remove the string formating of the numercal data. 
    This includes the following characters: +,%M
    Additionally it replaces `N/A` with None
    
    v : any
        The value to transform, it is expected to be a string or castable to one
    
    returns : float 
        The numeric value represented as a float
    '''
    v = re.sub(r"(?i)(\+|,|%)", '', v)
       
    if 'M' in v:
        v = v.replace('M', '')
        v = float(v) * 1_000_000
    
    elif 'N/A' == v:
        v = None
        
    return v
    
        

In [11]:
gainers['price'] = gainers['price'].apply(numeric_conversion)
gainers['change'] = gainers['change'].apply(numeric_conversion)
gainers['change %'] = gainers['change %'].apply(numeric_conversion)
gainers['volume'] = gainers['volume'].apply(numeric_conversion)
gainers['avg'] = gainers['avg'].apply(numeric_conversion)
gainers['market cap'] = gainers['market cap'].apply(numeric_conversion)
gainers['pe ratio'] = gainers['pe ratio'].apply(numeric_conversion)

# Sketchy


try:
    del gainers['52 week']
except Exception as e:
    print(e)

gainers

Unnamed: 0,symbol,name,price,change,change %,volume,avg,market cap,pe ratio
0,CCE.AX,Carnegie Clean Energy Limited,0.002,0.001,100.0,1610000.0,4616000.0,31285000.0,
1,ARE.AX,Argonaut Resources NL,0.003,0.001,50.0,65429000.0,9588000.0,19086000.0,
2,ANL.XA,Amani Gold Limited,0.0015,0.0005,50.0,1500000.0,,,
3,WBE.XA,Whitebark Energy Limited,0.0015,0.0005,50.0,583880.0,,,1.5
4,LNU.AX,Linius Technologies Limited,0.003,0.001,50.0,755556.0,6563000.0,8938000.0,
5,LML.XA,Lincoln Minerals Limited,0.066,0.021,46.67,8763000.0,,,
6,LML.AX,Lincoln Minerals Limited,0.067,0.021,45.65,44937000.0,597816.0,92223000.0,
7,ENX.XA,Enegex Limited,0.043,0.012,38.71,85000.0,,,
8,BME.XA,Black Mountain Energy Ltd,0.042,0.011,35.48,178331.0,,,
9,BME.AX,Black Mountain Energy Ltd,0.043,0.011,34.37,1203000.0,46348.0,10965000.0,


In [20]:
### Save To C.S.V

datetime.today().strftime("%Y_%m_%d")

'2023_01_30'

In [21]:
gainers.to_csv('./data/tabular/gainers_%s.csv' % datetime.today().strftime("%Y_%m_%d"), index=False)

## Image Extraction

In [14]:
browser = WebDriver(service=service, options=options)
browser.get("http://books.toscrape.com/")

In [15]:
raw_books = browser.find_elements(By.CSS_SELECTOR, "article.product_pod")
raw_books

[<selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="8a569879-2a3f-485d-952b-e56ca0dff926")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="9df15391-4f03-4cfb-a083-d9a747c8428c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="255f3b36-ba0c-4990-bc81-4792bdf6cc0b")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="46a9dd14-8a90-43e5-b075-f40c6f50c9b2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="e4d3bd10-49f9-474f-ae38-915069c6eb57")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="31bb4177-231d-4d84-81d9-a73e672a9bf6")>,
 <selenium.webdriver.remote.webelement.WebElement (session="3b0d2c8dde161329f73c7628563f0921", element="6740d6dd-4d41-498d-a737-70

In [16]:
BOOK_FEILDS = [
    "link",
    "image",
    "title",
    "rating",
    "price"
]

def extract_rawbooks(book: any) -> dict:
    '''
    Extract the data that represents a book from the raw HTML elements
    This includes the image link, image, title, rating & price
    
    book : selenium.webdriver.remote.webelement.WebElement
        The HTML wrapper containing all book elements
        
    returns : dict
        The book data represented as a dictionary
    '''
    
    return {
        **extract_image_link(book),
        **extract_rating(book),
        **extract_price(book)
    }
        

def extract_image_link(book: any) -> dict:
    '''
    Extract the books cover image and URL
    
    book : selenium.webdriver.remote.webelement.WebElement
        The HTML elements containing all book elements
        
    returns : dict
        The star rating
    '''
    link = book.find_element(By.CSS_SELECTOR, "div.image_container a").get_attribute("href")
    image = book.find_element(By.CSS_SELECTOR, "div.image_container img")
    
    img_url = image.get_attribute("src")
    
    return {
        "link": link,
        "image": image
    }
    

def extract_rating(book: any) -> dict:
    '''
    Extract the rating out of 5 stars based on the 
    CSS coloring of the star elements
    
    book : selenium.webdriver.remote.webelement.WebElement
        The HTML elements containing all book elements
        
    returns : dict
        The star rating
    '''
    yellow_rbga = 'rgba(230, 206, 49, 1)'
    
    stars = book.find_elements(By.CSS_SELECTOR, "i.icon-star")
    rating = [star.value_of_css_property("color") for star in stars].count(yellow_rbga)
    
    return {
        "stars": rating,
        "rating": round((rating / len(stars)) * 100, 2)
    }


def clean_price(v: any) -> float:
    '''
    Ensure price is a float and remove any formating, 
    this extends our earlier function see: `numeric_conversion`
    
    v : any
        The value to transform, it is expected to be a string or castable to one
    
    returns : float 
        The numeric value represented as a float
    '''
    v = re.sub(r"(?i)(£)", '', str(v))
    return numeric_conversion(v)
    

def extract_price(book: any) -> dict:
    '''
    Extract the books price and availability
    
    book : selenium.webdriver.remote.webelement.WebElement
        The HTML elements containing all book elements
        
    returns : dict
        The star rating
    '''
    available = "In stock"
    price = book.find_element(By.CSS_SELECTOR, 'div.product_price p.price_color').text
    price = clean_price(price)
    
    status = book.find_element(By.CSS_SELECTOR, 'div.product_price p.availability').text
    
    availability = status == available
    
    print(price, status, availability)
    return {
        "price": price,
        "status": status,
        "is_availble": availability
    }
    

In [17]:


def extract_rawbooks(book: any) -> dict:
    '''
    Extract the data that represents a book from the raw HTML elements
    This includes the image link, image, title, rating & price
    
    book : selenium.webdriver.remote.webelement.WebElement
        The HTML wrapper containing all book elements
        
    returns : dict
        The book data represented as a dictionary
    '''
    yellow_rbga = 'rgba(230, 206, 49, 1)'
    available = "In stock"
    
    # Extract URL, Title & Image link
    link = book.find_element(By.CSS_SELECTOR, "div.image_container a").get_attribute("href")
    title = book.find_element(By.CSS_SELECTOR, "h3 a").text
    image = book.find_element(By.CSS_SELECTOR, "div.image_container img")
    
    img_url = image.get_attribute("src")
    
    # Extract Stars & Rating
    stars = book.find_elements(By.CSS_SELECTOR, "i.icon-star")
    rating = [star.value_of_css_property("color") for star in stars].count(yellow_rbga)
    
    # Extract Price & Status
    price = book.find_element(By.CSS_SELECTOR, 'div.product_price p.price_color').text
    status = book.find_element(By.CSS_SELECTOR, 'div.product_price p.availability').text
    
    # Clean Price & Store image
    price = clean_price(price)
    availability = status == available
    
    # Save book cover image to local
    with open(f'./data/img/{title}.png', 'wb') as file:
        file.write(image.screenshot_as_png)
        file.close()
        
    return {
        "title": title,
        "link": link,
        "image": img_url,
        "stars": rating,
        "rating": round((rating / len(stars)) * 100, 2),
        "price": price,
        "status": status,
        "is_availble": availability
    }


In [18]:
clean_books = [extract_rawbooks(rb) for rb in raw_books]
clean_books

[{'title': 'A Light in the ...',
  'link': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
  'image': 'http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
  'stars': 3,
  'rating': 60.0,
  'price': '51.77',
  'status': 'In stock',
  'is_availble': True},
 {'title': 'Tipping the Velvet',
  'link': 'http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html',
  'image': 'http://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg',
  'stars': 1,
  'rating': 20.0,
  'price': '53.74',
  'status': 'In stock',
  'is_availble': True},
 {'title': 'Soumission',
  'link': 'http://books.toscrape.com/catalogue/soumission_998/index.html',
  'image': 'http://books.toscrape.com/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg',
  'stars': 1,
  'rating': 20.0,
  'price': '50.10',
  'status': 'In stock',
  'is_availble': True},
 {'title': 'Sharp Objects',
  'link': 'http://books.toscrape.com/catalogue/sharp

In [19]:
books = pd.DataFrame(clean_books)
books

Unnamed: 0,title,link,image,stars,rating,price,status,is_availble
0,A Light in the ...,http://books.toscrape.com/catalogue/a-light-in...,http://books.toscrape.com/media/cache/2c/da/2c...,3,60.0,51.77,In stock,True
1,Tipping the Velvet,http://books.toscrape.com/catalogue/tipping-th...,http://books.toscrape.com/media/cache/26/0c/26...,1,20.0,53.74,In stock,True
2,Soumission,http://books.toscrape.com/catalogue/soumission...,http://books.toscrape.com/media/cache/3e/ef/3e...,1,20.0,50.1,In stock,True
3,Sharp Objects,http://books.toscrape.com/catalogue/sharp-obje...,http://books.toscrape.com/media/cache/32/51/32...,4,80.0,47.82,In stock,True
4,Sapiens: A Brief History ...,http://books.toscrape.com/catalogue/sapiens-a-...,http://books.toscrape.com/media/cache/be/a5/be...,5,100.0,54.23,In stock,True
5,The Requiem Red,http://books.toscrape.com/catalogue/the-requie...,http://books.toscrape.com/media/cache/68/33/68...,1,20.0,22.65,In stock,True
6,The Dirty Little Secrets ...,http://books.toscrape.com/catalogue/the-dirty-...,http://books.toscrape.com/media/cache/92/27/92...,4,80.0,33.34,In stock,True
7,The Coming Woman: A ...,http://books.toscrape.com/catalogue/the-coming...,http://books.toscrape.com/media/cache/3d/54/3d...,3,60.0,17.93,In stock,True
8,The Boys in the ...,http://books.toscrape.com/catalogue/the-boys-i...,http://books.toscrape.com/media/cache/66/88/66...,4,80.0,22.6,In stock,True
9,The Black Maria,http://books.toscrape.com/catalogue/the-black-...,http://books.toscrape.com/media/cache/58/46/58...,1,20.0,52.15,In stock,True
