# Web scraping with Selenium and Beautiful Soup

To install Selenium in Jupyter notebook:

py -m pip install -U selenium

Have the web driver (e.g. chromedriver.exe) under the working folder


## To save the png file from web site into disk (with Beautiful Soup)

In [40]:
import requests,os,bs4

In [41]:
url="http://xkcd.com"

In [44]:
os.chdir("D:/temp")
os.makedirs("xkcd",exist_ok=True)

In [46]:
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")

Downloading page http://xkcd.com...


In [47]:
comicElem = soup.select('#comic img')
if comicElem == []:
    print('Could not find comic image.')
else:
    comicUrl = comicElem[0].get('src')

In [48]:
comicUrl

'//imgs.xkcd.com/comics/bad_code.png'

In [50]:
res = requests.get(("http:"+comicUrl))
res.raise_for_status()

In [52]:
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl[1:])), 'wb')
for chunk in res.iter_content(100000):
    imageFile.write(chunk)
imageFile.close()

## Use of Selenium 

To install selenium package by issuing the following command:
py -m pip install -U selenimum 

In [1]:
from selenium import webdriver
import time

In [2]:
driver = webdriver.Chrome('F:/R_Working_Directory/chromedriver.exe')  # Optional argument, if not specified will search path.
driver.get('http://www.google.com/xhtml')
time.sleep(5) # Let the user actually see something!
search_box = driver.find_element_by_name('q')
search_box.send_keys('ChromeDriver')
search_box.submit()
time.sleep(5) # Let the user actually see something!
#driver.quit()

WebDriverException: Message: chrome not reachable
  (Session info: chrome=63.0.3239.84)
  (Driver info: chromedriver=2.33.506120 (e3e53437346286c0bc2d2dc9aa4915ba81d9023f),platform=Windows NT 6.1.7601 SP1 x86_64)


In [55]:
driver = webdriver.Chrome('F:/R_Working_Directory/chromedriver.exe')  # Optional argument, if not specified will search path.
driver.get('http://inventwithpython.com')


### Refer to the ppt file for:

-Selenium's webdriver methods for finding elements

-Webelement attributes and methods

### Example: to use find_element_by_XXX from Selenium

In [17]:
try:
    elem=driver.find_element_by_class_name("cover-thumb")
    print("Found <%s> element with that class name" %(elem.tag_name))
except:
    print("Was not able to find an element with that name")

Found <img> element with that class name


In [56]:
elem=driver.find_elements_by_link_text("More Info")
#elem.click()

In [20]:
len(elem)

5

In [28]:
elem[1].get_attribute("href")

'http://inventwithpython.com/#cracking'

In [29]:
[i.get_attribute("href") for i in elem]

['http://inventwithpython.com/#automate',
 'http://inventwithpython.com/#cracking',
 'http://inventwithpython.com/#invent',
 'http://inventwithpython.com/#pygame',
 'http://inventwithpython.com/#scratch']

### To click on the html object

In [57]:
elem[1].click()

In [120]:
driver = webdriver.Chrome('F:/R_Working_Directory/chromedriver.exe')
driver.get('http://www.parknshop.com/Beverages,%20Wine%20&%20Spirits/lc/040000')

In [121]:
elem=driver.find_element_by_class_name("btn-show-more")
if elem.get_attribute("data-hasnextpage")=="true":
    elem.click()

In [118]:
#elem.get_attribute("data-hasnextpage")

'true'

In [64]:
div1=driver.find_element_by_class_name("product-container")

In [65]:
L1=div1.find_elements_by_tag_name('a')

In [94]:
product_list=[i.text for i in L1 if i.text !=""]


In [95]:
len(product_list)

72

In [96]:
product_list[:10]

['CARLSBERG SPECIAL BREW',
 'COCA-COLA Coca-Cola',
 'DEVONDALE Full Cream UHT Milk',
 'IF iF Local Sensation 100% coconut water (350mL)',
 'LUK YU CHINESE TEABAGS-PU ERH',
 'MARTELL Cordon Bleu Cognac (Qt.)',
 'NESCAFE GOLD BLEND',
 'NESCAFE Regular Coffee',
 'OVALTINE NUTRITIONAL MALTED MILK',
 'PENFOLDS KOONUNGA HILL SHZ CAB 37.5CL']

In [105]:
L3=div1.find_elements_by_class_name("discount")

In [106]:
len(L3)

72

In [107]:
price_discount_list=[i.text for i in L3]

In [110]:
from collections import Counter
Counter(product_list)

Counter({"ASAHI Beer Can 12'S": 1,
         "CARLSBERG 12's Can Beer": 1,
         "CARLSBERG 4's King Can Beer": 1,
         "CARLSBERG 6's Can Beer": 1,
         'CARLSBERG Quart Bottle Beer': 1,
         'CARLSBERG SPECIAL BREW': 1,
         'COCA-COLA Coca-Cola': 6,
         'DEVONDALE FULL CREAM UHT MILK': 1,
         'DEVONDALE Full Cream UHT Milk': 1,
         'DEVONDALE SKIM UHT MILK': 1,
         'DEVONDALE Skim UHT Milk': 1,
         'GEKKEIKAN Kishu Nanko Umeshu 720ml': 1,
         'GEKKEIKAN Momoshu 300 ml': 1,
         'IF COCONUT WATER MULTIPACK': 1,
         'IF iF Local Sensation 100% coconut water (350mL)': 1,
         'JAX COCO 100 PURE COCONUT WATER': 2,
         'JAX COCO 100 PURE COCONUT WATER - GLASS': 1,
         'KOH COCONUT WATER': 1,
         'LUK YU CHINESE TEABAGS-IRON BUDDHA': 1,
         'LUK YU CHINESE TEABAGS-JASMINE': 2,
         'LUK YU CHINESE TEABAGS-OOLONG': 1,
         'LUK YU CHINESE TEABAGS-PU ERH': 1,
         'MARTELL CHANTELOUP PERSPECTIVE': 1

In [61]:
from bs4 import BeautifulSoup

In [62]:
html=driver.page_source

In [63]:
soup = BeautifulSoup(html, "lxml")

get the southbound from "Shanghai connect" figures 

http://www.hkex.com.hk/Mutual-Market/Stock-Connect?sc_lang=en

