# Selenium

### Here we use selenium alongside geckodriver to create a bot browser instance so we can circumvent some javascript events

In [1]:
!pip install selenium

You should consider upgrading via the '/home/joram/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
from selenium import webdriver
from bs4 import BeautifulSoup as BS
import pandas as pd
pd.set_option('display.max_columns', None)

### Creating a webdriver

In [3]:
driver = webdriver.Firefox(executable_path='./geckodriver')

Creating a Selenium Firefox webdriver using geckodriver results in a real browser window that we can control with python


![example-img](../src/selenium-webdriver.png)

Using this driver, we can go to any page we'd like.

In [4]:
url = 'https://www.woolworths.co.za'

In [5]:
driver.get(url)

![example-img](../src/selenium-url.png)

### Why Selenium?

I think it's best to learn by example, so let's start!

Say we want to scrape the nutrional information about a list of Woolworths products. To develop the flow, we start with a single product.

In [6]:
product_id = 8000500037874
product_url = 'https://www.woolworths.co.za/cat?Ntt={}&Dy=1'.format(product_id)
driver.get(product_url)

In [7]:
soup = BS(driver.page_source, 'lxml')

### Inspect the html to see which element or class name can we use to find the table

In [8]:
soup.find('table')

### Nothing found since the table is hidden until an event occurs

Where in BeautifulSoup you find elements using `.find` or `.find_all`, Selenium works a bit differently.
You specify what you are looking for in the method name.

In [9]:
clickable_list = driver.find_elements_by_class_name('accordion__toggle--chrome')
clickable_list

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b09e8b39-6771-4040-bda4-833fdd329e16", element="5bc321b7-c44b-4640-a06b-8684336536ab")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b09e8b39-6771-4040-bda4-833fdd329e16", element="8379ef2e-62c5-4643-a6f7-ca58b40c3c16")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b09e8b39-6771-4040-bda4-833fdd329e16", element="5675f8cb-f5a0-40ef-93b8-7935561a75e5")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b09e8b39-6771-4040-bda4-833fdd329e16", element="cefa8500-0a5f-42d7-a071-ff7e07b86333")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b09e8b39-6771-4040-bda4-833fdd329e16", element="c6fd1331-fb08-4ee8-b8ef-2313b665642a")>]

In [10]:
type(clickable_list[0])

selenium.webdriver.firefox.webelement.FirefoxWebElement

When working with unknown (or known) types, `dir` is a handy tool to check what you can do!

In [11]:
dir(clickable_list[0])

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_execute',
 '_id',
 '_parent',
 '_upload',
 '_w3c',
 'anonymous_children',
 'clear',
 'click',
 'find_anonymous_element_by_attribute',
 'find_element',
 'find_element_by_class_name',
 'find_element_by_css_selector',
 'find_element_by_id',
 'find_element_by_link_text',
 'find_element_by_name',
 'find_element_by_partial_link_text',
 'find_element_by_tag_name',
 'find_element_by_xpath',
 'find_elements',
 'find_elements_by_class_name',
 'find_elements_by_css_selector',
 'find_elements_by_id',
 'find_elements_by_link_text',
 'find_elements_by_name',
 'find_elements_by_partial_link_text',
 'find_elements_by_tag_name',
 'find_

In [12]:
clickable_list[0].tag_name

'h4'

In [13]:
for clickable in clickable_list:
    if clickable.text.lower() == 'nutritional information':
        clickable.click()

### Reload soup after click!

In [14]:
soup = BS(driver.page_source, 'lxml')

In [15]:
soup.find('table')

<table cellpadding="0" cellspacing="0" class="table table-scroll__table table--zebra table--nutrition"><thead class="table__head"><tr class="table-scroll__row"><th class="pdp-desc-font">Description</th><th class="pdp-desc-font">Per<br/>100g/ml</th><th class="pdp-desc-font">Per<br/>Serving</th><th class="pdp-desc-font">Measurement</th><th class="pdp-desc-font">% NRV<br/> per<br/>serving</th></tr></thead><tbody><tr class="table-scroll__row"><th>Portion Size</th><td>100</td><td></td><td></td><td></td></tr><tr class="table-scroll__row"><th>Energy</th><td>2419</td><td>301</td><td>kJ</td><td>-</td></tr><tr class="table-scroll__row"><th>Protein</th><td>8.8</td><td>1.1</td><td>g</td><td>-</td></tr><tr class="table-scroll__row"><th>Carbohydrate</th><td>42.3</td><td>5.3</td><td>g</td><td>-</td></tr><tr class="table-scroll__row"><th>    Of which Sugars</th><td>36.4</td><td>4.5</td><td>g</td><td>-</td></tr><tr class="table-scroll__row"><th>Total Fat</th><td>41.9</td><td>5.2</td><td>g</td><td>-</td

### Tables in html
- Rows are represented with `<tr>`
- Headers are represented with `<th>`
- Data is represented with `<td>`

```
<table>
 <tr><th>top-header1</th><th>top-header2</th><th>top-header3</th><th>top-header4</th>
 <tr><th>side-header1</th><td>x_1</td><td>y_1</td><td>z_1</td></tr>
 <tr><th>side-header2</th><td>x_2</td><td>y_2</td><td>z_2</td></tr>
 <tr><th>side-header3</th><td>x_3</td><td>y_3</td><td>z_3</td></tr>
</table>
```
<table>
 <tr><th>top-header1</th><th>top-header2</th><th>top-header3</th><th>top-header4</th>
 <tr><th>side-header1</th><td>x_1</td><td>y_1</td><td>z_1</td></tr>
 <tr><th>side-header2</th><td>x_2</td><td>y_2</td><td>z_2</td></tr>
 <tr><th>side-header3</th><td>x_3</td><td>y_3</td><td>z_3</td></tr>
</table>


![example](../src/selenium-table.png)

In [16]:
table_rows = soup.find_all('tr')
table_rows

[<tr class="table-scroll__row"><th class="pdp-desc-font">Description</th><th class="pdp-desc-font">Per<br/>100g/ml</th><th class="pdp-desc-font">Per<br/>Serving</th><th class="pdp-desc-font">Measurement</th><th class="pdp-desc-font">% NRV<br/> per<br/>serving</th></tr>,
 <tr class="table-scroll__row"><th>Portion Size</th><td>100</td><td></td><td></td><td></td></tr>,
 <tr class="table-scroll__row"><th>Energy</th><td>2419</td><td>301</td><td>kJ</td><td>-</td></tr>,
 <tr class="table-scroll__row"><th>Protein</th><td>8.8</td><td>1.1</td><td>g</td><td>-</td></tr>,
 <tr class="table-scroll__row"><th>Carbohydrate</th><td>42.3</td><td>5.3</td><td>g</td><td>-</td></tr>,
 <tr class="table-scroll__row"><th>    Of which Sugars</th><td>36.4</td><td>4.5</td><td>g</td><td>-</td></tr>,
 <tr class="table-scroll__row"><th>Total Fat</th><td>41.9</td><td>5.2</td><td>g</td><td>-</td></tr>,
 <tr class="table-scroll__row"><th>    Of which mono unsaturated fatty acids</th><td>25.1</td><td>3.1</td><td>g</td><t

In [17]:
nutrient_dict = {}
for row in table_rows[1:]:
    nutrient_dict[row.th.text] = row.td.text

In [18]:
nutrient_dict

{'Portion Size': '100',
 'Energy': '2419',
 'Protein': '8.8',
 'Carbohydrate': '42.3',
 '    Of which Sugars': '36.4',
 'Total Fat': '41.9',
 '    Of which mono unsaturated fatty acids': '25.1',
 '    Of which poly unsaturated fatty acids': '3.1',
 '    Of which saturated fatty acids': '13.7',
 '    Of which trans fatty acids': '0.1',
 'Cholesterol': '4.9',
 'Dietary Fibre': '4.5',
 'Sodium': '43'}

### With the scrape flow working for 1 product, lets put it into a function

In [19]:
def scrape_product_nutrition(product_id):
    
    # make product url
    product_url = 'https://www.woolworths.co.za/cat?Ntt={}&Dy=1'.format(str(product_id))
    
    # point driver to url
    driver.get(product_url)
    
    # generate clickable list
    clickable_list = driver.find_elements_by_class_name('accordion__toggle--chrome')
    
    # find the clickable corresponding to nutritional information and click it
    for clickable in clickable_list:
        if clickable.text.lower() == 'nutritional information':
            clickable.click()
            break
            
    # make some new soup
    soup = BS(driver.page_source, 'lxml')
    
    # find the product name for some human readability
    product = soup.find(attrs={'class':'prod-name'}).text
    
    # create a list of all the table rows
    table_rows = soup.find_all('tr')
    
    # initiate the data dictionary
    nutrient_dict = {}
    nutrient_dict['product_id'] = str(product_id)
    nutrient_dict['product_name']= product
    
    # fill the dictionary with headings and data
    for row in table_rows[1:]:
        nutrient_dict[row.th.text] = row.td.text
    
    return nutrient_dict

### Check that it scrapes correctly

In [20]:
scrape_product_nutrition(8000500037874)

{'product_id': '8000500037874',
 'product_name': 'Ferrero Rocher 200g',
 'Portion Size': '100',
 'Energy': '2419',
 'Protein': '8.8',
 'Carbohydrate': '42.3',
 '    Of which Sugars': '36.4',
 'Total Fat': '41.9',
 '    Of which mono unsaturated fatty acids': '25.1',
 '    Of which poly unsaturated fatty acids': '3.1',
 '    Of which saturated fatty acids': '13.7',
 '    Of which trans fatty acids': '0.1',
 'Cholesterol': '4.9',
 'Dietary Fibre': '4.5',
 'Sodium': '43'}

### Try it for a list of essential items!

In [21]:
essential_list = [3046920029759, 6009204330887, 6009801741758, 6001275000003, 6009178222607]

In [None]:
products = []
for product_id in essential_list:
    data_dict = scrape_product_nutrition(product_id)
    products.append(data_dict)

In [None]:
df = pd.DataFrame(products)

In [None]:
df.fillna(0)

## That concludes scraping with Selenium!