# Webscraping dynamic websites
---

Following on from the webscraping basics notebook where we scraped a static webpage, we now look to use this new notebook to experiment with extracting data from a dynamic website. In particular, we shall look to scrape prices of a given search item from Amazon.com. We look to build familarity with the structure of dynamic webpages, in addition to creating adjusting code to extract desired information.

In [1]:
import re
import bs4
import requests
from bs4 import BeautifulSoup

Start by setting our target web address. We also niavely open the webpage to see what response we initially get.

In [2]:
url = 'https://www.amazon.co.uk'
page = requests.get(url)
print(page.text)

<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title dir="ltr">Amazon.co.uk</title>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-eu.amazon.co.uk",
        ue_mid = "A1F83G8C2ARO7P",

Note that the above code is incredibly short for what might be expected from the landing page of Amazon. This is because Amazon is a dynamic website, and much of what you would see in your browser is hand "dyanmically"produced - i.e. is constructed on your computer rather than sent direclt from a server (note that a large chunk of the above is infact javascipt code - this is what is used to construct our "personal" Amazon homepage).

To over come this, we need a new approach. For this we use the packagae selenium which can handle these dynamically produced websites.

In [3]:
import selenium

Selenium requires a webdriver to interface with these pages. We have opted to use chromedriver as Chrome is our chosen browser.

In [4]:
driver_loc = 'C:/Users/Decla/Downloads/chromedriver_win32/chromedriver.exe'

To start our webscraping process, we also need some additional objects. We import these.

In [5]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import os
import time

We can now set some basic options for our browser interface (handled by selenium). In particular, we want to use the headless mode which will mean that when the website is accessed and rendered, it will not be down in a visibile window (i.e. will be loaded in the background).

In [6]:
# Define options
chrome_options = Options()
chrome_options.add_argument("--headless")

# Note that some websites deny headless browsers, changing user-agent should handle this
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'    
chrome_options.add_argument('user-agent={0}'.format(user_agent))

# Create driver
driver = webdriver.Chrome(options=chrome_options)

To get started, we shall be instead attempting to scrape a dynamic job listing website.

In [7]:
test_url = 'https://www.naukri.com/top-jobs-by-designations#desigtop600'

Now we provide this address to the driver, which will then request the page while also using our browser driver to execute any dynamic aspects (e.g. javascript driven artifacts) on this page. Note that javascript may take some time to execute and for servers to return everything, so we use pythons built in sleep function just to give some additional time and ensure these have completed.

In [8]:
# Get our target website
driver.get(test_url)
# Ensure that webpage has fully loaded (i.e. javascript has fully executed and returned)
time.sleep(3)

# Get HTML source code
html = driver.page_source

Having obtained the complete compiled HTML code for the webpage, we can use the BeautifulSoup function to make the DOM readily searchable and accessible. 

In [10]:
soup = BeautifulSoup(html, "html.parser")

Looking at the HTML code of the webpage through a browser, we can see that all of the job listings are contained within a div element which has the id nameSearch. We can thus use methods of the BeautifulSoup object to find this section of code to search and scrape further.

In [14]:
div = soup.find('div', id='nameSearch')

Further inspection shows that actual job listins are displaying in the HTML code as a elements. Print a sample of these.

In [16]:
# Print out the text from the top ten <a> tags
job_listings = div.find_all('a')
for job in job_listings[:10]:
    print(job.text)

Assistant Professor Jobs
Lecturer Jobs
Business Analyst Jobs
Computer Operator Jobs
Company Secretary Jobs
Graphic Designer Jobs
Lab Technician Jobs
Accountant Jobs
Librarian Jobs
Medical Representative Jobs


In [9]:
# Closer broswer
driver.quit()