<html>
  <head>
    <title>Hello PyLadies!</title>
  </head>
  <body>
    <h2 class='opening'>Welcome to my <a href="http://melbourne.pyladies.com/">PyLadies Melbourne</a> talk</h2>
      <p>A few comments:</p>
      <li id='new'>I'm pretty new to this</li>
      <li id='fun'>But it's been fun to learn</li>
      <p></p>
    <p class = '1'>Hopefully you'll learn something.</p>
      <div>There is a lot to learn.
          <div>Certainly there is much I do not know.</div>
      </div>
    <p class = '2'>So hopefully I'll learn something.</p>
    <p class = '3'>Hopefully we'll all have some fun.</p>
      <div>Because learning is often fun!</div>
    <p class = 'closing'>Let's get started!<p>
  </body>
</html>

## HTML in Python
Python doesn't understand html, but plenty of packages exist for working with html through Python.<br>
<br>
Among the most popular are:
* [Scrapy](https://scrapy.org) - a powerful framework for web scraping
* [Requests](https://2.python-requests.org/en/master/) - a library for scraping raw html from a website
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) - an html parser, useful for extracting information from html

If you plan to do lots of web scraping, Scrapy is probably for you. There's more to learn up-front to get a project going, but lots more flexibility and options after that. Also, it's all-in-one from grabbing the html through processing.<br>
If you're looking for a simple way to get started, BeautifulSoup may be easier and do what you need.<br>
<br>
For today, I'm focusing on BeautifulSoup for the easy-to-get-started factor.

In [94]:
#imports
from scrapy.selector import Selector

import requests
from bs4 import BeautifulSoup

import pandas as pd


## A little taste of Scrapy & XPath
XPath locates items within html by following the tag structure of the code

In [93]:
html_str = '''
<html>
  <head>
    <title>Hello PyLadies!</title>
  </head>
  <body>
    <h2 class='opening'>Welcome to my <a href="http://melbourne.pyladies.com/">PyLadies Melbourne</a> talk</h2>
      <p>A few comments:</p>
      <li id='new'>I'm pretty new to this</li>
      <li id='fun'>But it's been fun to learn</li>
      <p></p>
    <p class = '1'>Hopefully you'll learn something.</p>
      <div>There is a lot to learn.
          <div>Certainly there is much I do not know.</div>
      </div>
    <p class = '2'>So hopefully I'll learn something.</p>
    <p class = '3'>Hopefully we'll all have some fun.</p>
      <div>Because learning is often fun!</div>
    <p class = 'closing'>Let's get started!<p>
  </body>
</html>
'''

In [95]:
xpath_selector = Selector(text=html_str)

In [98]:
print(xpath_selector.xpath('/html/body/h2/text()').extract())

['Welcome to my ', ' talk']


In [103]:
#keep track of your single/double quotes when working with html :)
path = '//*[@class = "opening"]/text()'
xpath_selector.xpath(path).extract()

['Welcome to my ', ' talk']

In [108]:
path1 = '/html/body/li/text()'
path2 = '//*[@id = "new"]/text()'
print(xpath_selector.xpath(path1).extract())
print(xpath_selector.xpath(path2).extract())

["I'm pretty new to this", "But it's been fun to learn"]
["I'm pretty new to this"]


In [117]:
path3 = '/html/body/p[@class = 3]/text()'
xpath_selector.xpath(path3).extract()

["Hopefully we'll all have some fun."]

In [118]:
paragraphs = [xpath_selector.xpath('/html/body/p[@class = ' + str(i+1) +']/text()').extract()
                  for i in range(3)]
print(paragraphs)

[["Hopefully you'll learn something."], ["Hopefully I'll learn something."], ["Hopefully we'll all have some fun."]]


## Requests
Let's get some html!

In [85]:
#using requests to capture html from a website
response = requests.get('https://www.python.org/')
response.status_code

200

### HTML status codes
These responses tell about whether your request worked, or what might have gone wrong. You've probably seen 404 frequently. You don't usually see 200 because it means things are all good <br>
<br>
learn more: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes <br>
<br>


In [86]:
#let's look at what's in the response
#how long is it?
print(len(response.text))

#what does it look like?
response.text[0:1000]

48823


'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->\n<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->\n\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">\n\n    <meta name="application-name" content="Python.org">\n    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">\n    <meta name="apple-mobile-web-app-title" content="Python.org">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-status-bar-style" content="black">\n\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="Han

In [8]:
response.apparent_encoding

'Windows-1252'

## BeautifulSoup

In [171]:
url = '''https://www.opentable.com/s/?covers=2&dateTime=2019-09-16%2019%3A00
        &metroId=279&regionIds=1145&enableSimpleCuisines=true
        &userlongitude=149.22&userlatitude=-35.28&includeTicketedAvailability=true&pageType=0'''

response2 = requests.get(url)
print(response2.status_code)

html = response2.text

200


In [172]:
html[0:1000]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-32.png" sizes="32x32"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-48.png" sizes="48x48"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-64.png" sizes="64x64"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-128.png" sizes="128x128"/><link rel="apple-touch-icon-precomposed" sizes

In [193]:
soup_booking = BeautifulSoup(html,'html.parser')
soup_booking.title.text

'Restaurant Reservation Availability'

In [208]:
# soup_booking.find_all('span',{'class':"rest-row-name-text"})

In [207]:
# html_driver[100000:]

In [144]:
soup_booking.find_all('div', {'class':"booking"})

[]

## Working with javascript and similar features
The request object only grabs the html - if something is using javascript it won't see it. One way to get around this is to use automated testing software designed to mimic human browsing habits. I'll be using Selenium and chromedriver, which you can learn about here: <br>
<br>
https://www.seleniumhq.org/ <br>
<br>
https://sites.google.com/a/chromium.org/chromedriver/home <br>
<br>
If you decide to use scrapy for your project, here is an option to play with:<br>
https://github.com/scrapy-plugins/scrapy-splash<br>
<br>

In [66]:
from selenium import webdriver
from time import sleep


In [145]:
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver.exe") #tell Selenium where to find the driver

In [146]:
#when you're done with it
driver.close()

In [184]:
def get_html_with_script(url):
    driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver.exe")
    driver.get(url)

    #wait one second
    sleep(1)

    #grab page source - javascript will have been executed
    html = driver.page_source
    driver.close()
    
    return html

In [185]:
html_driver = get_html_with_script(url)
soup_driver = BeautifulSoup(html_driver)
soup_driver.find_all('div', {'class':"booking"})

[<div class="booking"><span class="tadpole"></span>Booked 23 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 14 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 20 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 15 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 19 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 4 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 7 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 9 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 2 times today</div>,
 <div class="booking"><span class="tadpole"></span>Booked 4 times today</div>,
 <div class="booking"><span class="tadpole"></s

### Changing example: job boards

In [76]:
url2 = 'https://www.seek.com.au/job/39930523?token=1~5f8ee63d-cb9e-44de-93d2-91c3414807f7'

In [78]:
job_html = requests.get(url2)
job_html.status_code

200

In [81]:
job_soup = BeautifulSoup(job_html.text)
job_soup.find_all('div', {'class':"lmis-cg-salary-bottom"})

[]

In [84]:
job_soup_new = BeautifulSoup(get_html_with_script(url2))
job_soup_new.find_all('div', {'class':"lmis-cg-salary-bottom"})

[<div class="lmis-cg-salary-bottom">
 <span style="width: 54.545%">
             $65K
           </span>
 <span class="lmis-cg-salary-mode">$110K</span>
 <span>$160K</span>
 </div>]

__Recognition__ <br>
Thanks to Srikanta Patra for helping me learn web scraping as part of the General Assembly Data Science Immersive. Some of this code is adapted from his lesson, particularly the use of Open Table as a good example for javascript.<br>
And thanks to Elke Hansen for pointing out the javascript in Seek ads.