## Data Scraping Solution

##### Author: Alex Sherman | alsherman@deloitte.com

### Lesson Objectives:
- Discuss python String formatting
- Discuss Client-side versus Server-side programming
- Learn how to request data (html, css, javascript) from a web page
- Structure the html with BeautifulSoup
- Identify, extract, and store selected elements from the HTML

In [2]:
# import package to view websites inside the Jupyter Notebook
from IPython.core.display import display, HTML

### Configuration Recap

In [3]:
# use magic command to print working directory
# confirm you are in lesson2_automation directory
%pwd

'C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\lessons\\lesson2_automation'

In [4]:
import configparser
config = configparser.ConfigParser()
config.read('../../config.ini')

['../../config.ini']

### Python String Formatting

We will use formatting to add placeholders into strings in many of the lessons. For this, we will use str.format()

In [5]:
string_formatting_url = 'https://pyformat.info/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(string_formatting_url)
HTML(iframe)

In [6]:
'This is a python string'

'This is a python string'

In [7]:
'We can add a parameter {}'.format('into the string')

'We can add a parameter into the string'

In [8]:
'format {} the braces'.format('fills in')

'format fills in the braces'

In [147]:
'.format() allows multiple parameters, of any datatype, param: {}, param: {},  param: {}'.format('one', 2, 3.0)

'.format() allows multiple parameters, of any datatype, param: one, param: 2,  param: 3.0'

In [148]:
'add {:10} with a {:10} then number'.format('spaces','colon')

'add spaces     with a colon      then number'

In [12]:
# formatting is useful for populating strings as part of a loop
urls = ['url1','url2','url3']

for url in urls:
    print('Successfully requested data from {}'.format(url))

Successfully requested data from url1
Successfully requested data from url2
Successfully requested data from url3


### REQUESTS

Requests is a Python HTTP library, released under the Apache2 License. The goal of the project is to make HTTP requests simpler and more human-friendly. We will use requests to get html, css, and javascript from webpages to collect data from the web.

In [13]:
requests_url = 'http://docs.python-requests.org/en/master/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(requests_url)
HTML(iframe)

In [25]:
import requests

# set a url for deloitte's website homepage
opm_url = r'https://www.opm.gov/'

# r is the common name for a requests instance
r = requests.get(opm_url)

#### Two commonly used methods for a request-response between a client and server are: GET and POST.
- GET - Requests data from a specified resource
- POST - Submits data to be processed to a specified resource


#### Client-side vs Server-side Programming Languages

Web development is all about communication and data exchange. This communication takes place via two parties over the HTTP protocol.

- Server: The Server is responsible for serving the web pages depending on the client/end user requirement. It can be either static or dynamic.
- Client: A client is a party that requests pages from the server and displays them to the end user. In general a client program is a web browser.

source: http://www.c-sharpcorner.com/UploadFile/2072a9/client-side-vs-server-side-programming-languages/

In [26]:
client_server_url = 'http://www.afterhoursprogramming.com/static/reference/images/webFlow.gif'
iframe = '<iframe src={} width=500 height=220></iframe>'.format(client_server_url)
HTML(iframe)

In [27]:
r

<Response [200]>

In [28]:
r.status_code

200

#### Status Code Explanation    
When a browser requests a service from a web server, an error might occur. The first digit of the status code specifies one of five standard classes of responses.

- 1xx: Information
- 2xx: Successful
- 3xx: Redirection
- 4xx: Client Error
- 5xx: Server Error

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [149]:
r.text[0:1000]

'\r\n<!DOCTYPE html>\r\n\r\n<html  xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml">\r\n\t<head><link rel="shortcut icon" href="/favicon.ico" /><title>\r\n\tData, Analysis & Documentation : Raw Datasets - OPM.gov\r\n</title><meta http-equiv="X-UA-Compatible" content="IE=edge" /><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta name="description" content="Welcome to opm.gov" /><meta name="keywords" content="OPM,Office of Personnel Management,opm.gov" /><meta name="Expires" /><meta name="TemplateVersion" content="3.0" />\r\n\t\t\t<meta property="fb:admins" content="568256384"/>\r\n\t\t\t<meta property="fb:app_id" content="121223957945585"/>\r\n\t\t\t<meta property="og:type" content="government"/>\r\n\t\t\t<meta property="og:site_name" content="U.S. Office of Personnel Management"/><link rel="stylesheet" type="text/css" media="screen,projection" href="/css/global.css?v=20150331" />\r\

In [30]:
r.encoding

'ISO-8859-1'

#### Requests text and encoding

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded. When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. 

The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

- What is the difference between unicode and ascii: https://stackoverflow.com/questions/19212306/whats-the-difference-between-ascii-and-unicode

### Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [31]:
bs_url = 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'
iframe = '<iframe src=' + bs_url + ' width=1100 height=300></iframe>'
HTML(iframe)

In [32]:
from bs4 import BeautifulSoup

b = BeautifulSoup(r.text, 'lxml')

In [33]:
# BeautifulSoup structures the text, removing unnecessary 
# newline \n and tab \t characters
b

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head><link href="/favicon.ico" rel="shortcut icon"/><title>
	OPM.gov
</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><meta content="Welcome to opm.gov" name="description"/><meta content="OPM,Office of Personnel Management,opm.gov" name="keywords"/><meta name="Expires"/><meta content="3.0" name="TemplateVersion"/>
<meta content="568256384" property="fb:admins"/>
<meta content="121223957945585" property="fb:app_id"/>
<meta content="government" property="og:type"/>
<meta content="U.S. Office of Personnel Management" property="og:site_name"/>
<meta content="U.S. Office of Personnel Management - www.OPM.gov" property="og:title"/>
<meta content="OPM works in several broad categories to recruit, retain and honor a world-class workforce for the American people."

### Example HTML

Before we explore the full HMTL, let's work with some example HTML

In [34]:
example_html = """
    <!DOCTYPE html>
    <html lang='en'>

    <head>
        <title>The ML Guild</title>
    </head>

    <body>
        <h1 id='headline'>Machine Learning Guild Overview</h1>
            <p class='description' >The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>
        <h2>Courses</h2>

        <p>Below are the ML Guild Tracks:</p>

        <ul>
            <li class='first'>Explorer</li>
            <li class='second'>Apprentice</li>
            <li class='third'>Master</li>
        </ul>
    </body>

    </html>
    """

### Common BeautifulSoup Syntax

#### Methods
- find: find only the first specified html tag that meets a specified condition
- find_all: find all html tags that meets a specified condition

#### Parameters
- attrs: dict of attributes (e.g. id, class) to filter relevant html

#### Attributes
- text: only view the text in the html (ignore the html itself) 


In [35]:
# Beautiful Soup supports the HTML parser included in Python’s standard library, 
# but it also supports a number of third-party Python parsers. 
# One is the lxml parser, which is fast and lenient (i.e. it will not crash it the html is not formatted correctly)

b = BeautifulSoup(example_html, 'lxml')
b

<!DOCTYPE html>
<html lang="en">
<head>
<title>The ML Guild</title>
</head>
<body>
<h1 id="headline">Machine Learning Guild Overview</h1>
<p class="description">The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>
<h2>Courses</h2>
<p>Below are the ML Guild Tracks:</p>
<ul>
<li class="first">Explorer</li>
<li class="second">Apprentice</li>
<li class="third">Master</li>
</ul>
</body>
</html>

#### find

In [36]:
b.find('title')

<title>The ML Guild</title>

In [37]:
b.find('h2')

<h2>Courses</h2>

#### text

In [38]:
b.find('title').text

'The ML Guild'

In [39]:
b.find('h2').text

'Courses'

In [40]:
b.find('li').text

'Explorer'

#### find_all

In [41]:
# find_all
b.find_all('li')

[<li class="first">Explorer</li>,
 <li class="second">Apprentice</li>,
 <li class="third">Master</li>]

In [42]:
lists = b.find_all('li')
for l in lists:
    print(l.text)

Explorer
Apprentice
Master


#### attrs

In [43]:
b.find('h1', attrs={'id':'headline'})

<h1 id="headline">Machine Learning Guild Overview</h1>

In [44]:
b.find('li', attrs={'class':'first'})

<li class="first">Explorer</li>

In [45]:
b.find('li', attrs={'class':'second'})

<li class="second">Apprentice</li>

### Exercise

In [46]:
# Get the text (only) for the master class
b.find('li', attrs={'class':'third'}).text

'Master'

In [47]:
# print the text (only) of all the paragraphs in a lists
paragraphs = b.find_all('p')
for p in paragraphs:
    print(p)

<p class="description">The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>
<p>Below are the ML Guild Tracks:</p>


In [48]:
# print the paragraphs with the class description
b.find('p', attrs={'class':'description'})

<p class="description">The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>

### Robots.txt

"WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)."

SOURCE: http://www.robotstxt.org/orig.html

In [49]:
robots_url = 'http://www.robotstxt.org/orig.html'
iframe = '<iframe src={} width=1100 height=400></iframe>'.format(robots_url)
HTML(iframe)

In [50]:
from urllib import robotparser
 
rp = robotparser.RobotFileParser()  # instantiate robotparser
rp.set_url("https://www.facebook.com/robots.txt")  # set the path to the robots.txt file
rp.read()  # read the robots.txt file
rp.can_fetch('*', "https://www.facebook.com")  # check if you are allowed to fetch a specific page

False

In [51]:
rp = robotparser.RobotFileParser()
rp.set_url("https://www2.deloitte.com/robots.txt")
rp.read()
rp.can_fetch('*', "https://www2.deloitte.com/us/en.html")

False

In [52]:
rp = robotparser.RobotFileParser()
rp.set_url("http://www.annualreports.com/robots.txt")
rp.read()
rp.can_fetch("*", "http://www.annualreports.com/Company/oracle-corporation")

True

In [57]:
rp = robotparser.RobotFileParser()
rp.set_url('https://www.opm.gov/robots.txt')
rp.read()
rp.can_fetch("*", "https://www.opm.gov")

True

### Extract blog posts from OPM.gov

In [58]:
### REVIEW: GET OPM HTML

# set a url for the OPM website homepage
url = r'https://www.opm.gov/'

# r is the common name for a requests instance
r = requests.get(url)

In [59]:
# read in the html from deloitte
b = BeautifulSoup(r.text, 'lxml')

In [64]:
# find the first blog post
blog = b.find_all('div', attrs={'class':'Blog_Entry'})[0]
blog

<div class="Blog_Entry"><div class="Blog_Date">Jan<span>04</span></div><div class="Blog_Title"><a href="http://www.opm.gov/policy-data-oversight/snow-dismissal-procedures/status-archives/18/1/4/Open---2-hours-Delayed-Arrival---With-Option-for-Unscheduled-Leave-or-Unscheduled-Telework_748/">January 4, 2018 Operating Status</a></div><p class="Blog_Text">Federal agencies in the Washington, DC area are OPEN under 2 hours DELAYED ARRIVAL and employees have the OPTION FOR UNSCHEDULED LEAVE OR UNSCHEDULED TELEWORK. Employees should plan to arrive for work no more than 2 hours later than they would be expected to arrive.</p></div>

In [61]:
# get the blog Title
blog.find('div', attrs={'class':'Blog_Title'}).text

'January 4, 2018 Operating Status'

In [62]:
# get the date the blog was posted
blog.find('div', attrs={'class':'Blog_Date'}).text

'Jan04'

In [63]:
# get the blog text
blog.find('p', attrs={'class':'Blog_Text'}).text

'Federal agencies in the Washington, DC area are OPEN under 2 hours DELAYED ARRIVAL and employees have the OPTION FOR UNSCHEDULED LEAVE OR UNSCHEDULED TELEWORK. Employees should plan to arrive for work no more than 2 hours later than they would be expected to arrive.'

In [72]:
# store all the bog posts
blogs = b.find_all('div', attrs={'class':'Blog_Entry'})

### Store Data

We will use a common python data stucture (e.g. list or dict) to store the data for later use. 

Below is a common data collection pattern:
- A. set an empty data structure to store data
- B. iterate through an existing data set container
- C. extract selected pieces of information, disregard the rest of the data
- D. add the selected information to the created data structure
- E. view results

In [73]:
all_headlines = []  # A 

for blog in blogs: # B
    headline = blog.find('div', attrs={'class':'Blog_Title'}).text   # C
    all_headlines.append(headline)  # D

all_headlines  # E

['January 4, 2018 Operating Status',
 'Dismissal and Closure Operating Status Decision',
 'News: OPM opens CFC for retiree giving',
 'Open Season Quality Comparison Tool',
 'OPM Celebrates Native American Heritage Day']

### Collect ZIP URLS from OPM.gov

In [80]:
url = r'https://www.opm.gov/data/Index.aspx?tag=FedScope'
r = requests.get(url)
b = BeautifulSoup(r.text, 'lxml')

In [103]:
# find the html for the table of files (HINT: look for the class DataTable)
data_table = b.find('table', attrs={'class':'DataTable'})

In [136]:
# find the html for the first row of files (HINT: make sure to skip the table headers)
row = data_table.find_all('tr')[1]
row

<tr>
<td valign="top">FedScope Employment Cube (September 2017)</td>
<td valign="top">
<a href="/Data/Files/522/192e937d-555f-42d2-8837-eb322cabafb0.pdf"><img alt="Data Dictionary" border="0" src="/img/global/icoPDF.gif"/></a>
<span class="FileSize">[310.54 KB]</span>
</td>
<td>
<span>
<a href="/Data/Files/522/7a0bf199-6c16-4131-92d1-485b18f7878a.zip"><img alt="Download File" border="0" src="/img/global/icoZip.gif"/></a>
<span class="FileSize">[20.74 MB]</span>
</span>
</td>
<td valign="top">
						Website: <a href="https://www.fedscope.opm.gov/">www.fedscope.opm.gov</a><br/>
						Email: <a href="mailto:FedScope@opm.gov">FedScope@opm.gov</a><br/>
</td>
<td>
						fedscope
					</td>
<td>10/26/2017</td>
</tr>

In [139]:
# find the 'td' element with the file name
filename = row.find('td').text
filename

'FedScope Employment Cube (September 2017)'

In [100]:
# find the 'td' element with the .zip link (HINT: look for 'a href')
url = data_table.find_all('td')[2]
url

<td>
<span>
<a href="/Data/Files/522/7a0bf199-6c16-4131-92d1-485b18f7878a.zip"><img alt="Download File" border="0" src="/img/global/icoZip.gif"/></a>
<span class="FileSize">[20.74 MB]</span>
</span>
</td>

In [101]:
# get the zip url
url.find('a')['href']

'/Data/Files/522/7a0bf199-6c16-4131-92d1-485b18f7878a.zip'

In [141]:
# add the base url
BASE_URL = r'https://www.opm.gov'
url = ''.join([BASE_URL, url_end])
url

'https://www.opm.gov/Data/Files/77/25455df4-0c25-49f3-a460-147b3aa596c8.zip'

In [142]:
# Exception Handling
0/0

ZeroDivisionError: division by zero

In [144]:
### Try - Except to handle expected errors
try:
   0/0
except ZeroDivisionError:
    print('successfully caught error')

successfully caught error


#### Exercise

Combine the above code to collect all the zip urls in a list then in a dict

In [146]:
# Identify, extract, and store all zip urls
# store them in a list called zip_urls

BASE_URL = r'https://www.opm.gov'
zip_urls = []
data_table = b.find('table', attrs={'class':'DataTable'}) # skip table headers

for row in data_table.find_all('tr')[1:]:
    cells = row.find_all('td')
    
    try:        
        url_end = cells[2].find('a')['href']
        url = ''.join([BASE_URL, url_end])
        zip_urls.append(url)
    except TypeError:
        continue

zip_urls[0:5]

['https://www.opm.gov/Data/Files/522/7a0bf199-6c16-4131-92d1-485b18f7878a.zip',
 'https://www.opm.gov/Data/Files/519/cb894d2e-de0b-4635-88ac-37f471544fff.zip',
 'https://www.opm.gov/Data/Files/494/1dda0280-390b-4c86-9484-14278783ffdc.zip',
 'https://www.opm.gov/Data/Files/493/297ed5db-f314-4d3a-802f-f45231e9e062.zip',
 'https://www.opm.gov/Data/Files/492/5467c9a4-8cc7-430e-a96d-59078b440d43.zip']

In [135]:
# Identify, extract, and store all filenames and zip urls
# store them in a dict called zip_urls

BASE_URL = r'https://www.opm.gov'
zip_urls = {}
data_table = b.find('table', attrs={'class':'DataTable'})

for row in data_table.find_all('tr')[1:]:
    cells = row.find_all('td')

    try:        
        filename = cells[0].text
        url_end = cells[2].find('a')['href']
        url = ''.join([BASE_URL, url_end])
        zip_urls[filename] = url
    except TypeError:
        continue

zip_urls

{'FedScope Accessions Cube (Fiscal Year 2016)': 'https://www.opm.gov/Data/Files/492/5467c9a4-8cc7-430e-a96d-59078b440d43.zip',
 'FedScope Employment Cube (December 2009)': 'https://www.opm.gov/Data/Files/181/6e6d2997-5e31-48f9-8723-db439c48e3af.zip',
 'FedScope Employment Cube (December 2010)': 'https://www.opm.gov/Data/Files/169/ecf4b47c-0f1a-4f8d-9b06-b352e041eefa.zip',
 'FedScope Employment Cube (December 2011)': 'https://www.opm.gov/Data/Files/234/76933192-00c0-4359-967f-da301c699121.zip',
 'FedScope Employment Cube (December 2012)': 'https://www.opm.gov/Data/Files/318/9095712a-b161-4fff-8208-086b75cd1b2c.zip',
 'FedScope Employment Cube (December 2013)': 'https://www.opm.gov/Data/Files/340/9fb31e2e-3cec-4be4-927c-66c765fe17da.zip',
 'FedScope Employment Cube (December 2014)': 'https://www.opm.gov/Data/Files/397/123a2080-61a4-4dd6-9715-ace27eb4709d.zip',
 'FedScope Employment Cube (December 2015)': 'https://www.opm.gov/Data/Files/422/944a9e54-69b5-4b74-8395-4f6eca37e650.zip',
 'Fed