In [None]:
"""
Process:
1. Request data (html, css, javascript) from web page
2. Structure html to enable the identification and extraction of html
3. Identify, extract, and store

Slide of libraries, patterns, syntax
Slide of request, html, extract tag, store in list, iterate through, download, store locally
"""


import os
import time


### Python String Formatting

We will use string formatting to create clean, readable strings many of the lessons. For this, we will use str.format()

- str.format() examples: https://pyformat.info/

In [122]:
'This is a python string'

'This is a python string'

In [126]:
'We can add a parameter {}'.format('in the string using {}')

'We can add a parameter in the string using {}'

In [127]:
'{}'.format('.format() fills in the parameter braces')

'.format() fills in the parameter braces'

In [134]:
'.format() allows multiple parameters, of any datatype: param {}, param {},  param {}'.format('one', 2, 3.0)

'.format() allows multiple parameters, of any datatype: param one, param 2,  param 3.0'

In [135]:
# formatting is useful for changing strings as part of a loop
urls = ['url1','url2','url3']

for url in urls:
    print('Successfully requested data from {}'.format(url))

Successfully requested data from url1
Successfully requested data from url2
Successfully requested data from url3


### REQUESTS

Requests is a Python HTTP library, released under the Apache2 License. The goal of the project is to make HTTP requests simpler and more human-friendly. We will use requests to get html, css, and javascript from webpages to collect data from the web.

In [138]:
from IPython.core.display import display, HTML

requests_url = 'http://docs.python-requests.org/en/master/'
iframe = '<iframe src={} width=1100 height=300></iframe>'.format(requests_url)
HTML(iframe)

In [34]:
import requests

# set a url for deloitte's website homepage
deloitte_url = r'https://www2.deloitte.com/us/en.html'

# r is the common name for a requests instance
r = requests.get(deloitte_url)

#### Two commonly used methods for a request-response between a client and server are: GET and POST.
- GET - Requests data from a specified resource
- POST - Submits data to be processed to a specified resource


#### Client-side vs Server-side Programming Languages

Web development is all about communication and data exchange. This communication takes place via two parties over the HTTP protocol.

- Server: The Server is responsible for serving the web pages depending on the client/end user requirement. It can be either static or dynamic.
- Client: A client is a party that requests pages from the server and displays them to the end user. In general a client program is a web browser.

source: http://www.c-sharpcorner.com/UploadFile/2072a9/client-side-vs-server-side-programming-languages/

In [35]:
r

<Response [200]>

In [36]:
r.status_code

200

#### Status Code Explanation    
When a browser requests a service from a web server, an error might occur. The first digit of the status code specifies one of five standard classes of responses.

- 1xx: Information
- 2xx: Successful
- 3xx: Redirection
- 4xx: Client Error
- 5xx: Server Error

https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [23]:
r.text

'\n<!DOCTYPE HTML>\n\n\n\n\n    <html lang="en">\n\n        <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">\n        \t\n<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">\n      \n    <script>\n\t\tvar dtmConfig = true;\n\t\tvar siteCatConfig = false;\n        // Added the below to fetch the Twine Social parameters\n        var twineSocialClientID = "MTY1ODcsUk9mSlVLWFdWQlZBTzdHWmFWQzVoRzlvTXd2c24wcjU%3D";\n        var twineSocialAccountCode = "15-I7FFXT";\n        var twineSocialGroupID = "96432072593";\n\t</script>\n\n    <meta charset="utf-8"> \n    \n    <meta name=\'description\' content="Deloitte provides industry-leading audit, consulting, tax, and advisory services to many of the world’s most admired brands, including 80 percent of the Fortune 500. As a member firm of Deloitte Touche Tohmatsu Limited, a network of member firms, we are proud to be part of the largest global professional services network, serving our clients in the markets that are mos

In [139]:
r.encoding

'UTF-8'

#### Requests text and encoding

Requests will automatically decode content from the server. Most unicode charsets are seamlessly decoded. When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers. 

The text encoding guessed by Requests is used when you access r.text. You can find out what encoding Requests is using, and change it, using the r.encoding property:

- What is the difference between unicode and ascii: https://stackoverflow.com/questions/19212306/whats-the-difference-between-ascii-and-unicode

### Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [37]:
bs_url = 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'
iframe = '<iframe src=' + bs_url + ' width=1100 height=300></iframe>'
HTML(iframe)

In [29]:
from bs4 import BeautifulSoup

b = BeautifulSoup(r.text, 'lxml')

In [30]:
# BeautifulSoup structures the text, removing unnecessary 
# newline \n and tab \t characters
b

<!DOCTYPE HTML>
<html lang="en">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
<meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
<script>
		var dtmConfig = true;
		var siteCatConfig = false;
        // Added the below to fetch the Twine Social parameters
        var twineSocialClientID = "MTY1ODcsUk9mSlVLWFdWQlZBTzdHWmFWQzVoRzlvTXd2c24wcjU%3D";
        var twineSocialAccountCode = "15-I7FFXT";
        var twineSocialGroupID = "96432072593";
	</script>
<meta charset="utf-8"/>
<meta content="Deloitte provides industry-leading audit, consulting, tax, and advisory services to many of the world’s most admired brands, including 80 percent of the Fortune 500. As a member firm of Deloitte Touche Tohmatsu Limited, a network of member firms, we are proud to be part of the largest global professional services network, serving our clients in the markets that are most important to them." name="description"/>
<link href="https://www2.deloitte.com/us/en.html" rel="canon

### Example HTML

Before we explore the full HMTL, let's work with some example HTML

In [101]:
example_html = """
    <!DOCTYPE html>
    <html lang='en'>

    <head>
        <title>The ML Guild</title>
    </head>

    <body>
        <h1 id='headline'>Machine Learning Guild Overview</h1>
            <p class='description' >The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>
        <h2>Courses</h2>

        <p>Below are the ML Guild Tracks:</p>

        <ul>
            <li class='first'>Explorer</li>
            <li class='second'>Apprentice</li>
            <li class='third'>Master</li>
        </ul>
    </body>

    </html>
    """

### Common BeautifulSoup Syntax

#### Methods
- find: find only the first specified html tag that meets a specified condition
- find_all: find all html tags that meets a specified condition

#### Parameters
- attrs: dict of attributes (e.g. id, class) to filter relevant html

#### Attributes
- text: only view the text in the html (ignore the html itself) 


In [102]:
# Beautiful Soup supports the HTML parser included in Python’s standard library, 
# but it also supports a number of third-party Python parsers. 
# One is the lxml parser, which is fast and lenient (i.e. it will not crash it the html is not formatted correctly)

b = BeautifulSoup(example_html, 'lxml')
b

<!DOCTYPE html>
<html lang="en">
<head>
<title>The ML Guild</title>
</head>
<body>
<h1 id="headline">Machine Learning Guild Overview</h1>
<p class="description">The ML guild provides rich learning 
            content around machine learning, including self-guided courses, 
            curriculum-based learning programs, and mentorship by more 
            advanced practitioners</p>
<h2>Courses</h2>
<p>Below are the ML Guild Tracks:</p>
<ul>
<li class="first">Explorer</li>
<li class="second">Apprentice</li>
<li class="third">Master</li>
</ul>
</body>
</html>

#### find

In [103]:
b.find('title')

<title>The ML Guild</title>

In [104]:
b.find('h2')

<h2>Courses</h2>

#### text

In [105]:
b.find('title').text

'The ML Guild'

In [106]:
b.find('h2').text

'Courses'

In [107]:
b.find('li').text

'Explorer'

#### find_all

In [108]:
b.find_all('p')

[<p class="description">The ML guild provides rich learning 
             content around machine learning, including self-guided courses, 
             curriculum-based learning programs, and mentorship by more 
             advanced practitioners</p>, <p>Below are the ML Guild Tracks:</p>]

In [109]:
# find_all
b.find_all('li')

[<li class="first">Explorer</li>,
 <li class="second">Apprentice</li>,
 <li class="third">Master</li>]

In [110]:
lists = b.find_all('li')
for l in lists:
    print(l.text)

Explorer
Apprentice
Master


#### attrs

In [91]:
b.find('h1', attrs={'id':'headline'})

<h1 id="headline">Machine Learning Guild Overview</h1>

In [92]:
b.find('li', attrs={'class':'first'})

<li class="first">Explorer</li>

In [93]:
b.find('li', attrs={'class':'second'})

<li class="second">Apprentice</li>

In [94]:
b.find('li', attrs={'class':'third'})

<li class="third">Master</li>

### Exercise

### Extract articles from Deloitte.com 

In [143]:
# read in the html from deloitte
b = BeautifulSoup(r.text, 'lxml')

In [144]:
articles = b.find_all('div', attrs={'class':'table-frame-col-50 standardpromo '})
for article in articles:
    title = article.find('h3', attrs={'class':'tertiary-headline'}).text
    print(title)

Robotic process automation and cognitive technologies in insurance

Holiday 2017: How retailers can secure customer devotion
				


In [145]:
for article in articles:
    text = article.find('div', attrs={'class':'page-description-for-promo'}).text
    print(text)

Adopting a more customer-centric approach will give insurers opportunities to unlock cognitive automation across the insurance value chain.
November 15 | 11 a.m. ET | Participants will learn results of Deloitte's 32nd annual holiday survey and how a customer rapport strategy can tip the competitive war for share in their favor.


### Store Data

We will use a common python data stucture (e.g. list or dict) to store the data for later use. 

Below is a common data collection pattern:
- A. set an empty data structure to store data
- B. iterate through an existing data set container
- C. extract selected pieces of information, disregard the rest of the data
- D. add the selected information to the created data structure
- E. view results

In [150]:
all_headlines = []  # A 

for article in articles: # B
    headline = article.find('h3', attrs={'class':'tertiary-headline'}).text  # C
    all_headlines.append(headline)  # D

all_headlines  # E

['Robotic process automation and cognitive technologies in insurance\n',
 'Holiday 2017: How retailers can secure customer devotion\n\t\t\t\t']

#### Python string methods

Depending on the article headlines on Deloitte.com from the day the above code was run, there likely may be newline \n and/or tabs \t at the end of the headline. These are excess characters that we do not want to store. 

Python string methods will allow us to handle these data quality issues.

In [151]:
# all headlines in a list, we slice to view the first element stored at index 0
all_headlines[0]

'Robotic process automation and cognitive technologies in insurance\n'

In [152]:
# strip will remove all excess characters (e.g. spaces, tabs, new lines) 
# from the start and end of a string
all_headlines[0].strip()

'Robotic process automation and cognitive technologies in insurance'

In [153]:
# replace will replace a select character (new lines below) 
# with another string of our choosing 
all_headlines[0].replace('\n', '')

'Robotic process automation and cognitive technologies in insurance'

In [155]:
all_headlines = []

for article in articles:
    # strip the extra spaces at the start and end of the text
    headline = article.find('h3', attrs={'class':'tertiary-headline'}).text.strip()
    all_headlines.append(headline)

all_headlines

['Robotic process automation and cognitive technologies in insurance',
 'Holiday 2017: How retailers can secure customer devotion']

#### Exercise

In [115]:
# Identify, extract, and store the text of each article

all_text = []

for article in articles:
    text = article.find('div', attrs={'class':'page-description-for-promo'}).text
    all_text.append(text)

all_text

['Adopting a more customer-centric approach will give insurers opportunities to unlock cognitive automation across the insurance value chain.',
 "November 15 | 11 a.m. ET | Participants will learn results of Deloitte's 32nd annual holiday survey and how a customer rapport strategy can tip the competitive war for share in their favor."]

In [119]:
# Identify, extract, and store the headline and text of each article
# store them in a list called all_articles

all_articles = []

for article in articles:
    title = article.find('h3', attrs={'class':'tertiary-headline'}).text.strip()
    text = article.find('div', attrs={'class':'page-description-for-promo'}).text.strip()
    all_articles.append([title, text])

all_articles

[['Robotic process automation and cognitive technologies in insurance',
  'Adopting a more customer-centric approach will give insurers opportunities to unlock cognitive automation across the insurance value chain.'],
 ['Holiday 2017: How retailers can secure customer devotion',
  "November 15 | 11 a.m. ET | Participants will learn results of Deloitte's 32nd annual holiday survey and how a customer rapport strategy can tip the competitive war for share in their favor."]]

In [157]:
# Identify, extract, and store the headline and text of each article
# store them in a dict called articles_map

articles_map = {}

for article in articles:
    title = article.find('h3', attrs={'class':'tertiary-headline'}).text.strip()
    text = article.find('div', attrs={'class':'page-description-for-promo'}).text.strip()

    articles_map[title] = text

articles_map

{'Holiday 2017: How retailers can secure customer devotion': "November 15 | 11 a.m. ET | Participants will learn results of Deloitte's 32nd annual holiday survey and how a customer rapport strategy can tip the competitive war for share in their favor.",
 'Robotic process automation and cognitive technologies in insurance': 'Adopting a more customer-centric approach will give insurers opportunities to unlock cognitive automation across the insurance value chain.'}

In [None]:




### Try - Except
    try:
        report_name = report.find('a')['href']
    except TypeError:
        continue


### Context Manager
    # export pdf locally
    filepath = output_paths[url]
    with open(filepath, 'wb') as f:
        f.write(r.content)

    # required delay, stated in the robots.txt
    time.sleep(10)  # ten seconds

