In [37]:
import pandas as pd
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Webscraping

Week 4 | Day 4

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Describe how web scraping works conceptually
- Explain how to Web Scraping works using python
- Define how to approach scraping project data

# Webscraping

In data science work, it is often necessary to retrieve data from websites. Occassionally, sites will provide an API that allows their data to be easily accessed, but often this isn't the case. When an API is not available, the only real option is to build a webscraper. 


A webscraper retrieves the webpage in the same way your browser retrieves the page, but because we're doing it with code, we are able to parse the resulting site's content.


**So how can we retrieve webpage content programmatically?**<br>
The first step is to understand how HTTP works...

## HTTP

Hypertext Transfer Protocol, or HTTP, is a text-based standard that allows clients and servers to communicate over TCP/IP. 

**HTTP  = the language computers communicate with**<br>
**TCP/IP = the channel over which that communication takes place**

HTTP is based on a client-server model. A client makes a request for some resource, and the server responds with the status of that request and the resource if available.

## HTTP Requests

There are two common types of HTTP requests: **GET and POST**

### GET Requests

GET requests are by far the most common, they simply ask the server to retrieve some resource, typically a webpage, and to return it.

<img src="http://i.imgur.com/qBG7jmB.png" width="900">

### POST Requests

A POST request is nearly identical to a GET request, but includes a payload of some sort in the request body. 

<img src="http://i.imgur.com/mzWB0wD.png" width=900>

## Typical Use Cases

GET requests are the standard way to request a webpage (as your browser would do). Some simple forms will use get as well. 

More sophisticated forms will utilize a POST request. GET requests pass parameters in the URL, while POST requests do not. This tends to make POST request more secure. 

N.B. Do not rely on POST alone as a security measure!

## So once you make a request, naturally you expect a response...


In the language of http, responses are provided first as a code

## HTTP Response Codes

- 1XX - Informational
- 2XX - Success
- 3XX - Redirection
- 4XX - Client Error
- 5XX - Server Error

### Response Codes - The Greatest Hits

- **200 - OK** - The requested action was successfully executed
- **301 - Moved Permanently** - The resource has been relocated (and will not be back, so please stop asking me)
- **400 - Bad Request** - The the client request is malformed in some way
- **403 - Forbidden** - The requesting client (i.e. you) does not have permission to view the resource
- **404 - Not Found** - The resource can't be found at the moment (may be in the future, so check back later)
- **405 - Method Not Allowed** - Used GET when only POST was applicable for example
- **418 - I'm a teapot** - For when the server is a teapot
- **420 - NOT an HTTP code** - you're thinking of something else
- **429 - Too Many Requests** - They're on to you and if you keep it up, they'll block you permenantly
- **500 - Internal Server Error** -Some non-specific bad happened on their end
- **502 - Bad Gateway** - The server was waiting on another resource and it ended badly
- **503 - Service Unavailable** - The server is overloaded or down at the moment

## So that is the basic language of the web, now how do we actually use this to get our content...

## Python Requests

<img src="http://i.imgur.com/qpfNAPb.png" width="900">

Requests allows us to send the server a request using (POST or GET) and in return we receive our response code and content where applicable.

## First, we make a request to retrieve a website

In [38]:
import requests

In [39]:
r = requests.get('http://news.ycombinator.com')

## We can check the response code

In [40]:
r

<Response [200]>

### Check: What is a 200? Is that good or bad for what we're trying to do?

In [41]:
#200 Means OK

## Let's see the request headers we sent

In [42]:
r.request.headers

{'Connection': 'keep-alive', 'Cookie': '__cfduid=d01e9ee16ed2e8f1d3dc61c19a96a4cdf1476386795', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.11.1'}

In [43]:
# we can print those out nicely
for k, v in r.request.headers.items():
    print(k + ':', v)

('Connection:', 'keep-alive')
('Accept-Encoding:', 'gzip, deflate')
('Accept:', '*/*')
('User-Agent:', 'python-requests/2.11.1')
('Cookie:', '__cfduid=d01e9ee16ed2e8f1d3dc61c19a96a4cdf1476386795')


## We can also see the response headers

In [44]:
r.headers

{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Strict-Transport-Security': 'max-age=31556900; includeSubDomains', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare-nginx', 'Connection': 'keep-alive', 'Cache-Control': 'private, max-age=0', 'Date': 'Thu, 13 Oct 2016 19:26:35 GMT', 'X-Frame-Options': 'DENY', 'Content-Type': 'text/html; charset=utf-8', 'CF-RAY': '2f152f229f554746-EWR'}

In [45]:
for k, v in r.headers.items():
    print(k + ':', v)

('Date:', 'Thu, 13 Oct 2016 19:26:35 GMT')
('Content-Type:', 'text/html; charset=utf-8')
('Transfer-Encoding:', 'chunked')
('Connection:', 'keep-alive')
('Vary:', 'Accept-Encoding')
('Cache-Control:', 'private, max-age=0')
('X-Frame-Options:', 'DENY')
('Strict-Transport-Security:', 'max-age=31556900; includeSubDomains')
('Content-Encoding:', 'gzip')
('Server:', 'cloudflare-nginx')
('CF-RAY:', '2f152f229f554746-EWR')


## Let's see what content came back

In [46]:
r.content

'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?0jKc9Keyn2Zl7D1UAQcy">\n        <link rel="shortcut icon" href="favicon.ico">\n          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n        <title>Hacker News</title>\n      </head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">\n        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>\n                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n              <a href="newest">new</a> | <a href="newcomments">comments</

## We can wrap that in HTML to see the code

In [47]:
from IPython.core.display import HTML
HTML(r.content.decode('utf-8'))

0
Hacker News  new | comments | show | ask | jobs | submit login
"1. Cooled Nikon D5500a Chills the Sensor for Clearer Star Photos (petapixel.com)  130 points by uptown 3 hours ago | hide | 49 comments 2. The Nobel Prize in Literature 2016 awarded to Bob Dylan (nobelprize.org)  563 points by eCa 8 hours ago | hide | 248 comments 3. Google's “Director of Engineering” Hiring Test (gwan.com)  1099 points by fatihky 4 hours ago | hide | 497 comments 4. Inside the New York Public Library's Last, Secret Apartments (atlasobscura.com)  67 points by Tomte 3 hours ago | hide | 8 comments 5. Certificate Revocation Issue (globalsign.com)  69 points by directionless 2 hours ago | hide | 20 comments 6. Remediation Plan for WoSign and StartCom (groups.google.com)  31 points by asayler 1 hour ago | hide | 30 comments 7. Computational Thinking Benefits Society (2014) (toronto.edu)  37 points by sonabinu 3 hours ago | hide | 9 comments 8. Segment (YC S11) Is Hiring Senior Solutions Engineers (greenhouse.io)  18 minutes ago | hide 9. Ask HN: What is your favorite YouTube channel for developers?  96 points by justanton 1 hour ago | hide | 26 comments 10. Leonard Cohen Makes It Darker (newyorker.com)  145 points by ehudla 7 hours ago | hide | 39 comments 11. Show HN: CloudRail – API Integration Solution (cloudrail.com)  18 points by gro_us 1 hour ago | hide | 14 comments 12. Analyzing the Patterns of Numbers in 10M Passwords (2015) (minimaxir.com)  101 points by BeautifulData 7 hours ago | hide | 26 comments 13. What were Einstein and Gödel talking about? (2005) (newyorker.com)  157 points by cZuLi 8 hours ago | hide | 36 comments 14. Peer pressure’s effects are perhaps more powerful than we thought (2014) (washingtonpost.com)  74 points by thebent 8 hours ago | hide | 39 comments 15. Differentiable Neural Computers (deepmind.com)  232 points by tonybeltramelli 13 hours ago | hide | 61 comments 16. Show HN: Styled-components – Use the best of ES6 to style React apps (styled-components.com)  148 points by mxstbr 8 hours ago | hide | 58 comments 17. Canonical releases Ubuntu 16.10 (ubuntu.com)  214 points by Jarlakxen 5 hours ago | hide | 146 comments 18. Timing the time it takes to parse time (ayende.com)  95 points by yread 9 hours ago | hide | 50 comments 19. Giant Concrete Arrows That Point Your Way Across America (2013) (cntraveler.com)  71 points by denzell 9 hours ago | hide | 3 comments 20. Thailand's King Bhumibol Adulyadej Dies at 88 (bbc.com)  84 points by Osiris30 7 hours ago | hide | 47 comments 21. The New York Times’s Response to Donald Trump’s Retraction Letter (nytco.com)  51 points by The_ed17 43 minutes ago | hide | 11 comments 22. Tech luminaries laud Dennis Ritchie 5 years after death (cnet.com)  36 points by mgiannopoulos 2 hours ago | hide | 7 comments 23. Really Bad Chess makes chess fun even if you’re really bad (theverge.com)  62 points by Swifty 3 hours ago | hide | 11 comments 24. At the World's First Cybathlon, Proud Cyborg Athletes Raced for the Gold (ieee.org)  29 points by timgluz 8 hours ago | hide | 3 comments 25. Lèse-majesté (wikipedia.org)  13 points by mzs 1 hour ago | hide | 2 comments 26. Twitter bot is tracking dictators' flights in and out of Geneva (theverge.com)  139 points by jonbaer 6 hours ago | hide | 33 comments 27. Show HN: TakeAim – Expose your team's daily aims (takeaim.io)  50 points by bmark757 7 hours ago | hide | 18 comments 28. ‘I Is Someone Else’ (2005) (nybooks.com)  14 points by var_eps 6 hours ago | hide | discuss 29. A Mexican architect has a vision for a city straddling the U.S.-Mexico border (citylab.com)  53 points by waqasaday 6 hours ago | hide | 61 comments 30. Ask HN: How can I create a decentralized GNU-social-compatible website?  69 points by rayalez 7 hours ago | hide | 21 comments More"
Guidelines  | FAQ  | Support  | API  | Security  | Lists  | Bookmarklet  | DMCA  | Apply to YC  | Contact Search:

0,1,2
,Hacker News  new | comments | show | ask | jobs | submit,login

0,1,2
1.0,,Cooled Nikon D5500a Chills the Sensor for Clearer Star Photos (petapixel.com)
,,130 points by uptown 3 hours ago | hide | 49 comments
,,
2.0,,The Nobel Prize in Literature 2016 awarded to Bob Dylan (nobelprize.org)
,,563 points by eCa 8 hours ago | hide | 248 comments
,,
3.0,,Google's “Director of Engineering” Hiring Test (gwan.com)
,,1099 points by fatihky 4 hours ago | hide | 497 comments
,,
4.0,,"Inside the New York Public Library's Last, Secret Apartments (atlasobscura.com)"


## Exercise

- Using the requests library, retrieve a wepage of your choosing with a GET request
- Examine the response code, the headers, and the content
- Use ```IPython.core.display's HTML()``` to display the page in your notebook 
- Compare the results with the actual page you requested in your browser

In [48]:
r1 = requests.get('http://gizmodo.com/')

r1

<Response [200]>

In [49]:
r1.request.headers


{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.11.1'}

In [50]:
for k, v in r1.headers.items():
    print(k + ':', v)

('Cache-Control:', 'stale-if-error=86400, stale-while-revalidate=300')
('Content-Encoding:', 'gzip')
('Content-Type:', 'text/html; charset=utf-8')
('P3P:', 'CP="IDC DSP COR CURa ADMa OUR IND PHY ONL COM STA"')
('Strict-Transport-Security:', 'max-age=0')
('X-Content-Type-Options:', 'nosniff')
('X-Kinja:', 'app03.xyz.kinja-ops.com #1772')
('X-Kinja-Build:', '1772')
('X-Kinja-Revision:', '5e815aa6e32b94a520e334d4c7a5c7bdc229ab38')
('X-Kinja-Server:', 'app03.xyz.kinja-ops.com')
('X-XSS-Protection:', '1; mode=block')
('x-cdn-fetch:', 'mantle-default')
('Content-Length:', '74397')
('Accept-Ranges:', 'bytes')
('Date:', 'Thu, 13 Oct 2016 19:26:36 GMT')
('Via:', '1.1 varnish')
('Age:', '97')
('Connection:', 'keep-alive')
('X-Served-By:', 'cache-jfk8143-JFK')
('X-Cache:', 'HIT')
('X-Cache-Hits:', '13')
('X-Timer:', 'S1476386796.505082,VS0,VE0')
('Vary:', 'Accept-Encoding, X-Feature-Hash, X-Forwarded-Proto,Accept-Encoding')
('X-Geo-Segment:', 'B')
('Set-Cookie:', 'geocc=US;path=/;')


In [51]:
r1.content



In [52]:

from IPython.core.display import HTML
HTML(r1.content.decode('utf-8'))

## Webscraping - The Struggle is real

- Robots.txt
- User Agent
- Ajax

## Ajax - The enemy of the webscraper

In [53]:
r2 = requests.get('https://www.google.com/#q=data+science')

In [54]:
r2

<Response [200]>

In [55]:
# notice anything missing?
HTML(r2.content.decode('latin-1'))

0,1,2
,,Advanced searchLanguage tools


## What is AJAX?

>Conventional web applications transmit information to and from the server using synchronous requests. It means you fill out a form, hit submit, and get directed to a new page with new information from the server.

>With AJAX, when you hit submit, JavaScript will make a request to the server, interpret the results, and **update the current screen**. In the purest sense, the user would never know that anything was even transmitted to the server.

## How do you handle Ajax?

If a site uses ajax on content you need to scrape, **you will have to use a browser object** to retrieve it. 

The difference between a library like requests and an actual browser object is that requests just sends and receives text. The browser object "renders" the webpage just like Firefox or Chrome does. 

So how do we do this? We'll need to libraries to accomplish this...


- Selenium

- PhantomJS

## Selenium

- Selenium is a browser automation library (used extensively in testing)<br>

 <img src="http://i.imgur.com/WLs22wp.png" width=500>

## PhantomJS

PhantomJS is a "headless" browswer. It allows us all the functionality available in a full browser, but with the overhead of a UI.

<img src="http://i.imgur.com/hN5trU9.png" width="500">

## Using Selenium with PhantomJS

In [67]:
from selenium import webdriver

driver = webdriver.PhantomJS(executable_path='/Users/student/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.set_window_size(1024, 768) 
driver.get('https://www.google.com/#q=data+science')

In [68]:
# .page_source gives us our document
HTML(driver.page_source)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,AllImagesVideosNewsShoppingBooksMaps,AllImagesVideosNewsShoppingBooksMaps,,,,,,,,,
Search OptionsAny timePast hourPast 24 hoursPast weekPast monthPast yearAll resultsVerbatim,"About 62,900,000 resultsData Science - 12 Weeks - generalassemb.lyAdwww.generalassemb.ly/Data-Science‎Learn Python, Git, Unix & More. Join Our Tech Community Today.Advance Your Career · Bring Ideas to LifeCourses: Git, UNIX, & Relational Databases, Data Analysis & Python…902 Broadway, 4th Floor, New York, NYData Science Jobs - The Best Tech Companies Are HiringAdwww.indeed.com/Prime‎Find a Job, Get $2K Signing BonusFree for Candidates - Referral Bonus - For EmployersData Science - Find Your Dream Job With Hired - hired.comAdwww.hired.com/NewYork/new-york-city‎Don't Waste Time Job Searching. Let Tech Companies Apply to You!Salaries Between $75-250k · Apply Once & Get Hired · Join Hired TodayRefer a Friend Bonus - For Employers - Apply Once. Get Hired.Data Science Platform - actusdata.comAdwww.actusdata.com/‎Actus Data, US Based Startup Next Generation AnalyticsData Science Startup - Advisor Prospecting - About US - Meet the TeamData science - Wikipediahttps://en.wikipedia.org/wiki/Data_science‎Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or ... ‎Overview - ‎History - ‎Domain specific interests - ‎CriticismData Science | Courserahttps://www.coursera.org/specializations/jhu-data-science‎Explore Data Science Certificate offered by Johns Hopkins University. Launch Your Career in Data Science - A nine-course introduction to data science, ... Data Science at NYUdatascience.nyu.edu/‎The data science initiative at NYU is a university-wide effort to establish the country's leading data science training and research facilities at NYU. News for data science TechTargetSkills and temperament drive success in cloud-based data scienceTechTarget - 1 day agoDespite the need for more talent in cloud-based data science, not everyone has the right combination of tech skills and temperament to ...VU offers new undergrad degree in data sciencenwitimes.com - 9 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 3 hours agoColumbia Certification of Professional Achievement in Data Sciencesdatascience.columbia.edu/certification‎The Certification of Professional Achievement in Data Sciences prepares students to expand their career prospects or change career paths by developing  ... Intro to Data Science Online Course | Udacityhttps://www.udacity.com/course/intro-to-data-science--ud359‎Intro to Data Science covers the basics of big data through data manipulation, analysis and communication while completing a hands on data science project. Data Science Company | Data Science Platform | Data Science ...https://www.datascience.com/‎The DataScience Cloud is a data science platform that brings together best-in- class tools, infrastructure, and expertise in a modern, full-service offering. What is Data Science?https://datascience.berkeley.edu/about/what-is-data-science/‎The supply of professionals who can work effectively with data at scale is limited, and is reflected by rapidly rising salaries for data engineers, data scientists, ... Data Science Certificate | Harvard Extensionhttps://www.extension.harvard.edu/academics/.../data-science-certificate‎Learn to interpret big data sets with the data science certificate from Harvard Extension. Your analysis education can start today. Data Science Essentials | edXhttps://www.edx.org/.../data-science-essentials-microsoft-dat203-1x-1‎Explore data visualization and exploration concepts with experts from MIT and Microsoft, and get an introduction to machine learning. Google Cloud Big Data - Build at the Speed of GoogleAdcloud.google.com/‎Create Simple Sites to Complex AppsHighlights: Analytics Data Warehouse, Batch And Stream Data Processing…Get Started - Compute Engine - Pricing - Case Studies - Deploy Bitnami AppsBig Data Hadoop Dummies® Guide - Workflow Automation and HadoopAdwww.bmc.com/‎Streamline your data operation by managing big data workflows. Get free guide.BigData Staffing Insights · Automation Benefits · Cutting Edge InformationCustomer Success Stories - Contact Sales - Free Trial - Product WhitepaperDeVry® Business Analytics - Learn How to Analyze Big DataAdwww.devry.edu/‎Help Drive Strategic Business Decisions.180 Madison Ave., Ste. 900, Midtown Manhattan, NYSearches related to data sciencedata science coursedata science courseradata science salarydata science certificationdata science degreedata science pdfdata science definitiondata science books12345678910NextAdvanced searchSearch Help Send feedbackGoogle Home Advertising Programs Business Solutions Privacy Terms About Google",,,,,,,,,,
TechTarget,"Skills and temperament drive success in cloud-based data scienceTechTarget - 1 day agoDespite the need for more talent in cloud-based data science, not everyone has the right combination of tech skills and temperament to ...VU offers new undergrad degree in data sciencenwitimes.com - 9 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 3 hours ago",,,,,,,,,,
data science course,data science coursera,,,,,,,,,,
data science salary,data science certification,,,,,,,,,,
data science degree,data science pdf,,,,,,,,,,
data science definition,data science books,,,,,,,,,,

0,1
TechTarget,"Skills and temperament drive success in cloud-based data scienceTechTarget - 1 day agoDespite the need for more talent in cloud-based data science, not everyone has the right combination of tech skills and temperament to ...VU offers new undergrad degree in data sciencenwitimes.com - 9 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 3 hours ago"

0,1
data science course,data science coursera
data science salary,data science certification
data science degree,data science pdf
data science definition,data science books

0,1,2,3,4,5,6,7,8,9,10,11
,1,2,3,4,5,6,7,8,9,10,Next


## Exercise
1. Pip install selenium 
2. Download and unzip phantomJS 2.1.1 from https://bitbucket.org/ariya/phantomjs/downloads
3. Use the library to pull down an ajax-based page such as Google search results

# Now how do we get the content we want from the page?

## DOM

> The Document Object Model (DOM) is a programming interface for HTML and XML documents. It provides a structured representation of the document and it defines a way that the structure can be accessed from programs so that they can change the document structure, style and content. The DOM provides a representation of the document as a structured group of nodes and objects that have properties and methods. Essentially, it connects web pages to scripts or programming languages.

## Typical Web Page Structure

    <html>
        <head>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>

In [69]:
page_html = """
    <html>
        <head>
        <title>Super Cool Website!</title>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>
"""

## We're going to feed this full HTML into a library called Beautiful Soup

<img src="http://i.imgur.com/klVeXY7.png" width="800">

## Coding BeautifulSoup

In [70]:
from bs4 import BeautifulSoup

## Pass the HTML into the BS object

In [71]:
soup = BeautifulSoup(page_html, "lxml")

From there it can be searched and parsed

## Print the html

In [72]:
print(soup.prettify())

<html>
 <head>
  <title>
   Super Cool Website!
  </title>
 </head>
 <body>
  <div class="extraFancy" id="header">
   I'm a header!
  </div>
  <div id="main">
   I'm a div!
   <ul>
    I'm an unordered list!
    <li>
     I'm list item 1
    </li>
    <li>
     I'm list item 2
    </li>
   </ul>
  </div>
  <div class="extraFancy" id="footer">
   I'm a footer
  </div>
 </body>
</html>



## Let's now do some parsing of the HTML using the DOM

## Get the title

In [73]:
soup.title

<title>Super Cool Website!</title>

In [74]:
soup.title.text

u'Super Cool Website!'

## Find - get the first result

In [75]:
soup.find('div')

<div class="extraFancy" id="header">I'm a header!</div>

## FindAll - get all matching results

In [76]:
i = 0
for d in soup.findAll('div'):
    print(i, d)
    print('\n')
    i += 1

(0, <div class="extraFancy" id="header">I'm a header!</div>)


(1, <div id="main">\n                I'm a div!\n                <ul>\n                    I'm an unordered list!\n                    <li>I'm list item 1</li>\n<li>I'm list item 2</li>\n</ul>\n</div>)


(2, <div class="extraFancy" id="footer">I'm a footer</div>)




## Get the page's text

In [55]:
print(soup.text)



Super Cool Website!


I'm a header!

                I'm a div!
                
                    I'm an unordered list!
                    I'm list item 1
I'm list item 2


I'm a footer





## Get the class of an element

In [50]:
# find returns the first result
soup.find('div')['class']

['extraFancy']

## Search by the id of an element

In [51]:
print(soup.find(id='main'))

<div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
<li>I'm list item 2</li>
</ul>
</div>


## Search by the class

In [52]:
#  note the underscore after class
print(soup.findAll(class_='extraFancy'))

[<div class="extraFancy" id="header">I'm a header!</div>, <div class="extraFancy" id="footer">I'm a footer</div>]


## Get the children of an element

In [53]:
my_ul = soup.find('ul')

In [57]:
print(my_ul)

<ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
<li>I'm list item 2</li>
</ul>


In [54]:
my_ul.findChildren()

[<li>I'm list item 1</li>, <li>I'm list item 2</li>]

## Exercise

Using Requests and BeautifulSoup, pull down hacker news and print out the headlines and the story links in your notebook

In [77]:
rhn = requests.get('https://news.ycombinator.com/')

rhn 

<Response [200]>

In [84]:
print rhn.headers

{'Content-Encoding': 'gzip', 'Transfer-Encoding': 'chunked', 'Set-Cookie': '__cfduid=da58cc0041fcf9073804ccd930d08ce011476389026; expires=Fri, 13-Oct-17 20:03:46 GMT; path=/; domain=.ycombinator.com; HttpOnly', 'Strict-Transport-Security': 'max-age=31556900; includeSubDomains', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare-nginx', 'Connection': 'keep-alive', 'Cache-Control': 'private, max-age=0', 'Date': 'Thu, 13 Oct 2016 20:03:46 GMT', 'X-Frame-Options': 'DENY', 'Content-Type': 'text/html; charset=utf-8', 'CF-RAY': '2f15659a6dcc470a-EWR'}


In [81]:
print rhn.content

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?0jKc9Keyn2Zl7D1UAQcy">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title>
      </head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="newcomments">comments</a> | <a 

In [83]:
for k, v in rhn.headers.items():
    print(k + ':', v)

('Date:', 'Thu, 13 Oct 2016 20:03:46 GMT')
('Content-Type:', 'text/html; charset=utf-8')
('Transfer-Encoding:', 'chunked')
('Connection:', 'keep-alive')
('Set-Cookie:', '__cfduid=da58cc0041fcf9073804ccd930d08ce011476389026; expires=Fri, 13-Oct-17 20:03:46 GMT; path=/; domain=.ycombinator.com; HttpOnly')
('Vary:', 'Accept-Encoding')
('Cache-Control:', 'private, max-age=0')
('X-Frame-Options:', 'DENY')
('Strict-Transport-Security:', 'max-age=31556900; includeSubDomains')
('Content-Encoding:', 'gzip')
('Server:', 'cloudflare-nginx')
('CF-RAY:', '2f15659a6dcc470a-EWR')


In [103]:
from bs4 import BeautifulSoup

rhnsoup = BeautifulSoup(rhn.content, "lxml")

rhnsoup

<html op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?0jKc9Keyn2Zl7D1UAQcy" rel="stylesheet" type="text/css"/>\n<link href="favicon.ico" rel="shortcut icon"/>\n<link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>\n<title>Hacker News</title>\n</head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">\n<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>\n<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n<a href="newest">new</a> | <a href="newcomments">comments</a> | <a href="show">show</a> | <a href="ask">ask</a> | <a href="job

In [104]:
print(rhnsoup.prettify())

<html op="news">
 <head>
  <meta content="origin" name="referrer"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="news.css?0jKc9Keyn2Zl7D1UAQcy" rel="stylesheet" type="text/css"/>
  <link href="favicon.ico" rel="shortcut icon"/>
  <link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
  <title>
   Hacker News
  </title>
 </head>
 <body>
  <center>
   <table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
    <tr>
     <td bgcolor="#ff6600">
      <table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%">
       <tr>
        <td style="width:18px;padding-right:4px">
         <a href="http://www.ycombinator.com">
          <img height="18" src="y18.gif" style="border:1px white solid;" width="18"/>
         </a>
        </td>
        <td style="line-height:12pt; height:10px;">
         <span class="pagetop">
          <b class="hnname">
           <a href="news">
   

In [105]:
title_link = soup.findAll(class_='title')

In [139]:
for link in rhnsoup.findAll(class_ = 'r', 'a'):
    print(link.text)
    print(link["href"])
    print('\n')

SyntaxError: non-keyword arg after keyword arg (<ipython-input-139-832f6492425c>, line 1)

In [102]:
i = 0
for d in soup.findAll(class_='storylink'):
    print(i, d)
    print('\n')
    i += 1

(0, <a class="storylink" href="http://www.bitmatica.com/blog/an-open-source-self-hosted-heroku/">An Open Source, Self-Hosted Heroku</a>)


(1, <a class="storylink" href="http://petapixel.com/2016/10/11/cooled-nikon-d5500a-chills-sensor-clearer-star-photos/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%25253A+PetaPixel+%252528PetaPixel%252529">Cooled Nikon D5500a Chills the Sensor for Clearer Star Photos</a>)


(2, <a class="storylink" href="http://www.nobelprize.org/nobel_prizes/literature/laureates/2016/press.html">The Nobel Prize in Literature 2016 awarded to Bob Dylan</a>)


(3, <a class="storylink" href="http://www.gwan.com/blog/20160405.html">Google's \u201cDirector of Engineering\u201d Hiring Test</a>)


(4, <a class="storylink" href="http://www.atlasobscura.com/articles/inside-the-new-york-public-librarys-last-secret-apartments">Inside the New York Public Library's Last, Secret Apartments</a>)


(5, <a class="storylink" href="https://downloads.globalsign.com/ac

In [95]:
print(soup.findAll(class_='storylink'))

[<a class="storylink" href="http://www.bitmatica.com/blog/an-open-source-self-hosted-heroku/">An Open Source, Self-Hosted Heroku</a>, <a class="storylink" href="http://petapixel.com/2016/10/11/cooled-nikon-d5500a-chills-sensor-clearer-star-photos/?utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%25253A+PetaPixel+%252528PetaPixel%252529">Cooled Nikon D5500a Chills the Sensor for Clearer Star Photos</a>, <a class="storylink" href="http://www.nobelprize.org/nobel_prizes/literature/laureates/2016/press.html">The Nobel Prize in Literature 2016 awarded to Bob Dylan</a>, <a class="storylink" href="http://www.gwan.com/blog/20160405.html">Google's \u201cDirector of Engineering\u201d Hiring Test</a>, <a class="storylink" href="http://www.atlasobscura.com/articles/inside-the-new-york-public-librarys-last-secret-apartments">Inside the New York Public Library's Last, Secret Apartments</a>, <a class="storylink" href="https://downloads.globalsign.com/acton/fs/blocks/showLandingPage/a/2

## Now for the Easy Way

## Import.io

Using the URL, go to "http://www.zillow.com/new-york-city-ny/apartments/"

## Independent Practice

1. Programmatically run a google search for 'Data Science' using Selenium and PhantomJS

2. Retrieve only the links and their titles using BS - avoid getting the ads in your list

3. Place those into a DataFrame

In [124]:
from selenium import webdriver

driver1 = webdriver.PhantomJS(executable_path='/Users/student/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs')
driver1.set_window_size(1024, 768) 
driver1.get('https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=data+science')

In [125]:
HTML(driver1.page_source)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
,,,,,,,,,,,
,,,,,,,,,,,
,,,,,,,,,,,
,AllImagesVideosNewsShoppingBooksMaps,AllImagesVideosNewsShoppingBooksMaps,,,,,,,,,
Search OptionsAny timePast hourPast 24 hoursPast weekPast monthPast yearAll resultsVerbatim,"About 60,600,000 resultsData Science - 12 Weeks - Learn Python, Git, Unix & MoreAdwww.generalassemb.ly/Data-Science‎Join Our Tech Community Today.Bring Ideas to Life · Advance Your CareerCourses: Git, UNIX, & Relational Databases, Data Analysis & Python…902 Broadway, 4th Floor, New York, NYMasters in Data Analytics - Penn State Online Master's DegreeAdworldcampus.psu.edu/DataAnalytics‎Advance Your Career with Big Data!Top 10 Ranked Graduate Business Program 2016‎ – US News & World ReportWhy Choose Penn State? - We Are Penn State, Online - View All Online DegreesData Science - Find Your Dream Job With Hired - hired.comAdwww.hired.com/NewYork/new-york-city‎Your Dream Job Is Waiting to Apply To You. Get Hired on Hired. Sign Up Now!Salaries Between $75-250k · Top Tech Companies · Apply Once & Get HiredApply Once. Get Hired. - For Employers - List of CompaniesData Science Jobs - The Best Tech Companies Are HiringAdwww.indeed.com/Prime‎Find a Job, Get $2K Signing BonusLocations: Austin, London, Boston, San Francisco, Seattle…Data science - Wikipediahttps://en.wikipedia.org/wiki/Data_science‎Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or ... ‎Overview - ‎Data scientist - ‎History - ‎CriticismData Science | Courserahttps://www.coursera.org/specializations/jhu-data-science‎Explore Data Science Certificate offered by Johns Hopkins University. Launch Your Career in Data Science - A nine-course introduction to data science, ... Data Science at NYUdatascience.nyu.edu/‎The data science initiative at NYU is a university-wide effort to establish the country's leading data science training and research facilities at NYU. News for data scienceEsri Selects Anaconda to Enhance GIS Applications with Open Data ScienceBusiness Wire - 1 day agoContinuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today ...VU offers new undergrad degree in data sciencenwitimes.com - 10 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 4 hours agoCertification of Professional Achievement in Data Sciences | Data ...datascience.columbia.edu/certification‎The Certification of Professional Achievement in Data Sciences prepares students to expand their career prospects or change career paths by developing  ... Intro to Data Science Online Course | Udacityhttps://www.udacity.com/course/intro-to-data-science--ud359‎Intro to Data Science covers the basics of big data through data manipulation, analysis and communication while completing a hands on data science project. What is Data Science? - DataScience@Berkeleyhttps://datascience.berkeley.edu/about/what-is-data-science/‎The supply of professionals who can work effectively with data at scale is limited, and is reflected by rapidly rising salaries for data engineers, data scientists, ... Data Science Company | Data Science Platform | Data Science ...https://www.datascience.com/‎The DataScience Cloud is a data science platform that brings together best-in- class tools, infrastructure, and expertise in a modern, full-service offering. Data Science Certificate | Harvard Extensionhttps://www.extension.harvard.edu/academics/.../data-science-certificate‎Learn to interpret big data sets with the data science certificate from Harvard Extension. Your analysis education can start today. Programming with Python for Data Science | edXhttps://www.edx.org/.../programming-python-data-science-microsoft-dat210x -1‎Traverse the data analysis pipeline using advanced visualizations in Python, and make machine learning start working for you. Google Cloud Big Data - Build at the Speed of GoogleAdcloud.google.com/‎Create Simple Sites to Complex Apps$300 Free Trial · Deploy in Minutes · Get The Support You Need · Highly ScalableHighlights: Analytics Data Warehouse, Batch And Stream Data Processing…Case Studies - Download Data Sheet - Mobile & Data Solutions - Prediction APIIIoT Resources - Videos, Case Studies, and moreAdwww.bitstew.com/Intel/case-study‎Everything you need to know about the Industrial Internet of ThingsMIx Core for IIoT Data - Free Consultation - MIx Core Live Demo - Free TrialBig Data and Social Analytics - MIT Online Certificate CourseAdgetsmarter.mit.edu/getsmarter/data-science‎Drive Smarter Strategies With Big Data. Study Online With MIT - Enroll Today!Earn an MIT Certificate · Personalized SupportOptions: Online, Part-Time…Searches related to data sciencedata science coursedata science courseradata science salarydata science certificationdata science degreedata science pdfdata science definitiondata science books12345678910NextAdvanced searchSearch Help Send feedbackGoogle Home Advertising Programs Business Solutions Privacy Terms About Google",,,,,,,,,,
"Esri Selects Anaconda to Enhance GIS Applications with Open Data ScienceBusiness Wire - 1 day agoContinuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today ...VU offers new undergrad degree in data sciencenwitimes.com - 10 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 4 hours ago",,,,,,,,,,,
data science course,data science coursera,,,,,,,,,,
data science salary,data science certification,,,,,,,,,,
data science degree,data science pdf,,,,,,,,,,
data science definition,data science books,,,,,,,,,,

0
"Esri Selects Anaconda to Enhance GIS Applications with Open Data ScienceBusiness Wire - 1 day agoContinuum Analytics, the creator and driving force behind Anaconda, the leading Open Data Science platform powered by Python, today ...VU offers new undergrad degree in data sciencenwitimes.com - 10 hours agoData Science & Technology Program Lowers Risks Associated with AsthmaBusiness Wire - 4 hours ago"

0,1
data science course,data science coursera
data science salary,data science certification
data science degree,data science pdf
data science definition,data science books

0,1,2,3,4,5,6,7,8,9,10,11
,1,2,3,4,5,6,7,8,9,10,Next


In [133]:
from bs4 import BeautifulSoup

driversoup = BeautifulSoup((driver1.page_source), "lxml")

driversoup

<!DOCTYPE html>\n<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/><link href="/images/branding/product/ico/googleg_lodp.ico" rel="shortcut icon"/><noscript>&lt;meta content="0;url=/search?q=data+science&amp;amp;gbv=1&amp;amp;sei=DfP_V83jJYPp-AGeiJXQAg" http-equiv="refresh"&gt;&lt;style&gt;table,div,span,p{display:none}&lt;/style&gt;&lt;div style="display:block"&gt;Please click &lt;a href="/search?q=data+science&amp;amp;gbv=1&amp;amp;sei=DfP_V83jJYPp-AGeiJXQAg"&gt;here&lt;/a&gt; if you are not redirected within a few seconds.&lt;/div&gt;</noscript><title>data science - Google Search</title><style>#gb{font:13px/27px Arial,sans-serif;height:30px}#gbz,#gbg{position:absolute;white-space:nowrap;top:0;height:30px;z-index:1000}#gbz{left:0;padding-left:4px}#gbg{right:0;padding-right:5px}#gbs{b

In [144]:
print(driversoup.prettify())

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <link href="/images/branding/product/ico/googleg_lodp.ico" rel="shortcut icon"/>
  <noscript>
   &lt;meta content="0;url=/search?q=data+science&amp;amp;gbv=1&amp;amp;sei=DfP_V83jJYPp-AGeiJXQAg" http-equiv="refresh"&gt;&lt;style&gt;table,div,span,p{display:none}&lt;/style&gt;&lt;div style="display:block"&gt;Please click &lt;a href="/search?q=data+science&amp;amp;gbv=1&amp;amp;sei=DfP_V83jJYPp-AGeiJXQAg"&gt;here&lt;/a&gt; if you are not redirected within a few seconds.&lt;/div&gt;
  </noscript>
  <title>
   data science - Google Search
  </title>
  <style>
   #gb{font:13px/27px Arial,sans-serif;height:30px}#gbz,#gbg{position:absolute;white-space:nowrap;top:0;height:30px;z-index:1000}#gbz{left:0;padding-left:4px}#gbg{righ

In [152]:
for tl in driversoup.findAll('h3', class_ ='r'):
    print(tl.text)
    print (tl.find('a') ["href"])
    print 
    print('\n')

Data science - Wikipedia
/url?q=https://en.wikipedia.org/wiki/Data_science&sa=U&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQFghCMAA&usg=AFQjCNFQh0RwboJfKUkxOpkW7aPg9OGy3A


Data Science | Coursera
/url?q=https://www.coursera.org/specializations/jhu-data-science&sa=U&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQFghNMAE&usg=AFQjCNEK59LpjiNInQWIC3fivzEiD7zICQ


Data Science at NYU
/url?q=http://datascience.nyu.edu/&sa=U&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQFghTMAI&usg=AFQjCNE6eEHKOE8v9038R-YnfVdcBu9lTw


News for data science
/search?q=data+science&prmd=ivnsb&source=univ&tbm=nws&tbo=u&sa=X&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQqAIIWQ


Certification of Professional Achievement in Data Sciences | Data ...
/url?q=http://datascience.columbia.edu/certification&sa=U&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQFghhMAY&usg=AFQjCNGwYCPLO41a6e3dcNzHwGCg0eN2ow


Intro to Data Science Online Course | Udacity
/url?q=https://www.udacity.com/course/intro-to-data-science--ud359&sa=U&ved=0ahUKEwiNnsrT09jPAhWDND4KHR5EBSoQFghnMA