# Web Scraping Intro

Sometimes, data viewable on the web may not be available through and API or downloadable as csv or otherwise. In these cases, scraping that data might be possible.

In this notebook, we'll see how we can retrieve the contents of a website and then parse the resulting HTML to extract the data we want.

For this, we'll again be using the [_requests_](https://requests.readthedocs.io/en/master/) library.

In [1]:
import requests

Let's say we want to pull some data from http://en.wikipedia.org/wiki/Turing_Award. 

I can start by sending a get request for the contents of the site.

In [2]:
URL = 'http://en.wikipedia.org/wiki/Turing_Award'

response = requests.get(URL)

We can check the status code using the `status_code` attribute.

In [3]:
response.status_code

200

A 200 status code is the standard response for a successful request.  

Let's see what this request returned.

In [4]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Turing Award - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vecto

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [5]:
from bs4 import BeautifulSoup

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [6]:
soup = BeautifulSoup(response.text)

Now, we can print it out in a slightly more readable form.

In [7]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Turing Award - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clien

What we are looking at is the HTML for this page. This is rendered by your browser into the Wikipedia page that you see.

<img src="../assets/html.png">


If you navigate to this page in your browser, you can view page source or inspect elements to see the underlying HTML.

Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup, telling it that we want to find the first `title` tag.

In [8]:
soup.find('title')

<title>Turing Award - Wikipedia</title>

Notice that this returns a bs4 Tag object.

In [9]:
type(soup.find('title'))

bs4.element.Tag

To extract out the text, you can use the `.text` attribute.

In [10]:
soup.find('title').text

'Turing Award - Wikipedia'

The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all images. We'll look for the `img` tag.

In [11]:
images = soup.findAll('img')
images

[<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>,
 <img alt="Wikipedia" class="mw-logo-wordmark" src="/static/images/mobile/copyright/wikipedia-wordmark-en.svg" style="width: 7.5em; height: 1.125em;"/>,
 <img alt="The Free Encyclopedia" class="mw-logo-tagline" height="13" src="/static/images/mobile/copyright/wikipedia-tagline-en.svg" style="width: 7.3125em; height: 0.8125em;" width="117"/>,
 <img alt="This is a featured list. Click here for more information." class="mw-file-element" data-file-height="443" data-file-width="466" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/>,
 <img alt="Photo of Alan Turing" class="mw-file-elemen

Let's look closer at the first image.

In [12]:
first_image = images[0]
first_image

<img alt="" aria-hidden="true" class="mw-logo-icon" height="50" src="/static/images/icons/wikipedia.png" width="50"/>

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [13]:
first_image['src']

'/static/images/icons/wikipedia.png'

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [14]:
# Non-safe
first_image['class']

['mw-logo-icon']

In [15]:
# Safe
first_image.get('class')

['mw-logo-icon']

You can also specify a default value when using `get`.

In [16]:
first_image.get('class', default = 'No Class')

['mw-logo-icon']

If you want to grab a particular attribute for all images, an easy way to do so is with a list comprehension.

In [17]:
image_srcs = [x.get('src') for x in images]

In [18]:
image_srcs

['/static/images/icons/wikipedia.png',
 '/static/images/mobile/copyright/wikipedia-wordmark-en.svg',
 '/static/images/mobile/copyright/wikipedia-tagline-en.svg',
 '//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 '//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg/220px-Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/3/36/Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg/80px-Maurice_Vincent_Wilkes_1980_%283%2C_cropped%29.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Marvin_Minsky_at_OLPCc.jpg/80px-Marvin_Minsky_at_OLPCc.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/4/49/John_McCarthy_Stanford.jpg/80px-John_McCarthy_Stanford.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Edsger_Wybe_Dijkstra.jpg/80px-Edsger_Wybe_Dijkstra.jpg',
 '//upload.wikimedia.org/wikipedia/commons/thumb

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the second header on the page.

In [19]:
soup.findAll('header')[1]

<header class="mw-body-header vector-page-titlebar">
<nav aria-label="Contents" class="vector-toc-landmark">
<div class="vector-dropdown vector-page-titlebar-toc vector-button-flush-left" id="vector-page-titlebar-toc">
<input aria-haspopup="true" aria-label="Toggle the table of contents" class="vector-dropdown-checkbox" data-event-name="ui.dropdown-vector-page-titlebar-toc" id="vector-page-titlebar-toc-checkbox" role="button" type="checkbox"/>
<label aria-hidden="true" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" for="vector-page-titlebar-toc-checkbox" id="vector-page-titlebar-toc-label"><span class="vector-icon mw-ui-icon-listBullet mw-ui-icon-wikimedia-listBullet"></span>
<span class="vector-dropdown-label-text">Toggle the table of contents</span>
</label>
<div class="vector-dropdown-content">
<div class="vector-unpinned-container" id="vector-page-titlebar-toc-unpinned-container">
</di

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [20]:
soup.findAll('header')[1].find('h1').get('id')

'firstHeading'

In [21]:
soup.findAll('header')[1].find('h1').text

'Turing Award'

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [22]:
soup.findAll('table')

[<table class="infobox vevent"><tbody><tr><th class="infobox-above summary" colspan="2">ACM Turing Award</th></tr><tr><td class="infobox-image" colspan="2"><span class="mw-default-size" typeof="mw:File/Frameless"><a class="mw-file-description" href="/wiki/File:Alan_Turing_(1912-1954)_in_1936_at_Princeton_University.jpg"><img alt="Photo of Alan Turing" class="mw-file-element" data-file-height="745" data-file-width="733" decoding="async" height="224" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg/220px-Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/1/17/Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg/330px-Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/1/17/Alan_Turing_%281912-1954%29_in_1936_at_Princeton_University.jpg/440px-Alan_Turing_%281912-1954%29_in_19

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [23]:
soup.find('table', attrs={'class' : 'wikitable'})

<table class="wikitable sortable">
<caption>Recipients of the ACM Turing award
</caption>
<tbody><tr>
<th scope="col">Year
</th>
<th scope="col">Recipient(s)
</th>
<th class="unsortable" scope="col">Photo
</th>
<th class="unsortable" scope="col">Rationale
</th>
<th scope="col">Affiliated institute(s)
</th></tr>
<tr>
<th scope="row">1966
</th>
<td><span data-sort-value="Perlis, Alan"><span class="vcard"><span class="fn"><a href="/wiki/Alan_Perlis" title="Alan Perlis">Alan Perlis</a></span></span></span>
</td>
<td align="center">—
</td>
<td>"For his influence in the area of advanced <a href="/wiki/Computer_programming" title="Computer programming">computer programming</a> techniques and <a href="/wiki/Compiler" title="Compiler">compiler</a> construction"<sup class="reference" id="cite_ref-16"><a href="#cite_note-16"><span class="cite-bracket">[</span>16<span class="cite-bracket">]</span></a></sup><sup class="reference" id="cite_ref-perlis_17-0"><a href="#cite_note-perlis-17"><span class=

We can display the table by importing the `HTML` function.

In [24]:
table_html = str(soup.find('table', attrs={'class' : 'wikitable'}))

from IPython.core.display import HTML

HTML(table_html)

Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
1966,Alan Perlis,—,"""For his influence in the area of advanced computer programming techniques and compiler construction""[16][17]",Carnegie Mellon University
1967,Maurice Wilkes,,"""Wilkes is best known as the builder and designer of the EDSAC, the second computer with an internally stored program. Built in 1949, the EDSAC used a mercury delay line memory. He is also known as the author, with David Wheeler and Stanley Gill, of a volume on 'Preparation of Programs for Electronic Digital Computers' in 1951, in which program libraries were effectively introduced.""[18][19]",University of Cambridge
1968,Richard Hamming,—,"""For his work on numerical methods, automatic coding systems, and error-detecting and error-correcting codes""[20][21]",Bell Labs
1969,Marvin Minsky,,"""For his central role in creating, shaping, promoting, and advancing the field of artificial intelligence""[22][23]",Massachusetts Institute of Technology
1970,James H. Wilkinson,—,"""For his research in numerical analysis to facilitate the use of the high-speed digital computer, having received special recognition for his work in computations in linear algebra and 'backward' error analysis""[24][25]",National Physical Laboratory
1971,John McCarthy,,"""McCarthy's lecture 'The Present State of Research on Artificial Intelligence' is a topic that covers the area in which he has achieved considerable recognition for his work.""[26][27]",Stanford University
1972,Edsger W. Dijkstra,,"""Edsger Dijkstra was a principal contributor in the late 1950s to the development of the ALGOL, a high level programming language which has become a model of clarity and mathematical rigor. He is one of the principal proponents of the science and art of programming languages in general, and has greatly contributed to our understanding of their structure, representation, and implementation. His fifteen years of publications extend from theoretical articles on graph theory to basic manuals, expository texts, and philosophical contemplations in the field of programming languages.""[28][29]","Centrum Wiskunde & Informatica, Eindhoven University of Technology, University of Texas at Austin"
1973,Charles Bachman,,"""For his outstanding contributions to database technology""[30][31]","General Electric Research Laboratory (now under Groupe Bull, an Atos company)"
1974,Donald Knuth,,"""For his major contributions to the analysis of algorithms and the design of programming languages, and in particular for his contributions to ""The Art of Computer Programming"" through his well-known books in a continuous series by this title""[32][33]","California Institute of Technology, Center for Communications Research, Center for Communications and Computing, Institute for Defense Analyses, Stanford University"
1975,Allen Newell,—,"""In joint scientific efforts extending over twenty years, initially in collaboration with J. C. Shaw at the RAND Corporation, and subsequently with numerous faculty and student colleagues at Carnegie Mellon University, they have made basic contributions to artificial intelligence, the psychology of human cognition, and list processing.""[34][35][36]","RAND Corporation, Carnegie Mellon University"


However, this does not give us a way to work with the data in the table, only to display it.

If we want to interact with the table, we can use the _pandas_ `read_html` method.

In [25]:
import pandas as pd
import io

In [26]:
pd.read_html(io.StringIO(str(table_html)))[0]

Unnamed: 0,Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
0,1966,Alan Perlis,—,"""For his influence in the area of advanced com...",Carnegie Mellon University
1,1967,Maurice Wilkes,,"""Wilkes is best known as the builder and desig...",University of Cambridge
2,1968,Richard Hamming,—,"""For his work on numerical methods, automatic ...",Bell Labs
3,1969,Marvin Minsky,,"""For his central role in creating, shaping, pr...",Massachusetts Institute of Technology
4,1970,James H. Wilkinson,—,"""For his research in numerical analysis to fac...",National Physical Laboratory
...,...,...,...,...,...
72,2020,Alfred Aho,—,"""For fundamental algorithms and theory underly...","Bell Labs, Columbia University"
73,2020,Jeffrey Ullman,—,"""For fundamental algorithms and theory underly...","Bell Labs, Princeton University, Stanford Univ..."
74,2021,Jack Dongarra,,"""For pioneering contributions to numerical alg...","Argonne National Laboratory, Oak Ridge Nationa..."
75,2022,Robert Metcalfe,,"""For the invention, standardization, and comme...","Massachusetts Institute of Technology, Harvard..."


You can even just point the read_html function to the URL containing the table you want.

In [27]:
URL = 'http://en.wikipedia.org/wiki/Turing_Award'
pd.read_html(URL)[1]

Unnamed: 0,Year,Recipient(s),Photo,Rationale,Affiliated institute(s)
0,1966,Alan Perlis,—,"""For his influence in the area of advanced com...",Carnegie Mellon University
1,1967,Maurice Wilkes,,"""Wilkes is best known as the builder and desig...",University of Cambridge
2,1968,Richard Hamming,—,"""For his work on numerical methods, automatic ...",Bell Labs
3,1969,Marvin Minsky,,"""For his central role in creating, shaping, pr...",Massachusetts Institute of Technology
4,1970,James H. Wilkinson,—,"""For his research in numerical analysis to fac...",National Physical Laboratory
...,...,...,...,...,...
72,2020,Alfred Aho,—,"""For fundamental algorithms and theory underly...","Bell Labs, Columbia University"
73,2020,Jeffrey Ullman,—,"""For fundamental algorithms and theory underly...","Bell Labs, Princeton University, Stanford Univ..."
74,2021,Jack Dongarra,,"""For pioneering contributions to numerical alg...","Argonne National Laboratory, Oak Ridge Nationa..."
75,2022,Robert Metcalfe,,"""For the invention, standardization, and comme...","Massachusetts Institute of Technology, Harvard..."
