# Retrieving data from the web

# requests

The first task you'll have on HW1 will be to retrieve some data from the Internet. Python has many built-in libraries that were developed over the years to do exactly that (e.g. urllib, urllib2, urllib3).

However, these libraries are very low-level and somewhat hard to use. They become especially cumbersome when you need to issue POST requests or authenticate against a web service.

Luckly, as with most tasks in Python, someone has developed a library that simplifies these tasks. In reality, the requests made both on this lab and on HW1 are fairly simple, and could easily be done using one of the built-in libraries. However, it is better to get acquainted to requests as soon as possible, since you will probably need it in the future.

In [46]:
# You tell Python that you want to use a library with the import statement.
import requests

In [2]:
# Get the HU Wikipedia page
req = requests.get("https://en.wikipedia.org/wiki/Harvard_University")

Python is an Object Oriented language, and everything on it is an object. Even built-in functions such as */len/* are just syntactic sugar for acting on object properties.



In [47]:
req

<Response [200]>

In [48]:
type(req)

requests.models.Response

Another very nifty Python function is dir. You can use it to list all the properties of an object.

By the way, properties starting with a single and double underscores are usually not meant to be called directly.



In [50]:
# with calling dir will show all properties of that instant
dir(req)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

Right now req holds a reference to a Request object; but we are interested in the text associated with the web page, not the object itself.


So the next step is to assign the value of the text property of this Request object to a variable.



In [51]:
page = req.text
page

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Harvard University - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"80f1d86f-d6c7-4e8f-9d24-613045332a1c","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1057550158,"wgRevisionId":1057550158,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Sho

Great! Now we have the text of the HU Wikipedia page. But this mess of HTML tags would be a pain to parse manually. Which is why we will use another very cool Python library called BeautifulSoup.

# BeautifulSoup

Parsing data would be a breeze if we could always use well formatted data sources, such as CSV, JSON, or XML; but some formats such as HTML are at the same time a very popular and a pain to parse.

One of the problems with HTML is that over the years browsers have evolved to be very forgiving of "malformed" syntax. Your browser is smart enough to detect some common problems, such as open tags, and correct them on the fly.

Unfortunately, we do not have the time or patience to implement all the different corner cases, so we'll let BeautifulSoup do that for us.

You'll notice that the import statement bellow is different from what we used for requests. The from library import thing pattern is useful when you don't want to reference a function byt its full name (like we did with requests.get), but you also don't want to import every single thing on that library into your namespace.

In [52]:
from bs4 import BeautifulSoup


BeautifulSoup can deal with HTML or XML data, so the next line parser the contents of the page variable using its HTML parser, and assigns the result of that to the soup variable.

In [53]:
soup = BeautifulSoup(page, 'html.parser')


Let's check the string representation of the soup object.

In [54]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Harvard University - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"80f1d86f-d6c7-4e8f-9d24-613045332a1c","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1057550158,"wgRevisionId":1057550158,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short description","Short de

Doesn't look much different from the page object representation. Let's make sure the two are different types.

In [55]:
type(page)

str

In [56]:
type(soup)

bs4.BeautifulSoup

Looks like they are indeed different.

BeautifulSoup obkects have a cool little method that allows you to see the HTML content in a nice, indented way.

In [57]:
print (soup.prettify())


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Harvard University - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"80f1d86f-d6c7-4e8f-9d24-613045332a1c","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Harvard_University","wgTitle":"Harvard University","wgCurRevisionId":1057550158,"wgRevisionId":1057550158,"wgArticleId":18426501,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Webarchive template wayback links","CS1: Julian–Gregorian uncertainty","Articles with short descr

Looks like it's our page!

We can now reference elements of the HTML document in different ways. One very convenient way is by using the dot notation, which allows us to access the elements as if they were properties of the object.

In [58]:
soup.title

<title>Harvard University - Wikipedia</title>

But we should make it clear that this is again just syntactic sugar. title is not a property of the soup object and I can prove it:


In [59]:
"title" in dir(soup)

False

This is nice for HTML elements that only appear once per page, such the the title tag. But what about elements that can appear multiple times?

In [60]:
# Be careful with elements that show up multiple times.
soup.p

<p class="mw-empty-elt">
</p>

In [63]:
len(soup.find_all("p"))

106

In [64]:
len(soup.find_all("h1"))

1

In [65]:
len(soup.find_all("div"))

483

In [66]:
len(soup.find_all("table"))

25

In [62]:
soup.table["class"]


['infobox', 'vcard']

If you look at the Wikipedia page on your browser, you'll notice that it has a couple of tables in it. We will be working with the "Demographics" table, but first we need to find it.

One of the HTML attributes that will be very useful to us is the "class" attribute.

Getting the class of a single element is easy...

# List Comprehensions



Next we will use a list comprehension to see all the tables that have a "class" attributes. List comprehensions are a very cool Python feature that allows for a loop iteration and a list creation in a single line.

In [67]:
[t["class"] for t in soup.find_all("table") if t.get("class")]


[['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable'],
 ['wikitable'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowr

The list comprehension above is equivalent (but much more concise) to the following construct:

In [19]:
# creat a list and itarate all attrubutes 
my_list = []
for t in soup.find_all("table"):
    if t.get("class"):
        my_list.append(t["class"])
my_list

[['infobox', 'vcard'],
 ['toccolours'],
 ['infobox'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable', 'collapsible', 'collapsed', 'floatright'],
 ['wikitable', 'sortable'],
 ['wikitable'],
 ['metadata', 'mbox-small'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'navbox-subgroup'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'mw-collapsed', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner'],
 ['nowr

As I mentioned, we will be using the Demographics table for this lab. The next cell contains the HTML elements of said table. We will render it in different parts of the notebook to make it easier to follow along the parsing steps.

In [73]:
# show the table that have a class attribute called "wikitable" but here in this page we have many called "wikitable"
#for that it will print the first one
table_html = str(soup.find("table", "wikitable"))

In [74]:
from IPython.core.display import HTML

HTML(table_html)

National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Biological Sciences,4,,
Business,6,,
Chemistry,2,,
Clinical Psychology,10,,
Computer Science,16,,
Earth Sciences,8,,
Economics,1,,
Education,1,,
Engineering,22,,
English,8,,


First we'll use a list comprehension to extract the rows (tr) elements.


In [75]:
rows = [row for row in soup.find("table", "wikitable").find_all("tr")]
rows

[<tr>
 <th colspan="4" style="background-color:#A31F36;color:white;box-shadow: inset 2px 2px 0 #2C2A29, inset -2px -2px 0 #2C2A29;">National Graduate Rankings<sup class="reference" id="cite_ref-92"><a href="#cite_note-92">[92]</a></sup>
 </th></tr>,
 <tr>
 <th>Program
 </th>
 <th>Ranking
 </th></tr>,
 <tr>
 <td>Biological Sciences</td>
 <td>4
 </td></tr>,
 <tr>
 <td>Business</td>
 <td>6
 </td></tr>,
 <tr>
 <td>Chemistry</td>
 <td>2
 </td></tr>,
 <tr>
 <td>Clinical Psychology</td>
 <td>10
 </td></tr>,
 <tr>
 <td>Computer Science</td>
 <td>16
 </td></tr>,
 <tr>
 <td>Earth Sciences</td>
 <td>8
 </td></tr>,
 <tr>
 <td>Economics</td>
 <td>1
 </td></tr>,
 <tr>
 <td>Education</td>
 <td>1
 </td></tr>,
 <tr>
 <td>Engineering</td>
 <td>22
 </td></tr>,
 <tr>
 <td>English</td>
 <td>8
 </td></tr>,
 <tr>
 <td>History</td>
 <td>4
 </td></tr>,
 <tr>
 <td>Law</td>
 <td>3
 </td></tr>,
 <tr>
 <td>Mathematics</td>
 <td>2
 </td></tr>,
 <tr>
 <td>Medicine: Primary Care</td>
 <td>10
 </td></tr>,
 <tr>
 <td>M


# lambda expressions

We will then use a lambda expression to replace new line characters with spaces. Lambda expressions are to functions what list comprehensions are to lists: namely a more concise way to achieve the same thing.

In reality, both lambda expressions and list comprehensions are a little different from their function and loop counterparts. But for the purposes of this class we can ignore those differences.

In [76]:
# Lambda expressions return the value of the expression inside it.
# In this case, it will return a string with new line characters replaced by spaces.
rem_nl = lambda s: s.replace("\n", " ")

# Functions

Let's expand a little on functions... Python is very flexible when it comes to function declarations. We've seen the lambda expression, and you might already be familiar with the normal function declaration:

In [77]:
#will return value whuch is will be with out:8
def power(x, y):
    return x**y

power(2, 3)

8

In [78]:
def print_greeting():
    print ("Hello!")
    
print_greeting()

Hello!


In [79]:
# The function bellow can be called with x and y, in which case it will return x*y;
# or it can be called with x only, in which case it will return x*1.
def get_multiple(x, y=1):
    return x*y

print ("With x and y: ", get_multiple(10, 2))
print ("With x only: ", get_multiple(10))

With x and y:  20
With x only:  10


In [27]:
#Things start to get more interesting when we have multiple default values:

def print_special_greeting(name, leaving=False, condition="nice"):
    print ("Hi", name)
    print ("How are you doing in this", condition, "day?")
    if leaving:
        print ("Please come back!")

In [28]:
# Use all the default values.
print_special_greeting("John")

Hi John
How are you doing in this nice day?


In [29]:
# Specify all values.
print_special_greeting("John", True, "rainy")

Hi John
How are you doing in this rainy day?
Please come back!


In [30]:
# Change only the first default value.
print_special_greeting("John", True)

Hi John
How are you doing in this nice day?
Please come back!


In [31]:
print_special_greeting("John", condition="horrible")


Hi John
How are you doing in this horrible day?


In [32]:
def print_siblings(name, *siblings):
    print (name, "has the following siblings:")
    for sibling in siblings:
        print (sibling)
    print
        
print_siblings("John", "Ashley", "Lauren", "Arthur")
print_siblings("Mike", "John")
print_siblings("Terry")

John has the following siblings:
Ashley
Lauren
Arthur
Mike has the following siblings:
John
Terry has the following siblings:


In [80]:
def print_brothers_sisters(name, **siblings):
    print (name, "has the following siblings:")
    for sibling in siblings:
        print (sibling, ":", siblings[sibling])
    print
    
print_brothers_sisters("John", Ashley="sister", Lauren="sister", Arthur="brother")

John has the following siblings:
Ashley : sister
Lauren : sister
Arthur : brother


# Splitting the data



In [84]:
columns = [rem_nl(col.get_text()) for col in rows[1].find_all("th") if col.get_text()]
columns

['Program ', 'Ranking ']

In [97]:
indexes = [row.find("th").get_text() for row in rows[1:]]
indexes

AttributeError: 'NoneType' object has no attribute 'get_text'

In [83]:
# Here's the original HTML table.
HTML(table_html)

National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Biological Sciences,4,,
Business,6,,
Chemistry,2,,
Clinical Psychology,10,,
Computer Science,16,,
Earth Sciences,8,,
Economics,1,,
Education,1,,
Engineering,22,,
English,8,,


In [93]:
# remove % from the numbers
to_num = lambda s: s[-1] == "%" and int(s[:-1]) or None


In [94]:
values = [to_num(value.get_text()) for row in rows[1:] for value in row.find_all("td")]
values

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

In [39]:
# Here's the original HTML table.
HTML(table_html)

National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92],National Graduate Rankings[92]
Program,Ranking,Unnamed: 2_level_1,Unnamed: 3_level_1
Biological Sciences,4,,
Business,6,,
Chemistry,2,,
Clinical Psychology,10,,
Computer Science,16,,
Earth Sciences,8,,
Economics,1,,
Education,1,,
Engineering,22,,
English,8,,


# Exploding parameters

The asterisk before the list comprehension is used to explode the list. Take a look at the function calls bellow:

In [40]:
def print_args(arg1, arg2, arg3):
    print (arg1, arg2, arg3)

# Print three numbers.
print_args(1, 2, 3)

# Print three lists.
print_args([1, 10], [2, 20], [3, 30])

1 2 3
[1, 10] [2, 20] [3, 30]


In [41]:
# But sometimes we have a container holding our parameters. Here's the hard way to handle this:

parameters = [100, 200, 300]

p1 = parameters[0]
p2 = parameters[1]
p3 = parameters[2]

print_args(p1, p2, p3)

100 200 300


A slightly better way to handle this is to use the unpacking functionality. We can assign values from a container directly to variables using the syntax bellow (note that the container can have more values than there are variables, but not the other way around):

In [42]:
p4, p5, p6 = parameters

print_args(p4, p5, p6)

100 200 300


But the best way to handle these situations is to explode the list by placing an asterisk before it:

In [43]:
print_args(*parameters)


100 200 300


In [45]:
print_args(parameters)

TypeError: print_args() missing 2 required positional arguments: 'arg2' and 'arg3'