# Web Scraping and APIs

In this notebook, we learn how to scrape data from the Web and get an idea of what Application Programming Interfaces are (APIs).

## Web Scraping

**Web Scraping** is a technique for the extraction of information from websites by transforming unstructured data (HTML pages) into structured data (databases or spreadsheets). 

Even if scraping can be manually performed by a user, it is usually implemented using a **web crawler** (i.e., it is usually implemented as an automatic process). For larger scale scraping see, e.g., [Scrapy](https://scrapy.org).

The process is an alternative to using already available **API**s (Application Programming Interface), such as those provided by all the major platforms, like *Facebook*, *Google* and *Twitter*.

### Basics of HTML

The **HyperText Markup Language (HTML)** is the standard **descriptive markup** language for web pages.


- **Markup** language: a human-readable, explicit system for annotating the content of a document. Markdown is another markup language.


- **Descriptive** markup languages (e.g. HTML, XML) are used to annotate the structure or the contents of a document, as opposed to **procedural** markup languages (e.g. TEX, Postscript), whose main goal is to describe how a document should be processed.

HTML provides a means to annotate the <strong>structural</strong> elements of documents like headings, paragraphs, lists, links, images, quotes, tables and so forth. Similarly, even if with fewer options, does Markdown.

HTML tags **do not mark the logical structure** of a document, but only its format (e.g. *this is a table*, *this is a h3-type heading*...). It is up to the browser to then use HTML (plus other information, such as *Cascading Style Sheets*), to render a webpage appropriately.

HTML markup relies on a **fixed inventory of tags**, written by using angle brackets. Some tags, e.g. `<p>...</p>`, surround the marked text, and may include subelements. Other tags, e.g. `<br>` or `<img>` introduce content directly.

The following is an example of a web page:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>The Adventures of Pinocchio</title>
  </head>
  <body>
    <h2>Carlo Collodi</h2>
    <h1>The Adventures of Pinocchio</h1>
    <hr>
    <h4>CHAPTER 1</h4>
    <br>
    <p><i>How it happened that Mastro Cherry, carpenter, found a piece of wood that wept and laughed like a child</i></p>
    <br>
    <p>Centuries ago there lived--</p>
    <p>"A king!" my little readers will say immediately.</p>
  </body>
</html>
```

### Scraping Web Pages

>The following notes are roughly based on the **Chapters 1-3** of: Mitchell, R. (2015). [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do), O'Reilly

#### Modules and Packages Required for Web Scraping

**BeautifulSoup**: this library defines [classes and functions](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to pull data (e.g. table, lists, paragraphs) out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree.


**lxml**: to function, BeautifulSoup relies on external HTML-XML parsers. Many options are available, among which the html5lib's and the Python's built-in parsers. We'll rely on the [lxml](http://lxml.de/)'s parser, due to its high performance, reliability and flexibility.


**Urllib**: BeautifulSoup does not fetch the web page for us. To do this, we'll rely on the [Urllib](https://docs.python.org/3.11/library/urllib.html#module-urllib) module available in the Python Standard Library, that implements classes and functions which help in opening URLs (authentication, redirections, cookies and so on). We will see another option, **requests**, below.

In [1]:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup

#### Retrieve and Parse an HTML page

`urllib.request.urlopen()` allows us to retrieve our target HTML page:

In [2]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")

What if the page doesn't exist?

In [3]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except Exception as e:
    print(e)

HTTP Error 404: Not Found


Well, let's handle this properly...

In [4]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except urllib.request.URLError as e:
    pass # code your plan B here
except urllib.request.URLError as e:
    raise # raise any other exception

We use `BeautifulSoup()` in conjunction with `lxml` to parse out `html` page and store it in the Beautiful Soup format

In [5]:
# you might need to to the following:
#!pip install lxml

In [6]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
soup_page1 = BeautifulSoup(html, "lxml")

In [7]:
#Let's scrape another couple of pages we'll need in our examples
soup_page3 = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/page3.html"), "lxml")
soup_wap = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/warandpeace.html"), "lxml")

#### Let's look at the nested structure of the page

The `prettify()` method allows us to have a look at the structure of the HTML page

In [8]:
print(soup_page1)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [9]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



#### Let's play with a HTML tag

The notation `soup.<tag>` allows us to retrieve the content marked by a tag (opening and closing tags included)

In [10]:
# note that the first "<div>" tag is nested two layers deep (html → body → div).
soup_page1.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

If the text is the only thing you're interested into, well, the `soup.<tag>.string` method comes in handy:

In [11]:
soup_page1.div.string

'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

The HTML markup generated by Beautiful Soup can be modified:

In [12]:
# let's change the content of our div
soup_page1.div.string = "this content has been changed"
# let's change the name of the tag
soup_page1.div.name = "new_div"

In [13]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <new_div>
   this content has been changed
  </new_div>
 </body>
</html>



In its simplest use, the `find()` method is an alternative to the `soup.<tag>` notation...

In [14]:
soup_page1.find("new_div")

<new_div>this content has been changed</new_div>

In [15]:
soup_page1.new_div

<new_div>this content has been changed</new_div>

...but this function allows for the searching of nodes by exploiting cues in the markup, such as a given **class attribute** value:

In [16]:
print(soup_wap.prettify())

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

In [17]:
soup_wap.find("span", attrs = {"class":"green"})

<span class="green">Anna
Pavlovna Scherer</span>

The values of an attribute for a given tag instance can be retrieved by using the `get("ATTRIBUTE")` method. For instance, if we want to retrieve the URL of an image we can extract the `src` value from the corresponding `<img>` tag:

In [18]:
soup_page3.img.get("src")

'../img/gifts/logo.jpg'

If we want to know all the attibutes associated with a given tag, the `attrs` method is convenient:

In [19]:
soup_page3.img.attrs

{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}

In [20]:
# by returning a dictionary, it is easy to see how "attrs" can be used as an alternative to "get()"
soup_page3.img.attrs["src"]

'../img/gifts/logo.jpg'

In [21]:
# if you fancy another way to do the same thing...
soup_page3.img["src"]

'../img/gifts/logo.jpg'

#### Dealing with multiple HTML tags at once

When the same tag is used multiple time in the same page, however, both the `soup.<tag>` notation and the `find()` method allow you to access **only one instance** (i.e. the first):

In [22]:
print(soup_wap.prettify())[180:1190]

<html>
 <head>
  <style>
   .green{
	color:#55ff55;
}
.red{
	color:#ff5555;
}
#text{
	width:50%;
}
  </style>
 </head>
 <body>
  <h1>
   War and Peace
  </h1>
  <h2>
   Chapter 1
  </h2>
  <div id="text">
   "
   <span class="red">
    Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.
   </span>
   "
   <p>
   </p>
   It was in July, 1805, and the speaker was the well-known
   <span class="green">
    Anna
Pavlovna Scherer
   </span>
   , maid of honor and favorite of the
   <span class="green">
    Empress Marya
Fedorovna
   </span>
   . With these words she greeted
   <span 

TypeError: 'NoneType' object is not subscriptable

In [None]:
soup_wap.span

In order to extract the **sequence of all the instances of a tag** in a file, we can use the `find_all()` method (previously known as `findAll()` and `findChildren()` in BS 3 and BS 2, respectively)

In [23]:
soup_wap.find_all("span")

[<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="red">If you have nothing better to do, Count [or Prince], and if the
 prospect of spending an evening with a poor invalid is not too
 terrible, I shall be very charmed to see you tonight between 7 and 10-
 Annette Scherer.</span>,
 <span clas

The `find_all()` method as well allows for  the extraction of  all tags by exploiting cues in the markup, such as a given **class attribute** value:

In [24]:
soup_wap.find_all("span",  attrs = {"class":"green"})

[<span class="green">Anna
 Pavlovna Scherer</span>,
 <span class="green">Empress Marya
 Fedorovna</span>,
 <span class="green">Prince Vasili Kuragin</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">St. Petersburg</span>,
 <span class="green">the prince</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">the prince</span>,
 <span class="green">Prince Vasili</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">Anna Pavlovna</span>,
 <span class="green">the prince</span>,
 <span class="green">Wintzingerode</span>,
 <span class="green">King of Prussia</span>,
 <span class="green">le Vicomte de Mortemart</span>,
 <span class="green">Montmorencys</span>,
 <span class="green">Rohans</span>,
 <span class="green">Abbe Morio</span>,
 <span class="green">the Emperor</span>,
 <span class="green">the prince</span>,
 

### Web Crawling

Web Crawlers are softwares designed to collect pages from the Web. In essence, they recursively implement the following steps: 

- they start by retrieving the page content for an URL 


- they then parse it to retrieve other URLs of interest


- they then focus on these new URLs, for each of which they repeat the whole process, ad infinitum

For instance, if you want to crawl and **entire site**:

- start with a top-level page


- parse the page (retrieve the data your application need) and extract all the internal links, by ignoring already visited URLs


- for each new link, move to the corresponding page and repeat the previous step

#### A Random walk through Wikipedia

Let's set our starting page URL, fetch it and parse its HTML:

In [25]:
starting_page = urlopen("https://en.wikipedia.org/wiki/Chris_Cornell")
soup = BeautifulSoup(starting_page, "lxml")

At this point, it should be easy to extract all the links in the page:

In [26]:
# links are defined by <a> tag
for link_element in soup.find_all("a")[:10]:
    print(link_element)

<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>
<a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>
<a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>
<a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>
<a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>
<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>
<a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>
<a href="/wiki/Help:Conte

Let's ignore all the "a" tags without an "href" attribute:

In [27]:
for link_element in [tag for tag in soup.find_all("a") if 'href' in tag.attrs][:10]:
    
    url = link_element.attrs['href']
    
    print(url)

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction


Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles:

```
/wiki/Template_talk:Chris_Cornell
```

```
#cite_note-147
```

Moreover, we don't want to visit pages outside of Wikipedia:

```
http://www.chriscornell.com/
```

Relevant links have three thing in common:

- they reside within the `div` with the `id` set to `bodyContent`


- the URLs do not contain semicolons


- the URLs begin with `/wiki/`

In [28]:
import re

re_pattern = re.compile(r"^(/wiki/)((?!:).)*$")

body = soup.find("div", {"id": "bodyContent"})

for link in body.find_all("a", {'href': re_pattern}):

    print(link.attrs['href'])

/wiki/Seattle
/wiki/Washington_(state)
/wiki/Detroit
/wiki/Michigan
/wiki/Suicide_by_hanging
/wiki/Hollywood_Forever_Cemetery
/wiki/Susan_Silver
/wiki/Peter_Cornell_(singer)
/wiki/List_of_awards_and_nominations_received_by_Chris_Cornell
/wiki/Alternative_metal
/wiki/Heavy_metal_music
/wiki/Grunge
/wiki/Alternative_rock
/wiki/Hard_rock
/wiki/SST_Records
/wiki/Sub_Pop
/wiki/A%26M_Records
/wiki/Epic_Records
/wiki/Suretone
/wiki/Interscope_Records
/wiki/Mosley_Music_Group
/wiki/Soundgarden
/wiki/Audioslave
/wiki/Temple_of_the_Dog
/wiki/N%C3%A9
/wiki/Rock_music
/wiki/Soundgarden
/wiki/Audioslave
/wiki/Temple_of_the_Dog
/wiki/Andrew_Wood_(singer)
/wiki/Grunge
/wiki/Octave
/wiki/Belting_(music)
/wiki/Euphoria_Morning
/wiki/Carry_On_(Chris_Cornell_album)
/wiki/Scream_(Chris_Cornell_album)
/wiki/Higher_Truth
/wiki/Songbook_(Chris_Cornell_album)
/wiki/The_Roads_We_Choose_%E2%80%93_A_Retrospective
/wiki/Chris_Cornell_(album)
/wiki/Golden_Globe_Award
/wiki/Machine_Gun_Preacher
/wiki/You_Know_My_Na

This code returns the list of all the Wikipedia articles linked to our starting page. 

This is not enough, we want to be recursively repeat this process for all these links. That is, we need a function that takes as input a Wikipedia article URL of the form `/wiki/<Article_Name>` and returns a list of all linked articles

In [29]:
def get_links(article_url):
    """
    Retrieve all URLs on an English Wikipedia article page (e.g. /wiki/Amsterdam).
    
    This function needs a relative URL on the 
    http://en.wikipedia.org domain, such as '/wiki/Amsterdam'. 
    
    Args:
        article_url (str): URL of a website
        
    Returns:
        bs4.element.ResultSet: bs link elements resultset
        
    """
    
    page = urlopen("http://en.wikipedia.org" + article_url)
    soup = BeautifulSoup(page, "lxml")
    
    body = soup.find("div", {"id":"bodyContent"})
    
    re_pattern = re.compile(r"^(/wiki/)((?!:).)*$")
    
    links = body.find_all("a", href=re_pattern)
    
    return links

Let's test our function by calling it in a script that randomly select, for each iteration, a random link and that stops after 10 URLs have been retrieved (or when it bumps into a page without link):

In [30]:
import random

links = get_links("/wiki/Chris_Cornell")

for _ in range(10):  # for testing purposes, we want to do this 10 times
    if len(links) > 0:
        new_article = links[random.randint(0, len(links)-1)].attrs["href"]
        print(new_article)
        
        links = get_links(new_article)
        
    else:
        print("No links in this page!")
        break

/wiki/Octave
/wiki/Cent_(music)#Centitones
/wiki/Septimal_diesis
/wiki/Septimal_kleisma
/wiki/Subminor_and_supermajor
/wiki/Diesis
/wiki/Thirteenth
/wiki/Lead_sheet
/wiki/Mensural_notation
/wiki/Longa_(music)


---

## Working with APIs

An **Application Programming Interface** is a set of protocols that defines how software programs communicate among eachother. Without APIs, we have to scrape the Web or get the data directly. With APIs, we often can get structured data: it is a much more convenient way to work.

APIs are a great option in that they implement extensively tested routines (**high reliability**). However, you should spend time in learning how they work and, in some cases, they don't allow you to access the piece of information you may need (**low flexibility**).

In [32]:
import requests  # External package: https://requests.readthedocs.io/en/master/

In [33]:
# Example of a Google search

In [34]:
query = "Tesla"
r = requests.get('https://www.google.com/search', params={'q': query})

In [35]:
r.status_code

200

In [36]:
print(r.headers['content-type'])
print(r.encoding)
print(r.url)

text/html; charset=utf-8
utf-8
https://consent.google.com/ml?continue=https://www.google.com/search%3Fq%3DTesla&gl=IT&m=0&pc=srp&uxe=none&cm=2&hl=it&src=1


In [37]:
r.text[:1000]

'<!DOCTYPE html><html lang="it" dir="ltr"><head><style nonce="SAOcRJbeWvExT3NaGGd-lA">\na, a:link, a:visited, a:active, a:hover {\n  color: #1a73e8;\n  text-decoration: none;\n}\nbody {\n  font-family: Roboto,Helvetica,Arial,sans-serif;\n  text-align: center;\n  -ms-text-size-adjust: 100%;\n  -moz-text-size-adjust: 100%;\n  -webkit-text-size-adjust: 100%;\n}\n.box {\n  border: 1px solid #dadce0;\n  box-sizing: border-box;\n  border-radius: 8px;\n  margin: 24px auto 5px auto;\n  max-width: 800px;\n  padding: 24px;\n}\n.youtubeContainerUIModernization,\n.boxUIModernization {\n  box-sizing: border-box;\n  margin-left: auto;\n  margin-right: auto;\n  max-width: 800px;\n}\n.signInContainerUIModernization {\n    display: flex;\n    justify-content: flex-end;\n}\nh1 {\n  color: #2c2c2c;\n  font-size: 24px;\n  hyphens: auto;\n  margin: 24px 0;\n}\n.icaCallout {\n  background-color: #f8f9fa;\n  padding: 12px 16px;\n  border-radius: 10px;\n  margin-bottom: 10px;\n}\n.icaCalloutUIModernization {\

---

What about using `requests` to query APIs? Easy using the param dictionary. Responses then follow the starndard format of the API (or you can request the one you like if available).

In [38]:
r = requests.get('https://api.github.com')

# raw
r.content

b'{\n  "current_user_url": "https://api.github.com/user",\n  "current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",\n  "authorizations_url": "https://api.github.com/authorizations",\n  "code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",\n  "commit_search_url": "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}",\n  "emails_url": "https://api.github.com/user/emails",\n  "emojis_url": "https://api.github.com/emojis",\n  "events_url": "https://api.github.com/events",\n  "feeds_url": "https://api.github.com/feeds",\n  "followers_url": "https://api.github.com/user/followers",\n  "following_url": "https://api.github.com/user/following{/target}",\n  "gists_url": "https://api.github.com/gists{/gist_id}",\n  "hub_url": "https://api.github.com/hub",\n  "issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",\n  "issues_url": "https://api.

In [39]:
# json
r.json()

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

---

### Exercise 1.

Write code to retrieve the **number of students and year of foundation** of Italian universities by starting from the following Wikipedia article:

https://en.wikipedia.org/wiki/List_of_universities_in_Italy

In [31]:
# Your code here

---

### Exercise 2.

1. Inspect the Google search results page and understand how results are displayed.

2. Use BeautifulSoup to get the link of the first 10 results of this search out.

In [None]:
# Your code here

---

### Exercise 3.

Develop a simple crawler to download information of interest from Wikipedia pages. Customize it however you like.

In [None]:
# Your code here