# Web Scraping and APIs

>### Today
>
> - [Web Scraping](#Web-Scraping)
>
>
> - [Working with APIs](#Working-with-APIs)

## Web Scraping

**Web Scraping** is a technique for the extraction of information from websites by transforming unstructured data (HTML pages) into structured data (databases or spreadsheets). 

Even if scraping can be manually performed by a user, it is usually implemented using a **web crawler** (i.e. it is usually implemented as an automatic process). For larger scale scraping see, e.g., [Scrapy](https://scrapy.org).

The process is an alternative to using already available **API**s (Application Programming Interface), such as those provided by all the major platforms, like *Facebook*, *Google* and *Twitter*. **More below.**

### Basics of HTML

The **HyperText Markup Language (HTML)** is the standard **descriptive markup** language for web pages.


- **Markup** language: a human-readable, explicit system for annotating the content of a document.


- **Descriptive** markup languages (e.g. HTML, XML) are used to annotate the structure of a document, as opposed to **procedural** markup languages (e.g. TEX, Postscript), whose main goal is to describe how a document should be processed.

HTML provides a means to annotate the <strong>structural</strong> elements of documents like (different kinds of) headings, paragraphs, lists, links, images, quotes, tables and so forth. Similarly, even if with fewer options, does Markdown (which we are <em>using</em> *here*, check the code!).

HTML tags **do not mark the logical structure** of a document, but only its format (e.g. *this is a table*, *this is a h3-type heading*...). It is up to the browser to then use HTML (plus other information, such as *Cascading Style Sheets*), to render a webpage appropriately.

HTML markup relies on a **fixed inventory of tags**, written by using angle brackets. Some tags, e.g. `<p>...</p>`, surround the marked text, and may include subelements. Other tags, e.g. `<br>` or `<img>` introduce content directly.

The following is an example of a web page:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>The Adventures of Pinocchio</title>
  </head>
  <body>
    <h2>Carlo Collodi</h2>
    <h1>The Adventures of Pinocchio</h1>
    <hr>
    <h4>CHAPTER 1</h4>
    <br>
    <p><i>How it happened that Mastro Cherry, carpenter, found a piece of wood that wept and laughed like a child</i></p>
    <br>
    <p>Centuries ago there lived--</p>
    <p>"A king!" my little readers will say immediately.</p>
  </body>
</html>
```

#### DOM (Document Object Model)

- *The Document Object Model is a platform- and language-neutral interface that will allow programs and scripts to dynamically access and update the content, structure and style of documents. The document can be further processed and the results of that processing can be incorporated back into the presented page* ([Word Wide Web Consortium](https://www.w3.org/DOM/))


- The DOM treats HTML, XHTML, or XML document as a tree structure, in which each node is an object representing a part of the document.

![alt text](https://github.com/bloemj/AUC_TMCI_2022/blob/main/notebooks/images/treeStructure.png?raw=1)

- It is a standard, on which most modern browsers rely: usually browsers work by parsing an HTML documents into a DOM and later rendering the DOM structure.


### Scraping Web Pages

>The following notes are roughly based on the **Chapters 1-3** of: Mitchell, R. (2015). [Web Scraping with Python](http://shop.oreilly.com/product/0636920034391.do), O'Reilly

#### Modules and Packages Required for Web Scraping

**BeautifulSoup**: this library defines [classes and functions](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to pull data (e.g. table, lists, paragraphs) out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree.


**lxml**: to function, BeautifulSoup relies on external HTML-XML parsers. Many options are available, among which the html5lib's and the Python's built-in parsers. We'll rely on the [lxml](http://lxml.de/)'s parser, due to its high performance, reliability and flexibility.


**Urllib**: BeautifulSoup does not fetch the web page for us. To do this, we'll rely on the [Urllib](https://docs.python.org/3.7/library/urllib.html#module-urllib) module available in the Python Standard Library, that implements classes and functions which help in opening URLs (authentication, redirections, cookies and so on). We will see another option, **requests**, below.

In [1]:
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup

#### Retrieve and Parse an HTML page

`urllib.request.urlopen()` allows us to retrieve our target HTML page:

In [2]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")

What if the page doesn't exist?

In [3]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except Exception as e:
    print(e)

HTTP Error 404: Not Found


Well, let's handle this properly...

In [None]:
try:
    html = urlopen("http://www.pythonscraping.com/pages/page.html")
except urllib.request.URLError as e:
    pass # code your plan B here
except urllib.request.URLError as e:
    raise # raise any other exception

We use `BeautifulSoup()` in conjunction with `lxml` to parse out `html` page and store it in the Beautiful Soup format

In [None]:
# you might need to to the following:
#!pip install lxml

In [5]:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
soup_page1 = BeautifulSoup(html, "lxml")

In [6]:
#Let's scrape another couple of pages we'll need in our examples
soup_page3 = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/page3.html"), "lxml")
soup_wap = BeautifulSoup(urlopen("http://www.pythonscraping.com/pages/warandpeace.html"), "lxml")

#### Let's look at the nested structure of the page

The `prettify()` method allows us to have a look at the structure of the HTML page

In [7]:
print(soup_page1)

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [None]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <div>
   Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
  </div>
 </body>
</html>



#### Let's play with a HTML tag

The notation `soup.<tag>` allows us to retrieve the content marked by a tag (opening and closing tags included)

In [None]:
# note that the first "<div>" tag is nested two layers deep (html → body → div).
soup_page1.div

<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>

If the text is the only thing you're interested into, well, the `soup.<tag>.string` method comes in handy:

In [None]:
soup_page1.div.string

'\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n'

The HTML markup generated by Beautiful Soup can be modified:

In [None]:
# let's change the content of our div
soup_page1.div.string = "this content has been changed"
# let's change the name of the tag
soup_page1.div.name = "new_div"

In [None]:
print(soup_page1.prettify())

<html>
 <head>
  <title>
   A Useful Page
  </title>
 </head>
 <body>
  <h1>
   An Interesting Title
  </h1>
  <new_div>
   this content has been changed
  </new_div>
 </body>
</html>



In its simplest use, the `find()` method is an alternative to the `soup.<tag>` notation...

In [None]:
soup_page1.find("new_div")

<new_div>this content has been changed</new_div>

In [None]:
soup_page1.new_div

<new_div>this content has been changed</new_div>

...but this function allows for the searching of nodes by exploiting cues in the markup, such as a given **class attribute** value:

In [None]:
print(soup_wap.prettify())

In [8]:
soup_wap.find("span", attrs = {"class":"green"})

<span class="green">Anna
Pavlovna Scherer</span>

The values of an attribute for a given tag instance can be retrieved by using the `get("ATTRIBUTE")` method. For instance, if we want to retrieve the URL of an image we can extract the `src` value from the corresponding `<img>` tag:

In [None]:
soup_page3.img.get("src")

'../img/gifts/logo.jpg'

If we want to know all the attibutes associated with a given tag, the `attrs` method is convenient:

In [None]:
soup_page3.img.attrs

{'src': '../img/gifts/logo.jpg', 'style': 'float:left;'}

In [None]:
# by returning a dictionary, it is easy to see how "attrs" can be used as an alternative to "get()"
soup_page3.img.attrs["src"]

'../img/gifts/logo.jpg'

In [None]:
# if you fancy another way to do the same thing...
soup_page3.img["src"]

'../img/gifts/logo.jpg'

#### Dealing with multiple HTML tags at once

When the same tag is used multiple time in the same page, however, both the `soup.<tag>` notation and the `find()` method allow you to access **only one instance** (i.e. the first):

In [None]:
print(soup_wap.prettify()[180:1190])

In [None]:
soup_wap.span

<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news.</span>

In order to extract the **sequence of all the instances of a tag** in a file, we can use the `find_all()` method (previously known as `findAll()` and `findChildren()` in BS 3 and BS 2, respectively)

In [None]:
soup_wap.find_all("span")

The `find_all()` method as well allows for  the extraction of  all tags by exploiting cues in the markup, such as a given **class attribute** value:

In [None]:
soup_wap.find_all("span",  attrs = {"class":"green"})

[<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>, <span class="green">the prince</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">Prince Vasili</span>, <span class="green">Anna Pavlovna</span>, <span class="green">Anna Pavlovna</span>, <span class="green">the prince</span>, <span class="green">Wintzingerode</span>, <span class="green">King of Prussia</span>, <span class="green">le Vicomte de Mortemart</span>, <span class="green">Montmorencys</span>, <span class="green">Rohans</span>, <span class="green">Abbe Morio</span>, <span class="green">the Emperor</span>, <span class="green">the prince</span>, <span class="green">Pri

#### Navigating the tree

The DOM trats HTML pages as trees. 

Beautiful Soup implements several methods that allows you to move in this structure, by starting from a given node.

![alt text](https://github.com/bloemj/AUC_TMCI_2022/blob/main/notebooks/images/DOM-table.gif?raw=1)

**Going Down**

You can iterate over a node's children using the `.children` generator:

In [None]:
print(soup_page3.prettify())

<html>
 <head>
  <style>
   img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
  </style>
 </head>
 <body>
  <div id="wrapper">
   <img src="../img/gifts/logo.jpg" style="float:left;"/>
   <h1>
    Totally Normal Gifts
   </h1>
   <div id="content">
    Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
    <p>
     We haven't figured out how to make online shopping carts yet, but you can send us a check to:
     <br/>
     123 Main St.
     <br/>
     Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
    </p>
   </div>
   <table id="giftList">
    <tr>
     <th>
      Item Title
     </th>
     <th>
      Description
     </th>
     <th>
      Cost
     </th>
     <th>
      Image
     </th>
   

In [None]:
for child in soup_page3.find("table", {"id":"giftList"}).children:
    print(child)



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


**Going Up**

You can access an element's parent with the `.parent` attribute

In [None]:
print(soup_page3.find("img", {"src":"../img/gifts/img1.jpg"}).parent)

<td>
<img src="../img/gifts/img1.jpg"/>
</td>


You can iterate over all of an element’s parents with `.parents`

In [None]:
for upper_node in soup_page3.find("img", {"src":"../img/gifts/img1.jpg"}).parents:
    print(upper_node.name)

td
tr
table
div
body
html
[document]


You can use `find_parents()` to search for a given parent node:

In [None]:
for upper_node in soup_page3.find("img", {"src":"../img/gifts/img1.jpg"}).find_parents("div"):
    print(upper_node.name)

div


**Siblings**

You can use `find_previous_siblings()` and `.find_next_siblings()` to navigate between the siblings (i.e. page elements that are on the same level of the parse tree) that precede or come after in the three, respectively:

In [None]:
for sibling in soup_page3.find("table",{"id":"giftList"}).find_previous_siblings():
    print(sibling.name)

div
h1
img


In [None]:
for sibling in soup_page3.find("table",{"id":"giftList"}).find_next_siblings():
    print(sibling.name)

div


### Web Crawling

Web Crawlers are softwares designed to collect pages from the Web. In essence, they recursively implement the following steps: 

- they start by retrieving the page content for an URL 


- they then parse it to retrieve other URLs of interest


- they then focus on these new URLs, for each of which they repeat the whole process, ad infinitum

For instance, if you want to crawl and **entire site**:

- start with a top-level page


- parse the page (retrieve the data your application need) and extract all the internal links, by ignoring already visited URLs


- for each new link, move to the corresponding page and repeat the previous step

#### A Random walk through Wikipedia

Let's set our starting page URL, fetch it and parse its HTML:

In [9]:
starting_page = urlopen("https://en.wikipedia.org/wiki/Chris_Cornell")
soup = BeautifulSoup(starting_page, "lxml")

At this point, it should be easy to extract all the links in the page:

In [10]:
# links are defined by <a> tag
for link in soup.find_all("a")[:10]:
    print(link)

<a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a>
<a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a>
<a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a>
<a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a>
<a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a>
<a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a>
<a href="https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&amp;utm_medium=sidebar&amp;utm_campaign=C13_en.wikipedia.org&amp;uselang=en" title="Support us by donating to the Wikimedia Foundation"><span>Donate</span></a>
<a href="/wiki/Help:Conte

Let's ignore all the "a" tags without an "href" attribute:

In [None]:
for link in [tag for tag in soup.find_all("a") if 'href' in tag.attrs][:10]:
    print(link.attrs['href'])

#mw-head
#p-search
/wiki/File:ChrisCornellTIFFSept2011.jpg
/wiki/Seattle,_Washington
/wiki/Detroit,_Michigan
/wiki/Suicide_by_hanging
/wiki/Hollywood_Forever_Cemetery
/wiki/Susan_Silver
/wiki/Alternative_metal
/wiki/Heavy_metal_music


Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles:

```
/wiki/Template_talk:Chris_Cornell
```

```
#cite_note-147
```

Moreover, we don't want to visit pages outside of Wikipedia:

```
http://www.chriscornell.com/
```

Relevant links have three thing in common:

- they reside within the `div` with the `id` set to `bodyContent`


- the URLs do not contain semicolons


- the URLs begin with `/wiki/`

In [None]:
import re

for link in soup.find("div", {"id": "bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

This code returns the list of all the Wikipedia articles linked to our starting page. 

This is not enough, we want to be recursively repeat this process for all these links. That is, we need a function that takes as input a Wikipedia article URL of the form `/wiki/<Article_Name>` and returns a list of all linked articles

In [None]:
def getLinks(articleUrl):
    page = urlopen("http://en.wikipedia.org" + articleUrl)
    soup = BeautifulSoup(page, "lxml")
    links = soup.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))
    return links

Let's test our function by calling it in a script that randomly select, for each iteration, a random link and that stops after 10 URLs have been retrieved (or when it bumps into a page without link):

In [None]:
import random

links = getLinks("/wiki/Chris_Cornell")

for _ in range(10):
    if len(links) > 0:
        newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
        print(newArticle)
        links = getLinks(newArticle)
    else:
        print("no links in this page")
        break

/wiki/Cannabis
/wiki/Nabumetone
/wiki/Rofecoxib
/wiki/PubMed_Identifier
/wiki/PubMed_Identifier
/wiki/Digital_object_identifier
/wiki/HTTP_proxy
/wiki/VPN
/wiki/Two-factor_authentication
/wiki/Application_security


---

### Exercise 1.

Write code to retrieve the official address of the Internationally Ranked Universities in the Netherlands by starting from the following Wikipedia article:

https://en.wikipedia.org/wiki/List_of_universities_in_the_Netherlands

In [None]:
# your code here

---

## Working with APIs

An **Application Programming Interface** is a set of protocols that defines how software programs communicate among eachother. Without APIs, we have to scrape the Web or get the data directly. With APIs, we often can get structured data: it is a much more convenient way to work.

APIs are a great option in that they implement extensively tested routines (**high reliability**). However, you should spend time in learning how they work and, in some cases, they don't allow you to access the piece of information you may need (**low flexibility**).

In [12]:
import requests

In [None]:
# Example of a Google search

In [13]:
query = "Tesla"
r = requests.get('https://www.google.com/search', params={'q': query})

In [None]:
r.status_code

200

In [14]:
print(r.headers['content-type'])
print(r.encoding)
print(r.url)

text/html; charset=utf-8
utf-8
https://consent.google.com/ml?continue=https://www.google.com/search%3Fq%3DTesla&gl=NL&m=0&pc=srp&uxe=none&cm=2&hl=nl&src=1


In [None]:
r.text[:1000]

'<!doctype html><html lang="nl"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Tesla - Google zoeken</title><script nonce="/UfDvGuTENH7v4GHQofzrw==">(function(){var a=window.performance;window.start=(new Date).getTime();a:{var b=window;if(a){var c=a.timing;if(c){var d=c.navigationStart,e=c.responseStart;if(e>d&&e<=window.start){window.start=e;b.wsrt=e-d;break a}}a.now&&(b.wsrt=Math.floor(a.now()))}}window.google=window.google||{};google.aft=function(f){f.setAttribute("data-iml",+new Date)};}).call(this);(function(){window.jsarwt=function(){return!1};}).call(this);(function(){var c=[],e=0;window.ping=function(b){-1==b.indexOf("&zx")&&(b+="&zx="+(new Date).getTime());var a=new Image,d=e++;c[d]=a;a.onerror=a.onload=a.onabort=function(){delete c[d]};a.src=b};}).call(this);</script><style>body{margin:0 auto;max-width:736px;padding:0 8px}a{color:#1967D2;text-decoration:none;tap-highlight-color:rgba(0,0,0,.1)}a:

---

### Exercise 2.

1. Inspect the Google search results page and understand how results are displayed.


2. Use BeautifulSoup to get the link of the first 10 results of this search out.

In [None]:
# your code here

---

## Google Vertical Search Engine

A *Vertical Search Engine* is a "specialized" search engine that focuses on a specific domain or service, tailored to the
particular information needs of niche audiences and professions. These also have a more detailed API than the general Google search. So, we can use it to practice our API skills. But first, we have to make a Vertical Search Engine and get our API key.

### Build the Vertical Search Engine

We will use Google Co-op's custom search engine http://cse.google.com/. Follow these steps to create your search engine:

* Click on ''New Search Engine'' and sign in.
* Specify a list of websites to search. You can start with a few websites, and add more later (aim for 10-20). If you add too many, it may not search all of them --- you can see this in the Control Panel. Only if a website has a green checkmark, it will be searched.
* Enter the basic information: language, name.
* Click on ''Create''.

To use your search engine in Python, you need two things: your Search Engine ID (visible in the CSE control panel) and an API key. To get a key, go to this page: https://developers.google.com/custom-search/v1/introduction and click Get a Key.

Now, we can try to search using the API.

In [None]:
import requests

# get the API KEY here: https://developers.google.com/custom-search/v1/overview
API_KEY = ""
# get your Search Engine ID on your CSE control panel
SEARCH_ENGINE_ID = ""

In [None]:
# The query you want to search for.
query = "List of public APIs"

# using the first page
page = 1

# Making the link to Google to search
# Documentation on this topic: https://developers.google.com/custom-search/v1/using_rest
# Start should be the index of the first result you want to see, and each page has 10 results.
# So if we want to see page 2, we need to start at result number 11.
start = (page - 1) * 10 + 1
# Building the link to send to Google
url = f"https://www.googleapis.com/customsearch/v1?key={API_KEY}&cx={SEARCH_ENGINE_ID}&q={query}&start={start}"

**Warning**: With the free version, you can only make 100 queries per day to Google (the requests.get part). So don't run the "get" queries too often or you will hit the limit before being able to finish the assignment.

In [None]:
# Make the search request to the API. This is a cell you want to run as few times as possible.
data = requests.get(url).json()

In [None]:
search_results = data.get("items")

This returns a dictionary *data* containing the result of our request. The actual search results are stored in `data["items"]` so we have saved that to the variable search_results. We can re-use this variable to avoid making many requests to Google.

### Read through the search results

Let's print the results in a nice way:

In [None]:
def print_search_results(search_results):
    for i, search_item in enumerate(search_results, start=1):
        # get the page title
        title = search_item.get("title")
        # page snippet
        snippet = search_item.get("snippet")
        # alternatively, you can get the HTML snippet (bolded keywords)
        html_snippet = search_item.get("htmlSnippet")
        # extract the page url
        link = search_item.get("link")
        # print the results
        print("="*15, f"Result #{i+start-1}", "="*15)
        print("Title:", title)
        print("Description:", snippet)
        print("URL:", link, "\n")
        
print_search_results(search_results)

### Exercise 3.

1. We did not really cover the topic of Information Retrieval in this course, but we can of course use this to perform evaluation retrieval. But, how good is your vertical search engine? Try to make two different vertical search engines that target the same topic (for example, one with the "Search the entire web" setting enabled, and the other without), and evaluate them. Your gold standard should consist of a list of keywords and web pages that should be found using that keyword, and exist within the list of sites that your vertical search engine searches:


In [None]:
eval_dataset_dict = {
    "huggingface api key": "https://huggingface.co/docs/hub/security-tokens",
}

Write code to compare the performance of your two vertical search engines using the Mean Reciprocal Rank metric (http://en.wikipedia.org/wiki/Mean_reciprocal_rank).

In [None]:
# your code here

2. Also evaluate its Precision.

In [None]:
# your code here

## LyricsGenius API

Another very nice API that has sometimes been used in projects for this course is the LyricsGenius API, allowing you to query the Genius website of song lyrics. They have their own Python package:

In [None]:
#!pip install lyricsgenius

As usual, you need an API key, but it is free to make an account: https://genius.com/api-clients/new

In [None]:
import lyricsgenius as lg
api_key = ""
genius = lg.Genius(api_key, skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"], remove_section_headers=True)

You can have a look at the documentation to see what you can do with this API, which is often what you have to do when working with APIs, as every API is different: https://lyricsgenius.readthedocs.io/en/master/usage.html

For example, we can get an object that has all songs by an artist, and then print the lyrics of a song from it:

In [None]:
artist = genius.search_artist("Rick Astley")
song = artist.song("Never Gonna Give You Up")
print(song.lyrics)

### Exercise 4

1. Let's make an API pipeline. Use the LyricsGenius API to find the YouTube URL of a song that you are interested in, then use the YouTube Data API (https://developers.google.com/youtube/v3) to retrieve all comments made on the video for that song, and print the most liked one.

In [None]:
# your code here

## General Requests to web APIs

What about using `requests` to query APIs? Easy using the param dictionary. Responses then follow the standard format of the API (or you can request the one you like if available).

In [15]:
r = requests.get('https://api.github.com')

# raw
r.content

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

In [16]:
# json
r.json()

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

## Twitter API

**Note**: Since Elon Musk's takeover of Twitter, this last section of the notebook has become less useful, as they now charge \$100 a month even for basic access. Nevertheless, it is here for reference.

Previous to that, developed their API version 2, which is unfortunately more limited than the old one, but you may still encounter both in code. The v1 version is no longer accessible even on the basic plan. This notebook uses the API v2. For a version using API v1, see last year's course materials.

Two main APIs:

* **Streaming API**: a sample of public tweets and events as they published on Twitter, provides only real-time data without limits.

* **REST API**: allows to search, follow trends, read author profile and follower data, post / modify. It provides historical data up to a week (for the free account, more by paying), rwquires a one-time request and has rate limit (varies for different requests and subscriptions).


REST APIs (it is a style for developing Web services which is widely used): https://en.wikipedia.org/wiki/Representational_state_transfer

Some more basic info: https://developer.twitter.com/en/docs/basics/things-every-developer-should-know

Tutorials: https://developer.twitter.com/en/docs/tutorials

#### Using the API: authentication

Let's do this on the Twitter dev website..

A good way to store your keys is using `.conf` files and `configparser`. **DO NOT put them on GitHub.**

In [None]:
import configparser
config = configparser.ConfigParser(interpolation=None)
config.read("stuff/conf.conf")

['stuff/conf.conf']

In [None]:
config['twitter']['api_key']

'hBn3fPoa7TXlL4fEEbZ5l1cbd'

This is how my `conf.conf` file looks like (also in `stuff/conf_public.conf`):

```
[twitter]
bearer_token = YOURS
```

#### A useful package: Tweepy

https://tweepy.readthedocs.io/en/latest/index.html

In [None]:
#You will need a recent version of Tweepy to work with the new Twitter API
#!pip install tweepy==4.8.0

In [None]:
import tweepy

In [None]:
# Tweepy Hello World

# authentication (OAuth 2.0)

client = tweepy.Client(config['twitter']['bearer_token'])

usertweets = client.get_users_tweets(id=156280168)

for tweet in usertweets.data[:5]:
    print(tweet.text)


RT @FMG_UvA: Orthopedagoog Levi van Dam is mede-initiatiefnemer van ‘We Spark the World’. Een campagne met challenges om #jongeren #mentaal…
Drie UvA-wetenschapstalenten ontvangen een #Rubicon-subsidie van @NWONieuws! Gefeliciteerd 👏Laura Burgers @act_privatelaw, @TesselBouwens @UvA_Science en @SvanNeerven @amsterdamumc! Met de Rubicon kunnen wetenschappers ervaring op doen in het buitenland. https://t.co/P26jKaBkKV
RT @FMG_UvA: Hoe #socialemedia onderdeel zijn geworden van de #oorlog.

Interview met @UvA_Amsterdam communicatiewetenschappers @TomDobber…
Door de oorlog in #Oekraïne is het humanitair oorlogsrecht de afgelopen weken veel in het nieuws geweest. De wisselwerking tussen het humanitair oorlogsrecht en het internationaal strafrecht zorgt voor uitdagingen, aldus promovendus Rogier Bartels (ICC): https://t.co/EVmCrWDjav
RT @NKI_nl: We’re excited to officially launch our new AI-lab: POP-AART, together with @UvA_Amsterdam and @Elekta. Would you like to learn…


#### Interlude: JSON

The Twitter API returns data structured in the JSON format. [JSON](https://www.json.org) (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. **It is basically a list of nested Python dictionaries.**


Minimal example:

```json
{
  "firstName": "John",
  "lastName": "Doe",
  "age": 21
}
```

Extended example:

```json
{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Person",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string",
      "description": "The person's first name."
    },
    "lastName": {
      "type": "string",
      "description": "The person's last name."
    },
    "age": {
      "description": "Age in years which must be equal to or greater than zero.",
      "type": "integer",
      "minimum": 0
    }
  }
}
```


Online viewer: http://jsonviewer.stack.hu

#### Using the API: search

All the most recent Tweets from a given hashtag.

In [None]:
# queries

for tweet in tweepy.Paginator(client.search_recent_tweets, "#nlproc", max_results=10).flatten(limit=20):
    print(tweet.data)
    print(tweet.data["text"])


{'id': '1514575939883216902', 'text': 'RT @KirkDBorne: Use *already solved* #DataScience projects for your real-world business problems now. Explore many examples here: https://t…'}
RT @KirkDBorne: Use *already solved* #DataScience projects for your real-world business problems now. Explore many examples here: https://t…
{'id': '1514575718361051143', 'text': 'RT @SapienzaNLP: #NLPaperAlert 📢 We bring together existing resources, revise them, and propose SRL4E, a unified evaluation on Semantic Rol…'}
RT @SapienzaNLP: #NLPaperAlert 📢 We bring together existing resources, revise them, and propose SRL4E, a unified evaluation on Semantic Rol…
{'id': '1514574359582720013', 'text': 'RT @gp_pulipaka: Clinical Named Entity Recognition Using spaCy! #BigData #Analytics #DataScience #AI #MachineLearning #NLProc #IoT #IIoT #P…'}
RT @gp_pulipaka: Clinical Named Entity Recognition Using spaCy! #BigData #Analytics #DataScience #AI #MachineLearning #NLProc #IoT #IIoT #P…
{'id': '1514569941994942473', '

In [None]:
# queries with more information fields
import json

for tweet in tweepy.Paginator(client.search_recent_tweets, "#nlproc", max_results=10, 
                              tweet_fields=["created_at", "in_reply_to_user_id", "referenced_tweets", "public_metrics"]).flatten(limit=5):
    print(json.dumps(tweet.data, indent=4, sort_keys=False))
#for item in tweets.items(5):
#    print(json.dumps(item._json, indent=4, sort_keys=False))

#### Using the API: users

Get some info on a given user, and explore their friends/followers.

In [None]:
user = client.get_user(username="elonmusk", user_fields=["public_metrics"])
print(user)
print("User:",user.data.username)
print("------")
print("Following:",user.data.public_metrics["following_count"])
print("Followers:",user.data.public_metrics["followers_count"])
print("------")
print("Following:")
friends = client.get_users_following(user.data.id, max_results=10)[0]
for friend in friends:
    print(friend.data["username"])
print("------")
print("Followers:")
friends = client.get_users_followers(user.data.id, max_results=10)[0]
for friend in friends:
    print(friend.data["username"])

Response(data=<User id=44196397 name=Elon Musk username=elonmusk>, includes={}, errors=[], meta={})
User: elonmusk
------
Following: 113
Followers: 81656079
------
Following:
beeple
Grimezsz
thesheetztweetz
EvaFoxU
planet
Teslarati
OfficialPCMR
stats_feed
universal_sci
gunsnrosesgirl3
------
Followers:
Gamewiz18
Miguelib88
SunbridgeConsu1
Barcaismyclub
PetrasPhyllis
mdreichard3
ZeusZeu05906203
cryptoburrenci
han54282601
BrigSpurck


#### Using the API: tweets from user

In [None]:
#user = api.get_user("elonmusk", tweet_mode="extended") # extended tweetmode gets also the longer 280/char tweets
#elon_tweets = user.timeline()

#for tweet in elon_tweets[:5]:
#    print(tweet.text)

usertweets = client.get_users_tweets(id=client.get_user(username="elonmusk").data.id)

for tweet in usertweets.data[:5]:
    print(tweet.text)

I made an offer 
https://t.co/VvreuPMeLu
RT @SpaceX: Photos from the @space_station of Dragon and the Ax-1 astronauts approaching the orbiting laboratory before docking on Saturday…
@JeffBezos Great idea
@BLKMDL3 Coming soon
@tesla_raj Exciting times ahead


#### Twitter data: tokenizing

Tokenizing tweets requires a dedicated apporoach.

In [None]:
a_tweet = "@moyo5150 More like https://t.co/f8BUlz0Xdo #robots"

In [None]:
import nltk
from nltk.tokenize import word_tokenize
#nltk.download('punkt')

word_tokenize(a_tweet)

['@',
 'moyo5150',
 'More',
 'like',
 'https',
 ':',
 '//t.co/f8BUlz0Xdo',
 '#',
 'robots']

In [None]:
from nltk.tokenize import TweetTokenizer

tweet_tokenizer = TweetTokenizer(preserve_case=True, reduce_len=False, strip_handles=False)
tweet_tokenizer.tokenize(a_tweet)

['@moyo5150', 'More', 'like', 'https://t.co/f8BUlz0Xdo', '#robots']

Other options are available, e.g., the [ark-twokenizer](https://github.com/myleott/ark-twokenize-py).

---

For those who don't have a Twitter account and app, here are some tweets on and by Boris Johnson!

In [None]:
user = "@BorisJohnson"
#tweets_on_user = tweepy.Cursor(api.search, q=user, tweet_mode="extended")
tweets_on_user = tweepy.Paginator(client.search_recent_tweets, user, max_results=100).flatten(limit=1000)

on_boris = list()
for tweet in tweets_on_user:
    on_boris.append(tweet.data["text"])
    
#print("\n------\n")
# from user

#user = api.get_user("BorisJohnson", tweet_mode="extended") # extended tweetmode gets also the longer 280/char tweets
#tweets_from_user = user.timeline(count=100)

usertweets = tweepy.Paginator(client.get_users_tweets, id=client.get_user(username="BorisJohnson").data.id, max_results=100).flatten(limit=1000)

from_boris = list()
for tweet in usertweets:
    from_boris.append(tweet.data["text"])

In [None]:
# save to file
f_on_boris = "stuff/tweets_on_boris.csv"
f_from_boris = "stuff/tweets_from_boris.csv"

# note we are using the "" as text delimiter
with open(f_on_boris, "w") as f:
    for t in on_boris:
        f.write('"'+t+'"\n')
        
with open(f_from_boris, "w") as f:
    for t in from_boris:
        f.write('"'+t+'"\n')

### Exercise 5 (if you are rich):

1. Download the last 100 (or another number) tweets mentioning a user you are interested into and the last 100 from the user itself. Alternatively, use the tweets in the on_boris and from_boris files.


2. Create a minimal pipeline to normalize the tweets into lists of tokens.


3. Count and compare from the two datasets, the most frequent (top 10):
    - tokens
    - hashtags
    - other user mentions

In [None]:
# your code here