In [2]:
pip install graphviz

Defaulting to user installation because normal site-packages is not writeable
Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.0/47.0 KB[0m [31m591.5 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: graphviz
Successfully installed graphviz-0.20.1
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import time
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
from graphviz import Digraph
import re
import time

# Web and Cloud Computing (DATA 534): Lab 1
## General Lab Instructions

- This assignment is to be completed in python, submitting both a `.ipynb` file (you can add your answers directly to this one) along with a rendered `.md`.
- I added an Intro section to help you with the basics for this lab. If you are already comfortable with scraping data using python, you can safely skip it.

## Intro 

Let's have this intro section to introduce you to the main python functions to deal with web scraping and crawling.

### Web requests

When you navigate to a website, there are many things going on under the hood. Many layers of protocols are used to allow you to communicate with a web server (take a look at the [OSI model](https://en.wikipedia.org/wiki/OSI_model) if you are curious). We will be mostly dealing with the last layer (Layer 7 - [application layer](https://en.wikipedia.org/wiki/Application_layer)) and [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) (HyperText Transfer Protocol). We'll leave the python libraries to handle the details of the other layers for us. 

HTTP is text based. For example, to send a GET request to a web server:
```
GET /HTTP/1.1
Host: www.google.com
User-Agent: Python-urllib/3.6
```
and the server send back a response, also with a header and the requested content.

Here we will use python (instead of a web browser like Chrome) to collect information from the web. For illustration purposes, let's scrape historical data of the Word Cups (soccer) available in [this wikipedia page](https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners).

#### The `urllib` package

The [`urllib`](https://docs.python.org/3/library/urllib.html) is a built-in package in python focused on dealing with URLs. To open and read URLs, we use the function [`urllib.request.urlopen`](https://docs.python.org/3/library/urllib.request.html#module-urllib.request). Let's start by importing this function.

In [2]:
from urllib.request import urlopen

Ok, now all we have to do is to call the function and pass the URL we want. Note that although we don't usually need to add the "http://" when we are using a web browser, here we do (try removing this part if you are curious).

In [3]:
soccer_urllib = urlopen("https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners")

Now our `soccer_urllib` contains some information of our request. For example:

In [4]:
print("The url of our request: ", end="")
print(soccer_urllib.geturl())

The url of our request: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners


We can also get the header from the server's response (don't need to spend a lot of time trying to understand this header - it is just to show you how to access this info):

In [5]:
print(soccer_urllib.info())

Date: Fri, 14 Jan 2022 05:34:21 GMT
Vary: Accept-Encoding,Cookie,Authorization
Server: ATS/8.0.8
X-Content-Type-Options: nosniff
P3p: CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."
Content-Language: en
Last-Modified: Tue, 11 Jan 2022 01:27:44 GMT
Content-Type: text/html; charset=UTF-8
Age: 0
X-Cache: cp4032 hit, cp4027 miss
X-Cache-Status: hit-local
Server-Timing: cache;desc="hit-local", host;desc="cp4027"
Strict-Transport-Security: max-age=106384710; includeSubDomains; preload
Report-To: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
NEL: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
Permissions-Policy: interest-cohort=()
Set-Cookie: WMF-Last-Access=14-Jan-2022;Path=/;HttpOnly;secure;Expires=Tue, 15 Feb 2022 00:00:00 GMT
Set-Cookie: WMF-Last-Access-

Ok, but what about the content of the page? Let's check it out!

In [6]:
soccer_html = soccer_urllib.read()
print(soccer_html)

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of FIFA World Cup winning players - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6470f84c-0bb8-4e8a-a69f-00dccd3d097d","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_FIFA_World_Cup_winning_players","wgTitle":"List of FIFA World Cup winning players","wgCurRevisionId":1064949742,"wgRevisionId":1064949742,"wgArticleId":27818578,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["FIFA player ID not in Wikidata","Articles with short description","Sho

Yes, it is a complete mess, I know. How to make sense of all this? Well, there are different approaches. We could use regular expression (not a good idea though), or we could use python packages such as [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [lxml](https://lxml.de/tutorial.html) to help you make sense of this thing. Here we will focus on `BeautifulSoup`, but you have seen `lxml` in 513, remember?

We have to be careful when using `urllib.request.open`, because if you request a webpage that does not exist, it will throw an exception and stop your program. Try running the commented code in the following cell:

In [7]:
#my_pg = urlopen("www.lourenzu.com")

The exceptions raised by `urllib.request` are defined in [`urllib.error`](https://docs.python.org/3/library/urllib.error.html#module-urllib.error). So we can import this package and make a more robust piece of code that handles these exceptions:

In [3]:
from urllib.error import HTTPError, URLError
try:
    my_pg = urlopen("http://www.lourenzu.com")
except URLError as error:
    print(error.reason)

[Errno -2] Name or service not known


In [9]:
try:
    my_pg = urlopen("http://www.google.com/rodolfo_lourenzutti")
except HTTPError as error:
    print("The famous error code: ", error.code)
    print("The reason for the exception:", error.reason)

The famous error code:  404
The reason for the exception: Not Found


Handling exceptions is an important part of web scraping. Let's see an alternative package next.

#### The `requests` package

The [`requests`](http://docs.python-requests.org/en/master/user/quickstart/#custom-headers) package is an alternative (not built-in) HTTP library that is becoming more and more popular. Let's load the package and request our wikipedia page: 

In [7]:
import requests

soccer_requests = requests.get("https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_winners")

Differently from the `urllib`, requests does not raise an exception for HTTP errors:

In [11]:
bad_request = requests.get("http://www.google.com/rodolfo_lourenzutti")
bad_request

<Response [404]>

But it does raise an exception for some errors (uncomment the line below to see):

In [12]:
#requests.get("http://www.lourenzu.com")

Now, let's check the status code of our wikipedia request:

In [8]:
soccer_requests.status_code

200

In [14]:
soccer_requests.reason

'OK'

Great! Code 200 means everything is ok! We can also check the headers of the request and of the response: 

In [15]:
# Headers of the request
print(soccer_requests.request.headers)

{'User-Agent': 'python-requests/2.26.0', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': '*/*', 'Connection': 'keep-alive'}


Lastly, let's access the content of the page:

In [None]:
# Headers of the response
soccer_requests.headers

In [17]:
soccer_requests.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of FIFA World Cup winning players - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"6470f84c-0bb8-4e8a-a69f-00dccd3d097d","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_FIFA_World_Cup_winning_players","wgTitle":"List of FIFA World Cup winning players","wgCurRevisionId":1064949742,"wgRevisionId":1064949742,"wgArticleId":27818578,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["FIFA player ID not in Wikidata","Articles with short description","Shor

### Beautiful Soup

[`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a quite useful package to handle data in HTML format. It has two main functions 
1. [`BeautifulSoup.find()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find)
2. [`BeautifulSoup.find_all()`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)

There are other functions that might be useful as well (e.g., `find_next_sibling()`, `find_parent()`) 

Let's starting by importing the library and creating a `BeautifulSoup` object.

In [4]:
from bs4 import BeautifulSoup

In [9]:
# Creating a BeautifulSoup object.
soccer = BeautifulSoup(soccer_requests.text) # remember that the field text contains the content of our request

Great! Now we have a `BeautifulSoup` object stored in the `soccer` variable. We can now use the function `find_all()` to retrieve information. For example, let's find the tables contained in the page.

In [10]:
tables = soccer.find_all("table")
tables

<table class="wikitable plainrowheaders sortable" style="text-align:center">
<caption>Players who have won the World Cup
</caption>
<tbody><tr>
<th rowspan="2" scope="col">Player
</th>
<th rowspan="2" scope="col">Team
</th>
<th colspan="2" scope="col">Titles won
</th>
<th rowspan="2" scope="col">Other appearances
</th>
<th class="unsortable" rowspan="2" scope="col">Profile
</th>
<th rowspan="2">Birth
</th>
<th rowspan="2">Death
</th></tr>
<tr>
<th scope="col">Number
</th>
<th scope="col">Years
</th></tr>
<tr>
<th data-sort-value="Pele" scope="row"><a href="/wiki/Pel%C3%A9" title="Pelé">Pelé</a>
</th>
<td align="left"><span style="white-space:nowrap"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="504" data-file-width="720" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Flag_of_Brazil_%281968%E2%80%931992%29.svg/22px-Flag_of_Brazil_%281968%E2%80%931992%29.svg.png" src

Now we have a list of tables in the web page. Let's scrape the first table - the nations that won the word cup.

In [21]:
row = []
df = []
for i,entry in enumerate(tables[0].find_all("td")):
    row.append(entry.text)
    if (i+1)%5 == 0:
        df.append(row)
        row = []

df = pd.DataFrame(df, columns = ["team", "n_titles", "years", "other_app", "profile"])
df

Unnamed: 0,team,n_titles,years
0,Brazil\n,3\n,"1958, 1962, 1970\n"
1,1966\n,[fp 1]\n,Brazil\n
2,2\n,"1958, 1962\n",1966\n
3,[fp 2]\n,Brazil\n,2\n
4,"1994, 2002\n","1998, 2006\n",[fp 3]\n
...,...,...,...
736,"2002, 2006\n",[fp 442]\n,Germany\n
737,1\n,2014\n,\n
738,[fp 443]\n,Brazil\n,1\n
739,1994\n,\n,[fp 444]\n


How cool is that? Of course we need some data cleaning.

In [22]:
df = df.replace("\n","", regex=True)
df = df.replace("\[[A-Za-z0-9 ]*\]","", regex=True)
df

Unnamed: 0,team,n_titles,years
0,Brazil,3,"1958, 1962, 1970"
1,1966,,Brazil
2,2,"1958, 1962",1966
3,,Brazil,2
4,"1994, 2002","1998, 2006",
...,...,...,...
736,"2002, 2006",,Germany
737,1,2014,
738,,Brazil,1
739,1994,,


As we can see, Brazil has the highest number of World Cup titles in the world. YEAH, yeah! Don't get upset, Canada would crush us in Hockey (I don't know if we even have a team!). We could keep manipulating this dataframe, but let's keep our focus on scraping data. Let's get the names of the players that won the world cup, which are present in the second column in the first table of the wiki-page.

In [23]:
tables[0]

<table class="wikitable plainrowheaders sortable">
<caption>Players who have won the World Cup
</caption>
<tbody><tr>
<th rowspan="2" scope="col">Player
</th>
<th rowspan="2" scope="col">Team
</th>
<th colspan="2" scope="col">Titles won
</th>
<th rowspan="2" scope="col">Other appearances
</th>
<th class="unsortable" rowspan="2" scope="col">Profile
</th></tr>
<tr>
<th scope="col">Number
</th>
<th scope="col">Years
</th></tr>
<tr>
<th data-sort-value="Pele" scope="row"><a href="/wiki/Pel%C3%A9" title="Pelé">Pelé</a>
</th>
<td><span style="white-space:nowrap"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="504" data-file-width="720" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Flag_of_Brazil_%281968%E2%80%931992%29.svg/22px-Flag_of_Brazil_%281968%E2%80%931992%29.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Flag_of_Brazil_%281968%E2%80%931992%29.svg/33px-Flag_of_Brazil_%281968%E2%80%931992%29.svg.png 

In [24]:
players = []
for entry in tables[0].find_all("th"):
    if entry.has_attr("data-sort-value"): # Note the function I'm using here: has_attr(). Inspect the page on your browser to see why
        players.append(entry.text)
        
players = pd.Series(players, name='Player', dtype='object').str.replace("\n","")
players

0                   Pelé
1                Bellini
2                   Cafu
3               Castilho
4                   Didi
             ...        
440                Zetti
441      Zinedine Zidane
442    Ron-Robert Zieler
443                Zinho
444            Dino Zoff
Name: Player, Length: 445, dtype: object

But why are we doing this? Isn't easier to just copy the table directly from the page? Well, if you have only one table, it would be. But say that you want the date of birth of all these players that won the word cup. This information is not present in the table, so we need to gather from somewhere else. However, the table does provide us with the link for the wiki-page of each one of those players. We just need to go there and gather that information. Doing it manually for the 510 players would be annoying, right? 

Well, this could take a while since there are 510 players, but let's do for the first 25 players just so we get a taste.

In [25]:
url = "https://en.wikipedia.org"
bday = []
for name in players[0:25]:
    player_pg = requests.get(url+tables[0].find("a", text=name)["href"])
    print(url+tables[0].find("a", text=name)["href"])
    player_pg = BeautifulSoup(player_pg.text)
    try:
        print(player_pg.find("span", {"class":"bday"}).text)
        bday.append(player_pg.find("span", {"class":"bday"}).text)
    except:
        bday.append(player_pg.find("th", text="Date of birth").next_sibling.text)
    time.sleep(.5)

print("Finished!")

https://en.wikipedia.org/wiki/Pel%C3%A9
1940-10-23
https://en.wikipedia.org/wiki/Hilderaldo_Bellini
1930-06-07
https://en.wikipedia.org/wiki/Cafu
1970-06-07
https://en.wikipedia.org/wiki/Carlos_Jos%C3%A9_Castilho
1927-11-27
https://en.wikipedia.org/wiki/Didi_(footballer,_born_1928)
1928-10-08
https://en.wikipedia.org/wiki/Djalma_Santos
1929-02-27
https://en.wikipedia.org/wiki/Giovanni_Ferrari
1907-12-06
https://en.wikipedia.org/wiki/Garrincha
1933-10-28
https://en.wikipedia.org/wiki/Gylmar_dos_Santos_Neves
1930-08-22
https://en.wikipedia.org/wiki/Guido_Masetti
1907-11-22
https://en.wikipedia.org/wiki/Mauro_Ramos
1930-08-30
https://en.wikipedia.org/wiki/Giuseppe_Meazza
1910-08-23
https://en.wikipedia.org/wiki/Eraldo_Monzeglio
1906-06-05
https://en.wikipedia.org/wiki/N%C3%ADlton_Santos
1925-05-16
https://en.wikipedia.org/wiki/Daniel_Passarella
1953-05-25
https://en.wikipedia.org/wiki/Pepe_(footballer,_born_1935)
1935-02-25
https://en.wikipedia.org/wiki/Ronaldo_(Brazilian_footballer)
1976

Why did I add a `time.sleep` in the code? It is to give some time to the server. We want to be polite and not overload the server with too many requests in a short time. You need to be aware of that. If you are dealing with a small server, you could cause real problems. Besides, you could get blocked. Wikipedia is a very big server, and we aren't requesting that many pages, so we should be fine here. But always keep this in mind. 

Now, let's check the date of birth of the first 25 players. Please note that the previous block of code must successfully finish running first for the birthday table to display properly. 

In [26]:
players_won_world_cup = pd.concat([players[0:25],pd.Series(bday, name="bday", dtype = 'object')], axis=1)
players_won_world_cup

Unnamed: 0,Player,bday
0,Pelé,1940-10-23
1,Bellini,1930-06-07
2,Cafu,1970-06-07
3,Castilho,1927-11-27
4,Didi,1928-10-08
5,Djalma Santos,1929-02-27
6,Giovanni Ferrari,1907-12-06
7,Garrincha,1933-10-28
8,Gilmar,1930-08-22
9,Guido Masetti,1907-11-22


Cool, right? 
One last thing before you start: your browser's developer tools is your ally in navigating a website structure. 

Finally, let's begin your assigment.