# Web Scraiping
## Lecutre 05 - 22/09/2023

# What is web scraping?

We have previously been working with datasets using files or accessing public APIs orgnaized by companies or organizations. Web scraping is a technique used to **scrape** the web for whatever data you need.

This could be done manually or automatically using Python :)

# What are the advantages?

* No rate limits
* Fully in control of what data you collect
* Anonymous
* Many websites do not expose a public api!

# But how?

Python has several libraries that can be used to scrape the web, some of there include:
* Scrapy
* **BeautifulSoup**
* Selenium
* **Pandas**

Just to name a few...

These libraries allow you to process or browse the web programatically. Requires some understanding about how the web works.

# HyperText Markup Language

HyperText Markup Language (HTML), is the standard markup language used to create web pages. 

* Used as markup language for basically every website on the internet.
* Developed by the World Wide Web Consortium (W3C).
* Current version: HTML5 is supported by most modern internet browsers.



## Resources

* Lot's of offline and online material available
* For example the Mozilla Developer Network https://developer.mozilla.org
* W3Schools has had a bad reputation but has improved over the years (https://www.w3schools.com/)
* Historical reference: https://www.w3fools.com/
    

## A simple HTML document
The Hello world version of an HTML document is:

```html
<!DOCTYPE html>
<html>

  <!-- The head tag contains header information about the HTML document. -->
  <head>
    <title>A title for the browser window</title> 
  </head>
  
  <!-- The body tag contains the content of the HTML document. -->
  <body>
    <p>Hello world!</p>
  </body>
  
</html>
```

Save this file as `index.html` and open it with your favorite web browser, e.g. `firefox index.html`



## Syntax

* The HTML file consists of tags, denoted by `<tagname>`
* Most HTML elements are marked by a tag pair (start tag and end tag): `<tagname>content</tagname>`
* Some HTML elements have no content (and hence no end tag): `<img>` or `<br>`
* A tag can have attributes, for example: `<tagname atttribute1="value1" attribute2="value2">`
* Certain tags should be unique, like `id`, but this is not enforced.

<img src="https://codewithfaraz.com/img/a%20comprehensive%20list%20of%20html%20tags%20for%20web%20development.png" />

## Comments
Comments are enclosed the `<!-- This is a comment -->` tag:

```html
<!-- This is a multiline comment. 
It will not be rendered. -->
```

-------------------------------
## Formatting

### Headings

In [7]:
%%html

<h1>I am a header</h1>
<h2>I am a sub-header</h2>

## New lines
The HTML code for a newline is `<br>`:

In [8]:
%%html
Hello world<br>New lines   

and spaces        do not matter. Use the br tag instead

## Div and Span
The `<div>` tag in HTML is a container element used for grouping together and applying styles or JavaScript functionalities to a set of elements. `<span>` is similar to the `<div>` tag but more for text.


In [3]:
%%html

<div>
    <h1>This is a header</h1>
    <br />
    <p>
        This is some <span style="color:blue;">blue</span> text.
    </p>
</div>

## Special characters

HTML uses special codes to encode special characters, for example for mathematical, technical and currency symbols. Also used in-place of reserved characters lie < or & 

Full list: https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

Examples:


| Symbol        | HTML entity   |
| ------------- |:-------------:|
| Å             | `&Aring;`     |
| å             | `&aring;`     |
| Ø             | `&Oslash;`    |
| ø             | `&oslash;`    |
| Æ             | `&Aelig;`     |
| æ             | `&aelig;`     |
| '             | `&#34;`       |
| "             | `&quot;`      |
| &amp;         | `&amp;`       |

In [3]:
%%html
<p>
&Aring;s, S&oslash;r-Tr&oslash;ndelag
</p>

## Special characters alternative:

Save your file with UTF-8 encoding, and add the character encoding to your HTML document:

```html
<head>
  <meta charset="UTF-8">
</head>
```
Then you can write (almost all) special character directly in the document.

# Paragraphs

In [4]:
%%html
<p>
This is a paragraph.
</p>
<p>
This is another paragraph.
</p>

## Italic text, bold text and links

In [5]:
%%html
<b>Bold text</b>

<br>
<i>Italic text</i>

<br>
<em>Emphasized text</em>

<br>
<a href="http://github.com">This is a link</a>

-------------------------------
## Tables

In [3]:
%%html
<table>
 <tr>
    <th>Name</th>
    <th>Course</th> 
    <th>Points</th>
  </tr>
  <tr>
    <td>John</td>
    <td>GRA4157</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>George</td>
    <td>GRA4157</td> 
    <td>94</td>
  </tr>
</table>

Name,Course,Points
John,GRA4157,50
George,GRA4157,94


## Links

In [6]:
%%html

<a href="https://www.bi.no" target="_blank">Link to BI<a/>

## Images

In [2]:
%%html
<img src="figs/Rhinoceros.png" alt="D&uuml;rer's Rhinoceros">D&uuml;rer's Rhinoceros

## Styling

Every HTML document has a default style (background color white, text color black). The default style can be changed with the *style attribute*.

```html
<tagname style="property:value;">
```

Multiple properties can be set with:
```html
<tagname style="property1:value; property2:value">
```

Some Valid property options:
* `width`
* `height`
* `color`
* `background-color`
* `font-family`
* `font-size`
* `text-align`


### Examples

In [3]:
%%html
<img src="figs/Rhinoceros.png" alt="D&uuml;rer&#39;s Rhinoceros" style="width:100px;">D&uuml;rer&#39;s Rhinoceros

In [6]:
%%html
<p style="color:blue; background-color:rgb(255,0,255);">
Some colorful text.
</p>

# Processing HTML using Python
HTML is a simple document adhereing to a specific set of standards. As mentioned before, several libraries allow us to process these documents, among which, BeautifulSoup is probably the most popular.

# Installation

With conda, you can install the required dependencies with:

```bash
conda install bs4 requests lxml
```

We also install lxml as a *better* HTML processing engine.


# Basic usage of BeautifulSoup

First, we import the `BeatifulSoup` class:

In [7]:
from bs4 import BeautifulSoup

We load the html source file from disk and pass the source the the BeautifulSoup constructor. We choose the "lxml" parser for XML documents, which is faster than the defaul parser that comes with BeautifulSoup:

In [32]:
src = open("data/list.html")
soup = BeautifulSoup(src, 'lxml')
print(soup)

<!DOCTYPE html>
<html>
<body>
<h2>An Unordered HTML List</h2>
<ul id="unordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
<h2>An Ordered HTML List</h2>
<ol id="ordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ol>
</body>
</html>



### Finding tags by name

The document now contains the full html document. We can find the first occuring tag with a specific name with the `find` function. Let's find the first un-ordered list tag:

In [37]:
ulist = soup.find("ul")

The result contains all tags contained in the **first** matched tag:

In [36]:
ulist

<ul id="unordered_list" style="color:#069">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

The `find_all` function returns **all** tags that match the given tag name. We can use it to get a list of all list items:

In [20]:
items = ulist.find_all("li")
items

[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

Finally, we can loop over all items and extract their contant with the `get_text` function:

In [33]:
for item in items:
    print(item.get_text())

NameError: name 'items' is not defined

Note that `find_all` is **recursive** by default. This means that we could call it the on the full `document` to get the items
of both the ordered and un-ordered lists:

In [24]:
document.find_all("li", recursive=True)

[<li>Coffee</li>,
 <li>Tea</li>,
 <li>Milk</li>,
 <li>Coffee</li>,
 <li>Tea</li>,
 <li>Milk</li>]

### Finding tags by attributes

Sometimes the easiest way to find a tag is by its attribute name. In our examples, both lists have an `id` attribute that uniquely identifies the tables. We can also use the `find*` methods to search for attributes:


In [13]:
document.find_all(attrs={"style":"color:#069"})

[<ul id="unordered_list" style="color:#069">
 <li>Coffee</li>
 <li>Tea</li>
 <li>Milk</li>
 </ul>,
 <ol id="ordered_list" style="color:#069">
 <li>Coffee</li>
 <li>Tea</li>
 <li>Milk</li>
 </ol>]

### Accessing attributes

The `ul` tag also contains a `style` attribute. Any bs4 tag behaves like a dictionary with attribute names as keys and attribute values as values:

In [20]:
ulist.attrs

{'id': 'unordered_list', 'style': 'color:#069'}

In [21]:
ulist["style"]

'color:#069'

# Downloading a table from Wikipedia

We aim to get a list of countries sorted by their population size:
https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

First, let's import the required modules:

In [38]:
import requests
from bs4 import BeautifulSoup
import re
import dateutil

This time, we load the html source directly from a website using the requests module:

In [39]:
result = requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")

The web server returns a status code to indicate if the request was (un-)succesfully.
We use that status-code to check if the page was succesfully loaded:

In [40]:
assert result.status_code==200  

Next, we extract the html source and initiated BeautifulSoup:

In [56]:
src = result.content
document = BeautifulSoup(src, 'lxml')

by looking at the document, we can see that we are interested in the first table. So we use `find`:

In [57]:
table = document.find("table")
table

<table class="wikitable sortable" style="text-align:right">
<tbody><tr class="is-sticky">
<th></th>
<th style="width:17em"><a href="/wiki/List_of_sovereign_states" title="List of sovereign states">Country</a> / <a href="/wiki/Dependent_territory" title="Dependent territory">Dependency</a></th>
<th>Population</th>
<th style="width:2em">% of<br/>world</th>
<th>Date</th>
<th><span class="nowrap">Source (official or from</span><br/>the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)</th>
<th class="unsortable">
</th></tr>
<tr>
<th>–
</th>
<td style="text-align:left"><b>World</b>
</td>
<td><b>8,061,221,000</b></td>
<td><b>100%</b></td>
<td><b><span data-sort-value="000000002023-09-21-0000" style="white-space:nowrap">21 Sep 2023</span></b>
</td>
<td style="text-align:left"><b>UN projection</b><sup class="reference" id="cite_ref-unpop_4-0"><a href="#cite_note-unpop-4">[3]</a></sup></td>
<td>
</td></tr>
<tr>
<th>1
</th>
<td style="text-align:left"><span class="flagico

If you are not familiar with html table, read this example first: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/table

In [58]:
rows = table.find_all("tr")  # Note: this works because find_all is resursive by default


In [63]:
for row in rows[2:-1]:
    cells = row.find_all(["td", "th"])
    
    cells_text = [cell.get_text(strip=True) for cell in cells]
    print(cells_text)
    (rank, country, population, percentage, updated_at, source, comment) = cells_text  
    print(f'{rank}, {country}, {population}')
    

['1', 'China', '1,411,750,000', '17.5%', '31 Dec 2022', 'Official estimate[4]', '[b]']
1, China, 1,411,750,000
['2', 'India', '1,392,329,000', '17.3%', '1 Mar 2023', 'Official projection[5]', '[c]']
2, India, 1,392,329,000
['3', 'United States', '335,461,000', '4.2%', '21 Sep 2023', 'National population clock[7]', '[d]']
3, United States, 335,461,000
['4', 'Indonesia', '277,749,853', '3.4%', '31 Dec 2022', 'Official estimate[8]', '']
4, Indonesia, 277,749,853
['5', 'Pakistan', '241,499,431', '3.0%', '1 Mar 2023', '2023 census result[9]', '[e]']
5, Pakistan, 241,499,431
['6', 'Nigeria', '216,783,400', '2.7%', '21 Mar 2022', 'Official projection[10]', '']
6, Nigeria, 216,783,400
['7', 'Brazil', '203,062,512', '2.5%', '1 Aug 2022', '2022 census result[11]', '']
7, Brazil, 203,062,512
['8', 'Bangladesh', '169,828,911', '2.1%', '14 Jun 2022', '2022 census result[12]', '']
8, Bangladesh, 169,828,911
['9', 'Russia', '146,424,729', '1.8%', '1 Jan 2023', 'Official estimate[13]', '[f]']
9, Russi

ValueError: not enough values to unpack (expected 7, got 6)

**Attention**: Beautiful Soup does not execute Javascript. This means that you the code in the Google Chrome inspector might look different to the original source code. 

# Another example of downloading a Wikipedia table 

Let's consider another table in a Wikipedia page. This page has a lot more tables, so one challenge will be to pick the right table

https://en.wikipedia.org/wiki/Tiger_Woods


We are interested in extracting these two tables:

![Target Wikipedia tables](figs/wiki_tables.png)

**Exercise**: 

1) Identify the id="The_Players_Championship", by using title = document.find(id="The_Players_Championship")

2) First find all tables below the id in 1) by title.find_all_next('table').

3) Search for headers (th) by table.find('th') for table in tables to identify the "Tournament" header. Remember to use get_text(strip=True)

4) Save all tables with the header "Tournament" into a list tournament_tables. Check the length of the table and reduce it if it is needed.

5) Bonus: Print out the information in the two tables of interest in the terminal

We begin by downloading the webpage and instatiating the BeautifulSoup object:

In [64]:
result = requests.get("https://en.wikipedia.org/wiki/Tiger_Woods")
src = result.content
document = BeautifulSoup(src, 'lxml')

This page contains a lot of tables without specific attributes that would make it difficult to find our table of interest. Further, the same headings of the tables are used for multiple tables, making it difficult to find a table just by its headings:

In [65]:
len(document.find_all("table"))

62

Therefore, we choose another strategy. First, we extract the tag that defines the header just before our tables of interest. That header tag has a unique identifier attribute `id="The_Players_Championship"`. Then we use the `find_all_next` function in BeautifulSoup to extract all following table tags:

In [66]:
title = document.find(id="The_Players_Championship")
tables = title.find_all_next("table")

Now, our tables of interest are the first two tables with the "Tournament" heading. We write a small helper function (a generator https://wiki.python.org/moin/Generators) that returns a table with a given heading:

In [67]:
def find_table_with_heading(tables, heading):
    for table in tables:
        if table.find("th").get_text(strip=True) == heading:
            yield table

Next, we can extract the table rows and columns as usual. We only extract the first two tables, as these are the only ones we were interested in:

In [79]:
tournament_tables = find_table_with_heading(tables, "Tournament")

for table in tournament_tables:
    for row in table.find_all("tr"):
        cells= row.find_all(["th", "td"])
        print([cell.get_text(strip="True") for cell in cells])
        

['Tournament', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009']
['The Players Championship', 'T31', 'T35', 'T10', '2', '1', 'T14', 'T11', 'T16', 'T53', 'T22', 'T37', '', '8']
['Tournament', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
['The Players Championship', 'WD', 'WD', 'T40', '1', '', 'T69', '', '', 'T11', 'T30']
['Tournament', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']
['Championship', '1', 'T5', 'NT1', '1', '1', '9', '1', '1', '1', '5', 'T9', '', 'T10', 'WD', '1', 'T25', '', '', '', '', 'T10']
['Match Play', 'QF', '2', '', 'R64', '1', '1', 'R32', 'R16', 'R16', '1', 'R32', '', 'R64', 'R32', 'R64', '', '', '', '', '', 'QF']
['Invitational', '1', '1', '1', '4', 'T4', 'T2', '1', '1', '1', '', '1', 'T78', 'T37', 'T8', '1', 'WD', '', '', '', 'T31', '']
['Champions', '', 'T

In [89]:
wb[df.columns[6]]

154      10071
17     1012847
63      106166
65      106526
66      110347
        ...   
196        981
59       99269
128       9947
153       9996
21           —
Name: (United Nations[15], Estimate), Length: 214, dtype: object

In [90]:
wb[df.columns[6]] = pd.to_numeric(wb[df.columns[6]], errors='coerce')

In [91]:
wb[df.columns[6]]

154      10071.0
17     1012847.0
63      106166.0
65      106526.0
66      110347.0
         ...    
196        981.0
59       99269.0
128       9947.0
153       9996.0
21           NaN
Name: (United Nations[15], Estimate), Length: 214, dtype: float64

In [92]:
wb = wb.sort_values(by=df.columns[6], ascending=False)
wb

Unnamed: 0_level_0,Country/Territory,UN region,IMF[1][13],IMF[1][13],World Bank[14],World Bank[14],United Nations[15],United Nations[15]
Unnamed: 0_level_1,Country/Territory,UN region,Forecast,Year,Estimate,Year,Estimate,Year
0,World,—,105568776,2023,100562011,2022,96698005.0,2021
1,United States,Americas,26854599,2023,25462700,2022,23315081.0,2021
2,China,Asia,19373586,[n 1]2023,17963171,[n 3]2022,17734131.0,[n 1]2021
3,Japan,Asia,4409738,2023,4231141,2022,4940878.0,2021
4,Germany,Europe,4308854,2023,4072192,2022,4259935.0,2021
...,...,...,...,...,...,...,...,...
208,Palau,Oceania,262,2023,—,—,218.0,2021
211,Nauru,Oceania,151,2023,151,2022,155.0,2021
212,Montserrat,Americas,—,—,—,—,72.0,2021
213,Tuvalu,Oceania,65,2023,60,2022,60.0,2021


### Let's say we want to compute the sum of the GDP in Asia and Compare it to the Americas. We use the IMF estimate for this: 

In [36]:
region = df.columns[1]
est = df.columns[2]
A = df[[region,est]]
A

Unnamed: 0_level_0,UN Region,IMF[1][13]
Unnamed: 0_level_1,UN Region,Estimate
0,—,93863851
1,Americas,25346805
2,Asia,19911593
3,Asia,4912147
4,Europe,4256540
...,...,...
212,Oceania,244
213,Oceania,216
214,Oceania,134
215,Americas,—


In [37]:
americas = A[A[region] == 'Americas']
americas.head()

Unnamed: 0_level_0,UN Region,IMF[1][13]
Unnamed: 0_level_1,UN Region,Estimate
1,Americas,25346805
8,Americas,2221218
10,Americas,1833274
16,Americas,1322740
26,Americas,564277


In [38]:
asia = A[A[region] == 'Asia']

In [39]:
asia_estimation = pd.to_numeric(asia[est], errors='coerce')
americas_estimation = pd.to_numeric(americas[est], errors='coerce')

In [101]:
print('Asia \t', asia_estimation.sum())
print('Americas', americas_estimation.sum())

Asia 	 39851830.0
Americas 33147336.0


In [44]:
sum(asia_estimation)

nan

In [46]:
from numpy import *

In [47]:
sum(asia_estimation)

39851830.0

# Web Scraping Live Demo

Let's scrape http://books.toscrape.com/catalogue/page-1.html