# Scraping the web with BeautifulSoup

We are going to get information out of websites using `requests` and `beautifulsoup`.

## Installation

With conda, you can install the required dependencies with:

```bash
conda install beautifulsoup4 requests
```

or

```bash
python3 -m pip install beautifulsoup4 requests
```


## Basic usage of BeautifulSoup

First, we import the `BeatifulSoup` class:

In [2]:
from bs4 import BeautifulSoup

We load the html source file from disk and pass the contents to the BeautifulSoup constructor.

In [3]:
with open("list.html") as f:
    html = f.read()
    document = BeautifulSoup(html, "html.parser")
print(html)

<!doctype html>
<html>
  <head>
    <title>Sample HTML document</title>
  </head>
  <body>
    <h2>An Unordered HTML List</h2>

    <ul id="unordered_list" style="color: #f0e">
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ul>

    <h2>An Ordered HTML List</h2>

    <ol id="ordered_list" style="color: rgb(20, 200, 100)">
      <li>First</li>
      <li>Second</li>
      <li>Third</li>
    </ol>
  </body>
</html>


In [4]:
from IPython.display import HTML

HTML(html)

### Finding tags by name

The document now contains the full html document. We can find the first occuring tag with a specific name with the `find` function. Let's find the first un-ordered list tag:

In [5]:
ulist = document.find("ul")

The result contains all tags contained in the matched tag:

In [6]:
ulist

<ul id="unordered_list" style="color: #f0e">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

The `find_all` function returns **all** tags that match the given tag name. We can use it to get a list of all list items:

In [7]:
items = ulist.find_all("li")
items

[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

Finally, we can loop over all items and extract their contant with the `get_text` function:

In [7]:
for item in items:
    print(repr(item.get_text()))

'Coffee'
'Tea'
'Milk'


Because whitespace is not meaningful in HTML,
it is often useful to strip it when you are getting the content of a tag.
You can do this with `strip=True`

In [8]:
for item in items:
    print(repr(item.get_text(strip=True)))

'Coffee'
'Tea'
'Milk'


Note that `find_all` is **recursive** by default. This means that we could call it the on the full `document` to get the items
of both the ordered and un-ordered lists:

In [9]:
document.find_all("li")

[<li>Coffee</li>,
 <li>Tea</li>,
 <li>Milk</li>,
 <li>First</li>,
 <li>Second</li>,
 <li>Third</li>]

In [10]:
document.find_all("li", recursive=False)

[]

In [11]:
ulist.find_all("li", recursive=False)

[<li>Coffee</li>, <li>Tea</li>, <li>Milk</li>]

A recursive search finds all `li` tags anywhere.

In [12]:
document.find_all("li")

[<li>Coffee</li>,
 <li>Tea</li>,
 <li>Milk</li>,
 <li>First</li>,
 <li>Second</li>,
 <li>Third</li>]

### Finding tags by attributes

Sometimes the easiest way to find a tag is by its attribute name. In our examples, both lists have an `id` attribute that uniquely identifies the tables. We can also use the `find*` methods to search for attributes:


In [13]:
document.find(attrs={"id": "unordered_list"})

<ul id="unordered_list" style="color: #f0e">
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>

### Accessing attributes

The `ul` tag also contains a `style` attribute. Any bs4 tag behaves like a dictionary with attribute names as keys and attribute values as values:

In [14]:
ulist.attrs

{'id': 'unordered_list', 'style': 'color: #f0e'}

In [15]:
ulist["style"]

'color: #f0e'

## Downloading a table from Wikipedia

We aim to get a list of countries sorted by their population size:
https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

First, let's import the required modules:

In [16]:
import re

import dateutil
import requests
from bs4 import BeautifulSoup

This time, we load the html directly from a website using the requests module:

In [17]:
url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"

r = requests.get(url)
url

'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

The web server returns a status code to indicate if the request was (un-)succesfully.
We use that status-code to check if the page was succesfully loaded:

In [18]:
assert r.status_code == 200

Next, we extract the html source and initiated BeautifulSoup:

In [19]:
html = r.text
document = BeautifulSoup(html, "html.parser")

by looking at the document, we can see that we are interested in first table. So we use `find`:

In [20]:
table = document.find("table", class_="wikitable")

If you are not familiar with html table, read this example first: https://www.w3schools.com/html/tryit.asp?filename=tryhtml_table_intro

In [21]:
print(str(table)[:1024])

<table class="wikitable sortable sticky-header sort-under mw-datatable col2left col6left" style="text-align:right">
<caption>List of countries and territories by total population
</caption>
<tbody><tr>
<th>
</th>
<th>Location
</th>
<th>Population
</th>
<th style="width:2em">% of<br/>world
</th>
<th>Date
</th>
<th><span class="nowrap">Source (official or from</span><br/>the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)
</th>
<th class="unsortable">Notes
</th></tr>
<tr>
<td>-
</td>
<td><b><span class="flagicon" style="padding-left:25px;"> </span>World</b>
</td>
<td>8,119,000,000</td>
<td><div class="center" style="width:auto; margin-left:auto; margin-right:auto;">100%</div></td>
<td><span data-sort-value="000000002024-07-01-0000" style="white-space:nowrap">1 Jul 2024</span>
</td>
<td>UN projection<sup class="reference" id="cite_ref-UNFPA_1-1"><a href="#cite_note-UNFPA-1"><span class="cite-bracket">[</span>1<span class="cite-bracket">]</span></a></sup><sup clas

At this point, it is a good idea to programatically check that the table contains the correct header:

In [22]:
header = " ".join([th.get_text(strip=True) for th in table.find_all("th")])
assert "Population" in header
header

' Location Population % ofworld Date Source (official or fromtheUnited Nations) Notes'

### Exercise

extract the information from the table

- get the rows
- find column names
- get sensible data from each cell
- parse numbers/dates where they show up
  

In [23]:
rows = table.find_all("tr")

In [24]:
rows[0]

<tr>
<th>
</th>
<th>Location
</th>
<th>Population
</th>
<th style="width:2em">% of<br/>world
</th>
<th>Date
</th>
<th><span class="nowrap">Source (official or from</span><br/>the <a href="/wiki/United_Nations" title="United Nations">United Nations</a>)
</th>
<th class="unsortable">Notes
</th></tr>

In [25]:
column_names = [th.get_text(strip=True) for th in rows[0].find_all("th")]
column_names

['',
 'Location',
 'Population',
 '% ofworld',
 'Date',
 'Source (official or fromtheUnited Nations)',
 'Notes']

In [37]:
last_rank = 0
for row in rows[1:]:
    cells = row.find_all(["th", "td"])
    if not cells:
        continue
    cells_text = [cell.get_text(strip=True) for cell in cells]
    rank, country, population, percentage, updated_at, source, *comment = cells_text
    if not rank.isdigit():
        rank = last_rank
    else:
        last_rank = rank
        rank = int(rank)
    population = int(population.replace(",", ""))
    percentage = float(re.findall(r"[\d\.]+", percentage)[0]) / 100
    updated_at = dateutil.parser.parse(updated_at).date()

    print(rank, country, f"{population:,.2e}", f"{percentage:.1%}", updated_at)

0 World 8.12e+09 100.0% 2024-07-01
0 China 1.41e+09 17.3% 2023-12-31


ValueError: invalid literal for int() with base 10: '17.2%'

**Attention**: Beautiful Soup does not execute Javascript. This means that you the code in your browser inspector might look a bit different from the original HTML source code. 

# Another example of downloading a Wikipedia table 

Let's consider another table in a Wikipedia page. This page has a lot more tables, so one challenge will be to pick the right table

https://en.wikipedia.org/wiki/Serena_Williams


We are interested in extracting these two tables:

![Target Wikipedia tables](figs/wiki_tables.png)

**Exercise**: 

Find the tables on a page by locating heading and using `.find_next()`


We begin by downloading the webpage and instatiating the BeautifulSoup object:

In [38]:
r = requests.get("https://en.wikipedia.org/wiki/Serena_Williams")
document = BeautifulSoup(r.text, "html.parser")

This page contains a lot of tables without specific attributes that would make it easy to find our table of interest. Further, the same headings of the tables are used for multiple tables, making it difficult to find a table just by its headings:

In [39]:
len(document.find_all("table"))

74

Therefore, we choose another strategy.

First, we find the tag with class `mw-headling` whose `string` content _starts with_ `Singles`.
Then we find the _next_ table using `heading_element.find_next(...)`:

In [29]:
document.find_all(class_="mw-headline", string=re.compile("^Singles"))

[]

In [40]:
# string class
singles_heading = document.find(class_="mw-headline", string=re.compile("^Singles"))
singles_heading

In [42]:
singles_heading.find_next()

AttributeError: 'NoneType' object has no attribute 'find_next'

Now, our tables of interest are the first two result tables for "Singles" and "Women's doubles". We write a small helper function that returns a table with a given heading:

In [43]:
def find_table_with_heading(document, heading_pat):
    heading_element = document.find(class_="mw-headline", string=heading_pat)
    table = heading_element.find_next("table")
    return table

In [44]:
singles_table = find_table_with_heading(document, re.compile("^Singles"))
# print headers
headings = singles_table.find_all("th")
[th.get_text(strip=True) for th in headings]

AttributeError: 'NoneType' object has no attribute 'find_next'

Next, we can find the table after the heading "Women's doubles"

In [45]:
doubles_table = find_table_with_heading(document, re.compile(r"^Women's doubles"))
# print headers
headings = doubles_table.find_all("th")
[th.get_text(strip=True) for th in headings]

AttributeError: 'NoneType' object has no attribute 'find_next'

## Exercise:

- Iterate through the rows
- convert year to integer (or date)
- strip note '(12)' from event, so the same event has the same string
- load into pandas DataFrame (more on pandas in a later lecture)

In [46]:
re.sub?

[1;31mSignature:[0m [0mre[0m[1;33m.[0m[0msub[0m[1;33m([0m[0mpattern[0m[1;33m,[0m [0mrepl[0m[1;33m,[0m [0mstring[0m[1;33m,[0m [0mcount[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m [0mflags[0m[1;33m=[0m[1;36m0[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the Match object and must return
a replacement string to be used.
[1;31mFile:[0m      c:\users\hp\anaconda3\lib\re\__init__.py
[1;31mType:[0m      function

In [47]:
data = []
for row in singles_table.find_all("tr"):
    cells = row.find_all("td")
    if not cells:
        continue
    values = [cell.get_text(strip=True) for cell in cells]
    values[1] = int(values[1])
    values[2] = re.sub(r"\s*\(.+\)", "", values[2])
    print(values)
    data.append(values)

NameError: name 'singles_table' is not defined

When data is in this form, we can convert it into a DataFrame with pandas.

You'll learn more about pandas next week.

In [None]:
import pandas as pd

headings = [th.get_text(strip=True) for th in singles_table.find_all("th")]
df = pd.DataFrame(data, columns=headings)
df

With pandas, we can filter this data, group it, and plot interesting relationships.

Pandas `groupby` is an interesting operation for performing aggregations,
e.g. counting the wins/losses by year and result:

In [None]:
df.Result.value_counts()

In [None]:
results_by_year = df.groupby(["Year", "Result"]).Tournament.count().unstack().fillna(0)
results_by_year

Which we can now plot

In [None]:
results_by_year.plot(kind="bar", grid=False)

Is there any significance to the court?

In [None]:
results_by_surface = df.groupby(["Surface", "Result"]).Tournament.count().unstack()
results_by_surface

In [None]:
results_by_surface.plot(kind="bar")

We can even filter to e.g. select opponents who Williams faced at least twice

In [None]:
results_by_op = df.groupby(["Opponents", "Result"]).Tournament.count().unstack()
results_by_op

In [None]:
# we can exclude opponents only met once:
results_by_op = results_by_op.fillna(0)
results_by_op

In [None]:
(results_by_op.Win + results_by_op.Loss) > 1

In [None]:
multiple_meetings = results_by_op[(results_by_op.Win + results_by_op.Loss) > 1]
multiple_meetings.plot(kind="bar")

# Exercise:

Find images on the UiO page

1) Go to https://en.wikipedia.org/wiki/University_of_Oslo 
2) Download the content from the site using BeautifulSoup and requests
3) Search for all images (using `images = document.find_all('img')`) and print out the content
4) Include only images with the attribute `class_="mw-file-element"` in your list of images.
5) Print out a list of the value of the "src" attribute for the images in 4. 
6) See if you can display an image by pasting a result from 5 into your web-browser.

In [None]:
r = requests.get("https://no.wikipedia.org/wiki/Universitetet_i_Oslo")
html = r.text
print(html[:400])

In [None]:
document = BeautifulSoup(html, "html.parser")

In [None]:
images = document.find_all("img", class_="mw-file-element")
len(images)

In [None]:
for image in images:
    print(image["src"])

In [None]:
from IPython.display import HTML, display

In [None]:
for image in images:
    url = image["src"]
    if "://" in url:
        pass
    elif url.startswith("//"):
        # add 'scheme' or 'protocol'
        url = "https:" + url
    elif url.startswith("/"):
        url = "https://no.wikipedia.org" + url
    else:
        # not an understood URL
        raise ValueError(f"I don't understand this url: {url}")
    html = HTML(f'<img src="{url}">')
    display(html)