# Scientific Python
## Central European University

## 05 Error handling, JSON, XML, web scraping

Instructor: Márton Pósfai, TA: Luka Blagojevic

Email: posfaim@ceu.edu, Blagojevic_Luka@phd.ceu.edu

*Don't forget:* use the Slack channel for discussion, to ask questions, or to show solutions to exercises that are different from the ones provided in the notebook. [Slack channel](http://www.personal.ceu.edu/staff/Marton_Posfai/slack_forward.html)

## Recap -- pandas

Create a dataframe:

In [None]:
import pandas as pd
import numpy as np

# create a dataframe from a dictionary
D_pets = {'name':['Fluffy','Stan','The Fish', 'Ham', 'Flip'],
          'species':['dog','dog','fish','pig','fish'],
          'age':[3.,10.,1.,2.,1.5]}
df = pd.DataFrame(D_pets)
df

Indexing, slicing, masking:

In [None]:
#only the first three rows using slices:
print(df.iloc[:3])
print()

#only the name and age column using list of labels
print(df[['name','age']][:3])
print()

#only the fish using masking
print(df[df['species']=='fish'])
print()

Average age of species using `groupby`:

In [None]:
df.groupby('species')['age'].mean()

New columns with `apply`:

In [None]:
df['lengthofname']=df.apply(lambda row: len(row['name']),axis=1)

#or 
#df['lengthofname']=df['name'].apply(len)
df

### Exercise

Create a new column called `old` if the pet is older than 2 years.

<details><summary><u>Hint.</u></summary>
<p>

The expression `x>2` actually returns a `True` or `False` value.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
df['old']=df['age'].apply(lambda x: x>2)

#or simply
df['old']=df['age']>2
df
```

    
</p>
</details>

And now for something new!

## Error handling & exceptions

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=21c84ba6-1c2c-4cdf-9faa-acc400e1f602)

What happens if we run the next bit of code?

In [None]:
for a in range(5):
    print(1/a)

Well, we tried to divide by zero and python didn't approve. It threw an error, namely a `ZeroDivisionError`, and the code stopped executing.

This is inflexible. What if your code depends on outside variables such as user input or an online data source and issues beyond your control cause an error? You might want to be able to handle the errors within the code and not stop the execution.

Python has a very simple way of doing this using the `try` and `except` key words:

In [None]:
for a in range(5):
    try:
        print(1/a)
    except ZeroDivisionError:
        print("Cannot divide with 0")

We try to execute whatever is after `try`, if it throws an error, we catch it and run whatever is after `except`. And we continue to execute the code.

You don't have to specify what type of error you are trying to catch, we can catch all:

In [None]:
for a in range(5):
    try:
        print(1/a)
    except:
        print("Cannot divide with 0")


You could always test your variables with `if` statements to be sure that the code will execute without an error, but things can get complicated fast, the use of `try`-`except` pair result in  more readable code.

For example:


In [1]:
l = [0, 1, 3, 4]
for a in range(5):
    try:
        i = l[a]
        print(i)
        # very complicated stuff with i
    except IndexError:
        pass

0
1
3
4


Depending on what calculations we are planning to do with `i`, we might end up with  a lot of `if` statements. Using `try` if calculations go wrong any way, we end up in the `except` part.

### Exercise

Write a function that
* returns the sum of two variables `x` and `y` if adding the two variables is possible
* returns `None`  and prints out the message "Can't add x and y together!" if the variables `x` and `y` are incompatible

Substituting `try` and `except` with `if` statements would be difficult.

<details><summary><u>Hint.</u></summary>
<p>

Try running
```python
3+'1'
```
What kind of error did you get? Catch this kind of error.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
def add(x,y):
    try:
        return x+y
    except TypeError:
        print(f"Can't add {x} and {y} together!") # remember the f-strings?
        return None
        
print(add(1,2), add([1,2],[2,3]), add(1,[2,3]))
```

    
</p>
</details>

## A structured data format: JSON

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0c9a8b1b-2e77-4ac6-8b20-acc400cdd6fb)

So far we mostly looked at data that can be neatly stored in tables; however, not all data can be represented naturally as a table. We will look at an alternative data format called JSON (the full name of the format is [Javascript Object Notation](http://en.wikipedia.org/wiki/JSON)). It is basically a text file or a string that encodes data in a logical hierarchical structure. Look at the following example, it should look familiar to you!

<pre>
{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    },
    {
      "type": "mobile",
      "number": "123 456-7890"
    }
  ],
  "children": [],
  "spouse": null
}
</pre>

If you thought this looks like a combination of nested dictionaries and lists in python, you were right!

This data format has become **very popular recently** because it's not only how you write a python dict but also a **Javascript object**. This means a web browser can trivially parse a JSON string. Other formats, like XML (see next class), require real parser code.



### JSON in Python

Python has a simple module called json that can parse JSON strings to create the corresponding nested python dictionaries and lists, or vice versa it can write JSON string based on the python objects.

In [1]:
import json 

We can use the `dumps` function to write a string from python dictionaries and/or lists:

In [2]:
D = {'name' : 'Alice'}
json_str = json.dumps(D) # dumps -> creates a JSON string
print(json_str)
print(type(json_str))

{"name": "Alice"}
<class 'str'>


Notice that in the dictionary definition we used single quotes `'`, but the JSON string contains double qoutes `"`. In python, we can use both `'` and `"` interchangebly; in JSON, however, only double quotes are accepted.

And to convert a JSON string to an appropriate combination of lists and strings, we use the `load` function:

In [3]:
# note the single- vs. double-quotes...
string = '[ {"name":"Bob","age":28}, {"name":"Alice","age":23} ]'
print(type(string))

D = json.loads(string) 

print(D)
print(type(D))
print(type(D[0]))
print(D[0]["name"], D[0]["age"]) # we can access elements in the usual way 

<class 'str'>
[{'name': 'Bob', 'age': 28}, {'name': 'Alice', 'age': 23}]
<class 'list'>
<class 'dict'>
Bob 28


### Exercise -- JSON

* Take the first JSON example in this notebook and copy it to a text file.
* Read it in whole (not line by line) as a string, and then convert it to a dictionary `d` by using the module json.
* Print out all phone numbers.

<details><summary><u>Hint.</u></summary>
<p>

To read the entire contents of a file `f` into a string use `f.read()`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
with open("json_example.txt","r",encoding="utf-8") as f:
    text = f.read()
d = json.loads(text)

for num in d["phoneNumbers"]:
    print(num["number"])
```

    
</p>
</details>

JSON has many uses, for example, the notebook that you are working with right now is also stored as a JSON file. Let's load it and take a look!

In [None]:
with open("SP_08.ipynb","r",encoding="utf-8") as f:
    text = f.read()
notebook = json.loads(text)
notebook.keys()

In [None]:
notebook['cells']

The notebook itself is a dictionary with four keys and contains combination of nested lists dictionaries.

### Exercise -- notebook as JSON

Explore the `notebook` object and answer the following questions:
* How many markdown cells are there in the notebook?
* How many code cells? Or more advanced: how many lines of code?
* Why doesn't the code line count increase if I add more code?

<details><summary><u>Hint.</u></summary>
<p>

`notebook['cells']` is a list containing cells. Print out the first one to see how the contents of a cell is stored.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
markdowncells = [cell for cell in notebook['cells'] if cell['cell_type']=='markdown']
print('number of markdown cells:', len(markdowncells))

codecells = [cell for cell in notebook['cells'] if cell['cell_type']=='code']
print('number of code cells:', len(codecells))

codelines = [l   for cell in notebook['cells'] if cell['cell_type']=='code' for l in cell['source'] if l.strip() ]
print('number of code lines:', len(codelines))


#or more verbose
num_codelines = 0
for cell in notebook['cells']:   # iterate over cells
    if cell['cell_type']=='code': # check if code cell
        for line in cell['source']: #iterate over lines
            if line.strip() != '': #check if line contains something else than whitespace
                num_codelines+=1 # increase count
print('number of code cells:', num_codelines)
```

You would have to reload the file to see the number of code lines increase.

</p>
</details>

This is all neat but the real reason we are learing about JSON, is that it is often used for data transfer by web services. In the next section, we will obtain data in JSON format from a web service using its API.

## Exchange rates through accessing a web API

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=882b3757-741c-41ea-bf6b-acc400d2635d)

So what is an API? API stands for "Application Programming Interface" and is a set of functions and procedures that allow code to access the features or data of an operating system, application, or other service.

As a data-collection example, we are going to obtain how currencies compare to one another over time. Specifically we ask:

**How did the exchange rate between USD and HUF change recently?**

There's a nice, free website called https://openexchangerates.org. They provide a simple API to get exchange rate data. Let's use this.

To use their API, we need to register with them and get an **App ID**. This lets them track how often you call their website and block you if you do too much (this is known as rate limiting).


### Exercise -- App ID

Register to [Open Exchange Rates](https://openexchangerates.org/signup/free) and copy-paste your own App ID below. The App ID is a string of 32 characters.

Registration reequires an email address. On a side note, there are [websites](https://www.mintemail.com/) that give you throw-away email addresses for temporary use.

In [4]:
app_id = ""

<details><summary><u>Solution.</u></summary>
<p>


Every ID is unique, get your own!
    
If all of us would be using the same ID, we would quickly reach the rate limit and we would get blocked.

    
</p>
</details>

Now that we have the App ID, how do we dowload data? Their [docs](https://openexchangerates.org/documentation) tell us what to do:
* We have to build an appropriate URL that encodes our querry and contains our ID.
* We have to request this URL from their server the same way as if we were downloading a website.
* Their service prepares and sends the requested data as a JSON file.

The documentation tells us that to access historical exchange rates the URL has to look like this
<pre>
http://openexchangerates.org/api/historical/2011-10-18.json?app_id=xxx
</pre>
where `2011-10-18` is the date we are interested in and `app_id=xxx` is our App ID.

Let's try it out:

In [None]:
# build a url from pieces:
base_url = "http://openexchangerates.org/api/historical/"
id_str   = "app_id="+app_id
date_str = "2011-10-18"
#stich it together
URL = base_url+date_str+".json?"+id_str # this format is specified at the end of the doc page

In [None]:
#let's check it
URL

OK, let's download the text of that "page":

In [None]:
import urllib.request # this module allows us to download websites, we used it in the second class
result = urllib.request.urlopen(URL)
text = result.read()

Let's see what we got:

In [None]:
print(type(text))
# now print the beginning and end of the text:
print(text[:1500])
print('\n')
print(text[-300:])

This is a `bytes` object. Do remember it from one of the first classes? A `bytes` object is just a series of bytes, basically a string without the character encoding specified. (The notebook had no problem printing it out, since it seems to only contain ASCII characters.)

**Great!** This means we can take the text from that website and run it through `json.loads` and we have a nice accessible python dict:

In [None]:
data = json.loads(str(text,"utf-8")) # This comes from our API, remember?
print(type(data))
print(list(data.keys()))

Sweet. Now we see there's a timestamp key. What does it give us?

In [None]:
print(data["timestamp"])

Is that a Unix timestamp? Yup! If you don't remember what this is: it is the number of seconds since 1970-01-01 00:00 UTC, aka the Unix epoch. For details go back a couple of notebooks where we first covered dates and look for the epoch section.

This format is so common that datetime has a builtin method to deal with it, `fromtimestamp` converts a Uinix timestamp to a `datetime` object:

In [None]:
import datetime
t = datetime.datetime.fromtimestamp(data["timestamp"])

print(t)
print(type(t)) # this is datetime, not a timedelta. Do you remember the difference?

Our original URL had `historical/2011-10-18.json` in it, so that timestamp makes perfect sense.
There are also the `base` and `rates` keys. Those are the actual exchange rate data:

In [None]:
print(data["base"])
print(type(data["rates"]))
print(list(data["rates"].keys())[:5]) # print first five keys

`base` tells us what currency the exchange rate is relative to. `rates` is another dict, keyed by three-letter currency names.

In [None]:
print(data["rates"]["USD"])

Makes sense, the conversion for USD should always be 1 since the base was USD. Let's check if the exchange rate with the Hungarian forint is present, and let's have a look at it:

In [None]:
print(data["rates"]["HUF"])

### Exercise

Write a function that takes a currency code and the exchange rate dictionary as input and
* prints out the exchange rate if the currency is included in the dictionary
* prints out an message if not included

<details><summary><u>Hint.</u></summary>
<p>
    
To check dictionary `D` contains `key` use `if key in D`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
def exrate(currency, data):
    if currency in data["rates"]:
        print('The exchange rate of',currency, "is", data["rates"][currency])
    else:
        print(currency, 'is not included')
    return

exrate('HUF', data)
exrate('Imaginary dollars',data)
```

    
</p>
</details>

Now, having a `data` dict like this may seem a little verbose compared to a table or CSV file. A CSV file for exchange rates makes a lot of sense but many data do not fit into a nice regular form like that. Send JSON "over the wire" and using dictionary keys makes it easy for us to keep track of what number correspond to what unit of measurement.

### Exercise

How many different currencies are there in the JSON files? How can you check it quickly?

<details><summary><u>Hint.</u></summary>
<p>
    
To get the number of items in a dictionary `D` use `len(D)`.
    
</p>
</details>


<details><summary><u>Solution.</u></summary>
<p>


```python
print(len(data['rates']))
```

    
</p>
</details>

Now, having a `data` dict like this may seem a little verbose compared to a table or CSV file. A CSV file for exchange rates makes a lot of sense but many data do not fit into a nice regular form like that. Send JSON "over the wire" and using dictionary keys makes it easy for us to keep track of what number correspond to what unit of measurement.

## Putting it all together

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3878700f-022c-402b-8cae-acc400d8b477)

**How did the exchange rate between USD and HUF change recently?**

We want to plot the USD-HUF exchange rate for the last 60 days, we now have all the tools to do this. Let's break down the task into smaller pieces:
1. Construct a list containing the appropriate dates using datetime.
2. Download and save the data into a dictionary using json.
3. Extract the USD-HUF exchange rate from the JSON format.
4. Create plot using matplotlib.

### Part 1 -- Dates

To download the exchange rates of a specific day we have to construct and url containing the date:
```
http://openexchangerates.org/api/historical/2011-10-18.json?app_id=xxx
```
For this we can use the `datetime` module.

#### Recap: datetime

We have already imported datetime. Here we would like to use it to generate a sequence of strings containing dates -- something that we learned how to do, but let's remind us how it is done:

In [None]:
# create a datetime object from year, month, and day
# (the hours and minutes default to midnight, but we don't need that)
DT = datetime.datetime(1985, 10, 26)

# strftime = string format time
print("formatted date:", DT.strftime("%Y-%m-%d"))

# current date and time
Dnow = datetime.datetime.now()
print("today:", Dnow.strftime("%Y-%m-%d"))

# timedelta objects can be added to and subtracted 
td = datetime.timedelta(days=1)
Dyester = Dnow-td
print("yesterday:", Dyester.strftime("%Y-%m-%d"))


We have recapped everything we need to create a list of dates for our requests.

Through the next exercises you will make the USD-HUF exchange rate figure!

### Exercise -- Date list

Create a list called `dt_list` that contains datetime objects of the last 60 days including today.

<details><summary><u>Hint.</u></summary>
<p>
    
To get the `datetime` object `k` subtract `datetime.timedelta(days=k)` from today.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
dt_list = [Dnow - datetime.timedelta(days=t) for t in range(60)]
dt_list = dt_list[::-1] #reversing the list so the oldest dates are in increasing order
print(dt_list[:5])
```

    
</p>
</details>

### Exercise -- Downloading the data

Download the data needed for out USD-HUF figure:
* Create a for loop over these dates.
* In each iteration construct an URL to request the exchange rate for each date (like we did a few cells above).
* Download the JSON exchange data.
* Parse it using `json.loads`.
* And save it to a list of dictionaries ``daily_data``.

**Important**: It should take only a few seconds. If you are not sure your code works, test only with a few days, as you can reach the rate limit. 

<details><summary><u>Hint.</u></summary>
<p>
    
When constructing the URLs use the `D.strftime("%Y-%m-%d")` method to create the correct date format.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
url_begin = "http://openexchangerates.org/api/historical/"
url_end = ".json?app_id="+app_id
daily_data = []
for D in dt_list:
    URL = url_begin + D.strftime("%Y-%m-%d") + url_end
    result = urllib.request.urlopen(URL)
    text = result.read()
    data = json.loads(str(text,"utf-8"))
    daily_data.append(data)
```

    
</p>
</details>

I assume you have your `daily_data`. Now we can look at some data, for example the USD-EUR exchange rate yesterday was:

In [None]:
daily_data[-2]['rates']['EUR']

### Exercise -- List of rates

Create a list called `xrate_list` that contains the USD-HUF exchange rates for each day.

<details><summary><u>Solution.</u></summary>
<p>


```python
huf_xrate_list = [data['rates']['HUF'] for data in daily_data]
huf_xrate_list[:5]
```

    
</p>
</details>

### Exercise -- Plotting

Plot the exchange rate versus the date. Check the documentation of the `plt.plot_date()` function. Try to make the plot look appealing.

<details><summary><u>Hint.</u></summary>
<p>
    
For example, check out the use of `plt.MaxNLocator()` to have fewer tics on the x axes.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
import matplotlib.pyplot as plt
    
fig, ax = plt.subplots(figsize=(8,4))

plt.plot_date(dt_list, huf_xrate_list, "o-")

#label the axes
#plt.xlabel("Year",        fontsize=14)
plt.ylabel("HUF vs. USD", fontsize=14);

#to make it look nicer, we set the tics on the x axis so that the dates do not overlap
ax.set_xticks(dt_list[10::20]);

```

    
</p>
</details>

[Video explaining the solutions.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=2f3e4dfa-3bdc-4038-8d5a-acc400dddf81)

## XML

[Video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=0ba489d3-a006-4e9d-adf1-ab8200ccdb32).

XML stands for Extendable Markup language.
* A universal purpose markup language
* Both human and computer readable
* Represents information in a hierarchical way
* Strict syntax rules → effective and unambiguous parsing
* Stored in plain text
* Many API use it for communication
* Many file formats are special cases of XML:
    * SVG: Scalable Vector Graphics
    * RSS feeds
    * Microsoft Word docx

Let's take a look at an example (for further details see slides):

```xml
<?xml version="1.0" encoding="UTF-8"?>
<breakfast_menu tasty="True">
  <food kind="vegan">
    <item>Belgian Waffles</item>
    <price>$5.95</price>
    <description>Two of our famous...</description>
    <calories>650</calories>
  </food>
  <food>
    <item>French Toast</item>
    <price>$4.50</price>
    <description>Thick slices made...</description>
    <calories>600</calories>
  </food>
  <food>
    <item>Homestyle Breakfast</item>
    <price>$6.95</price>
    <description>Two eggs, bacon...</description>
    <calories>950</calories>
  </food>
</breakfast_menu>
```


[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=28522003-574d-49b7-8b0e-acc900c255a0)

An XML parser can help you navigate and manipulate the XML tree. We will use the module [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), technically it is built on top of a lower level parser with many useful functions.

In [7]:
from bs4 import BeautifulSoup

with open("menu.xml","r",encoding='utf-8') as f:
    #from the file we create a soup object
    #we also specify the parser it uses as "xml", later we will use BeautifulSoup to parse html files too
    soup = BeautifulSoup(f,"xml")
type(soup)

bs4.BeautifulSoup

Now `soup` is the object that we will work with and it contains directly the root of the XML.

### Tags and attributes

You can access tags and their attributes by their names very simply:

In [8]:
#access child by name
print(type(soup.breakfast_menu))
print()

#get name of the tag
print(soup.breakfast_menu.name)
print()

#get attributes of a tag as a dictionary
print(soup.breakfast_menu['tasty'])
print()


<class 'bs4.element.Tag'>

breakfast_menu

True



The `text` property of a tag returns all the text contained inside the start `<sometag>` and the end `<\sometag>` of the tag, this includes the text inside any descendent tag. While the `string` property only returns the text directly contained in the tag, excluding descendents. For example:

In [None]:
#if multiple children with the same tag, we get the first one
#text: all text, including text contained by offspring
print(soup.breakfast_menu.food.text)
print()

#string: only the text directly inside tag
print(soup.breakfast_menu.food.item.string)

Note that if multiple children tags have the same name (e.g. `<food>`) the first one is returned.

We can also change things, for example:

In [None]:
print(soup.breakfast_menu.food.item.string)
soup.breakfast_menu.food.item.string = "Yummy Belgian Waffles"
print(soup.breakfast_menu.food.item.string)

### Navigating the tree

Beautiful Soup has methods that allow traversing the tag tree of the XML file.

* Moving down: iterate through the children of a tag

In [None]:
#the first food on the menu
food1 = soup.breakfast_menu.food

list_of_children = list(food1.children)
print(list_of_children)
print()

#children also include strings containing end of line characters, to exclude them
for child in food1.children:
    if child != '\n':
        print(child)


Note that the `children` iterable contains all text directly inside the tag as a child, this means that the end-of-line characters `\n` (which are added for readability) are also listed as children. You can exclude them with an `if` statement when iterating.

* Moving up: get the parent of a tag

In [None]:
print(food1.parent.name)

* You can also move sideways: check documentation for `next_siblings` and `previous_siblings`

## Searching

Most often instead of navigating up and down the tree we search for tags that match some requirements. For this we can use:
* Find the first match: `soup.find()`
* Find all matches: `soup.find_all()`

We can specify what we are searching for in various ways:
* Search based on tag names:

In [None]:
for food in soup.find_all("food"):
    print(food.item.string)

* Search based on attributes:

In [None]:
for vegan_stuff in soup.find_all(kind="vegan"):
    print(vegan_stuff.item.text)

* We can even use functions! This is a powerful tool that allows for very complex searches. For example, we can look for food options that have waffles in their description:

In [2]:
#the function takes a tag as input
#outputs True if it's a match, False if it's not
def match(tag):
    if tag.name=="food": #check if it is a food
        if "Waffles" in tag.text: #check if waffles are mentioned 
            return True
    #if we didn't return True, we return False
    return False

for waffles in soup.find_all(match):
    print(waffles.item.string)


Belgian Waffles
Strawberry Belgian Waffles
Berry-Berry Belgian Waffles


* Or we can do the same thing using lambda functions:

In [None]:
for waffles in soup.find_all(lambda tag: tag.name=="food" and "Waffles" in tag.text):
    print(waffles.item.string)

### Exercise

Print out the price of all food items.

<details><summary><u>Hint</u></summary>
<p>

You are looking for the `<price>` tags, use the `soup.find_all()` function to get all matches.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for price in soup.find_all("price"):
    print(price.string)
```
    
</p>
</details>

### Exercise
Calculate the average calorie of the food options.

<details><summary><u>Hint</u></summary>
<p>

The calories are in string format, you have to convert them to a number using the `float()` function. If you store these numbers in a list, you can use `numpy`'s `np.mean()` function to get the average.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
import numpy as np
calories = [float(cal.string) for cal in soup.find_all("calories")]
print("%.2f"%np.mean(calories))
```
    
</p>
</details>

### Exercise
Print out all food items that have less than 800 calories
<details><summary><u>Hint</u></summary>
<p>

This is a bit more tricky. You can define `match(tag)` function to use with `find_all()` as we did when we were searching for waffles. To get the calories withing the `mahtch(tag)` function use `tag.calories.string`.

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def match(tag):
    if tag.name=="food":
        if float(tag.calories.string)<800:
            return True
    return False

for food in soup.find_all(match):
    print(food.item.string)

#or with a lambda function
for food in soup.find_all(lambda tag: tag.name=="food" and float(tag.calories.text)<800):
    print(food.item.string)
```
    
</p>
</details>

### Exercise
You are worried about global warming and you would like to encourage people to reduce their carbon footprint. Reduce the price of all vegan items by 10 percent!

<details><summary><u>Hint</u></summary>
<p>

In a previous example, we already searched for vegan items. Scroll back if you don't remember how.

You can modify the string contained in a tag simply by overwriting it, e.g., `food.price.sting = new_price`.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for food in soup.find_all(kind='vegan'):
    #convert price to number
    price = float(food.price.string[1:])
    #calculate new price
    new_price = .9*price
    #convert new price to string
    new_str = "$%.2f"%new_price
    #update xml
    food.price.string = new_str
    
#test
for food in soup.find_all('food'):
    print(food.item.string, food.price.string)
```
    
</p>
</details>

## Webscraping by parsing HTML

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=276ebf77-45c6-464b-9ba4-ab8200e4194f)

HTML stands for Hypertext Markup Language, it is a text file that describes how a website looks. When you open a website in your browser, it downloads the HTML file and translates it into what you see.

You can view the HTML source code of a website by hitting ctrl-u (or Command+Option+u in Safari). For a simple example visit [this site](http://posfaim.web.elte.hu/example.html) and look at the source code.

It looks very similar to an XML file, but there are some differences. Check out the additional slides and the next video for more details.

[Video.](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=6f20236b-24b1-4a60-8457-acc900cacb7b)

We can use Beautiful Soup to parse and search the website as we did with XML before! Let's open the website using the `urllib.request` module and create a soup object:

In [3]:
import urllib.request
webpage = urllib.request.urlopen("http://posfaim.web.elte.hu/example.html")
soup = BeautifulSoup(webpage,"lxml") 

Note that we use a different parser, the HTML parser is confusingly called `"lxml"`.

Now we can navigate and search the HTML tree. For example, we can access tags by their names:

In [4]:
#for example, we can access tags by their names
title = soup.html.head.title.string
print(title)

 My simple website


Or we can find the table and access the data in it:

In [None]:
#there is only one table on the webpage, so we can use the find() function
#which returns the first match
table = soup.find("table") 

for row in table.find_all("tr")[1:]: #iterate through the rows of the table, we skip the header
    animal = row.th.string.strip().lower() #get the row headers, and make the string look prettier
    
    cells = row.find_all("td") # get all cells in the row
    legs = cells[2].string.strip() #grab the third one
    print("The "+animal+" has "+legs+" legs.")

### Exercise

Find all links on the page and print out all urls that they point to. If you forgot what tags represent links, look at the website's source code.

<details><summary><u>Hint</u></summary>
<p>

To access the attribute `att` of a tag use `tag['att']`.    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for a in soup.find_all("a"):
    print(a['href'])
```
    
</p>
</details>

## World population over time

Now for some real webscraping! Our task is to plot the world population as a function of time based on this [table](https://en.wikipedia.org/wiki/World_population#Past_population) found on Wikipedia. Your first step is to investigate the source code of the website, you will find that it is more complex and longer than our first little example.

### Exercise

Dowload the https://en.wikipedia.org/wiki/World_population webpage and create a soup object called `pop_soup`.

<details><summary><u>Hint</u></summary>
<p>

You can use the same code as we used to download the simple example website, you only have to change the URL.    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
webpage = urllib.request.urlopen("https://en.wikipedia.org/wiki/World_population")
pop_soup = BeautifulSoup(webpage,"lxml") 
```
    
</p>
</details>

[Watch video](https://ceu.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=3d773823-6623-4da1-ac87-ab82012d0aa2)

Open the source code of the website https://en.wikipedia.org/wiki/World_population in your browser. Let's try to figure out a way to find the `<table>` tag containing the "Past population" table.

One possibility if to search for the pattern "<table" in your browser using `ctrl-f`, this will find and count the tables. Iterating throught them by hand, we can see that our table of interest is the 11th.

In [None]:
table = pop_soup.find_all("table")[10]

Another possibility is that we notice that this is the only table that has BC dates in them.

In [None]:
table = pop_soup.find(lambda tag: tag.name=="table" and "BC" in tag.text)

Here we used `tag.text`, remember `tag.string` is the text directly in the tag, `tag.text` is all the text stored in the tag, its children and other descedents.

Now we have the table, lets extract the year column:

In [None]:
year_list = []
for row in table.find_all("tr")[1:]: #we leave out the first row, because that is just a header
    #the year is in the row headers
    year_list.append(row.th.string)
    
print(year_list)
    

If we want to use this for a plot, we have to convert these strings into numbers.

### Exercise

Obtain a list `num_pop_list` containing the world population as a number. Treat "<0.015" as "0.015".

<details><summary><u>Hint</u></summary>
<p>

Do something similar as the previous example. Remove `,` and `<` characters and convert the string to a float. 
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
num_pop_list = []
for row in table.find_all("tr")[1:]: #we leave out the first row, because that is just a header
    #the world population is the first column
    #if we refer to the tag <td> by name, it returns the first instance
    pop = row.td.string
    #remove '<'
    pop = pop.replace("<","")
    #remove ','
    pop = pop.replace(",","")
    num_pop_list.append(float(pop))
    
print(num_pop_list)
```
    
</p>
</details>

### Exercise

Plot the world population as a function of the year. The oldest datapoints are rough estimates, try excluding them.

Bonus: Is the growth exponential? Try setting the y axis to log scale too! 

<details><summary><u>Hint</u></summary>
<p>

Import `matplotlib.pyplot` and use the `plot()` function. To set the y axis to logscale, you can use `plt.yscale("log")`. To exclude the 3 oldest datapoints use `num_pop_list[3:]`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1,2,figsize=(14,5))

axes[0].set_xlabel("Year")
axes[0].set_ylabel("Population [million]")
axes[0].plot(num_year_list[3:],num_pop_list[3:],'o-')

axes[1].set_xlabel("Year")
axes[1].set_ylabel("Population [million]")
axes[1].plot(num_year_list[3:],num_pop_list[3:],'o-g')
plt.yscale('log')
```
    
</p>
</details>

## Remark

We used bare-bones `urllib.request` calls to download the contents of websites; however, this approach might not work for several reasons:
* You might have to log into a website to access its contents, so you have to figure out how to authenticate your requests, see [urllib.request docs](https://docs.python.org/3/library/urllib.request.html#examples).
* A website displayed in your browser might be generated by javascript executed by your browser. To a access such content you have to execute these scripts yourself, see [Selenium](https://selenium-python.readthedocs.io/).