# Scientific Python
## Central European University

## 05 Error handling, JSON, XML, web scraping -- exercises

Instructor: Márton Pósfai, TA: --

Email: posfaim@ceu.edu

*Don't forget:* use the Slack channel for discussion, to ask questions, or to show solutions to exercises that are different from the ones provided in the notebook. [Slack channel](http://www.personal.ceu.edu/staff/Marton_Posfai/slack_forward.html)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

## Exercises -- Error handling

### 01 Dictionaries and errors

Write a function that does the same thing as `get` method of dictionaries:
* takes a dictionary and a key as input
* if the key exists, return the corresponding value
* if the key does not exist, return `None`

Use the `try`-`except` pair!

<details><summary><u>Hint.</u></summary>
<p>

If you try to access a key that doesn't exist, python throws a `KeyError`.

</p>
</details>

In [None]:
D = {'apple':100, 'watermelon':200,'orange':14}

#behavior of the get method:
print(D.get('apple'))
print(D.get('sausage'))

<details><summary><u>Solution.</u></summary>
<p>


```python
D = {'apple':100, 'watermelon':200,'orange':14}
def same_as_get(d,key):
    try:
        return d[key]
    except KeyError:
        return None
    
print(same_as_get(D,'apple'))
print(same_as_get(D,'sausage'))  
```

    
</p>
</details>

### 02 Conversion

Write a function that takes a string as input, if possible, converts it to an integer using the `int()` function and returns this value, if the conversion is not possible, it returns `None`.


<details><summary><u>Hint.</u></summary>
<p>

Run `int("hello")` and see what error is thrown. Or you can try to find the right error code [here](https://docs.python.org/3/tutorial/errors.html).

    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
def IntConv(x):
    try:
        return int(x)
    except ValueError:
        return None
    
IntConv("2.") 
```

    
</p>
</details>

### 03 Indexing

Write a function that takes one variable `x` as input and returns an element of it indexed by 2 (i.e., `x[2]`) if possible. If not possible return `None` and print out:
* "Index is out of range, buddy." if the index is out of range,
* "Please pay more attention to your variable types." if `x` is not indexable,
* "Something is not working..." in any other case.

Note that you can handle multiple error types within the same `try` statement simply by adding multiple `except` statements:
```python
try:
    some code
except Error1:
    some code
except Error2:
    some code
```

Write cases to test all error messages.

<details><summary><u>Hint.</u></summary>
<p>


Try out these `x`s to see the possible error codes:
```python
x=0
x[2]
    
x=[1,2]
x[2]
    
x={'hello':33}
x[2]
```
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
def index2(x):
    try:
        return x[2]
    except IndexError:
        print("Index is out of range, buddy.")
    except TypeError:
        print("Please pay more attention to your variable types.")
    except:
        print("Something is not working...")
        
    
index2({1:[3]}) 
```

    
</p>
</details>

## Exercises -- JSON/webapi

### 03 JSON

Convert the string `S` into a JSON object `D` and do the following:
* Try pretty printing `D` (printing in a more human readble way) by converting it back to a string using the argument `indent`, check the documentation for details.

<details><summary><u>Hint.</u></summary>
<p>

To convert back to a string use `json.dumps()`.
    
</p>
</details>

In [None]:
S = '{"movies": {"Repo Man": {"actors": ["Emilio Estevez", "Harry Dean Stanton", "Zander Schloss"], "imdb_rating": 6.9, "year": 1984, "director": "Alex Cox"}, "Human Highway": {"actors": ["Neil Young", "Mark Mothersbaugh", "Pegi Young"], "imdb_rating": 6.0, "year": 1982, "director": "Neil Young"}, "Mighty Ducks": {"actors": ["Heidi Kling", "Emilio Estevez"], "imdb_rating": 6.6, "year": 1992, "director": "Stephen Herek"}}, "band_membership": {"Emilio Estevez": [], "Harry Dean Stanton": ["Harry Dean Stanton & The Cheap Dates"], "Zander Schloss": ["Circle Jks", "The Weirdos"], "Neil Young": ["Crazy Horses", "Buffalo Springfield"], "Mark Mothersbaugh": ["DEVO"], "Pegi Young": ["Pegi Young and the Survivors"], "Heidi Kling": []}}'



<details><summary><u>Solution.</u></summary>
<p>


```python
D= json.loads(S)
print(json.dumps(D,indent=1))  
```

    
</p>
</details>

* Print out which movies had their directors also act in them.

<details><summary><u>Hint.</u></summary>
<p>

To test if `x` is contained in a list `L`:
```python
if x in L:
    ...
```
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
for movie,details in D['movies'].items():
    if details['director'] in details['actors']:
        print(movie)  
```

    
</p>
</details>

* Print out the title of the movies and the number of actors in the movie who are also musicians.

<details><summary><u>Hint.</u></summary>
<p>

You have to count the number of actors in `D['movies'][movie_title]['actors']` whose band membership list is not empty.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
#solution 1
for movie,details in D['movies'].items():
    n = 0
    for actor in details['actors']:
        if len(D['band_membership'][actor])>0:
            n += 1
    print(movie,n)
    
#solution 2
for movie,details in D['movies'].items():
    n = len([actor for actor in details['actors'] if D['band_membership'][actor]])
    print(movie,n) 
```

    
</p>
</details>

The next few exercises use `openexchangerates.org`, so we need the app id again:

In [None]:
app_id = "..."

### 04 Monthly rates

Download the exchange rates for the first day of each month of 2020 and plot your favorite currency.

Tipp: When requesting a lot of data, download and plot the data in separate cells, so that you don't download the data multiple times when you are experimenting with your plot.

<details><summary><u>Hint.</u></summary>
<p>

You can use `datetime` to create the list of strings representing the dates. However, you can do it in a more simple way using `"2020-%02d-01"%m` where `m` goes from 1 to 12.
    
</p>
</details>

Download:

Plot:

<details><summary><u>Solution.</u></summary>
<p>


```python
#download
dates = ["2020-%02d-01"%t for t in range(1,13)]

monthly_rates = []
currency = 'BTC'
for date in dates:
    URL = "http://openexchangerates.org/api/historical/" + date + ".json?app_id="+app_id
    result = urllib.request.urlopen(URL)
    text = result.read()
    data = json.loads(str(text,"utf-8"))
    monthly_rates.append(data['rates']['BTC'])

#plot   
plt.plot(dates,monthly_rates,'-o')
plt.ylabel('bitcoin vs USD')
plt.xticks(rotation=45);
```

    
</p>
</details>

### 05 Check API usage

Write code to check how many data requests did you send to `openexchangerates.org`. Check the API documentation [here](https://docs.openexchangerates.org/docs/usage-json).

<details><summary><u>Hint.</u></summary>
<p>

The URL that you have to request is
```python
"http://openexchangerates.org/api/usage.json?app_id=xxx"
```
where `xxx` is your App ID
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
URL = "http://openexchangerates.org/api/usage.json?app_id="+app_id
result = urllib.request.urlopen(URL)
text = result.read()
usagedata = json.loads(str(text,"utf-8"))
usagedata['data']['usage']['requests']
```

    
</p>
</details>

### 06 Old exchange rates

What year is the oldest exchange rate data from? To answer this question combine using the API with error handling. 

<details><summary><u>Hint.</u></summary>
<p>


If you request a date that `openexchangerates.org` doesn't have data for it throws an error. Try requesting data for `2020-12-31`, `2019-12-31`, and so on. The first year for which we get an error will be the first year there is no data for.

    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
year = 2020
while True:
    try:
        # build a url from pieces:
        base_url = "http://openexchangerates.org/api"
        id_str   = "app_id="+app_id
        date_str = str(year)+"-12-31"
        #stich it together
        URL = base_url+"/historical/"+date_str+".json?"+id_str # this format is specified at the end of the doc page
        result = urllib.request.urlopen(URL)
    except:
        print(f'The oldest exchange rate data is from {year+1}.')
        break
    year-=1
```

    
</p>
</details>

### Weather API

I ran out of ideas for exercies involving exchange rates, so let's try something else.

Web API services are typically marketed developers, who create apps or other software that rely on the data from the web APIs. Usually there is a free limited subscription that allows developers to test out an API before committing to it. We make use of of these subscriptions for our own eductaional purposes.

Register and get an app id (also known as the api key) from [weatherapi.com](https://www.weatherapi.com/). You need an email address to register. On an unrelated note there are a disposable email [services](https://www.google.com/search?&q=temporary%20email) that provide a temporary email that you can use once and then forget about. Sometimes these email addresses are not allowed by webservices, so if the first one does not work you can 

There is also an [api explorer](https://www.weatherapi.com/api-explorer.aspx) that let's you test and construct request urls on their website.

In [None]:
api_key='...'

### 07 Current weather

This api has a wide range of functionality. For example the following request provides info about the current weather in Paris:
```python
url = "http://api.weatherapi.com/v1/current.json?key="+api_key+"&q=Paris"
```
Download the current weather, convert it to a JSON object and print it out to explore the data it contains.

<details><summary><u>Hint.</u></summary>
<p>

Use the same steps we used for `openexchangerates.org`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
url="http://api.weatherapi.com/v1/current.json?key="+api_key+"&q=Paris"
result = urllib.request.urlopen(url)
text = result.read()
current_weather = json.loads(str(text,"utf-8"))
print(json.dumps(current_weather,indent=1))
```

    
</p>
</details>

### 08 Rain --  Discussion exercise

Write a function that takes your location as input and prints out "yes" if it is raining outside and "no" if it is not.

Take a look at the [api documentation](https://www.weatherapi.com/docs/)!

**Share your solution on Slack. We will discuss the problem during class together.**

### 09 Forecast

Someone I know moved from Davis, California to Budapest for a new job. Did this person make the right decision? Get tomorrows hourly weather forecast for Davis, California and Budapest and plot the predicted temperature as a function of hour of the day.

<details><summary><u>Hint.</u></summary>
<p>

Request the forecast for two days, the second day in the list will be tomorrow's forecast:
```python
weather['forecast']['forecastday'][1]
```
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>


```python
for loc in ['Davis','Budapest']:
    url="http://api.weatherapi.com/v1/forecast.json?key="+api_key+"&q="+loc+"&days=2"
    result = urllib.request.urlopen(url)
    text = result.read()
    weather = json.loads(str(text,"utf-8"))
    temps = [hw['temp_c'] for hw in weather['forecast']['forecastday'][1]['hour']]

    plt.title(weather['forecast']['forecastday'][1]['date'])
    plt.plot(temps,'o-',label=loc)
    
plt.xlabel('hours')
plt.ylabel('temperature[C]')
plt.legend()
```

    
</p>
</details>

### 10 Clean air

Sort the EU capitals based on current air quality as measured by airborne coarse particulate matter (PM10).

<details><summary><u>Hint.</u></summary>
<p>


Add `aqi=yes` to the request URL to ensure that the response contains the airquality info.

Create a list of tuples `(city_name,pm10)` and sort the list based on the second element of the tuple.
    
</p>
</details>

In [None]:
eu_caps=["Vienna", "Brussels", "Sofia", "Zagreb", "Nicosia", "Prague", "Copenhagen", 
 "Tallinn", "Helsinki", "Paris", "Berlin", "Athens", "Budapest", "Dublin", 
 "Rome", "Riga", "Vilnius", "Luxembourg", "Valletta", "Amsterdam", "Warsaw",
 "Lisbon", "Bucharest", "Bratislava", "Ljubljana", "Madrid", "Stockholm"]



<details><summary><u>Solution.</u></summary>
<p>


```python
pm10 =[]
for loc in eu_caps:
    url="http://api.weatherapi.com/v1/current.json?key="+api_key+"&q="+loc+"&aqi=yes"
    result = urllib.request.urlopen(url)
    text = result.read()
    weather = json.loads(str(text,"utf-8"))
    pm10.append(weather['current']['air_quality']['pm10'])

joint_list = list(zip(eu_caps,pm10))
joint_list.sort(key =lambda pair: pair[1])
joint_list  
```

    
</p>
</details>

## Exercises -- XML/HTML

As part of the exercises you will have to extract some data from a file or website using BeautifulSoup and typically you will have to process this data (plot something, calculate a statistic, etc), you can do the second part multiple ways using built-in datatypes, numpy or pandas, the choice is yours. Similar to real applications, you might find that the simplest solution uses a function or trick that we did not cover in class, so look at the documentations and use the internet as needed.

### 11 Quiz XML

As an instructor of a course you can export the quizes in moodle as xml files. The file `python_quiz.xml` contains the questions from the quiz you did after the first class.

Load the file and parse it with BeautifulSoup.

In [None]:
with open("python_quiz.xml","r",encoding='utf-8') as f:
    soup = BeautifulSoup(f,"xml")

<details><summary><u>Solution.</u></summary>
<p>
    
```python
with open("python_quiz.xml","r",encoding='utf-8') as f:
    soup = BeautifulSoup(f,"xml")
```
    
</p>
</details>

The questions are contained in `<question>` tags. Take the first one and figure out how to pretty print it, so you can examine its structure.

<details><summary><u>Hint</u></summary>
<p>

`tag.another_tag` returns the first `another_tag` inside `tag`, see class notebook. The `prettify()` tag method creates a formatted string.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
print(soup.question.prettify())
```
    
</p>
</details>

How many questions are there and what are the different types of question?

<details><summary><u>Hint</u></summary>
<p>

Iterate over all `<question>` tags.
    
One possible way to identify the different types is to collect all first occurances of the `type` attribute in a list. You can also check out the built-in data type `set`.
 
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
qtypes = []
for q in soup.find_all('question'):
    if q['type'] not in qtypes:
        qtypes.append(q['type'])
print("Number of questions:", len(soup.find_all('question')))
print("Question types:", qtypes)
```
    
</p>
</details>

How many questions contain the word "list" in the question itself (not the answers)?

<details><summary><u>Hint</u></summary>
<p>

The questions are encased in the `<questiontext>` tags.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
n_lists =0
for q in soup.find_all('question'):
    if "list" in q.questiontext.text:
        #print(q.questiontext.text)
        n_lists += 1
print(n_lists)
```
    
</p>
</details>

Select the `"essay"` type questions and double the size of response field (this is the box that you type in, the height of the box is given in lines in the `<responsefieldlines>`tag).

<details><summary><u>Hint</u></summary>
<p>

The text directly contained within `<responsefieldlines>` is `responsefieldlines.string`, you can simply overwrite this using `responsefieldlines.string=...`.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
for q in soup.find_all('question',type="essay"):
    #print(q.responsefieldlines.string)
    q.responsefieldlines.string=str(2*int(q.responsefieldlines.string))
    #print(q.responsefieldlines.string)
#print(soup.question.prettify())
```
    
</p>
</details>

Print out the number multiple choice questions that have more than one correct answer (the answer tags have an attribute called `fraction`, for correct answers the fraction is larger than zero).

<details><summary><u>Hint</u></summary>
<p>

Write a function `multiple_correct(tag)` that takes a tag as input and returns `True` if
* the tag is a `<question>` and
* its `type` attribute is equal to `multichoice` and
* it contains at least two `<answer>` tags that have a non-zero `fraction` attribute
And otherwise returns `False`.
    
Use the `multiple_correct(tag)` with `find_all`
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def multiple_correct(tag):
    if tag.name=="question" and tag['type']=='multichoice':
        n = 0
        for a in tag.find_all('answer'):
            if float(a['fraction'])>0:
                n+=1
        if n>1:
            return True
    return False

print(len(soup.find_all(multiple_correct)))
```
    
</p>
</details>

Write a function `has_attribute(soup,tag,attribute)` that takes a soup object `soup` and two strings `tag` and `attribute` as input. Looks for the first occurance of `tag` and returns `True` if this tag has an attribute called `attribute` and `False` otherwise.

For example, using the quiz xml file `has_attribute(soup,"question","type")` returns `True`, but `has_attribute(soup,"question","horseradish")` returns `False`.

<details><summary><u>Hint</u></summary>
<p>

The attributes are stored and accessed like a dictionary, e.g., `soup.question["type"]` provides the value of the `"type"` attribute. If you try to access a non-existing attribute, it throws a `KeyError`, which you can catch using a `try`-`except` pair. Or you can use the `get()` dictionary method (remember class 3?).
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def has_attribute(soup,tag,attribute):
    if soup.find(tag).get(attribute)!=None:
        return True
    else:
        return False
#or
def has_attribute(soup,tag,attribute):
    if soup.find(tag).get(attribute):
        return True
    return False  

print(has_attribute(soup,"question","type"),has_attribute(soup,"question","horseradish"))
```
    
</p>
</details>

### 12 arXiv.org

[arXiv.org] is an open-access preprint repository containing almost 2,000,000 scietific papers. It originally started for Physics, but since has expanded to include math, computer science and other fields. It has an API that allows you to access the articles and their metadata. Use the URL below to retrive metadata of the first 1000 articles with titles containing the word "covid" in XML format. The information about the individual papers are stored in `<entry>` tags, print out the first one to see the available data.

<details><summary><u>Hint</u></summary>
<p>

Same as in the first exercise.
    
</p>
</details>

In [None]:
url = 'http://export.arxiv.org/api/query?search_query=ti:"covid"&max_results=1000&sortBy=submittedDate&sortOrder=ascending'

<details><summary><u>Solution.</u></summary>
<p>
    
```python
data = urllib.request.urlopen(url)
soup = BeautifulSoup(data,"xml")
print(soup.entry.prettify())
```
    
</p>
</details>

Create a figure that shows the number of papers published in each month in the dataset!

There are many ways you can do this: you can create a dataframe, you can use lists. You can write your own code to count the occurences in a month or use `np.unique` or `collections.Counter`. Google away if you need help.

<details><summary><u>Hint</u></summary>
<p>

The date and time when the paper was published is in the `<published>` tag. Create a date object with the year and the month of publishing, but set the day to 1.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
YMs = [datetime.date(year=int(pub.string[:4]),month=int(pub.string[5:7]),day=1) for pub in soup.find_all('published')]
uniYMs, counts = np.unique(YMs, return_counts=True)
plt.plot(uniYMs,counts,'o-')
plt.ylabel("Number of covid papers")
plt.xticks(rotation=45);
``` 
</p>
</details>

You can also look at the newest 1000 paper changing "ascending" to "descending" in the url. (arXiv.org does not recommend api search requests that return more than 1000 entries. If you are interested in accessing more, you can stich together the results of multiple requests, or you can look into the bulk data api.)

Print out the title of the paper that has the most co-authors!

<details><summary><u>Hint</u></summary>
<p>

Count the number of `<author>` tags contained within the `<entry>` tags.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
max_auth=0

for entry in soup.find_all('entry'):
    auth = len(entry.find_all('author'))
    if auth>max_auth:
        max_auth = auth
        title = entry.title.string
print(title)
print('number of co-authors:', max_auth)
```
    
</p>
</details>

###  13 Planets

Download the table containing info about planets compiled for you by NASA from here: https://nssdc.gsfc.nasa.gov/planetary/factsheet/
Parse the table using BeautifulSoup and store the data in a pandas dataframe.

It's generally a good idea to download and process the data in separate cells, so you don't repeat the download unnecessarily.

<details><summary><u>Hint</u></summary>
<p>

* The table is contained in the `<table>` tag. The table rows are inside `<tr>` tags, the table cells are represented by `<td>` tags.
* The column names are in the first `<tr>` tag. Iterate over the rest of the `<tr>` tags to fetch the rows, the first `<td>` is the row index, the rest is the data.
* Collect the table data as a list of lists, convert it to a dataframe, and rename the colunms and rows.  
* You can use slices such as `soup.find_all(...)[1:]` to exclude tags as needed.
   
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
webpage = urllib.request.urlopen("https://nssdc.gsfc.nasa.gov/planetary/factsheet/")
soup = BeautifulSoup(webpage,"lxml")   
 
colnames = []
for td in soup.table.tr.find_all('td')[1:]:
    colnames.append(td.text.strip())
colnames

rownames = []
rows =[]
for tr in soup.table.find_all('tr')[1:-1]:
    rownames.append(tr.td.text.strip())
    rows.append([])
    for td in tr.find_all('td')[1:]:
        rows[-1].append(td.text.strip())

df=pd.DataFrame(rows)
df.columns=colnames
df.set_index(pd.Index(rownames),inplace=True)
df
```
    
</p>
</details>

For good measure, create a plot showing the relationship between mean temperature and distance from the Sun.

<details><summary><u>Hint</u></summary>
<p>

You can directly use matplotlib's `plt.plot()`. Or you can use pandas plot, but `df.plot(...)` plots columns not rows, one solition is to transpose your dataframe using `df.T`.
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
df.T.plot(kind='scatter',x="Distance from Sun (106 km)", y="Mean Temperature (C)")
```
    
</p>
</details>

### 14 Planets part 2

Now do the same thing using `pd.read_html()` (check out the documentation for details).

<details><summary><u>Hint</u></summary>
<p>

Note that `pd.read_html()` returns a list of dataframes, one for each table in the html file.
    
One option to drop the rows that are labelled `nan` with
```python
df = df.loc[df.index.dropna(),:] 
```
</p>
</details>

In [None]:
import pandas as pd
df = pd.read_html("https://nssdc.gsfc.nasa.gov/planetary/factsheet/")[0]


<details><summary><u>Solution.</u></summary>
<p>
    
```python
df = pd.read_html("https://nssdc.gsfc.nasa.gov/planetary/factsheet/")[0]

df.set_index(0, inplace=True)
df.columns = df.iloc[0]
df = df.loc[df.index.dropna(),:]
df.index.name=None
df.columns.name = None

df
```
    
</p>
</details>

So why don't we always use pandas?
* To learn about html and its parsing
* Not all data from website comes from tables
* Not all html is clean enough and `pd.html_read()` might not work as expected
* etc

### 15 Christmas

Write a function `isitChristmas()` that prints no if it is not Christmas and yes if it is by scrapping the 
https://isitchristmas.today/ website.

**There is a catch:**<br>
When your browser sends a request to a server to download a page it also sends additional information about itself in the header of the request. Some websites block requests that do not have such information, and would not allow your script to access the contents. Luckily you can easily fake this by using
```python
#construct a request with the additional header
req = urllib.request.Request("https://isitchristmas.today/",headers={'User-Agent':'Mozilla/5.0'})
#send the request and download the site
webpage = urllib.request.urlopen(req)
```

<details><summary><u>Hint</u></summary>
<p>

Visit the website with your browser and open the source code to see what tag is used to display the answer.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def isitChristmas():
    req = urllib.request.Request("https://isitchristmas.today/",headers={'User-Agent':'Mozilla/5.0'})
    webpage = urllib.request.urlopen(req)
    soup = BeautifulSoup(webpage,"lxml") 
    
    
    if soup.html.body.h2.string=="No!":
        print("no")
    else:
        print("yes")
    
    return

isitChristmas()
```
    
</p>
</details>

Of course you could have done this a lot easier using datetime.

### 16 GIFs

Write a function that takes a string containing an URL of a website as input and returns the number of `gif` images on the website.

<details><summary><u>Hint</u></summary>
<p>

Look for `<img>` tags and their `src` attribute in the HTML files.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def gifcounter(url):
    webpage = urllib.request.urlopen(url,headers={'User-Agent':'Mozilla/5.0'})
    soup = BeautifulSoup(webpage,"lxml") 
    gifcount = 0
    for img in soup.find_all("img"):
        if img['src'][-3:]=='gif':
            gifcount+=1
        
    return gifcount

# or
def gifcounter(url):
    webpage = urllib.request.urlopen(url)
    soup = BeautifulSoup(webpage,"lxml")
        
    return len(soup.find_all(lambda tag: tag.name=='img' and tag['src'][-3:]=='gif'))
    
print(gifcounter("http://posfaim.web.elte.hu/example.html"))
print(gifcounter("http://google.com"))
print(gifcounter("https://index.hu/velemeny/jegyzet/folio/"))
```
    
</p>
</details>

### 17 The gray lady

Scrape the front page of the New York Times and retrieve the article titles.

It is not necessarily a trival task to extract information robustly from a website. There is a lot of html and other code in the file that you download that is responsible for the visuals of the website, and this code is not meant to be read by a human directly. You can try to figure out a method yourself, but clicking on the hint reveals a possible approach.

<details><summary><u>Hint</u></summary>
<p>

* Each article title is a link represented as the `<a>` tag in html
* There are many other links on the webpage, but article title links have an attribute `data-story` and the value of this attribute always starts with `"nyt://article"`
    
Write a function that takes a tag as input, returns `True` if it matches the above criteria, and `False` if it does not. Use this function together with `find_all` to identify all titles.

(The above criteria is not entirely accurate, it will count some article descriptions as titles. To get more better results you can extend the criteria to only consider `<a>` tags that have at contain at least one `<h3>` tag. The last `<h3>` tag contains the title.)
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
req = urllib.request.Request("https://nytimes.com",headers={'User-Agent':'Mozilla/5.0'})
webpage = urllib.request.urlopen(req)
soup = BeautifulSoup(webpage,"lxml")

#solution 1
def is_article(tag):
    if tag.name=='a' and tag.get('data-story'):
        if 'nyt://article' in tag['data-story']:
            return True
    return False

titles = []
for story in soup.find_all(is_article):
        titles.append(story.text)
titles[:3]
    
#solution 2
def is_article(tag):
    if tag.name=='a' and tag.get('data-story'):
        if 'nyt://article' in tag['data-story']:
            if tag.find('h3'):
                return True
    return False

titles = []
for story in soup.find_all(is_article):
        titles.append(story.find_all('h3')[-1].text)
titles[:3]
```
    
</p>
</details>

Pick a word and calculate the percentage of the articles that contain this word in the title, ignore capitalization of the letters.

<details><summary><u>Hint</u></summary>
<p>

Convert the word and the title to lower case using the `lower()` string method.
    
You can use
```python
   if word in title:
       ...
```
but this will also include cases when a word is contained inside a longer word, e.g., "is" is inside "misreable". You can get a more accurate count using regex expressions (check out `\b`).

</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
#solution 1
word = 'Biden'
n = 0
for t in titles:
    if word.lower() in t.lower():
        n+=1
        print(t)
print(word,"count:",n,"ratio:", n/len(titles))  
    
#solution 2
import re
word = "Biden"
n = 0
for t in titles:
    if re.search(r"\b"+word.lower()+r"\b",t.lower()):
        n+=1
print(n,n/len(titles))
```
    
</p>
</details>

### 18 Wiki

Write a function that takes an English wikipedia article as input and returns a list of wikipedia articles that are linked by the page. For example, `"Dog"` has links to `"Domesticated"`, `"Wolf"` and many others.

<details><summary><u>Hint</u></summary>
<p>

Look for tags that
* are links `<a>`,
* have an attribute `href`
* the value of `href` starts with `"/wiki/"` (you can use the `startswith()` string method),
* and value of `href` does not contain the character `":"` (if it does contain `":"`, it links to a category or a media file, not an article).

An article might be linked twice! Only collect unique article names.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
def iswikilink(tag):
    if tag.name=='a' and tag.get('href'):
        if tag['href'].startswith('/wiki/') and ":" not in tag['href']:
            return True
    return False

def wikilinks(page):
    webpage = urllib.request.urlopen("https://en.wikipedia.org/wiki/"+page)
    soup = BeautifulSoup(webpage,"lxml")
    
    links = []
    for link in soup.find_all(iswikilink):
        article = link['href'][6:]
        if article not in links: 
            links.append(article)
    return links

wikilinks("Dog")
```
    
</p>
</details>

How many articles are one click away from "Ja-Da"? How many articles are two clicks away from "Ja-Da"?

<details><summary><u>Hint</u></summary>
<p>

* Get the list of articles one click away using the function you have just written.
* Apply the same function to each article in this list and collect all articles two steps away.
* Make sure you only count each article once.
    
</p>
</details>

<details><summary><u>Solution.</u></summary>
<p>
    
```python
onestep = wikilinks("Ja-Da")
twostep = []
for word in onestep:
    twostep = twostep + wikilinks(word)
len(onestep),len(set(twostep))
```
    
</p>
</details>

## Final problem

Solve either Part I, Option 1 or Option 2. In Part II, plan your final project.


### Part I -- Option 1

Write a function that returns two lists conataining the authors and the titles of the books on the New York Times best seller list.
* Download and parse the page from here: https://www.nytimes.com/books/best-sellers/combined-print-and-e-book-fiction/
* The book titles are encased in a tag with attribute itemprop="name"
* The authors are in a tag itemprop="author"





### Part I -- Option 2

Predicting exchange rates is a difficult problem, so it totally makes sense to get help from any possible source. Look for correlations between movements of celestial objects and currency exchange rates!

 `weatherapi.com` has an astronomy api that can provide the illuminated portion of the moon for each day. It ranges from 0% (new moon) to 100% (full moon).

* Download the moon illumination for the last 100 days.
* Download all exchange rates for the last 100 days.
* Find the currency that has the highest correlation with the moon phases (illuminated portion of the moon).
* Create a plot that compares the moon illumination time series to this exchange rate time series (use [this type](https://matplotlib.org/devdocs/gallery/subplots_axes_and_figures/two_scales.html) of plot).

Hint:
* You download the exchange rates as a dictionary; you can collect the dictionaries corresponding to each day in a list `L`; and this list can be easily converted to a dataframe with
```python
df_rates= pd.DataFrame(L),
```
where the rows represent the days and the columns the different exchange rates.


### Part II

Plan your final project! Check the requirement of the final project on moodle, and write a short paragraph describing your plans. Pick something that interests you and/or is useful for your research.