# Gathering data from the web - Problems

**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))  <br>
**Last updated:** September 2021  
**Python version:** Python 3.6+     
**Recommended environment: `researchPython`**

In [1]:
import os
recommendedEnvironment = 'researchPython'
if os.environ['CONDA_DEFAULT_ENV'] != recommendedEnvironment:
    print('Warning: it does not appear you are using the {0} environment, did you run "conda activate {0}" before starting Jupyter?'.format(recommendedEnvironment))

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Introduction</span>
</div>

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 20px; font-weight:bold;'> Make sure to open up the respective tutorial notebook(s)! <br> That is what you are expected to use as primary reference material. </span>
</div>

### Relevant tutorial notebooks:

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`4_web_scraping.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb)  

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: center; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 1 </span>
</div>  

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 15px; font-weight:bold;'> Note: feel free to add as many cells as you'd like to answer these problems, you don't have to fit it all in one cell. </span>
</div>

The goal of these problems is to get hands-on experience with gathering data from the Web using `Requests` and `Requests-HTML`.

The tasks below are split up into two sections:  

1. API tasks  

2. Web scraping tasks  

## Import required packages  

In [33]:
import os
from pathlib import Path
import requests
from requests_html import HTMLSession
import lxml.html

In [3]:
import pandas as pd
import numpy as np

### Also run the code below, it solves a couple of minor problems that you don't need to worry about

In [4]:
from IPython.display import HTML
import time
def show_image(url):
    return HTML('<img src="{}?{}"></img>'.format(url, int(time.time())))

In [5]:
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px; margin-bottom:50px; margin-top:50px;'>
<span style='color:black; font-size: 20px; font-weight:bold;'> Warning: if you are using Python 3.8 or 3.9 you might experience a bug with requests-html where the ".find()" function will return too much text even if the CSS selector is correct.  You can check whether this is the case using the code below, if it shows an error, please follow the alternative instructions using requests + LXML.</span>

In [6]:
def check_requests_issue():
    page = 'https://foster.uw.edu/faculty-research/directory/david-burgstahler/'
    name_selector = '#foster-content h2.entry-title'
    correct_string = 'David Burgstahler'

    session = HTMLSession()
    res = session.get(page)
    name_element = res.html.find(name_selector, first=True)
    if name_element.text == correct_string:
        print('Your system does not have the bug, you are good to use requests-html!')
    else:
        assert name_element.text == correct_string, "Error, your system has the bug. :("

In [7]:
check_requests_issue()

AssertionError: Error, your system has the bug. :(

**If the above resulted in an error that you system has the bug, please use the `requests` + `lxml` as an alternative to `requests-html`**

These instructions are also included in my reference notebook: https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb#ws-lxml

Using `requests` + `lxml` generally looks something like this:

```python
### First you "download" the page using requests:
res = requests.get('https://foster.uw.edu/faculty-research/directory/david-burgstahler/')

### Then you parse the HTML using LXML:
html = lxml.html.fromstring(res.text) 

### Then you run html.cssselect to extract the information you want:
result = html.cssselect('#foster-content h2.entry-title') ## Here ".cssselect()" is equivalent to ".find()"
print(result[0].text)

## --> David Burgstahler
```

Some differences to note between `requests-html` and `requests` + `lxml`:

- In `requests-html` you can do `.find(.. , first=True)`, the equivalent in `lxml` is `.cssselect(...)[0]`  
- In `requests-html` you get attributes using `.attrs`, the equivalent in `lxml` is `.attrib`

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>API Problem</span>
</div>

## 1) Use the `genderize.io` API with the `requests` library

Use this API: https://genderize.io/

**NOTE:** it might be that this API is down if you get a "too many requests message". In that case just come back to it a little later.

### 1a) Use the API to automatically guess the gender of your first name

### 1b) Write a function that takes any first name as input and that uses the API to return the predicted gender and probability

Note: make sure you include `return` in your function!

### 1c) Create a list of names, and use the `guess_gender` function to predict the gender of each name. Include a 1 second pause after each guess.  
**Hint:** *use the `time` library for the pause*

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Web Scraping Problem</span>
</div>

## 2) Create a webscraper that collects information for a Foster Faculty member

Your goal is to create a webscraper that can extract the following information from a Foster Faculty staff page (such as this one: https://foster.uw.edu/faculty-research/directory/david-burgstahler/ ):

* Name  
* URL to profile image  
* Title of first selected publication

**Hint 1:** If you don't have the bug --> use the `requests-html` library, otherwise use `requests` + `lxml`. 

**Hint 2:** if you use `requests-html` and get an error mentioning SSL --> add `, verify=False` to the `session.get()` command like so: `session.get(.... , verify=False)`

### 2a) Extract the above three pieces of information from the Faculty page of David Burgstahler  
url = https://foster.uw.edu/faculty-research/directory/david-burgstahler/

---

**Tip** you can show a picture from a URL in the notebook by using the provided `show_image(url)` function

### 2b) Create a function that takes a URL for a Staff page and extracts the three pieces of information and returns it as a dictionary  
Make sure to test your function by trying it with a several different URLs.  
A full list is available here:  
https://foster.uw.edu/faculty-research/academic-departments/accounting/faculty/   

**Warning:** make sure that the function can deal with faculty members that do not have a picture or any selected publication, test if with (for example):   
https://foster.uw.edu/faculty-research/directory/jane-jollineau/

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 2: Advanced Funcionality</span>
</div>

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>API Problem</span>
</div>

## 3) Get current picture of traffic camera using the `wsdot` API and `requests`

### 3a) Get access key

Go to " http://wsdot.com/traffic/api " in your browser.  
At the bottom of the page type a random email address in the text field (e.g. test@test.com) and copy the access key and assign it to a Python variable.

### 3b) Retrieve current picture of traffic camera for the `NE 45th St` camera

You can see all the various cameras here: https://www.wsdot.com/traffic/seattle/default.aspx?cam=1032#cam

The `CAMERAID` of the `NE 45th St` camera is: **1032**

---

**Tip** you can show a picture from a URL in the notebook by using the provided `show_image(url)` function.   
**Note:** This only works if you provide a direct url to the image (i.e., a url that ends with something like .jpg or .png).

![image.png](attachment:f573e33d-f13b-4990-adda-6f2a87d77a10.png)

<div style='border-style: solid; padding: 5px; border-color: darkred; border-width:5px;  text-align: center; margin-left: 100px; margin-right:100px;'>
<span style='color:black; font-size: 15px; font-weight:bold;'> Note: use the API, don't scrape the webpage! </span>
</div>

You can retrieve the current picture of a traffic camera using the API described here:     
http://wsdot.com/traffic/api/HighwayCameras/HighwayCamerasREST.svc/help/operations/GetCameraAsJson

### 3C) Save the image to your computer

There are many ways to do this, but for a pure `requests` solution see: [link](https://kite.com/python/answers/how-to-download-an-image-using-requests-in-python#:~:text=Use%20requests.,write%2Dand%2Dbinary%20mode.)

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Web Scraping Problem</span>
</div>

## 4) Create a webscraper that creates an Excel sheet with information for all Foster (UW) Faculty members in Accounting

### 4a) Create a list of URLs for all the Foster faculty members in Accounting  
This information is here: https://foster.uw.edu/faculty-research/academic-departments/accounting/faculty/

**Hint 1:** If you don't have the bug --> use the `requests-html` library, otherwise use `requests` + `lxml`. 

**Hint 2:** if you use `requests-html` and get an error mentioning SSL --> add `, verify=False` to the `session.get()` command like so: `session.get(.... , verify=False)`

### 4b) Apply the function you created in step 2b to all the URLs you gathered in step 4a and save it all (including the URL) to a Pandas DataFrame

If you got `TQDM` to work this would be a good time to use it: 

```python
from tqdm.notebook import tqdm

for i in tqdm(range(100)):
    time.sleep(0.5)
```

<div style='border-style: solid; padding: 10px; border-color: black; border-width:5px;  text-align: left; margin-top:20px; margin-bottom: 20px;'>
<span style='color:black; font-size: 30px; font-weight:bold;'>Part 3: Extra, not required for credit.</span>
</div>

**Note:** You don't have to complete part 3 if you are handing in the problems for credit.  

------

## 5) Create a function that retrieves all the sport events in Seattle for a given date range

https://visitseattle.org/ maintains an event calendar for events in Seattle.  

You can find the sports events at this page:  
https://visitseattle.org/?s&frm=events&event_type%5B0%5D=sports

**Task:** create a function that takes a starting date and an end date and returns the following information about the sports events:

* Title  
* Link  
* Location  
* Date info  

**Bonus task:** make sure that your scraper can deal with the results being presented accross multiple pages!

![image.png](attachment:41ecf795-766c-4edf-b308-c961cd6ef560.png)