# Gathering data from the web |  Mini Tasks

**Author:** Ties de Kok ([Personal Website](https://www.tiesdekok.com))  
**Last updated:** 15 March 2019  
**Python version:** Python 3.6 or 3.7   
**License:** MIT License  

## *Introduction*

In this notebook I will provide you with "tasks" that you can try to solve.  

Most of what you need is discussed in the tutorial notebooks, the rest you will have to Google (which is an important exercise in itself).

# *tutorial notebooks you will need*

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`4_web_scraping.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb)  

## Mini Tasks <br> ----------------

The goal of this mini-task is to get hands-on experience with gathering data from the Web using `Requests` and `Requests-HTML`.

The tasks below are split up into two sections:  

1. API tasks  

2. Web scraping tasks  

## Import required packages  
**Note:** make sure you have the `limperg-python` environment activated!

In [None]:
import time

In [1]:
import requests
from requests_html import HTMLSession

In [2]:
import pandas as pd
import numpy as np

## Also run the code below, it solves a couple of minor problems that you don't need to worry about

In [3]:
from IPython.display import HTML
import time
def show_image(url):
    return HTML('<img src="{}?{}"></img>'.format(url, int(time.time())))

In [4]:
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

## API Task | basic <br> -----------------------

## 1) Use the `genderize.io` API with the `requests` library

Use this API: https://genderize.io/

**NOTE:** it might be that this API is down if you get a "too many requests message"

### 1a) Use the API to automatically guess the gender of your first name

### 1b) Write a function that take any first name as input and that uses the API to return the predicted gender and probability

### 1c) Create a list of names, and use the `guess_gender` function to predict the gender of each name. Include a 1 second pause after each guess.  
**Hint:** *use the `time` library for the pause*

## Web Scraping Task | Basic <br> --------------------------------------

## 2) Create a webscraper that can collect information for a Foster Faculty member

Your goal is to create a webscraper that can extract the following information from a Foster Faculty staff page (such as this one: https://foster.uw.edu/faculty-research/directory/david-burgstahler/ ):

* Name  
* URL to profile image  
* Title of first selected publication

**Hint 1:** use the `requests-html` library  
**Hint 2:** if you get an error mentioning SSL --> add `, verify=False` to the `session.get()` command like so: `session.get(.... , verify=False)`

### 2a) Use `requests-html` to extract the above three pieces of information from the Faculty page of David Burgstahler  
url = https://foster.uw.edu/faculty-research/directory/david-burgstahler/

---

**Tip** you can show a picture from a URL in the notebook by using the provided `show_image(url)` function

### 2b) Create a function that takes a URL for a Staff page and extracts the three pieces of information and returns it as a dictionary  
Make sure to test your function by feeding it with the URL for various staff members! A full list is available here:  
https://foster.uw.edu/faculty-research/academic-departments/accounting/faculty/   

**Warning:** make sure that the function can deal with faculty members that do not have a picture or any selected publication, test if with (for example):   
https://foster.uw.edu/faculty-research/directory/jane-jollineau/

## API Task | advanced <br> -----------------------------

## 3) Get current picture of traffic camera using the `wsdot` API and `requests`

### 3a) Get access key

Go to " http://wsdot.com/traffic/api " in your browser.  
At the bottom of the page type a random email address in the text field (e.g. test@test.com) and copy the access key and assign it to a Python variable.

### 3b) Retrieve current picture of traffic camera for the `NE 45th St` camera

See: https://www.wsdot.com/traffic/seattle/default.aspx?cam=1032#cam

The `CAMERAID` of the `NE 45th St` camera is: **1032**

---

**Tip** you can show a picture from a URL in the notebook by using the provided `show_image(url)` function

You can retrieve the current picture of a traffic camera using the API described here:     
http://wsdot.com/traffic/api/HighwayCameras/HighwayCamerasREST.svc/help/operations/GetCameraAsJson

## Web Scraping Task | Advanced <br> ---------------------------------------------

## 4) Create a webscraper that creates an Excel sheet with information for all Foster (UW) Faculty members in Accounting

### 4a) Create a list of URLs for all the Foster faculty members in Accounting  
This information is here: https://foster.uw.edu/faculty-research/academic-departments/accounting/faculty/

**Hint 1:** use the `requests-html` library  
**Hint 2:** if you get an error mentioning SSL --> add `, verify=False` to the `session.get()` command like so: `session.get(.... , verify=False)`

### 4b) Apply the function you created in step 2b to all the URLs you gathered in step 4a and save it all (including the URL) to a Pandas DataFrame


# <span style='color: red'>!EXTRA!</span> Time left? 

If you have finished all of the above tasks, I recommend to familiarize yourself with `Selenium` as this often turns out useful as well. 

You can set it up and follow the steps here: https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb#Selenium