# Natural Language Processing Mini Task

**Author:** Ties de Kok ([Personal Website](http://www.tiesdekok.com))  
**Last updated:** 15 May 2018  
**Python version:** Python 3.6  
**License:** MIT License  

## *Introduction*

In this notebook I will provide you with "tasks" that you can try to solve.  

Most of what you need is discussed in the tutorial notebooks, the rest you will have to Google (which is an important exercise in itself).

## *Relevant notebooks*

1) [`0_python_basics.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/0_python_basics.ipynb)  


2) [`2_handling_data.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb)  


3) [`4_web_scraping.ipynb`](https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/4_web_scraping.ipynb)  

## Web Scraping Mini Task <br> -----------------------------------

The goal of this mini-task is to get hands-on experience with gathering data from the Web using `Requests` and `Requests-HTML`.

The tasks below are split up into three sections:  

1. API tasks  

2. Web scraping tasks  

3. *Extra challenge:* HTTP requests

## Import required packages

In [48]:
import requests
from requests_html import HTMLSession

## API Tasks <br> --------------

### Retrieve the current price of the "Dogecoin" cryptocurrency in Euros

You can use the cryptonator API: https://www.cryptonator.com/api

In [8]:
requests.get("https://api.cryptonator.com/api/ticker/doge-eur").json()

{'error': '',
 'success': True,
 'ticker': {'base': 'DOGE',
  'change': '0.00000819',
  'price': '0.00371951',
  'target': 'EUR',
  'volume': ''},
 'timestamp': 1526492941}

### Follow up: Create a function that retrieves the current price in Euros for a given cryptocurrency "ticker"

Make sure that it can handle invalid tickers and HTTP errors (*hint:* use `.status_code`)

In [33]:
def retrieve_crypto_price(ticker):
    res = requests.get("https://api.cryptonator.com/api/ticker/{}-eur".format(ticker))
    if res.status_code == 200:
        res_json = res.json()
        if res_json['success']:
            return res_json['ticker']['price']
        else:
            return 'Invalid Ticker'
    else:
        return 'HTTP Error: {}'.format(res.status_code)

In [34]:
res = retrieve_crypto_price('doge')

### Write a function that takes an artist and song title and returns the lyrics

Use this API: http://docs.lyricsovh.apiary.io/#reference/0/lyrics-of-a-song/search

In [40]:
def get_lyrics(artist, title):
    url = "https://api.lyrics.ovh/v1/{}/{}".format(artist, title)
    res = requests.get(url)
    return res.json()['lyrics']

In [41]:
print(get_lyrics('Coldplay', 'Adventure of a Lifetime')[:120])

Turn your magic on, Umi she'd say
Everything you want's a dream away
We are legends, every day

That's what she told 


### Write a function that guesses the gender based on first name

Use this API: https://genderize.io/

**NOTE:** it might be that this API is down if you get a "too many requests message"

In [42]:
def guess_gender(name):
    base_url = 'https://api.genderize.io/?'
    payload = {'name' : name}
    res = requests.get(base_url, params=payload)
    return res.json()

In [46]:
guess_gender('Ties')

{'count': 1, 'gender': 'female', 'name': 'Ties', 'probability': 1}

## Web Scraping Task <br> ----------------------------

Your goal is to create a dataset with details for all the University of Bristol faculty and staff. 

This page serves as the starting point: http://www.bristol.ac.uk/efm/people/allstaff.html

Recommendation: use `requests-html`

### Step 1: write a function that can extract information from a staff members profile page

For example: http://www.bristol.ac.uk/efm/people/mark-a-clatworthy/overview.html

Retrieve the following details:  

1. URL to profile picture  
2. Their departement (based on their departement link)
3. Latest publication

**Note:** Make sure it can handle people without publications / profile pictures / departement links

*Hint 1:* this might be relevant: https://www.w3schools.com/cssref/sel_attribute_value_contains.asp

In [102]:
session = HTMLSession()

In [126]:
def get_staff_details(url):
    res = session.get(url)

    profile_img_src_res = res.html.find('#portrait', first=True)
    if profile_img_src_res:
        profile_img_src = profile_img_src_res.attrs['src']
    else:
        prifle_img_src = None
    
    
    departement_res = res.html.find('#researcher-summary a[href*="bris.ac.uk"]', first=True)
    if departement_res:
        departement = departement_res.text
    else:
        departement = None
    
    latest_pub_res = res.html.find('#researcher-publications li', first=True)
    if latest_pub_res:
        latest_pub = latest_pub_res.text
    else:
        latest_pub = None
    
    return {'img_url' : profile_img_src,
           'department' : departement,
           'latest_pub' : latest_pub}

In [127]:
get_staff_details('http://www.bristol.ac.uk/efm/people/mark-a-clatworthy/overview.html')

{'department': 'Department of Accounting and Finance',
 'img_url': 'http://dbms.ilrt.bris.ac.uk/media/pure/medium-287091.jpeg',
 'latest_pub': "Clatworthy, M & Lee, E, 2018, \x91Financial analysts' role in valuation and stewardship: Introduction\x92. Accounting and Business Research, vol 48., pp. 1-4"}

### Step 2: retrieve a list of all faculty and staff members

Save the following details:  
1. Name  
2. Job title  
3. Email  
4. Phone number  
5. **Link to their page**

Recommendation: make sure to end up with a Pandas Dataframe so that you can save it easily to an Excel sheet!

*Hint 1:* this might be relevant: https://www.w3schools.com/cssref/sel_attribute_value_contains.asp

In [49]:
session = HTMLSession()

In [51]:
res = session.get('http://www.bristol.ac.uk/efm/people/allstaff.html')

In [98]:
data = []
for row in res.html.find('.a-z-staff-table tr'):
    row_res = {}
    
    col_row_name = row.find('a[title*="View details"]', first=True)
    if col_row_name:
        row_res['href'] = col_row_name.attrs['href']
        row_res['name'] = col_row_name.text    
    
    for column in ['stafftablejob', 'email', 'tel']:
        col_row_res = row.find('.{}'.format(column), first=True)
        if col_row_res:
            row_res[column] = col_row_res.text
    
    if row_res:
        data.append(row_res)

In [100]:
bristol_staff_df = pd.DataFrame(data)

In [101]:
bristol_staff_df.head()

Unnamed: 0,email,href,name,stafftablejob,tel
0,daniella.acker@bristol.ac.uk,/efm/people/daniella-e-acker/overview.html,Professor Daniella Acker,Professor of Finance and Accounting,Tel. (0117) 39 41476
1,tauheed.ali@bristol.ac.uk,/efm/people/tauheed-ali/overview.html,Miss Tauheed Ali,Teaching Associate,Tel. (0117) 42 82236
2,sophie.amor@bristol.ac.uk,/efm/people/sophie-amor/overview.html,Miss Sophie Amor,Postgraduate Admissions Administrator,Tel. (0117) 39 41487
3,ra14611@bristol.ac.uk,/efm/people/rutvica-andrijasevic/overview.html,Dr Rutvica Andrijasevic,Senior Lecturer in Management,Tel. (0117) 954 6905
4,d.andronoudis@bristol.ac.uk,/efm/people/dimos-andronoudis/overview.html,Dr Dimos Andronoudis,Lecturer in Accounting,Tel. (0117) 39 41504


## Step 3: run the function from step 1 on all the urls gathered in step 2

**Note:** if it takes a long time to run you can also just run it on a small subset of the data from step 2.

In [130]:
staff_details = []
for href in bristol_staff_df.href:
    detail_res = get_staff_details('http://www.bristol.ac.uk/' + href)
    detail_res['href'] = href
    staff_details.append(detail_res)

#### Bonus task:

Add the details to the initial dataframe that you created in Step 2

In [131]:
staff_details_df = pd.DataFrame(staff_details)

In [135]:
bristol_staff_df = pd.merge(bristol_staff_df, staff_details_df, on='href', how='left')

In [136]:
bristol_staff_df.head()

Unnamed: 0,email,href,name,stafftablejob,tel,department,img_url,latest_pub
0,daniella.acker@bristol.ac.uk,/efm/people/daniella-e-acker/overview.html,Professor Daniella Acker,Professor of Finance and Accounting,Tel. (0117) 39 41476,Department of Accounting and Finance,http://dbms.ilrt.bris.ac.uk/media/pure/medium-...,"Acker, D, Orujov, A & Simpson, H, 2018, Polit..."
1,tauheed.ali@bristol.ac.uk,/efm/people/tauheed-ali/overview.html,Miss Tauheed Ali,Teaching Associate,Tel. (0117) 42 82236,Department of Accounting and Finance,http://dbms.ilrt.bris.ac.uk/media/user/260011/...,
2,sophie.amor@bristol.ac.uk,/efm/people/sophie-amor/overview.html,Miss Sophie Amor,Postgraduate Admissions Administrator,Tel. (0117) 39 41487,Faculty Student Administration,http://dbms.ilrt.bris.ac.uk/media/images/photo...,
3,ra14611@bristol.ac.uk,/efm/people/rutvica-andrijasevic/overview.html,Dr Rutvica Andrijasevic,Senior Lecturer in Management,Tel. (0117) 954 6905,Department of Management,http://dbms.ilrt.bris.ac.uk/media/pure/medium-...,"Andrijasevic, R & Sacchetto, D, 2017, 'Disapp..."
4,d.andronoudis@bristol.ac.uk,/efm/people/dimos-andronoudis/overview.html,Dr Dimos Andronoudis,Lecturer in Accounting,Tel. (0117) 39 41504,Department of Accounting and Finance,http://dbms.ilrt.bris.ac.uk/media/pure/medium-...,


## Extra Challenge Task <br> --------------------------------

The Bristol city council has a "Neighbourhood-search":

https://www.bristol.gov.uk/my-neighbourhood-search

Try if you can create a function that takes a string and returns the points of interest at that string.  

Hint: look for words like "api" or "rest" in the results of the `NetworkSniffer` Chrome extension. 

In [132]:
api_endpoint = 'https://maps.bristol.gov.uk/csw_ac/Address.svc/rest/ADDRESS/SEARCH/TEXTUAL/{}/'

In [141]:
res = requests.get(api_endpoint.format('University of Bristol'))

In [143]:
res.json()['Matches'][0]

{'ADDRESS': 'University Of Bristol, Bristol Royal Infirmary, Marlborough Street',
 'LOCALITY': 'City Centre',
 'POSTCODE': 'BS2 8HW',
 'UPRN': '000000367623'}