# Exercise Sheet \#3


## Exercise 1. Quotes: manual scraping

In this exercise, you are required to compile a dataset of biographies taken from http://quotes.toscrape.com.
Recall this website displays 10 quotes per page, together with a link to their author's biography. This will be a step by step guide.

#### 1.1 Getting URLs of authors' pages

To get a list of URLs pointing at author pages, you will process quotes' pages. 

To do so, first complete the function get_links below which expects as parameter:

* `url` the URL of a page from quotes.toscrape.com

and returns:

* `authors` the list of links to author pages contained in the given quotes' page (beware of duplicates!)

In [1]:
import requests, re
from bs4 import BeautifulSoup

BASE_URL = 'http://quotes.toscrape.com'

def get_links(url):
    authors = []
    # Get page located at url:
    ua = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(url, headers=ua)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #Get all links corresponding to authors:
    all_author_hrefs = soup.find_all(href=re.compile('/author/'))
    all_author_links = [href['href'] for href in all_author_hrefs]
    
    #Loop over these:
    for href in all_author_links:
    
        #if a link is not in authors, add it:
        if href not in authors:
            authors.append(href)
        
    #Return results
    return authors

#Test:
authors = get_links(BASE_URL)
print(authors)

[]


#### 1.2 iterate over pages of quotes

In a second step, fill the `collect` function below, which will iteratively collect author links. This function will take as input parameters:
- `url`: the starting url from which to collect links,
- `authors`: the list of links to be updated
- `limit`: the number of pages to visit (default being `None`, which means visit all pages)

In [2]:

def collect(url, authors, limit=None):
    #Add links contained in page located at url to the authors being computed
    authors.extend([x for x in get_links(url) if x not in authors])
    #If no limit is given or limit > 1
    if limit is None or limit > 1:

        # Get page located at url:
        ua = {'User-agent': 'Mozilla/5.0'}
        page = requests.get(url, headers=ua)
        soup = BeautifulSoup(page.content, 'html.parser')

        # Get url of next page
        next_page = soup.find('li', {'class': 'next'})
        
        if next_page:
            next_page = next_page.find('a')['href']

            # recursively collect links (if any)
            if limit is None:
                collect(BASE_URL + next_page, authors, limit=None)
            else:
                collect(BASE_URL + next_page, authors, limit=limit - 1)

# Test
authors = []
collect(BASE_URL, authors, limit=2)
print(authors)

[]


#### Question 1.3 : get actual biographies

For each of the links computed in the previous question, retrieve the corresponding webpage and extract the biography it contains. To do so, fill the `get_biography` function below. It will feed a list of dictionaries of the following form:
```python
bios = [{name: '...', birth_date: '...', birth_place: '...', bio: '...'}, ...]
```

In [3]:
def get_biography(url):
    # Get page located at URL and parse it
    ua = {'User-agent': 'Mozilla/5.0'}
    page = requests.get(BASE_URL + url, headers=ua)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Get name with BeautifulSoup
    name = list(soup.find('h3', {'class': 'author-title'}).strings)[0]
    # Get birth date
    birth_date = soup.find('span', {'class': 'author-born-date'})
    # Get birth place
    birth_place= soup.find('span', {'class': 'author-born-location'})
    # Get bio
    bio = soup.find('div', {'class': 'author-description'})

    if name is not None and birth_date is not None and birth_place is not None and bio is not None:
        return {'name':name.getText(), 'birth_date': birth_date.getText(), 'birth_place': birth_place.getText(),
         'bio': bio.getText()}

def get_bios(urls):
    bios = []
    for u in urls:
        bios.append(get_biography(u))
    return bios

#Test
bios=get_bios(authors)
print(bios)

[]


#### Question 1.4: save your dataset

Finally, write a `save` function which takes as an input a list of biographies as computed above and save them in JSON on disk (the filename being an input parameter).

In [4]:
import json

def save(filename, dataset):
    # Open output file
    with open(filename, 'w') as file: 
    # write data in JSON format
        json.dump(dataset, file)

save('bios.json', bios)

## Exercise 2. Let's use Scrapy now!

Here the goal is to play with scrapy. Let's look at the wikipedia article https://en.wikipedia.org/wiki/List_of_French_artists. Let's say, we want to extract all names of artists from here with links to their corresponding wikipedia pages and the first paragraph about them.

You will find a file called `Exercise_sheet_3_scrapy.py`. Can you fill in the gaps in this script?


In addition to the Scrapy documentation I highly recommend you to look at possible selectors: https://www.w3schools.com/cssref/css_selectors.php