# Exercise Sheet \#3


## Exercise 1. Quotes: manual scraping

In this exercise, you are required to compile a dataset of biographies taken from http://quotes.toscrape.com.
Recall this website displays 10 quotes per page, together with a link to their author's biography. This will be a step by step guide.

#### 1.1 Getting URLs of authors' pages

To get a list of URLs pointing at author pages, you will process quotes' pages. 

To do so, first complete the function get_links below which expects as parameter:

* `url` the URL of a page from quotes.toscrape.com

and returns:

* `authors` the list of links to author pages contained in the given quotes' page (beware of duplicates!)

In [2]:
import requests, re
from bs4 import BeautifulSoup

BASE_URL = 'http://quotes.toscrape.com'

def get_links(url):
    authors = []
    #Get page located at url:
    page = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0'}) #It's important to add in case you have very different versions of the same page
    soup = BeautifulSoup(page.content, 'html.parser')
        
    #Get all links corresponding to authors:
    links = soup.find_all(href=re.compile('author'))
    #Loop over these:
    for l in links:
        link = l.get('href')
        #if a link is not in authors, add it:
        if (BASE_URL+link) not in authors:
            authors.append(BASE_URL + link)        
    #Return results
    return authors

#Test:
authors = get_links(BASE_URL)
print(authors)

['http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/J-K-Rowling', 'http://quotes.toscrape.com/author/Jane-Austen', 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'http://quotes.toscrape.com/author/Andre-Gide', 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'http://quotes.toscrape.com/author/Steve-Martin']


#### 1.2 iterate over pages of quotes

In a second step, fill the `collect` function below, which will iteratively collect author links. This function will take as input parameters:
- `url`: the starting url from which to collect links,
- `authors`: the list of links to be updated
- `limit`: the number of pages to visit (default being `None`, which means visit all pages)

In [3]:
def collect(url, authors, limit=None):
    #Add links contained in page located at url to the authors being computed
    authors.extend([x for x in get_links(url) if x not in authors]) #List Comprehension: for loop in one line
    #If no limit is given or limit > 1
    if limit == None or limit > 1:
        # Get page located at url:
        page = requests.get(url=url, headers={'User-Agent': 'Mozilla/5.0'}) #It's important to add in case you have very different versions of the same page
        soup = BeautifulSoup(page.content, 'html.parser')
        # Get url of next page
        next_el = soup.find('a', class_="next")
        # recursively collect links (if any)
        if next_el and (limit is None or limit > 1):
            next_url = next_el['href']
            if limit is not None:
                limit -= 1
            collect(next_url, authors, limit)
# Test
authors = []
collect(BASE_URL, authors, limit=2)
print(authors)

['http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/J-K-Rowling', 'http://quotes.toscrape.com/author/Jane-Austen', 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'http://quotes.toscrape.com/author/Andre-Gide', 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'http://quotes.toscrape.com/author/Steve-Martin']


#### Question 1.3 : get actual biographies

For each of the links computed in the previous question, retrieve the corresponding webpage and extract the biography it contains. To do so, fill the `get_biography` function below. It will feed a list of dictionaries of the following form:
```python
bios = [{name: '...', birth_date: '...', birth_place: '...', bio: '...'}, ...]
```

In [4]:
def get_biography(url):
    # Get page located at URL and parse it
    page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(page.content, 'html.parser')
    # Get name with BeautifulSoup
    name_tag = soup.find('h3', class_='author-title')
    name = name_tag.text.strip() if name_tag else None
    # Get birth date
    birth_date_tag = soup.find('span', class_='author-born-date')
    birth_date = birth_date_tag.text.strip() if birth_date_tag else None

    # Get birth place
    birth_place_tag = soup.find('span', class_='author-born-location')
    birth_place = birth_place_tag.text.strip() if birth_place_tag else None

    # Get bio
    bio_tag = soup.find('div', class_='author-description')
    bio = bio_tag.text.strip() if bio_tag else None

    return {'name': name, 'birth_date': birth_date, 'birth_place': birth_place, 'bio': bio}

def get_bios(urls):
    bios = []
    for u in urls:
        bios.append(get_biography(u))
    return bios

#Test
bios=get_bios(authors)
print(bios)

[{'name': 'Albert Einstein', 'birth_date': 'March 14, 1879', 'birth_place': 'in Ulm, Germany', 'bio': 'In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: "In their struggle for the ethical good, teachers of religion must have the stature to give up the doctrine of a personal God, that is, give up that source of fear and hope which in the past placed such vast power i

#### Question 1.4: save your dataset

Finally, write a `save` function which takes as an input a list of biographies as computed above and save them in JSON on disk (the filename being an input parameter).

In [6]:
import json

def save(filename, dataset):
    # Open output file
    with open(filename, 'w') as f:
    # write data in JSON format
        json.dump(dataset, f)
    pass #remove when ready

save('bios.json', bios)

## Exercise 2. Let's use Scrapy now!

Here the goal is to play with scrapy. Let's look at the wikipedia article https://en.wikipedia.org/wiki/List_of_French_artists. Let's say, we want to extract all names of artists from here with links to their corresponding wikipedia pages and the first paragraph about them.

You will find a file called `Exercise_sheet_3_scrapy.py`. Can you fill in the gaps in this script?


In addition to the Scrapy documentation I highly recommend you to look at possible selectors: https://www.w3schools.com/cssref/css_selectors.php