# Exercise Sheet \#3


## Exercise 1. Quotes: manual scraping

In this exercise, you are required to compile a dataset of biographies taken from http://quotes.toscrape.com.
Recall this website displays 10 quotes per page, together with a link to their author's biography. This will be a step by step guide.

#### 1.1 Getting URLs of authors' pages

To get a list of URLs pointing at author pages, you will process quotes' pages. 

To do so, first complete the function get_links below which expects as parameter:

* `url` the URL of a page from quotes.toscrape.com

and returns:

* `authors` the list of links to author pages contained in the given quotes' page (beware of duplicates!)

In [25]:
import requests, re
from bs4 import BeautifulSoup

BASE_URL = 'http://quotes.toscrape.com'
ua = {"User-agent": "Mozilla/5.0"}


def get_links(url):
    authors = []
    # Get page located at url:
    page = requests.get(url, headers = ua)
    #Get all links corresponding to authors:
    soup = BeautifulSoup(page.content, "html.parser")
    all_links = soup.find_all(href = re.compile("/author/.*"))
    #Loop over these:
    for link in all_links:
        #if a link is not in authors, add it:
        if link["href"] not in authors:
            authors.append(BASE_URL + link["href"])
    #Return results
    return authors

#Test:
authors = get_links(BASE_URL)
print(authors)

['http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/J-K-Rowling', 'http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/Jane-Austen', 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/Andre-Gide', 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'http://quotes.toscrape.com/author/Steve-Martin']


#### 1.2 iterate over pages of quotes

In a second step, fill the `collect` function below, which will iteratively collect author links. This function will take as input parameters:
- `url`: the starting url from which to collect links,
- `authors`: the list of links to be updated
- `limit`: the number of pages to visit (default being `None`, which means visit all pages)

In [26]:
def collect(url, authors, limit=None):
    #Add links contained in page located at url to the authors being computed
    authors.extend([x for x in get_links(url) if x not in authors])
    #If no limit is given or limit > 1
    if (type(limit) == int and limit > 1) or limit is None:
        # Get page located at url:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        # Get url of next page
        next = soup.select(".next a")
        if next:
            next_url = next[0]["href"]     # recursively collect links (if any)
            if limit is None:
                return collect(url=BASE_URL+next_url, authors = authors, limit = None)
            else:
                return collect(url=BASE_URL+next_url, authors = authors, limit = limit - 1)
        else:
            return authors
    else:
        return authors

# Test
authors = []
collect(BASE_URL, authors, limit=None)
print(authors)

['http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/J-K-Rowling', 'http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/Jane-Austen', 'http://quotes.toscrape.com/author/Marilyn-Monroe', 'http://quotes.toscrape.com/author/Albert-Einstein', 'http://quotes.toscrape.com/author/Andre-Gide', 'http://quotes.toscrape.com/author/Thomas-A-Edison', 'http://quotes.toscrape.com/author/Eleanor-Roosevelt', 'http://quotes.toscrape.com/author/Steve-Martin', 'http://quotes.toscrape.com/author/Bob-Marley', 'http://quotes.toscrape.com/author/Dr-Seuss', 'http://quotes.toscrape.com/author/Douglas-Adams', 'http://quotes.toscrape.com/author/Elie-Wiesel', 'http://quotes.toscrape.com/author/Friedrich-Nietzsche', 'http://quotes.toscrape.com/author/Mark-Twain', 'http://quotes.toscrape.com/author/Allen-Saunders', 'http://quotes.toscrape.com/author/Pablo-Neruda', 'http://quotes.toscrape.com/author/Ralph-Waldo-Emerson', 'http://quotes.toscrape.co

#### Question 1.3 : get actual biographies

For each of the links computed in the previous question, retrieve the corresponding webpage and extract the biography it contains. To do so, fill the `get_biography` function below. It will feed a list of dictionaries of the following form:
```python
bios = [{name: '...', birth_date: '...', birth_place: '...', bio: '...'}, ...]
```

In [35]:
def get_biography(url):
    # Get page located at URL and parse it
    page = requests.get(url, headers = ua)
    soup = BeautifulSoup(page.content, "html.parser")
    # Get name with BeautifulSoup
    name = soup.find("h3").text.split("\n")[0]
    # Get birth date
    birth_date = soup.select(".author-born-date")[0].text
    # Get birth place
    birth_place= soup.select(".author-born-location")[0].text
    # Get bio
    bio = soup.select(".author-description")[0].text
    return {'name':name, 'birth_date': birth_date, 'birth_place': birth_place, 'bio': bio}

def get_bios(urls):
    bios = []
    for u in urls:
        bios.append(get_biography(u))
    return bios

#Test
bios=get_bios(authors)
print(bios[0])

{'name': 'Albert Einstein', 'birth_date': 'March 14, 1879', 'birth_place': 'in Ulm, Germany', 'bio': '\n        In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: "In their struggle for the ethical good, teachers of religion must have the stature to give up the doctrine of a personal God, that is, give up that source of fear and hope which in the past placed such vas

#### Question 1.4: save your dataset

Finally, write a `save` function which takes as an input a list of biographies as computed above and save them in JSON on disk (the filename being an input parameter).

In [40]:
import json
import pandas as pd

bios_df = pd.DataFrame(bios)
bios_df

Unnamed: 0,name,birth_date,birth_place,bio
0,Albert Einstein,"March 14, 1879","in Ulm, Germany","\n In 1879, Albert Einstein was born in..."
1,J.K. Rowling,"July 31, 1965","in Yate, South Gloucestershire, England, The U...",\n See also: Robert GalbraithAlthough s...
2,Albert Einstein,"March 14, 1879","in Ulm, Germany","\n In 1879, Albert Einstein was born in..."
3,Jane Austen,"December 16, 1775","in Steventon Rectory, Hampshire, The United Ki...",\n Jane Austen was an English novelist ...
4,Marilyn Monroe,"June 01, 1926",in The United States,\n Marilyn Monroe (born Norma Jeane Mor...
5,Albert Einstein,"March 14, 1879","in Ulm, Germany","\n In 1879, Albert Einstein was born in..."
6,André Gide,"November 22, 1869","in Paris, France",\n André Paul Guillaume Gide was a Fren...
7,Thomas A. Edison,"February 11, 1847","in Milan, Ohio, The United States",\n Thomas Alva Edison was an American i...
8,Eleanor Roosevelt,"October 11, 1884",in The United States,\n Anna Eleanor Roosevelt was an Americ...
9,Steve Martin,"August 14, 1945","in Waco, Texas, The United States","\n Stephen Glenn ""Steve"" Martin is an A..."


In [41]:
def save(filename, dataset):
    # Open output file
    with open(filename, "w") as file:
        # write data in JSON format
        bios_json = bios_df.to_json(orient = "records")
        json_dict = json.loads(bios_json)
        json.dump(json_dict, file)
     #remove when ready

save('bios.json', bios)

## Exercise 2. Let's use Scrapy now!

Here the goal is to play with scrapy. Let's look at the wikipedia article https://en.wikipedia.org/wiki/List_of_French_artists. Let's say, we want to extract all names of artists from here with links to their corresponding wikipedia pages and the first paragraph about them.

You will find a file called `Exercise_sheet_3_scrapy.py`. Can you fill in the gaps in this script?


In addition to the Scrapy documentation I highly recommend you to look at possible selectors: https://www.w3schools.com/cssref/css_selectors.php