# Scraping WWW Exercises

## Exercise 1

Examine front page of BILD newspaper (www.bild.de) and create a list of all articles that can be found on that page. Each item of the list must contain

* article title,
* main image of the article,
* url of the article.

**Hint:**

* request content of `www.bild.de` page and use `"rel": "bookmark"` properties for identifying links pointing at artiles,
* request the content of each article for obtaining the url, the title, the teaser and main image of the article,
* you can use `"og"` properties of `meta` tag whithin an article to retrieve its title, main image and url.

In [1]:
import requests
import bs4

url = 'http://www.bild.de'
# read the website
front_page = requests.get(url).text
# parse html code
front_page_bs_tree = bs4.BeautifulSoup(front_page, 'html.parser')

article_link_tags = \
    front_page_bs_tree.find_all(name="a", attrs={'rel': 'bookmark'})


base_url = 'http://www.bild.de'

article_links = map(lambda tag: tag.get('href'), article_link_tags)
article_links = filter(lambda link: 'bild.html' in link, article_links)
article_links = map(lambda link: base_url + link, article_links)


articles = []

for link in list(article_links):
    try:
        article = requests.get(link).text
        article_bs_tree = bs4.BeautifulSoup(article, 'html.parser')
        title = \
            article_bs_tree.find(name='meta', 
                                 attrs={'property': 'og:title'}).get('content')
        image = \
            article_bs_tree.find(name='meta', 
                                 attrs={'property': 'og:image'}).get('content')
        url = \
            article_bs_tree.find(name='link', 
                                 attrs={'rel': 'canonical'}).get('href')
        article = {
                'title': title,
                'image': image,
                'url': url}
        articles.append(article)
    except Exception:
        continue



In [2]:
print(articles[:2])

[{'title': 'Böller-Bastler sprengte sich sprengte in die Luft. Nachbarin: „Wie bei einer Atombombe“', 'image': 'https://bilder.bild.de/fotos/boeller-bastler-sprengte-sich-sprengte-in-die-luft-nachbarin-wie-bei-einer-atombombe-1dadc6985d524ab094a027020ead7d0b-74672334/Bild/4.bild.jpg', 'url': 'https://www.bild.de/news/inland/news-inland/boeller-bastler-sprengte-sich-sprengte-in-die-luft-nachbarin-wie-bei-einer-atomb-74672320.bild.html'}, {'title': 'Corona: Spahn hält Restaurant- und Konzertbesuche nur für Geimpfte für „möglich“', 'image': 'https://bilder.bild.de/fotos/corona-spahn-haelt-restaurant--und-konzertbesuche-nur-fuer-geimpfte-fuer-moeglich-357cfc6b45054fa396c3aa44d69c2e0a-74671016/Bild/8.bild.jpg', 'url': 'https://www.bild.de/politik/inland/politik-inland/corona-spahn-haelt-restaurant-und-konzertbesuche-nur-fuer-geimpfte-fuer-moeglich-74670998.bild.html'}]


## Exercise 2

Scrape International Movies Database (IMDB) at [imdb.com](https://imdb.com) for top 1000 films released in year 2018 with the highest US box office. The result must me a list containing 1000 elements, where each element is a ditionary with elements 

* `name` - title of the movie, 
* `year` - release year of the movie, 
* `imdb` - IMDB score of the movie, 
* `m_score` - meta score of the movie, 
* `vote` - number of votes.

**Hint:**

* use `https://www.imdb.com/search/title?release_date=2018&sort=boxoffice_gross_us,desc&start=1` to get first top 50 movies; by setting `start` to `51`, `101`, etc. navigate through the movies list,
* you may want to use `sleep(randint(0,10))` from `time` module to introduce a delay between the request in order to avoid being temporary banned for scraping the content,
* use developer tools for identifying `<div>` containers containing information about movies,
* release of a movie may be presented as `(2018)`, `(I) (2018)`, `(II) (2018)` etc.; use `int(year[-5:-1])` for converting it into integer.|

In [4]:
from requests import get
from bs4 import BeautifulSoup
from time import sleep
from random import randint

# Redeclaring the lists to store data in
films = []

pages = range(20)
starts = map(lambda x: str(10*x+1), pages)

for start in starts:

    # Make a get request
    response = get('http://www.imdb.com/search/title?release_date=2018' + 
    '&sort=boxoffice_gross_us,desc&start=' + start)

    # Pause the loop
    sleep(randint(0,10))

    # Throw a warning for non-200 status codes
    if response.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))

    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')

    # Select all the 50 movie containers from a single page
    mv_containers = page_html.find_all('div', class_ = 'lister-item mode-advanced')

    # For every movie of these 50
    for container in mv_containers:
        # If the movie has a Metascore, then:
        if container.find('div', class_ = 'ratings-metascore') is not None:

            film = {}
            # Scrape the name
            film['name'] = container.h3.a.text

            # Scrape the year 
            film['year'] = int(container.h3.find('span', class_ = 'lister-item-year').text[-5:-1])

            # Scrape the IMDB rating
            film['imdb'] = float(container.strong.text)

            # Scrape the Metascore
            film['m_score'] = int(container.find('span', class_ = 'metascore').text)

            # Scrape the number of votes
            film['vote'] = int(container.find('span', attrs = {'name':'nv'})['data-value'])

            films.append(film)

In [5]:
print(films[:10])

[{'name': 'Black Panther', 'year': 2018, 'imdb': 7.3, 'm_score': 88, 'vote': 550835}, {'name': 'Avengers: Infinity War', 'year': 2018, 'imdb': 8.5, 'm_score': 68, 'vote': 724440}, {'name': 'Die Unglaublichen 2', 'year': 2018, 'imdb': 7.7, 'm_score': 80, 'vote': 218935}, {'name': 'Jurassic World: Das gefallene Königreich', 'year': 2018, 'imdb': 6.2, 'm_score': 51, 'vote': 233041}, {'name': 'Aquaman', 'year': 2018, 'imdb': 7.0, 'm_score': 55, 'vote': 310821}, {'name': 'Deadpool 2', 'year': 2018, 'imdb': 7.7, 'm_score': 66, 'vote': 424203}, {'name': 'Der Grinch', 'year': 2018, 'imdb': 6.3, 'm_score': 51, 'vote': 33973}, {'name': 'Mission: Impossible - Fallout', 'year': 2018, 'imdb': 7.8, 'm_score': 86, 'vote': 251754}, {'name': 'Ant-Man and the Wasp', 'year': 2018, 'imdb': 7.1, 'm_score': 70, 'vote': 266470}, {'name': 'Bohemian Rhapsody', 'year': 2018, 'imdb': 8.0, 'm_score': 49, 'vote': 392113}]
