# Exercise Web Scraping
* Author: Johannes Maucher
* Last update: 27.05.2018
* Contents:
    * Access html pages
    * html parsing
    * MongoDB

In this assignment the information of the *Basisdaten*-table of Wikipedia pages of german cities shall be crawled and stored in a MongoDB.

## To be submitted:
This notebook, enhanced with the solutions to the questions. Your solution should contain 
   * the implemented code in code-cells, 
   * the output of this code
   * Your remarks, discussion, comments on the solution in markdown-cells.



## Requests and BeautifulSoup

In [31]:
from bs4 import BeautifulSoup # pip install beautifulsoup4 if it is not already in your environment
import requests

[requests](https://pypi.org/project/requests/) is a Python package for the HTTP-protocol. It can be applied, e.g. for getting an arbitrary web-page via it's URL. In the example below it is applied to import the german Wikipedia page of Stuttgart. The first 800 characters of the downloaded web-pages are displayed. 

In [32]:
r=requests.get("https://de.wikipedia.org/wiki/Stuttgart")

In [33]:
r.text[:800]

'<!DOCTYPE html>\n<html class="client-nojs" lang="de" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Stuttgart – Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Stuttgart","wgTitle":"Stuttgart","wgCurRevisionId":177589159,"wgRevisionId":177589159,"wgArticleId":4492,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Wikipedia:Belege fehlen","Wikipedia:Seite mit Grafik","Stuttgart","Gemeinde in Baden-Württemberg","Ort in Baden-Württemberg","Stadtkreis in Ba'

As can be seen, the data returned by `requests.get()` contains the entire web-page, including all-types of markup and meta-information. In order to extract dedicated elements, e.g. images, links, text, tables etc, [Beautiful Soup (bs4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the tool of choice. BS4 is a popular and comprehensive Python library for HTML- and XML-parsing.

Below, the first example demonstrates how to get all links in web-page. In the second example all tables are retained. 


In [37]:
soup=BeautifulSoup(r.text,"html.parser")

In [38]:
for c in soup.findAll('a'):
    if c.has_attr('href'):
        print(c)

<a href="#mw-head">Navigation</a>
<a href="#p-search">Suche</a>
<a class="mw-disambig" href="/wiki/Stuttgart_(Begriffskl%C3%A4rung)" title="Stuttgart (Begriffsklärung)">Stuttgart (Begriffsklärung)</a>
<a class="image" href="/wiki/Datei:Coat_of_arms_of_Stuttgart.svg" title="Wappen der Stadt Stuttgart"><img alt="Wappen der Stadt Stuttgart" data-file-height="613" data-file-width="596" height="144" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Coat_of_arms_of_Stuttgart.svg/140px-Coat_of_arms_of_Stuttgart.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Coat_of_arms_of_Stuttgart.svg/210px-Coat_of_arms_of_Stuttgart.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Coat_of_arms_of_Stuttgart.svg/280px-Coat_of_arms_of_Stuttgart.svg.png 2x" width="140"/></a>
<a class="image" href="/wiki/Datei:Reddot.svg" title="Stuttgart"><img alt="Stuttgart" data-file-height="402" data-file-width="402" height="5" src="//upload.wikimedia.org/wikipedia/commons/thumb/

<a href="#cite_note-Historische_Einwohnerzahlen-31">[31]</a>
<a href="/wiki/Deutsche_Reichsgr%C3%BCndung" title="Deutsche Reichsgründung">Reichsgründung</a>
<a href="/wiki/Gro%C3%9Fstadt" title="Großstadt">Großstadt</a>
<a href="#cite_note-Historische_Einwohnerzahlen-31">[31]</a>
<a class="image" href="/wiki/Datei:Daimler-motoren-gesellschaft-1911.jpg"><img alt="" class="thumbimage" data-file-height="409" data-file-width="654" height="138" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Daimler-motoren-gesellschaft-1911.jpg/220px-Daimler-motoren-gesellschaft-1911.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Daimler-motoren-gesellschaft-1911.jpg/330px-Daimler-motoren-gesellschaft-1911.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e5/Daimler-motoren-gesellschaft-1911.jpg/440px-Daimler-motoren-gesellschaft-1911.jpg 2x" width="220"/></a>
<a class="internal" href="/wiki/Datei:Daimler-motoren-gesellschaft-1911.jpg" title="vergrößern und Informationen

<a href="/wiki/Bundesautobahn_81" title="Bundesautobahn 81"><img alt="A81" data-file-height="240" data-file-width="400" height="19" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Bundesautobahn_81_number.svg/32px-Bundesautobahn_81_number.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Bundesautobahn_81_number.svg/49px-Bundesautobahn_81_number.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Bundesautobahn_81_number.svg/64px-Bundesautobahn_81_number.svg.png 2x" style="vertical-align: middle" width="32"/></a>
<a href="/wiki/Bundesautobahn_81" title="Bundesautobahn 81"><img alt="A81" data-file-height="240" data-file-width="400" height="19" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Bundesautobahn_81_number.svg/32px-Bundesautobahn_81_number.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/5e/Bundesautobahn_81_number.svg/49px-Bundesautobahn_81_number.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/

In [39]:
for c in soup.findAll('table'):
    print(c.text)
    print("-"*20)





Der Titel dieses Artikels ist mehrdeutig. Weitere Bedeutungen sind unter Stuttgart (Begriffsklärung) aufgeführt.

--------------------


Wappen

Deutschlandkarte








48.7755555555569.1827777777778247Koordinaten: 48° 47′ N, 9° 11′ O


Basisdaten


Bundesland:
Baden-Württemberg


Regierungsbezirk:

Stuttgart


Höhe:

247 m ü. NHN


Fläche:

207,35 km2


Einwohner:

628.032 (31. Dez. 2016)[1]


Bevölkerungsdichte:

3029 Einwohner je km2


Postleitzahlen:

70173–70619


Vorwahl:

0711


Kfz-Kennzeichen:

S


Gemeindeschlüssel:

08 1 11 000


LOCODE:

DE STR


NUTS:

DE111


Stadtgliederung:

23 Stadtbezirkemit 152 Stadtteilen


Adresse derStadtverwaltung:

Marktplatz 170173 Stuttgart


Webpräsenz:

www.stuttgart.de


Oberbürgermeister:

Fritz Kuhn (Bündnis 90/Die Grünen)


Lage der Stadt Stuttgart in Baden-Württemberg




--------------------




--------------------





Die 23 Stadtbezirke mit Anzahl der zugehörigen Stadtteile


Innere Stadtbezirke


Stuttgart-Mitte (10), Stuttga

## Tasks
In the previous code-cell all tables of the Stuttgart-Wikipedia site have been extracted. As can be seen, the output contains all data, which is contained in the *Basisdaten*-Box of a city's Wikipedia page (here [Stuttgart](https://de.wikipedia.org/wiki/Stuttgart)). 

1. Define a function, which extracts all rows of the *Basisdaten*-table, from *Bundesland* to "Oberbürgermeister". The function shall return a dictionary whose keys are the attributes (e.g. Bundesland) and whose values are the corresponding assignments (e.g. Baden-Württemberg).
2. Define a function, which extracts these *Basisdaten*-tables for all German cities, which are listed in the file [germanCities.csv](germanCities.csv). Read the contents of this .csv-file, iterate over all citynames in the file, and invoke the function, developed in the previous task for all of these cities. Note that the URL of city Wikipedia-pages follows a unique pattern.
3. Save the list of all extracted *Basisdaten*-dictionaries in a MongoDB.
4. Demonstrate 3 queries to the MongoDB.
