# Hypothesis: 
This hypothesis was issued by the "Statistical Institute Alina Steinmetz" (my wife) and is being preached since we moved from Moscow to Berlin many years ago (in 2011). It is time to examine the accuracy of this hypothesis:

    "Berlin summers are most of the time much colder than in Moscow. Moscow has very hot summers but rest of the year Berlin is much more mild in temperature, i.e. it is warmer."
    
# 1st Try
When searching for data I encountered that there is of course a lot of statistics out there already and decided that it wouldn't make sense for me to work with something that was visualized and depicted 1000 times before. So I just got those 3 statistics http://www.wetteronline.de/klima-temperatur/berlin 
, showed them to my wife and explained that when comparing the graphs...
...
Berlin is slightly warmer in temperature

...
Berlin is less rainy

...
Berlin is more windy

My wife wasn't really happy, though. She replied that she doesn't know where the statistics come from and what period this is. In fact I realized that I broke one of the basics when starting out with any statistic or data (https://medium.com/data-goodie/data-scientists-exercise-with-their-own-ideas-on-real-datasets-basics-for-starting-out-with-them-c7d4dad98f4b): Expressiveness. I was not sure  which period of time those graphs covered. I just said some years... But was it the same amount of time? And who cares about 150 year old data anyway? Also 200 years ago part of Western Europe experienced a minor Ice Age (SOURCE), is that taken in the account of Berlin too? So I decided to make my own statistics for some relevant period of time. Firstly, from the last 25 years to show my wife's general IRRTUM. And secondly, examine closely the periods when she lived in Moscow and when she moved to Berlin to check if there might be some evidence for her bias.


# Weather data
Moin Moin, 
getting the weather of Moscow and Berlin wasn't as a funny walk as I thought it would be.
I found the raw datasets to be super cryptic and the explonatory codebooks non-existent or non-sense. When I found some really good looking the data on www.wetteronline.de I was very happy as it seemed to be a doable task with some few data mining tricks using python:
PICTURE OF http://www.wetteronline.de/klimarechner/moskau

# Description
Describe with glossary (TS2, p.25)

ATTENTION:
This guide is really step by step, how I approached the problem from Zero to Goal. I think it's perfect for beginners but people with some experience might feel bored at the beginnning.

# Getting the Data - Scrape or die 
Scrape is a really cruel word. Imagine you never heard of it before and then somebody throws it at you. Immediately you get pictures in your head from Hostel or ..., if you were so unlucky as I, to watch those movies being an young kid with an un(der)developed brain as I was in the early 2000s. OK, it wasn't quite that horrible. Actually with some programming experience you would get it in a few hours. Without a lot you would probably have to invest more time, depending on your google-skills and luck ;)

I outlined the following plan for my pursue:

Epic: Get all weather data of the www.wetteronline.de page for Moscow and Berlin. (7 entities for every month from 01.1990 to 07.2017)

# I # 
(experimental stage: getting comfortable with data website and tools)

Task 1: Print out Moscow's average highest temperature for August for all years between 1990 and 2017 with the data gathered from the website.

In [None]:
# Libs and main structure of my scrape adventure
# I knew Thanks to Professor Zhang
import pprint
from urllib import request, parse
from bs4 import BeautifulSoup as bs
import os
import sqlite3

In [None]:
# declare variables and load page
'''
base_url = 'http://www.wetteronline.de/'
raw = parse.urljoin(base_url, '?pcid=pc_rueckblick_climate&gid=27612&iid=27612&pid=p_rueckblick_climatecalculator&sid=Default&var=TX&analysis=monthly&month=08&startyear=1990&endyear=2017&iid=27612')
data_page = request.urlopen(raw)
data_html = bs(data_page.read(), 'lxml')

threw: HTTPError: HTTP Error 403: Forbidden :(
search for scraping HTTPError: HTTP Error 403: Forbidden :(
gave: https://stackoverflow.com/questions/16627227/http-error-403-in-python-3-web-scraping
'''

from urllib.request import Request, urlopen

req = Request('http://www.wetteronline.de/?pcid=pc_rueckblick_climate&gid=27612&iid=27612&pid=p_rueckblick_climatecalculator&sid=Default&var=TX&analysis=monthly&month=08&startyear=1990&endyear=2017&iid=27612', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

In [None]:
# Beautiful Output

In [None]:
webpage

# Make a soup out of it
No the last output wasn't that nice. The beautifulsoup lib is very popular to convert ugly things in nice readable structures. However I got the gentle message, that html = bs(webpage.read(), 'lxml') doesn't work with 'bytes objects'. Bytes? Yeah...

Just this: ?

In [None]:
webpage = webpage.decode('utf-8')

In [83]:
# webpage_beautyful = bs(webpage.read(), 'lxml') 
# what is bytes object 'webpage'?
soup = bs(webpage, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Klima in der Region Moskau - Klimarechner - WetterOnline
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="Wie war das Klima in der Region Moskau? Mit dem Klimarechner können Sie historische Werte ermitteln." name="description"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="de-DE" http-equiv="content-language"/>
  <meta content="100001020190994" property="fb:admins"/>
  <meta content="1060016694" property="fb:admins"/>
  <meta content="Klima in der Region Moskau - Klimarechner - WetterOnline" property="og:title">
   <meta content="article" property="og:type">
    <meta content="width=1160" name="viewport">
     <meta content="https://st.wetteronline.de/dr/1.0.798/images/logo/ogimage_wetteronline_1200x630.png" property="og:image">
      <meta content="skype_toolbar_parser_compatible" name="skype_toolbar"/>
      <meta content="#ffffff" name="msapplication-TileColor"/

In [81]:
# Explore the HTML in Chrome Debugger
# --> 
climate_frame = soup.find(id='climatecalculator_result')
print(climate_frame.prettify())

<table id="climatecalculator_result">
 <thead>
  <tr class="headline">
   <th colspan="2">
    Monatsanalyse
   </th>
  </tr>
  <tr class="headline">
   <th colspan="2">
    Tageshöchsttemperatur August
    <br/>
    Wetterstation Moskau
   </th>
  </tr>
  <tr class="mean">
   <th>
    Mittel 1990 - 2017
   </th>
   <th>
    22.3 °C
   </th>
  </tr>
  <tr class="divide">
   <th colspan="2">
   </th>
  </tr>
  <tr class="detailhead">
   <th>
    Jahr
   </th>
   <th>
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td>
    2017
   </td>
   <td>
    24.8 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2016
   </td>
   <td>
    24.7 °C
   </td>
  </tr>
  <tr>
   <td>
    2015
   </td>
   <td>
    22.8 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2014
   </td>
   <td>
    24.9 °C
   </td>
  </tr>
  <tr>
   <td>
    2013
   </td>
   <td>
    23.8 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2012
   </td>
   <td>
    22.4 °C
   </td>
  </tr>
  <tr>
   <td>
    2011
   </td>
   <

In [91]:
# go deeper
print(climate_frame.tbody.prettify())

<tbody>
 <tr>
  <td>
   2017
  </td>
  <td>
   24.8 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2016
  </td>
  <td>
   24.7 °C
  </td>
 </tr>
 <tr>
  <td>
   2015
  </td>
  <td>
   22.8 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2014
  </td>
  <td>
   24.9 °C
  </td>
 </tr>
 <tr>
  <td>
   2013
  </td>
  <td>
   23.8 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2012
  </td>
  <td>
   22.4 °C
  </td>
 </tr>
 <tr>
  <td>
   2011
  </td>
  <td>
   24.1 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2010
  </td>
  <td>
   27.3 °C
  </td>
 </tr>
 <tr>
  <td>
   2009
  </td>
  <td>
   20.2 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2008
  </td>
  <td>
   21.7 °C
  </td>
 </tr>
 <tr>
  <td>
   2007
  </td>
  <td>
   25.7 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2006
  </td>
  <td>
   21.7 °C
  </td>
 </tr>
 <tr>
  <td>
   2005
  </td>
  <td>
   23.1 °C
  </td>
 </tr>
 <tr class="odd">
  <td>
   2004
  </td>
  <td>
   23.6 °C
  </td>
 </tr>
 <tr>
  <td>
   2003
  </td>
  <td>
   2

In [117]:
# Now we have a structure, that we hopefully can work with.
# For now it looks like the algorithm should be:
# In your own database:
# For every <tr> put the first <td> in column 'year'
# and the second <td> in column 'highest_temp'
# But how to achieve that in code?

In [131]:
h = ""
for row in table.findAll('tr'):
    for td in row.find_all("td"):
        print(td.get_text())

2017
24.8 °C
2016
24.7 °C
2015
22.8 °C
2014
24.9 °C
2013
23.8 °C
2012
22.4 °C
2011
24.1 °C
2010
27.3 °C
2009
20.2 °C
2008
21.7 °C
2007
25.7 °C
2006
21.7 °C
2005
23.1 °C
2004
23.6 °C
2003
21.2 °C
2002
22.7 °C
2001
21.6 °C
2000
21.3 °C
1999
21.2 °C
1998
19.3 °C
1997
22.1 °C
1996
- °C
1995
21.1 °C
1994
19.6 °C
1993
19.2 °C
1992
22.4 °C
1991
19.9 °C
1990
19.8 °C


In [None]:
# a little side note: when I read myself into web
# scraping (here: https://www.dataquest.io/blog/web-scraping-tutorial-python/), I stumbled over the fact that actually I could
# just use the API without scraping anything, but 'luckily'
# www.wetteronline.de don't have an API for historical :)
# Funny enough, the dataquest tutorial also uses a weather 
# page for scraping. Another important point here: dataquest
# uses CURRENT weather data to scrape. This could be problematic
# with www.wetteronline.de's current data because they sell
# their current weather data API, so this could be subject
# to copyright. Keep in mind to stay in conformity with the
# and inform yourself. (a little intro: http://blog.icreon.us/advise/web-scraping-legality)
# But be aware before STÜRZEN into scraping!