# Hypothesis: 
This hypothesis was issued by the "Statistical Institute Alina Steinmetz" (my wife) and is being preached since we moved from Moscow to Berlin many years ago (in 2011). It is time to examine the truthfulness of this hypothesis:

    "Berlin summers are most of the time much colder than in Moscow. Berlin is much more mild over the year but the summers in Moscow are hotter."
    
# First Try
When searching for data I encountered that there are of course a lot of statistics out there already and decided that it wouldn't make sense for me to work with something that was visualized and depicted 1000 times before. So I just got 3 comparing statistics from the German weather website http://www.wetteronline.de/klima-temperatur/berlin 
, showed them to my wife and explained that when comparing the graphs she will see that...
<br><br>
<i> [To be able to follow this tutorial, you will have to install python(3.x), IPython, jupyter and later we have to add some other libraries (I recommend to use [anaconda](https://www.continuum.io/downloads), then you'll have all bundled up and the setup of other libraries will be just one line of code). This guide is really step by step, how I approached the problem from Zero to Goal. I think it's good for scraping beginners. Let me know if there are any problems with the code or anything else. ]</i>
<br>
<br>
<i>(to navigate between the cells, click on a cell once and do follow with SHIFT+ENTER on your keyboard)</i>
<br>

### ... Berlin is slightly warmer in June and August:

![temperature comparison](https://raw.githubusercontent.com/RichStone/weather-comparison-berlin-moscow/master/temperature.png)

### ... Berlin is less rainy:

![temperature comparison](https://raw.githubusercontent.com/RichStone/weather-comparison-berlin-moscow/master/rain.png)

### ... Berlin is more windy:

![temperature comparison](https://raw.githubusercontent.com/RichStone/weather-comparison-berlin-moscow/master/wind.png)

I was convinced that this little presentation would be enough for every sane person to accept that Moscow's summer is not really a lot more hotter than Berlin's (only less windy).

My wife wasn't really happy, though. She replied, that she doesn't know where the statistics come from and what period this is. In fact I realized that I broke [one of the basics](https://medium.com/data-goodie/data-scientists-exercise-with-their-own-ideas-on-real-datasets-basics-for-starting-out-with-them-c7d4dad98f4b) when starting out with any statistic or data: Expressiveness. I was not sure which period of time those graphs covered and it's not stated there either. I just said some years... But was it the same amount of time? And who cares about 150 year old data anyway? Also 200 years ago part of Western Europe experienced a minor Ice Age (https://en.wikipedia.org/wiki/Little_Ice_Age), is that taken in the account of Berlin too :D ? So I decided to make my own statistics for some relevant period of time. Firstly, I would take the data from the last 25 years to show the hypothesis' general fallacy. And secondly, examine closely the periods when Mrs. Steinmetz lived in Moscow and when she moved to Berlin to check if there might be some evidence for a little bias.

# Weather data
Getting some historical weather data of Moscow and Berlin wasn't as a funny walk as I thought it would be.
I found the raw datasets to be super cryptic and the explanatory codebooks non-existent or non-sense. When I found some really good looking data on www.wetteronline.de I was very happy as it seemed to be a doable task to get it with a few scraping tricks using python. The data is nicely organzed in tables and the URLs give readable information [(link)](http://www.wetteronline.de/?pcid=pc_rueckblick_climate&gid=10382&iid=10382&pid=p_rueckblick_climatecalculator&sid=Default&var=TX&analysis=monthly&month=08&startyear=1990&endyear=2017&iid=10382):

![temperature comparison](https://raw.githubusercontent.com/RichStone/weather-comparison-berlin-moscow/master/example-data-table.png)

# Getting the Data - Scrape it out
Scrape sounds very cruel. But actually the process of web scraping having some programming experience isn't that bad.

I outlined the following plan to reach my goal:

Print all weather data of the www.wetteronline.de page for Moscow and Berlin. (7 categories [like highest average temperature etc.] for every month from 01.1990 to 07.2017)
<br> 
<br>
Next tutorials will then be about putting the data into a database, analyzing/comparing/visualizing it and finally we might touch on some machine learning to see if we get any fancy predictions of weather based on historical data.

# - I. - 
Task 1: Print out Moscow's average highest temperature for August for all years between 1990 and 2017 with the data gathered from the website.

In [6]:
# Libraries
# I already knew where to begin thanks to my Professor Zhang, who was brave enough
# to start out our computer science course with some nice practical scraping.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs
# if you don't have BeautifulSoup installed yet, you would need to do in your
# command line: pip install beautifulsoup4

In [7]:
# declare variables and load page
req = Request('http://www.wetteronline.de/?pcid=pc_rueckblick_climate&gid=27612&iid=27612&pid=p_rueckblick_climatecalculator&sid=Default&var=TX&analysis=monthly&month=08&startyear=1990&endyear=2017&iid=27612', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

By the way my first approach failed with the error
<br>
  <b>"HTTPError: HTTP Error 403: Forbidden"</b>
<br>
The solution was to put the header as the last parameter of the Request() object so that our request wouldn't be classified as some evil robotic force crawling the website.

The second approach with the header passed to the Request was then more successful with this Beautiful Output:

In [8]:
print(webpage)



### Make a soup out of it
But you can make this output even more precious. The beautifulsoup library is very popular to convert ugly things in nice readable structures. As you can see at the beginning of the output above, it starts with some <b>b'</b>. That is a sign for the fact that it is a byte string which has to be decoded first:

In [9]:
webpage = webpage.decode('utf-8')

Now you are ready to really make it a readable output using the prettify() method on our soup:

In [10]:
soup = bs(webpage, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Klima in der Region Moskau - Klimarechner - WetterOnline
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="Wie war das Klima in der Region Moskau? Mit dem Klimarechner können Sie historische Werte ermitteln." name="description"/>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="de-DE" http-equiv="content-language"/>
  <meta content="100001020190994" property="fb:admins"/>
  <meta content="1060016694" property="fb:admins"/>
  <meta content="Klima in der Region Moskau - Klimarechner - WetterOnline" property="og:title">
   <meta content="article" property="og:type">
    <meta content="width=1160" name="viewport">
     <meta content="https://st.wetteronline.de/dr/1.0.800/images/logo/ogimage_wetteronline_1200x630.png" property="og:image">
      <meta content="skype_toolbar_parser_compatible" name="skype_toolbar"/>
      <meta content="#ffffff" name="msapplication-TileColor"/

But the beautifying of the HTML stream is not our final goal of course. The prettified output is just nice to have when you run your methods against the soup. For instance to find out which part of your HTML you targeted with a find(tag) method and which tag to target next to get the desired data. Firstly, however, I would analyze the HTML structure in Chrome's Developer Tool's and make a plan how to access my table elements (the years and the according values). Initially, I found out that the whole table is inside a TABLE tag with the ID 'climatecalculator_result'. With our `soup` object, we now can easily access any element of the DOM with methods like find() by searching for some attribute or anything else:

In [11]:
climate_frame = soup.find(id='climatecalculator_result')
print(climate_frame.prettify())

<table id="climatecalculator_result">
 <thead>
  <tr class="headline">
   <th colspan="2">
    Monatsanalyse
   </th>
  </tr>
  <tr class="headline">
   <th colspan="2">
    Tageshöchsttemperatur August
    <br/>
    Wetterstation Moskau
   </th>
  </tr>
  <tr class="mean">
   <th>
    Mittel 1990 - 2017
   </th>
   <th>
    22.3 °C
   </th>
  </tr>
  <tr class="divide">
   <th colspan="2">
   </th>
  </tr>
  <tr class="detailhead">
   <th>
    Jahr
   </th>
   <th>
   </th>
  </tr>
 </thead>
 <tbody>
  <tr>
   <td>
    2017
   </td>
   <td>
    25.4 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2016
   </td>
   <td>
    24.7 °C
   </td>
  </tr>
  <tr>
   <td>
    2015
   </td>
   <td>
    22.8 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2014
   </td>
   <td>
    24.9 °C
   </td>
  </tr>
  <tr>
   <td>
    2013
   </td>
   <td>
    23.8 °C
   </td>
  </tr>
  <tr class="odd">
   <td>
    2012
   </td>
   <td>
    22.4 °C
   </td>
  </tr>
  <tr>
   <td>
    2011
   </td>
   <

Now we have a structure, that we hopefully can work with. So far it looks like the algorithm could be something like 
<code>
> To your own database:
> For every row &lt;tr&gt; put the first &lt;td&gt; in column 'year' and the second &lt;td&gt; in column 'highest_temp'
</code>
<br>
After a bit of playing with some simple for loops we can deduct those two simple loops and get the data achieved with the beautifulsoup library again:

In [12]:
table = climate_frame.tbody # you can access child elements just by using the dot notation
for row in table.findAll('tr'):
    for td in row.find_all('td'):
        print(td.get_text())

2017
25.4 °C
2016
24.7 °C
2015
22.8 °C
2014
24.9 °C
2013
23.8 °C
2012
22.4 °C
2011
24.1 °C
2010
27.3 °C
2009
20.2 °C
2008
21.7 °C
2007
25.7 °C
2006
21.7 °C
2005
23.1 °C
2004
23.6 °C
2003
21.2 °C
2002
22.7 °C
2001
21.6 °C
2000
21.3 °C
1999
21.2 °C
1998
19.3 °C
1997
22.1 °C
1996
- °C
1995
21.1 °C
1994
19.6 °C
1993
19.2 °C
1992
22.4 °C
1991
19.9 °C
1990
19.8 °C


<i>A little side note on legal stuff: when I familiarized myself with web
scraping (here: https://www.dataquest.io/blog/web-scraping-tutorial-python/), I stumbled over the fact that actually I could theoretically just use the API without scraping anything, but 'luckily' www.wetteronline.de doesn't have an API for historical data :)Funny enough, the dataquest tutorial also uses a weather page for scraping. Another important point here: dataquest uses CURRENT weather data to scrape. This could be problematic with www.wetteronline.de's current/forecasting weather data because they sell this weather data API, so this could be subject to copyright. So keep in mind to stay in conformity with the law especially if you go public with your results (a little intro: http://blog.icreon.us/advise/web-scraping-legality) </i>

The data above looks good to me now. It should be really easy to put it all in separate columns in a database.

But first we want *all data* for *all months* for *all categories*.
To do so, we first need to examine the URL more closely:
http://www.wetteronline.de/?pcid=pc_rueckblick_climate&gid=27612&iid=27612&pid=p_rueckblick_climatecalculator&sid=Default&var=TX&analysis=monthly&month=08&startyear=1990&endyear=2017&iid=27612

Let's see how to divide this long peace of art into some readable chunks:
<br>
http://www.wetteronline.de/ 
<br>
?
<br>
pcid=pc_rueckblick_climate
<br>
&
<br>
gid=27612
<br>
&
<br>
iid=27612
<br>
&
<br>
pid=p_rueckblick_climatecalculator
<br>
&
<br>
sid=Default
<br>
&
<br>
var=TX
<br>
&
<br>
analysis=monthly
<br>
&
<br>
month=08
<br>
&
<br>
startyear=1990
<br>
&
<br>
endyear=2017
<br>
&
<br>
iid=27612
<br>

If we play around with the form at http://www.wetteronline.de/klimarechner/berlin, we can quickly figure out that to get ALL data we have to change var=TX for the according category codes and month=08. But before I apply my theories to a real database, I like to verify them with some plain print outs. This would also be the final code for the whole scraping exercise:

In [18]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bs

months = ['01','02','03','04','05','06','07','08','09','10','11','12']
categories = ['TX','TN','NS','FFkmh']

print('this is just a little part of the whole output \n comment out the if parts of the code to have the full outprint')

for category in categories:
    for month in months:
        url = 'http://www.wetteronline.de/?pcid=pc_rueckblick_climate&gid=10382&iid=10382&pid=p_rueckblick_climatecalculator&sid=Default&var=' + category + '&analysis=monthly&month=' + month + '&startyear=1990&endyear=2017&iid=10382'
        req = Request(url,
                      headers={'User-Agent': 'Mozilla/5.0'})
        webpage = urlopen(req).read()
        webpage = webpage.decode('utf-8')
        soup = bs(webpage, 'html.parser')
        table = soup.tbody
        if(category is 'TX' and month is '01'):
            print('\nCategory    Month\n' + category + '          ' + month + '\n')
        for row in table.findAll('tr'):
            for td in row.find_all("td"):
                if(category is 'TX' and month is '01'):
                    print(td.get_text())

this is just a little part of the whole output 
 comment out the if parts of the code to have the full outprint

Category    Month
TX          01

2017
1.3 °C
2016
2.2 °C
2015
5.0 °C
2014
2.9 °C
2013
1.9 °C
2012
4.0 °C
2011
3.5 °C
2010
-3.1 °C
2009
0.7 °C
2008
6.2 °C
2007
7.6 °C
2006
-0.6 °C
2005
5.2 °C
2004
1.2 °C
2003
2.2 °C
2002
4.5 °C
2001
3.4 °C
2000
3.7 °C
1999
5.9 °C
1998
5.5 °C
1997
0.4 °C
1996
-1.9 °C
1995
2.9 °C
1994
5.3 °C
1993
4.9 °C
1992
3.3 °C
1991
4.7 °C
1990
5.9 °C


# Conclusion
This is why I love Python: you can have a bunch of data in your pocket from a website with a few lines of code. 
Thanks for staying by! By now we have achieved scraping all average data of all categories for all months and all historical years down to 1990.

# What's next?
Next I will continue with setting up a database, push the data into it and start to compare and analyze it using different statistical approaches including data visualization. Any feedback is always welcome, especially to figure out how good this step by step format works for you folks.