# Biergartens in Germany - Regional Differences

## Introduction

Bavaria, region at South-East Germany, is well known for it's beer culture. Bavarian purity law has stated since 1516 that beer is to be made of four ingredients: water, malt, hops and yeast. Munich, capital of Bavaria, is particularly well known for Oktoberfest which is celebrated elsewhere in Germany as well. Hence, if you are a traveler looking for once in a lifetime beer experience, you should probably attend Oktoberfest in Munich. But what if that's not possible? You may be traveling at different time of the year, or in another part of Germany. Where should you go? Obvious solution is to try out one, two or more biergartens, which can be found anywhere in Germany. 

This analysis attempts to determine where a traveler can get the best biergarten experience. The analysis will be conducted by asking the following questions and answering them by modern data analysis techniques.

- Where can you find most biergartens?
- Are biergartens equally popular in different regions?
- Do biergarten reviews hint where to go to?
- Does population structure explain popularity of biergartens?
- Does local living standard explain biergarten density in region?

## Data

Regional statistics of German population and society will be fetched from Eurostat City Statistics Database.

https://ec.europa.eu/eurostat/web/cities/data/database

Locations of biergartens will be got from OpenStreetMap where they are tagged. According to OpenStreetMap Wiki the tagging is very accurate since biergarten is distinquished from beer garden.

https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dbiergarten

Foursquare API will be used to fetch reviews of german pubs, bars and restaurants. Biergartens will be found by matching Foursquare data to OpenStreetMap data thus enabling the extraction of biergarten reviews.

# Methodology

The analysis was conducted by comparing the 20 largest cities in Germany. the list of cities was obtained from [World Population Review website](https://worldpopulationreview.com/countries/germany-population/cities/).

## Data Preparation

Data collection was by far the most tedious task in the project due to multiple online sources previously unknown to the author. To enable easy and reproducable data collection, a python package was created to consistently query Eurostat, OpenStreetMap and Foursquare APIs. The package can be found and installed from the [project Github](https://github.com/Mtale/Coursera_Capstone).

Once the data collection package was created, it was used to run the following process:

1) Get all biergartens from OpenStreeMap

2) Match biergartens to Foursquare venues by coordinates allowing 25 meters radius. This phase introduces some inaccuracy due to different data types in OpenStreetMap: some large biergartens have been expressed as ways or relations and it's possible some of them have been dropped out from the analysis if the center point of polygon was more than 25 meters off from coordinates of a Foursquare venue. Setting a radius is a matter of balancing: too large radius allows other pubs and german restaurants to enter the dataset if they are close to the actual biergarten. The few venues having fit to the radius of a biergarten are included in the dataset.

3) Get likes count and rating of each venue from foursquare

At last, the biergarten data was merged to Eurostat data to create a single dataframe containing the data needed in the analysis. The phase was executed in the notebook [Data Preparation](https://github.com/Mtale/Coursera_Capstone/blob/master/Data%20Preparation.ipynb).

## Exploratory Data Analysis

Exploratory data analysis (EDA) was conducted to examine the data set created in the data preparation phase. During the analysis it turned out that large part of the data acquired from Eurostat was missing - most likely due to self-set requirement that the data should be at most 4 years old, from 2016. Out of 145 variables from Eurostat 41 were excluded from further analysis due to high number of missing values. 

In the end, 38 easy-to-understand variables describing population structure and industrial structure of each city were included in the analysis. Population variables and some variables related to economical activity were already proportional, number of jobs in industry were scaled to number of jobs per 1,000 inhabitants.

In the last phase of EDA the most recent observation of each included variable for each city was included in tidy data where biergartens were on rows and variables on columns.

EDA was executed in the notebook [Exploratory Data Analysis](https://github.com/Mtale/Coursera_Capstone/blob/master/EDA.ipynb).

## Statistical Analysis

The objective of statistical analysis was to answer the predefined questions:

- Where can you find most biergartens?
- Are biergartens equally popular in different regions?
- Do biergarten reviews hint where to go to?
- Does population structure explain popularity of biergartens?
- Does local living standard explain biergarten density in region?

The questions were answered mainly by appropriate plotting techniques and visual analysis. An attempt to explain biergarten density per 100,000 people was made by using linear regression. The regression models were run on the dataset where variables having high correlation were excluded to avoid multicollinearity thus enabling interpretation of results.

The analysis was done in the notebook [Analysis](https://github.com/Mtale/Coursera_Capstone/blob/master/Analysis.ipynb).



# Results

## Number of biergartens in the 20 largest cities
Let's start by having a look at where are the 20 largest cities in Germany. The following map shows their location. Color of marker depicts the number of biergartens the city hosts per 100,000 people.

<img src="result_charts/map.jpg">

Plotting the number of biergartens in city aside of number of biergartens per 100,000 people provides us with an overall view to the difference of biergarten density in cities. It's easy to see that Berlin has far scarcer density than meny smaller cities whereas Leipzig is comparable to Munich and Dresden to Frankfurt. 

<img src="result_charts/biergarten_freq.png">

Converting actual numbers to ranking often provides one with a clearer view to the phenomena. Such is it here: it's very easy to see that the two largest cities, Hamburg and Berlin, are way of the diagonal having lower than biergarten density than other cities. Three smaller cities: Bielefeld, Mannheim and Bonn, seem to have high density even though they host relatively few biergartens.

<img src="result_charts/biergarten_rank_scatter.png">

## Review of biergarten reviews in Foursquare

Gut feeling says that lots of reviews may not be written by locals but tourists. The following chart of likes count doesn't prove the gut feeling but it shows that biergartens in the two most popular travel destinations, Berlin and Munich have got eight times more likes than the venues in other cities. Hence, we gain no wisdom on biergarten popularity by looking at the likes.

<img src="result_charts/likes_count.png">

How about ratings then? There must be differences between ratings in different cities. There are, yes. But number of rated biergartens is very small in many of the cities. Ratings median in Stuttgart is somewhat higher than other cities. Next in the chart, Dusseldorf, has the lowest median of all the cities.

<img src="result_charts/ratings_boxplot.png">

## Does population structure or economical activity explain density of biergartens?

Yes, it does. Particularly when coupled with location of the city. With R<sup>2</sup> = 0.85.

To avoid multicollinearity and enable interpretation of regression coefficients, variables with correlation higher than 0.7 were excluded from the analysis. Intercept being 1.495 the following chart allows us to draw some conclusions:

- The southern the city, the higher the biergarten density per 100,000 people
- The eastern the city, the higher the density
- The higher the proportion of small children, the higher the density (note that proportion of small children correlates with proportion of adults on their 30s)
- The higher the proportion of youg adults, the hihgher the density
- The higher the proportion of women, the hihgher the density

<img src="result_charts/regression_coef.png">

Regression plot of actual density and predicted density tells us that most of the predictions are amazingly accurate: only Hamburg and Essen are way off.

<img src="result_charts/regression_plot.png">

# Discussion

Results of the analysis are fascinating! It was a great surprise to author that biergarten density can be explained so well by using only simple, commonly available statistics. It would be interesting to apply the model to other German cities to see whether there's a threshold in population size where the model stops working. Many German cities being similar size: there are only 3 cities hosting more than a million inhabitants and 97 cities having population between 100,000 to 1 million, there is a chance that the model would work for smaller cities than the top 20 cities.

Motivation of the analysis was to learn where should one go to get a great biergarten experience if Munich is not an option. Based on the analysis, author's choice would be Leipzig, the only city having higher density of Biergartens than Munich. The 10 rated biergartens in Stuttgart are likely to be worth exploring, based on the high median rating and high third quartile rating. According to the Foursquare ratings, one should not go to Dusseldorf for biergartens.

# Conclusion

Working with GIS data can be a tedious task for a newcomer. Various data types and their relations take time to grasp - gaining full knowledge of them was not in the scope of this analysis. Hence, the results of the analysis are advisory, not conclusive. Doing the analysis was a great learning experience enabling author to gain rudimentary understanding of GIS data and to spark greater respect for masters of the field.