# Biergartens in Germany - Where to travel if you can't attend Oktoberfest?

## Introduction

Bavaria, region at South-East Germany, is well known for it's beer culture. Bavarian purity law has stated since 1516 that beer is to be made of four ingredients: water, malt, hops and yeast. Munich, capital of Bavaria, is particularly well known for Oktoberfest which is celebrated elsewhere in Germany as well. Hence, if you are a traveler looking for once in a lifetime beer experience, you should probably attend Oktoberfest in Munich. But what if that's not possible? You may be traveling at different time of the year, or in another part of Germany. Where should you go? Obvious solution is to try out one, two or more biergartens, which can be found anywhere in Germany. 

This analysis attempts to determine where a traveler can get the best biergarten experience. The analysis will be conducted by asking the following questions and answering them by data analysis techniques.

- Where can you find most biergartens?
- Are biergartens equally popular in different regions?
- Do biergarten reviews in Foursquare hint where to go to?
- Does population structure or labour statistics explain density of biergartens?

## Data

Regional statistics of German population and society were fetched from [Eurostat City Statistics Database](https://ec.europa.eu/eurostat/web/cities/data/database).

Locations of biergartens were got from OpenStreetMap where they are tagged. According to [OpenStreetMap Wiki](https://wiki.openstreetmap.org/wiki/Tag:amenity%3Dbiergarten) the tagging is very accurate since biergarten is distinquished from beer garden.

[Foursquare API](https://developer.foursquare.com/docs/) was be used to fetch reviews of venues in categories *german pubs*, *bars* and *restaurants*. Biergartens were found by matching Foursquare venue location to OpenStreetMap location allowing radius of 25 meters.

## Methodology

The analysis was conducted by comparing the 20 largest cities in Germany. the list of cities was obtained from [World Population Review website](https://worldpopulationreview.com/countries/germany-population/cities/).

### Data Preparation

Data collection was by far the most tedious task in the project due to multiple online sources previously unfamiliar to the me. To enable easy and reproducible data collection, a python package was created to consistently query Eurostat, OpenStreetMap and Foursquare APIs. The package can be found in and installed from the [project Github repository](https://github.com/Mtale/Coursera_Capstone).

Once the data collection package was complete, it was used to run the following process:

1) Get all biergartens from OpenStreeMap

2) Match OpenStreeMap biergartens to Foursquare venues by coordinates allowing 25 meters radius. This phase introduces some inaccuracy in the process due to different data types in OpenStreetMap: some large biergartens have been expressed as [ways](https://wiki.openstreetmap.org/wiki/Way) or [relations](https://wiki.openstreetmap.org/wiki/Relation) instead of [nodes](https://wiki.openstreetmap.org/wiki/Node). It's  possible some of the large biergartens have been dropped out from the analysis if center point of polygon (way or relation) is more than 25 meters off of coordinates of a Foursquare venue. Setting a radius is a matter of balancing: too large radius allows other *pubs* and *german restaurants* to enter the dataset if they are next door neighbors of the actual biergarten. The few venues having fit to the radius of a biergarten are included in the dataset.

3) Get likes count and rating of each venue from foursquare

At last, the biergarten data was merged to Eurostat data to create a single dataframe containing the data needed in the analysis. The data preparation phase was executed in the notebook [Data Preparation](https://github.com/Mtale/Coursera_Capstone/blob/master/Data%20Preparation.ipynb).

### Exploratory Data Analysis

Exploratory data analysis (EDA) was conducted to examine quality of the dataset created in the data preparation phase. During the analysis it turned out that large part of the data acquired from Eurostat was missing - most likely due to self-set requirement that the data should be at most 4 years old, from between 2016-2019. Out of 145 variables from Eurostat 41 were excluded from further analysis due to high number of missing values. 

In the end, 38 easy-to-understand, non-correlating variables describing population structure and labour market of each city were included in the analysis. Population variables and some labour market variables were already proportional, number of jobs in industry were scaled to number of jobs per 1,000 inhabitants.

In the last phase of EDA the most recent observation of each variable for each city was included in tidy dataset where biergartens were on rows and variables on columns.

EDA was executed in the notebook [Exploratory Data Analysis](https://github.com/Mtale/Coursera_Capstone/blob/master/EDA.ipynb).

### Statistical Analysis

The objective of statistical analysis was to answer the predefined questions:

- Where can you find most biergartens?
- Are biergartens equally popular in different regions?
- Do biergarten reviews in Foursquare hint where to go to?
- Does population structure or labour statistics explain density of biergartens?

The questions were answered by appropriate plotting techniques and visual analysis. An attempt to explain biergarten density per 100,000 people was made by using linear regression. The regression model was run on the dataset where variables having high correlation were excluded to avoid multicollinearity thus enabling interpretation of results.

The analysis was done in the notebook [Analysis](https://github.com/Mtale/Coursera_Capstone/blob/master/Analysis.ipynb).



## Results

### Number of biergartens in the 20 largest cities
Let's start by having a look at where are the 20 largest cities in Germany. The following map shows their location. Color of marker depicts the number of biergartens the city hosts per 100,000 people.

In [1]:
import folium
import numpy as np
import pandas as pd
import seaborn as sns

from geopy.geocoders import Nominatim
from matplotlib import pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

# Custom modules
# We will use top_20_cities dictionary from here
from openstreetmap import openstreetmap as osm 

In [2]:
# Read the data and sort it by total population
data = pd.read_csv('tidy_data.csv', sep=';')
data = data.sort_values(by='Population on the 1st of January, total', ascending=False).reset_index(drop=True)

# Rename the total population column
data.rename(columns={'Population on the 1st of January, total':'total_population'}, inplace=True)

# Replace unicode characters due to rendering issue in Folium
data = data.replace(to_replace={'ü':'u','ö':'o'}, regex=True)

# Add ratings count per city to tidy data
data['ratings_count'] = data.rating.notnull().groupby(data['city']).transform('sum').astype(int)

# Add likes_count per city to tidy data
data['likes_count'] = data.likes_cnt.groupby(data['city']).transform('sum').astype(int)

# Count ratings to distinct dataframe
data_counts = pd.DataFrame(data.rating.notnull().groupby(data['city'], sort=False).sum().astype(int).reset_index())
data_counts = data_counts.merge(data[['city', 'total_population']], on='city') \
                .drop_duplicates() \
                .reset_index(drop=True)
data_counts.columns = ['city', 'ratings_count', 'total_population']

# Count likes to distinct dataframe
likes_counts = pd.DataFrame(data.likes_cnt.groupby(data['city'], sort=False).sum().astype(int).reset_index())
likes_counts.columns = ['city','likes_count']
data_counts = data_counts.merge(likes_counts, on='city')

# Count number of biergartens per city
no_of_biergartens_city = pd.DataFrame(data.groupby('city', sort=False).count().venue_id).reset_index()
no_of_biergartens_city.columns = ['city', 'biergarten_count']

# Join to count data
data_counts = data_counts.merge(no_of_biergartens_city, on='city')

# Count no of biergartens per 100,000 people
data_counts['biergarten_count_100k'] = data_counts['biergarten_count']/data_counts['total_population']*100000

# Add rank variables to dataset
data_counts['biergarten_rank'] = data_counts['biergarten_count'].rank()
data_counts['biergarten_100k_rank'] = data_counts['biergarten_count_100k'].rank()

In [3]:
# Get coordinates for Germany to center the map
geolocator = Nominatim(user_agent="germany_explorer")
address = 'Germany'
location = geolocator.geocode(address)
germany_latitude = location.latitude
germany_longitude = location.longitude

# Create empty dataframe to store coordinates to
germany_city_coordinates = pd.DataFrame()

# Get coordinates for cities to be plotted
geolocator = Nominatim(user_agent="germany_explorer")
for city in osm.top20_cities.keys():
    address = city + ', Germany'
    location = geolocator.geocode(address)
    d = {
    'city': city,
    'latitude': location.latitude,
    'longitude': location.longitude,
    }
    germany_city_coordinates = germany_city_coordinates.append(d, ignore_index=True)
    
# Replace unicode characters due to rendering issue in Folium and to match rest of the data
germany_city_coordinates = germany_city_coordinates.replace(to_replace={'ü':'u','ö':'o'}, regex=True)

# Join coordinates to counts data
data_counts = data_counts.merge(germany_city_coordinates, on='city')

# Join coordinates to venue data
data = data.merge(germany_city_coordinates, on='city')

In [4]:
# Inititate map of Germany
map_germany = folium.Map(location=[germany_latitude, germany_longitude], zoom_start=6)

# Loop through data_counts
for city, lat, lng, pop, cnt, cnt_100k, rank, rank_100k in zip(data_counts['city']
                          , data_counts['latitude']
                          , data_counts['longitude']
                          , data_counts['total_population']
                          , data_counts['biergarten_count']
                          , data_counts['biergarten_count_100k']
                          , data_counts['biergarten_rank']
                          , data_counts['biergarten_100k_rank']):
    
    # Generate html to include data in popup
    label = (
            "{city}<br>"
            "Population: {pop}<br>"
            "No of biergartens: {cnt}<br>"
            "No of biergartens per 100,000 people: {cnt_100k}<br>"
           ).format(city=city.upper(),
                    pop=str(int(pop)),
                    cnt=str(int(cnt)),
                    cnt_100k=str(round(cnt_100k, 1)),
                    )
    
    # Set marker color based on the biergarten_count_100k
    if cnt_100k > 5:
        colour = 'darkpurple'
    elif cnt_100k > 4:
        colour = 'red'
    elif cnt_100k > 3:
        colour = 'orange'
    elif cnt_100k > 2:
        colour = 'pink'
    else:
        colour = 'lightgray'
    
    # Add marker
    map_germany.add_child(folium.Marker(
        location=[lat, lng],
        popup=label,
        icon=folium.Icon(
            color=colour,
            prefix='fa',
            icon='circle')))

# Create a legent to map
legend_html = """
     <div style="position: fixed; bottom: 50px; left: 50px; width: 150px; height: 200px; \
     border:2px solid grey; z-index:9999; font-size:14px;" >
     &nbsp; No of biergartens <br>
     &nbsp; per 100,000 people <br>
     &nbsp; 5 + &nbsp; <i class="fa fa-map-marker fa-2x"
                  style="color:darkpurple"></i><br>
     &nbsp; 4-5 &nbsp; <i class="fa fa-map-marker fa-2x"
                  style="color:red"></i><br>
     &nbsp; 3-4 &nbsp; <i class="fa fa-map-marker fa-2x"
                  style="color:orange"></i><br>
     &nbsp; 2-3 &nbsp; <i class="fa fa-map-marker fa-2x"
                  style="color:pink"></i><br>
     &nbsp; 0-2 &nbsp; <i class="fa fa-map-marker fa-2x"
                  style="color:lightgray"></i></div>
     """
map_germany.get_root().html.add_child(folium.Element(legend_html))
    
# Show the map
map_germany

<br>
Plotting the number of biergartens in city aside of number of biergartens per 100,000 people provides us with an overall view to the difference of biergarten density in cities. It's easy to see that Berlin has much scarcer density than many smaller cities whereas Leipzig is comparable to Munich and Dresden is comparable to Frankfurt. 
<br>
<br>
<img src="result_charts/biergarten_freq.png">
<br>
Converting actual numbers to ranking often provides one with a clearer view to the phenomena. Such is it here: it's very easy to see that the two largest cities, Hamburg and Berlin, are way off the diagonal having lower than average biergarten density. Three smaller cities: Bielefeld, Mannheim and Bonn, seem to have high density even though they host relatively few biergartens.
<br>
<br>
<img src="result_charts/biergarten_rank_scatter.png">

### Review of biergarten reviews in Foursquare

Gut feeling says that lots of reviews may not be written by locals but tourists. The following chart of likes count in Foursquare doesn't prove the gut feeling but it shows that biergartens in the two most popular travel destinations, Berlin and Munich, have got eight times more likes than the venues in other cities. Hence, we gain no wisdom on biergarten popularity by looking at the likes.
<br>
<br>
<img src="result_charts/likes_count.png">
<br>
<br>
How about ratings then? There must be differences between ratings in different cities. There are, yes. But number of rated biergartens is very small in many of the cities. Ratings median in Stuttgart is somewhat higher than other cities. Next in the chart, Dusseldorf, has the lowest median of all the 20 largest cities.
<br>
<br>
<img src="result_charts/ratings_boxplot.png">

### Does population structure or economical activity explain density of biergartens?

Yes, it does. Particularly when coupled with location of the city. 

To avoid multicollinearity and enable interpretation of regression coefficients, variables with correlation higher than 0.7 were excluded from the analysis. Intercept being 1.495 and R<sup>2</sup> = 0.85, the following chart allows us to draw some conclusions:

- The southern the city, the higher the biergarten density per 100,000 people
- The western the city, the higher the density
- The higher the proportion of small children, the higher the density (note that proportion of small children correlates with proportion of adults on their 30s)
- The higher the proportion of youg adults, the higher the density
- The higher the proportion of women, the higher the density
<br>
<br>
<img src="result_charts/regression_coef.png">
<br>
<br>
Regression plot of actual density and predicted density tells us that most of the predictions are amazingly accurate: only Hamburg and Essen are way off.
<br>
<br>
<img src="result_charts/regression_plot.png">

## Discussion

Results of the analysis are fascinating! It was a great surprise to me that biergarten density can be explained so well by using only simple, commonly available statistics. It would be interesting to apply the model to other German cities to see whether there's a threshold in city size where the model stops working. Many German cities are similar size: there are only 3 cities hosting more than a million inhabitants and 97 cities having population between 100,000 to 1 million. Hence, there is a chance that the model would work for smaller cities than the 20 largest.

Motivation of the analysis was to learn where should one go to get a great biergarten experience if Munich is not an option. Based on the analysis, my choice would be Leipzig, the only city having higher density of Biergartens than Munich. The 10 rated biergartens in Stuttgart are likely to be worth exploring, based on the high median rating and high third quartile rating. According to the Foursquare ratings, one should not go to Dusseldorf for biergartens.

If you are after a good beer and are traveling anyway, it's safe to use this analysis as one of the inputs for destination selection. If you consider founding a biergarten in Germany, please don't use this analysis for decision making. Even though foundation for proper analysis is solid, more effort should be put to data processing, to matching biergartens found on openStreetMap to venues in Foursquare. Probably you would be better off with Foursquare data only as such business decision must be based on competition brought by other types of restaurants, bars and pubs, not only other biergartens.

## Conclusion

Working with GIS data can be a tedious task for a newcomer. Various data types and their relations take time to grasp - mastering them was not in the scope of this analysis. Hence, the results of the analysis are advisory, not conclusive. Doing the analysis was a great learning experience enabling me to gain rudimentary understanding of GIS data and to spark greater respect for masters of the field.

The advisory results are very interesting, though. There are big diffences in biergarten density the 20 largest cities in Germany taking Leipzig and Stuttgart closer to the top of my traveling destinations. 