In [None]:
import os
import pandas as pd
import numpy as np
import json
import folium
import matplotlib.pyplot as pp

First, we load the excel file that was previously downloaded from the eurostat website. The latter contains the unemployment rate of european countries in July 2017, the most recent month for which we have a rate for each country. Note that we already preprocessed the data by removing useless columns and rows from the ```.xls``` file, and by correcting the name of Germany.

Also, we add a country ID to the previous dataframe because it is easier to use in the choropleth function for the "key_on" identifier. We do it by hand relating IDs from the topojson file with the country names from the dataframe, it is fine to do so as we have less than 30 countries.

Lastly, we manually add a value for Switzerland because it is absent from the eurostat dataset. The value was taken from the amstat dataset.

In [None]:
df=pd.read_excel('eurostat2.xlsx')
df['rate'] = pd.to_numeric(df['rate'],errors=False)

country_to_id = {
"Belgium": "BE",
"Bulgaria": "BG",
"Czech Republic": "CZ",
"Denmark": "DK",
"Germany": "DE",
"Estonia": "EE",
"Ireland": "IE",
"Greece": "GR",
"Spain": "ES",
"France": "FR",
"Croatia": "HR",
"Italy": "IT",
"Cyprus": "CY",
"Latvia": "LV",
"Lithuania": "LT",
"Luxembourg": "LU",
"Hungary": "HU",
"Malta": "MT",
"Netherlands": "NL",
"Austria": "AT",
"Poland": "PL",
"Portugal": "PT",
"Romania": "RO",
"Slovenia": "SI",
"Slovakia": "SK",
"Finland": "FI",
"Sweden": "SE",
"United Kingdom": "GB",
"Iceland": "IS",
"Norway": "NO",
"Turkey": "TR"
}
df["country_id"] = df["country"].map(country_to_id)

df.loc[len(df)]=["Switzerland", 3, "CH"]

df

In order to choose the colouring threshold for the choropleth map, we looked at some dataframe stats. Initially, we chose to place the threshold at the rate values corresponding to 25%, 50% and 75% of the european countries. However, we noticed that using such thresholds, some countries with very similar unemployment rates might have drastically different colours. Hence, we decided to plot the different rates and to place thresholds in between visible clusters. 

While trying to formalize that approach using a K-mean algorithm, we came to learn about the [Jenks natural breaks optimization](https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization) method, a algorithmic solution designed for this specific problem. Roughly, the basic principle is to find the cut that maximimizes the distance between groups' means (so that country appearing in different colors are really different) while at the same time minimizing the variance within groups (so that country in the same color can be seen as similar). Unfortunately, we argue that this method still produce quite "unfair" comparisons between data points (or the implementation we found were not correct).

We finally decided that, for the scope of this homework, manual assignment of the threshold was a viable strategy, and had the advantage of being justifiable. For instance, as we are asked to compare Switzerland against other countries, it makes sense to have finer granularity around its rate value.

We therefor define the ```plot_groups``` function to help us set and understand the threshold values.

In [None]:
def plot_groups(df, col, breaks):
    for i in range(len(breaks)-1):
        points = df[df[col].between(breaks[i], breaks[i+1])][col]
        pp.plot(points, np.zeros_like(points), 'o')
    
    pp.plot(breaks, np.zeros_like(breaks), '|', markersize=30)
    pp.show()

breaks = [0, 4, 7,  10, 14, 20.2]
plot_groups(df, "rate", breaks)

Finally, we construct the choropleth map showing the unemployment rate in 2016 in Europe using the thresholds. Choice on the color-palette is not discussed as it would mainly be influenced by the targeted audiance of the vizualization.

In [None]:
geo_json_data = json.load(open('topojson/europe.topojson.json'))

# Keep only geometries for country we know the data
geo_json_data["objects"]["europe"]["geometries"] = [g for g in geo_json_data["objects"]["europe"]["geometries"] if g["id"] in list(df["country_id"])]

m_europe=folium.Map([52.5,15], tiles='cartodbpositron', zoom_start=4)
m_europe.choropleth(geo_data= geo_json_data, topojson='objects.europe', 
                    key_on='id',
                    data=df, columns=['country_id','rate'],
                    threshold_scale = breaks,
                    fill_color='OrRd',
                    #fill_opacity=0.7, 
                    #line_opacity=0.2,
                   legend_name= 'Unemployment rate in July 2016 in Europe (%)')

display(m_europe)
print("Choropleth map of the unemployment rates in Europe. ")

Switzerland has a 3% unemployment rate which is very low compared to the rest of Europe, it is in 25% of countries with lowest rate in September 2017.

## The amstat dataset

For the amstat dataset, unemployment rates per canton can only be exported grouped by nationality (Swiss vs foreigners) OR age group (or not grouped at all). We chose to work from 3 different exports (one per grouping option), so that we don't have to transform the rates ourself because it would require additional sources of data (including the active population count per canton).

Again, we map the cantons to their respective codes manually, as there is only 26 of them.

Finally, we extend the ungrouped DataFrame with the rate difference between the *Swiss* and the *foreigners* groupes to help us answer question 3.

In [None]:
ur_nat_raw = pd.read_excel("amstat_nat.xlsx")
ur_all_raw = pd.read_excel("amstat_all.xlsx")

cant_to_id = {
    "Berne": "BE",
    "Soleure": "SO",
    "Argovie": "AG",
    "Bâle-Campagne": "BL",
    "Bâle-Ville": "BS",
    "Appenzell Rhodes-Extérieures": "AR",
    "Appenzell Rhodes-Intérieures": "AI",
    "Glaris": "GL",
    "Grisons": "GR",
    "Schaffhouse": "SH",
    "St-Gall": "SG",
    "Thurgovie": "TG",
    "Lucerne": "LU",
    "Nidwald": "NW",
    "Obwald":"OW",
    "Schwyz": "SZ",
    "Uri": "UR",
    "Zoug": "ZG",
    "Zurich": "ZH",
    "Fribourg":"FR",
    "Jura":"JU",
    "Neuchâtel": "NE",
    "Genève": "GE",
    "Valais": "VS",
    "Vaud": "VD",
    "Tessin": "TI"
}

ur_nat_raw["cant_id"] = ur_nat_raw["cant"].map(cant_to_id)
ur_all_raw["cant_id"] = ur_all_raw["cant"].map(cant_to_id)

ur_nat = ur_nat_raw.set_index(["ling_reg", "reg", "cant", "cant_id", "nat"])
ur = ur_all_raw.set_index(["ling_reg", "reg", "cant", "cant_id"])

for col in ["unempl_c", "jobseek_c", "empl_jobseek_c"]:
    ur_nat[col] = pd.to_numeric(ur_nat[col].str.replace("'", ""))

ur["rate_nat_diff"] = ur_nat.xs("Etrangers", level="nat")["rate"] - ur_nat.xs("Suisses", level="nat")["rate"]
ur

We can now create the choropleth map showing the unemployment rates, using again manually set values to cut the data points into groups.

In [None]:
breaks = [0, 1.4, 3.9, 4.9 ,6]

cantons_b = json.load(open("topojson/ch-cantons.topojson.json"))

m_rate = folium.Map([46.90, 8.50], zoom_start=8)
m_rate.choropleth(geo_data=cantons_b,
             threshold_scale=breaks,
             topojson="objects.cantons", 
             data=ur["rate"].reset_index(level=[0,1,2], drop=True), 
             fill_color='OrRd', 
             key_on="id",
             legend_name="Unemployment Rate")

plot_groups(ur, "rate", breaks)


Before displaying the map, we create a second one showing the rate difference between the *Swiss* and *Foreigners* categories, so that we can display the two maps side by side.

In [None]:
breaks = [0, 1.9, 2.7, 3.4, 4.5, 5.5]

m_nat_diff = folium.Map([46.90, 8.50], zoom_start=8)     
m_nat_diff.choropleth(geo_data=cantons_b,
             threshold_scale=breaks,
             topojson="objects.cantons", 
             data=ur["rate_nat_diff"].reset_index(level=[0,1,2], drop=True), 
             fill_color='OrRd', 
             key_on="id",
             legend_name="Unemployment Rates difference")

plot_groups(ur, "rate_nat_diff", breaks)

We now display the maps from question 1) and 2) side by side and provide a succint analysis.

In [None]:
display(m_rate)
display(m_nat_diff)