In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import json
import folium # Check description below if not already installed

from datetime import date

data_foleder = 'data'
json_folder = 'topojson'

unemployement_eu = os.path.join(data_foleder, 'eurostat.csv')
unemployement_ch = os.path.join(data_foleder, 'unemployement_rate_ch.xlsx')
unemployement_sex_ch = os.path.join(data_foleder, 'unemployement_rate_sex_ch.xlsx')
unemployement_factor_ch = os.path.join(data_foleder, 'unemployement_rate_factor_ch.xlsx')
unemployement_nat_ch = os.path.join(data_foleder, 'unemployement_rate_nat_ch.xlsx')
topojson_ch = os.path.join(json_folder, 'ch-cantons.topojson.json')
topojson_eu = os.path.join(json_folder, 'europe.topojson.json')

# 0. Description 

## Background
In this homework we will be exploring interactive visualization, which is a key ingredient of many successful data visualizations (especially when it comes to infographics).

Unemployment rates are major economic metrics and a matter of concern for governments around the world. Though its definition may seem straightforward at first glance (usually defined as the number of unemployed people divided by the active population), it can be tricky to define consistently. For example, one must define what exactly unemployed means : looking for a job ? Having declared their unemployment ? Currently without a job ? Should students or recent graduates be included ? We could also wonder what the active population is : everyone in an age category (e.g. `16-64`) ? Anyone interested by finding a job ? Though these questions may seem subtle, they can have a large impact on the interpretation of the results : `3%` unemployment doesn't mean much if we don't know who is included in this percentage. 

In this homework you will be dealing with two different datasets from the statistics offices of the European commission ([eurostat](http://ec.europa.eu/eurostat/data/database)) and the Swiss Confederation ([amstat](https://www.amstat.ch)). They provide a variety of datasets with plenty of information on many different statistics and demographics at their respective scales. Unfortunately, as is often the case is data analysis, these websites are not always straightforward to navigate. They may include a lot of obscure categories, not always be translated into your native language, have strange link structures, â€¦ Navigating this complexity is part of a data scientists' job : you will have to use a few tricks to get the right data for this homework.

For the visualization part, install [Folium](https://github.com/python-visualization/folium) (*HINT*: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!). Folium's `README` comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find two `.topojson` files, containing the geo-coordinates of 

- European countries (*liberal definition of EU*) (`topojson/europe.topojson.json`, [source](https://github.com/leakyMirror/map-of-europe))
- Swiss cantons (`topojson/ch-cantons.topojson.json`) 

These will be used as an overlay on the Folium maps.

## Assignment

1. Go to the [eurostat](http://ec.europa.eu/eurostat/data/database) website and try to find a dataset that includes the european unemployment rates at a recent date.

   Use this data to build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows the unemployment rate in Europe at a country level. Think about [the colors you use](https://carto.com/academy/courses/intermediate-design/choose-colors-1/), how you decided to [split the intervals into data classes](http://gisgeography.com/choropleth-maps-data-classification/) or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.

2. Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through. 

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

3. Use the [amstat](https://www.amstat.ch) website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between *Swiss* and *foreign* workers.

   The Economic Secretary (SECO) releases [a monthly report](https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html) on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for *foreign* (`5.1%`) and *Swiss* (`2.2%`) workers. 

   Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (*hint* The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

   Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

4. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in unemployment rates between the areas divided by the [RÃ¶stigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

## Folium

#### Installation

We recommend using the latest version : `0.5.0`.

`Folium` is a regular python package, which can be installed through several means :

#### 1. Conda
```
conda install folium
```

#### 2. pip

```
pip install -U folium
``` 

By default, the `pip` command is linked to the local `python` distribution on. To use with your notebook, make sure you use the pip bundled with `anaconda`. On mac for example, this is usually : 
```
~/anaconda/bin/pip install -U folium
```

#### 3. Direct download

The package is available [directly from pypi](https://pypi.python.org/pypi/folium)

---

## 1. Eurostat

We can find on the Eurostat website many interesting european statistics. For our project we are going to use this specific one: [Employment rates by sex, age and citizenship](http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=lfsq_ergan&lang=en). It gathers the unemployment rates of every european countries from 2015 first quarter to 2017 second quarter.
We choose to download it into the .csv format and then import it into a Panda Dataframe for more data handling.

In [None]:
eurostat = pd.read_csv(unemployement_eu)
eurostat.head(10)

In [None]:
print('Values in TIME: {}'.format(eurostat['TIME'].unique()))
print('Values in SEX: {}'.format(eurostat['SEX'].unique()))
print('Values in AGE: {}'.format(eurostat['AGE'].unique()))
print('Values in CITIZEN: {}'.format(eurostat['CITIZEN'].unique()))
print('Values in UNIT: {}'.format(eurostat['UNIT'].unique()))
print('Values in Flag and Footnotes: {}'.format(eurostat['Flag and Footnotes'].unique()))

In [None]:
print('Values in GEO: {}'.format(eurostat['GEO'].unique()))

In [None]:
def convert_time(time):
    quarter = int(time[time.find('Q')+1])
    year = int(time[0:4])
    return pd.Timestamp(date(year, quarter*3, 1) ) 

eurostat.drop(['AGE', 'CITIZEN', 'UNIT', 'Flag and Footnotes'], axis=1, inplace=True)
eurostat.TIME = eurostat.TIME.apply(convert_time)
eurostat = eurostat.loc[eurostat.TIME == eurostat.TIME.max()]

In [None]:
#Data Processing - Keep only useful column
eurostat.TIME.value_counts()

In [None]:
drop_geo = ['European Union (28 countries)', 'European Union (27 countries)',
            'European Union (15 countries)', 'Euro area (19 countries)', 
            'Euro area (18 countries)', 'Euro area (17 countries)']

eurostat = eurostat.loc[[name not in drop_geo for name in eurostat.GEO]]

In [None]:
eurostat.GEO.unique()

In [None]:
data_topojson_eu = json.load(open(topojson_eu))
geo_eu_countries = []
for country in data_topojson_eu['objects']['europe']['geometries']:
    geo_eu_countries.append(country['properties']['NAME'])
print('Countries JSON EU:\n{}'.format(geo_eu_countries))

In [None]:
eurostat.loc[ [country not in geo_eu_countries for country in eurostat.reset_index().GEO], 'GEO' ].unique()

In [None]:
eurostat.GEO.replace(
    {'Germany (until 1990 former territory of the FRG)': 'Germany',
     'Former Yugoslav Republic of Macedonia, the': 'The former Yugoslav Republic of Macedonia'}, inplace=True)

In [None]:
json_id_keep = [country['properties']['NAME'] in eurostat.GEO.values 
                for country in data_topojson_eu['objects']['europe']['geometries']]

data_topojson_eu['objects']['europe']['geometries'] = np.array(data_topojson_eu['objects']['europe']['geometries'])[json_id_keep].tolist()

In [None]:
eurostat = eurostat.pivot_table(index=['GEO', 'TIME'], columns='SEX', values='Value')
eurostat = 100-eurostat
eurostat['Females_o_Males'] = eurostat['Females']/eurostat['Males']
eurostat.head()

In [None]:
europe_coordinates = [54.5, 15.3]
range_value =  np.linspace(eurostat.Total.min(), eurostat.Total.max(), 6).tolist()
europe_map = folium.Map(location = europe_coordinates, zoom_start = 3)
europe_map.choropleth(geo_data=data_topojson_eu, data=eurostat.reset_index(),
                      columns=['GEO', 'Total'], threshold_scale=range_value,
                      key_on='feature.properties.NAME', topojson='objects.europe',
                      fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,
                      legend_name='Percentage unemployement for Total poulation')
europe_map

In [None]:
europe_map = folium.Map(location = europe_coordinates, zoom_start = 3)
range_value =  np.linspace(eurostat.Females_o_Males.min(), eurostat.Females_o_Males.max(), 6).tolist()

europe_map.choropleth(geo_data=data_topojson_eu, data=eurostat.reset_index(),
                      columns=['GEO', 'Females_o_Males'], threshold_scale=range_value,
                      key_on='feature.properties.NAME', topojson='objects.europe',
                      fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,
                      legend_name='Unemployement ratio Females/Males')
europe_map

## 2. Switzerland

In [None]:
ch_stat = pd.read_excel(unemployement_ch, skiprows=4, header=None, usecols=np.array([0,2]))
ch_stat_sex = pd.read_excel(unemployement_sex_ch, skiprows=4, header=None, usecols=np.array([0,1,3]))

ch_stat.columns = ['canton', 'rate']
ch_stat_sex.columns = ['canton', 'sex', 'rate']

ch_stat = pd.concat((ch_stat, ch_stat_sex))
ch_stat.replace({np.NAN: 'Total', 'Männer': 'Males', 'Frauen': 'Females'}, inplace=True)
ch_stat.head()

In [None]:
data_topojson_ch = json.load(open(topojson_ch))
geo_ch_cantons = []

for canton in data_topojson_ch['objects']['cantons']['geometries']:
    geo_ch_cantons.append(canton['properties']['name'])
print('Countries JSON EU:\n{}'.format(geo_ch_cantons))

In [None]:
ch_stat.canton.unique()

In [None]:
ch_stat.loc[ [canton not in geo_ch_cantons for canton in ch_stat.canton], 'canton' ].unique()

In [None]:
ch_stat['canton_json'] = ch_stat.canton
ch_stat['canton_json'].replace(
                    {'Bern': 'Bern/Berne', 'Freiburg': 'Fribourg', 'Graubünden': 'Graubünden/Grigioni', 
                     'Tessin': 'Ticino', 'Waadt': 'Vaud', 'Wallis': 'Valais/Wallis',
                     'Neuenburg': 'Neuchâtel', 'Genf': 'Genève'}, inplace=True)
ch_stat = ch_stat[ch_stat.canton != 'Gesamt']
ch_stat.head()

In [None]:
ch_stat = ch_stat.pivot_table(index=['canton', 'canton_json'], columns='sex', values='rate')
ch_stat.columns = ['females_rate', 'males_rate', 'total_rate']
ch_stat['female_o_males'] = ch_stat['females_rate']/ch_stat['males_rate']
ch_stat.reset_index(inplace=True)
ch_stat.head()

In [None]:
def plot_ch_choropletch(df, df_cols, legend='', ch_coordinates=[46.8, 8.2]):
    range_value =  np.linspace(df[df_cols[1]].min(), df[df_cols[1]].max(), 6).tolist()
    map_ = folium.Map(location = ch_coordinates, zoom_start = 8)
    map_.choropleth(geo_data=data_topojson_ch, data=df.reset_index(),
                      columns=df_cols, threshold_scale=range_value,
                      key_on='feature.properties.name', topojson='objects.cantons',
                      fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,
                      legend_name=legend)
    return map_

In [None]:
ch_map = plot_ch_choropletch(ch_stat, ['canton_json', 'total_rate'], 
                             legend='Percentage unemployement for Total poulation')
ch_map

In [None]:
ch_map = plot_ch_choropletch(ch_stat, ['canton_json', 'female_o_males'], 
                             legend='Percentage unemployement for Total poulation')
ch_map

## factors

In [None]:
ch_stat_fact = pd.read_excel(unemployement_factor_ch, skiprows=4, header=None, 
                             usecols=np.array([0,3,4,5,6,7]))
ch_stat_fact.columns = ['canton', 'unemp_number','unemp_young', 
                   'unemp_longterm', 'unemp_seeker_number', 'seeker']
ch_stat_fact = ch_stat_fact[ch_stat_fact.canton != 'Gesamt']
ch_stat_fact.head()

In [None]:
ch_stat = ch_stat.merge(ch_stat_fact)
ch_stat.head()

In [None]:
ch_stat['canton_active_pop'] =  (100/ch_stat.total_rate) * ch_stat.unemp_number
ch_stat['unemp_rate_seeker'] = 100*(ch_stat.unemp_seeker_number/ch_stat['canton_active_pop'])
ch_stat['unemp_rate_no_young'] = 100*((ch_stat.unemp_number-ch_stat.unemp_young)/ch_stat['canton_active_pop'])
ch_stat['unemp_rate_longterm'] = 100*((ch_stat.unemp_longterm)/ch_stat['canton_active_pop'])
ch_stat['unemp_rate_seeker_diff'] = ch_stat.unemp_rate_seeker - ch_stat.total_rate
ch_stat['unemp_rate_no_young_diff'] = ch_stat.unemp_rate_no_young - ch_stat.total_rate
ch_stat['unemp_rate_longterm_diff'] = ch_stat.unemp_rate_longterm - ch_stat.total_rate
ch_stat.head()

In [None]:
ch_map = plot_ch_choropletch(ch_stat, ['canton_json', 'unemp_rate_seeker_diff'], 
                             legend='Percentage unemployement for Total poulation')
ch_map

In [None]:
ch_map = plot_ch_choropletch(ch_stat, ['canton_json', 'unemp_rate_no_young_diff'], 
                             legend='Percentage unemployement for Total poulation')
ch_map

In [None]:
ch_map = plot_ch_choropletch(ch_stat, ['canton_json', 'unemp_rate_longterm'], 
                             legend='Percentage unemployement for Total poulation')
ch_map

In [None]:
swiss_unemp_seek = 100*ch_stat.unemp_seeker_number.sum()/ch_stat.canton_active_pop.sum()
swiss_unemp_no_young = 100*(ch_stat.unemp_seeker_number.sum()-ch_stat.unemp_young.sum())/ch_stat.canton_active_pop.sum()
swiss_unemp_longtem = 100*ch_stat.unemp_longterm.sum()/ch_stat.canton_active_pop.sum()

print(swiss_unemp_seek, swiss_unemp_no_young, swiss_unemp_longtem)