# 03 - Interactive Viz

## Deadline

Wednesday November 8th, 2017 at 11:59PM

## Important Notes

- Make sure you push on GitHub your Notebook with all the cells already evaluated
- Note that maps do not render in a standard Github environment : you should export them to HTML and link them in your notebook.
- Remember that `.csv` is not the only data format. Though they might require additional processing, some formats provide better encoding support.
- Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
- Please write all your comments in English, and use meaningful variable names in your code

## Background

In this homework we will be exploring interactive visualization, which is a key ingredient of many successful data visualizations (especially when it comes to infographics).

Unemployment rates are major economic metrics and a matter of concern for governments around the world. Though its definition may seem straightforward at first glance (usually defined as the number of unemployed people divided by the active population), it can be tricky to define consistently. For example, one must define what exactly unemployed means : looking for a job ? Having declared their unemployment ? Currently without a job ? Should students or recent graduates be included ? We could also wonder what the active population is : everyone in an age category (e.g. `16-64`) ? Anyone interested by finding a job ? Though these questions may seem subtle, they can have a large impact on the interpretation of the results : `3%` unemployment doesn't mean much if we don't know who is included in this percentage.

In this homework you will be dealing with two different datasets from the statistics offices of the European commission ([eurostat](http://ec.europa.eu/eurostat/data/database)) and the Swiss Confederation ([amstat](https://www.amstat.ch)). They provide a variety of datasets with plenty of information on many different statistics and demographics at their respective scales. Unfortunately, as is often the case is data analysis, these websites are not always straightforward to navigate. They may include a lot of obscure categories, not always be translated into your native language, have strange link structures, â€¦ Navigating this complexity is part of a data scientists' job : you will have to use a few tricks to get the right data for this homework.

For the visualization part, install [Folium](https://github.com/python-visualization/folium) (*HINT*: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!). Folium's `README` comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find two `.topojson` files, containing the geo-coordinates of

- European countries (*liberal definition of EU*) (`topojson/europe.topojson.json`, [source](https://github.com/leakyMirror/map-of-europe))
- Swiss cantons (`topojson/ch-cantons.topojson.json`)

These will be used as an overlay on the Folium maps.

## Assignment

1. Go to the [eurostat](http://ec.europa.eu/eurostat/data/database) website and try to find a dataset that includes the european unemployment rates at a recent date.

   Use this data to build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows the unemployment rate in Europe at a country level. Think about [the colors you use](https://carto.com/academy/courses/intermediate-design/choose-colors-1/), how you decided to [split the intervals into data classes](http://gisgeography.com/choropleth-maps-data-classification/) or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.

2. Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through.

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

3. Use the [amstat](https://www.amstat.ch) website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between *Swiss* and *foreign* workers.

   The Economic Secretary (SECO) releases [a monthly report](https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html) on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for *foreign* (`5.1%`) and *Swiss* (`2.2%`) workers.

   Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (*hint* The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

   Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

4. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in unemployment rates between the areas divided by the [RÃ¶stigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

# Solution

Importing all the libraries needed.
** For this work we need the library 'jenkspy':  https://pypi.python.org/pypi/jenkspy/0.1.0**

To install it type: `pip install jenkspy`

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl
%matplotlib inline
import json
import branca.colormap as cm # for color steps
from IPython.core import display as ICD
import folium
from folium import plugins
import os
import jenkspy

We load the topojson files that we are going to use for the choropleth maps

In [2]:
JSON_FOLDER = r'topojson/'
# europe
topojson_europe_path = JSON_FOLDER + r'europe.topojson.json'
topojson_europe = json.load(open(topojson_europe_path))
# suisse
topojson_swiss_path = JSON_FOLDER + r'ch-cantons.topojson.json'
topojson_swiss = json.load(open(topojson_swiss_path))

Now, using the tool provided at the following link http://jeffpaine.github.io/geojson-topojson/  we convert the `\*.topojson` files in `\*.geojson` format. We put the `\*.geojson` in the same folder `JSON_FOLDER`.

**The geojson will not be used for the chotopleth map, but just to add the popups, since hey are easier to handle with folium**

After having obtained such `\*.geojson` files for both Europe and Suisse, we load them into the kernel:

In [3]:
# europe data
europe_geojson_path = JSON_FOLDER + r'europe.geojson.json'
geojson_europe = json.load(open(europe_geojson_path))

# swiss data
swiss_geojson_path = JSON_FOLDER + r'ch-cantons.geojson.json'
geojson_swiss = json.load(open(swiss_geojson_path))

# Question 1: visual analysis of the unemployment in Europe

The choice of the dataset has been forced by the need to have also data for Switzerland. the only dataset that we found on the _Eurostat_ website that contained data from Switzerlad is the following:

**Long-term unemployment rate, by sex**       _from eurostat_

link: http://ec.europa.eu/eurostat/web/products-datasets/-/tsdsc330

_Short description:_ the share of long-term unemployment is the share of unemployed persons since 12 months or more in the total active population, expressed as a percentage. The total active population (labour force) is the total number of the employed and unemployed population. The duration of unemployment is defined as the duration of a search for a job or as the period of time since the last job was held (if this period is shorter than the duration of the search for a job).

In this dataset, the values come sometimes with a letter next to it: the meaning of such letter is explained hereunder:
* **b** break in the time serie
* **e** for estimated
* **u** low relialability

Non avaiable data are indicated with the character semicolumn **' : '**

Let's import such data; usual cleaning operations on the dataframe has been commented and described with in-line comments to improve readability.

In [None]:
DATA_FOLDER = 'data/'
eurodf = pd.read_csv(DATA_FOLDER + 'tsdsc330.tsv' , sep='\t|,|[|]', engine='python')
# replace semicolumns with actual NaNs
eurodf.replace(to_replace=':', value='nan',inplace=True, regex=True)
# dropping useless columns
eurodf.drop(['indic_em','age','unit'],inplace=True, axis=1)
# renaming columns
eurodf.columns.values[1] = 'country code'
# eliminating useless tags for our values
eurodf.replace(['b','e'],['',''],regex=True,inplace=True)
eurodf.head()

Unnamed: 0,sex,country code,1996,1997,1998,1999,2000,2001,2002,2003,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
0,F,AT,,,,,,,1.1,1.0,...,1.5,1.0,1.1,1.0,1.1,1.1,1.2,1.4,1.4,1.7
1,F,BE,,,,6.0,4.7,3.5,4.3,4.2,...,4.3,3.6,3.6,4.1,3.6,3.2,3.7,3.8,3.9,3.8
2,F,BG,,,,,9.4,11.9,11.4,8.6,...,4.5,3.1,3.1,4.4,5.5,5.8,6.6,6.0,5.0,4.1
3,F,CH,,,,,,,,,...,,,,1.9,1.7,1.6,1.7,1.9,1.9,1.9
4,F,CY,,,,,,,,,...,0.7 u,0.5 u,0.6 u,1.3,1.5,3.1,5.6,7.0,6.2,5.1


## 1.1) Exploratory analysis

From the `eurodf.head()` command we immediately notice that:
* our data for every nation is split with respect to the categorical variable `"sex"` that can assume three different values: `['F','M','T']`.
* a lot on NAN values are present in the first years, from 1996 to roughly 2005

We are interested in having a look on how many data are considered of "low reliability", and as already explained, carachterized by a 'u' next to the value. If the dataset contains a lot of such values, the analysis could loose credibility. We decided to do the following things:
* we count how many unceratain data we have, to have an idea of the importance of such values inside the dataset
* we plot a heatmap of such uncertain values: in this way we can see diferent things in only one image: how many uncertain data we have compared to others, and also how these data are spread into the dataframe

In [None]:
### creating a dataframe with binary values {0,1} corresponding to safe data and uncertain data
### at lines 14-15 we'll plot such heatdf to obtain the desired heatmap
heatdf = eurodf[eurodf.columns[2:]]

#initializing the counter of uncertain values
count = 0
# let's count them!
for col in eurodf.columns[2:]:
    for j,euro_cell in enumerate(eurodf[col]):
        if 'u' in euro_cell:
            count=count+1
            heatdf[col][j]=1
        else:
            heatdf[col][j]=0
print('Inside the dataframe there are %d uncertain values' %count)

fig, ax =  plt.subplots(1)
ax = sns.heatmap(data=heatdf.astype(int), ax=ax, cbar=False, yticklabels=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


We notice that:
* _the uncertainties in the data are not spread randomly_ across the dataset, but are in particular rows of the dataframe. Since to every nation correspond three rows in the dataframe, we conclude that the uncertainties are due to few, well specified nations, and we can trust the major part of the dataset.
* _almost all the uncertain data refers to years from 1996 to 2009_.

Thanks to this pattern, we decided to drop all the years before 2010 to have a robust analysis.

We continue our exploratory analysis by plotting a timeserie of the unemployment rates for every nation, to have another visual look at our data. From these timeseries we hope to gain some insights on the data and a better understanding of our dataset.

In [None]:
# Now we can get rid of all the non-numeric flag that accompany each number:
eurodf.replace('u','',regex=True,inplace=True)
# here we keep only the last 8 years, from 2009 to 2016
N = 8
years = list(eurodf.columns[-N:].astype(int))

# we convert the values to floats
eurodf[eurodf.columns[2:]]=eurodf[eurodf.columns[2:]].astype(float)

#splitting the dataset to
totaldf = eurodf[eurodf["sex"]=='T'].reset_index(drop=True)
maledf = eurodf[eurodf["sex"]=='M'].reset_index(drop=True)
femaledf = eurodf[eurodf["sex"]=='F'].reset_index(drop=True)
# Let's plot the time series!
fig, axes =  plt.subplots(12,3, figsize=[20,40])
for i,ax in enumerate(axes.reshape(-1)):
    mpl.style.use('default')
    ax.plot_date(years, totaldf.loc[i][-N:], 'C1', label="T")
    ax.plot_date(years, maledf.loc[i][-N:], 'C2', label="M")
    ax.plot_date(years, femaledf.loc[i][-N:], 'C3', label="F")
    ax.set_title(totaldf.loc[i][1])
    ax.set_ylim(0,35)
    ax.set_xlim(2009,2016)
    ax.set_facecolor('white')
    #ax.set_xticks(np.arange(1996,2016,4))
    ax.legend()
    ax.grid()
fig.tight_layout()
plt.show()

From these plots we see that a big part of the nations present a comparable unemployment rate (between 5% and 10%), and there are a few with a very high or very low unemployment rate. Also, the variations of such value are really important during these 8 years (see the graph of Greece, EL).

## 1.2) Visualization of the data using Folium

In the cell hereunder we define the code necessary for the display of the Choropleth map using Folium.
Further comments are inline, to improve readability of the notebook. Here we just say that:

* the total unemployment rate is visualized on the Choropleth map.
* to display the data for every year, an interactive map would be needed. Since we couldn't find a way to save this kind of map in `.html`, we opted not to use such a tool and _we limited ourselves to the display of the data regarding a single year (2016)._
* default bins for the legend has been utilized (a discussion on this is made afterwards)
* **a plugin has been added on the map**: by clicking on each nation, the user can still get an insight on the distinction between male unemployment, female unemployment accessing the raw data corresponding to that nation. In this way any piece information has been lost

In [None]:
#############################################################

# to be decomposed in smaller functions for readability

def europe_map(date,bins=None):
    ### INPUTS:  date -> the year we want to show on our map
    ###          bins -> the list of bins of the legend of the Chloropleth map
    ### OUTPUTS: m    -> the map created

    m = folium.Map([50,5], tiles='Stamen Toner', zoom_start=4)

    m.choropleth(geo_data = json.load(open(topojson_europe_path)),
                 data = eurodf.loc[(eurodf['sex']=='T')],
                 columns = ['country code', str(date)],
                 key_on = 'id',
                 threshold_scale = bins,
                 topojson = 'objects.europe',
                 fill_color = 'YlOrRd', fill_opacity=1,
                 line_color = 'black', line_weight = 1, line_opacity=0.9,
                 smooth_factor = 1,
                 reset = True, # False by default, put true if you wanna remove previous layers
                 highlight = True, # hovering
                 legend_name='National unemployment in percents')

    for feature in geojson_europe['features']:
        id_ = feature['id']
        name_ = feature['properties']['NAME']
        text = name_ #'Country \n' + id_
        additional_text = ''
        html = ''
        if id_ in list(eurodf['country code']):
            female_unemp = eurodf.loc[(eurodf['sex']=='F') & (eurodf['country code']==id_),str(date)].values
            male_unemp = eurodf.loc[(eurodf['sex']=='M') & (eurodf['country code']==id_),str(date)].values
            total_unemp = eurodf.loc[(eurodf['sex']=='T') & (eurodf['country code']==id_),str(date)].values
            d = {'females': female_unemp, 'males': male_unemp,'total': total_unemp}
            df_tmp = pd.DataFrame(data=d)
            #df_tmp.set_index[['unemployment']]
            html = df_tmp.to_html(classes='table table-striped table-hover table-condensed table-responsive')
        #elif:

        #text += '\n' + additional_text
        pops = folium.GeoJson(
           feature,
            name = name_,
            # overlay=True, ??
            style_function=lambda feature: {
            'fill_opacity': 0 ,
            'fillColor': '#fffff',
            'color' : 'black',
            'weight' : 0.05,
            #'dashArray' : '5, 5',
            'highlight' : True # all of these options are not really effectives :'(
            },
            #highlight_function=highlight_function(feature)
        ).add_child(folium.Popup(text + html))

        pops.add_to(m) # add popup for that country



    plugins.Fullscreen( # to have the full screen option top-right corner
        position='topright',
        title='Expand me',
        title_cancel='Exit me',
        force_separate_button=True).add_to(m)

    # folium.LayerControl().add_to(m) # to have the button for layer control (top-right corner as well)
    return m

### ---------------------------------------------------------------------
### ------------------------ Let's visualize it! ------------------------
### ---------------------------------------------------------------------

m = europe_map(2016)
m.save(os.path.join('results', 'europe_default.html'))
m

We immediately notice that the default subdivision of nations in equally spaced groups is not ideal: the two colours reserved for unemplyment rates between 10 and 16 are not used, since there's no nation in this range!

Let's visualize the distribution of our data:

In [None]:
fig,ax = plt.subplots(1)
plt.hist((list(eurodf['2016'].loc[eurodf['sex']=='T'].dropna().astype(float))),bins=np.arange(1,20,1))

We see that the distribution is really skewed: of course equally spaced bins for the Choropleth map didn't work well. Let's try with the Jensksy algorithm:

In [None]:
breaks = jenkspy.jenks_breaks(list(eurodf['2016'].loc[eurodf['sex']=='T'].dropna().astype(float)), nb_class=5)
m2 = europe_map(2016, breaks)
m2.save(os.path.join('results', 'europe_Jenks.html'))
m2

We can see that the map obtained is much better: we can notice more differences between the following nations:
* Italy, Slovakia and Czech Republic are now distinguished from France and Ireland
* still there are a lot of nations in the (1,3) range that cannot be distinguished

_As the histogram above suggest, all the countries in the (1,3) range are equally distributed in the subranges (1,2) and (2,3)_; let's try to add such ranges, maybe sacrifying the distinction between Spain and Macedonia.

In [None]:
breaks = [1,2,3,5,6.7,19.2]
m_final = europe_map(2016, breaks)
m_final.save(os.path.join('results', 'europe_final.html'))
m_final

In [None]:
plt.hist((list(eurodf['2016'].loc[eurodf['sex']=='T'].dropna().astype(float))),bins=breaks, color=['#393E41'])

The map looks much better! From the histogram of our new subdivision, we can appreciate:
* in the first 4 bins there are a comparable number of countries, in between 7 and 10 for each bin
* in the last bin there are only 4 nations, but this is the one with the biggest range. It is reasonable to take only few countries into this bin not to loose too mny informations.

We can say that at this point we are satisfied of the results obtained and we can proceed to the next task.
## 1.3) Comparison of Switzerland and the rest of Europe
The last map gives us all the information needed to compare Switzerland to the rest of Europe: Switzerland is in the first bin of our Choroplet map together with countries like Sweden, Norway, Germany, Austria, Czech Republic, and Denmark. With a total unemployment rate in 2016 of 1.8, Switzerland has one of the lowest unemployment rates of Europe.

# 2) Unemployment rates in the Swiss cantons

Hereunder we prepare the dataset for the analysis of Switzerland unemployment rate. Briefly here's the workflow that we followed:

* we import the data obtained in the dataframe `chdf2` that we clean oppurtunately
* we define the months of our interest: from September 2016 to September 2017
* **Basing our reasoning to the hypotesis stated in the assignment "The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100)"** we use the data avaiable to calculate different unemployment rates, corresponding to the following metrics:
    * Unemployed people as provided us by amstat.ch
    * Unemployed people without people looking for a different job that still have a job


* at the end we define a function `select_data(df,month, column)` capable of extracting a single column `column` of data from our dataframe `chdf2` referred to the month `month` to ease as much as possible the construction of the Choropleth maps. This function takes the adbvantage of the file **cantons_code.csv** that we manually build to associate to every canton name its code. In this way we managed to avoid the use of other boiler code.
* we observed that **following that assumptions, the new unemployment rates were sometimes negative!**
* ** We revised then our hypothesis** to the following one:

### Hypothesis: The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job _that don't have a job_ divided by the size of the active population (scaled by 100)

So, to obtain the unemployment rate of people that are looking for a job _whatever they have a job or not_ we had to **sum** to the unemployment rate provided us by amstat.ch the contribute given by "unemployed people without people looking for a different job that still have a job", oppurtunately estrapolated by the absolute numbers provided us in the dataset

In [None]:
### ----------------------------------------------------
### FOLLOWING THE ASSUMPTION GIVEN US IN THE ASSIGNMENT
### we calculate the new rate, and in the end we'll
### observe something strange!!
### ----------------------------------------------------

#I define this function just for readability
def new_unemployment_rate(months, df, new_column_name, category_to_subtract):
    ### OUTPUTS: adds a column with the new UNEMPLOYMENT RATE WITHOUT PEOPLE THAT ARE LOOKING FOR A JOB BUT
    ### STILL HAVE A JOB based on the HYPOTHESIS OF THE ASSIGNMENT -> (I subtract the contribute of the
    ### new category)

    for month in months:
        df[month,new_column_name] = df[month,'Tasso di disoccupazione']*(df[month,'Disoccupati registrati']-
             df[month,category_to_subtract]) /df[month,'Disoccupati registrati']

months = ['Settembre 2016',
          'Ottobre 2016',
          'Novembre 2016',
          'Dicembre 2016',
          'Gennaio 2017',
          'Febbraio 2017',
          'Marzo 2017',
          'Aprile 2017',
          'Maggio 2017',
          'Giugno 2017',
          'Luglio 2017',
          'Agosto 2017',
          'Settembre 2017']

def select_data(df, when, what):
    temp=pd.merge(cantons_code,chdf2[when],left_on='Canton',right_index=True,how='inner')[['code',what]]
    temp.rename(columns={what:'unemployment'},inplace=True)
    return temp

### here we provided a .csv file with the cantons code, necessary for the plot!
cantons_code = pd.read_csv(DATA_FOLDER + 'cantons_code.csv')
chdf2 = pd.read_excel(DATA_FOLDER + 'ch_punto2.xlsx',header=[0,1])

#creating the NEW unemployment rate following the assumption in the assignment
new_unemployment_rate(months, chdf2, 'unemployment rate without employed people', 'Persone in cerca d\'impiego non disoccupate')

# dropping useless columns
chdf2.drop('mese', axis = 1, level=0,inplace=True)

# sorting the index
chdf2.sort_index(level=0,axis=1, inplace=True)

there's no need to understand the code above, just take a look on this examples of of how `select_data` works:

### example 1
extrecting the original unemployment rate ('Tasso di disoccupazione') for October 2016

In [None]:
select_data(chdf2,'Ottobre 2016', 'Tasso di disoccupazione').head()

### example 2
extracting the newly calculated unemployment rate for September 2017

In [None]:
select_data(chdf2,'Settembre 2017', 'unemployment rate without employed people').head()

### There are negative values!!!!
This could happend only because in our data there are canton with more "people looking for a job that already have a job" than "unemployed people". This clearly is an absurd if we believe in the hypothesys stated in the assignment. So, we revert what has been done and we follow our hypothesis stated before, that we repeat again:

_Hypothesis: The Swiss Confederation defines the rates you have just plotted as **the number of people looking for a job that don't have a job** divided by the size of the active population (scaled by 100)_

In [None]:
# we drop the clearly wrong results we just obtained
chdf2.drop('unemployment rate without employed people',level=1,axis=1,inplace=True)

# we redefine of function following the new hypothesys:
def new_tasso_add(months, df, new_column_name, category_to_add):
    for month in months:
        df[month,new_column_name] = df[month,'Tasso di disoccupazione']*(1+df[month,category_to_add]/df[month,'Disoccupati registrati'])

# we recalculate them with the new hypothesis
new_tasso_add(months, chdf2, 'unemployment rate with employed people', 'Persone in cerca d\'impiego non disoccupate')

#dropping useless columns
chdf2.drop(['Persone in cerca d\'impiego non disoccupate','Disoccupati registrati', 'Disoccupati dei giovani'], axis=1, level=1,inplace=True)

# Let's visualize everything in a cool Choroplet map!

Hereunder the usual helper functions to generate the Choropleth map.

In [None]:
#viz funtions

def add_features(m,data1,data2,name1='data_1',name2='data_2'):
    for feature in geojson_swiss['features']:
        id_ = feature['id']
        name_ = feature['properties']['name']
        text = name_
        html = ''
        if id_ in list(data1['code']):
            tmp1 = data1.loc[data1['code']==id_ , 'unemployment'].values
            tmp2 = data2.loc[data2['code']==id_ , 'unemployment'].values
            d = {name1 : tmp1, name2 : tmp2, ' ': 'unemployment (%)' }
            df_tmp = pd.DataFrame(data=d)
            df_tmp.set_index(' ',inplace=True)
            html = df_tmp.to_html(classes='table table-striped table-hover table-condensed table-responsive')
        #elif:

        pops = folium.GeoJson(
           feature,
            name = name_,
            overlay=True,
            style_function=lambda feature: {
            'fill_opacity': 0 ,
            'fillColor': None,
            'color' : 'black',
            'weight' : 0.05,
            #'dashArray' : '5, 5',
            'highlight' : True # all of these options are not really effectives :'(
            },
        ).add_child(folium.Popup(text + html))
        pops.add_to(m) # add popup for that country
    return m

def add_choropleth(m, data, name, bins = None, color='YlOrRd',legend_name='Cantonal unemployment in percents' ):
    m.choropleth(geo_data = json.load(open(topojson_swiss_path)),
             name = name,
             data = data,
             columns = ['code', 'unemployment'],
             key_on = 'id',
             threshold_scale = bins,
             topojson = 'objects.cantons',
             fill_color = color, fill_opacity=1,
             line_color = 'black', line_weight = 1, line_opacity=0.9,
             smooth_factor = 1,
             reset = False, # False by default, put true if you wanna remove previous layers
             highlight = True, # hovering
             legend_name= legend_name)
    return m

def add_fullscreen_button(m):
    plugins.Fullscreen( # to have the full screen option top-right corner
        position='topright',
        title='Expand me',
        title_cancel='Exit me',
        force_separate_button=True).add_to(m)
    return m


def create_double_map(data1, data2,\
                        name1, bins, name2,\
                        color='YlOrRd',legend_name='Cantonal unemployment in percents' ):
    m = folium.Map([46.5,8], tiles='Mapbox bright', zoom_start=7.4)

    add_choropleth(m, data=data1, name=name1, bins = bins,\
                   color=color,legend_name=legend_name )

    add_choropleth(m, data=data2, name=name2, bins = bins,\
                   color=color,legend_name=legend_name )

    add_features(m,data1,data2,name1,name2)

    folium.LayerControl().add_to(m)
    add_fullscreen_button(m)
    return m

def create_single_map(data1,\
                        name1, bins,\
                        color='YlOrRd',legend_name='Cantonal unemployment in percents' ):
    m = folium.Map([46.5,8], tiles='Mapbox bright', zoom_start=7.4)

    add_choropleth(m, data=data1, name=name1, bins = bins,\
                   color=color,legend_name=legend_name )
    add_fullscreen_button(m)
    return m

## 2.1) Let's finally visualize a first Choropleth map of the unemployment rate provided us by default by amstat.ch

* with default bins

In [None]:
bins = []
data_to_viz = select_data(chdf2,'Settembre 2017',what='Tasso di disoccupazione')
m = create_single_map(data1=data_to_viz,\
                  name1='without a job',\
                  bins=None,\
                  color='YlOrRd',legend_name='Cantonal unemployment in percents' )
m.save(os.path.join('results', 'suiss_default.html'))
m

In [None]:
plt.hist(data_to_viz['unemployment'],bins=[0.6,1.3,2.1,2.9,3.7,4.5,5.2])

### Jensk algorithm
As usual, let's apply Jenks algorithm to see a different result:

In [None]:
bins = jenkspy.jenks_breaks(list(data_to_viz['unemployment'].dropna().astype(float)), nb_class=5)
m = create_single_map(data1=data_to_viz,\
                  name1='without a job',\
                  bins=bins,\
                  color='YlOrRd',legend_name='Cantonal unemployment in percents')
m.save(os.path.join('results', 'suisse_Jensk.html'))
m

In [None]:
plt.hist(data_to_viz['unemployment'],bins=bins)

### Conclusion:
We can see that by jenks algorithm we managed to
* reduce the nuber of the classes from 6 to 5
* still mantain a good classification of the cantons, but from the histogram we can see how the distribution of the cantons inside the group has been skewed!

Prsonally, **this time we prefer the original map with the default bins**. It's more accurate, the number of classes is still not too high (6) and manages to communicate better the differences between the cantons.

# 2.2) Compare unemployment rates with and without "people looking for a job that still have a job"

To compare these two values we plotted _on the same map_ two Chromoplet maps _in two separate layers_: the user thanks to the button on the right can **select the layer and visualize the data he's interested in**. The sudden change in the map of the color helps comparing immediately the change in the two rates.
Furthermore, **we added a PopUp** so that if the reader is interested, by clicking on each canton he can access the raw data of the two distributions

** How to select the layers**
* click on the icon on the right of the screen
* select only one between the two options
    * without a job
    * searching for jobs

In [None]:
bins = []

m = create_double_map(data1=select_data(chdf2,'Settembre 2017',what='Tasso di disoccupazione'),\
                  data2=select_data(chdf2,'Settembre 2017',what='unemployment rate with employed people'),\
                  name1='without a job',\
                  name2='searching for jobs',bins=[ 0.6,  1.8,  3. ,  4.2,  5.4,  6.6],\
                  color='YlOrRd',legend_name='Cantonal unemployment in percents' )
m.save(os.path.join('results', 'suisse_dual_layer_searchingVSwithout.html'))
m

# 3) Comparing the unemployment rate per canton, differencing between swiss people and strangers

The strategy to visualize such difference is the same of the previous point: a single map with two layers and a PopUp that the user can select.

In [None]:
# loading data
chdf = pd.read_excel(DATA_FOLDER + 'chdf_nazionalita.xlsx',header=[0,1])

# building the two separate dataframes
strangers_df =  pd.DataFrame(chdf['Settembre 2017','Tasso di disoccupazione'][chdf['Nazionalità','Unnamed: 0_level_1']=='stranieri'])
strangers_df.columns=strangers_df.columns.droplevel(level=1)
strangers_df = pd.merge(strangers_df,cantons_code, left_index=True,right_on = 'Canton', how = 'inner')
strangers_df.rename(columns={'Settembre 2017' : 'unemployment'},inplace = True)

swissers_df = pd.DataFrame(chdf['Settembre 2017','Tasso di disoccupazione'][chdf['Nazionalità','Unnamed: 0_level_1']=='svizzeri'])
swissers_df.columns=swissers_df.columns.droplevel(level=1)
swissers_df = pd.merge(swissers_df,cantons_code, left_index=True,right_on = 'Canton', how = 'inner')
swissers_df.rename(columns={'Settembre 2017' : 'unemployment'},inplace = True)

In [None]:
bins = [1, 2,3, 5, 9]
m=create_double_map(data1=strangers_df, data2=swissers_df,\
                        name1='foreign', bins=bins, name2='swisss',\
                        color='YlOrRd',legend_name='Cantonal unemployment in percents' )
m.save(os.path.join('results', 'suisse_dual_layer_foreignVSsuisse.html'))
m

#### q3 --> BY AGE GROUP

In [None]:
chdf_numbers = pd.read_excel(DATA_FOLDER + 'chdf_absolutenumbers.xlsx',header=[0,1])
chdf_numbers.drop(['Classi d\'età 15-24, 15-49, 50 anni e più', 'Unnamed: 2_level_1'], axis=1, level=1,inplace=True)
chdf_numbers.drop('mese', axis=1, level=0, inplace=True)
chdf_numbers.head()
chdf_numbers.columns=chdf_numbers.columns.droplevel(level=1)
chdf_numbers.set_index('Nazionalità', append=True, inplace=True)
chdf_numbers.set_index('Classi d\'età 15-24, 15-49, 50 anni e più', append=True, inplace=True)
chdf_numbers.drop('Totale',level=2,axis=0,inplace=True)

total_unemp_per_canton=chdf_numbers.xs('Totale', level=1, drop_level=False)
total_unemp_per_age_category=chdf_numbers.xs('Totale', level=1, drop_level=False)
chdf_numbers.head(5)

The following function takes as input the month we want to extract from our data, and returns a dataframe with:
* the data of unemployed strangers and swiss divided in the three age categories
* these data are **normalized** with respect to the total number of unemployed _for each canton_. This means that the sum of the six fields of the output dataframe:

**(Swiss, age 1)+(Swiss, age 2)+(Swiss, age 3)+(Strangers, age 1)+(Strangers, age 2)+(Strangers, age 3) = 1** _for each canton_

This strange normalization procedure is due to the fact that we miss the number of inhabitants per canton for each age category.

PROS of this approach:
* the barplots are easily readable, since they are pretty much big the same
* it's easy to see how unemployed people in each canton are distributed into the 6 categories

CONS:
* it could be misleading, since the data are not normalized with respect to the total populazion for each category as one could expect.
* _the barplots can't be used to compare different cantons_, because of the normalization procedure.

In [None]:
def select_and_normalize_data(month):    
    ### INPUTS: month we are interested to plot
    ### OUTPUT: a dataframe normalized as described above in the markdown cell
    tot = total_unemp_per_canton[[month]].reset_index(drop=False)[['level_0',month]]
    tot.rename(columns={'level_0':'name'},inplace=True)
    tot.set_index('name',drop=True, inplace=True)
    tmp = chdf_numbers.drop('Totale',level=1,axis=0)
    tmp.drop('Totale',level=0,axis=0,inplace=True)
    divided = tmp[[month]].divide(tot,level=0)
    return divided

In [None]:
normalized = select_and_normalize_data('Settembre 2017')

ready_to_plot = divided.unstack()
ready_to_plot.plot(kind='bar', stacked=True, figsize=(20,5))

To compare the different cantons, **we provide a second bar plot where the absolute number of unemployed people is represented**. By looking at both graphs, the reader can have then a precise idea of the distribution and differences of unemployment statistics iin between the different cantons

In [None]:
chdf_numbers.drop('Totale',level=1,axis=0,inplace=True)
chdf_numbers.drop('Totale',level=0,axis=0,inplace=True)
ready_to_plot = chdf_numbers[['Settembre 2017']].unstack()
ready_to_plot.plot(kind='bar', stacked=True, figsize=(20,5))