# Assignment 2: Voting Visualized

## Deadline

Oct. 24th

## Important notes

- Make sure you push on GitHub your notebook with all the cells already evaluated.
- Note that maps do not render in a standard Github environment. You should export them to HTML and link them in your notebook.
- Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you implemented.
- Please write all your comments in English, and use meaningful variable names in your code.

## Background


* Are you curious to know what the political leanings of the people of Switzerland are?
* Do you wake up in a cold sweat, wondering which party won the last cantonal parliament election in Vaud?
* Are you looking to learn all sorts of visualizations, including maps, in Python?

If your answer to any of the above is yes, this assignment is just right for you. Otherwise, it's still an assignment, so we're terribly sorry.

The chief aim of this assignment is to familiarize you with visualizations in Python, particularly maps, and also to give you some insight into how visualizations are to be interpreted. The data we will use is the data on Swiss cantonal parliament elections from 2007 to 2018, which contains, for each cantonal election in this time period, the voting percentages for each party and canton.

For the visualization part, install [Folium](Folium) (_Hint: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!_). Folium's README comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find one TopoJSON file, containing the geo-coordinates of the cantonal borders of Switzerland.

One last, general reminder: back up any hypotheses and claims with data, since this is an important aspect of the course.

In [None]:
import pandas as pd
import json
import folium
import os
import xlrd


In [None]:
data_folder = './data/'

In [None]:
folium.__version__ == '0.6.0'

# Task 1: Cartography and census

__A)__ Display a Swiss map that has cantonal borders as well as the national borders. We provide a TopoJSON `data/ch-cantons.topojson.json` that contains the borders of the cantons.

__B)__ Take the spreadsheet `data/communes_pop.xls`, collected from [admin.ch](https://www.bfs.admin.ch/bfs/fr/home/statistiques/catalogues-banques-donnees/tableaux.assetdetail.5886191.html), containing population figures for every commune. You can use [pd.read_excel()](https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_excel.html) to read the file and to select specific sheets. Plot a histogram of the population counts and explain your observations. Do not use a log-scale plot for now. What does this histogram tell you about urban and rural communes in Switzerland? Are there any clear outliers on either side, and if so, which communes?

__C)__ The figure below represents 4 types of histogram. At this stage, our distribution should look like Fig.(a). A common way to represent [power-laws](https://en.wikipedia.org/wiki/Power_law) is to use a histogram using a log-log scale  -- remember: the x-axis of an histogram is segmented in bins of equal sizes and y-values are the average of each bin. As shown in Fig.(b), small bins sizes might introduce artifacts. Fig.(b) and Fig.(c) are examples of histograms with two different bin sizes. Another great way to visualize such distribution is to use a cumulative representation, as show in Fig.(d), in which the y-axis represents the number of data points with values greater than y.  
  
Create the figures (b) and (d) using the data extracted for task 1B. For Fig.(b), represent two histograms using two different bin sizes and provide a brief description of the results. What does this tell you about the relationship between the two variables, namely the frequency of each bin and the value (i.e. population in case of the communal data) for each bin?

<img src="plaw_crop.png" style="width: 600px;">
  
The figure is extracted from [this paper](https://arxiv.org/pdf/cond-mat/0412004.pdf) that contains more information about this family of distributions.

## Task 2: Parties visualized

We provide a spreadsheet, `data/voters.xls`, (again) collected from [admin.ch](https://www.bfs.admin.ch/bfs/fr/home/statistiques/politique/elections/conseil-national/force-partis.assetdetail.217195.html), which contains the percentage of voters for each party and for each canton. For the following task, we will focus on the period 2014-2018 (the first page of the spreadsheet). Please report any assumptions you make regarding outliers, missing values, etc. Notice that data is missing for two cantons, namely Appenzell Ausserrhoden and Graubünden, and your visualisations should include data for every other canton.


__A)__ For the period 2014-2018 and for each canton, visualize, on the map, **the percentage of voters** in that canton who voted for the party [`UDC`](https://en.wikipedia.org/wiki/Swiss_People%27s_Party) (Union démocratique du centre). Does this party seem to be more popular in the German-speaking part, the French-speaking part, or the Italian-speaking part?

__B)__ For the same period, now visualize **the number of residents** in each canton who voted for UDC.

__C)__ Which one of the two visualizations above would be more informative in case of a national election with majority voting (i.e. when a party needs to have the largest number of citizens voting for it among all parties)? Which one is more informative for the cantonal parliament elections?

For part B, you can use the `data/national_council_elections.xslx` file ([guess where we got it from](https://www.bfs.admin.ch/bfs/fr/home/statistiques/politique/elections/conseil-national/participation.assetdetail.81625.html)) to have the voting-eligible population of each canton in 2015.

In [None]:
voters = pd.read_excel(data_folder + 'voters.xls')
voters

In [None]:
#we load the voters dataset
voters = pd.read_excel(data_folder + 'voters.xls')

#we load the cantons dataset
cantons = pd.read_csv(data_folder + 'cantons.csv')

#Cleaning of the dataset

#First we want to drop all the rows full of NaN
#Creation of a new df
voters_noNaN = voters.drop(voters.index[0:3])

#We drop the first column in order to have rowa full of NaN and be able to use classic pandas functions 
voters_noNaN = voters_noNaN.drop(voters_noNaN.columns[0],axis=1)

#drop the NaN rows...
voters_noNaN=voters_noNaN.dropna(axis=0,how='all')

#We keep the column index of or original dataframe that corresponds to the one of the non-NaN df
voters=voters.iloc[voters_noNaN.index]

#Now we can drop the columns full of NaN
voters=voters.dropna(axis=1,how='all')

#The rest of NaN values corresponds to state where the political part is not present i.e. there was 0 voters so we replace NaN by null
voters=voters.fillna(0)

#We rename the columns with corresponding labels
voters.columns = ['Cantons', 'Année électorale', 'Participation', 'PLR', 'PDC', 'PS','UDC','PLS','PEV','PCS','PVL','PBD','PST','PSA','PES','AVF','Sol.','DS','UDF','Lega','MCR','Autres','Total']

#Reset in the index in order to be able to concatenate our dataframe with cantons dataframe
voters.reset_index(drop=True,inplace =True)

#Concatenation
left = voters
right = cantons
left.join(right,sort = False)
result=pd.concat([left,right],axis=1,sort=False)

#We convert the population data into float in order to be able to do operations on it
result['Population']= result['Population'].apply(lambda row: float(row.replace(",","")[:-4]))

#Since the column 'Canton of' is redundant we use it to create a new columns equal to the number of UDC voters
result['Canton of'] = result['Population']*result['Participation']*result['UDC']*0.0001
result.rename(columns={'Canton of':'UDC voters'}, inplace=True)









#Creation of CSV file to use with folium
result.to_csv('result.csv')
result_csv = r'result.csv'
result_data=pd.read_csv(result_csv)
result_data.drop(result_data.columns[0],axis=1,inplace=True)


result

In [None]:
#Data visualization for UDC vote percentage by canton
cantons_data= json.load(open(data_folder + 'ch-cantons.topojson.json'))

m = folium.Map([47, 8.33],tiles='cartodbpositron', zoom_start=7)




m.choropleth(
 geo_data=cantons_data,
 topojson='objects.cantons',   
 name='choropleth',
 data=result_data,
 columns=['Code', 'UDC'],
 key_on='feature.id',
 fill_color='YlGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='UDC vote Rate (%)'
)
folium.LayerControl().add_to(m)

m.save('UDC vote rate.html')

In [None]:
#Data visualization for number of UDC voters by canton

m = folium.Map([47, 8.33],tiles='cartodbpositron', zoom_start=7)


m.choropleth(
 geo_data=cantons_data,
 topojson='objects.cantons',   
 name='choropleth',
 data=result,
 columns=['Code', 'UDC voters'],
 key_on='feature.id',
 fill_color='YlGn',
 fill_opacity=0.7,
 line_opacity=0.2,
 legend_name='Number of UDC voters',
 threshold_scale=[0,10000,30000,60000, 90000,120000]
)
folium.LayerControl().add_to(m)
m.save('UDC voters number.html')

In [None]:
#Does this party seem to be more popular in the German-speaking part, the French-speaking part, or the Italian-speaking part?
#ATTENTION J'AI EU BCP DE MAL A FAIRE CETTE MAP QUI N'EST PAS DEMANDEE ! DU COUP C'EST PAS OPTI


german_part = result['Official languages'] == 'German'
french_part = result['Official languages'] == 'French'
italian_part = result['Official languages'] == 'Italian'

result['Population']=result['Participation']*result['Population']/100

german_swiss_UDC_popularity=result[german_part]['UDC voters'].sum()/result[german_part]['Population'].sum()
french_swiss_UDC_popularity=result[french_part]['UDC voters'].sum()/result[french_part]['Population'].sum()
italian_swiss_UDC_popularity=result[italian_part]['UDC voters'].sum()/result[french_part]['Population'].sum()

result_french=result[french_part]

result_german=result[german_part]

result_italian=result[italian_part]
result

In [None]:
result_french.loc[:,'Participation']=french_swiss_UDC_popularity
result_german.loc[:,'Participation']=german_swiss_UDC_popularity
result_italian.loc[:,'Participation']=italian_swiss_UDC_popularity

print(french_swiss_UDC_popularity)
print(german_swiss_UDC_popularity)
print(italian_swiss_UDC_popularity)

In [None]:
frames = [result_french,result_german,result_italian]
frenchvsgermanpopularity=pd.concat(frames)
frenchvsgermanpopularity

In [None]:
result['Participation'].iloc[frenchvsgermanpopularity.index]=frenchvsgermanpopularity['Participation']*100

result.loc [[1,9,22],['Participation']]= 0.0
result

In [None]:
m = folium.Map([47, 8.33],tiles='cartodbpositron', zoom_start=7)



m.choropleth(
 geo_data=cantons_data,
 topojson='objects.cantons',   
 name='choropleth',
 data=result,
 columns=['Code', 'Participation'],
 key_on='feature.id',
 fill_color='YlGn',
 fill_opacity=0.7,
 line_opacity=0.1,
 legend_name='UDC vote Rate in german and french speaking Switzerland(%)',
 threshold_scale=[0,5,10,15,20,25]
 
)
folium.LayerControl().add_to(m)
m.save('UDC vote rate GERvsFR.html')

Which one of the two visualizations above would be more informative in case of a national election with majority voting (i.e. when a party needs to have the largest number of citizens voting for it among all parties)? Which one is more informative for the cantonal parlement elections?

In case of a national election with majority voting it is clearly the visualization of part B with the raw voters number that is the more usefull. The only problem is that the reader need to have an idea of the voters total population in order to understand the numbers on this map. It is good to be able to see that UDC can have 30 000 voters in one canton but if the PS have 50 000 UDC will not win despite the impressive number... Thus, percentage is often more clear unlike my explanation.

On the other hand, for the cantonal parlement elections, it is the popularity of the party by cantons that will be important and the visualization of part A.