# Background

In this homework we will practice with interactive visualization, which is the key ingredient of many successful viz (especially when it comes to infographics). You will be working with the P3 database of the SNSF (Swiss National Science Foundation). As you can see from their entry page, P3 already offers some ready-made viz, but we want to build a more advanced one for the sake of quick data exploration. Therefore, start by downloading the raw data (just for the Grant Export), and read carefully the documentation to understand the schema. Install then Folium to deal with geographical data (HINT: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!) The README file of Folium comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find a TopoJSON file with the geo-coordinates of each Swiss canton (which can be used as an overlay on the Folium maps).

# Assignment

Build a Choropleth map which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton. To do so, you will need to use the provided TopoJSON file, combined with the Choropleth map example you can find in the Folium README file.
HINT: the P3 database is formed by entries which assign a grant (and its approved amount) to a University name. Therefore you will need a smart strategy to go from University to Canton name. The Geonames Full Text Search API in JSON can help you with this -- try to use it as much as possible to build the canton mappings that you need. For those universities for which you cannot find a mapping via the API, you are then allowed to build it manually -- feel free to stop by the time you mapped the top-95% of the universities. I also recommend you to use an intermediate viz step for debugging purposes, showing all the universties as markers in your map (e.g., if you don't select the right results from the Geonames API, some of your markers might be placed on nearby countries...)

BONUS: using the map you have just built, and the geographical information contained in it, could you give a rough estimate of the difference in research funding between the areas divided by the Röstigraben?

# Getting started

## Imports

In [81]:
%matplotlib inline
import pandas as pd
import numpy as np
import scipy.stats as stats
import folium
import json
import urllib

from googleplaces import GooglePlaces, types, lang

## Read data from grant report

In [67]:
# read csv file. we notice that a lot of NaN value is present, so we replace them with empty field for better visual
grantReport = pd.read_csv('data/P3_GrantExport.csv', delimiter=';')
grantReport = grantReport.fillna('')
grantReport.head()

Unnamed: 0,"﻿""Project Number""",Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
0,1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
1,4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,
2,5,"Kritische Erstausgabe der ""Evidentiae contra D...",,Kommission für das Corpus philosophorum medii ...,Project funding (Div. I-III),Project funding,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",10101,Philosophy,Human and Social Sciences;Linguistics and lite...,01.03.1976,28.02.1985,79732.0,
3,6,Katalog der datierten Handschriften in der Sch...,,Burckhardt Max,Project funding (Div. I-III),Project funding,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,52627.0,
4,7,Wissenschaftliche Mitarbeit am Thesaurus Lingu...,,Schweiz. Thesauruskommission,Project funding (Div. I-III),Project funding,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",10303,Ancient history and Classical studies,Human and Social Sciences;Theology & religious...,01.01.1976,30.04.1978,120042.0,


## Clean up

From the chart above, we are interested in nothing but two columns: _University_ and _Approved Amount_.

In [69]:
grantReport = grantReport[['Approved Amount', 'University']]
grantReport.head()

Unnamed: 0,Approved Amount,University
0,11619.0,Nicht zuteilbar - NA
1,41022.0,Université de Genève - GE
2,79732.0,"NPO (Biblioth., Museen, Verwalt.) - NPO"
3,52627.0,Universität Basel - BS
4,120042.0,"NPO (Biblioth., Museen, Verwalt.) - NPO"


First we check the universities.

In [71]:
grantReport.University.unique().shape

(78,)

There are not so many different universities out there. So we list all of them to check outliers.

In [72]:
uni = grantReport.University.unique()
uni

array(['Nicht zuteilbar - NA', 'Université de Genève - GE',
       'NPO (Biblioth., Museen, Verwalt.) - NPO', 'Universität Basel - BS',
       'Université de Fribourg - FR', 'Universität Zürich - ZH',
       'Université de Lausanne - LA', 'Universität Bern - BE',
       'Eidg. Forschungsanstalt für Wald,Schnee,Land - WSL',
       'Université de Neuchâtel - NE', 'ETH Zürich - ETHZ',
       'Inst. de Hautes Etudes Internat. et du Dév - IHEID',
       'Universität St. Gallen - SG', 'Weitere Institute - FINST',
       'Firmen/Privatwirtschaft - FP',
       'Pädagogische Hochschule Graubünden - PHGR', 'EPF Lausanne - EPFL',
       'Pädagogische Hochschule Zürich - PHZFH', 'Universität Luzern - LU',
       'Schweiz. Institut für Kunstwissenschaft - SIK-ISEA',
       'SUP della Svizzera italiana - SUPSI',
       'HES de Suisse occidentale - HES-SO',
       'Robert Walser-Stiftung Bern - RWS', 'Paul Scherrer Institut - PSI',
       'Pädagogische Hochschule St. Gallen - PHSG',
       'Eidg. Ans

We see some obvious outliers there, such as an empty '', 'Nicht zuteilbar - NA', 'Weitere Institute - FINST'(_other institute_), and 'Weitere Spitäler - ASPIT'(_other hospital_)

In [73]:
invalidUniversities = ['Nicht zuteilbar - NA', 'Weitere Institute - FINST', 'Weitere Spitäler - ASPIT']
grantReport.University = grantReport.University.replace(invalidUniversities, '')
grantReport = grantReport[grantReport.University != '']

In [74]:
uni = grantReport.University.unique()
uni.shape

(74,)

Great, we have kicked out outliers in University column. Now we do the same for the _Approved Amount_.

In [75]:
# we take an iterative way. valid amounts should be able to be transfered into floats with 'astype' method. do this:
# sum(grantReport['Approved Amount'].astype(float))
# when we try the above code, python returns an error:  could not convert string to float: 'data not included in P3'. we kick this out.
invalidAmounts = ['data not included in P3']
grantReport['Approved Amount'] = grantReport['Approved Amount'].replace(invalidAmounts, '')
grantReport = grantReport[grantReport['Approved Amount'] != '']
sum(grantReport['Approved Amount'].astype(float))
# do the 'sum' code again. this time it works! seems we have only one type of invalid entry in 'Approved Amount'.

12815436933.020002

In [76]:
grantReport.count()

Approved Amount    48159
University         48159
dtype: int64

## Assign universities to cantons

In [77]:
uni.shape

(74,)

### Import Google Map Services

In [89]:
# change key everytime
API_KEY = 'USE_YOUR_KEY'

if API_KEY != 'USE_YOUR_KEY':
    google_places = GooglePlaces(API_KEY)        

In [90]:
response = google_places.text_search(query = uni[0])
if response.places:
    place = response.places[0]
place

<Place name="University of Geneva", lat=46.199444, lng=6.1451157>

In [98]:
place.get_details()
details = place.details
details

{'address_components': [{'long_name': 'Geneva',
   'short_name': 'Geneva',
   'types': ['locality', 'political']},
  {'long_name': 'Centre-Plainpalais-Acacias',
   'short_name': 'Centre-Plainpalais-Acacias',
   'types': ['sublocality_level_1', 'sublocality', 'political']},
  {'long_name': 'Genève',
   'short_name': 'Genève',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Geneva',
   'short_name': 'GE',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'Switzerland',
   'short_name': 'CH',
   'types': ['country', 'political']},
  {'long_name': '1205', 'short_name': '1205', 'types': ['postal_code']}],
 'adr_address': '<span class="postal-code">CH-1205</span> <span class="locality">Geneva</span>, <span class="country-name">Switzerland</span>',
 'formatted_address': '1205 Geneva, Switzerland',
 'formatted_phone_number': '022 379 71 11',
 'geometry': {'location': {'lat': Decimal('46.199444'),
   'lng': Decimal('6.1451157')},
  'viewport'

In [99]:
for component in details['address_components']:
    if 'administrative_area_level_1' in component['types']:
        canton = component['short_name']
canton

'GE'

In [100]:
# sum it up as functions
def uni_to_canton(name):
    response = google_places.text_search(query = name)
    if response.places:
        place = response.places[0]
        place.get_details()
        details = place.details
        for component in details['address_components']:
            if 'administrative_area_level_1' in component['types']:
                canton = component['short_name']
    return canton