# Homework 03 - Interactive Viz

Build a Choropleth map which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton. 

The P3 database of the SNSF (Swiss National Science Foundation) http://p3.snf.ch/Pages/DataAndDocumentation.aspx can be linked via the project number (Project Number) field, in order to create a relationship between people and projects or publications and projects. Raw data of this project:
- P3_GrantExport.csv
- P3_GrantExport_with_abstracts.csv
- P3_PersonExport.csv
- P3_PublicationExport.csv

# Overview

**TODO**  
The objective of this homework is to ...  
To achieve this goal, we ...

In [51]:
import requests
import folium
import pandas as pd
import numpy as np
import json
import random
import pprint

# Data preprocessing/cleaning/wrangling/?
First, we obtain information about grant records, and cantons in Switzerland.

## Grant data

We download the data file from P3 website. We try to clean up data and obtain all grant records.

In [52]:
GRANTS_FILE = 'data/P3_GrantExport.csv'

csv_data = pd.read_csv(GRANTS_FILE, sep = ';')
csv_data.dtypes

﻿"Project Number"                int64
Project Title                   object
Project Title English           object
Responsible Applicant           object
Funding Instrument              object
Funding Instrument Hierarchy    object
Institution                     object
University                      object
Discipline Number                int64
Discipline Name                 object
Discipline Name Hierarchy       object
Start Date                      object
End Date                        object
Approved Amount                 object
Keywords                        object
dtype: object

In [53]:
csv_data.head()

Unnamed: 0,"﻿""Project Number""",Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
0,1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
1,4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,
2,5,"Kritische Erstausgabe der ""Evidentiae contra D...",,Kommission für das Corpus philosophorum medii ...,Project funding (Div. I-III),Project funding,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",10101,Philosophy,Human and Social Sciences;Linguistics and lite...,01.03.1976,28.02.1985,79732.0,
3,6,Katalog der datierten Handschriften in der Sch...,,Burckhardt Max,Project funding (Div. I-III),Project funding,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,52627.0,
4,7,Wissenschaftliche Mitarbeit am Thesaurus Lingu...,,Schweiz. Thesauruskommission,Project funding (Div. I-III),Project funding,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",10303,Ancient history and Classical studies,Human and Social Sciences;Theology & religious...,01.01.1976,30.04.1978,120042.0,


After further inspection, we summarize the data fields as follows:
* ** TODO **: add descriptions of all fields
* 

For this assignment, to compute the total amount of grants, we plan to find the canton of every record of grants, and group these amount together by canton. We note that information about university and institution can provide us information about the place of project. This information is crucial for us to identify its canton.

Thus, we are interested in following three data fields :
* ** Institution **: the institution where the project is carried.
* ** University **: the university where the project is carried.
* ** Approved Amount **: amount of approved grants

We extract these columns from initial data to a separate DataFrame.

In [54]:
data = csv_data[['University', 'Institution', 'Approved Amount']]
data.head()

Unnamed: 0,University,Institution,Approved Amount
0,Nicht zuteilbar - NA,,11619.0
1,Université de Genève - GE,Faculté de Psychologie et des Sciences de l'Ed...,41022.0
2,"NPO (Biblioth., Museen, Verwalt.) - NPO",Kommission für das Corpus philosophorum medii ...,79732.0
3,Universität Basel - BS,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0
4,"NPO (Biblioth., Museen, Verwalt.) - NPO",Schweiz. Thesauruskommission,120042.0


Next, we clean up the data.  
We note that some values of 'Approved Amount' field are not numeric. Instead, they are strings indicating lack of data. So, we drop these records.

In [55]:
data['Approved Amount'] = pd.to_numeric(data['Approved Amount'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


We also remove records with illegal universities and null approved amount

In [56]:
data = data[(data.University != 'Nicht zuteilbar - NA') & pd.notnull(data.University) & pd.notnull(data['Approved Amount'])]

In [57]:
data.head()

Unnamed: 0,University,Institution,Approved Amount
1,Université de Genève - GE,Faculté de Psychologie et des Sciences de l'Ed...,41022.0
2,"NPO (Biblioth., Museen, Verwalt.) - NPO",Kommission für das Corpus philosophorum medii ...,79732.0
3,Universität Basel - BS,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0
4,"NPO (Biblioth., Museen, Verwalt.) - NPO",Schweiz. Thesauruskommission,120042.0
5,Université de Fribourg - FR,"Séminaire de politique économique, d'économie ...",53009.0


In [58]:
# TODO. It's a temporary workaround.
# In next section, we need to split all university names by '-' to two parts: full name and short name.
# If there is no '-' in the string, extracting short name will get an error 
data['University'].replace("Forschungskommission SAGW", "Forschungskommission - SAGE", inplace=True)
universities = pd.DataFrame(data.University.unique())

uni_test = universities.replace({r'\s+': '&'}, regex=True)
uni_name = pd.DataFrame(universities[0].str.rsplit('-',1).tolist(), columns=['Full Name', 'Short Name'])
uni_name.head()

Unnamed: 0,Full Name,Short Name
0,Université de Genève,GE
1,"NPO (Biblioth., Museen, Verwalt.)",NPO
2,Universität Basel,BS
3,Université de Fribourg,FR
4,Universität Zürich,ZH


## Canton data

We obtain abbrevations for all cantons from geometry JSON file

In [59]:
CH_GEO_JSON = 'data/ch-cantons.topojson.json'

with open(CH_GEO_JSON, 'r') as f:
    d = json.load(f)

cantons = d['objects']['cantons']['geometries']
canton_ids = []
for canton in cantons:
    canton_ids.append(canton['id'])

canton_ids

['ZH',
 'BE',
 'LU',
 'UR',
 'SZ',
 'OW',
 'NW',
 'GL',
 'ZG',
 'FR',
 'SO',
 'BS',
 'BL',
 'SH',
 'AR',
 'AI',
 'SG',
 'GR',
 'AG',
 'TG',
 'TI',
 'VD',
 'VS',
 'NE',
 'GE',
 'JU']

# Mapping to canton

To compute grants by canton, we need to map (University, Institution) to canton.

Available APIs to obtain such mappings include **Geonames Full Text Search API in JSON**, and **Google Places**, etc. 

## Geonames API

In [60]:
uni_canton = pd.DataFrame(columns=['University','Canton'])

In [61]:
def canton_in_CH(result):
    for i in range(len(result)):
        if(result['geonames'][i]['countryName']=='Switzerland'):
            return True
    return False

In [62]:
URL = 'http://api.geonames.org/searchJSON?'

def canton_get(uni):
    paraload = {'q':uni,'username':'shiyuenie'}
    r = requests.get(URL, params=paraload)
    result = r.json()
    return result

In [63]:
for ucount in range(len(universities)): # search full name
    uni = universities[0][ucount]
    result = canton_get(uni)
    if (result['geonames'] != []):
        if (canton_in_CH(result)):
            uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
        else:
            uni_canton.loc[ucount] = [uni,'Not Found']   
    else:
        uni_canton.loc[ucount] = [uni,'Not Found']
    
uni_canton.head()

Unnamed: 0,University,Canton
0,Université de Genève - GE,Not Found
1,"NPO (Biblioth., Museen, Verwalt.) - NPO",Not Found
2,Universität Basel - BS,Basel-City
3,Université de Fribourg - FR,Fribourg
4,Universität Zürich - ZH,Zurich


In [64]:
#uni_name=uni_name.replace({r'\s+': '&'}, regex=True)

for ucount in range(len(uni_name)):
    if (uni_canton.loc[:,'Canton'][ucount] == 'Not Found'): # search with seperate parts
        uni_full = uni_name['Full Name'][ucount] 
        uni_short = uni_name['Short Name'][ucount] 
        result = canton_get(uni_full) # search part 1
        if (result['geonames'] != []):
            if(canton_in_CH(result)):
                uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
            else:
                result = canton_get(uni_short) # search part 2 
                if (result['geonames'] != []):
                    if(canton_in_CH(result)):
                        uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
        
uni_canton.head()

Unnamed: 0,University,Canton
0,Université de Genève - GE,Not Found
1,"NPO (Biblioth., Museen, Verwalt.) - NPO",Not Found
2,Universität Basel - BS,Basel-City
3,Université de Fribourg - FR,Fribourg
4,Universität Zürich - ZH,Zurich


In [65]:
uni_canton[uni_canton['Canton'] != "Not Found"]

for ucount in range(len(uni_test)): # search full name wit &
    uni = uni_test[0][ucount]
    result = canton_get(uni)
    if (result['geonames'] != []):
        if (canton_in_CH(result)):
            uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
    
uni_canton.head()

Unnamed: 0,University,Canton
0,Université de Genève - GE,Not Found
1,"NPO (Biblioth., Museen, Verwalt.) - NPO",Not Found
2,Universität Basel - BS,Basel-City
3,Université de Fribourg - FR,Fribourg
4,Universität Zürich - ZH,Zurich


In [66]:
np.sum(uni_canton.loc[:,'Canton']=='Not Found')

68

In [67]:
uni_canton[uni_canton['Canton'] != "Not Found"]

Unnamed: 0,University,Canton
2,Universität Basel - BS,Basel-City
3,Université de Fribourg - FR,Fribourg
4,Universität Zürich - ZH,Zurich
5,Fernfachhochschule Schweiz (Mitglied SUPSI) - ...,Vaud
6,Universität Bern - BE,Bern
8,Université de Neuchâtel - NE,Neuchâtel
9,Fernfachhochschule Schweiz (Mitglied SUPSI) - ...,Zurich
34,Fernfachhochschule Schweiz (Mitglied SUPSI) - ...,Saint Gallen


## GoogleMaps API

We use a Python [API wrapper](https://github.com/slimkrazy/python-google-places) for GoogleMaps

In [68]:
# to install the wrapper
# !pip install https://github.com/slimkrazy/python-google-places/zipball/master

In [69]:
import googlemaps

# Key omitted due to security concerns on Github
API_KEY = 'AIzaSyC2-TxJBHd-X8RnFh3-a6Y9hXri7WpktHE'
#########################'

gmaps = googlemaps.Client(key=API_KEY)

In [103]:
!pip install python-google-places

Collecting python-google-places
  Downloading python-google-places-1.2.0.tar.gz
Building wheels for collected packages: python-google-places
  Running setup.py bdist_wheel for python-google-places ... [?25l- \ done
[?25h  Stored in directory: /home/ubuntu/.cache/pip/wheels/cc/78/46/3e0cf79ddf2ea0c5fdd10d60c2281d86b90b1ab86ceca2f6f0
Successfully built python-google-places
Installing collected packages: python-google-places
Successfully installed python-google-places-1.2.0


In [117]:
from googleplaces import GooglePlaces, types, lang


def find_canton(k)

google_places = GooglePlaces(API_KEY)
r = google_places.text_search('Idiap Research Institute')#EPFL Switzerland')

In [119]:
r.places[0].get_details()
r.places[0].details

{'address_components': [{'long_name': '19',
   'short_name': '19',
   'types': ['street_number']},
  {'long_name': 'Rue Marconi',
   'short_name': 'Rue Marconi',
   'types': ['route']},
  {'long_name': 'Martigny',
   'short_name': 'Martigny',
   'types': ['locality', 'political']},
  {'long_name': 'Martigny',
   'short_name': 'Martigny',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Valais',
   'short_name': 'VS',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'Switzerland',
   'short_name': 'CH',
   'types': ['country', 'political']},
  {'long_name': '1920', 'short_name': '1920', 'types': ['postal_code']}],
 'adr_address': '<span class="street-address">Rue Marconi 19</span>, <span class="postal-code">CH-1920</span> <span class="locality">Martigny</span>, <span class="country-name">Switzerland</span>',
 'formatted_address': 'Rue Marconi 19, 1920 Martigny, Switzerland',
 'formatted_phone_number': '027 721 77 11',
 'geometry': {'l

In [70]:
# We try a sample search to obtain the structure of reply
test_epfl = gmaps.geocode('EPFL Switzerland')
test_epfl

[{'address_components': [{'long_name': 'EPFL',
    'short_name': 'EPFL',
    'types': ['premise']},
   {'long_name': 'Lausanne',
    'short_name': 'Lausanne',
    'types': ['locality', 'political']},
   {'long_name': 'Ouest lausannois',
    'short_name': 'Ouest lausannois',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'Vaud',
    'short_name': 'VD',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'Switzerland',
    'short_name': 'CH',
    'types': ['country', 'political']},
   {'long_name': '1015', 'short_name': '1015', 'types': ['postal_code']}],
  'formatted_address': 'EPFL, 1015 Lausanne, Switzerland',
  'geometry': {'location': {'lat': 46.5189865, 'lng': 6.5676007},
   'location_type': 'ROOFTOP',
   'viewport': {'northeast': {'lat': 46.52033548029151,
     'lng': 6.568949680291502},
    'southwest': {'lat': 46.51763751970851, 'lng': 6.566251719708498}}},
  'place_id': 'ChIJ4zm3ev4wjEcRShTLf2C0UWA',
  'types': ['premise']}

We notice that canton information is returned in 'administrative_area_level_2' field. long_name gives the full name of the canton, short_name gives the abbrevation, the same as the id used in geometry json file.

In [71]:
CANTON_TYPE_KEY = 'administrative_area_level_1'
    
def place_in_CH(place_info):
    for info_dict in place_info['address_components']:
        if ('country' in info_dict['types']) and (info_dict['long_name'] == 'Switzerland'):
            return True
    return False

def parse_canton(place_candidates):
    if (place_candidates == None or len(place_candidates) == 0):
        return None
    
    for place in place_candidates:
        if not place_in_CH(place):
            continue
        for info_dict in place['address_components']:
            if (CANTON_TYPE_KEY in info_dict['types']) and (info_dict['short_name'] in canton_ids):
                return info_dict['short_name']
    
    return None    

parse_canton(test_epfl)

'VD'

In [72]:
def construct_search_keys(university, institution):
    
    def keys_with(x):
        keys = []
        keys.append(x)
        keys.append(x + ', Switzerland')
        keys.append(x + ', CH')
        return keys
    
    uni_full_name, uni_short_name = university.rsplit('-',1)

    keys = []   
    keys += keys_with(university)
    keys += keys_with(uni_full_name)
    
    if institution != None:
        keys += keys_with(str(institution))
    if uni_short_name != None:
        keys += keys_with(uni_short_name)
    return keys 

In [73]:
def search_canton(keys):
    # geocode
    for key in keys:
        resp = gmaps.geocode(key)
        canton = parse_canton(resp)
        if canton != None:
            return canton
        
    # text_search
    for key in keys:
        resp = google_places.text_search(key)
        for place in resp.places:
            place.get_details()
            canton = parse_canton(place.details)     
        
        
    return None

In [74]:
# intialize uni_canton_dict from file
# uni_canton_dict maps university to canton
uni_canton_df = pd.read_csv('uni-canton.csv', encoding='utf-8')
uni_canton_df.set_index('University', inplace=True)
uni_canton_dict = uni_canton_df.to_dict()['0']

In [75]:
def find_canton(row):
    # if not searched before
    if row.University not in uni_canton_dict:
        search_keys = construct_search_keys(row.University, row.Institution)
        uni_canton_dict[row.University] = search_canton(search_keys)
        print(row.University + " " + uni_canton_dict[row.University])
    return uni_canton_dict[row.University]

In [76]:
# uni_canton_df = pd.DataFrame.from_dict(uni_canton_dict, orient="index")
# uni_canton_df.index.name = 'University'
# uni_canton_df.to_csv('uni-canton.csv', encoding='utf-8')

In [77]:
df = data.copy()
df['Canton'] = df.apply(find_canton, axis=1)

In [78]:
df.head()

Unnamed: 0,University,Institution,Approved Amount,Canton
1,Université de Genève - GE,Faculté de Psychologie et des Sciences de l'Ed...,41022.0,GE
2,"NPO (Biblioth., Museen, Verwalt.) - NPO",Kommission für das Corpus philosophorum medii ...,79732.0,FR
3,Universität Basel - BS,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0,BS
4,"NPO (Biblioth., Museen, Verwalt.) - NPO",Schweiz. Thesauruskommission,120042.0,FR
5,Université de Fribourg - FR,"Séminaire de politique économique, d'économie ...",53009.0,FR


In [79]:
df.to_csv('data-with-canton.csv', encoding='utf-8')

In [80]:
found_rate = pd.notnull(df['Canton']).sum() / len(df['Canton'])
found_rate

0.96396247126317747

### Manually handle missing data

Fow now, we simply ignore the missing data cases.

#  Compute grants by canton

In [81]:
canton_grants_df = df[['Approved Amount', 'Canton']].groupby(['Canton']).sum()
canton_grants_df.reset_index(inplace=True)

In [82]:
# Add missing data
for canton_id in canton_ids:
    if canton_id not in canton_grants_df['Canton'].values:
        canton_grants_df = canton_grants_df.append({'Canton':canton_id, 'Approved Amount': 0.0}, ignore_index=True)

canton_grants_df = canton_grants_df.sort_values(by=['Approved Amount'], ascending=False)
canton_grants_df

Unnamed: 0,Canton,Approved Amount
19,ZH,3573242000.0
16,VD,2393342000.0
5,GE,1849821000.0
2,BE,1552392000.0
3,BS,1352269000.0
4,FR,791656800.0
6,GR,513160500.0
0,AG,119918500.0
9,SG,86327450.0
14,TI,86089110.0


# Data visualization
##  Draw grants map by canton

In [83]:
map = folium.Map(location=[46.82244,8.22410], zoom_start=8)
map.choropleth(data=canton_grants_df,
               columns=['Canton', 'Approved Amount'], 
               key_on='feature.id', 
               geo_path=CH_GEO_JSON, 
               topojson='objects.cantons', 
               fill_color='YlOrRd'
               )
map



** TODO **: The map cannot be displayed on Github. To view the map, you have to run this cell locally with Jupyter Notebook. To facilitate code reviewing, we may add an additionaly snapshot of the map here.