# Table of Contents
 <p><div class="lev1"><a href="# Step 1. Process data"><span class="toc-item-num">1&nbsp;&nbsp;</span>TODO: Following given notebooks in HW1, create table of contents here :) </a></div>

This Jupyter extension might help. https://github.com/minrk/ipython_extensions

# Homework 03 - Interactive Viz

# Overview 

### Objective : Build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows intuitively how much grant money goes to each Swiss canton.

##### What data do we have?

The SNSF (Swiss National Science Foundation) has graciously provided the [P<sup>3</sup>](http://p3.snf.ch/) database, which contains data on research projects approved by the SNSF. We will just use the [Grants Data](http://p3.snf.ch/P3Export/P3_GrantExport.csv) (saved in `data/P3_GrantExport`) as this contains the grant amounts going to different universities.

We are also given the file `data/ch-cantons.topojson.json`, which contains the geo-coordinates of each Swiss canton.

##### What needs to be done?

The Grants Data does not contain the Canton of the University/Institution that received the funding. Therefore, we will need to deduce the Canton from the University/Institution. Once this is accomplished, we can sum the grant money that was allocated to each Canton.

##### How will this be done?

We will use the following Python packages to accomplish our objective:
1. `pandas` to import, clean, and wrangle the Grants Data
2. `json` and `pprint` to read the `data/ch-cantons.topojson.json` file
3. `requests` with the [GeoNames API](http://www.geonames.org/export/web-services.html) and `googlemaps` (a [Python wrapper](https://github.com/googlemaps/google-maps-services-python) for Google Maps API Web Services) to determine the Canton of a particular University/Institution
4. `folium` to deal with the geographical data and create the Choropleth map

# Building the Choropleth Map

We start off by importing the Python packages mentioned above.

In [1]:
import requests
import folium
import pandas as pd
import numpy as np
import json
import random
from pprint import pprint

## 1. Importing the Data

We downloaded the [Grants Data](http://p3.snf.ch/P3Export/P3_GrantExport.csv) and placed it in the local `data` folder.

In [2]:
GRANTS_FILE = 'data/P3_GrantExport.csv'
grants_data_orig = pd.read_csv(GRANTS_FILE, sep = ';')
grants_data_orig.head(3)

Unnamed: 0,"﻿""Project Number""",Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
0,1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
1,4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,
2,5,"Kritische Erstausgabe der ""Evidentiae contra D...",,Kommission für das Corpus philosophorum medii ...,Project funding (Div. I-III),Project funding,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",10101,Philosophy,Human and Social Sciences;Linguistics and lite...,01.03.1976,28.02.1985,79732.0,


For our objective, we are interested in the following columns: 

* ** University **: the (possible) university proposing the project. As we see in the third row, a grant can be awarded to an NPO (Non-Profit Organization).
* ** Institution **: the institution proposing the project.
* ** Approved Amount **: amount of approved grants

The Institution/University can be used to deduce the Swiss Canton. Finally, the Amount Approved will be used for our Choropleth map. Let's extract these columns to reduce the size of our DataFrame.

In [3]:
grants_data_red = grants_data_orig[['University', 'Institution', 'Approved Amount']]
grants_data_red.head()

Unnamed: 0,University,Institution,Approved Amount
0,Nicht zuteilbar - NA,,11619.0
1,Université de Genève - GE,Faculté de Psychologie et des Sciences de l'Ed...,41022.0
2,"NPO (Biblioth., Museen, Verwalt.) - NPO",Kommission für das Corpus philosophorum medii ...,79732.0
3,Universität Basel - BS,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0
4,"NPO (Biblioth., Museen, Verwalt.) - NPO",Schweiz. Thesauruskommission,120042.0


Finally, we will rename the "Approved Amount" column to "Amount" so that it can be easily accessed.

In [4]:
grants_data_red = grants_data_red.rename(columns={'Approved Amount': 'Amount'})

## 2. Cleaning the Data

### Cleaning "Amount"

Let's make sure the "Amount" has the correct variable type, i.e. numeric, and that entries with an invalid "Amount" are dropped.

In [5]:
print(grants_data_red.shape)
grants_clean_amount = grants_data_red
grants_clean_amount.Amount = pd.to_numeric(grants_clean_amount.Amount, errors='coerce')
grants_clean_amount = grants_data_red.dropna(subset=["Amount"])
print(grants_clean_amount.shape)

(63969, 3)
(53059, 3)


We have dropped 10910 entries.

### Cleaning "University" and "Institution"

We only need one of these fields to be valid for each row in order to geolocate where the grant was awarded. So let's drop those rows that have invalid entries for **both** fields. Moreover, we have noticed that the "University" entries have the following structure:

`LONG NAME - SHORT NAME`

Furthermore, one of the entries is `Nicht zuteilbar - NA` which must be SNSF's way of indicating an invalid entry. We will first replace such entries with the standard entry for invalids - `NaN` - before dropping rows.

In [7]:
print(grants_clean_amount.shape)
grants_clean_place = grants_clean_amount.replace(to_replace="Nicht zuteilbar - NA", value=np.nan)
grants_clean_place.dropna(how='all',subset=["University","Institution"], inplace=True)
print(grants_clean_place.shape)

(53059, 3)
(49823, 3)


In [8]:
# suppressed output for shorter notebook!
#grants_clean_place.University.value_counts(dropna=False)

We have dropped 3236 entries. Almost done! We only one need either "Univerisity" or "Institution" for geolocation. For most entries the "University" should be enough. However, 1437 of the entries are not from Universities but rather Non-Profit Organizations (NPO) and there are 1540 `NaN`s. This count was determined by running:

`grants_clean_place.University.value_counts(dropna=False)`

We will modify the "University" column by setting its value to `NaN` if the entry is "NPO (Biblioth., Museen, Verwalt.) - NPO". For the valid universities, we will decouple the `LONG NAME` and `SHORT NAME` and place them in different columns. These modifications will make it easier later on when mapping the university/institution to a Canton and for checking the criteria that 95% of the universities are mapped to a canton.

In [9]:
def extract_short_name(row):
    # extract the SHORT NAME from University
    uni = str(row["University"])
    if uni == "NPO (Biblioth., Museen, Verwalt.) - NPO" or pd.isnull(uni):
        return np.nan
    split_entry = uni.split()
    return split_entry[-1]

def extract_long_name(entry):
    # extract the LONG NAME and remove the space at the end
    split_entry = entry.rsplit(" - ",1)
    return split_entry[0].rstrip()

def modify_university(row): 
    # keep long name unless it is an NPO or NaN
    if row["University"] == "NPO (Biblioth., Museen, Verwalt.) - NPO" or pd.isnull(row["University"]):
        return np.nan
    else: 
        return extract_long_name(str(row["University"]))

grants_modified_uni = grants_clean_place.copy()
grants_modified_uni['Short'] = grants_modified_uni.apply(lambda row: extract_short_name(row), axis=1)
grants_modified_uni['University'] = grants_modified_uni.apply(lambda row: modify_university(row), axis=1)

In [10]:
grants_modified_uni.head()

Unnamed: 0,University,Institution,Amount,Short
1,Université de Genève,Faculté de Psychologie et des Sciences de l'Ed...,41022.0,GE
2,,Kommission für das Corpus philosophorum medii ...,79732.0,
3,Universität Basel,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0,BS
4,,Schweiz. Thesauruskommission,120042.0,
5,Université de Fribourg,"Séminaire de politique économique, d'économie ...",53009.0,FR


### Summary

We have gone from 63696 entries to 49823 by dropping invalid "Amount" and "University" or "Institution" entries. We have also re-labelled data (NPO to `NaN` and decoupled the long and short name of the university for our convenience in analysis).

## 3. Reading the Canton Data

Now we obtain the abbrevations for all cantons from the geographical JSON data in `data/ch-cantons.topojson.json`. We did some inspection of the structure of the JSON file (using pprint) and extracted the canton abbrevations by navigating appropriately through the JSON data structure.

In [11]:
CH_GEO_JSON = 'data/ch-cantons.topojson.json'

with open(CH_GEO_JSON, 'r') as f:
    canton_data = json.load(f)

cantons = canton_data['objects']['cantons']['geometries']
canton_ids = []
for canton in cantons:
    canton_ids.append(canton['id'])

print(canton_ids)
#pprint(canton_data)

['ZH', 'BE', 'LU', 'UR', 'SZ', 'OW', 'NW', 'GL', 'ZG', 'FR', 'SO', 'BS', 'BL', 'SH', 'AR', 'AI', 'SG', 'GR', 'AG', 'TG', 'TI', 'VD', 'VS', 'NE', 'GE', 'JU']


## 4. Mapping "Institute" to a Canton

We will use a Python wrapper for geolocating the Universities and Institutions. Let's see how it works for `EPFL` so we know how to parse its result!

In [68]:
import googlemaps
# add your own GOOGLE API KEY HERE!!!
key = % env GOOGLE_MAPS_KEY
gmaps = googlemaps.Client(key=key)
test_epfl = gmaps.geocode('EPFL')
test_epfl

[{'address_components': [{'long_name': 'EPFL',
    'short_name': 'EPFL',
    'types': ['premise']},
   {'long_name': 'Lausanne',
    'short_name': 'Lausanne',
    'types': ['locality', 'political']},
   {'long_name': 'Ouest lausannois',
    'short_name': 'Ouest lausannois',
    'types': ['administrative_area_level_2', 'political']},
   {'long_name': 'Vaud',
    'short_name': 'VD',
    'types': ['administrative_area_level_1', 'political']},
   {'long_name': 'Switzerland',
    'short_name': 'CH',
    'types': ['country', 'political']},
   {'long_name': '1015', 'short_name': '1015', 'types': ['postal_code']}],
  'formatted_address': 'EPFL, 1015 Lausanne, Switzerland',
  'geometry': {'location': {'lat': 46.5189865, 'lng': 6.5676007},
   'location_type': 'ROOFTOP',
   'viewport': {'northeast': {'lat': 46.52033548029151,
     'lng': 6.568949680291502},
    'southwest': {'lat': 46.51763751970851, 'lng': 6.566251719708498}}},
  'place_id': 'ChIJ4zm3ev4wjEcRShTLf2C0UWA',
  'types': ['premise']}

The output is a list of possible locations (in this case the list of length 1). So we can loop through the candidates to find the Canton. At `administrative_area_level_1`, we can extract the short code of the Canton (which will correspond to the Canton ID in the geographical JSON data). We will also need to check the country to make sure the university/institution is in Switzerland!

In [69]:
# check if result is in Switzerland
def in_Switzerland(place_info):
    for info_dict in place_info['address_components']:
        if ('country' in info_dict['types']) and (info_dict['long_name'] == 'Switzerland'):
            return True
    return False

in_Switzerland(test_epfl[0])

True

As we can see EPFL is in Switzerland! If the University/Institution is in Switzerland, we can then parse the output for the Canton at `'administrative_area_level_1`.

In [70]:
CANTON_TYPE_KEY = 'administrative_area_level_1'

def parse_canton(place):
    if (place == None) or (not in_Switzerland(place)):
        return None
    
    for components in place['address_components']:
        if (CANTON_TYPE_KEY in components['types']) and (components['short_name'] in canton_ids):
            return components['short_name']
    
    return None     

# # parse the Canton short name
# def parse_canton(place_candidates):
#     if (place_candidates == None or len(place_candidates) == 0):
#         return None
#     # go through multiple results in given to this function by passing a list of 'geocode' results
#     for place in place_candidates:
#         if not in_Switzerland(place):
#             continue
#         for info_dict in place['address_components']:
#             if (CANTON_TYPE_KEY in info_dict['types']) and (info_dict['short_name'] in canton_ids):
#                 return info_dict['short_name']
#     return None    

parse_canton(test_epfl[0])

'VD'

We have noticed that using only the University Long Name is not robust enough. So we will use a combination of the University Long Name, the Institution, and the University Short Name in order to determine the Canton as such:

* IF it is a University: try LONG NAME
    * IF LONG NAME worked: return Canton ID
    * ELSEIF LONG NAME couldn't find a results: try Institution (if it is not NaN)
    * IF still no result: try SHORT NAME + "CH"
    * return NaN if search didn't return a result
* ELSE: try Institution
    * return Canton ID or NaN if search didn't return a result
    
We will do this with a dictionary though, so that we don't have to make 49823 queries! So we try all unique University names and for those Universities that.

In [71]:
universities = grants_modified_uni.University.dropna().unique()
universities

array(['Université de Genève', 'Universität Basel',
       'Université de Fribourg', 'Universität Zürich',
       'Université de Lausanne', 'Universität Bern',
       'Eidg. Forschungsanstalt für Wald,Schnee,Land',
       'Université de Neuchâtel', 'ETH Zürich',
       'Inst. de Hautes Etudes Internat. et du Dév',
       'Universität St. Gallen', 'Weitere Institute',
       'Firmen/Privatwirtschaft', 'Pädagogische Hochschule Graubünden',
       'EPF Lausanne', 'Pädagogische Hochschule Zürich',
       'Universität Luzern', 'Schweiz. Institut für Kunstwissenschaft',
       'SUP della Svizzera italiana', 'HES de Suisse occidentale',
       'Robert Walser-Stiftung Bern', 'Paul Scherrer Institut',
       'Pädagogische Hochschule St. Gallen',
       'Eidg. Anstalt für Wasserversorgung',
       'Eidg. Material und Prüfungsanstalt',
       'Physikal.-Meteorolog. Observatorium Davos',
       'Berner Fachhochschule',
       'Swiss Center for Electronics and Microtech.', 'Weitere Spitäler',
     

In [72]:
from googleplaces import GooglePlaces, types, lang

google_places = GooglePlaces(key)


In [74]:
r = google_places.text_search('Idiap Research Institute')#EPFL Switzerland')
#r.places[0].get_details()
#parse_canton(r.places[0].details)

In [75]:
r

<GooglePlacesSearchResult with 1 result(s)>

In [76]:
r.places[0]

<Place name="Idiap Research Institute", lat=46.109376, lng=7.084442999999999>

In [78]:
try:
    r.places[0].get_details()
except:
    None

r.places[0].details

{'address_components': [{'long_name': '19',
   'short_name': '19',
   'types': ['street_number']},
  {'long_name': 'Rue Marconi',
   'short_name': 'Rue Marconi',
   'types': ['route']},
  {'long_name': 'Martigny',
   'short_name': 'Martigny',
   'types': ['locality', 'political']},
  {'long_name': 'Martigny',
   'short_name': 'Martigny',
   'types': ['administrative_area_level_2', 'political']},
  {'long_name': 'Valais',
   'short_name': 'VS',
   'types': ['administrative_area_level_1', 'political']},
  {'long_name': 'Switzerland',
   'short_name': 'CH',
   'types': ['country', 'political']},
  {'long_name': '1920', 'short_name': '1920', 'types': ['postal_code']}],
 'adr_address': '<span class="street-address">Rue Marconi 19</span>, <span class="postal-code">CH-1920</span> <span class="locality">Martigny</span>, <span class="country-name">Switzerland</span>',
 'formatted_address': 'Rue Marconi 19, 1920 Martigny, Switzerland',
 'formatted_phone_number': '027 721 77 11',
 'geometry': {'l

In [81]:
pd.notnull?

In [89]:
def keys_with(x):
    keys = []
    keys.append(x)
    keys.append(x + ', Switzerland')
    keys.append(x + ', CH')
    return keys   

def construct_search_keys(row):
    keys = []

    if (row.University != None) and (pd.notnull(row.University)):
        keys += keys_with(row.University)
    
    if (row.Institution != None) and (pd.notnull(row.Institution)):
        keys += keys_with(row.Institution)
    
    if (row.Short != None) and (pd.notnull(row.Short)):
        keys += keys_with(row.Short)
    
    return keys

In [92]:
# cantons = {}

In [94]:
def search_canton(row):
    keys = construct_search_keys(row)
    
    for key in keys:
        if key in cantons.keys():
            return cantons[key]
    
    # geocode
    for key in keys:
        places = gmaps.geocode(key)
        for place in places:
            canton = parse_canton(place)
            if canton != None:
                cantons[key] = canton
                return canton
        
    # text_search
    for key in keys:
        resp = google_places.text_search(key)
        for place in resp.places:
            try:
                place.get_details()
            except:
                None
            canton = parse_canton(place.details)     
            if canton != None:
                cantons[key] = canton
                return canton

    return None

In [95]:
grants_canton = grants_modified_uni.copy()
grants_canton['Canton'] = grants_canton.apply(search_canton, axis=1)

Université de Genève GE
Universität Basel BS
Université de Fribourg FR
Universität Zürich ZH
Université de Lausanne VD
Universität Bern BE
Schweizerische Gesellschaft für Volkskunde BS
WSL ZH
Schweizerisches Nationalmuseum Landesmuseum Zürich ZH
Fondation Hardt pour l'étude de l'Antiquité classique GE
Historische und Antiquarische Gesellschaft zu Basel BS
Bibliothèque de Genève Département des périodiques GE
NE, Switzerland NE
ETH Zürich ZH
Collège secondaire de Béthusy VD
Inst. de Hautes Etudes Internat. et du Dév GE
Antikenmuseum BS
Universität St. Gallen SG
FINST, Switzerland BE
Schweiz. Osteuropabibliothek BE
FP, Switzerland ZH
Fondation pour le LIMC, Switzerland VD
Stift. Pro Kloster St. Johann in Müstair, Switzerland GR
Schweizerische Nationalbibliothek BE
Büro für Archäologie der Stadt Zürich ZH
Museum der Kulturen BS
Kantonsschule Sargans SG
Gesellschaft für Schweizerische Kunstgeschichte BE
Staatsarchiv ZH
Pädagogische Hochschule Graubünden, Switzerland GR
Schweizerische Gesel

In [96]:
grants_canton

Unnamed: 0,University,Institution,Amount,Short,Canton
1,Université de Genève,Faculté de Psychologie et des Sciences de l'Ed...,41022.0,GE,GE
2,,Kommission für das Corpus philosophorum medii ...,79732.0,,
3,Universität Basel,Abt. Handschriften und Alte Drucke Bibliothek ...,52627.0,BS,BS
4,,Schweiz. Thesauruskommission,120042.0,,
5,Université de Fribourg,"Séminaire de politique économique, d'économie ...",53009.0,FR,FR
6,Université de Fribourg,Institut für ökumenische Studien Université de...,25403.0,FR,FR
7,Universität Zürich,Ostasiatisches Seminar Universität Zürich,47100.0,ZH,ZH
8,Université de Lausanne,,25814.0,LA,VD
9,Université de Genève,Laboratoire de Didactique et Epistémologie des...,360000.0,GE,GE
10,Université de Fribourg,Klinische Psychologie und Psychotherapie Insti...,153886.0,FR,FR


In [97]:
grants_canton.to_csv('grants-canton.csv')

In [98]:
# grants_canton = pd.read_csv('grants-canton.csv')

In [99]:
pd.notnull(grants_canton['Canton']).sum() / len(grants_canton)

0.98884049535355156

To compute grants by canton, we need to find the corresponding Canton for a particular "Institute".

We will first use the **Geonames API** as this was suggested by the assignment but then revert to the results of the **GoogleMaps API** as the latter is able to map more of the "Institutes". 

### Geonames API

In [None]:
uni_canton = pd.DataFrame(columns=['University','Canton'])

In [None]:
def canton_in_CH(result):
    for i in range(len(result)):
        if(result['geonames'][i]['countryName']=='Switzerland'):
            return True
    return False

In [None]:
URL = 'http://api.geonames.org/searchJSON?'

def canton_get(uni):
    paraload = {'q':uni,'username':'shiyuenie'}
    r = requests.get(URL, params=paraload)
    result = r.json()
    return result

In [None]:
canton_get("AGRIDEA, CH")

In [None]:
for ucount in range(len(institutes)): # search full name
    uni = institutes[0][ucount]
    result = canton_get(uni)
    if (result['geonames'] != []):
        if (canton_in_CH(result)):
            uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
        else:
            uni_canton.loc[ucount] = [uni,'Not Found']   
    else:
        uni_canton.loc[ucount] = [uni,'Not Found']
    
uni_canton.head()

In [None]:
#uni_name=uni_name.replace({r'\s+': '&'}, regex=True)

for ucount in range(len(uni_name)):
    if (uni_canton.loc[:,'Canton'][ucount] == 'Not Found'): # search with seperate parts
        uni_full = uni_name['Full Name'][ucount] 
        uni_short = uni_name['Short Name'][ucount] 
        result = canton_get(uni_full) # search part 1
        if (result['geonames'] != []):
            if(canton_in_CH(result)):
                uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
            else:
                result = canton_get(uni_short) # search part 2 
                if (result['geonames'] != []):
                    if(canton_in_CH(result)):
                        uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
        
uni_canton.head()

In [None]:
uni_canton[uni_canton['Canton'] != "Not Found"]

for ucount in range(len(uni_test)): # search full name wit &
    uni = uni_test[0][ucount]
    result = canton_get(uni)
    if (result['geonames'] != []):
        if (canton_in_CH(result)):
            uni_canton.loc[ucount] = [uni,result['geonames'][0]['adminName1']]
    
uni_canton.head()

In [None]:
np.sum(uni_canton.loc[:,'Canton']=='Not Found')

In [None]:
uni_canton[uni_canton['Canton'] != "Not Found"]

In [None]:
np.sum(uni_canton.loc[:,'Canton']!='Not Found')

### Manually handle missing data

Fow now, we simply ignore the missing records.

#  Compute grants by canton

In [132]:
grants_by_canton = grants_canton[['Amount', 'Canton']].groupby(['Canton']).sum()
grants_by_canton.reset_index(inplace=True)
grants_by_canton.head()

Unnamed: 0,Canton,Amount
0,AG,130727800.0
1,AR,7238730.0
2,BE,1617490000.0
3,BL,2103119.0
4,BS,1440733000.0


In [133]:
# Add missing data
for canton_id in canton_ids:
    if canton_id not in grants_by_canton['Canton'].values:
        grants_by_canton = grants_by_canton.append({'Canton':canton_id, 'Amount':0.0}, ignore_index=True)

grants_by_canton = grants_by_canton.sort_values(by=['Amount'], ascending=False)
grants_by_canton

Unnamed: 0,Canton,Amount
22,ZH,3870949000.0
19,VD,2409645000.0
6,GE,1901507000.0
2,BE,1617490000.0
4,BS,1440733000.0
5,FR,465617200.0
10,NE,403839800.0
0,AG,130727800.0
17,TI,121137600.0
7,GR,102613400.0


# Data visualization
##  Draw grants map by canton

In [135]:
map = folium.Map(location=[46.82244,8.22410], zoom_start=8)
map.choropleth(data=grants_by_canton,
               columns=['Canton', 'Amount'], 
               key_on='feature.id', 
               geo_path=CH_GEO_JSON, 
               topojson='objects.cantons', 
               fill_color='YlOrRd'
               )
map



** TODO **: The map cannot be displayed on Github. To view the map, you have to run this cell locally with Jupyter Notebook. To facilitate code reviewing, we may add an additionaly snapshot of the map here.

# Bonus :  Research funding difference with Röstigraben

In this task, using the map and the geographical information contained in it in the previous task, we give a rough estimate of the difference in research funding between the areas divided by the Röstigraben.

As Wiki illustrating(https://en.wikipedia.org/wiki/R%C3%B6stigraben), Röstigraben is a humorous term used to refer to the cultural boundary between German-speaking and French-speaking parts of Switzerland. Therefore, we should figure out the German-speaking and French-speaking parts first, and then sum up grants of different parts for comparison.

## Matching language areas
First, we try to find the correct boundary of Röstigraben(e.g. the map shown in Wiki's webpage). Unfortunately, we cannot find any authoritative references about it. Since we just estimate the 'Röstigraben research funding difference' approximately, we decide to separate language areas by canton. The canton's official language(s) on Wiki(https://en.wikipedia.org/wiki/Languages_of_Switzerland) help us to match them manually.

Let's import the canton's abbreviation from the topojson file at first.

In [136]:
import json
import pandas as pd

CH_GEO_JSON = 'data/ch-cantons.topojson.json'
# Download the ch-cantons.topojson.json and copy it into sub-filefold 'data'

with open(CH_GEO_JSON, 'r') as f:
    d = json.load(f)

cantons = d['objects']['cantons']['geometries']
canton_id_name= []
for canton in cantons:
    canton_id_name.append(canton['id'])
    
canton_language =pd.DataFrame(canton_id_name, columns=['Canton'])

According to the Wiki's description, separating all 26 cantons into 5 language parts: German(GR), French(FR), Italian(IT), German and French(GR/FR) as well as German and Italian and Romansh(GR/IT/RO). Save their abbreviations into corresponding list.

In [137]:
german_canton = ["AG", "AR", "AI", "BS", 
                 "BL", "GL", "LU", "NW", "OW", "SH", 
                 "SZ", "SO", "SG", "TG", "UR", "ZG", "ZH"]
french_canton = ["GE", "VD", "NE", "JU"]
italian_canton = ["TI"]
# In 17 Swiss cantons, German is the only official language 
gr_co_fr_canton = ["BE", "FR", "VS"] 
# Bern (German-speaking majority), Valais (French-speaking majority), and Fribourg (French-speaking majority)
tri_canton = ["GR"] # more than half German
# In the cantons of Bern, Fribourg and Valais, French is co-official; in the trilingual canton of Graubünden, 
# more than half of the population speaks German, while the rest speak Romansh or Italian.

The function lg_match helps us to match canton with its language.

In [138]:
# german: GR  french: FR  italian: IT   romansh: RO

def lg_match(row):
    if row.Canton in german_canton:
        return 'GR'
    elif row.Canton in french_canton:
        return 'FR'
    elif row.Canton in gr_co_fr_canton:
        return 'GR/FR'
    elif row.Canton in italian_canton:
        return 'IT'
    else:
        return 'GR/IT/RO'

canton_language['Language'] = canton_language.apply(lg_match, axis=1)

canton_language

Unnamed: 0,Canton,Language
0,ZH,GR
1,BE,GR/FR
2,LU,GR
3,UR,GR
4,SZ,GR
5,OW,GR
6,NW,GR
7,GL,GR
8,ZG,GR
9,FR,GR/FR


## Sum up all fundings in different language area
The previous task has counted approved amount of each canton. We can merge it for further calculation. 

In [140]:
c_lg_funding = canton_language.merge(grants_by_canton)
c_lg_funding

Unnamed: 0,Canton,Language,Amount
0,ZH,GR,3870949000.0
1,BE,GR/FR,1617490000.0
2,LU,GR,62287710.0
3,UR,GR,1672678.0
4,SZ,GR,1625438.0
5,OW,GR,138096.0
6,NW,GR,0.0
7,GL,GR,0.0
8,ZG,GR,9478505.0
9,FR,GR/FR,465617200.0


In [141]:
language_grants_df = c_lg_funding.groupby(['Language']).sum()
language_grants_df.rename(columns={'Amount':'Grants in language areas'}, inplace=True)
language_grants_df.reset_index(inplace=True)
language_grants_df

Unnamed: 0,Language,Grants in language areas
0,FR,4715541000.0
1,GR,5712489000.0
2,GR/FR,2114327000.0
3,GR/IT/RO,102613400.0
4,IT,121137600.0


To draw the map, it is hard to rewrite another topojson. Why don't we use the same one above? However, using canton's topojson means the map will be divided into cantons, so we just fill the total funding of each language area into all of their cantons. In this way, the map paints same color in each language area.

In [142]:
canton_lg_funding = canton_language.merge(language_grants_df, how='outer')
canton_lg_funding

Unnamed: 0,Canton,Language,Grants in language areas
0,ZH,GR,5712489000.0
1,LU,GR,5712489000.0
2,UR,GR,5712489000.0
3,SZ,GR,5712489000.0
4,OW,GR,5712489000.0
5,NW,GR,5712489000.0
6,GL,GR,5712489000.0
7,ZG,GR,5712489000.0
8,SO,GR,5712489000.0
9,BS,GR,5712489000.0


Here we can use the dataframe to draw the map (shown in grants_by_language.html). 

In [143]:
map = folium.Map(location=[46.82244,8.22410], zoom_start=8)
CH_LG_JSON='ch-canton.topojson.json'
map.choropleth(data=canton_lg_funding,
               columns=['Canton', 'Grants in language areas'], 
               key_on='feature.id',
               geo_path=CH_GEO_JSON, 
               topojson='objects.cantons', 
               fill_color='YlOrRd'
               )
map.save('grants_by_language.html')



In [144]:
map

There 4 distinct color blocks in the result: rose part on north is German-speaking area, with the most funding amount(over 5 billion); red part on west is Frenching-speaking area, with the second most funding amount(over 4 billion). The Italian and trilingual area are almost the same color for their small quantity of grants.

Comparing to the map of each cantons(grants_by_canton.html), it is not difficult to see that except for Zurich's large grant, the rest German-speaking area actually has small quantity of grants. In French-speaking area, both Vaud and Geneva has higher amount. For the bilingual part, the internal distribution seems to be more balanced.

## Röstigraben Grants
To roughly divide cantons by the Röstigraben, we decide to set the most common language of canton as their 'official language'. The map will be more similar to the real Röstigraben map.

We can use the counting above and change the bilingual/trilingual parts, then sum them again as the previous map.

In [145]:
gr_fr_funding = canton_lg_funding.copy()
gr_fr_funding.rename(columns={'Grants in language areas':'Röstigraben Grants'}, inplace=True)
gr_fr_funding.head()

Unnamed: 0,Canton,Language,Röstigraben Grants
0,ZH,GR,5712489000.0
1,LU,GR,5712489000.0
2,UR,GR,5712489000.0
3,SZ,GR,5712489000.0
4,OW,GR,5712489000.0


In bilingual cantons, Bern is German-speaking majority, and Valais is French-speaking majority, and Fribourg is French-speaking majority, so we set Bern belongs to German-speaking part, and Valais and Fribourg are French-speaking part. 
In the trilingual canton Graubünden, more than half of the population speaks German, so it belongs to German-speaking part.

In [146]:
ngr_cantons = ["BE", "GR"] 
nfr_cantons = ["FR", "VS"] # more than half German
# Bern (German-speaking majority), Valais (French-speaking majority), and Fribourg (French-speaking majority)

def lg_change(row):
    if row.Canton in ngr_cantons:
        return 'GR'
    if row.Canton in nfr_cantons:
        return 'FR'
    return row.Language
    
language_grants_df = c_lg_funding.groupby(['Language']).sum()
language_grants_df.rename(columns = {'Approved Amount':'Grants in language areas'}, inplace=True)
language_grants_df.reset_index(inplace=True)
language_grants_df


gr_fr_funding['Language'] = gr_fr_funding.apply(lg_change, axis=1)
gr_fr_funding

Unnamed: 0,Canton,Language,Röstigraben Grants
0,ZH,GR,5712489000.0
1,LU,GR,5712489000.0
2,UR,GR,5712489000.0
3,SZ,GR,5712489000.0
4,OW,GR,5712489000.0
5,NW,GR,5712489000.0
6,GL,GR,5712489000.0
7,ZG,GR,5712489000.0
8,SO,GR,5712489000.0
9,BS,GR,5712489000.0


The rest steps are similar, grouping fundings by areas and merging them to each canton.

In [147]:
r_grants_df = gr_fr_funding.groupby(['Language']).sum()
r_grants_df.reset_index(inplace=True)
r_grants_df

Unnamed: 0,Language,Röstigraben Grants
0,FR,23090820000.0
1,GR,99329260000.0
2,IT,121137600.0


In [148]:
r_lg = gr_fr_funding.drop('Röstigraben Grants',1)
r_lg_funding = r_lg.merge(r_grants_df, how='outer')
r_lg_funding

Unnamed: 0,Canton,Language,Röstigraben Grants
0,ZH,GR,99329260000.0
1,LU,GR,99329260000.0
2,UR,GR,99329260000.0
3,SZ,GR,99329260000.0
4,OW,GR,99329260000.0
5,NW,GR,99329260000.0
6,GL,GR,99329260000.0
7,ZG,GR,99329260000.0
8,SO,GR,99329260000.0
9,BS,GR,99329260000.0


In [152]:
map = folium.Map(location=[46.82244,8.22410], zoom_start=8)
CH_LG_JSON='ch-canton.topojson.json'
map.choropleth(data=r_lg_funding,
               columns=['Canton', 'Röstigraben Grants'], 
               key_on='feature.id',
               geo_path=CH_GEO_JSON, 
               topojson='objects.cantons', 
               fill_color='YlOrRd'
               )
map.save('grants_by_Rostigraben.html')



In [153]:
map