# HOMEWORK 3  - INTERACTIVE VIZ

** Build a Choropleth map which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton. To do so, you will need to use the provided TopoJSON file, combined with the Choropleth map example you can find in the Folium README file.**

**BONUS: using the map you have just built, and the geographical information contained in it, could you give a rough estimate of the difference in research funding between the areas divided by the Röstigraben?**

## 1. Get the canton to which each grant was assigned

Usual imports first.
- A INSTALLER AVEC pip install geopy

In [None]:
import pandas as pd
import numpy as np
import json
import os
from geopy.geocoders import Nominatim
from geopy.geocoders import GeoNames

We load the grants and select a few useful attributes :
- The university and the institution will try to help us locate the canton to which each grant was assigned.
- The approved amount to know how much each canton received.

In [None]:
df_grants = pd.read_csv('P3_GrantExport.csv',sep=';')
df_grants = df_grants[[6,7,11,12,13]]
df_grants.head()

### 1. Extracting the canton from the university field
There are some issues in the data, that need to be formatted. First of all, some data need to be replaced with NaN, as they do not convey any information regarding the location of the grant that was assigned. We considered the following :
- 'Nicht zuteilbar - NA' -> Unavailable data
- 'NPO (Biblioth., Museen, Verwalt.) - NPO' -> No informations
- 'Weitere Institute - FINST' -> Other institutes
- 'Firmen/Privatwirtschaft - FP' -> Private institutions

There is also a second part of data that simply have no university field. The location of where those grants were assigned then shall be determined in the second part of the extraction, where we turn our sights to the institution field of the DataFrame

In [None]:
df_grants = df_grants.replace('Nicht zuteilbar - NA',np.nan)
df_grants = df_grants.replace('NPO (Biblioth., Museen, Verwalt.) - NPO',np.nan)
df_grants = df_grants.replace('Weitere Institute - FINST',np.nan)
df_grants = df_grants.replace('Firmen/Privatwirtschaft - FP',np.nan)

data_uni = df_grants['University'].unique()[1:]

data_uni[1:10]

In order to do a few less queries on the web to determine the location of our universities, we define two dictionnaries that we will extensively use to help us locate a university. The first one, `CANTON_DICT` associates each canton to its tag, that we will use on the map later on, but we can use it in the following way : if the key of this dict is present in the string describing the university, then we match it to the canton. It is very practical, as many universities are located in the capital of the canton, which often has the same name as the canton itself.

The second dict, `CAPITAL_DICT` contains pairs associating the capital of a canton to a canton, for those who do not have exactly the same name. We will use it the same way as `CANTON_DICT` : if the name of the capital appears in the string of the university, we match it to the canton it belongs to.

In [None]:
CANTON_DICT = {'Zürich':'ZH','Bern':'BE','Luzern':'LU','Uri':'UR','Schwyz':'SZ','Obwalden':'OW','Nidwalden':'NW',
               'Glarus':'GL','Zug':'ZG','Fribourg':'FR','Solothurn':'SO','Basel-Stadt':'BS','Basel-Landschaft':'BL',
               'Schaffhausen':'SH','Appenzell Ausserrhoden':'AR','Appenzell Innerrhoden':'AI','Sankt Gallen':'SG',
               'Graubünden':'GR','Aargau':'AG','Thurgau':'TG','Ticino':'TI', 'Vaud':'VD','Valais':'VS','Neuchâtel':'NE',
               'Genève':'GE','Jura':'JU'}
CAPITAL_DICT = {'Basel':'Basel-Stadt' ,'Lausanne':'Vaud', 'Sion':'Valais','Altdorf':'Uri','Sarnen':'Obwalden','Stans':'Nidwalden',
                'Liestal':'Basel-Landschaft','Herisau':'Appenzell Ausserrhoden',
                'Chur':'Graubünden','Aarau':'Aargau','Frauenfeld':'Thurgau','Bellinzona':'Ticino','Delémont':'Jura'}
#'Appenzell':'Appenzell Innerrhoden' excluded because the name of the city is partially in the name of two cantons

Let us now start with the canton extraction from the uni field. We will use two methods for that. 
- `extract_canton` will also be reused later on, tries to find, given a data_string containing a university/institution, the canton if belongs to. It does that in three steps :
    1. Tries to see if a canton appears in the data_string
    2. Tries to see if a capital which does not have the same name as the canton appears in the data_string
    3. Queries the Nominatim web serviceto try to locate the string. Here, if the response if `None`, we return `None`, and if the answer is not in Switzerland (its last field should contain `Svizra` in this case), then we also return `None`.

In [None]:
def extract_canton(data_string):
    """
        Fetchs from the web the location associated to a string. We use Nominatim at the moment. 
        The format of the output string (if it is not None) has the canton in the 6th output before the end.
        The canton is written in all the languages spoken in it, and we hence split and keep the name that comes first.
        @param data_string : the data from which we want to determine the canton
        @return canton : the name of the canton associated to the input data_string
    """
    for canton in CANTON_DICT:
        if(canton in str(data_string)):
            return canton
    for city,canton in CAPITAL_DICT.items():
        if(city in str(data_string)):
            return canton
    geolocator = Nominatim(country_bias='Switzerland')#,username='test_056')
    location = geolocator.geocode(data_string)
    split_loc = str(location).split(', ')
    if (split_loc is not None) and (split_loc[-1] =='Svizra'):
        canton = str(location).split(', ')
        canton = canton[len(canton)-6]
        return canton.split(' - ')[0]   
    else:
        return np.nan

- The second method, `extract_canton_from_uni`, will simply iterate on all the different universites of our DataFrame and call extract_canton. To get a higher chance of succeeding, we split the string at the `-` that separates the name and the acronym, and if the research was not successful with the full name, we pass the acronym to extract_canton as well. The full processing is done only once, and we store the result as a json file to load it again later on.

In [None]:
def extract_canton_from_uni(data_uni):
    """
        Extracts to which canton belongs a university and stored it into a json file (folder data).
        @param data_uni :           an array of strings describing universities or other institutions.
        @return university_dict :   a dict which associate to each university a canton if it was found
                                    or a None if nothing was found.
    """
    if(os.path.isfile('data/uni_canton_dict.json')):
        print('Loading the data from json file')
        with open('data/uni_canton_dict.json') as f:
            university_dict = json.load(f)
    else:
        print('Fetching the locations from the web')
        university_dict = dict()
        for uni in data_uni:
            # Splits the university string at the '-', which usually corresponds to the separation between the full name
            # and the acronym of the institution.
            uni_name_split = uni.split('-')
            extract_value = extract_canton(uni_name_split[0])
            if (extract_value is np.nan) and (len(uni_name_split) > 1):
                # If the canton was not found in the name, then we try to extract it from the acronym string if it exists
                extract_value = extract_canton(uni_name_split[1])
            university_dict[uni] = extract_value
        with open('data/uni_canton_dict.json', 'w') as f:
            json.dump(university_dict, f)
    return university_dict
university_canton = extract_canton_from_uni(data_uni)

Let us now see the number of universities that were located and those who weren't.

In [None]:
print(len([place for place, value in university_canton.items() if not pd.isnull(value)]))
print(len([place for place, value in university_canton.items() if pd.isnull(value)]))

Sadly, this way only half of the universities were matches. Note that we did several tests and found that GeoNames leads a way poorer result than Nominatim (52 vs 37 not found), and prefiltering (trying to find the uni in the `CANTON_DICT` and `CAPITAL_DICT` dict) yields 37 not found against 40, but avoid queries on the web.

Let us now add a column to our grant DataFrame to find how many of the projects have been identified from their university name and fill this column with the canton tags from each canton

In [None]:
df_grants['Canton'] = np.nan
match_canton_uni = lambda uni: CANTON_DICT[university_canton[uni]] if (not pd.isnull(university_canton[uni]))  else university_canton[uni]
df_grants.Canton = df_grants.University.loc[[uni in university_canton for uni  in df_grants.University]].apply(match_canton_uni)

In [None]:
df_grants.head(10)

In [None]:
sum(df_grants.Canton.isnull()/len(df_grants.Canton))

We see that only 30% of the cantons were found.

### 2. Extracting the canton from the institution field
Let us now address the potential problems we will encounter by displaying all the grants that have no universities associated.

In [None]:
df_grants.Institution.loc[df_grants.University.isnull()].value_counts()

We see that those institutions are mostly foreign universities or reasearch centers, that do not provide useful informations for our exercise. However, there are some swiss universities in the middle of it, so we will need to iterate through it and sort the institutions that will be kept or not.

In [None]:
df_institution = df_grants['Institution'].loc[df_grants['Canton'].isnull()]
df_institution = df_institution.unique()#head()
print(df_institution.shape)

INSTITUTION_EXCLUDED = ['USA','GB','AUS','CDN','Stanford University','University of Cambridge','Paris']

In [None]:
def extract_canton_from_institution(data_institution, complete = True):
    """
        Extracts to which canton belongs a institution and stored it into a json file (folder data). 
        As this data is very large, the queries to the website will most likely time out. We dump the the dictionnary 
        to a json file at each iteration and can restart from it with the parameter complete
        @param data_institution :   an array of strings describing institutions.
        @param complete         :   tells us if the process has been completed (everything saved to json file)
        @return institution_dict :  a dict which associate to each university a canton if it was found
                                    or a None if nothing was found.
    """
    institution_dict = dict()
    if(os.path.isfile('data/instit_canton_dict.json')):
        print('Loading the data from json file')
        with open('data/instit_canton_dict.json') as f:
            institution_dict = json.load(f)
        #print(institution_dict)
    if not(os.path.isfile('data/instit_canton_dict.json') and complete):
        print('Fetching the locations from the web')
        excluded_bool = False
        for index,instit in enumerate(data_institution):
            for excluded in INSTITUTION_EXCLUDED:
                if(not excluded_bool and excluded in str(instit))  :
                    institution_dict[instit] = np.nan
                    print(index,'- EXCLUDED <===== ',instit,)
                    excluded_bool = True
            if not excluded_bool:
                value = extract_canton(instit)
                institution_dict[instit] = value
                if value is not np.nan:
                    print(index,'-',value,'<=====',instit )
                with open('data/instit_canton_dict.json', 'w') as f:
                    json.dump(institution_dict, f)
            excluded_bool = False
    return institution_dict
institution_canton = extract_canton_from_institution(df_institution[1995:],complete=False)

Elements 1 - 1165 sampled with Nominatim :

Should continue to do it, but I'm blocked

Elements 1166 - 1512 with GeoNames

1513 -1994 Nominatim

1995 - 2815 GeoNames

In [None]:
Gets the lengths of the dictionnary stored so we can know where to sample from (first number)

In [None]:
with open('data/instit_canton_dict.json') as f:
    institution_dict = json.load(f)
#print(len(institution_dict))
#for index,elem in enumerate(df_institution):
#    if(elem == 'UNI: Université de Neuchâtel Institut de Mic rotechnique  Neuchâtel CH'):
# print(index)