# Interactive visualization homework overview
In this homework, we want to make an interactive visualization of the grants received from the SNSF in each canton. The data is the P3 data given on the [SNSF website](http://p3.snf.ch/), called P3_GrantExport.csv
To do so, 
* we first load the data with pandas;
* we only keep the columns of interest (University name and amount of money received for each project);
* and we only keep the rows of interest (corresponding to Swiss universities, that is any non-nan "University" entry is valid). 
* Then, we have to map the universities to their corresponding cantons using [Geonames Full Text Search API in JSON](http://www.geonames.org/export/web-services.html) together with some manual tuning. 
* We finally visualize the results thanks to folium on the map of Switzerland, using a cloropleth map.

## Import librarires and load data

In [1]:
import pandas as pd
import folium
import requests
import math

Read the csv file with the separator ; and create a dataframe from it.

In [2]:
data = pd.read_csv('P3_GrantExport.csv',sep=';')
data = data.rename(columns={'\ufeff"Project Number"':'Project','Approved Amount':'Amount'})
data = data.set_index('Project')

We check if the project number is actually a unique identifier, as declared in the SNSF P3 website.

In [3]:
data.index.is_unique

True

Since it is unique, we set it as the index of the dataframe. We explore now the dataframe parameters, in order to understand which of them are interesting for our purposes:

In [4]:
data.columns

Index(['Project Title', 'Project Title English', 'Responsible Applicant',
       'Funding Instrument', 'Funding Instrument Hierarchy', 'Institution',
       'University', 'Discipline Number', 'Discipline Name',
       'Discipline Name Hierarchy', 'Start Date', 'End Date', 'Amount',
       'Keywords'],
      dtype='object')

## Choose data of interest
A complete description of the parameters can be found at http://p3.snf.ch/Pages/DataAndDocumentation.aspx. The attributes which are interesting for us are:

* 'Institution': place where the project is carried out
* 'University': academic insitution where the project is carried out
* 'Approved Amount': grant for the project (CHF)

We keep in also the 'Start Date' and the 'End date' parameters for eventual further investigations which take time into account.
Let us filter the dataset now.

In [5]:
data = data[['Institution','University','Amount','Start Date','End Date']]
data.head()

Unnamed: 0_level_0,Institution,University,Amount,Start Date,End Date
Project,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,Nicht zuteilbar - NA,11619.0,01.10.1975,30.09.1976
4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0,01.10.1975,30.09.1976
5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0,01.03.1976,28.02.1985
6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0,01.10.1975,30.09.1976
7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0,01.01.1976,30.04.1978


# Preliminar data exploration
In this section we will explore the essential features of the dataset to learn better how it looks like.
We know a priori the structure of the dataset, and in particular, we have seen that some University fields are filled with NaN, while other columns contain some specific string, such as "Nicht zuteilbar - NA", that is, 'not allocable'. Moreover, when the grant amount is not given, one can find the specification "data not included in P3". These data arise actually from projects developed at foreign Universities, as said in the documentation of P3. Consequently, they are not interesting for our study. It is thus reasonable to drop them out of the dataset.

Let us study first the 'University' field.

In [7]:
nan_string1 = 'Nicht zuteilbar - NA'
nans1 = (data['University'] == nan_string1).value_counts()
nans2 = data['University'].isnull().value_counts()
print(str((nans1[True] + nans2[True])/nans2.sum())+' is the ratio of nan values out of '+str(nans2.sum())+' projects.')

0.243492941894 is the ratio of nan values out of 63969 projects.


There are roughly **24%** of NaN values in the field 'University', that is, around a fourth of the projects are developed in non Swiss Universities.

Now, let us study the 'Amount' field. We convert the grant amounts from string to floating point numbers, in order to treat them quantitatively.

In [8]:
# Remove missing data
nan_string = 'data not included in P3'
# Amount NaN
nans_string = data.Amount.isin([nan_string]).value_counts()[True]
data=data[~data.Amount.isin([nan_string])]

In [9]:
data.Amount = pd.to_numeric(data.Amount)

Choose rows and replace 'Nicht zuteilbar - NA' and 'data not included in P3' by nan.

In [None]:
grants_uni_df = grants_df[['Institution', 'University','Approved Amount']].replace('Nicht zuteilbar - NA', np.nan)
grants_uni_df = grants_uni_df[['Institution', 'University','Approved Amount']].replace('data not included in P3', np.nan)
grants_uni_df.head()

Check how many null-entries there are.

In [None]:
null_uni = grants_uni_df[grants_uni_df['University'].isnull()].shape[0]
null_inst = grants_uni_df[grants_uni_df['Institution'].isnull()].shape[0]
null_amount = grants_uni_df[grants_uni_df['Approved Amount'].isnull()].shape[0]
print(null_inst)
print(null_uni)
print(null_amount)

Drop null entries.

In [None]:
grants_uni_CH_df = grants_uni_df.dropna()
grants_uni_CH_df.head()

In [None]:
grants_uni_CH_df.describe()

In [None]:
grants_uni_CH_df = grants_uni_CH_df.rename(columns={'Approved Amount':'Amount'})
grants_uni_CH_df.Amount = pd.to_numeric(grants_uni_CH_df.Amount)
grants_uni_CH_df.Amount.describe()

## Mapping from University to Canton

In [None]:
username = 'ochanon'
url='http://api.geonames.org/postalCodeSearchJSON?'
parameters={'username':username,'placename':'CH','maxRows':1,'operator':'OR'}
r=requests.get(url,params=parameters)
df=grants_uni_CH_df

In [None]:
df_final=pd.DataFrame({'Canton':[]})
not_found_list=[]
for block in df[['Institution','University']].itertuples(index=False):
    nan1=str(block[0])
    nan2=str(block[1])
    
    # By default take only the university.
    # The first time a value is added it is checked for differences if adding also the institution.
    if nan1=='nan':
        if nan2=='nan':
            raise('Bad preprocessing - double nan')
        query_string=block[1]
    elif nan2=='nan':
        query_string=block[0]
    else:
        query_string=block[0]+", "+block[1]
    
    
    # List of checks if already present in the dictionary:
    # 1- institution + university 
    # 2- university 
    # 3- query to geonames
    try:
        canton=correspondencies_dictionary[query_string]
        df2=pd.DataFrame({'Canton':[canton]})
        df_final=df_final.append(df2)
    except:
        try:
            query_string_university=str(block[1])
            canton=correspondencies_dictionary[query_string_university]
            df2=pd.DataFrame({'Canton':[canton]})
            df_final=df_final.append(df2)
        except:
            try:
                params['placename']=query_string
                r=requests.get(url,params=params)
                df1=pd.read_json(r.text,orient='records')
                canton=df1.postalCodes[0]

                if nan2!='nan':
                    query_string_short=block[1]
                    r=requests.get(url,params=params)
                    df1=pd.read_json(r.text,orient='records')
                    canton2=df1.postalCodes[0]
                    if canton2==canton:
                        query_string=query_string_short
                    else:
                        print(canton2,canton)

                if canton['countryCode']!='CH':
    #                 if nan2!='nan':
    #                     print(2140358234)
    #                 print(query_string)
    #                 print(canton['countryCode'])
                    continue
                df2=pd.DataFrame(canton,columns=['adminCode1'],index=['adminCode1'])
                df2=df2.rename(columns={'adminCode1':'Canton'})
                df_final=df_final.append(df2)
                correspondencies_dictionary[query_string]=df2.Canton[0]
            except:
                print(query_string)
                not_found_list.append(query_string)

## Interactive visualization using Folium

In [None]:
map = folium.Map(location=[46.8, 8], zoom_start=8)
map.choropleth(geo_path=geo, data=None,
             columns=['Canton', 'Amount'],
             key_on='feature.id',
             fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
             legend_name='Amount of grants (CHF)')

In [None]:
map