# Interactive Visualization

In [2]:
#WARNING guys! In order to use folium you are gonna need to run
# pip install folium
#in your terminal.
import folium

#Probably gonna use this:

# Normal stack of pandas, numpy, matplotlib and seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# For handeling the web requests
import requests
from collections import defaultdict
from pprint import *

import numpy as np

# Statistical test library
import scipy.stats as stats
from helpers import *

%matplotlib inline

In [8]:
'''
By using np.save and np.load on the dict, we can avoid doing the requests more than once :)
This does not work now because of some utf8, should be figured out!
'''
np.save('canton_dict.npy',urlencode({}))
np.save('has_looked_up.npy',{})

has_looked_up_dict = np.load('has_looked_up.npy').item()
canton_dict = np.load('canton_dict.npy').item()

def has_looked_up(name):
    try:
        has_looked_up[name]
    except:
        False
    True

def has_canton(name):
    try:
        canton_dict[name]
    except:
        False
    True

Folium: https://github.com/python-visualization/folium
 
 Documentation:
 1. https://folium.readthedocs.io/en/latest/
 2. https://media.readthedocs.org/pdf/folium/latest/folium.pdf

In [3]:
#Folium test
map_osm = folium.Map(location=[46.8076878,7.1004592], zoom_start=10)
map_osm

# Data wrangeling

Data source (P3_GrantExport.csv)
and description of the columns: http://p3.snf.ch/Pages/DataAndDocumentation.aspx

In [4]:
# Load the data we need. (Columns: Insti)
columns = ['Institution', 'University','Approved Amount']
p3 = pd.read_csv("P3_GrantExport.csv", delimiter=';', usecols=columns, na_values=['data not included in P3', 'nan'])

To be able to work with the data, we have to filter out the values where we have enough information.

We start by filtering out the projects where we know at least the University or the institution related to the project

In [5]:
missing__info = p3[p3[['Institution', 'University']].isnull().all(axis=1)].index
has_info = p3.drop(missing__info, axis=0)

print('Dropped from original:', get_dropped_perc(p3, has_info))


Dropped from original: 2.3261267176288514


As we can see, this leaves out 2.33% of the original data.

In [6]:
grant_info = has_info[has_info[['Approved Amount']].isnull().all(axis=1) == False]
grants_data = has_info.loc[grant_info.index]
grants_data.head()

Unnamed: 0,Institution,University,Approved Amount
0,,Nicht zuteilbar - NA,11619.0
1,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


In [7]:
print('Dropped from original:', get_dropped_perc(p3, grants_data))
print('Dropped from last step:', get_dropped_perc(has_info, grants_data))

Dropped from original: 18.340133502165113
Dropped from last step: 16.395384196795824


So, for about 16.4% of the data, we don't have any information about the grant given. 

# Get the geo info

From the description of the dataset, we have the folloring description:
> ### Institution
> According to the information submitted by the responsible applicant, this is the research institution where the project will largely be carried out. Typically, this is the institution or specific lab where the responsible applicant works. The institution is therefore linked to the application. This field is not structured, but its content is consolidated internally. There is no link in the database to the field University.

> ### University
> This is the institution where the project will largely be carried out according to the application. Pick list. This field is only filled if the research is carried out at a Swiss institution, otherwise the field remains blank. In the case of mobility fellowships, it is generally left empty.


So, we know that for the rows where the university is given, the grant was awarded to a swizz university. However, we do not know much about the cases where we only have the institution information. We descided to look into these cases further.

In [8]:
has_uni = has_info[has_info['University'].isnull() == False]
has_uni.head()

Unnamed: 0,Institution,University,Approved Amount
0,,Nicht zuteilbar - NA,11619.0
1,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


In [9]:
print('Dropped from original:', get_dropped_perc(p3, has_uni))
print('Dropped from last step:', get_dropped_perc(grants_data, has_uni))

Dropped from original: 20.292641748346853
Dropped from last step: 2.3910255183107743


In [10]:
has_code = has_uni['University'].str.extract('(?P<University>.*?(?= -))(?P<delim>\ -\ )(?P<Code>.*)', expand=True).drop('delim', axis=1)
has_code.head()

Unnamed: 0,University,Code
0,Nicht zuteilbar,
1,Université de Genève,GE
2,"NPO (Biblioth., Museen, Verwalt.)",NPO
3,Universität Basel,BS
4,"NPO (Biblioth., Museen, Verwalt.)",NPO


In [11]:
# For these rows, we allready have the canton code
has_canton_code = has_code['Code'][has_code['Code'].apply(is_canton_code)].rename('Canton code')

has_canton_code.index

Int64Index([    1,     3,     5,     6,     7,     9,    10,    12,    13,
               14,
            ...
            63918, 63919, 63921, 63922, 63927, 63928, 63932, 63934, 63955,
            63958],
           dtype='int64', length=27702)

In [12]:
# For the rest, we will try to get the canton code by requesting the GeoNames web service
has_uni_code = has_code['Code'][has_code['Code'].apply(is_canton_code) == False]

has_uni_no_cantoncode = has_uni.loc[has_uni_code.index]
has_uni_no_cantoncode.head()

Unnamed: 0,Institution,University,Approved Amount
0,,Nicht zuteilbar - NA,11619.0
2,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
4,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0
8,,Université de Lausanne - LA,25814.0
11,Schweizerische Rechtsquellen c/o Universität Z...,"NPO (Biblioth., Museen, Verwalt.) - NPO",862200.0


In [13]:
for place in has_uni_no_cantoncode['University'].unique():
    if place:
        p = geo_lookup(place)
    if p:
        print(place)
        pprint(p)

In [None]:
def map_institution_canton(df):
    for place in df['Institution'].unique():
        if place.startswith('Uni'):
            continue
        try:
            canton_dict[place]
            print(place, 'allready in the map')
        except:
            
            p = geo_lookup(place)
            if p:
                print('fount', p, 'in the api')
                canton_dict[place] = p
            else:
                has_looked_up[place] = True
                print(place, p)


map_institution_canton(has_info)

nan allready in the map
Faculté de Psychologie et des Sciences de l'Education Université de Genève None
Kommission für das Corpus philosophorum medii aevi der SGG None
Abt. Handschriften und Alte Drucke Bibliothek der Universität Basel None
Schweiz. Thesauruskommission None
Séminaire de politique économique, d'économie internationale et d'économie régionale None
Institut für ökumenische Studien Université de Fribourg None
Ostasiatisches Seminar Universität Zürich None
Laboratoire de Didactique et Epistémologie des Sciences Université de Genève None
Klinische Psychologie und Psychotherapie Institut für Psychologie Universität Bern None
Schweizerische Rechtsquellen c/o Universität Zürich / RWI None
Département de Sociologie Faculté des Sciences de la Société Université de Genève None
Institut für Volkswirtschaft WWZ / FGS Universität Basel None
Soziologisches Institut Universität Zürich None
Forschungszentrum FAL None
Forschungsbereich Politisches Verhalten und Oeffentliche Meinung Inst.

In [None]:
pprint(canton_dict)

In [None]:
grants_with_has_code = university_grants.join(has_canton_code)
grants_with_has_code.head()

In [None]:
university_grants.head()

In [None]:
extracts = university_grants['University'].str.extract('(?P<University>.*-)(?P<Code>.*)', expand=True)
has_code = pd.DataFrame( extracts['Code'] )
has_code.Code.apply(is_canton_code)
grouped_has_code = has_code.groupby('Code')

In [None]:
only_institution_data = grants_data[grants_data[['University']].isnull().all(axis=1)].index

## TODO:

1. clean up this mess (:

1. make one function for mapping a row to a canton
   Suggested strategy: 
   
   (we can only make 200 req. pr hour to the api, so we should try to keep the number down)
   
   1. Unique institutions, unique universities
   2. Lookup institution name, place result in hashmap: institution -> canton
   3. Parse university. If no canton code is given, lookup. Place result in the hashmap
   4. In the function, do a query in our hashmap for each column. If conflict, return (?)

1. apply the function to all rows in the data, add canton as a column
1. make a df of Canton, Approved Amount
1. Groupby canton, sum approved_amount, add a count column
1. Plot the results on the map etc
1 Bonus
