# Interactive Visualization

In [38]:
#WARNING guys! In order to use folium you are gonna need to run
# pip install folium
#in your terminal.
import folium

#Probably gonna use this:

# Normal stack of pandas, numpy, matplotlib and seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# For handeling the web requests
import requests
from collections import defaultdict
from pprint import *

# Statistical test library
import scipy.stats as stats
from helpers import *

%matplotlib inline

Folium: https://github.com/python-visualization/folium
 
 Documentation:
 1. https://folium.readthedocs.io/en/latest/
 2. https://media.readthedocs.org/pdf/folium/latest/folium.pdf

In [2]:
#Folium test
map_osm = folium.Map(location=[46.8076878,7.1004592], zoom_start=10)
map_osm

# Data wrangeling

Data source (P3_GrantExport.csv)
and description of the columns: http://p3.snf.ch/Pages/DataAndDocumentation.aspx

In [3]:
# Load the data we need. (Columns: Insti)
columns = ['Institution', 'University','Approved Amount']
p3 = pd.read_csv("P3_GrantExport.csv", delimiter=';', usecols=columns, na_values=['data not included in P3', 'nan'])

To be able to work with the data, we have to filter out the values where we have enough information.

We start by filtering out the projects where we know at least the University or the institution related to the project

In [42]:
missing__info = p3[p3[['Institution', 'University']].isnull().all(axis=1)].index
has_info = p3.drop(missing__info, axis=0)

print('Dropped from original:', get_dropped_perc(p3, has_info))


Dropped from original: 2.3261267176288514


As we can see, this leaves out 2.33% of the original data.

In [5]:
grant_info = has_info[has_info[['Approved Amount']].isnull().all(axis=1) == False]
grants_data = has_info.loc[grant_info.index]
grants_data.head()

Unnamed: 0,Institution,University,Approved Amount
0,,Nicht zuteilbar - NA,11619.0
1,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


In [6]:
print('Dropped from original:', get_dropped_perc(p3, grants_data))
print('Dropped from last step:', get_dropped_perc(has_info, grants_data))

Dropped from original: 18.340133502165113
Dropped from last step: 16.395384196795824


So, for about 16.4% of the data, we don't have any information about the grant given. 

# Get the geo info

From the description of the dataset, we have the folloring description:
> ### Institution
> According to the information submitted by the responsible applicant, this is the research institution where the project will largely be carried out. Typically, this is the institution or specific lab where the responsible applicant works. The institution is therefore linked to the application. This field is not structured, but its content is consolidated internally. There is no link in the database to the field University.

> ### University
> This is the institution where the project will largely be carried out according to the application. Pick list. This field is only filled if the research is carried out at a Swiss institution, otherwise the field remains blank. In the case of mobility fellowships, it is generally left empty.


So, we know that for the rows where the university is given, the grant was awarded to a swizz university. However, we do not know much about the cases where we only have the institution information. We descided to look into these cases further.

In [7]:
has_uni = has_info[has_info['University'].isnull() == False]
has_uni.head()

Unnamed: 0,Institution,University,Approved Amount
0,,Nicht zuteilbar - NA,11619.0
1,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
2,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
3,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
4,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


In [8]:
print('Dropped from original:', get_dropped_perc(p3, has_uni))
print('Dropped from last step:', get_dropped_perc(grants_data, has_uni))

Dropped from original: 20.292641748346853
Dropped from last step: 2.3910255183107743


In [9]:
has_code = has_uni['University'].str.extract('(?P<University>.*?(?= -))(?P<delim>\ -\ )(?P<Code>.*)', expand=True).drop('delim', axis=1)
has_code.head()

Unnamed: 0,University,Code
0,Nicht zuteilbar,
1,Université de Genève,GE
2,"NPO (Biblioth., Museen, Verwalt.)",NPO
3,Universität Basel,BS
4,"NPO (Biblioth., Museen, Verwalt.)",NPO


In [18]:
# Drop NAs and NPOs in Canton Codes (not avaialable and NGOs)
has_code = has_code[has_code.Code != 'NPO']
has_code = has_code[has_code.Code != 'NA']
has_code.head()

Unnamed: 0,University,Code
1,Université de Genève,GE
3,Universität Basel,BS
5,Université de Fribourg,FR
6,Université de Fribourg,FR
7,Universität Zürich,ZH


In [19]:
# For these rows, we allready have the canton code
has_canton_code = has_code['Code'][has_code['Code'].apply(is_canton_code)].rename('Canton code')

has_canton_code.index

Int64Index([    1,     3,     5,     6,     7,     9,    10,    12,    13,
               14,
            ...
            63918, 63919, 63921, 63922, 63927, 63928, 63932, 63934, 63955,
            63958],
           dtype='int64', length=27702)

In [20]:
# For the rest, we will try to get the canton code by requesting the GeoNames web service
has_uni_code = has_code['Code'][has_code['Code'].apply(is_canton_code) == False]

has_uni_no_cantoncode = has_uni.loc[has_uni_code.index]
has_uni_no_cantoncode.head()

Unnamed: 0,Institution,University,Approved Amount
8,,Université de Lausanne - LA,25814.0
19,,Université de Lausanne - LA,14138.0
29,"Eidg. Forschungsanstalt für Wald, Schnee und L...","Eidg. Forschungsanstalt für Wald,Schnee,Land -...",445198.0
47,Chair of English Literature ETH-Zentrum,ETH Zürich - ETHZ,46200.0
51,Faculté des lettres Université de Lausanne,Université de Lausanne - LA,74617.0


In [21]:
for place in has_uni_no_cantoncode['University'].unique():
    if place:
        p = geo_lookup(place)
    if p:
        print(place)
        pprint(p)

In [44]:
def map_institution_canton(df):
    for place in df['Institution'].unique():
        if str(place) == 'nan':
            continue
        if str(place).startswith('Uni'):
            continue
        #c = get_canton(place)
        #if c == '':
        p = geo_lookup(place)
        if p:
            print('found', p, 'in the api')
            canton_dict[place] = p
        else:
            has_looked_up[place] = True
            print(place, p)


map_institution_canton(has_info)

Faculté de Psychologie et des Sciences de l'Education Université de Genève None
Kommission für das Corpus philosophorum medii aevi der SGG None
Abt. Handschriften und Alte Drucke Bibliothek der Universität Basel None
Schweiz. Thesauruskommission None
Séminaire de politique économique, d'économie internationale et d'économie régionale None
Institut für ökumenische Studien Université de Fribourg None
Ostasiatisches Seminar Universität Zürich None
Laboratoire de Didactique et Epistémologie des Sciences Université de Genève None
Klinische Psychologie und Psychotherapie Institut für Psychologie Universität Bern None
Schweizerische Rechtsquellen c/o Universität Zürich / RWI None
Département de Sociologie Faculté des Sciences de la Société Université de Genève None
Institut für Volkswirtschaft WWZ / FGS Universität Basel None
Soziologisches Institut Universität Zürich None
Forschungszentrum FAL None
Forschungsbereich Politisches Verhalten und Oeffentliche Meinung Inst. für Politikwissenschaft

UnboundLocalError: local variable 'r' referenced before assignment

In [50]:
# Load data for swiss cities
columns = ['Name', 'Kanton']
swiss_cities = pd.read_csv("swiss_cities.csv", delimiter='\t', usecols=columns, na_values=['data not included in P3', 'nan'])
swiss_cities.head(10)

Unnamed: 0,Name,Kanton
0,Zürich,Kanton Zürich Zürich
1,Genf,Kanton Genf Genf
2,Basel,Kanton Basel-Stadt Basel-Stadt
3,Lausanne,Kanton Waadt Waadt
4,Bern,Kanton Bern Bern
5,Winterthur,Kanton Zürich Zürich
6,Luzern,Kanton Luzern Luzern
7,St. Gallen,Kanton St. Gallen St. Gallen
8,Lugano,Kanton Tessin Tessin
9,Biel/Bienne,Kanton Bern Bern


In [66]:
# Clean Kantons up
#swiss_cities['Kanton'] = swiss_cities['Kanton'].apply(lambda x:get_canton_code(x.rsplit()[-1])) #str.rsplit(None, 1)[0]) # = swiss_cities['Kanton'].map(.str.rsplit(None, 1)[0]
#swiss_cities.head(20)
# Broken since German names

In [102]:
# Load data for swiss towns
columns = ['Town', 'Canton']
swiss_towns = pd.read_csv("swiss_towns.csv", delimiter='\t', usecols=columns, na_values=['data not included in P3', 'nan'])
swiss_towns.set_index('Town', inplace=True)
swiss_towns.head(10)

Unnamed: 0_level_0,Canton
Town,Unnamed: 1_level_1
Aarau,AG
Aarberg,BE
Aarburg,AG
Adliswil,ZH
Aesch (BL),BL
Affoltern am Albis,ZH
Agno,TI
Aigle,VD
Allschwil,BL
Altdorf (UR),UR


In [107]:
# Test for particular town
#swiss_towns.loc[swiss_towns.Town == 'Basel']
swiss_towns.loc['Basel']['Canton']

AttributeError: 'str' object has no attribute 'str'

In [113]:
# Direct mapping of canton with Switzerland town list from wikipedia https://en.wikipedia.org/wiki/List_of_cities_in_Switzerland

def map_inst_canton(df):
    for place in df['Institution'].unique():
        if str(place) == 'nan':
            continue
        for word in place.split():
            #print(word)
            if word in swiss_towns.index:
                p = swiss_towns.loc[word]['Canton']
                print(place, " -> ", p)

map_inst_canton(has_info)

Abt. Handschriften und Alte Drucke Bibliothek der Universität Basel  ->  BS
Institut für ökumenische Studien Université de Fribourg  ->  FR
Ostasiatisches Seminar Universität Zürich  ->  ZH
Klinische Psychologie und Psychotherapie Institut für Psychologie Universität Bern  ->  BE
Schweizerische Rechtsquellen c/o Universität Zürich / RWI  ->  ZH
Institut für Volkswirtschaft WWZ / FGS Universität Basel  ->  BS
Soziologisches Institut Universität Zürich  ->  ZH
Schweizerisches Nationalmuseum Landesmuseum Zürich  ->  ZH
Institut für Iberoromanistik Universität Basel  ->  BS
Historische und Antiquarische Gesellschaft zu Basel  ->  BS
Universitätsbibliothek Basel Pressestelle  ->  BS
Ägyptologisches Seminar Departement Altertumswissenschaften Universität Basel  ->  BS
Klassische Archäologie Departement Altertumswissenschaft Universität Basel  ->  BS
Faculté des lettres Université de Lausanne  ->  VD
Institut de sociologie et de sciences politiques Université de Neuchâtel  ->  NE
Section de f

ValueError: I/O operation on closed file

In [23]:
pprint(canton_dict)

''


In [24]:
grants_with_has_code = university_grants.join(has_canton_code)
grants_with_has_code.head()

NameError: name 'university_grants' is not defined

In [25]:
university_grants.head()

NameError: name 'university_grants' is not defined

In [26]:
extracts = university_grants['University'].str.extract('(?P<University>.*-)(?P<Code>.*)', expand=True)
has_code = pd.DataFrame( extracts['Code'] )
has_code.Code.apply(is_canton_code)
grouped_has_code = has_code.groupby('Code')

NameError: name 'university_grants' is not defined

In [27]:
only_institution_data = grants_data[grants_data[['University']].isnull().all(axis=1)].index

## TODO:

1. clean up this mess (:

1. make one function for mapping a row to a canton
   Suggested strategy: 
   
   (we can only make 200 req. pr hour to the api, so we should try to keep the number down)
   
   1. Unique institutions, unique universities
   2. Lookup institution name, place result in hashmap: institution -> canton
   3. Parse university. If no canton code is given, lookup. Place result in the hashmap
   4. In the function, do a query in our hashmap for each column. If conflict, return (?)

1. apply the function to all rows in the data, add canton as a column
1. make a df of Canton, Approved Amount
1. Groupby canton, sum approved_amount, add a count column
1. Plot the results on the map etc
1 Bonus
