# ADA Homework 03 - Interactive Viz

In [28]:
# Import libraries
import pandas as pd
import numpy as np

# Map
import folium

# Query with geonames
import json
import requests
from urllib import parse 
from urllib.request import urlopen


> **Assignemnt**

> Build a Choropleth map which shows intuitively (i.e., use colors wisely) how much grant money goes to each Swiss canton. To do so, you will need to use the provided TopoJSON file, combined with the Choropleth map example you can find in the Folium README file.

## Import data

In [4]:
# The data is downloaded from: http://p3.snf.ch/Pages/DataAndDocumentation.aspx
# The downloaded csv file use semi-colons ';' as delimeter and contains
# the column header has first row
p3_grand = pd.read_csv("Data/P3_GrantExport.csv", sep=";", index_col=0)
p3_grand.head()

Unnamed: 0_level_0,Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
"﻿""Project Number""",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,
5,"Kritische Erstausgabe der ""Evidentiae contra D...",,Kommission für das Corpus philosophorum medii ...,Project funding (Div. I-III),Project funding,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",10101,Philosophy,Human and Social Sciences;Linguistics and lite...,01.03.1976,28.02.1985,79732.0,
6,Katalog der datierten Handschriften in der Sch...,,Burckhardt Max,Project funding (Div. I-III),Project funding,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,52627.0,
7,Wissenschaftliche Mitarbeit am Thesaurus Lingu...,,Schweiz. Thesauruskommission,Project funding (Div. I-III),Project funding,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",10303,Ancient history and Classical studies,Human and Social Sciences;Theology & religious...,01.01.1976,30.04.1978,120042.0,


In [5]:
p3_grand.index.is_unique

True

The data contains 15 columns. Among these columns, there are 8 columns containing some 'nan' values.

In [6]:
p3_grand.isnull().any()

Project Title                   False
Project Title English            True
Responsible Applicant           False
Funding Instrument              False
Funding Instrument Hierarchy     True
Institution                      True
University                       True
Discipline Number               False
Discipline Name                 False
Discipline Name Hierarchy        True
Start Date                       True
End Date                         True
Approved Amount                 False
Keywords                         True
dtype: bool

## Data filter and cleaning

For the assignment, we need to have the canton location and the grant amount from the data.<br>
In order to retrieve the canton, we will use the columns *Institution* and *University*. For the grant amount, we keep the colum *Approved Amount*.

In [7]:
data_of_interest = p3_grand[['Institution', 'University', 'Approved Amount']]
data_of_interest.head()

Unnamed: 0_level_0,Institution,University,Approved Amount
"﻿""Project Number""",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,,Nicht zuteilbar - NA,11619.0
4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


We can see that some row have neither *Institution* or *University* information.

In [8]:
data_of_interest.loc[20008]

Institution                            NaN
University                             NaN
Approved Amount    data not included in P3
Name: 20008, dtype: object

These row are therefore useless for us, so we can remove them.

In [9]:
# We can use the 'thresh=2' because we know that the column 'Approved Amount' hasn't null values
data_of_interest = data_of_interest.dropna(axis=0, how="all", thresh=2)
data_of_interest.size

187443

Note that, as seen above, the column *Approved Amount* has no null values. But it doesn't only contains number...

In [10]:
data_of_interest['Approved Amount'].describe()

count                       62481
unique                      36396
top       data not included in P3
freq                        10244
Name: Approved Amount, dtype: object

As we can see, some rows contains a string *"data not included in P3"*. <br>
So we need to clean this column. We only keep rows where the *Approved Amount* value can be cast to numeric.

In [11]:
# The "errors=coerce" will replace all values that cannot be cast to numeric by a null value.
# So we simply need to get ride of all these null values
missing_amount = pd.to_numeric(data_of_interest['Approved Amount'], errors="coerce").isnull()
data_clean = data_of_interest[~ missing_amount]

print("Type of 'Approved Amount column:", data_clean['Approved Amount'].dtypes)
print("Any null values? ",data_clean['Approved Amount'].isnull().any())
data_clean.head()

Type of 'Approved Amount column: object
Any null values?  False


Unnamed: 0_level_0,Institution,University,Approved Amount
"﻿""Project Number""",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,,Nicht zuteilbar - NA,11619.0
4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0
5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0
6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0
7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0


## Add "Canton" information

Starting for the *Institution* and *University* information, we can find the *Canton* information for each row.

In [72]:
a = data_clean.groupby('University').size()
print("Total:", a.sum())
a.sort_values(ascending=False)

Total: 50878


University
Universität Zürich - ZH                               6754
Université de Genève - GE                             6379
ETH Zürich - ETHZ                                     6143
Universität Bern - BE                                 5460
Universität Basel - BS                                4737
EPF Lausanne - EPFL                                   4406
Université de Lausanne - LA                           4090
Nicht zuteilbar - NA                                  2595
Université de Fribourg - FR                           2078
Université de Neuchâtel - NE                          1589
NPO (Biblioth., Museen, Verwalt.) - NPO               1473
Paul Scherrer Institut - PSI                           537
Firmen/Privatwirtschaft - FP                           492
Universität St. Gallen - SG                            426
Università della Svizzera italiana - USI               346
Eidg. Anstalt für Wasserversorgung - EAWAG             333
HES de Suisse occidentale - HES-SO           

In [18]:
data_clean[data_clean.University.isnull()].shape

(1359, 3)

In [87]:
df = pd.DataFrame()
df['University'] = data_clean.University.unique()
df.index = df['University']

institution = "**Institution**"

df['University Name'] = None
df['University Code'] = None

for index, row in df.iterrows():
    
    x = row.University
    if x == x:
        try:
            a,b = x.split('-')
        except:
            a,b = x,no_value
    else:
        #x is NaN
        a = b = institution
    
    row['University Name'] = a
    row['University Code'] = b
        
df.head()

Unnamed: 0_level_0,University,University Name,University Code
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Nicht zuteilbar - NA,Nicht zuteilbar - NA,Nicht zuteilbar,
Université de Genève - GE,Université de Genève - GE,Université de Genève,GE
"NPO (Biblioth., Museen, Verwalt.) - NPO","NPO (Biblioth., Museen, Verwalt.) - NPO","NPO (Biblioth., Museen, Verwalt.)",NPO
Universität Basel - BS,Universität Basel - BS,Universität Basel,BS
Université de Fribourg - FR,Université de Fribourg - FR,Université de Fribourg,FR


In [81]:
def getCanton(name):
    
    url = 'http://api.geonames.org/search?q='+name+'&country=CH&maxRows=1&username=epfl_interactive_viz&type=json'

    query = parse.quote(url, safe=':/&=?')
    js = json.loads(requests.get(query).text)

    try:
        return js['geonames'][0]['adminCode1']
    except:
        return False

In [118]:
df['Canton'] = None

for index, row in df.iterrows():
    name = row['University Name']
    code = row['University Code']
    
    canton = getCanton(row['University Name'])
    if not canton:
        canton = getCanton(row['University Code'])
    
    if canton:
        row['Canton'] = canton
        
df.head()

Unnamed: 0_level_0,University,University Name,University Code,Canton
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nicht zuteilbar - NA,Nicht zuteilbar - NA,Nicht zuteilbar,,BE
Université de Genève - GE,Université de Genève - GE,Université de Genève,GE,GE
"NPO (Biblioth., Museen, Verwalt.) - NPO","NPO (Biblioth., Museen, Verwalt.) - NPO","NPO (Biblioth., Museen, Verwalt.)",NPO,
Universität Basel - BS,Universität Basel - BS,Universität Basel,BS,BS
Université de Fribourg - FR,Université de Fribourg - FR,Université de Fribourg,FR,FR


In [121]:
data_clean['Canton'] = no_canton

for index, row in data_clean.iterrows():
    row['Canton'] = df.loc[row.University, 'Canton']

data_clean.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0_level_0,Institution,University,Approved Amount,Canton
"﻿""Project Number""",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,Nicht zuteilbar - NA,11619.0,BE
4,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,41022.0,GE
5,Kommission für das Corpus philosophorum medii ...,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0,
6,Abt. Handschriften und Alte Drucke Bibliothek ...,Universität Basel - BS,52627.0,BS
7,Schweiz. Thesauruskommission,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0,


In [128]:
1 - data_clean.Canton.isnull().sum() / data_clean.shape[0]

0.89250914103030421