# Interactive Viz - load dataset

After the usually library imports, we load the P3 dataset. We remark that, because of problems in reading the csv file, we have converted the original .csv dataset as an excel spreadsheet.

In [1]:
import pandas as pd

In [2]:
# load excel spreadsheet
data = pd.read_excel('P3.xlsx')
data = data.set_index('Project Number')
#data.head(5)

We check if the project number is actually a unique identifier, as declared in the SNSF P3 website.

In [3]:
data.index.is_unique

True

Since it is unique, we set it as the index of the dataframe. We explore now the dataframe parameters, in order to understand which of them are interesting for our purposes:

In [4]:
data.columns

Index(['Project Title', 'Project Title English', 'Responsible Applicant',
       'Funding Instrument', 'Funding Instrument Hierarchy', 'Institution',
       'University', 'Discipline Number', 'Discipline Name',
       'Discipline Name Hierarchy', 'Start Date', 'End Date',
       'Approved Amount', 'Keywords'],
      dtype='object')

A complete description of the parameters can be found at http://p3.snf.ch/Pages/DataAndDocumentation.aspx. The attributes which are interesting for us are:
+ '**University**': academic insitution where the project is carried out
+ '**Approved Amount**': grant for the project (CHF)

We will consider later also the '**Start Date**' and the '**End date**' parameters for further investigations which take time into account.

Let us filter the dataset, save its reduced version as an .xlsx file.

In [7]:
data_red = data[['University','Approved Amount','Start Date','End Date']]
data_red.head(5)

Unnamed: 0_level_0,University,Approved Amount,Start Date,End Date
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Nicht zuteilbar - NA,11619.0,01.10.1975,30.09.1976
4,Université de Genève - GE,41022.0,01.10.1975,30.09.1976
5,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0,01.03.1976,28.02.1985
6,Universität Basel - BS,52627.0,01.10.1975,30.09.1976
7,"NPO (Biblioth., Museen, Verwalt.) - NPO",120042.0,01.01.1976,30.04.1978


In [8]:
data_red.to_excel('Grants.xlsx',sheet_name='Sheet1')

From now on, we will work on the reduced dataset, which is lighter to load and read. You can start directly from the following section.

# Preliminar data exploration

In this section we will explore the essential features of the dataset.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sbr
import matplotlib as plt
%matplotlib inline

In [47]:
data = pd.read_excel('Grants.xlsx',sheet_name='Sheet1')
data = data.set_index('Project Number')
data.head(3)

Unnamed: 0_level_0,University,Approved Amount,Start Date,End Date
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Nicht zuteilbar - NA,11619.0,01.10.1975,30.09.1976
4,Université de Genève - GE,41022.0,01.10.1975,30.09.1976
5,"NPO (Biblioth., Museen, Verwalt.) - NPO",79732.0,01.03.1976,28.02.1985


We perform now an essential data exploration, looking for eventual NaN values and analyzing the statistical distribution of the '**Approved Amount**' parameter.

In [48]:
data['University'].isnull().value_counts()

False    50988
True     12981
Name: University, dtype: int64

There are roughly **20%** of NaN values in the field '**University**', which will make problematic the matching with the corresponding Canton. Let us identify part of this non available data:

In [49]:
# NaN values in the 'University' field
data[data['University'].isnull()].head(10)

Unnamed: 0_level_0,University,Approved Amount,Start Date,End Date
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20001,,data not included in P3,01.11.1986,31.10.1987
20002,,data not included in P3,01.01.1987,31.12.1987
20003,,data not included in P3,01.07.1987,30.06.1988
20004,,data not included in P3,01.09.1986,31.08.1987
20005,,data not included in P3,01.09.1986,31.08.1987
20006,,data not included in P3,01.08.1986,31.07.1987
20007,,data not included in P3,01.10.1986,30.09.1987
20008,,data not included in P3,01.07.1986,31.12.1986
20009,,data not included in P3,01.10.1986,30.09.1987
20010,,data not included in P3,01.10.1986,30.09.1987


In [50]:
# NaN values in the 'Approved Amount' field
data[data['Approved Amount']=='data not included in P3'].head(10)

Unnamed: 0_level_0,University,Approved Amount,Start Date,End Date
Project Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
20001,,data not included in P3,01.11.1986,31.10.1987
20002,,data not included in P3,01.01.1987,31.12.1987
20003,,data not included in P3,01.07.1987,30.06.1988
20004,,data not included in P3,01.09.1986,31.08.1987
20005,,data not included in P3,01.09.1986,31.08.1987
20006,,data not included in P3,01.08.1986,31.07.1987
20007,,data not included in P3,01.10.1986,30.09.1987
20008,,data not included in P3,01.07.1986,31.12.1986
20009,,data not included in P3,01.10.1986,30.09.1987
20010,,data not included in P3,01.10.1986,30.09.1987


There are missing data which are not included in the P3 project, as explicitely specified by the dataset. Since we are not able to link them, by the moment we ignore them.