# Fetching the data from the IS Academia API

We start by getting the HTML response of the tabular student data from ISAcademia.
For this, we use the [Requests](http://docs.python-requests.org/en/master/) library.


In [None]:
# We are going to use requests to do the HTTP-calls for gathering data, and BeautifulSoup for parsing the 
# HTML that we recieve
import requests
from bs4 import BeautifulSoup

# re will help us parse the html by using regular expressions
import re

# Furthermore, we will use the normal stack of pandas, numpy, matplotlib and seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as pls
import seaborn as sns

%matplotlib inline

# Statistical test library
import scipy.stats as stats

## Making the reqest

*Warning*: we are loading a lot of data, thus the loading takes quite a long time. Therefore, don't run this unless it's needed

To not spam the API too much, we collect all the data in one try, and filter it afterwords.

We use the following parameters:

## TODO: Update this
~~~~~~~~~~~~~~~~
- ww_x_GPS:-1
- ww_i_reportModel:133685247
- ww_i_reportModelXsl:133685270
- ww_x_UNITE_ACAD:249847
- ww_x_PERIODE_ACAD:null
- ww_x_PERIODE_PEDAGO:null
- ww_x_HIVERETE:null


Which leads to the following request:
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=null&ww_x_PERIODE_PEDAGO=null&ww_x_HIVERETE=null
fetching data for all Computer Science students (Informatique) for all available years and semesters.
Such querring technique might be problematic with larger datasets (would probably result in server timeout) but since it works for our problem we stick to it.
~~~~~~~~~~~~~~~~

In [None]:
DEBUG = False

# TODO: make the request by using parameters to the function call, instead of coding it in the URI.
# TODO: verify that the uri is correct, and that we get all the data that we want

if DEBUG:
    # For testing and development we use the test_uri, which only loads data from 2016-2017
    uri = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=null&ww_x_HIVERETE=null"
else:
    # For 'production', collect all the data available from ISAcademia, for students at the IC-section
    uri = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=null&ww_x_PERIODE_PEDAGO=null&ww_x_HIVERETE=null"

req = requests.get(uri)

## Parsing the result

In [None]:
# Defining some helper functions, for clarity
def clean(string):
    return string.strip().lower().replace(' ', '_')

def is_semester_info(data):
    return len(data) <= 2

def is_header(data):
    return not ((len(data) > 2) and data[-2].isdigit())

def parse_table(table):
    students = []
    header = ''
    semester = ''

    for tr in table:
        row_data = []
        for td in tr:
            value = td.get_text().strip().replace('\xa0', ' ')
            row_data.append(value)

                     
        if is_semester_info(row_data):
            info = [clean(value) for value in row_data[0].split(', ')]
            section = info[0]
            year = info[1]
            semester, wat = info[2].split('\n_')
        elif(is_header(row_data)):
            header = [ clean(val) for val  in  row_data] 
        else:
            person = {'year': year, 'semester': semester, 'section': section, 'wat': wat}
            for i, key in enumerate(header):
                val = row_data[i].strip()
                if val: 
                    person[key] = val
                    
            students.append(person)
    
    return students

In [None]:
soup = BeautifulSoup(req.text, 'html.parser')
students_table = soup.find('table')

students = parse_table(students_table)

df = pd.DataFrame(students)
df.set_index(['no_sciper'], inplace=True)

original = df.copy()

In [None]:
#this here is for debug only if I mess up df somewhere down below in the code
df = original
#Lets list some basic info about parsed data
print(df.shape)
print(df.dtypes)
df.head()

In [None]:
# Well first of all we noticed that parsing all data without specifing date 
# also resulted in data of students from years before 2007 and we don't want that.

# Lets split year column into year_start and year_end
df['year_start'], df['year_end'] = df['year'].str.split('-', 1).str
# Cast from object to int
df[['year_start','year_end']] = df[['year_start','year_end']].apply(pd.to_numeric)
# Drop year column 
new_df = df.drop("year", axis=1)

# Verify
print(new_df.dtypes)
new_df.head()


# Bachelor students

In [179]:
bachelor_df = new_df[new_df["semester"].str.contains("bachelor_semestre")]
print(bachelor_df.shape)
bachelor_df.head()

(7271, 13)


Unnamed: 0_level_0,civilité,ecole_echange,filière_opt.,mineur,nom_prénom,section,semester,spécialisation,statut,type_echange,wat,year_start,year_end
no_sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
154168,Monsieur,,,,Aghamahdi Mohammad Hossein,informatique,bachelor_semestre_1,,Présent,,(107_ét.),2004,2005
160104,Monsieur,,,,Alves Sergio,informatique,bachelor_semestre_1,,Présent,,(107_ét.),2004,2005
154157,Madame,,,,Andriambololona Riana Miarantsoa,informatique,bachelor_semestre_1,,Présent,,(107_ét.),2004,2005
166876,Monsieur,,,,Aslan Unal,informatique,bachelor_semestre_1,,Présent,,(107_ét.),2004,2005
166258,Monsieur,,,,Balet Ken,informatique,bachelor_semestre_1,,Présent,,(107_ét.),2004,2005


In [180]:
# ...from year 2007 and above
bachelor_from_2007_df = bachelor_df[bachelor_df["year_start"] >= 2007]
print(bachelor_from_2007_df.shape)
bachelor_from_2007_df.head()

(5807, 13)


Unnamed: 0_level_0,civilité,ecole_echange,filière_opt.,mineur,nom_prénom,section,semester,spécialisation,statut,type_echange,wat,year_start,year_end
no_sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
169569,Monsieur,,,,Arévalo Christian,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
174905,Monsieur,,,,Aubelle Flavien,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
173922,Monsieur,,,,Badoud Morgan,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179406,Monsieur,,,,Baeriswyl Jonathan,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179428,Monsieur,,,,Barroco Michael,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008


## Extract bachelor candidates based on their semester entries
Search for students(rows) who either have bachelor_semestre_1 or bachelor_semestre_6

In [181]:
'''DROPTHIS searchfor = ['bachelor_semestre_1', 'bachelor_semestre_6']
first_and_last_sem_df = bachelor_from_2007_df[bachelor_from_2007_df["semester"].str.contains('|'.join(searchfor))]

# just to visualise
by_name = first_and_last_sem_df.groupby(['nom_prénom', 'semester'])
by_name.first()
'''

#pruned_bachelor = bachelor_from_2007_df.groupby(['nom_prénom']).filter(lambda x: x['semester'].str.contains('bachelor_semestre_1').any() and x['semester'].str.contains('bachelor_semestre_6').any())
#pruned_bachelor = pruned_bachelor.groupby(['nom_prénom'])
#pruned_bachelor.head(10)

'DROPTHIS searchfor = [\'bachelor_semestre_1\', \'bachelor_semestre_6\']\nfirst_and_last_sem_df = bachelor_from_2007_df[bachelor_from_2007_df["semester"].str.contains(\'|\'.join(searchfor))]\n\n# just to visualise\nby_name = first_and_last_sem_df.groupby([\'nom_prénom\', \'semester\'])\nby_name.first()\n'

Only consider students that were in Semester 1 and have done their Master project. This data set includes Minors. Since Minors and Specializations are allowed to have a semester longer. Technically these would need to be excluded to measure the pure length of an IC Master. For the Minors this would be possible but for the Specializations there is no way to evalute it, hence we use the combined dataset to compensate for that fact.

In [182]:
pruned_bachelor = bachelor_from_2007_df.groupby(['nom_prénom']).filter(lambda x: x['semester'].str.contains('bachelor_semestre_1').any() and x['semester'].str.contains('bachelor_semestre_6').any())
pruned_bachelor = pruned_bachelor.groupby(['nom_prénom'])
pruned_bachelor.head(10)

Unnamed: 0_level_0,civilité,ecole_echange,filière_opt.,mineur,nom_prénom,section,semester,spécialisation,statut,type_echange,wat,year_start,year_end
no_sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
169569,Monsieur,,,,Arévalo Christian,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
174905,Monsieur,,,,Aubelle Flavien,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179406,Monsieur,,,,Baeriswyl Jonathan,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179428,Monsieur,,,,Barroco Michael,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179449,Monsieur,,,,Bindschaedler Vincent,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
178553,Monsieur,,,,Bloch Marc-Olivier,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
179426,Monsieur,,,,Bloch Remi,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
178271,Monsieur,,,,Boéchat Marc-Alexandre,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
180731,Monsieur,,,,Bricola Jean-Charles,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008
171619,Monsieur,,,,Buchschacher Nicolas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008


Calculate the number of semesters by counting the rows in each group. Then convert the collected GroupBy data to a DataFrame and add the gender column.

In [183]:
bachelor_total = pd.DataFrame(pruned_bachelor.size().rename('total_semester_count'))
bachelor_total['civilité'] = bachelor_total.index.map(lambda x: bachelor_df[bachelor_df['nom_prénom'] == str(x)].civilité.unique()[0])
bachelor_total['gender'] = bachelor_total.apply(is_men, axis=1)
bachelor_total.head(10)

Unnamed: 0_level_0,total_semester_count,civilité,gender
nom_prénom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abate Bryan Jeremy,6,Monsieur,1
Aiulfi Loris Sandro,12,Monsieur,1
Alami-Idrissi Ali,6,Monsieur,1
Alfonso Peterssen Alfonso,6,Monsieur,1
Alonso Seisdedos Florian,11,Monsieur,1
Amorim Afonso Caldeira Da Silva Pedro Maria,8,Monsieur,1
Andreina Sébastien Laurent,8,Monsieur,1
Angel Axel,6,Monsieur,1
Angerand Grégoire Georges Jacques,10,Monsieur,1
Antognini Marco,8,Monsieur,1


In [184]:
# Warning. We are not EPFL students so it is extermly hard for us to tell how IS-Academia system really works.
# We assume that in order to consider bachelor studies to be completed student has to be registered for 
# both bachelor_semestre_1 and bachelor_semestre_6. Since during those six semesters there could be multiple different 
# situations as gap year, failed semeter, exchange semeter etc. we simplyfy our problem and assume that
# number of semeters spent @ EPFL is equal to (year of graduiation - year of bachelor start) * 2.
# Obviously in real life scenario this assumption is invalid but from this dataset there is really no possibility
# to tell what was the actual amount of semesters required for graduation. (Even getting to 6th semester doesn't imply
# that student succefully graduated!) Moreover it seams strange that student is required to retake whole year 
# if he fails only one semester (from data it seems that failing on 5th semester means you cannot attempt 6th 
# and have to wait one semester to retake 5th) - but thats what we assumed. 
#
# Thus our dataset becomes significantly chopped down - from 1839 IC students who attempted either semester 1 OR 6
# to 397 IC students who managed to attempt semester 1 AND 6.

sem_1_df = year_start_order_df[year_start_order_df["semester"] == "bachelor_semestre_1"]
unique_sem_1_df = sem_1_df.drop_duplicates(subset=['nom_prénom', 'semester'], keep='first')

sem_6_df = year_start_order_df[year_start_order_df["semester"] == "bachelor_semestre_6"]
unique_sem_6_df = sem_6_df.drop_duplicates(subset=['nom_prénom', 'semester'], keep='last')

difference_df = pd.DataFrame(unique_sem_6_df["year_end"]-unique_sem_1_df["year_start"], columns=['year_count'])
difference_df.dropna(inplace=True)
difference_df["semester_total"] = difference_df["year_count"]*2
difference_df = difference_df.drop('year_count', 1)

In [185]:
semesters_df = year_start_order_df
semesters_df["semester_total"] = difference_df["semester_total"]
semesters_df = semesters_df[pd.notnull(semesters_df['semester_total'])]
semesters_df = semesters_df.drop_duplicates(subset=['nom_prénom'])
semesters_df

Unnamed: 0_level_0,civilité,ecole_echange,filière_opt.,mineur,nom_prénom,section,semester,spécialisation,statut,type_echange,wat,year_start,year_end,semester_total
no_sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
169569,Monsieur,,,,Arévalo Christian,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
178682,Monsieur,,,,Zoller Roman,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
180854,Monsieur,,,,Vautherin Jonas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
175280,Monsieur,,,,Uberti Quentin,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,8.0
180241,Monsieur,,,,Sondag Pierre-Antoine,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
178684,Monsieur,,,,Schwery Thomas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
169795,Monsieur,,,,Scheiben Pascal,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,8.0
178948,Monsieur,,,,Schädeli Andreas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
171195,Monsieur,,,,Richter Arnaud,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0
180959,Monsieur,,,,Restani Stéphane,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0


In [186]:
semesters_df.loc["174905"]

civilité                     Monsieur
ecole_echange                     NaN
filière_opt.                      NaN
mineur                            NaN
nom_prénom            Aubelle Flavien
section                  informatique
semester          bachelor_semestre_1
spécialisation                    NaN
statut                        Présent
type_echange                      NaN
wat                          (90_ét.)
year_start                       2007
year_end                         2008
semester_total                     10
Name: 174905, dtype: object

In [187]:
def is_men(row):
    if row['civilité'] == 'Monsieur':
        val = 1
    else:
        val = 0
    return val


semesters_df['gender'] = semesters_df.apply(is_men, axis=1)
semesters_df

Unnamed: 0_level_0,civilité,ecole_echange,filière_opt.,mineur,nom_prénom,section,semester,spécialisation,statut,type_echange,wat,year_start,year_end,semester_total,gender
no_sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
169569,Monsieur,,,,Arévalo Christian,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
178682,Monsieur,,,,Zoller Roman,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
180854,Monsieur,,,,Vautherin Jonas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
175280,Monsieur,,,,Uberti Quentin,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,8.0,1
180241,Monsieur,,,,Sondag Pierre-Antoine,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
178684,Monsieur,,,,Schwery Thomas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
169795,Monsieur,,,,Scheiben Pascal,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,8.0,1
178948,Monsieur,,,,Schädeli Andreas,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
171195,Monsieur,,,,Richter Arnaud,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1
180959,Monsieur,,,,Restani Stéphane,informatique,bachelor_semestre_1,,Présent,,(90_ét.),2007,2008,6.0,1


In [188]:
bachelor_total["gender"].mean()

# Enromous number of males(92%) over females(8%) 

0.9269521410579346

In [189]:
bachelor_total.groupby(['civilité'])['total_semester_count'].mean()

# But it appears that females are better students! On avarage they take less time to complete bachelor.

civilité
Madame      6.793103
Monsieur    7.105978
Name: total_semester_count, dtype: float64

In [190]:

men_df = bachelor_total[bachelor_total["gender"] == 1]
women_df = bachelor_total[bachelor_total["gender"] == 0]

# In a two-sample test, the null hypothesis is that the means of both groups are the same = 
# men and women take on avarage the same time to complete studies. Our average says differentely but that's because
# probably women data set is to small and not statistically significant. 

stats.ttest_ind(men_df.total_semester_count, women_df.total_semester_count)

# The test yields a p-value of 0.22800520488780102, which means there is a 22.8% chance we'd see sample data 
# this far apart if the two groups tested are actually identical. If we were using a 95% confidence level we 
# would fail to reject the null hypothesis, since the p-value is greater than the corresponding significance 
# level of 5%.

Ttest_indResult(statistic=1.0643000334248733, pvalue=0.2878429746516184)

# Masters

Initially prune Master dataset to only contain data after year 2007

In [None]:
master_copy = df.copy()
master_df = master_copy[master_copy['year_start'] >= 2007]
master_df = master_df[master_df.semester.str.contains('master')]

master_df.semester.value_counts()

Creation of needed master DataFrames for semester 1/2/3 and Master Project

### Calculate number of month for master

In [None]:
original[original['nom_prénom']=='Brutsche Florian'].civilité.unique()[0]
master_df[master_df['nom_prénom'] == 'Brutsche Florian']

Only consider students that were in Semester 1 and have done their Master project. This data set includes Minors. Since Minors and Specializations are allowed to have a semester longer. Technically these would need to be excluded to measure the pure length of an IC Master. For the Minors this would be possible but for the Specializations there is no way to evalute it, hence we use the combined dataset to compensate for that fact.

In [None]:
pruned_master = master_df.groupby(['nom_prénom']).filter(lambda x: x['semester'].str.contains('projet_master').any() and x['semester'].str.contains('semestre_1').any())
pruned_master = pruned_master.groupby(['nom_prénom'])

master_total = pd.DataFrame(pruned_master.size().rename('total_semester_count'))
master_total['civilité'] = master_total.index.map(lambda x: master_df[master_df['nom_prénom'] == str(x)].civilité.unique()[0])

master_total['gender'] = master_total.apply(is_men, axis=1)
master_total.head(20)

In [None]:
box = sns.factorplot(x='civilité', y='total_semester_count', data=master_total, kind="box")

In [None]:
master_total['gender'].mean()

Again we are dealing with ~88.60% of males and ~11.40% of females...

In [None]:
master_total.groupby(['civilité'])['total_semester_count'].mean()

...and it appears that on avarage it takes more time for males to complete master degree:
1. Males - 4.27 semesters = 25.62 months = over 2 years
2. Females - 4.15 semesters = 24.9 months = just slightly over 2 years

In a two-sample test, the null hypothesis is that the means of both groups are the same - men and women take on avarage the same time to complete studies. Our average says differentely but that's because probably women data set is to small and not statistically significant. 

In [None]:
master_men_df = master_total[master_total['gender']==1]
master_women_df = master_total[master_total['gender']==0]

stats.ttest_ind(master_men_df.total_semester_count, master_women_df.total_semester_count)

The test yields a **p-value = 0.6913**, which means there is a 69.13% chance we'd see sample data this far apart if the two groups tested are actually identical. If we were using a 95% confidence level we would **fail to reject the null hypothesis**, since the p-value is greater than the corresponding significance level of 5%.
**We definetely cannot write a paper based on such results.** 