# Homework 2: Data from the Web

In this homework, we are going to fetch information from the EPFL website IS-Academia. The idea is to read data from the webpage in an html format, to parse this data using external library BeautifulSoup and to analyze the data obtained, i.e the informatic students, extracting some statistical information.
We can then decompose the work in two principale activities: 
- Fetching the data
- Analyzing the data

# Fetching data from IS-Academia

We import first the libraries we'll need in this notebook

In [41]:
#Usual imports
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
#sns.set_context('notebook')

#Specific imports for data fetching
import requests #HTTP requests
from bs4 import BeautifulSoup as BSoup #HTML parsing

Let's start by indexing the web page where we are going to fetch all the data, i.e IS-Academia

In [42]:
# URL containing the empty form of IS_Academia that list students
main_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_x_GPS=-1&ww_i_reportModel=133685247'

# URL containing the form with the fields that we will replace depending on the information we want to extract, i.e {ACADEMIC_PERIOD_KEY}
form_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD={ACADEMIC_UNIT_KEY}&ww_x_PERIODE_ACAD={ACADEMIC_PERIOD_KEY}&ww_x_PERIODE_PEDAGO={PEDAGOGIC_PERIOD_KEY}&ww_x_HIVERETE={HIVERETE_KEY}'

The webpage where the data are stored is in a form format. Therefore, we have to change the URL depending on which list of student we want. We can check which are the specific keys for each section, year, bachelor semester and season, in order to obtain the corresponding student list when we replace them in the "form_url" above. The field (or parameters) to be change in the URL are given in brackets {}, for example {ACADEMIC_UNIT_KEY}.

In [43]:
# We get the raw data of the main page
ugly_html = requests.get(main_url)
    
# We parse it using BeautifulSoup
beautiful_html = BSoup(ugly_html.text, 'html.parser')

section_keys = {}
year_keys = {}
semester_keys = {}
season_keys = {}

# We store all the keys of the sections
unite_html = beautiful_html.find('select', {'name': 'ww_x_UNITE_ACAD'})
for unite in unite_html.find_all('option'):
    section_keys[unite.text] = unite.get('value')

# We store all the keys of the academic years
acad_period_html = beautiful_html.find('select', {'name': 'ww_x_PERIODE_ACAD'})
for period in acad_period_html.find_all('option'):
    year_keys[period.text] = period.get('value')

# We store all the keys of the academic semester (Bachelor and Master)
peda_period_html = beautiful_html.find('select', {'name': 'ww_x_PERIODE_PEDAGO'})
for peda in peda_period_html.find_all('option'):
    semester_keys[peda.text] = peda.get('value')

# We store all the keys of the seasons (Autumn / Spring)
season_html = beautiful_html.find('select', {'name': 'ww_x_HIVERETE'})
for season in season_html.find_all('option'):
    season_keys[season.text] = season.get('value')
    

We can verify that we have all the keys that we need:

(Note that the first one is empty. It represents the situation when the user does not precise what he wants for this field of research)

In [44]:
section_keys

{'': 'null',
 'Architecture': '942293',
 'Chimie et génie chimique': '246696',
 'Cours de mathématiques spéciales': '943282',
 'EME (EPFL Middle East)': '637841336',
 'Génie civil': '942623',
 'Génie mécanique': '944263',
 'Génie électrique et électronique ': '943936',
 'Humanités digitales': '2054839157',
 'Informatique': '249847',
 'Ingénierie financière': '120623110',
 'Management de la technologie': '946882',
 'Mathématiques': '944590',
 'Microtechnique': '945244',
 'Physique': '945571',
 'Science et génie des matériaux': '944917',
 "Sciences et ingénierie de l'environnement": '942953',
 'Sciences et technologies du vivant': '945901',
 'Section FCUE': '1574548993',
 'Systèmes de communication': '946228'}

In [45]:
year_keys

{'': 'null',
 '2007-2008': '978181',
 '2008-2009': '978187',
 '2009-2010': '978195',
 '2010-2011': '39486325',
 '2011-2012': '123455150',
 '2012-2013': '123456101',
 '2013-2014': '213637754',
 '2014-2015': '213637922',
 '2015-2016': '213638028',
 '2016-2017': '355925344'}

In [46]:
semester_keys

{'': 'null',
 'Bachelor semestre 1': '249108',
 'Bachelor semestre 2': '249114',
 'Bachelor semestre 3': '942155',
 'Bachelor semestre 4': '942163',
 'Bachelor semestre 5': '942120',
 'Bachelor semestre 5b': '2226768',
 'Bachelor semestre 6': '942175',
 'Bachelor semestre 6b': '2226785',
 'Master semestre 1': '2230106',
 'Master semestre 2': '942192',
 'Master semestre 3': '2230128',
 'Master semestre 4': '2230140',
 'Mineur semestre 1': '2335667',
 'Mineur semestre 2': '2335676',
 'Mise à niveau': '2063602308',
 'Projet Master automne': '249127',
 'Projet Master printemps': '3781783',
 'Semestre automne': '953159',
 'Semestre printemps': '2754553',
 'Stage automne 3ème année': '953137',
 'Stage automne 4ème année': '2226616',
 'Stage printemps 3ème année': '983606',
 'Stage printemps 4ème année': '2226626',
 'Stage printemps master': '2227132'}

In [47]:
season_keys

{'': 'null',
 "Semestre d'automne": '2936286',
 'Semestre de printemps': '2936295'}

We create a function that update the URL depending on the list of student we want, for example 'Informatique / 2007-2008 / Bachelor semestre 1 / Semestre d'automne'.

In [48]:
def getSpecificURL(section, years, semester, season):
    new_url = form_url;
    new_url = new_url.replace('{ACADEMIC_UNIT_KEY}', str(section_keys[section]))
    new_url = new_url.replace('{ACADEMIC_PERIOD_KEY}', str(year_keys[years]))
    new_url = new_url.replace('{PEDAGOGIC_PERIOD_KEY}', str(semester_keys[semester]))
    new_url = new_url.replace('{HIVERETE_KEY}', str(season_keys[season]))
    return new_url

In [49]:
getSpecificURL('Informatique', '2007-2008', 'Bachelor semestre 1', 'Semestre d\'automne')

'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286'

If you tap this URL in your favorite browser, you will obtain the list of student for this specific year and semester.

Before extracting the data we want, we still have to define a function that parse correctly the list of student (given in html) into a well designed DataFrame. To do so, we define the function parseTable:

In [50]:
def parseTable(table):
    # Search on all the lines
    lines = table.find_all('tr')
    N = len(lines)
    output = []
    
    
    for i in range(2, N-1):
        # Search on all the columns
        line = lines[i]
        rows = line.find_all('td')
        M = len(rows)
        if M > 0:
            gender = rows[0].text
            name = rows[1].text
            sciper = rows[10].text
            status = rows[7].text
            output.append({'gender':gender, 'name':name, 'sciper':sciper, 'status':status})
            
    return pd.DataFrame(output)

Now that we have all the tools that we needed to extract the data correctly, we can start fetching them from IS-Academia. First, we have to decide which ones are interesting for us. We will focus only on the 'Informatique' section and on the students that have finished their bachelor degree. We will then store only the information on 'Informatique' students that have *'Bachelor semestre 1'*, *'Bachelor semestre 5'* and *'Bachelor semestre 6'* entries. We  have to keep the *'Bachelor semestre 5'* entry because a student can finish his degree in the semester 5 if he redoes only half a year.

Note also that we don't take into account the academic year 2016-2017, because the students have not completed there cursus yet.

In [51]:
spec_section = ['Informatique']
spec_year = ['2007-2008', '2008-2009', '2009-2010', '2010-2011', '2011-2012', '2012-2013', '2013-2014', '2014-2015', '2015-2016']
spec_semester = ['Bachelor semestre 1','Bachelor semestre 5','Bachelor semestre 6']
spec_season = ['Semestre d\'automne', 'Semestre de printemps']

In [52]:
all_data = []

for section in spec_section:
    for year in spec_year:
        for semester in spec_semester:
            for season in spec_season:
                spec_url = getSpecificURL(section, year, semester, season)
                spec_html = requests.get(spec_url)
                spec_beautiful_html = BSoup(spec_html.text, 'html.parser')
                
                student_table = spec_beautiful_html.find('table')
                student_data = parseTable(student_table)
                student_data['section'] = section
                student_data['year'] = year
                student_data['semester'] = semester
                student_data['season'] = season
                
                if not(student_data.empty):
                    all_data.append(student_data)

In [53]:
all_data[0].head(10)

Unnamed: 0,gender,name,sciper,status,section,year,semester,season
0,Monsieur,Arévalo Christian,169569,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
1,Monsieur,Aubelle Flavien,174905,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
2,Monsieur,Badoud Morgan,173922,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
3,Monsieur,Baeriswyl Jonathan,179406,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
4,Monsieur,Barroco Michael,179428,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
5,Monsieur,Belfis Nicolas,179324,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
6,Monsieur,Beliaev Stanislav,174597,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
7,Monsieur,Bindschaedler Vincent,179449,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
8,Monsieur,Bloch Marc-Olivier,178553,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
9,Monsieur,Bloch Remi,179426,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne


In [54]:
all_data_frame = pd.concat(all_data)
all_data_frame.shape

(2868, 8)

In oder to avoid unecessary process, we store all the data in an csv file

In [56]:
all_data_frame.to_csv('data_ba.csv', ',', index=False)

For futher work, we will also need the students in master degree. Therefore, we can redo the process above and store these new datas in an other file. However, the data has a different structure for the master students. For example, it would be interesting to store the minor the student has (if he has one) etc...

We need then to redefine a parser for the list of master students.

In [76]:
def parseTable_ma(table):
    # Search on all the lines
    lines = table.find_all('tr')
    N = len(lines)
    output = []
    
    
    for i in range(2, N-1):
        # Search on all the columns
        line = lines[i]
        rows = line.find_all('td')
        M = len(rows)
        if M > 0:
            gender = rows[0].text
            name = rows[1].text
            sciper = rows[10].text
            status = rows[7].text
            special = rows[4].text
            mineur = rows[6].text
            output.append({'gender':gender, 'name':name, 'sciper':sciper, 'status':status, 'specialization': special, 'mineur':mineur})
            
    return pd.DataFrame(output)

In [74]:
spec_semester_ma = ['Master semestre 1','Master semestre 2','Master semestre 3','Master semestre 4']

In [77]:
all_data_ma = []

for section in spec_section:
    for year in spec_year:
        for semester in spec_semester_ma:
            for season in spec_season:
                spec_url = getSpecificURL(section, year, semester, season)
                spec_html = requests.get(spec_url)
                spec_beautiful_html = BSoup(spec_html.text, 'html.parser')
                
                student_table = spec_beautiful_html.find('table')
                student_data = parseTable_ma(student_table)
                student_data['section'] = section
                student_data['year'] = year
                student_data['semester'] = semester
                student_data['season'] = season
                
                if not(student_data.empty):
                    all_data_ma.append(student_data)

In [78]:
all_data_frame_ma = pd.concat(all_data_ma)
all_data_frame_ma.shape

(2431, 10)

In [83]:
all_data_frame_ma.head(15)

Unnamed: 0,gender,mineur,name,sciper,specialization,status,section,year,semester,season
0,Monsieur,,Aeberhard François-Xavier,153066,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
1,Madame,,Agarwal Megha,180027,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
2,Monsieur,,Anagnostaras David,152232,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
3,Monsieur,,Auroux Damien,177395,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
4,Monsieur,,Awalebo Joseph,161970,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
5,Monsieur,,Balet Ken,166258,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
6,Monsieur,,Barazzutti Raphaël Pierre,173600,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
7,Monsieur,,Bayramoglu Ersoy,178879,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
8,Madame,,Benabdallah Zeineb,154573,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
9,Monsieur,,Bettex Marc,160492,,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne


Again, we store these data in a csv file

In [80]:
all_data_frame_ma.to_csv('data_ma.csv', ',', index=False)

# Assignments: Analyzing the data

Now that we have fetching all the data interesting for us, we can start analyzing them and try to extract interesting information.

It is possible to import data without requesting the server because everything had been stored previously in memory 

In [32]:
all_data_frame = pd.read_csv('data_ba.csv', ',')
all_data_frame.shape

(2868, 8)

In [31]:
all_data_frame.head()

Unnamed: 0,gender,name,sciper,status,section,year,semester,season
0,Monsieur,Arévalo Christian,169569,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
1,Monsieur,Aubelle Flavien,174905,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
2,Monsieur,Badoud Morgan,173922,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
3,Monsieur,Baeriswyl Jonathan,179406,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
4,Monsieur,Barroco Michael,179428,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne


In [69]:
all_data_frame_ma = pd.read_csv('data_ma.csv', ',')
all_data_frame_ma.shape

(2431, 8)

In [70]:
all_data_frame_ma.head()

Unnamed: 0,gender,name,sciper,status,section,year,semester,season
0,Monsieur,Aeberhard François-Xavier,153066,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
1,Madame,Agarwal Megha,180027,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
2,Monsieur,Anagnostaras David,152232,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
3,Monsieur,Auroux Damien,177395,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne
4,Monsieur,Awalebo Joseph,161970,Présent,Informatique,2007-2008,Master semestre 1,Semestre d'automne


## Task 1: Bachelor students

In this first task, we will consider only the students that have finished their Bachelor degree and see how many months they took to complete it. We will then separate the students between men and women and observe if the time difference is statically interesting.

In [206]:
data = pd.read_csv('data_ba.csv', ',')

First, we can keep only the data with the student's status 'Présent'

In [207]:
# remove all people that are currently away
data = data[data.status == 'Présent']

Then, we sort the people by sciper and check if each student has an entry for *'Bachelor semester 6'*. If not, we remove the corresponding persons. We can then check the years of the first and the last semester done at EPFL. With this methodolgy, we will consider as well the student that have finished their Bachelor degree after redoing only the *'Bachelor semester 5'*.

Note that we cannot know if a student actually graduated after the *'Bachelor semester 6'* or if he failed at the last step. Therefore, we will consider that all the students with a *'Bachelor semester 6'* entry did graduated at some point.

In [208]:
# Sort people
data.sort_values(ascending=[True, True, True], by=['sciper', 'year', 'semester'], inplace=True)

# Select only the people for which we have a 'Bachelor semester 1' entry
starting_scipers = data[data.semester == 'Bachelor semestre 1'].sciper.drop_duplicates()

# Select only people that have completed their degree (i.e have a 'Bachelor semester 6' entry)
graduated_scipers = data[data.semester == 'Bachelor semestre 6'].sciper.drop_duplicates()

# Take the intersection of the two list of scipers above
correct_scipers = pd.Series(list(set(starting_scipers).intersection(set(graduated_scipers)))).sort_values(ascending=True)

In [209]:
data = data[data['sciper'].isin(correct_scipers)]
data.head(12)

Unnamed: 0,gender,name,sciper,status,section,year,semester,season
0,Monsieur,Arévalo Christian,169569,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
490,Monsieur,Arévalo Christian,169569,Présent,Informatique,2009-2010,Bachelor semestre 5,Semestre d'automne
564,Monsieur,Arévalo Christian,169569,Présent,Informatique,2009-2010,Bachelor semestre 6,Semestre de printemps
44,Monsieur,Knecht Mathieu,169731,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
859,Monsieur,Knecht Mathieu,169731,Présent,Informatique,2010-2011,Bachelor semestre 6,Semestre de printemps
76,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2007-2008,Bachelor semestre 1,Semestre d'automne
553,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2009-2010,Bachelor semestre 5,Semestre d'automne
615,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2009-2010,Bachelor semestre 6,Semestre de printemps
823,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2010-2011,Bachelor semestre 5,Semestre d'automne
876,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2010-2011,Bachelor semestre 6,Semestre de printemps


In [231]:
# Find starting semester of the Bachelor. We keep the first one if the student has redone his first year.
startData = data[data.semester == 'Bachelor semestre 1'].copy()
startData.drop_duplicates(subset=['sciper'], keep='first', inplace = True)

# Find ending semester of the Bachelor. We keep the last one if the student has redone some semesters.
endData = data[data['semester'].isin(['Bachelor semestre 6', 'Bachelor semestre 5'])].copy()
endData.drop_duplicates(subset=['sciper'], keep='last', inplace = True)

In [240]:
# Renaming the columns
startData = startData.rename(columns={'year': 'startYear', 'season': 'startSeason'})
endData = endData.rename(columns={'year': 'finalYear', 'season': 'finalSeason'})

# Fusion of the two dataframes
startEndData = pd.merge(startData, endData[['sciper', 'finalYear', 'finalSeason']], how='inner', left_on='sciper', right_on='sciper')
startEndData.drop(['semester'], axis=1, inplace = True)

We can now observe that 294 students have done their entire degree in *'Informatique'* between 2007 and 2016.

In [233]:
startEndData.shape

(294, 9)

But what is interesting for us is to know how many semester each of these graduated students took to get their diploma. For that, let's define a function that calculate the number of semesters based on the student incoming and outcoming year/season.

In [237]:
def getNumberOfBachelorSemester(startYear, startSeason, endYear, endSeason):
    year_diff = int(endYear.split('-')[1]) - int(startYear.split('-')[0])
    if( startSeason == endSeason):
        semester_number = (2 * year_diff) - 1
    else :
        semester_number = (2 * year_diff)
    return semester_number


In [242]:
startEndData['semesterNumber'] = startEndData.apply(lambda row: getNumberOfBachelorSemester(row['startYear'], row['startSeason'],row['finalYear'], row['finalSeason']), axis=1)
startEndData.head(7)

Unnamed: 0,gender,name,sciper,status,section,startYear,startSeason,finalYear,finalSeason,semesterNumber
0,Monsieur,Arévalo Christian,169569,Présent,Informatique,2007-2008,Semestre d'automne,2009-2010,Semestre de printemps,6
1,Monsieur,Knecht Mathieu,169731,Présent,Informatique,2007-2008,Semestre d'automne,2010-2011,Semestre de printemps,8
2,Monsieur,Scheiben Pascal,169795,Présent,Informatique,2007-2008,Semestre d'automne,2011-2012,Semestre d'automne,9
3,Monsieur,Richter Arnaud,171195,Présent,Informatique,2007-2008,Semestre d'automne,2009-2010,Semestre de printemps,6
4,Monsieur,Buchschacher Nicolas,171619,Présent,Informatique,2007-2008,Semestre d'automne,2009-2010,Semestre de printemps,6
5,Monsieur,Aubelle Flavien,174905,Présent,Informatique,2007-2008,Semestre d'automne,2011-2012,Semestre de printemps,10
6,Monsieur,Hanser Valérian,175190,Présent,Informatique,2007-2008,Semestre d'automne,2010-2011,Semestre d'automne,7


**We have now the number of semesters (6 months) each students took to finish their Bachelor degree.** (6 semesters, i.e 3 years, is the minimum)

Let's separate now the students by gender and compute the mean time they took to achieve their cursus.

In [253]:
dataMen = startEndData[startEndData.gender == 'Monsieur']
nbrMen = dataMen.shape[0]
nbrMen

270

In [254]:
dataWomen = startEndData[startEndData.gender == 'Madame']
nbrWomen = dataWomen.shape[0]
nbrWomen

24

Note that women in the informatic section are not a common thing ^^'