Obtain all the data for the Master students, starting from 2007. Compute how many months it took each master student to complete their master, for those that completed it. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?

Notice that master students' data is more tricky than the bachelors' one, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2. If the Mineur field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?

In [118]:
# Requests : make http requests to websites
import requests
# BeautifulSoup : parser to manipulate easily html content
from bs4 import BeautifulSoup
# Regular expressions
import re
# Aren't pandas awesome ?
import pandas as pd

Let's get the first page in which we will be able to extract some interesting content !

In [119]:
# Ask for the first page on IS Academia. To see it, just type it on your browser address bar : http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247
r = requests.get('http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247')
htmlContent = BeautifulSoup(r.content, 'html.parser')

In [120]:
print(htmlContent.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

Now we need to make other requests to IS Academia, which specify every parameter : computer science students, all the years, and all bachelor semester (which are a couple of two values : pedagogic period and semester type). Thus, we're going to get all the parameters we need to make the next request :

In [121]:
# We first get the "Computer science" value
computerScienceField = htmlContent.find('option', text='Informatique')
computerScienceField

<option value="249847">Informatique</option>

In [122]:
computerScienceValue = computerScienceField.get('value')
computerScienceValue

'249847'

In [123]:
# Then, we're going to need all the academic years values.
academicYearsField = htmlContent.find('select', attrs={'name':'ww_x_PERIODE_ACAD'})
academicYearsSet = academicYearsField.findAll('option')

# Since there are several years to remember, we're storing all of them in a table to use them later
academicYearValues = []
# We'll put the textual content in a table aswell ("Master semestre 1", "Master semestre 2"...)
academicYearContent = []

for option in academicYearsSet:
    value = option.get('value')
    # However, we don't want any "null" value
    if value != 'null':
        academicYearValues.append(value)
        academicYearContent.append(option.text)

In [181]:
# Now, we have all the academic years that might interest us. We wrangle them a little bit so be able to make request more easily later.
academicYearValues_series = pd.Series(academicYearValues)
academicYearContent_series = pd.Series(academicYearContent)
academicYear_df = pd.concat([academicYearContent_series, academicYearValues_series], axis = 1)
academicYear_df.columns= ['Academic_year', 'Value']
academicYear_df = academicYear_df.sort_values(['Academic_year', 'Value'], ascending=[1, 0])
academicYear_df

Unnamed: 0,Academic_year,Value
9,2007-2008,978181
8,2008-2009,978187
7,2009-2010,978195
6,2010-2011,39486325
5,2011-2012,123455150
4,2012-2013,123456101
3,2013-2014,213637754
2,2014-2015,213637922
1,2015-2016,213638028
0,2016-2017,355925344


In [125]:
# Then, let's get all the pedagogic periods we need. It's a little bit more complicated here because we need to link the pedagogic period with a season (eg : Bachelor 1 is autumn, Bachelor 2 is spring etc.)
# Thus, we need more than the pedagogic values. For doing some tests to associate them with the right season, we need the actual textual value ("Bachelor semestre 1", "Bachelor semestre 2" etc.)
pedagogicPeriodsField = htmlContent.find('select', attrs={'name':'ww_x_PERIODE_PEDAGO'})
pedagogicPeriodsSet = pedagogicPeriodsField.findAll('option')

# Same as above, we'll store the values in a table
pedagogicPeriodValues = []
# We'll put the textual content in a table aswell ("Master semestre 1", "Master semestre 2"...)
pedagogicPeriodContent = []

for option in pedagogicPeriodsSet:
    value = option.get('value')
    if value != 'null':
        pedagogicPeriodValues.append(value)
        pedagogicPeriodContent.append(option.text)

In [213]:
# Let's make the values and content meet each other
pedagogicPeriodContent_series = pd.Series(pedagogicPeriodContent)
pedagogicPeriodValues_series = pd.Series(pedagogicPeriodValues)
pedagogicPeriod_df = pd.concat([pedagogicPeriodContent_series, pedagogicPeriodValues_series], axis = 1);
pedagogicPeriod_df.columns = ['Pedagogic_period', 'Value']

In [238]:
# We keep all semesters related to master students
pedagogicPeriod_df_master = pedagogicPeriod_df[[period.startswith('Master') for period in pedagogicPeriod_df.Pedagogic_period]]
pedagogicPeriod_df_minor = pedagogicPeriod_df[[period.startswith('Mineur') for period in pedagogicPeriod_df.Pedagogic_period]]
pedagogicPeriod_df_project = pedagogicPeriod_df[[period.startswith('Projet Master') for period in pedagogicPeriod_df.Pedagogic_period]]

pedagogicPeriod_df = pd.concat([pedagogicPeriod_df_master, pedagogicPeriod_df_minor, pedagogicPeriod_df_project])
pedagogicPeriod_df

Unnamed: 0,Pedagogic_period,Value
8,Master semestre 1,2230106
9,Master semestre 2,942192
10,Master semestre 3,2230128
11,Master semestre 4,2230140
12,Mineur semestre 1,2335667
13,Mineur semestre 2,2335676
15,Projet Master automne,249127
16,Projet Master printemps,3781783


In [128]:
# Lastly, we need to extract the values associated with autumn and spring semesters.
semesterTypeField = htmlContent.find('select', attrs={'name':'ww_x_HIVERETE'})
semesterTypeSet = semesterTypeField.findAll('option')

# Again, we need to store the values in a table
semesterTypeValues = []
# We'll put the textual content in a table aswell
semesterTypeContent = []

for option in semesterTypeSet:
    value = option.get('value')
    if value != 'null':
        semesterTypeValues.append(value)
        semesterTypeContent.append(option.text)

In [190]:
# Here are the values for autumn and spring semester :

semesterTypeValues_series = pd.Series(semesterTypeValues)
semesterTypeContent_series = pd.Series(semesterTypeContent)
semesterType_df = pd.concat([semesterTypeContent_series, semesterTypeValues_series], axis = 1)
semesterType_df.columns = ['Semester_type', 'Value']
semesterType_df

Unnamed: 0,Semester_type,Value
0,Semestre d'automne,2936286
1,Semestre de printemps,2936295


Now, we got all the information to get all the master students !
Let's make all the requests we need to build our data.
We will try to do requests such as :
- Get students from master semester 1 of 2007-2008
- ...
- Get students from master semester 4 of 2007-2008
- Get students from mineur semester 1 of 2007-2008
- Get students from mineur semester 2 of 2007-2008
- Get students from master project semester 1 of 2007-2008
- Get students from master project semester 2 of 2007-2008

... and so on for each academic year until 2015-2016, the last complete year.
We can even take the first semester of 2016-2017 into account, to check if some students we though they finished last year are actually still studying. This can be for different reasons : doing a mineur, a project, repeating a semester...

We can ask for a list of student in two formats : HTML or CSV.
We choosed to get them in a HTML format because this is the first time that we wrangle data in HTML format, and that may be really useful to learn in order to work with most of the websites in the future !
The request sent by the browser to IS Academia, to get a list of student in a HTML format, looks like this :
http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?arg1=xxx&arg2=yyy
With "xxx" the value associated with the argument named "arg1", "yyy" the value associated with the argument named "arg2" etc. It uses to have a lot more arguments.
For instance, we tried to send a request as a "human" through our browser and intercepted it with Postman interceptor.
We found that the folowing arguments have to be sent :
ww_x_GPS = -1
ww_i_reportModel = 133685247
ww_i_reportModelXsl = 133685270
ww_x_UNITE_ACAD = 249847 (which is the value of computer science !)
ww_x_PERIODE_ACAD = X (eg : the value corresponding to 2007-2008 would be 978181)
ww_x_PERIODE_PEDAGO = Y (eg : 2230106 for Master semestre 1)
ww_x_HIVERETE = Z (eg : 2936286 for autumn semester)

The last three values X, Y and Z must be replaced with the ones we extracted previously. For instance, if we want to get students from Master, semester 1 (which is necessarily autumn semester) of 2007-2008, the "GET Request" would be the following :

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=2230106&ww_x_HIVERETE=2936286

So let's cook all the requests we're going to send !

In [249]:
# Let's put the semester types aside, because we're going to need them
autumn_semester_value = semesterType_df.loc[semesterType_df['Semester_type'] == 'Semestre d\'automne', 'Value']
autumn_semester_value = autumn_semester_value.iloc[0]

spring_semester_value = semesterType_df.loc[semesterType_df['Semester_type'] == 'Semestre de printemps', 'Value']
spring_semester_value = spring_semester_value.iloc[0]

In [261]:
# Here is the list of the GET requests we will send to IS Academia
requestsToISAcademia = []

# Go all over the years ('2007-2008', '2008-2009' and so on)
for academicYear_row in academicYear_df.itertuples(index=True, name='Academic_year'):
    
    # The year (eg: '2007-2008')
    academicYear = academicYear_row.Academic_year
    
    # The associated value (eg: '978181')
    academicYear_value = academicYear_row.Value
    
    # We get all the pedagogic periods associated with this academic year
    for pegagogicPeriod_row in pedagogicPeriod_df.itertuples(index=True, name='Pedagogic_period'):
        
        # The period (eg: 'Master semestre 1')
        pedagogicPeriod = pegagogicPeriod_row.Pedagogic_period
        
        # The associated value (eg: '2230106')
        pegagogicPeriod_Value = pegagogicPeriod_row.Value
        
        # We need to associate the corresponding semester type (eg: Master semester 1 is autumn, but Master semester 2 will be spring)
        if (pedagogicPeriod.endswith('1') or pedagogicPeriod.endswith('3') or pedagogicPeriod.endswith('automne')):
            semester_Value = autumn_semester_value
        else:
            semester_Value = spring_semester_value
        
        # This print line is only for debugging if you want to check something
        # print("academic year = " + academicYear_value + ", pedagogic value = " + pegagogicPeriod_Value + ", pedagogic period is " + pedagogicPeriod + " (semester type value = " + semester_Value + ")")
        
        # We're ready to cook the request !
        request = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=' + computerScienceValue
        request = request + '&ww_x_PERIODE_ACAD=' + academicYear_value
        request = request + '&ww_x_PERIODE_PEDAGO=' + pegagogicPeriod_Value
        request = request + '&ww_x_HIVERETE=' + semester_Value
        
        # Add the newly created request to our wish list...
        requestsToISAcademia.append(request)
        
        
        

In [263]:
# Here is the list of all the requests we have to send !
requestsToISAcademia

['http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=2230106&ww_x_HIVERETE=2936286',
 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=942192&ww_x_HIVERETE=2936295',
 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=2230128&ww_x_HIVERETE=2936286',
 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978181&ww_x_PERIODE_PEDAGO=2230140&ww_x_HIVERETE=2936295',
 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=1