Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both Bachelor semestre 1 and Bachelor semestre 6. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?

In [25]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

In [26]:
r = requests.get('http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247')
htmlContent = BeautifulSoup(r.content, 'html.parser')
print(htmlContent.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

In [30]:
# We first get the "Computer science" value
computerScienceField = htmlContent.find('option', text='Informatique')
computerScienceField

<option value="249847">Informatique</option>

In [31]:
computerScienceValue = computerScienceField.get('value')
computerScienceValue

'249847'

In [32]:
# Then, we're going to need all the academic years values.
academicYearsField = htmlContent.find('select', attrs={'name':'ww_x_PERIODE_ACAD'})
academicYearsSet = academicYearsField.findAll('option')

# Since there are several years to remember, we're storing all of them in a table to use them later
academicYearValues = []
# We'll put the textual content in a table aswell ("Master semestre 1", "Master semestre 2"...)
academicYearContent = []

for option in academicYearsSet:
    value = option.get('value')
    # However, we don't want any "null" value
    if value != 'null':
        academicYearValues.append(value)
        academicYearContent.append(option.text)

In [34]:
# Now, we have all the academic years that might interest us
academicYear_Series = pd.Series(academicYearContent, index=academicYearValues)
academicYear_Series

355925344    2016-2017
213638028    2015-2016
213637922    2014-2015
213637754    2013-2014
123456101    2012-2013
123455150    2011-2012
39486325     2010-2011
978195       2009-2010
978187       2008-2009
978181       2007-2008
dtype: object

In [35]:
# Then, let's get all the pedagogic periods we need. It's a little bit more complicated here because we need to link the pedagogic period with a season (eg : Bachelor 1 is autumn, Bachelor 2 is spring etc.)
# Thus, we need more than the pedagogic values. For doing some tests to associate them with the right season, we need the actual textual value ("Bachelor semestre 1", "Bachelor semestre 2" etc.)
pedagogicPeriodsField = htmlContent.find('select', attrs={'name':'ww_x_PERIODE_PEDAGO'})
pedagogicPeriodsSet = pedagogicPeriodsField.findAll('option')

# Same as above, we'll store the values in a table
pedagogicPeriodValues = []
# We'll put the textual content in a table aswell ("Master semestre 1", "Master semestre 2"...)
pedagogicPeriodContent = []

for option in pedagogicPeriodsSet:
    value = option.get('value')
    if value != 'null':
        pedagogicPeriodValues.append(value)
        pedagogicPeriodContent.append(option.text)

In [36]:
# Let's make the values and content meet each other
pedagogicPeriod_Series = pd.Series(pedagogicPeriodContent, index=pedagogicPeriodValues)
pedagogicPeriod_Series

249108               Bachelor semestre 1
249114               Bachelor semestre 2
942155               Bachelor semestre 3
942163               Bachelor semestre 4
942120               Bachelor semestre 5
2226768             Bachelor semestre 5b
942175               Bachelor semestre 6
2226785             Bachelor semestre 6b
2230106                Master semestre 1
942192                 Master semestre 2
2230128                Master semestre 3
2230140                Master semestre 4
2335667                Mineur semestre 1
2335676                Mineur semestre 2
2063602308                 Mise à niveau
249127             Projet Master automne
3781783          Projet Master printemps
953159                  Semestre automne
2754553               Semestre printemps
953137          Stage automne 3ème année
2226616         Stage automne 4ème année
983606        Stage printemps 3ème année
2226626       Stage printemps 4ème année
2227132           Stage printemps master
dtype: object

In [38]:
# We keep all semesters related to Bachelor students
bachelorPedagogicPeriod_Series = pedagogicPeriod_Series[[name.startswith('Bachelor') for name in pedagogicPeriod_Series]]
bachelorPedagogicPeriod_Series

249108      Bachelor semestre 1
249114      Bachelor semestre 2
942155      Bachelor semestre 3
942163      Bachelor semestre 4
942120      Bachelor semestre 5
2226768    Bachelor semestre 5b
942175      Bachelor semestre 6
2226785    Bachelor semestre 6b
dtype: object

In [39]:
# Lastly, we need to extract the values associated with autumn and spring semesters.
semesterTypeField = htmlContent.find('select', attrs={'name':'ww_x_HIVERETE'})
semesterTypeSet = semesterTypeField.findAll('option')

# Again, we need to store the values in a table
semesterTypeValues = []
# We'll put the textual content in a table aswell
semesterTypeContent = []

for option in semesterTypeSet:
    value = option.get('value')
    if value != 'null':
        semesterTypeValues.append(value)
        semesterTypeContent.append(option.text)

In [40]:
# Here are the values for autumn and spring semester :
semesterType_Series = pd.Series(semesterTypeContent, index=semesterTypeValues)
semesterType_Series

2936286       Semestre d'automne
2936295    Semestre de printemps
dtype: object

In [42]:
# Choosing between html and xls
formatField = htmlContent.find('select', attrs={'name':'ww_i_reportModelXsl'})