In [33]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# Familiarizing with the tools and data

Let's first analyze the HTML content of the IS-Academia directory. We will this by using the `requests` library to `GET` the HTML content given a URL. We obtained the following URL using Postman and Postman Interceptor.

In [5]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")

Now we will use `BeautifulSoup` to parse through the data and visualize it nicely with the `pretiffy()` method.

In [7]:
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

With `BeautifulSoup`, we can convinient dig deeper into the HTML content as described in this tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. We will isolate the filters used to distinguish students by Major (Unité académique), Academic Year (Période académique), Student Status (Période pédagogique), and Semester Type (Type de semestre). We have identified from the output above that the filters are in the `body`, between `<table>` tags with `id="filtre"`. Finally, we can use `find_all('tr')` to get each filter as an entry in a list.

In [21]:
filters = soup.body.find(id="filtre").find_all('tr')
print(filters)

[<tr><th>Unit\xe9 acad\xe9mique</th><td><input name="zz_x_UNITE_ACAD" type="hidden" value=""><select name="ww_x_UNITE_ACAD" onchange="document.f.zz_x_UNITE_ACAD.value=document.f.ww_x_UNITE_ACAD.options[document.f.ww_x_UNITE_ACAD.selectedIndex].text"><option value="null"></option><option value="942293">Architecture</option><option value="246696">Chimie et g\xe9nie chimique</option><option value="943282">Cours de math\xe9matiques sp\xe9ciales</option><option value="637841336">EME (EPFL Middle East)</option><option value="942623">G\xe9nie civil</option><option value="944263">G\xe9nie m\xe9canique</option><option value="943936">G\xe9nie \xe9lectrique et \xe9lectronique </option><option value="2054839157">Humanit\xe9s digitales</option><option value="249847">Informatique</option><option value="120623110">Ing\xe9nierie financi\xe8re</option><option value="946882">Management de la technologie</option><option value="944590">Math\xe9matiques</option><option value="945244">Microtechnique</option

Admittedly, not the nicest output. And besides the fact that there are square brackets, it's hard to even say that it's a list! Let's try to output this more cleanly. From the `prettify()` output, we saw that each filter, e.g. `Architecture` for Unité académique or `2010-2011` for Période académique has an `option` tag surrounding it. Let's use `find_all(option)` with each item in the above list to cleanly output the filter options.

In [24]:
for field in filters:
    print(field.find_all('option'))

[<option value="null"></option>, <option value="942293">Architecture</option>, <option value="246696">Chimie et g\xe9nie chimique</option>, <option value="943282">Cours de math\xe9matiques sp\xe9ciales</option>, <option value="637841336">EME (EPFL Middle East)</option>, <option value="942623">G\xe9nie civil</option>, <option value="944263">G\xe9nie m\xe9canique</option>, <option value="943936">G\xe9nie \xe9lectrique et \xe9lectronique </option>, <option value="2054839157">Humanit\xe9s digitales</option>, <option value="249847">Informatique</option>, <option value="120623110">Ing\xe9nierie financi\xe8re</option>, <option value="946882">Management de la technologie</option>, <option value="944590">Math\xe9matiques</option>, <option value="945244">Microtechnique</option>, <option value="945571">Physique</option>, <option value="944917">Science et g\xe9nie des mat\xe9riaux</option>, <option value="942953">Sciences et ing\xe9nierie de l'environnement</option>, <option value="945901">Science

That's a bit better. We can see a `value` associated with each field option, e.g. `249847` for `Informatique`. Let's actually use some of these filters on the website itself and intercept the requests using Postman + Postman Interceptor. Postman helps us track which URL's are requested and to analyze the corresponding HTML. Let's check out the URL and HTML content (with `BeautifulSoup`) when we select the following field options: `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne`.

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2009-2010&ww_x_PERIODE_ACAD=978195&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok

In the URL we can see the options we selected! Moreover, they have been used as parameters for the URL along with their corresponding `value` attribute. The parameter names (that `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne` are being set to) could be identified by navigating through our `filters` list (by going into the `td` tag and then selecting the `name` attribute of the `input` tag).

In [31]:
for field in filters:
    print(field.td.input["name"])

zz_x_UNITE_ACAD
zz_x_PERIODE_ACAD
zz_x_PERIODE_PEDAGO
zz_x_HIVERETE


`zz_*` seems to be for the string parameter name and `xx_*` for the corresponding `value` attribute. However, it is also possible to get the same HTML content without the `zz_*` parameters by adding the `ww_b_list` parameter (we found this out using Postmaster):

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_i_reportModelXsl=133685270&ww_x_HIVERETE=2936286&ww_x_PERIODE_ACAD=978195&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_PEDAGO=249108

When checking out the HTML content, we see a new table at the bottom (corresponding to the options we see on the IS-Academia portal) with attribute `border="0"`. Let's check it out with `BeautifulSoup`.

In [34]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2009-2010&ww_x_PERIODE_ACAD=978195&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok")
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

We now see this new parameter `ww_x_GPS`. Let's follow the link for `Informatique, 2009-2010, Bachelor semestre 1` and analyze as before. The webpage now shows the corresponding list of students! With Postman, we see a `GET` request with the following URL:

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=213617925&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978195&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286

This is very similar to previous URL with one key difference: the new parameter `ww_x_GPS` with its corresponding value has been added to the URL.

We have now "cracked" the manner in which to extract the desired HTML content from IS-Academia! The general procedure is as follows:

1. Identity the `value` attributes according to desired filters.
2. Using `requests`, build the URL for filter search results with the `value` attributes as parameters of the following base URL: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270
3. Use `BeautifulSoup` to extract the `ww_x_GPS` parameter value from the HTML content.
4. With `requests`, build the URL with the newly acquired `ww_x_GPS` value and the `value` attributes as parameters of the following base URL: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1
5. We then have a table of students in HTML format. We can use the `read_html()` function of `pandas` in order to conveniently access the data.

Below we will go through the above steps for picking out the students we need for our analysis in the exercises.

#### 1. Identify `value` attributes according to desired filters

In order to perform the first step conveniently, we will create a few dictionaries so we can "translate" the desired filter options into their corresponding `value` attributes. These dictionaries will be used in the following exercises.

In [255]:
# same URL as before, identified with postmam
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")
# scrape content using BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
# obtain list of filters as previously described
filters = soup.body.find(id="filtre").find_all('tr')

Now we define a new function `create_dict()` in order to scrape the string and corresponding `value` attribute from a list of `option`'s. The function will finally place them in a dictionary with the string as the key and the `value` attribute as the (you, got it) value, and then return this newly formed dictionary.

In [256]:
# function to create dictionary for each filter
def create_dict(field_list):
    field_dict = {}
    for i in range(1, len(field_list)):
        field_dict[field_list[i].string] = field_list[i]["value"]
    return field_dict

# Unité académique, Période académique, Période pédagogique, Type de semestre
major_dict = create_dict(filters[0].find_all('option'))
acad_yr_dict = create_dict(filters[1].find_all('option'))
status_dict = create_dict(filters[2].find_all('option'))
sem_dict = create_dict(filters[3].find_all('option'))

Let's create a Series from each of these dictionaries and `pickle` them so we don't have to have to rely on `requests`.

In [263]:
major = pd.Series(data=major_dict)
major.to_pickle("major_pickle")
acad_yr = pd.Series(data=acad_yr_dict)
acad_yr.to_pickle("acad_yr_pickle")
status = pd.Series(data=status_dict)
status.to_pickle("status_pickle")
semester = pd.Series(data=sem_dict)
semester.to_pickle("sem_pickle")

Now we can convieniently obtain the necessary parameters to build the URLs for filtering students based on Major (Unité académique), Academic Year (Période académique), Student Status (Période pédagogique), and Semester Type (Type de semestre)!

#### 2. Using `requests`, build the URL for filter search results with the `value` attributes as parameters

Now let's build the required URL so we can obtain the `ww_x_GPS` parameter value to then gather the students that meet our search criteria. The following is our base URL for the filter search results:

In [118]:
FILTER_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270'

Now using `requests`, we can build the URL with the necessary parameters as we saw above. We have the following parameters:

In [14]:
# parameter keys
PARAM_MAJ = 'ww_x_UNITE_ACAD'
PARAM_YR = 'ww_x_PERIODE_ACAD'
PARAM_STATUS = 'ww_x_PERIODE_PEDAGO'
PARAM_SEM = 'ww_x_HIVERETE'

Now let's pass parameters to the URL as described here (http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls) and make a `GET` request. Let's say we want students in `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne` as before.

In [119]:
# create URL for filtered result
payload_filter = {PARAM_MAJ: major['Informatique'], 
                  PARAM_YR: acad_yr['2009-2010'], 
                  PARAM_STATUS: status['Bachelor semestre 1'], 
                  PARAM_SEM: sem["Semestre d'automne"]}
r = requests.get(FILTER_BASE_URL, params=payload_filter)
print(r.url)

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&ww_x_PERIODE_PEDAGO=249108&ww_x_PERIODE_ACAD=978195&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=2936286


#### 3. Use `BeautifulSoup` to extract the `ww_x_GPS` parameter value from the HTML content. 

Let's see how we can use `BeautifulSoup` to navigate through the HTML content and extract the `ww_x_GPS` parameter value. `prettify()` can help us with this.

In [120]:
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

We need to pick out the `a` tags that have a `class` attribute equal to `ww_x_GPS`. This can be done with the `find_all()` method.

In [121]:
soup.find_all('a', class_='ww_x_GPS')

[<a class="ww_x_GPS" href="javascript:void(0)" onclick="loadReport('ww_x_GPS=-1');return false;">Tous</a>,
 <a class="ww_x_GPS" href="javascript:void(0)" onclick="loadReport('ww_x_GPS=213617925');return false;">Informatique, 2009-2010, Bachelor semestre 1</a>]

Now we have a list of HTML entries that contain `ww_x_GPS` values. The value itself is in the `onclick` attribute. We can extract the `ww_x_GPS` value by parsing the information contained in this attribute. We will assume that we only get two entries in the above list are our filter entries as our search criteria will ensure this. The two categories (which can be seen on the IS-Academia site) are "Tous" and the category of students we are interested in. "Tous" has a `ww_x_GPS` value of `-1` so we make sure to return the other value.

In [122]:
# assuming we only get two results with one of them being "Tous"
def is_valid_gps(gps):
    return gps != "-1"

def extract_gps(content):
    soup = BeautifulSoup(content, 'html.parser')
    elements = soup.find_all('a', class_='ww_x_GPS')
    for element in elements:
        raw_info = element.attrs['onclick']
        gps = raw_info.split("'")[1].split('=')[1]
        if is_valid_gps(gps):
            return gps

gps = extract_gps(r.content)
print(gps)

213617925


#### 4. With `requests`, build the URL with the newly acquired `ww_x_GPS` value and the `value` attributes as parameters. 

Now we have a new base URL and an additional parameter for our payload.

In [123]:
PARAM_GPS = 'ww_x_GPS'
DATA_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1'

As in Step 2, we use `requests` to can build the URL with the necessary parameters.

In [124]:
# create URL for filtered result
payload_data = {PARAM_GPS: gps,
                PARAM_MAJ: major['Informatique'], 
                PARAM_YR: acad_yr['2009-2010'], 
                PARAM_STATUS: status['Bachelor semestre 1'], 
                PARAM_SEM: sem["Semestre d'automne"]}
r = requests.get(DATA_BASE_URL, params=payload_data)
print(r.url)

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1&ww_x_PERIODE_PEDAGO=249108&ww_x_PERIODE_ACAD=978195&ww_x_UNITE_ACAD=249847&ww_x_HIVERETE=2936286&ww_x_GPS=213617925


Following the above link takes us to the list of students meeting the following criteria: `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne`.

#### 5. Use the `read_html()` function of `pandas` (or `BeautifulSoup`) in order to conveniently access the data.

##### Using with `pandas`

`read_html()` stores HTML tables as a list of `DataFrame`'s. So we need to access the first element for our desired table.

In [125]:
dfs = pd.read_html(r.url)
dfs[0].head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,"Informatique, 2009-2010, Bachelor semestre 1 ...",,,,,,,,,,,
1,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper,
2,Monsieur,Abdallah Jad,,,,,,Présent,,,194197,
3,Madame,Al Azawi Marwa,,,,,,Présent,,,195766,
4,Monsieur,Amrani Ismaïl,,,,,,Présent,,,186942,
5,Monsieur,Antognini Marco,,,,,,Présent,,,194182,
6,Monsieur,Augsburger Damien,,,,,,Présent,,,186595,
7,Madame,Balmau Oana Maria,,,,,,Présent,,,192757,
8,Monsieur,Barben Loïc,,,,,,Présent,,,189517,
9,Monsieur,Barbier Issa,,,,,,Présent,,,192248,


The first two rows corresponds to information about the students. After the first two rows, we have the student name, gender, SCIPER, etc. Therefore, the some pre-processing needs to be done. These are some example pre-processing steps (perhaps more or less is needed depending on what info we want to extract): remove first row, set second row as the column names, and set SCIPER number of index (since this is a unique ID).

In [126]:
df = dfs[0].copy()
df.columns = df.iloc[1] # set columns names
df = df.reindex(df.index.drop([0,1])) # drop non-student rows
df = df.set_index('No Sciper')
df.head(10)

1,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,nan
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
194197,Monsieur,Abdallah Jad,,,,,,Présent,,,
195766,Madame,Al Azawi Marwa,,,,,,Présent,,,
186942,Monsieur,Amrani Ismaïl,,,,,,Présent,,,
194182,Monsieur,Antognini Marco,,,,,,Présent,,,
186595,Monsieur,Augsburger Damien,,,,,,Présent,,,
192757,Madame,Balmau Oana Maria,,,,,,Présent,,,
189517,Monsieur,Barben Loïc,,,,,,Présent,,,
192248,Monsieur,Barbier Issa,,,,,,Présent,,,
187143,Monsieur,Bennani Kabchi Reda,,,,,,Présent,,,
195551,Monsieur,Berta Stefano,,,,,,Présent,,,


##### Using `BeautifulSoup`

One problem with using the `read_html()` function is that it (again) fetches the HTML content although this was already one with `resquests`. (We could alternatively build the full URL without `requests.get()` and just fetch the HTML content once). We can also scrape the HTM content using `BeautifulSoup`.

In [128]:
soup_students = BeautifulSoup(r_list.content, 'html.parser')
# visualize
print(soup_students.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <link href="gedpublicreports.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css"/>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <fieldset style="text-align:right; width:40%; position:relative; margin-right: 10px;float:right; border: 0; padding: 0 0 8px 0;">
   <a href="!GEDREPORTS.html?ww_i_reportModel=133685247&amp;ww_i_reportModelXsl=133685270&amp;ww_b_list=1&amp;ww_x_PERIODE_PEDAGO=249108&amp;ww_x_PERIODE_ACAD=355925344&amp;ww_x_UNITE_ACAD=249847&amp;ww_x_HIVERETE=2936286&amp;ww_x_GPS=2021043255" style="color:#990033;">
    Identification pour accéder aux e-mails
    <br>
     Login to access email adresses
    </br>
   </a>
  </fieldset>
  <script>
   function mailList(x) {
   var vtop = (screen.height-200)/2;
   var vleft=(screen.width-600)/2;
   var w=open("", "emaillist", "Scrollbars=1,resizabl

From the `prettify()` output, we see that student info is contained within `<tr>` tags and that the first two entries between `<tr>` are for general information about the students. Therefore, to get all the students, we can use `find_all()` to get all the `<tr>` entries and drop the first two.

In [132]:
students = soup_students.find_all('tr')[2:]
# let's look at one of the student entries
students[0]

<tr><td style="white-space:nowrap">Monsieur</td><td style="white-space:nowrap">Abbey\xa0Alexandre</td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap">Pr\xe9sent</td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td>235688</td><td style="white-space:nowrap"></td></tr>

Each data point about the student is surround by `<td>` tags. We can again use `find_all()` to access these elements and print these nicely.

In [135]:
student = students[0].find_all('td')
for field in student:
    print(field.string)

Monsieur
Abbey Alexandre
None
None
None
None
None
Présent
None
None
235688
None


# Exercise 1

_We will focus exclusively on the academic unit `Informatique`._

_Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both `Bachelor semestre 1` and `Bachelor semestre 6`. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?_

For this problem, we will create two dictionaries: one for `Bachelor semestre 1` (B1) students and another for `Bachelor semestre 6` (B6) students. For the B1 dictionary, we will start from 2007 and add students when they are first enrolled as `Informatique` and B1. If they repeat, they will not be added again. The same will be done for B6 as the problem is to compute the number of months from first to sixth semester. To compute the months, we will take into account the academic year: **12*[(second of the academic years for B6) - (first of the academic years for B1)]**.

A few assumptions we make:
* `Bachelor semestre 1` is always in `Semestre d'automne` due to the way the Bachelor program is structured.
* `Bachelor semestre 6` is always in `Semestre de printemps` due to the way the Bachelor program is structured.

**NOTE** : We realize that there is some discussion on Slack with people considering students that finished their studies with `Bachelor semestre 5`; however, the assignment states to compute the number of months for each student to go from the first to the sixth semester (how long it took for their Bachelors is a different problem).

In [223]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## (REMINDER) constants and functions as previously described/explained
# parameter keys
PARAM_GPS = 'ww_x_GPS'
PARAM_MAJ = 'ww_x_UNITE_ACAD'
PARAM_YR = 'ww_x_PERIODE_ACAD'
PARAM_STATUS = 'ww_x_PERIODE_PEDAGO'
PARAM_SEM = 'ww_x_HIVERETE'

# base urls
FILTER_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270'
DATA_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1'

# open the Series for the filter dropdown menus made before
majors = pd.read_pickle("major_pickle")
acad_yrs = pd.read_pickle("acad_yr_pickle")
statuses = pd.read_pickle("status_pickle")
semesters = pd.read_pickle("sem_pickle")

# extracting GPS value
def is_valid_gps(gps):
    return gps != "-1"
def extract_gps(content):
    soup = BeautifulSoup(content, 'html.parser')
    elements = soup.find_all('a', class_='ww_x_GPS')
    for element in elements:
        raw_info = element.attrs['onclick']
        gps = raw_info.split("'")[1].split('=')[1]
        if is_valid_gps(gps):
            return gps

# combine steps 1-4 from procedure of extracting HTML content of desired students
def get_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r_filt = requests.get(FILTER_BASE_URL, params=payload)
    gps = extract_gps(r_filt.content)
    # get list of students
    payload[PARAM_GPS] = gps
    r_list = requests.get(DATA_BASE_URL, params=payload)
    return r_list.content

# calculate the start date of a given academic year and semester type
def sem_start_date(academic_year, semester):
    start_year, next_year = academic_year.split('-')
    if semester == "Semestre d'automne":
        return start_year + '-09'
    else:
        return next_year + '-03'

def create_student_entry(stat, info, yr, sem):
    student = {}
    student['Gender'] = info[0].string
    student['Name'] = info[1].string
    student['Specialisation'] = info[4].string
    student['Minor'] = info[6].string
    student[stat] = sem_start_date(yr, sem)
    return student

# scrape student data for a particular major and student status
def scrape_student_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_html_content(maj, yr, stat, sem)
            # parse with beautiful soup
            soup_students = BeautifulSoup(html_content, 'html.parser')
            rows = soup_students.find_all('tr')
            # students are starting after two rows
            for row in rows[2:]:
                student = row.find_all('td')
                sciper = student[10].string
                # keep earliest year in case a student repeated first semester
                if int(stat.split(' ')[-1]) == 1: # obtaining number of semester
                    if sciper not in dic:
                        dic[sciper] = create_student_entry(stat, student, yr, sem)
                # for other semesters replace with latest
                else:
                    dic[sciper] = create_student_entry(stat, student, yr, sem)
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

Let's create a DataFrame for the B1 and one for the B6 students.

In [224]:
major = 'Informatique'
df_b1 = scrape_student_data(major, 'Bachelor semestre 1')
df_b6 = scrape_student_data(major, 'Bachelor semestre 6')
df_b1.to_pickle("df_b1_pickle")
df_b6.to_pickle("df_b6_pickle")

In [225]:
# join b1 and b6 students and keep those that are in both
df_b1 = pd.read_pickle("df_b1_pickle")
df_b6 = pd.read_pickle("df_b6_pickle")
b1_to_b6 = df_b1[["Bachelor semestre 1","Gender"]].join(df_b6["Bachelor semestre 6"], how='inner')
b1_to_b6.tail()

Unnamed: 0,Bachelor semestre 1,Gender,Bachelor semestre 6
250300,2014-09,Monsieur,2017-03
250362,2014-09,Monsieur,2017-03
250703,2014-09,Monsieur,2017-03
251758,2014-09,Monsieur,2017-03
251759,2014-09,Monsieur,2017-03


In [226]:
# drop students with future B6
b1_to_b6 = b1_to_b6[np.logical_not(b1_to_b6["Bachelor semestre 6"].isin(['2017-03']))]
b1_to_b6.tail()

Unnamed: 0,Bachelor semestre 1,Gender,Bachelor semestre 6
238150,2013-09,Monsieur,2016-03
239124,2013-09,Monsieur,2016-03
239170,2013-09,Monsieur,2016-03
239314,2013-09,Monsieur,2016-03
239366,2013-09,Monsieur,2016-03


In [227]:
# date is in 'year-month' format. e.g. 2015-07
def months_between_dates(start_date, end_date):
    start_year, start_month = start_date.split('-')
    end_year, end_month = end_date.split('-')
    return (int(end_year) - int(start_year)) * 12 + int(end_month) - int(start_month) + 6

def bachelor_duration(row):
    return months_between_dates(row['Bachelor semestre 1'], row['Bachelor semestre 6'])

b1_to_b6['Duration in months'] = b1_to_b6.apply(bachelor_duration, axis=1)

In [228]:
male_mean = b1_to_b6['Duration in months'][b1_to_b6.Gender=="Monsieur"].mean()
female_mean = b1_to_b6['Duration in months'][b1_to_b6.Gender=="Madame"].mean()
print("Average duration for male students: " + str(male_mean))
print("Average duration for female students: " + str(female_mean))
print(b1_to_b6.Gender.value_counts())

Average duration for male students: 42.0518731988
Average duration for female students: 39.5555555556
Monsieur    347
Madame       27
Name: Gender, dtype: int64


In [229]:
# Two-Sample T-Test
import scipy.stats as stats
stats.ttest_ind(a= b1_to_b6['Duration in months'][b1_to_b6.Gender=="Monsieur"],
                b= b1_to_b6['Duration in months'][b1_to_b6.Gender=="Madame"],
                equal_var=False) 

Ttest_indResult(statistic=1.860117047790256, pvalue=0.071443584251684678)

The test yields a p-value of 0.0714, which means there is a 7.14% chance we'd see sample data this far apart if the two groups tested are actually identical. If we were using a 95% confidence level we would fail to reject the null hypothesis (being that the means are the same), since the p-value is greater than the corresponding significance level of 5%. Therefore, the difference in average duration is not statistically significant.

# Exercise 2

_Perform a similar operation to what described above, this time for Master students. Notice that this data is more tricky, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between `Master semestre 1` and `Master semestre 2`. If the `Mineur` field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the `Projet Master` tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a `Spécialisation` and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?_

In [166]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## (REMINDER) constants and functions as previously described/explained
# parameter keys
PARAM_GPS = 'ww_x_GPS'
PARAM_MAJ = 'ww_x_UNITE_ACAD'
PARAM_YR = 'ww_x_PERIODE_ACAD'
PARAM_STATUS = 'ww_x_PERIODE_PEDAGO'
PARAM_SEM = 'ww_x_HIVERETE'

# base urls
FILTER_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270'
DATA_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1'

# open the Series for the filter dropdown menus made before
majors = pd.read_pickle("major_pickle")
acad_yrs = pd.read_pickle("acad_yr_pickle")
statuses = pd.read_pickle("status_pickle")
semesters = pd.read_pickle("sem_pickle")

# extracting GPS value
def is_valid_gps(gps):
    return gps != "-1"
def extract_gps(content):
    soup = BeautifulSoup(content, 'html.parser')
    elements = soup.find_all('a', class_='ww_x_GPS')
    for element in elements:
        raw_info = element.attrs['onclick']
        gps = raw_info.split("'")[1].split('=')[1]
        if is_valid_gps(gps):
            return gps

# combine steps 1-4 from procedure of extracting HTML content of desired students
def get_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r_filt = requests.get(FILTER_BASE_URL, params=payload)
    gps = extract_gps(r_filt.content)
    # get list of students
    payload[PARAM_GPS] = gps
    r_list = requests.get(DATA_BASE_URL, params=payload)
    return r_list.content

# calculate the start date of a given academic year and semester type
def sem_start_date(academic_year, semester):
    start_year, next_year = academic_year.split('-')
    if semester == "Semestre d'automne":
        return start_year + '-09'
    else:
        return next_year + '-03'

def create_student_entry(stat, info, yr, sem):
    student = {}
    student['Gender'] = info[0].string
    student['Name'] = info[1].string
    student['Specialisation'] = info[4].string
    student['Minor'] = info[6].string
    student[stat] = sem_start_date(yr, sem)
    return student

# scrape student data for a particular major and student status
def scrape_student_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_html_content(maj, yr, stat, sem)
            # parse with beautiful soup
            soup_students = BeautifulSoup(html_content, 'html.parser')
            rows = soup_students.find_all('tr')
            # students are starting after two rows
            for row in rows[2:]:
                student = row.find_all('td')
                sciper = student[10].string
                # keep earliest year in case a student repeated first semester
                if int(stat.split(' ')[-1]) == 1: # obtaining number of semester
                    if sciper not in dic:
                        dic[sciper] = create_student_entry(stat, student, yr, sem)
                # for other semesters replace with latest
                else:
                    dic[sciper] = create_student_entry(stat, student, yr, sem)
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

In [167]:
major = 'Informatique'
df_m1 = scrape_student_data(major, 'Master semestre 1')
df_m2 = scrape_student_data(major, 'Master semestre 2')
df_m3 = scrape_student_data(major, 'Master semestre 3')
df_m1.to_pickle("df_m1_pickle")
df_m2.to_pickle("df_m2_pickle")
df_m3.to_pickle("df_m3_pickle")

In [168]:
df_m3.head(5)

Unnamed: 0,Master semestre 3,Gender,Specialisation,Name,Minor
128911,2007-09,Monsieur,Internet computing,Gulati Asheesh,
129093,2007-09,Monsieur,,Zhou Maoan,
129326,2007-09,Monsieur,,Ni Zhong Zhong,
145546,2007-09,Monsieur,,Clivaz Jean-Philippe,
145957,2007-09,Monsieur,,Hügli Michael,


#### Try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2

Join M1 and M2 students (both fields shoud exits) as we did for B1 and B2

In [169]:
df_m1 = pd.read_pickle("df_m1_pickle")
df_m2 = pd.read_pickle("df_m2_pickle")
m1_to_m2 = df_m1[["Master semestre 1","Gender","Name"]].join(df_m2["Master semestre 2"], how='inner')
m1_to_m2.tail()

Unnamed: 0,Master semestre 1,Gender,Name,Master semestre 2
260806,2015-09,Monsieur,Rouault Sébastien Louis Alexandre,2016-03
260811,2015-09,Monsieur,Loiseleur Thibaut,2016-03
260968,2015-09,Madame,Kabil Selen Hande,2016-03
261006,2015-09,Madame,M'Hamdi Meryem,2016-03
261146,2015-09,Monsieur,Zakhour George,2016-03


In [170]:
# date is in 'year-month' format. e.g. 2015-07
def months_between_dates(start_date, end_date):
    start_year, start_month = start_date.split('-')
    end_year, end_month = end_date.split('-')
    return (int(end_year) - int(start_year)) * 12 + int(end_month) - int(start_month) + 6

def master_duration_rough(row):
    return months_between_dates(row['Master semestre 1'], row['Master semestre 2'])

m1_to_m2['Duration in months'] = m1_to_m2.apply(master_duration_rough, axis=1)
m1_to_m2.head()

Unnamed: 0,Master semestre 1,Gender,Name,Master semestre 2,Duration in months
146330,2007-09,Monsieur,Cardinaux Damien,2008-03,12
146742,2008-09,Monsieur,Marx Clément,2010-03,24
146929,2007-09,Monsieur,Junod Antoine,2008-03,12
147008,2011-09,Monsieur,Good Xavier,2013-03,24
152232,2007-09,Monsieur,Anagnostaras David,2008-03,12


In [171]:
# some interesting cases of an M2 before their M1...
print(len(m1_to_m2.loc[m1_to_m2["Duration in months"]<=0]))
m1_to_m2.loc[m1_to_m2["Duration in months"]<=0].head(10)

15


Unnamed: 0,Master semestre 1,Gender,Name,Master semestre 2,Duration in months
171206,2010-09,Monsieur,Testuz Stéphane,2010-03,0
178786,2011-09,Monsieur,Coiro Andrea,2011-03,0
180816,2013-09,Monsieur,Fond Matthieu,2013-03,0
192345,2014-09,Monsieur,Camenzind Marzell,2014-03,0
196034,2015-09,Monsieur,Perrin Sami,2015-03,0
202973,2016-09,Monsieur,Cartier Alexis Victor Xavier,2016-03,0
204869,2016-09,Monsieur,Imani Ismail,2016-03,0
208359,2016-09,Monsieur,Sidorenko Semion,2016-03,0
209450,2016-09,Monsieur,Zellweger Fabien André,2016-03,0
218357,2016-09,Monsieur,Ruetschi Romain Roland,2016-03,0


In [172]:
# We will drop those cases when computing the mean
m1_to_m2.loc[m1_to_m2["Duration in months"]>0].mean()

Duration in months    15.861148
dtype: float64

#### If the Mineur field is not empty, the student should also appear registered in Master semestre 3

We will specialization as well since this also takes an extra semester. According to EPFL regulations a minor or specialization must be chosen by Master semester 2:
* http://ic.epfl.ch/page-97562-en.html
* http://ic.epfl.ch/specializations

So we will take the Specialization or Minor from M2.

In [173]:
m1_to_m3 = df_m1[["Master semestre 1","Gender","Name"]].join(df_m2[["Master semestre 2","Minor","Specialisation"]], 
                                                            how='inner').join(df_m3["Master semestre 3"])
m1_to_m3.head()

Unnamed: 0,Master semestre 1,Gender,Name,Master semestre 2,Minor,Specialisation,Master semestre 3
146330,2007-09,Monsieur,Cardinaux Damien,2008-03,,,2008-09
146742,2008-09,Monsieur,Marx Clément,2010-03,,"Signals, Images and Interfaces",2012-09
146929,2007-09,Monsieur,Junod Antoine,2008-03,,,
147008,2011-09,Monsieur,Good Xavier,2013-03,,,2012-09
152232,2007-09,Monsieur,Anagnostaras David,2008-03,"Mineur en Management, technologie et entrepren...",,2008-09


We can see cases of a students (Cardinaux Damien and Good Xavier) that don't have a minor or a specialisation but have an entry for `Master semestre 3`. Therefore, we will simply check if `Master semestre 3` is not `NaN` (rather than checking is minor or specialisation is empty). This will give a more accurate value for the stay at EPFL.

In [174]:
# compute duration
def master_duration_in_months(row):
    start_date = row['Master semestre 1']
    end_date = row['Master semestre 2']
    if pd.notnull(row['Master semestre 3']):
        end_date = row['Master semestre 3']
    if pd.isnull(start_date) or pd.isnull(end_date):
        return np.nan
    return months_between_dates(start_date, end_date)

m1_to_m3['Duration'] = m1_to_m3.apply(lambda row: master_duration_in_months(row), axis=1)
m1_to_m3.head(10)

Unnamed: 0,Master semestre 1,Gender,Name,Master semestre 2,Minor,Specialisation,Master semestre 3,Duration
146330,2007-09,Monsieur,Cardinaux Damien,2008-03,,,2008-09,18
146742,2008-09,Monsieur,Marx Clément,2010-03,,"Signals, Images and Interfaces",2012-09,54
146929,2007-09,Monsieur,Junod Antoine,2008-03,,,,12
147008,2011-09,Monsieur,Good Xavier,2013-03,,,2012-09,18
152232,2007-09,Monsieur,Anagnostaras David,2008-03,"Mineur en Management, technologie et entrepren...",,2008-09,18
153066,2007-09,Monsieur,Aeberhard François-Xavier,2010-03,,Internet computing,2009-09,30
153746,2007-09,Monsieur,Cassina Ilya,2008-03,,,,12
153762,2008-09,Monsieur,Conus Johann,2010-03,,,2009-09,18
154080,2007-09,Monsieur,Fomene Tierry Alain,2009-03,,Internet computing,2009-09,30
154573,2007-09,Madame,Benabdallah Zeineb,2009-03,,Biocomputing,2009-09,30


In [175]:
# some interesting cases of an M2/M3 before their M1...
len(m1_to_m3.loc[m1_to_m3["Duration"]<=0])

15

We still drop the cases of students that have a Duration less than or equal to `0` before computing the mean.

In [178]:
m1_to_m3 = m1_to_m3.loc[m1_to_m3["Duration"]>0]
m1_to_m3["Duration"].mean()

18.0

#### Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students.

#### Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?

In [192]:
# do not drop na just to see how many don't have specialization
m1_to_m3.Specialisation.value_counts(dropna=False)

NaN                               544
Internet computing                 77
Foundations of Software            56
Signals, Images and Interfaces     22
Computer Engineering - SP          17
Software Systems                   16
Information Security - SP           7
Data Analytics                      4
Service science                     2
Biocomputing                        2
Computer Science Theory             1
Internet Information Systems        1
Name: Specialisation, dtype: int64

In [196]:
#m1_to_m3[m1_to_m3.Specialisation.notnull()].groupby('Specialisation')['Duration'].mean()
m1_to_m3_copy = m1_to_m3[["Specialisation","Duration"]]
m1_to_m3_copy.dropna().groupby('Specialisation')['Duration'].mean()

Specialisation
Biocomputing                      30.000000
Computer Engineering - SP         19.764706
Computer Science Theory           18.000000
Data Analytics                    16.500000
Foundations of Software           21.107143
Information Security - SP         18.000000
Internet Information Systems      18.000000
Internet computing                20.961039
Service science                   18.000000
Signals, Images and Interfaces    24.000000
Software Systems                  18.000000
Name: Duration, dtype: float64

In [206]:
m1_to_m3_copy[m1_to_m3_copy.Specialisation == "Data Analytics"]

Unnamed: 0,Specialisation,Duration
214573,Data Analytics,12
224356,Data Analytics,18
225757,Data Analytics,18
256553,Data Analytics,18


In [221]:
specs = m1_to_m3_copy.dropna().Specialisation.unique()
all_durations = m1_to_m3.Duration.values
p_vals = {}
for spec in specs:
    durations_spec = m1_to_m3_copy[m1_to_m3_copy.Specialisation == spec].Duration.values
    result = stats.ttest_ind(a= durations_spec,b= all_durations,equal_var=False)
    p_vals[spec] = result[1]*100
p_vals = pd.Series(p_vals)
p_vals

Biocomputing                      3.839277e-296
Computer Engineering - SP          1.763631e+01
Computer Science Theory                     NaN
Data Analytics                     3.922487e+01
Foundations of Software            1.601334e-02
Information Security - SP          1.000000e+02
Internet Information Systems                NaN
Internet computing                 5.455292e-03
Service science                    1.000000e+02
Signals, Images and Interfaces     1.152510e+00
Software Systems                   1.000000e+02
dtype: float64

We have to `NaN` since there was only one person who did those specialisations.

In [222]:
# confidence interval of 95%
p_vals[p_vals<5]

Biocomputing                      3.839277e-296
Foundations of Software            1.601334e-02
Internet computing                 5.455292e-03
Signals, Images and Interfaces     1.152510e+00
dtype: float64

Using p-values and a confidence interval of 95%, the difference in average for the following specialisation is statistically significant:
* **Biocomputing**
* **Foundations of Software**
* **Internet computing**
* **Signals, Images and Interfaces**

# Exercise 3