In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

** NOTE: ** We have suppressed some outputs so that the notebook isn't too long.

# Familiarizing with the tools and data

Let's first analyze the HTML content of the IS-Academia directory. We will this by using the `requests` library to `GET` the HTML content given a URL. We obtained the following URL using Postman and Postman Interceptor.

In [2]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")

Now we will use `BeautifulSoup` to parse through the data and visualize it nicely with the `pretiffy()` method.

In [3]:
soup = BeautifulSoup(r.content, 'html.parser')
#print(soup.prettify())

With `BeautifulSoup`, we can convinient dig deeper into the HTML content as described in this tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. We will isolate the filters used to distinguish students by Major (Unité académique), Academic Year (Période académique), Student Status (Période pédagogique), and Semester Type (Type de semestre). We have identified from the output above that the filters are in the `body`, between `<table>` tags with `id="filtre"`. Finally, we can use `find_all('tr')` to get each filter as an entry in a list.

In [4]:
filters = soup.body.find(id="filtre").find_all('tr')
#print(filters)

Admittedly, not the nicest output. And besides the fact that there are square brackets, it's hard to even say that it's a list! Let's try to output this more cleanly. From the `prettify()` output, we saw that each filter, e.g. `Architecture` for Unité académique or `2010-2011` for Période académique has an `option` tag surrounding it. Let's use `find_all(option)` with each item in the above list to cleanly output the filter options.

In [5]:
for field in filters:
    print(field.find_all('option'))

[<option value="null"></option>, <option value="942293">Architecture</option>, <option value="246696">Chimie et génie chimique</option>, <option value="943282">Cours de mathématiques spéciales</option>, <option value="637841336">EME (EPFL Middle East)</option>, <option value="942623">Génie civil</option>, <option value="944263">Génie mécanique</option>, <option value="943936">Génie électrique et électronique </option>, <option value="2054839157">Humanités digitales</option>, <option value="249847">Informatique</option>, <option value="120623110">Ingénierie financière</option>, <option value="946882">Management de la technologie</option>, <option value="944590">Mathématiques</option>, <option value="945244">Microtechnique</option>, <option value="945571">Physique</option>, <option value="944917">Science et génie des matériaux</option>, <option value="942953">Sciences et ingénierie de l'environnement</option>, <option value="945901">Sciences et technologies du vivant</option>, <option va

That's a bit better. We can see a `value` associated with each field option, e.g. `249847` for `Informatique`. Let's actually use some of these filters on the website itself and intercept the requests using Postman + Postman Interceptor. Postman helps us track which URL's are requested and to analyze the corresponding HTML. Let's check out the URL and HTML content (with `BeautifulSoup`) when we select the following field options: `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne`.

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2009-2010&ww_x_PERIODE_ACAD=978195&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok

In the URL we can see the options we selected! Moreover, they have been used as parameters for the URL along with their corresponding `value` attribute. The parameter names (that `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne` are being set to) could be identified by navigating through our `filters` list (by going into the `td` tag and then selecting the `name` attribute of the `input` tag).

In [6]:
for field in filters:
    print(field.td.input["name"])

zz_x_UNITE_ACAD
zz_x_PERIODE_ACAD
zz_x_PERIODE_PEDAGO
zz_x_HIVERETE


`zz_*` seems to be for the string parameter name and `xx_*` for the corresponding `value` attribute. However, it is also possible to get the same HTML content without the `zz_*` parameters by adding the `ww_b_list` parameter (we found this out using Postmaster):

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_i_reportModelXsl=133685270&ww_x_HIVERETE=2936286&ww_x_PERIODE_ACAD=978195&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_PEDAGO=249108

When checking out the HTML content, we see a new table at the bottom (corresponding to the options we see on the IS-Academia portal) with attribute `border="0"`. Let's check it out with `BeautifulSoup`.

In [7]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2009-2010&ww_x_PERIODE_ACAD=978195&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok")
soup = BeautifulSoup(r.content, 'html.parser')
#print(soup.prettify())

We now see this new parameter `ww_x_GPS`. Let's follow the link for `Informatique, 2009-2010, Bachelor semestre 1` and analyze as before. The webpage now shows the corresponding list of students! With Postman, we see a `GET` request with the following URL:

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=213617925&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978195&ww_x_PERIODE_PEDAGO=249108&ww_x_HIVERETE=2936286

This is very similar to previous URL with one key difference: the new parameter `ww_x_GPS` with its corresponding value has been added to the URL.

We have now "cracked" the manner in which to extract the desired HTML content from IS-Academia! The general procedure is as follows:

1. Identity the `value` attributes according to desired filters.
2. Using `requests`, build the URL for filter search results with the `value` attributes as parameters of the following base URL: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270
3. Use `BeautifulSoup` to extract the `ww_x_GPS` parameter value from the HTML content.
4. With `requests`, build the URL with the newly acquired `ww_x_GPS` value and the `value` attributes as parameters of the following base URL: http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1
5. We then have a table of students in HTML format. We can use the `read_html()` function of `pandas` in order to conveniently access the data.

Below we will go through the above steps for picking out the students we need for our analysis in the exercises.

#### 1. Identify `value` attributes according to desired filters

In order to perform the first step conveniently, we will create a few dictionaries so we can "translate" the desired filter options into their corresponding `value` attributes. These dictionaries will be used in the following exercises.

In [8]:
# same URL as before, identified with postmam
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")
# scrape content using BeautifulSoup
soup = BeautifulSoup(r.content, 'html.parser')
# obtain list of filters as previously described
filters = soup.body.find(id="filtre").find_all('tr')

Now we define a new function `create_series()` in order to scrape the string and corresponding `value` attribute from a list of `option`'s. The function will finally place them in a dictionary with the string as the key and the `value` attribute as the (you, got it) value. We then construct a pandas `Series` from the dictionary and return it .

In [9]:
# function to create dictionary for each filter
def create_series(field_list):
    field_dict = {}
    for i in range(1, len(field_list)):
        field_dict[field_list[i].string] = field_list[i]["value"]
    return pd.Series(data=field_dict)

# Unité académique, Période académique, Période pédagogique, Type de semestre
major,acad_yr,status,semester = [create_series(filters[x].find_all('option')) for x in range(0,4)]

Let's `pickle` these `Series` so we don't have to have to rely on `requests`.

In [10]:
pickle_names = ["major_pickle", "acad_yr_pickle", "status_pickle", "sem_pickle"]
dicts = [major,acad_yr,status,semester]
res = [dic.to_pickle(pname) for dic, pname in zip(dicts,pickle_names)]

Now we can convieniently obtain the necessary parameters to build the URLs for filtering students based on Major (Unité académique), Academic Year (Période académique), Student Status (Période pédagogique), and Semester Type (Type de semestre)!

#### 2. Using `requests`, build the URL for filter search results with the `value` attributes as parameters

Now let's build the required URL so we can obtain the `ww_x_GPS` parameter value to then gather the students that meet our search criteria. The following is our base URL for the filter search results:

In [11]:
FILTER_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270'

Now using `requests`, we can build the URL with the necessary parameters as we saw above. We have the following parameters:

In [12]:
# parameter keys
PARAM_MAJ = 'ww_x_UNITE_ACAD'
PARAM_YR = 'ww_x_PERIODE_ACAD'
PARAM_STATUS = 'ww_x_PERIODE_PEDAGO'
PARAM_SEM = 'ww_x_HIVERETE'

Now let's pass parameters to the URL as described here (http://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls) and make a `GET` request. Let's say we want students in `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne` as before.

In [13]:
# create URL for filtered result
payload_filter = {PARAM_MAJ: major['Informatique'], 
                  PARAM_YR: acad_yr['2009-2010'], 
                  PARAM_STATUS: status['Bachelor semestre 1'], 
                  PARAM_SEM: semester["Semestre d'automne"]}
r = requests.get(FILTER_BASE_URL, params=payload_filter)
print(r.url)

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978195&ww_x_HIVERETE=2936286&ww_x_PERIODE_PEDAGO=249108


#### 3. Use `BeautifulSoup` to extract the `ww_x_GPS` parameter value from the HTML content. 

Let's see how we can use `BeautifulSoup` to navigate through the HTML content and extract the `ww_x_GPS` parameter value. `prettify()` can help us with this.

In [14]:
soup = BeautifulSoup(r.content, 'html.parser')
#print(soup.prettify())

We need to pick out the `a` tags that have a `class` attribute equal to `ww_x_GPS`. This can be done with the `find_all()` method.

In [15]:
soup.find_all('a', class_='ww_x_GPS')

[<a class="ww_x_GPS" href="javascript:void(0)" onclick="loadReport('ww_x_GPS=-1');return false;">Tous</a>,
 <a class="ww_x_GPS" href="javascript:void(0)" onclick="loadReport('ww_x_GPS=213617925');return false;">Informatique, 2009-2010, Bachelor semestre 1</a>]

Now we have a list of HTML entries that contain `ww_x_GPS` values. The value itself is in the `onclick` attribute. We can extract the `ww_x_GPS` value by parsing the information contained in this attribute. We will assume that we only get two entries in the above list are our filter entries as our search criteria will ensure this. The two categories (which can be seen on the IS-Academia site) are "Tous" and the category of students we are interested in. "Tous" has a `ww_x_GPS` value of `-1` so we make sure to return the other value.

In [16]:
# assuming we only get two results with one of them being "Tous"
def extract_gps(content):
    soup = BeautifulSoup(content, 'html.parser')
    elements = soup.find_all('a', class_='ww_x_GPS')
    for element in elements:
        raw_info = element.attrs['onclick']
        gps = raw_info.split("'")[1].split('=')[1]
        if gps != "-1":
            return gps

gps = extract_gps(r.content)
print(gps)

213617925


#### 4. With `requests`, build the URL with the newly acquired `ww_x_GPS` value and the `value` attributes as parameters. 

Now we have a new base URL and an additional parameter for our payload.

In [17]:
PARAM_GPS = 'ww_x_GPS'
DATA_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1'

As in Step 2, we use `requests` to can build the URL with the necessary parameters.

In [18]:
# create URL for filtered result
payload_data = {PARAM_GPS: gps,
                PARAM_MAJ: major['Informatique'], 
                PARAM_YR: acad_yr['2009-2010'], 
                PARAM_STATUS: status['Bachelor semestre 1'], 
                PARAM_SEM: semester["Semestre d'automne"]}
r = requests.get(DATA_BASE_URL, params=payload_data)
print(r.url)

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1&ww_x_UNITE_ACAD=249847&ww_x_PERIODE_ACAD=978195&ww_x_GPS=213617925&ww_x_HIVERETE=2936286&ww_x_PERIODE_PEDAGO=249108


Following the above link takes us to the list of students meeting the following criteria: `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne`.

#### 5. Use  `BeautifulSoup` to conveniently navigate through and access the data.

In [19]:
soup_students = BeautifulSoup(r.content, 'html.parser')
# visualize
#print(soup_students.prettify())

From the `prettify()` output, we see that student info is contained within `<tr>` tags and that the first two entries between `<tr>` are for general information about the students. Therefore, to get all the students, we can use `find_all()` to get all the `<tr>` entries and drop the first two.

In [20]:
rows = soup_students.find_all('tr')
students = rows[2:]
# let's look at one of the student entries
students[0]

<tr><td style="white-space:nowrap">Monsieur</td><td style="white-space:nowrap">Abdallah Jad</td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td style="white-space:nowrap">Présent</td><td style="white-space:nowrap"></td><td style="white-space:nowrap"></td><td>194197</td><td style="white-space:nowrap"></td></tr>

Each data point (e.g. gender, minor, sciper) about the student is surround by `<td>` tags. We can again use `find_all()` to access these elements and print these nicely. The second row of `rows` contains the label for each data point.

In [21]:
student = students[0].find_all('td')
labels = rows[1].find_all('th') # separated by <th> tags
for i in range(len(labels)):
    print(str(i) + ", " + labels[i].string + ": ", end="")
    print(student[i].string)

0, Civilité: Monsieur
1, Nom Prénom: Abdallah Jad
2, Orientation Bachelor: None
3, Orientation Master: None
4, Spécialisation: None
5, Filière opt.: None
6, Mineur: None
7, Statut: Présent
8, Type Echange: None
9, Ecole Echange: None
10, No Sciper: 194197


Now we can see the corresponding index for a particular data point of a student and pick out what we want for the following exercises!

# Exercise 1

_We will focus exclusively on the academic unit `Informatique`._

_Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both `Bachelor semestre 1` and `Bachelor semestre 6`. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?_

For this problem, we will create two `DataFrame`s: one for `Bachelor semestre 1` (B1) students and another for `Bachelor semestre 6` (B6) students. For the B1 `DataFrame`, we will start from 2007 and add students when they are first enrolled as `Informatique` and B1. If they repeat, their info will not be added again / updated. For B6, though, we will take the most recent entry. When storing the date of the semester, we will store it in `Year-Month` format; if the semester is Autumn we will use `09` as the month and if the semester is Spring we will use `03` as the month. The duration is calculated as: 12 x (`B6 Year` - `B1 Year`) + (`B6 Month` - `B1 Month`) + 6. We add 6 to account for the duration of the 6th semester.

**NOTE** : We realize that there is some discussion on Slack with people considering students that finished their studies with `Bachelor semestre 5`; however, the assignment states to compute the number of months for each student to go from the first to the sixth semester (how long it took for their Bachelors is a different problem). We did, however, take the later of the (possible multiple) 6th semesters to get a bit closer to the full duration (while keeping to the original problem of finding the duration till the 6th semester).

In [22]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## (REMINDER) constants and functions as previously described/explained
# parameter keys
PARAM_GPS = 'ww_x_GPS'
PARAM_MAJ = 'ww_x_UNITE_ACAD'
PARAM_YR = 'ww_x_PERIODE_ACAD'
PARAM_STATUS = 'ww_x_PERIODE_PEDAGO'
PARAM_SEM = 'ww_x_HIVERETE'

# base urls
FILTER_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270'
DATA_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_b_list=1'

# open the Series for the filter dropdown menus made before
majors = pd.read_pickle("major_pickle")
acad_yrs = pd.read_pickle("acad_yr_pickle")
statuses = pd.read_pickle("status_pickle")
semesters = pd.read_pickle("sem_pickle")

# extracting GPS value
def extract_gps(content):
    soup = BeautifulSoup(content, 'html.parser')
    elements = soup.find_all('a', class_='ww_x_GPS')
    for element in elements:
        raw_info = element.attrs['onclick']
        gps = raw_info.split("'")[1].split('=')[1]
        if gps != "-1":
            return gps

# combine steps 1-4 from procedure of extracting HTML content of desired students
def get_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r_filt = requests.get(FILTER_BASE_URL, params=payload)
    gps = extract_gps(r_filt.content)
    # get list of students
    payload[PARAM_GPS] = gps
    r_list = requests.get(DATA_BASE_URL, params=payload)
    return r_list.content

# calculate the start date of a given academic year and semester type
def sem_start_date(academic_year, semester):
    start_year, next_year = academic_year.split('-')
    if semester == "Semestre d'automne":
        return start_year + '-09'
    else:
        return next_year + '-03'

# create entry for a particular student (using only necessary labels from those we saw earlier)
def create_student_entry(stat, info, yr, sem):
    student = {}
    student['Gender'] = info[0].string
    student[stat] = sem_start_date(yr, sem)
    return student

# scrape student data for a particular major and student status
def scrape_student_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_html_content(maj, yr, stat, sem)
            # parse with beautiful soup
            soup_students = BeautifulSoup(html_content, 'html.parser')
            rows = soup_students.find_all('tr')
            # students are starting after two rows
            for row in rows[2:]:
                student = row.find_all('td')
                sciper = student[10].string
                # keep earliest year in case a student repeated first semester
                if int(stat.split(' ')[-1]) == 1: # obtaining number of semester
                    if sciper not in dic:
                        dic[sciper] = create_student_entry(stat, student, yr, sem)
                # for other semesters replace with latest
                else:
                    dic[sciper] = create_student_entry(stat, student, yr, sem)
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

The last function - `scrape_student_data()` - deserves some description. Given a major and a student status (e.g. `Bachelor semestre 1` or `Master semestre 3`, it returns the students over all academic years and semesters (Spring or Autumn) that satisfy the given filters. It essentially runs through the 5 steps we described just before this exercise:
1. `get_html_content` performs the first 4 steps 
2. We use `BeautifulSoup` to navigate and parse through the HTML content as described in the 5th step. We go through each row of the HTML table (starting from the 3rd as the first 2 are metadata) to add the student to a dictionary. For the first semester of a program (e.g. `Bachelor semestre 1` or `Master semestre 1`) we use the earliest entry of a student as this marks the true beginning. For latter semesters, we use the most recent entry in case a student repeated a particular semester. In `create_student_entry()`, we only save the necessary info about the student for this exercise: SCIPER, Gender, and date of the semester.
3. Finally, we use the dictionary to create a `DataFrame` as this will be easier to manipulate for our analysis.

Let's create a DataFrame for the B1 and the B6 students.

In [23]:
major = 'Informatique'
bach_sems = ['Bachelor semestre 1', 'Bachelor semestre 6']
df_b1, df_b6 = [scrape_student_data(major, bach_sem) for bach_sem in bach_sems]

Save to `pickle` so we don't always need to extract the HTML content.

In [24]:
import sys
sys.setrecursionlimit(50000)
df_b1.to_pickle("b1_pickle")
df_b6.to_pickle("b6_pickle")

### Import data

In [25]:
df_b1 = pd.read_pickle("b1_pickle")
df_b6 = pd.read_pickle("b6_pickle")

### Prepare data

Join b1 and b6 students and keep those that are in both with `how='inner'`

In [26]:
b1_to_b6 = df_b1[["Bachelor semestre 1","Gender"]].join(df_b6["Bachelor semestre 6"], how='inner')
b1_to_b6.tail()

Unnamed: 0,Bachelor semestre 1,Gender,Bachelor semestre 6
250300,2014-09,Monsieur,2017-03
250362,2014-09,Monsieur,2017-03
250703,2014-09,Monsieur,2017-03
251758,2014-09,Monsieur,2017-03
251759,2014-09,Monsieur,2017-03


We can see some students with a future `Bachelor semestre 6`. Let's drop these people because they may take longer.

In [27]:
b1_to_b6 = b1_to_b6[np.logical_not(b1_to_b6["Bachelor semestre 6"].isin(['2017-03']))]
b1_to_b6.tail()

Unnamed: 0,Bachelor semestre 1,Gender,Bachelor semestre 6
238150,2013-09,Monsieur,2016-03
239124,2013-09,Monsieur,2016-03
239170,2013-09,Monsieur,2016-03
239314,2013-09,Monsieur,2016-03
239366,2013-09,Monsieur,2016-03


### Analyze data

Let's compute the duration from Semester 1 to the (last) Semester 6 as previously described.

In [28]:
# date is in 'year-month' format. e.g. 2015-07
def months_between_dates(start_date, end_date):
    start_year, start_month = start_date.split('-')
    end_year, end_month = end_date.split('-')
    return (int(end_year) - int(start_year)) * 12 + int(end_month) - int(start_month) + 6

def bachelor_duration(row):
    return months_between_dates(row['Bachelor semestre 1'], row['Bachelor semestre 6'])

b1_to_b6['Duration'] = b1_to_b6.apply(bachelor_duration, axis=1)
b1_to_b6.head()

Unnamed: 0,Bachelor semestre 1,Gender,Bachelor semestre 6,Duration
147008,2008-09,Monsieur,2011-03,36
169569,2007-09,Monsieur,2010-03,36
169731,2007-09,Monsieur,2011-03,48
169795,2007-09,Monsieur,2011-03,48
171195,2007-09,Monsieur,2010-03,36


Now we can compute the mean duration in months for male and female and test for statistical significance between their means.

In [29]:
male_mean = b1_to_b6['Duration'][b1_to_b6.Gender=="Monsieur"].mean()
female_mean = b1_to_b6['Duration'][b1_to_b6.Gender=="Madame"].mean()
print("Average duration for male students: " + str(male_mean))
print("Average duration for female students: " + str(female_mean))
print(b1_to_b6.Gender.value_counts())

Average duration for male students: 42.05187319884726
Average duration for female students: 39.55555555555556
Monsieur    347
Madame       27
Name: Gender, dtype: int64


We will use the Two-Sample T-Test to see if there is a statistical significance between the average duration for males and females.

In [30]:
import scipy.stats as stats
stats.ttest_ind(a= b1_to_b6['Duration'][b1_to_b6.Gender=="Monsieur"],
                b= b1_to_b6['Duration'][b1_to_b6.Gender=="Madame"],
                equal_var=False) 

Ttest_indResult(statistic=1.860117047790256, pvalue=0.071443584251684678)

The test yields a p-value of 0.0714, which means there is a 7.14% chance we'd see sample data this far apart if the two groups tested (male and female) are actually identical. If we were using a 95% confidence level, we would fail to reject the null hypothesis (that the means are the same), since the p-value is greater than the corresponding significance level of 5%. Therefore, according to this criteria, the difference in average duration is not statistically significant.

# Exercise 2

_Perform a similar operation to what described above, this time for Master students. Notice that this data is more tricky, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between `Master semestre 1` and `Master semestre 2`. If the `Mineur` field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the `Projet Master` tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a `Spécialisation` and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?_

In [31]:
# now we would like to collect the Specialization and Minor as well
def create_student_entry(stat, info, yr, sem):
    student = {}
    student['Gender'] = info[0].string
    student['Specialisation'] = info[4].string
    student['Minor'] = info[6].string
    student[stat] = sem_start_date(yr, sem)
    return student

# scrape student data for a particular major and student status
def scrape_student_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_html_content(maj, yr, stat, sem)
            # parse with beautiful soup
            soup_students = BeautifulSoup(html_content, 'html.parser')
            rows = soup_students.find_all('tr')
            # students are starting after two rows
            for row in rows[2:]:
                student = row.find_all('td')
                sciper = student[10].string
                # keep earliest year in case a student repeated first semester
                if int(stat.split(' ')[-1]) == 1: # obtaining number of semester
                    if sciper not in dic:
                        dic[sciper] = create_student_entry(stat, student, yr, sem)
                # for other semesters replace with latest
                else:
                    dic[sciper] = create_student_entry(stat, student, yr, sem)
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

Similar to Exercise 1, let's create a `DataFrame` for each of the semesters were are interested in.

In [32]:
major = 'Informatique'
masters_sems = ['Master semestre 1', 'Master semestre 2', 'Master semestre 3']
df_m1, df_m2, df_m3 = [scrape_student_data(major, master_sem) for master_sem in masters_sems]

Save to pickle so we don't always need to extract the HTML content.

In [33]:
df_m1.to_pickle("m1_pickle")
df_m2.to_pickle("m2_pickle")
df_m3.to_pickle("m3_pickle")

### Import data

In [34]:
df_m1 = pd.read_pickle("m1_pickle")
df_m2 = pd.read_pickle("m2_pickle")
df_m3 = pd.read_pickle("m3_pickle")

### Analyze data 

#### Try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2

Join M1 and M2 students (both fields shoud exits) as we did for B1 and B2

In [35]:
m1_to_m2 = df_m1[["Master semestre 1","Gender"]].join(df_m2["Master semestre 2"], how='inner')
m1_to_m2.tail()

Unnamed: 0,Master semestre 1,Gender,Master semestre 2
260806,2015-09,Monsieur,2016-03
260811,2015-09,Monsieur,2016-03
260968,2015-09,Madame,2016-03
261006,2015-09,Madame,2016-03
261146,2015-09,Monsieur,2016-03


Now we compute the duration. We use essentially the same function as for Bachelor, just changing the name of the column.

In [36]:
# date is in 'year-month' format. e.g. 2015-07
def months_between_dates(start_date, end_date):
    start_year, start_month = start_date.split('-')
    end_year, end_month = end_date.split('-')
    return (int(end_year) - int(start_year)) * 12 + int(end_month) - int(start_month) + 6

def master_duration_rough(row):
    return months_between_dates(row['Master semestre 1'], row['Master semestre 2'])

m1_to_m2['Duration'] = m1_to_m2.apply(master_duration_rough, axis=1)
m1_to_m2.head()

Unnamed: 0,Master semestre 1,Gender,Master semestre 2,Duration
146330,2007-09,Monsieur,2008-03,12
146742,2008-09,Monsieur,2010-03,24
146929,2007-09,Monsieur,2008-03,12
147008,2011-09,Monsieur,2013-03,24
152232,2007-09,Monsieur,2008-03,12


A closer look to the data reveals some students with strange durations (<=0).

In [37]:
print(len(m1_to_m2.loc[m1_to_m2["Duration"]<=0]))
m1_to_m2.loc[m1_to_m2["Duration"]<=0].head()

15


Unnamed: 0,Master semestre 1,Gender,Master semestre 2,Duration
171206,2010-09,Monsieur,2010-03,0
178786,2011-09,Monsieur,2011-03,0
180816,2013-09,Monsieur,2013-03,0
192345,2014-09,Monsieur,2014-03,0
196034,2015-09,Monsieur,2015-03,0


Let's drop these cases before computing the mean.

In [38]:
m1_to_m2.loc[m1_to_m2["Duration"]>0].mean()

Duration    15.861148
dtype: float64

#### If the Mineur field is not empty, the student should also appear registered in Master semestre 3

We will specialization as well since this also takes an extra semester. According to EPFL regulations a minor or specialization must be chosen by Master semester 2:
* http://ic.epfl.ch/page-97562-en.html
* http://ic.epfl.ch/specializations

So we will take the Specialization or Minor from M2.

In [39]:
m1_to_m3 = df_m1[["Master semestre 1","Gender"]].join(df_m2[["Master semestre 2","Minor","Specialisation"]], 
                                                            how='inner').join(df_m3["Master semestre 3"])
m1_to_m3.head()

Unnamed: 0,Master semestre 1,Gender,Master semestre 2,Minor,Specialisation,Master semestre 3
146330,2007-09,Monsieur,2008-03,,,2008-09
146742,2008-09,Monsieur,2010-03,,"Signals, Images and Interfaces",2012-09
146929,2007-09,Monsieur,2008-03,,,
147008,2011-09,Monsieur,2013-03,,,2012-09
152232,2007-09,Monsieur,2008-03,"Mineur en Management, technologie et entrepren...",,2008-09


We can see cases of students (`146330` and `147008`) that don't have a minor or a specialisation but have an entry for `Master semestre 3`. Therefore, we will simply check if `Master semestre 3` is not `NaN` (rather than checking if minor or specialisation is empty). This will give a more accurate value for the stay at EPFL.

Again, a similar function as before to compute the duration. This time we check to see if `Master semestre 3` is `NaN` and if not, we use this as the end date.

In [40]:
# date is in 'year-month' format. e.g. 2015-07
def months_between_dates(start_date, end_date):
    start_year, start_month = start_date.split('-')
    end_year, end_month = end_date.split('-')
    return (int(end_year) - int(start_year)) * 12 + int(end_month) - int(start_month) + 6

def master_duration_in_months(row):
    start_date = row['Master semestre 1']
    end_date = row['Master semestre 2']
    if pd.notnull(row['Master semestre 3']):
        end_date = row['Master semestre 3']
    if pd.isnull(start_date) or pd.isnull(end_date):
        return np.nan
    return months_between_dates(start_date, end_date)

m1_to_m3['Duration'] = m1_to_m3.apply(lambda row: master_duration_in_months(row), axis=1)
m1_to_m3.head()

Unnamed: 0,Master semestre 1,Gender,Master semestre 2,Minor,Specialisation,Master semestre 3,Duration
146330,2007-09,Monsieur,2008-03,,,2008-09,18
146742,2008-09,Monsieur,2010-03,,"Signals, Images and Interfaces",2012-09,54
146929,2007-09,Monsieur,2008-03,,,,12
147008,2011-09,Monsieur,2013-03,,,2012-09,18
152232,2007-09,Monsieur,2008-03,"Mineur en Management, technologie et entrepren...",,2008-09,18


In [41]:
# some interesting cases of an M2/M3 before their M1...
len(m1_to_m3.loc[m1_to_m3["Duration"]<=0])

15

We still drop the cases of students that have a Duration less than or equal to `0` before computing the mean.

In [42]:
m1_to_m3 = m1_to_m3.loc[m1_to_m3["Duration"]>0]
m1_to_m3["Duration"].mean()

18.0

#### Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?

We don't drop `NaN` to just see how many don't have a specialization.

In [43]:
m1_to_m3.Specialisation.value_counts(dropna=False)

NaN                               544
Internet computing                 77
Foundations of Software            56
Signals, Images and Interfaces     22
Computer Engineering - SP          17
Software Systems                   16
Information Security - SP           7
Data Analytics                      4
Service science                     2
Biocomputing                        2
Computer Science Theory             1
Internet Information Systems        1
Name: Specialisation, dtype: int64

We will only extract the specialization and duration as this is what we are interested, group by specialization, and then compute the mean for each one.

In [44]:
m1_to_m3_copy = m1_to_m3[["Specialisation","Duration"]]
m1_to_m3_copy.dropna().groupby('Specialisation')['Duration'].mean()

Specialisation
Biocomputing                      30.000000
Computer Engineering - SP         19.764706
Computer Science Theory           18.000000
Data Analytics                    16.500000
Foundations of Software           21.107143
Information Security - SP         18.000000
Internet Information Systems      18.000000
Internet computing                20.961039
Service science                   18.000000
Signals, Images and Interfaces    24.000000
Software Systems                  18.000000
Name: Duration, dtype: float64

We now use the Two-Sample T-Test to see if there is a statistical significance between the general average duration and the average duration for each specialization.

In [45]:
specs = m1_to_m3_copy.dropna().Specialisation.unique()
all_durations = m1_to_m3.Duration.values
p_vals = {}
for spec in specs:
    durations_spec = m1_to_m3_copy[m1_to_m3_copy.Specialisation == spec].Duration.values
    result = stats.ttest_ind(a= durations_spec,b= all_durations,equal_var=False)
    p_vals[spec] = result[1]*100
p_vals = pd.Series(p_vals)
p_vals



Biocomputing                      3.839277e-296
Computer Engineering - SP          1.763631e+01
Computer Science Theory                     NaN
Data Analytics                     3.922487e+01
Foundations of Software            1.601334e-02
Information Security - SP          1.000000e+02
Internet Information Systems                NaN
Internet computing                 5.455292e-03
Service science                    1.000000e+02
Signals, Images and Interfaces     1.152510e+00
Software Systems                   1.000000e+02
dtype: float64

We have to `NaN` since there was only one person who did those specialisations. We use a confidence interval of 95% to check for statistical significance.

In [46]:
p_vals[p_vals<5]

Biocomputing                      3.839277e-296
Foundations of Software            1.601334e-02
Internet computing                 5.455292e-03
Signals, Images and Interfaces     1.152510e+00
dtype: float64

Using the Two-Sample T-Test and a confidence interval of 95%, the difference in average for the following specializations is statistically significant:
* **Biocomputing**
* **Foundations of Software**
* **Internet computing**
* **Signals, Images and Interfaces**

#### Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students.

We note that ISA has a separate form for project registration. We obtain master project dataset from [this form](http://isa.epfl.ch/imoniteur_ISAP/%21gedpublicreports.htm?ww_i_reportmodel=3069459).  
We get the filter url by Postman. We extract all available parameters and their codes for querying. Then we use those parameters to scrape data.

In [47]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import pprint

In [48]:
PROJECT_FILTER_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=3069459'

r = requests.get(PROJECT_FILTER_URL)
soup = BeautifulSoup(r.content, 'html.parser')

In [49]:
param_names = [x.text for x in soup.findAll('th')[1:]]
codes = [x.attrs['name'] for x in soup.findAll('select')]
param_codes = dict(zip(param_names, codes))
param_codes

{'Période académique': 'ww_x_PERIODE_ACAD',
 'Période pédagogique': 'ww_x_PERIODE_PEDAGO',
 'Type de semestre': 'ww_x_HIVERETE',
 'Unité académique': 'ww_x_UNITE_ACAD'}

In [50]:
param_option_codes = {}

selects = soup.findAll('select')

for param_name, select in zip(param_names, selects):
    option_codes = {}
    options = select.findAll('option')

    for option in options:
        if option.attrs['value'] != 'null':
            option_codes[option.text] = option.attrs['value']
    
    param_option_codes[param_name] = option_codes
        
pprint.pprint(param_option_codes)

{'Période académique': {'2012-2013': '123456101',
                        '2013-2014': '213637754',
                        '2014-2015': '213637922',
                        '2015-2016': '213638028',
                        '2016-2017': '355925344'},
 'Période pédagogique': {'Admission EPFL': '2570913',
                         'Admission automne': '31163100',
                         'Admission printemps': '31164888',
                         'Bachelor semestre 1': '249108',
                         'Bachelor semestre 2': '249114',
                         'Bachelor semestre 3': '942155',
                         'Bachelor semestre 4': '942163',
                         'Bachelor semestre 5': '942120',
                         'Bachelor semestre 5b': '2226768',
                         'Bachelor semestre 6': '942175',
                         'Bachelor semestre 6b': '2226785',
                         'Evaluation': '1628004144',
                         'Evaluation automne': '249119',

In [51]:
# We note that the parameter codes are the same as the previous form

In [52]:
# We try a hard coded link first, using it to see the structure of data page, and extract student names
test_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportmodel=3069459&ww_i_reportModelXsl=3069477&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2016-2017&ww_x_PERIODE_ACAD=355925344&ww_x_PERIODE_PEDAGO=249127&ww_x_HIVERETE=2936286&dummy=ok'
r = requests.get(test_url)
soup = BeautifulSoup(r.content, 'html.parser')

In [53]:
# We observe that student name <td> entry differientiates the others from not having a 'b' tag
students = []
entries = soup.findAll('td', width="20%")
for entry in entries:
    if entry.find('b') == None:
        students.append(entry.text)
students        

['Taneli Deniz',
 'El Khoury Raphael',
 'Wang Zisi',
 'Zaridze Ketevani',
 'Mizraji Thomas',
 'Favrod Philémon Orphée',
 'Bovet Sidney',
 'Szabo Kristof Tamas',
 'Rabasco Jérémy',
 'Antognini Diego Matteo',
 'Manasovska Ana',
 'Sbai Hugo',
 'Pignat Eliéva Arlette',
 'Qureshi Zaid',
 'Maitre Grégory Ludovic',
 'Oliveira Andrade Patrick Daniel',
 'Rousseau Adrien Jean-Louis',
 'Schmutz Michaël Steven',
 'Loiseleur Thibaut',
 'Sbai Marion Fadoi',
 'Canale Raffaele',
 'Zhang Jin',
 'Bouquet Stéphane',
 'Robert Arnaud',
 'Farcasanu Alexandru-Ciprian',
 'Wang Zisi',
 'Valette Laurent Michel',
 'Junker Florian Christophe',
 'Robert Arnaud',
 'Guliyev Khayyam Mubariz Oglu',
 'Ionescu Vlad Nicolae',
 'Antognini Marco',
 'Grütter Karl Samuel',
 'Gaspoz John',
 'Débieux Vincent',
 'Sikiaridis Alexandre Jean Denis',
 'Gilgien David Yann',
 'Duhem Martin Nicolas',
 'Amiguet Jérôme',
 'Leiva Loris Angel',
 'Galissard de Marignac Vincent',
 'Schegg Elias',
 'Fokeas Sotirios',
 'Rudelle Matthieu Franç

In [54]:
# to build a dict of (student name, master project date)

In [55]:
# now we would like to collect the Specialization and Minor as well
def create_student_entry(name, yr):
    return {name: yr}

# combine steps 1-4 from procedure of extracting HTML content of desired students
def get_project_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r = requests.get(PROJECT_BASE_URL, params=payload)
    return r.content

# scrape student data for a particular major and student status
def scrape_student_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_project_html_content(maj, yr, stat, sem)
            # parse with beautiful soup
            soup_students = BeautifulSoup(html_content, 'html.parser')
            rows = soup_students.find_all('tr')
            # students are starting after two rows
            for row in rows[2:]:
                student = row.find_all('td')
                sciper = student[10].string
                # keep earliest year in case a student repeated first semester
                if int(stat.split(' ')[-1]) == 1: # obtaining number of semester
                    if sciper not in dic:
                        dic[sciper] = create_student_entry(stat, student, yr, sem)
                # for other semesters replace with latest
                else:
                    dic[sciper] = create_student_entry(stat, student, yr, sem)
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

In [57]:
PROJECT_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportmodel=3069459&ww_i_reportModelXsl=3069477'

def get_project_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r = requests.get(PROJECT_BASE_URL, params=payload)
    return r.content

r = get_project_html_content('Informatique', '2015-2016', 'Projet Master automne', "Semestre d'automne")

b'<html><head><META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" type="text/css" href="gedpublicreports.css?ww_x_path=Gestac.Moniteur.Style"></head><body bgcolor="#ffffff" marginheight="0" marginwidth="5" link="#666666" vlink="#666666" alink="#666666"><fieldset style="text-align:right; width:40%; position:relative; margin-right: 10px;float:right; border: 0; padding: 0 0 8px 0;"><a style="color:#990033;" href="!GEDREPORTS.html?ww_i_reportmodel=3069459&amp;ww_i_reportModelXsl=3069477&amp;ww_x_UNITE_ACAD=249847&amp;ww_x_PERIODE_ACAD=213638028&amp;ww_x_HIVERETE=2936286&amp;ww_x_PERIODE_PEDAGO=249127">Identification pour acc\xe9der aux e-mails<br>Login to access email adresses</a></fieldset><h1>Extraction : Liste des inscriptions aux projets par section et/ou semestre</h1><hr style="height:0px;visibility: hidden;display:block;width:0px; float:none; clear:both; color: #ffffff;"><table border="0" width="100%"><tr><td><img src="/images/gestacplus/bas

In [63]:
def students_from_project_html(content):
    # We observe that student name <td> entry differientiates the others from not having a 'b' tag
    students = []
    soup = BeautifulSoup(content, 'html.parser')
    entries = soup.findAll('td', width="20%")
    for entry in entries:
        if entry.find('b') == None:
            students.append(entry.text)
    return students

In [64]:
students_from_project_html(r.content)

['Taneli Deniz',
 'El Khoury Raphael',
 'Wang Zisi',
 'Zaridze Ketevani',
 'Mizraji Thomas',
 'Favrod Philémon Orphée',
 'Bovet Sidney',
 'Szabo Kristof Tamas',
 'Rabasco Jérémy',
 'Antognini Diego Matteo',
 'Manasovska Ana',
 'Sbai Hugo',
 'Pignat Eliéva Arlette',
 'Qureshi Zaid',
 'Maitre Grégory Ludovic',
 'Oliveira Andrade Patrick Daniel',
 'Rousseau Adrien Jean-Louis',
 'Schmutz Michaël Steven',
 'Loiseleur Thibaut',
 'Sbai Marion Fadoi',
 'Canale Raffaele',
 'Zhang Jin',
 'Bouquet Stéphane',
 'Robert Arnaud',
 'Farcasanu Alexandru-Ciprian',
 'Wang Zisi',
 'Valette Laurent Michel',
 'Junker Florian Christophe',
 'Robert Arnaud',
 'Guliyev Khayyam Mubariz Oglu',
 'Ionescu Vlad Nicolae',
 'Antognini Marco',
 'Grütter Karl Samuel',
 'Gaspoz John',
 'Débieux Vincent',
 'Sikiaridis Alexandre Jean Denis',
 'Gilgien David Yann',
 'Duhem Martin Nicolas',
 'Amiguet Jérôme',
 'Leiva Loris Angel',
 'Galissard de Marignac Vincent',
 'Schegg Elias',
 'Fokeas Sotirios',
 'Rudelle Matthieu Franç

In [69]:
PROJECT_BASE_URL = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_i_reportmodel=3069459&ww_i_reportModelXsl=3069477'

def get_project_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r = requests.get(PROJECT_BASE_URL, params=payload)
    return r.content

# combine steps 1-4 from procedure of extracting HTML content of desired students
def get_project_html_content(maj, yr, stat, sem):
    # obtain gps
    payload = {PARAM_MAJ: majors[maj],
               PARAM_YR: acad_yrs[yr], 
               PARAM_STATUS: statuses[stat],
               PARAM_SEM: semesters[sem]}
    r = requests.get(PROJECT_BASE_URL, params=payload)
    return r.content

# scrape student data for a particular major and student status
def scrape_project_data(maj, stat):
    dic = {}
    # go through all statuses, years, and semesters
    for yr in acad_yrs.keys():
        for sem in semesters.keys():
            html_content = get_project_html_content(maj, yr, stat, sem)
            students = students_from_project_html(html_content)
            for student in students:
                dic[student] = yr
    df = pd.DataFrame.from_dict(dic, orient='index')
    return df

df_automne = scrape_project_data('Informatique', 'Projet Master automne')
#df_spring = scrape_project_data('Informatique', 'Projet de printemps')
df

Unnamed: 0,0
Manasovska Ana,2016-2017
Graisse Julien,2015-2016
Tang Tinh Di David,2016-2017
Mazloumian Seyyed Amin,2007-2008
Perrenoud Basile Samuel,2012-2013
Pollet Christophe,2009-2010
Aranibar Casas Ivan Wilson,2015-2016
Eilemann Stefan,2014-2015
Popovic Miroslav,2009-2010
Balmau Oana Maria,2014-2015
