# Homework 02 - Data from the Web

**Import modules**

In [5]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

from datetime import date, time

**Global variables**

In [6]:
# URL of IS Academia public access form
ISA_url = 'http://isa.epfl.ch/imoniteur_ISAP/%21gedpublicreports'

# Some variable used
QUERY_STRING = 'query_string'
EMPTY_DATA_STRING = "**NO_DATA**"

**Retrieve data: Query Parameters**

In order to get the data, we first need to extract it from the IS-Academia public plateform. IS-Academia has an online webpage with a form allowing us to download data based on different parameters (e.g. year, semester, ...)

In order to have a convienient way to retrieve data from this form for future data analysis, we create a data structure storing all possible parameters the form acceptes. The data structure we decided to use is a dictionary:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter name  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter value -> Filter value id <br/>

As you can see, the dictionary is actually a dictionary of dictionaries. The first level correponds to the filter name. The second level is all filter values possibles associated with the id.

Exemple:<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Unité académique  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Informatique -> 249847 <br/>

Note that the second level will also contain the string representation of the level 1 key. e.g. query_string': 'ww_x_PERIODE_ACAD' for 'Période académique'

In [7]:
# Construction of the form "values -> id" dictionary

# Retrieve HTML page
r  = requests.get(ISA_url+'.filter?ww_i_reportmodel=133685247')
soup = BeautifulSoup(r.text, 'html.parser')

# Filter HTML page to get only the filters
table = soup.find(lambda tag: tag.name == 'table' and tag['id'] == 'filtre')

# Each filter name has its own 'tr'
filter_names = table.findAll('tr')

# The dictionnary to store form "values -> id"
formValues= {}

for filter_name in filter_names:
    # Second level of the dictionary
    dic = {}
    
    # String correspind to the filter name (first level)
    filter_name_string = filter_name.find('th').string
    
    
    # The first entry on the second level is the string to use for the query
    dic[QUERY_STRING] = filter_name.find('select')['name']
    
    
    # Now we iterate over each possibility for this filter
    filter_values = filter_name.findAll(lambda tag: tag.name == 'option' and tag['value'] != "null")
    for v in filter_values:
        # Add each combination of "filter value -> id" to the second level
        dic[v.string] = v['value']
    
    # Add the first level to the dictionary
    formValues[filter_name_string] = dic

formValues

{'Période académique': {'2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344',
  'query_string': 'ww_x_PERIODE_ACAD'},
 'Période pédagogique': {'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785',
  'Master semestre 1': '2230106',
  'Master semestre 2': '942192',
  'Master semestre 3': '2230128',
  'Master semestre 4': '2230140',
  'Mineur semestre 1': '2335667',
  'Mineur semestre 2': '2335676',
  'Mise à niveau': '2063602308',
  'Projet Master automne': '249127',
  'Projet Master printemps': '3781783',
  'Semestre automne': '953159',
  'Semestre prin

**Retrieve data: build queries**

We made a function in order to create an easy and intuitive query URL. The function uses the dictionary built before to retrive the strings and ids to use as part of the URL.

In [8]:
def build_query_url(section, year, semester, season=EMPTY_DATA_STRING):
    ''' 
    Create an query URL for IS-Academia. The webpage will contain all data about EPFL student 
    register for a semester in a specifc year in a section. Note that the function can also 
    support the season semester (Spring or Autumn), but its optional (not needed for the homework)
    
    Usage. e.g build_query_url('Informatique', '2012-2013', 'Bachelor semestre 1')
    
    @param section  - String for the section of EPFL         e.g. Informatique
    @param year     - String for the Academic year.          e.g. 2016-2017
    @param semester - The semester of the adacemic career.   e.g. Bachelor semestre 5

    @return the url string
    '''
    
    # The beginning of the URL remains a CONSTANT
    url = ISA_url + '.html?'
    
    # Select 'Tous' in order to take all data for the specified parameters
    url += 'ww_x_GPS=-1&'
    url += 'ww_i_reportModel=133685247&'
    url += 'ww_i_reportModelXsl=133685270&'
    
    # Section
    v1 = formValues['Unité académique']
    url += v1[QUERY_STRING]+'='+v1[section]+'&'
    
    # Year
    v1 = formValues['Période académique']
    url += v1[QUERY_STRING]+'='+v1[year]+'&'
    
    # Semester
    v1 = formValues['Période pédagogique']
    url += v1[QUERY_STRING]+'='+v1[semester]
    
    # If a season is specified, add it to the query
    if (season != EMPTY_DATA_STRING):
        v1 = formValues['Type de semestre']
        url += '&'+v1[QUERY_STRING]+'='+v1[season]
        
    return url

**Retrieve data. From HTML to Dataframe**

Once we have a query URL to make our request, we need to extract the data stored in an HTML table into a Pandas Dataframe.

In [9]:
def get_ISA_data(url):
    """Return the data stored in the given HTML Url in a DataFrame object"""
    
    r  = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')

    # Dataframe structure
    header = []        # Will store the header for each column
    content = []       # The content of the data, row by row

    #Student data is stored in the only 'table' in the HTML
    for table in soup.find_all('table'):    

        #***************HEADER***********
        # The header is bold in the html page, in a 'th'
        for column_header in table.find_all('th'):
            # Only happend the header is it's not empty 
            # This is also a way to clean the 'th' string we receive from the html table, 
            # because the first 'th' is not a column header
            if column_header.string:    
                header.append(column_header.string)


        #***************CONTENT***********
        # The actual content of the table, the rows
        # Each row represents a student
        for students in table.find_all('tr'):

            student_row=[]
            #Each column attribute for a student is stored in a 'td'
            for student_column in students.find_all('td'):    
                s = student_column.string
                # If there is no data for a specific row, replace it by the 'empty_data' variable
                if (not s):
                    student_row.append(EMPTY_DATA_STRING)
                else:
                    student_row.append(s)

            if student_row:
                # In the HTML table retrieve from ISAcademia, there is a last 'td' in
                # the HTML table which doesn't correspond to any column. Need to remove it
                student_row.pop()
                content.append(student_row)

    # Create a DataFrame from the data recovered.
    df = pd.DataFrame(content, columns=header)
    return df

get_ISA_data(build_query_url('Informatique', '2012-2013', 'Bachelor semestre 1'))

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper
0,Monsieur,Albrecht Pablo,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,212726
1,Monsieur,Alvard Edouard,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,223371
2,Madame,Ammann Gaëlle,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,228116
3,Monsieur,Amorim Afonso Caldeira Da Silva Pedro Maria,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,213618
4,Monsieur,Amrani Kamil,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,226305
5,Monsieur,Andreina Sébastien Laurent,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,215623
6,Monsieur,Angerand Grégoire Georges Jacques,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,212464
7,Monsieur,Antunes Nelson Tiago,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,224198
8,Monsieur,Armand Alexis Kevin,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,224617
9,Monsieur,Aulbach Adrian,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,214964


>**Exercise 1**<br/>
>Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both Bachelor semestre 1 and Bachelor semestre 6. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?1


**Solution**

First, we need to have dataframe combining 'Bachelor Semester 1' and 'Bachelor Semester 6'. Since students can fail some class (not all teacher are as great as Mister Castasta), we need to keep the "yougest" Semester 1 and the "oldest" Semester 6. <br/>
This dataframe must have the semester year, gender of the student, and student id. The student id the the *camipro* number, wich is unique for each student; threfore it will be used as the dataframe index.

In [10]:
def semester_data (section, semester, semester_month, semester_year, order, columns_of_interest):
    """ 
    Return a DataFrame object with the gender and the year for a given semester in a given section.
    There is only one entry per student, based on the 'Camipro' number. In case of conflict, the 
    parameter 'order' defined which year the dataframe should keep.

    @param section  - String for the section of EPFL           e.g. Informatique
    @param semester - The semester of the adacemic career.     e.g. Bachelor semestre 5
    @param year     - String for the Academic year.            e.g. 2016-2017
    @param semester_month - the starting month of the semester e.g. 2 (Feburary)
    @param semester_year  - the starting year of the semester  
    @param order    - the year to be reserved upon future concatination


    @return data - The dataFrame returned will contain two columns: 'Civilité' for the gender and 'Year'. 
                   The year value is a python 'date' object.
    """
    print('********** ', semester,' ***********')
    
    data = []

    # Iterate over each year available on the online ISA form
    for year in formValues['Période académique']:
        if year is not'query_string': #The dictionary 'formValues' contains year and the 'query_string'
            
            url = build_query_url(section, year , semester)
            year_df = get_ISA_data(url)

            if not year_df.empty:
                # DataFrame index is 'No Sciper'
                year_df.index = year_df["No Sciper"]                  
                year_df = year_df[columns_of_interest]

                # Only store one year.
                # e.g. semester = 'Bachelor semester 1' and semester_year='first' => store only '2012' from year='2012-2013'
                semester_year = int(year.split('-')[0 if semester_year=='first' else 1])
                year_df['Year'] = date(semester_year, semester_month, 1)
                
                data.append(year_df)
            
            print("In ",year, ' there was ', year_df.shape[0], "students registered !")

    # Concat the data for each semester altogether
    data = pd.concat(data)
    
    # In order to remove duplicates and keep only the "yougest" or "oldest" one, the data is order by 'Year'.
    data.sort_values('Year', inplace=True)
    # When there is a two rows with the same index (same Sciper nbr), only keep the "youngest" or "oldest" one,
    # based on the 'order' parameter
    data = data[~data.index.duplicated(keep=order)]
    
    return data

In [11]:
data_bachelor_1 = semester_data ('Informatique', 'Bachelor semestre 1', 9, 'first', 'first', ['Civilité'])
data_bachelor_6 = semester_data ('Informatique', 'Bachelor semestre 6', 2, 'last', 'last', ['Civilité'])

print("Bachelor semester 1:", data_bachelor_1.shape[0])
print("Bachelor semester 6:", data_bachelor_6.shape[0])
data_bachelor_6.head(1)

**********  Bachelor semestre 1  ***********
In  2009-2010  there was  117 students registered !
In  2010-2011  there was  153 students registered !
In  2011-2012  there was  166 students registered !
In  2008-2009  there was  96 students registered !
In  2015-2016  there was  216 students registered !
In  2014-2015  there was  242 students registered !
In  2016-2017  there was  236 students registered !
In  2012-2013  there was  198 students registered !
In  2007-2008  there was  90 students registered !
In  2013-2014  there was  206 students registered !
**********  Bachelor semestre 6  ***********
In  2009-2010  there was  60 students registered !
In  2010-2011  there was  52 students registered !
In  2011-2012  there was  52 students registered !
In  2008-2009  there was  51 students registered !
In  2015-2016  there was  104 students registered !
In  2014-2015  there was  116 students registered !
In  2016-2017  there was  24 students registered !
In  2012-2013  there was  81 stud

Unnamed: 0_level_0,Civilité,Year
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1
170931,Monsieur,2008-02-01


Once we have the data for semester 1 and 6, we need to merge them together. This is performed based on the index row (Sciper number). We only keep students that have an entry in both tables.

In [23]:
result = pd.merge(data_bachelor_1, data_bachelor_6.drop(['Civilité'], axis = 1),  left_index=True, right_index=True, how='inner')

print("Number of students for which having an entry in both dataframes:", result.shape[0])
result.sample(10)

Number of students for which having an entry in both dataframes: 397


Unnamed: 0_level_0,Civilité,Year_x,Year_y
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
227660,Monsieur,2013-09-01,2015-02-01
192757,Madame,2009-09-01,2012-02-01
180241,Monsieur,2008-09-01,2010-02-01
247680,Monsieur,2015-09-01,2017-02-01
194171,Monsieur,2011-09-01,2013-02-01
234033,Monsieur,2014-09-01,2016-02-01
202946,Monsieur,2011-09-01,2013-02-01
247328,Monsieur,2015-09-01,2017-02-01
236522,Monsieur,2014-09-01,2016-02-01
228496,Monsieur,2013-09-01,2015-02-01


For each student, we need to compute the number of months between 'Bachelor semester 1' and 'Bachelor semester 6'. By EPFL rules, the 'Bachelor semester 1' always start in September and the 'Bachelor semester 6' in February.

To compute the number of months, we compute the number of days between the start of both semester and multiply it by 0.0328767, which convert days to month

In [28]:
# Value taken from google
daysToMonth = 0.0328767

result["Month"] = ((result["Year_y"] - result["Year_x"])*daysToMonth).apply(lambda x: x.days)
result.head(1)

Unnamed: 0_level_0,Civilité,Year_x,Year_y,Month
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
169569,Monsieur,2008-09-01,2010-02-01,17


Now we can compute the average, in number of month, for both womens and mens to go from 'Bachelor semester 1' to 'Bachelor semester 6'.

Madame = Women <br/>
Monsieur = Men

In [47]:
for x in result.groupby(result.Civilité):
    print('*******', x[0], '*********')
    print(x[1].shape[0], "students")
    print("median  ", np.median(x[1]['Month']))
    
    print(x[1]['Month'].describe())

******* Madame *********
29 students
median   17.0
count    29.000000
mean     21.551724
std       7.462220
min      17.000000
25%      17.000000
50%      17.000000
75%      29.000000
max      41.000000
Name: Month, dtype: float64
******* Monsieur *********
368 students
median   17.0
count    368.000000
mean      24.173913
std       10.415466
min       17.000000
25%       17.000000
50%       17.000000
75%       29.000000
max       77.000000
Name: Month, dtype: float64


**Notes about the result:**

In this exercice, we only consider students with an entry in both semesters for the Informatique section. Note that this is not equal to the number of students having an EPFL Bachelor degree in Informatique for this time range. Indeed, some students can still fail Semester 6, or some students may have been register in 'Système de communication" for the first bachelor semester and then change to Informatique.

The total number of students is smaller than we originally thought, but looking at the data, we realize that betwenn 2007-2012, the number of students in Bachelor semester 6 is more than the half compare to 2015!

>**Exercise 2**

>Perform a similar operation to what described above, this time for Master students. Notice that this data is more tricky, as there are many missing records in the IS-Academia database. Therefore, try to guess how much time a master student spent at EPFL by at least checking the distance in months between Master semestre 1 and Master semestre 2. If the Mineur field is not empty, the student should also appear registered in Master semestre 3. Last but not the least, don't forget to check if the student has an entry also in the Projet Master tables. Once you can handle well this data, compute the "average stay at EPFL" for master students. Now extract all the students with a Spécialisation and compute the "average stay" per each category of that attribute -- compared to the general average, can you find any specialization for which the difference in average is statistically significant?

**Solution**

The first thing is to retrieve all data about Master in the 'Informatique' section. In 'Informatique', there is 5 table for master: 'Master semestre 1', 'Master semestre 2', 'Master semestre 3', 'Projet Master automne', and 'Projet Master printemps'. Note that 'Master semestre 4' doesn't exists for the 'Informatique' section.

When retrieving the data, if a student as two entries for the same semester, we only keep the "youngest" one, except for 'Master semestre 1'

As before, Master semestre 1, 3, and projet automne always start in September (9), where Master semestre 2 and projet printemps start in February (2)

In [76]:
data_m1 = semester_data('Informatique', 'Master semestre 1', 9, 'first', 'first', ['Civilité', 'Spécialisation'])
data_m2 = semester_data('Informatique', 'Master semestre 2', 2, 'last', 'last', ['Civilité', 'Spécialisation'])
data_m3 = semester_data('Informatique', 'Master semestre 3', 9,  'first', 'last', ['Civilité', 'Spécialisation'])
data_m_prj_a = semester_data('Informatique', 'Projet Master automne', 9 , 'first', 'last', ['Civilité', 'Spécialisation'])
data_m_prj_p = semester_data('Informatique', 'Projet Master printemps', 2,  'last', 'last', ['Civilité', 'Spécialisation'])

print()
print("Master semestre 1:", data_m1.shape[0], 'students')
print("Master semestre 2:", data_m2.shape[0], 'students')
print("Master semestre 3:", data_m3.shape[0], 'students')
print("Projet Master automne:", data_m_prj_a.shape[0], 'students')
print("Projet Master printemps:", data_m_prj_p.shape[0], 'students')

data_m1.head(5)

**********  Master semestre 1  ***********
In  2009-2010  there was  52 students registered !
In  2010-2011  there was  96 students registered !
In  2011-2012  there was  102 students registered !
In  2008-2009  there was  60 students registered !
In  2015-2016  there was  132 students registered !
In  2014-2015  there was  104 students registered !
In  2016-2017  there was  139 students registered !
In  2012-2013  there was  88 students registered !
In  2007-2008  there was  71 students registered !
In  2013-2014  there was  104 students registered !
**********  Master semestre 2  ***********
In  2009-2010  there was  62 students registered !
In  2010-2011  there was  109 students registered !
In  2011-2012  there was  123 students registered !
In  2008-2009  there was  64 students registered !
In  2015-2016  there was  196 students registered !
In  2014-2015  there was  151 students registered !
In  2016-2017  there was  2 students registered !
In  2012-2013  there was  130 students 

Unnamed: 0_level_0,Civilité,Spécialisation,Year
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
167387,Monsieur,**NO_DATA**,2008-09-01
160893,Monsieur,**NO_DATA**,2008-09-01
179878,Monsieur,Internet computing,2008-09-01
167133,Monsieur,**NO_DATA**,2008-09-01
161794,Monsieur,**NO_DATA**,2008-09-01


Now, we need to merge all this dataframes together.<br/>
We start my merging 'Master semestre 1' with 'Master semestre 2', since each of this two entries must necessary be filled up for any student.

In [84]:
result = pd.merge(data_m1, data_m2,  left_index=True, right_index=True, how='inner')
result.head(5)

Unnamed: 0_level_0,Civilité_x,Spécialisation_x,Year_x,Civilité_y,Spécialisation_y,Year_y
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
155018,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01
180269,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01
160893,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01
179878,Monsieur,Internet computing,2008-09-01,Monsieur,Internet computing,2008-02-01
167133,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01


Then we add the three remaining dataframes. Since this semester are optional, there are some cells with NaN values.

In [85]:
result['Year_3'] = data_m3['Year']
result['project_Master_A'] = data_m_prj_a['Year']
result['project_Master_P'] = data_m_prj_p['Year']

result.head(1)

Unnamed: 0_level_0,Civilité_x,Spécialisation_x,Year_x,Civilité_y,Spécialisation_y,Year_y,Year_3,project_Master_A,project_Master_P
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
155018,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01,,,


Now we compute the first and the last semester for each students. The first semester with the one with the "oldest" date among the 5 semester, and the last one is the "youngest" one.

Some students might have started their Master in 'Semestre 2', but our works with it.

Note that the 'dateEnd' column correspond to the starting month of the last semester, not the ending month.

In [86]:
result['dateStart'] = result[['Year_x', 'Year_y', 'Year_3', 'project_Master_A', 'project_Master_P']]\
                        .fillna(date.max).min(axis=1)
result['dateEnd'] = result[['Year_x', 'Year_y', 'Year_3', 'project_Master_A', 'project_Master_P']]\
                        .fillna(date.min).max(axis=1)

result.head(1)

Unnamed: 0_level_0,Civilité_x,Spécialisation_x,Year_x,Civilité_y,Spécialisation_y,Year_y,Year_3,project_Master_A,project_Master_P,dateStart,dateEnd
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
155018,Monsieur,**NO_DATA**,2008-09-01,Monsieur,**NO_DATA**,2008-02-01,,,,2008-02-01,2008-09-01


Now that we have the start and the end of studies, we can compute the number of month each student stayed at EPFL for the Master. As said earlier, the 'dateEnd' colum correspond to the first month of the last register semester. In order to have the last month, we need to add "5" to the total of months at EPFL, since each semester has a duration of 5 month.

Assumption: We assume the time of study is continues for a student, so there's no gap year/semester between the start and the end of his/her studies.

In [87]:
result["Month"] = ((result["dateEnd"] - result["dateStart"])*daysToMonth).apply(lambda x: x.days + 5)
result.sample(5)

Unnamed: 0_level_0,Civilité_x,Spécialisation_x,Year_x,Civilité_y,Spécialisation_y,Year_y,Year_3,project_Master_A,project_Master_P,dateStart,dateEnd,Month
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
185523,Monsieur,**NO_DATA**,2009-09-01,Monsieur,**NO_DATA**,2009-02-01,2009-09-01,,2010-02-01,2009-02-01,2010-02-01,16
211479,Monsieur,**NO_DATA**,2012-09-01,Monsieur,**NO_DATA**,2012-02-01,2013-09-01,,,2012-02-01,2013-09-01,24
170833,Monsieur,**NO_DATA**,2009-09-01,Monsieur,**NO_DATA**,2010-02-01,2011-09-01,,,2009-09-01,2011-09-01,28
256710,Monsieur,**NO_DATA**,2016-09-01,Monsieur,**NO_DATA**,2016-02-01,2017-09-01,,,2016-02-01,2017-09-01,24
226638,Monsieur,**NO_DATA**,2016-09-01,Monsieur,Foundations of Software,2016-02-01,2017-09-01,,,2016-02-01,2017-09-01,24


Now some stats about these findings.

In [96]:
nbr_students_total = result.shape[0]
print(nbr_students_total, "students")

mean_total = np.mean(result['Month'])
print("mean:\t",mean_total, "\tmonths")

median_total = np.median(result['Month'])
print("median:\t", median_total, "\t\tmonths")

764 students
mean:	 21.1897905759 	months
median:	 24.0 		months


Second part of the question, dealing with 'Spécialisation'. <br/>
Nothing fancy here, just use the 'groupBy' method and construct a Dataframe for a bett

In [63]:
spe_header = ['Specialisation', 'Mean', 'Median', 'Nbr students']
spe = []

spe.append(['ALL Students', mean_total, median_total, nbr_students_total])

for x in result.groupby('Spécialisation_x'):
    if x[0] is not EMPTY_DATA_STRING:
        spe.append([x[0], x[1]['Month'].mean(), x[1]['Month'].median(), x[1].shape[0]])

a = pd.DataFrame(spe, columns=spe_header)
a

Unnamed: 0,Specialisation,Mean,Median,Nbr students
0,ALL Students,21.189791,24.0,764
1,Biocomputing,24.0,24.0,1
2,Computer Engineering - SP,22.75,26.0,4
3,Data Analytics,24.0,24.0,2
4,Foundations of Software,23.75,24.0,16
5,Information Security - SP,29.0,29.0,1
6,Internet computing,20.44,24.0,25
7,Service science,28.0,28.0,2
8,"Signals, Images and Interfaces",25.0,24.0,15
9,Software Systems,23.0,24.0,6
