# Homework 2 -Data from the Web

Useful imports first

In [None]:
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
import html5lib 
from lxml import html
import numpy as np

We want to get first all the parameters that are possible, i.e. we will fill two dictionnaries, one with the year as index, and the "value" of the link associated to it (e.g. {'2016-2017': '355925344'}), and the same with the bachelor year and the "value" of the link associated to it.

We do that in order to be able to extract all students from all possible years (2007-2016) and from all stages of the bachelor automatically later.

In [None]:
# The URL of the "start" of the IS-Academia page listing the students.
home_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_x_GPS=-1&ww_i_reportModel=133685247"

# The general form of the URL, with the fields to be replaced later (e.g. [UNITE_ACADEMIQUE])
base_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=133685270&ww_x_UNITE_ACAD=[UNITE_ACADEMIQUE]&ww_x_PERIODE_ACAD=[PERIODE_ACADEMIQUE]&ww_x_PERIODE_PEDAGO=[PERIODE_PEDAGOGIQUE]&ww_x_HIVERETE=null"

# Get the raw content from the page
with urllib.request.urlopen(home_url) as url:
    s = url.read()

soup = BeautifulSoup(s, 'html.parser')

# Parse the content
url_section = 0
url_years = {}
url_bachelor = {}
url_master = {}

#For loop on all the possible fields with <option> to get the value associated with each year, or each "Bachelor" semester
for link in soup.find_all('option'):
    if link.text == 'Informatique':
        url_section = link.get('value')
    if "20" in link.text:
        #print(link.text)
        url_years[link.text] = link.get('value')
    if "Bachelor" in link.text:
        #print(link.text)
        url_bachelor[link.text] = link.get('value')
        #get('value'))
    if 'Master' in link.text or 'Mineur' in link.text:
        url_bachelor[link.text] = link.get('value')

        
print(url_years)
print(url_bachelor)
print(url_master)
#frame= frame.drop([2,3,4,5,6,8,9,11],axis=1)
#frame.head(10)

Function that replaces the general fields in the base_url with the fields for a given year and level of study

In [None]:
def getFullUrl(PeriodeAcad, PeriodePedag):
    url = base_url
    url = url.replace('[UNITE_ACADEMIQUE]', str(url_section))
    url = url.replace('[PERIODE_ACADEMIQUE]', str(url_years[PeriodeAcad]))
    url = url.replace('[PERIODE_PEDAGOGIQUE]', str(url_bachelor[PeriodePedag]))
    return url

print(getFullUrl('2007-2008', 'Bachelor semestre 1'))


### Reasoning
- We are interested only in the students that started between **2008-2010**, in order to be sure that they will have completed their bachelor studies before or in 2016 (even if they take 6 years to do it). It avoids our results getting biased by students actually still in studies (e.g. if we count students from 2013, we would only count students who finished their bachelor in 3 years, not actually those still in studies).
- In order to have to minimize our bias, we will also have to check the records from the year 2006-2007,in order to verify whether the students actually are starting their studies in 2007-2008 or have already failed first year once before. 
- We also assume that the bachelor 6 is the semester where someone ends his bachelor. We neglect the people returning to bachelor 5 in order to finish some courses then moving onto the master.
*We will then have only the students starting from **2008-2010** in our starting dataFrame, and will consider students ending from 2011 (i.e. they will have done at least 2008-2009, 2009-2010, 2010-2011).*

In [None]:
# defining our constants.
STARTING_YEAR_LOW = 2008
STARTING_YEAR_HIGH = 2010

FINISHING_YEAR_LOW = 2011
FINISHING_YEAR_HIGH= 2016

In [None]:
## Load all student from that have started EPFL from 2007 in current_student DataFrame
# We set the Sciper as index, as it is uniquely identifying each student.

current_student = pd.DataFrame(columns = ['Civilité', 'Nom Prénom', 'Starting year'])
current_student['Starting year']= current_student['Starting year'].astype(int)
current_student.index.name = 'No Sciper'

# We iterate over 3 years : 2008,2009,2010 -> We make sure the students all will have finished by 2016,
# as doing a bachelor can take only up to 6 years.
for year in range(STARTING_YEAR_LOW,STARTING_YEAR_HIGH+1):
    year_string = str(year) + '-' + str(year+1)
    
    # Get the data from the appropriate url -> the URL describing the students in first semester
    current_url_B1 = getFullUrl(year_string, 'Bachelor semestre 1')
    with urllib.request.urlopen(current_url_B1) as url:
        html_B1 = url.read()
    soup = BeautifulSoup(html_B1, 'html.parser')
    
    # Create the data frame only keeping relevant fields
    # Note that the student_frame_B1 DataFrame is a temporary dataframe that only orders the data from the webpage.
    # We clean it and put all the non duplicate students into the current student DataFrame, that will contain our final entries
    student_frame_B1 = pd.read_html(soup.prettify(), header=1)
    student_frame_B1 = student_frame_B1[0].drop(0,axis=0)
    student_frame_B1 = student_frame_B1[['Civilité', 'Nom Prénom', 'No Sciper']]
    student_frame_B1 = student_frame_B1.set_index('No Sciper')
    student_frame_B1['Starting year'] = int(year);
    
    # Only keep students that are in their first 1st year
    student_frame_B1 = student_frame_B1[[not (sciper in current_student.index) for sciper in student_frame_B1.index]]
    
    current_student = pd.concat([current_student, student_frame_B1])

 We now check for all students starting in 2008 whether they have previously failed in year 2007-2008 (i.e. we make sure if this is their first first year or whether it is the second one). If so, we change their starting year to the oldest one.

In [None]:
#Get the year url previous to the first year.
year_string = str(STARTING_YEAR_LOW-1) + '-' + str(STARTING_YEAR_LOW)

# Get the data from the URL describing the students in first semester in 2007
url_B1_previous = getFullUrl(year_string, 'Bachelor semestre 1')
with urllib.request.urlopen(url_B1_previous) as url:
    html_B1_previous = url.read()
soup = BeautifulSoup(html_B1_previous, 'html.parser')

student_frame_B1_previous = pd.read_html(soup.prettify(), header=1)

#We only keep the scipers, these are the only data we need as they are the indices of the current_student DataFrame
sciper_student_previous = np.array(student_frame_B1_previous[0]['No Sciper'])

#We change the starting year of the students who started in the earlier year
current_student.loc[current_student.index.isin(sciper_student_previous), 'Starting year'] = 2007

print('Is the index unique :', current_student.index.is_unique)
print(current_student.sort_values(ascending=True, by = 'Starting year').head(20))

We see that we get the students starting in 2007 but failed the first year and hence are shown here with starting year 2007.

We now apply roughly the same process to the finishing year of bachelor

In [None]:
finished_student = pd.DataFrame(columns = ['Civilité', 'Nom Prénom', 'Finishing year'])
finished_student.index.name = 'No Sciper'

for year in range(FINISHING_YEAR_LOW,FINISHING_YEAR_HIGH+1):
    year_string = str(year-1) + '-' + str(year)
    
    # Get the data from the appropriate url -> The URL containing the students in last semester.
    current_url_B6 = getFullUrl(year_string, 'Bachelor semestre 6')
    with urllib.request.urlopen(current_url_B6) as url:
        html_B6 = url.read()
    soup = BeautifulSoup(html_B6, 'html.parser')
    
    # Create the data frame only keeping relevant fields
    student_frame_B6 = pd.read_html(soup.prettify(), header=1)
    student_frame_B6 = student_frame_B6[0].drop(0,axis=0)
    student_frame_B6 = student_frame_B6[['Civilité', 'Nom Prénom', 'No Sciper']]
    student_frame_B6 = student_frame_B6.set_index('No Sciper') 
    # If we find a duplicate into the finished_student array, i.e. a student that was already in bachelor 6
    # the year before, it means that he failed his year and that we need to update his finishing year
    finished_student = finished_student[[not (sciper in student_frame_B6.index) for sciper in  finished_student.index]]
    student_frame_B6['Finishing year'] = int(year+1); # The +1 here gives us the finishing year (i.e. 2008 for the period 2007-2008)
    finished_student = pd.concat([finished_student, student_frame_B6])
        

print(finished_student.head(10))


## Assignment 1

*Obtain all the data for the Bachelor students, starting from 2007. Keep only the students for which you have an entry for both `Bachelor semestre 1` and `Bachelor semestre 6`. Compute how many months it took each student to go from the first to the sixth semester. Partition the data between male and female students, and compute the average -- is the difference in average statistically significant?*

Note that we **assume here that a student reaching bachelor 6 and disappearing after in the dataset has successfully completed his bachelor**, as there is absolutely no way to be sure he actually did so. We cannot check the first master year because he might change section/university, ... 

In [None]:
# Merge the two dataFrames so we get all the students that started and finished. 

# The number of dropouts could be computed by taking the difference between everyone that 
# entered in first year and does not appear at the end (Note that we wouldn't count the people who changed their section).
sample_student = pd.merge(current_student, finished_student)

#Compute the duration it took to complete the bachelor and group by sex, then compute the average.
sample_student['Bachelor duration'] = sample_student['Finishing year']-sample_student['Starting year']
sample_student.groupby('Civilité')['Bachelor duration'].mean()

#There is yet to conclude whether it is statistically significant or not.

sample_student.head(10)

We know see the result of our statistical set. We perform a **Two-Sample T-Test** because we want to establish whether the difference in average of our two independent groups (the duration of the bachelor for men and women) is significant or not. The null hypothesis means that both groups are the same.

In [None]:
# Statistical test
import scipy.stats as stats
import math

current_student.loc[current_student.index.isin(sciper_student_2007), 'Starting year']

#We split our DataFrames into one for the men, one for the women.
men_student = sample_student.loc[sample_student['Civilité']=='Monsieur']
women_student = sample_student.loc[sample_student['Civilité']=='Madame']

print('\nMen bachelor duration', round(men_student.loc[:,'Bachelor duration'].mean(),3), '+-', round(men_student.loc[:,'Bachelor duration'].std(),3))
print('Women bachelor duration',round(women_student.loc[:,'Bachelor duration'].mean(),3),'+-',round(women_student.loc[:,'Bachelor duration'].std(),3))
print('Number of women graduating  :',women_student.shape)

# Perform the statistical test
stats.ttest_ind(a= np.array(women_student.loc[:,'Bachelor duration']),
                b= np.array(men_student.loc[:,'Bachelor duration']),
                equal_var=False)    # Assume samples have equal variance?

We printed first the mean $\pm$ the standard deviation, and see that it is quite different for our two samples. We set then `equal_var` to `False`. We see then our p-value to be $1%$, so it means that the difference in the average is statistically significant. 

Note that due to the fact that we have very few women (only 9 graduated from bachelor in those years from our data!). This is due to our quite strict filtering. Even if the p-value yields that our result is very significant statistically, we must still handle it with care before jumping to conclusions, as it rests on many assumptions.