## Homework 02 - Data from the Web

*Remark:* whether you are interested in running all the cells (including the scraping) do that, otherwise, if you want just run the analysis, in the folder `Data_bachelor/` or `Data_master/` you find all the data used in the further analysis. For the tidiness of the notebook the functions used in the script are stored in the file `scraping_function.py`, `analysis_bachelor`.

In [1]:
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import glob
from scraping_function import *
from analysis_bachelor import *
import os
import collections

### Assignment 1

The final goal of the task is to verify whether the difference between the number of months that girls took from the first to the sixth semester and the number of months taken by boys, in average, is statistically significant.

Before accomplishing the task we need to pass through the following steps:

1. Obtain all the data for the Bachelor students, starting from 2007.
2. Keep only the students for which you have an entry for both Bachelor semestre 1 and Bachelor semestre 6.
3. Compute how many months it took each student to go from the first to the sixth semester.




#### Step 1

The first thing we do, in order to retrieve the data of interest, is to parse the `html` source related to the `IS-Academia` web page (the one where you have to fill in the form to get the enrolled students). 

In [2]:
# Request the html source for the URL
r = requests.get('http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247')
html = r.content
soup = BeautifulSoup(html, 'html.parser')

With the help of *Postman* we identify the *parameters* useful to get the data. They result to be the following:

* `ww_x_GPS`
* `ww_i_reportmodel`
* `ww_i_reportModelXsl`
* `ww_x_UNITE_ACAD`
* `ww_x_PERIODE_ACAD`
* `ww_x_PERIODE_PEDAGO`
* `ww_x_HIVERETE`

We gather them into a dictionary of the following form:

                 {'parameter_1' : {'name_1' : value_1, ..., 'name_n' : value_n},
                  'parameter_2' : {'name_1' : value_1, ..., 'name_n' : value_n},
                   ...........,
                  'parameter_k' : {'name_1' : value_1, ..., 'name_n' : value_n}}
 
Where `parameter_X` is the name the relavant parameter to retrieve, `name_X` and `value_X` are two possible expressions of the parameter (string and numerical respectively). The function to do that is `create_parameter_dict` and it's stored in the [file](Applied data analysis - ADA/scaping_function.py) aforementioned.

In [3]:
# Create the dictionary of parameters
par = ["ww_x_UNITE_ACAD", "ww_x_PERIODE_ACAD", 'ww_x_PERIODE_PEDAGO', 'ww_x_HIVERETE']
dic_par = create_parameter_dict(par, soup)

Thus, we choose and define the procedure to retrieve the information. After analyzing various `URLs`, with the help of *Postman*, we decide that we are going to make in total six requests, each of those consists of asking the list of the students for all the years related to a specific *pedagogic period*. The `URLs` have the following form:
* `ww_x_GPS = -1` : in this way we gather all the data of the page
* `ww_i_reportmodel = 133685247` : standard value of the parameter
* `ww_i_reportModelXsl = dic_par['ww_i_reportModelXsl']['html']` : that specify the form of the data (`html`)
* `ww_x_UNITE_ACAD = dic_par['ww_x_UNITE_ACAD']['Informatique']` : the value that corresponds to the students of Informatique
* `ww_x_PERIODE_PEDAGO: dic_par['ww_x_PERIODE_PEDAGO'][semester]` : the code relative to a specific semester
We don't specify anything about the season and the year beacause we decide to retrive everything related to a particular semester.

In [4]:
# Create the list of the URLs we are making the request to
urls = []

# For each possible semester, 
for semester in dic_par['ww_x_PERIODE_PEDAGO']:
    # Except for those known as semester 5b and 6b that DON'T contain data
    if semester.startswith('Bachelor') and semester[-1] != 'b':  
        # We define the link to append to our list
        url_data = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=' + dic_par['ww_i_reportModelXsl']['html'] + '&ww_x_UNITE_ACAD=' + dic_par['ww_x_UNITE_ACAD']['Informatique'] + '&ww_x_PERIODE_PEDAGO=' + dic_par['ww_x_PERIODE_PEDAGO'][semester]
        urls.append(url_data)

Once the list of links is ready, we proceed retrieving data. To do that we are going to use a function `retrieve_data` which does the following:
- The gathered `html` source code is transformed in a `DataFrame`. 
Since the `html` contains more than one table..
- We are going to split the frame (using the function `fun_rec`) 
Below an image which tries to explain the procedure used to split the big `DataFrame`.

In [5]:
from IPython.display import IFrame
IFrame('disegno_pdf.pdf', width=980, height=500)

- We clean the data using the `clean_df` function, in this step we keep the data of the years of interest 
- And finally save the cleaned `DFs` in `.csv` files (using the `save_data` function)
- The file Data_bachelor is emptied and the csv files are loaded each time we execute the cell, this was done for modularity and flexibility reasons, in order for example to further select the year, or the pedagogic period.
- An Important note here is that we do not take in account the ongoing academic year (2016-2017) for the simple reason that we cannot know if a student will really finish the year.

In [6]:
### TODO Pawel: wrap it in a function

# Creating an empty directory if it doesn't exist
try:
    os.mkdir('Data_bachelor')
except:
    pass

# Changing the permissions in order to avoid problems
os.chmod('Data_bachelor', 0o777)

# Remove old files from this directory
files = glob.glob('Data_bachelor/*')
for f in files:
    os.remove(f)

# Downloading the data into the 'Data_bachelor' directory
k = 0
for ur in urls:
    k += 1
    print ( k, ' request sent.')
    save_data(ur, 'Data_bachelor/') # funtion defined in separate file
    print ('Data collected from the ', k, ' request.')
    print ('*****'*10)
print('Done')

1  request sent.
Data collected from the  1  request.
**************************************************
2  request sent.
Data collected from the  2  request.
**************************************************
3  request sent.
Data collected from the  3  request.
**************************************************
4  request sent.
Data collected from the  4  request.
**************************************************
5  request sent.
Data collected from the  5  request.
**************************************************
6  request sent.
Data collected from the  6  request.
**************************************************
Done


#### Step 2
Now that we have all the data stored, we import and start to analyze them!

We read the `.csv` fies as `DataFrame`, thus we concatenate them on the axis 0. It's done by the `import_data` function.

In [7]:
data = import_data('Data_bachelor/')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 7: unexpected end of data

Hence, we proceed accomplishing the task of this step, in fact we filter the df according to the required criteria. In particular we `groupby` the frame according to the *Sciper number* that is unique for each enrolled student. Then we keep only the students (group) that have in the domain *Pedagogic period* at least one *Semester 1 & Semester 6*. A more technical procedure is explained inside the function `extract_students` found in the library provided with the notebook.

In [None]:
df_students = extract_students(data)

In [None]:
df_students.head()

#### Step 3

In this step we compute how many months each student spent to get from the 1st to the 6th semester. The assumptions we rely on are the following:
- A student reach the *Semester 6* whether he doesn't have to sit again any other semester. We make this assumption because we think that the choice of the 6th semester is useful to compute the time an epfl student need to graduated at bachelor
- Each semester lasts 6 months

Before getting the duration of studies for each student, since we want to get their names, we check whether the name could be considered as an index. The way we do that is that the number of *names* should be equal to the number of *Scipers*.

In [None]:
len(df_students['Nom Prénom'].unique()) == len(df_students['No_Sciper'].unique())

First, we're going to extract the male features. Thus, we `select` the *Civilité* to be *Monsieur* (male) and then `groupby` *Nom Prénom* which is an index, the last operation we operate is an aggregate function `size()*6` in order to get the number of pedagogic period attached to a person and multiply it by 6 in order to get the correspondant time duration in months to get from the first to the sixth semester.

In [None]:
df_male_aggregated = pd.DataFrame(df_students[df_students['Civilité'] == 'Monsieur'].groupby('Nom Prénom').size()*6)

We print some aggregated information to get a feeling of the data that may be useful later, and especially for choosing the right statistical test.

In [None]:
df_male_aggregated.describe()

Ploting the data so we can apprehend what would be the distribution and to get better understanding of the data such as uncovering the presence of outliers etc...

In [None]:
plt.hist(df_male_aggregated.values, bins = 50)

We try to suppress outliers and look at the results.

In [None]:
df_male_aggregated[df_male_aggregated[0] <= 65].describe()

We perform the exact same operations for female students.

In [None]:
df_female_aggregated = pd.DataFrame(df_students[df_students['Civilité'] == 'Madame'].groupby('Nom Prénom').size()*6)

In [None]:
df_female_aggregated.describe()

In [None]:
plt.hist(df_female_aggregated.values, bins = 50)

Based on our observations of the data, we choose a standard t-test  @TODO EXPLAIN MORE

In [None]:
stats.ttest_ind(df_male_aggregated, df_female_aggregated, equal_var=False)

### Master

For the master exercise, we will use code and pipelines from the Bachelor study as we used modular and reusable code. We use the same procedure to craft urls and select the data needed, namely data with *ww_x_PERIODE_PEDAGO* having *master* included or *projet*.
*IMPORTANT* We made strong assumptions facing the imprecise nature of the data :
- A Master's without *Spécialisation* or *Mineur* takes two semesters.
- A Master's with *Spécialisation* or *Mineur* takes three semesters.
- The duration of the *Master's thesis* is not included because it's not corresponding with the *Projet Prin* or *Projet Autu*, the data being too scarse compared to the number of registered in *Master Semester X*. The thesis is mandatory and can be done at EPFL but also in the industry, for this reason, we choose not to include it in our estimation for the *stay at EPFL* duration, because for the majority of the students, the thesis will be done outside EPFL walls.
- The *Projet* semester is considered as an optional semester project done at EPFL, its duration will be simply added to the time student spent at EPFL.

In [None]:
# Create the list of the URLs we are making the request to
urls = []

# For each possible semester, 
for semester in dic_par['ww_x_PERIODE_PEDAGO']:
    # Except for those known as semester 5b and 6b that DON'T contain data
    if semester.startswith('Master') or semester.startswith('Projet'):  
        # We define the link to append to our list
        url_data = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html?ww_x_GPS=-1&ww_i_reportModel=133685247&ww_i_reportModelXsl=' + dic_par['ww_i_reportModelXsl']['html'] + '&ww_x_UNITE_ACAD=' + dic_par['ww_x_UNITE_ACAD']['Informatique'] + '&ww_x_PERIODE_PEDAGO=' + dic_par['ww_x_PERIODE_PEDAGO'][semester]
        urls.append(url_data)

Once urls are created we will load and store the data into *Data_master* folder, the procedure is exactly identical than the one used in the previous exercise, we again use `save_data` defined as a helper in a separate file consultable on the GitHub.

In [None]:
# Creating an empty directory if it doesn't exist
try:
    os.mkdir('Data_master')
except:
    pass

# Changing the permissions in order to avoid problems
os.chmod('Data_master', 0o777)

# Remove old files from this directory
files = glob.glob('Data_master/*')
for f in files:
    os.remove(f)

# Downloading the data into the 'Data_master' directory
k = 0
for ur in urls:
    k += 1
    print ( k, ' request sent.')
    save_data(ur, 'Data_master/') # funtion defined in separate file
    print ('Data collected from the ', k, ' request.')
    print ('*****'*10)
print('Done')

Import the data previously downloaded.

In [None]:
data = import_data('Data_master/')

We verify that the data we retrieved is the one we were targeting.

In [None]:
data['Pedagogic period'].unique()

Here we are interested by the time it took students to get their first 60 credits, as it is mandatory for everyone. In this first step we pre-select the data.

In [None]:
# Data to compute the time for 60 credits
data_time = data[(data['Pedagogic period'] == ' Master semestre 1 \xa0') | (data['Pedagogic period'] == ' Master semestre 2 \xa0')]

Second step : we prepare a filter to isolate the data needed. We first `groupby` *No_Sciper* and to get an aggregate on the *Pedagogic Period*.

In [None]:
# Select students that starts and ends master's first 60 credits at EPFL ---> master 1 and master 2
# Group by the identification number of the students
grouped_data = data_time.groupby('No_Sciper')
    
    
# Define the semesters we want the students have been enrolled in.
master_sem = pd.Series([' Master semestre 1 \xa0', ' Master semestre 2 \xa0'])
    
# Get the list of the students (we are interested in) No_Sciper
students = []
    
# For each group
for g in grouped_data.groups:
        # We extract the Pedagocic periods 
    period_attended = grouped_data.get_group(g)['Pedagogic period']
        # Whether both Semester 1 and semester 2 are included we select the student
    if sum(master_sem.isin(period_attended)) == 2:
        students.append(g)

Apply the selection prepared previously to our data

In [None]:
pr = data_time.query('No_Sciper in @students')

Verify that our processing is correct.

In [None]:
pr['Pedagogic period'].unique()

Apply aggregate to get the time spent in months.

In [None]:
# time from ma1 to ma2
t = pr.groupby('No_Sciper').size()*6

In [None]:
# Average time spent at epfl for mter students. just counting semester 1 and 2
t.mean()
t.median()

We will additively compute the data relative to Minor and Specialization students, as they had to do their first two Master Semester. For such students we will just add the time of their Minor/Spec which corresponds to Master Semester 3. We look at the students having a semester 3 and that already did their first two semesters (we won't count the students having a specialization having only done semester 1 and 2 and not registered in semester 3 as they may have failed or have to redo one semester).

In [None]:
# Minor/ spec
df_minor = data[data['Pedagogic period'] == ' Master semestre 3 \xa0']


# In minor keep those that already do two semesters
minor = df_minor.query('No_Sciper in @students')

Apply aggregate to get the time spent in months.

In [None]:
# time to do minor or specialization
m = minor.groupby('No_Sciper').size()*6

Add the optional project duration according to the assumption mentioned at the begining of the exercise.

In [None]:
df_project = data[(data['Pedagogic period'] == ' Projet Master autom') |
                  (data['Pedagogic period'] == ' Projet Master print')]
project = df_project.query('No_Sciper in @students')

Apply aggregate to get the time spent in months.

In [None]:
p = project.groupby('No_Sciper').size()*6

In [None]:
#Add these size to the t vector
tot = (t.add(p, fill_value = 0)).add(m, fill_value = 0)
tot.describe()

In [None]:
df_special = minor[minor['Spécialisation'].notnull()]
special = df_special.query('No_Sciper in @students')

In [None]:
# Average stay
special.groupby('No_Sciper').agg({'Spécialisation' : 'first', 'Pedagogic period' : 'size'}).groupby('Spécialisation').mean()*6

### Bonus

In [None]:
df_male_aggregated = pd.DataFrame(pr[pr['Civilité'] == 'Monsieur'].groupby('No_Sciper').size()*6)

In [None]:
df_male_aggregated.describe()

In [None]:
df_female_aggregated = pd.DataFrame(pr[pr['Civilité'] == 'Madame'].groupby('No_Sciper').size()*6)

In [None]:
df_female_aggregated.describe()

In [None]:
tot_sorted = tot.sort_index(sort_remaining=False)

In [None]:
# We filter for first semester of master studies in order to find the starting year of each student.
# As it may happen that students have repeated the semesters we want to drop the other entries.
# Since the data is sorted by date we drop following occurences of the master semester 1
prova = pr[pr['Pedagogic period'] == ' Master semestre 1 \xa0'].drop_duplicates(subset = ['No_Sciper']).sort_values(by = 'No_Sciper')

In [None]:
prova.head()

In [None]:
prova['Length_study'] = tot_sorted.values

In [None]:
prova_grouped = prova.groupby(['Civilité', 'Academic year'])

In [None]:
# Monsieur - Try with agg fun and apply
avg_men = {}
avg_women = {}
for g in prova_grouped.groups:
    if g[0] == 'Madame':
        avg_women[g[1]] = prova_grouped.get_group(g)['Length_study'].mean()
    else:
        avg_men[g[1]] = prova_grouped.get_group(g)['Length_study'].mean()

In [None]:
avg_men

In [None]:
avg_women

In [None]:
od = collections.OrderedDict(sorted(avg_men.items()))
ood = collections.OrderedDict(sorted(avg_women.items()))

In [None]:
# Add title and axis lables. Y axis between 12 and 24 for better visualisation of data
# Since we have already dropped the data for 2016/2017 in the beginning,
# we cannot consider the results for mean in the academic year 2015/2016 because the data related to the second year
# of their studies aren't present in the data set. So we drop them in the charts.
plt.plot(list(od.values())[:-1], '-ro', list(ood.values())[:-1], '-bo')

In [None]:
test_results = {}
for year in sorted(prova['Academic year'].unique())[:-1]:
    df_female = prova[(prova['Civilité'] == 'Madame') & (prova['Academic year'] == year) ]['Length_study']
    df_male = prova[(prova['Civilité'] == 'Monsieur') & (prova['Academic year'] == year) ]['Length_study']
    
    t_test = stats.ttest_ind(df_male, df_female, equal_var=False)
    test_results[year] = {'t-statistic' : t_test[0], 'p-value' : t_test[1]}
test_results