**Homework 02 - Data from the Web**

In [20]:
# Import libraries
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

In [27]:
# URL of IS Academia public access form
ISA_form_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247'
ISA_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.bhtml'

# Some variable used
QUERY_STRING = 'query_string'
EMPTY_DATA_STRING = "**NO_DATA**"

**Retrieve data: how to**

In order to get the data, we need to perform a web crowl on the IS-Academia public plateform. IS-Academia has an online webpage with a form allowing us to download data based on different criteria (e.g. year, semester, ...)

In order to have a convienient way to retrieve data from this form, we create a data structure with all possibilities the form accepted for its filter. The data structure we decided to use is a dictionary:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter name  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter value -> Filter value id <br/>

As you can see, the dictionary is actually a dictionary of dictionaries. The first level correponds to the filter name. The second level is all filter values possibles associated with the id.

Exemple:<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Unité académique  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Informatique -> 249847 <br/>

Note that the second level will also contain the string to use when creating the query.

In [28]:
# Construction of the form "values -> id" dictionary

# Retrieve HTML page
r  = requests.get(ISA_form_url)
soup = BeautifulSoup(r.text, 'html.parser')

# Filter HTML page to get only the filters
table = soup.find(lambda tag: tag.name == 'table' and tag['id'] == 'filtre')

# Each filter name has its own 'tr'
filter_names = table.findAll('tr')

# The dictionnary to store form "values -> id"
formValues= {}

for filter_name in filter_names:
    # Second level of the dictionary
    dic = {}
    
    # String correspind to the filter name (first level)
    filter_name_string = filter_name.find('th').string
    
    
    # The first entry on the second level is the string to use for the query
    dic[QUERY_STRING] = filter_name.find('select')['name']
    
    
    # Now we iterate over each possibility for this filter
    filter_values = filter_name.findAll(lambda tag: tag.name == 'option' and tag['value'] != "null")
    for v in filter_values:
        # Add each combination of "filter value -> id" to the second level
        dic[v.string] = v['value']
    
    # Add the first level to the dictionary
    formValues[filter_name_string] = dic

formValues

{'Période académique': {'2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344',
  'query_string': 'ww_x_PERIODE_ACAD'},
 'Période pédagogique': {'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785',
  'Master semestre 1': '2230106',
  'Master semestre 2': '942192',
  'Master semestre 3': '2230128',
  'Master semestre 4': '2230140',
  'Mineur semestre 1': '2335667',
  'Mineur semestre 2': '2335676',
  'Mise à niveau': '2063602308',
  'Projet Master automne': '249127',
  'Projet Master printemps': '3781783',
  'Semestre automne': '953159',
  'Semestre prin

**Retrieve data: build queries**

We made a function in order to create in an easy and intuitive way a query URL. The function uses the dictionary build before to retrive the strings and id to use inside the URL.

In [29]:
''' 
Create an query URL for IS-Academia. The webpage will contain all data about EPFL student 
register for a semester in a specifc year in a section. Note that the function can also 
support the season semester (Spring or Autumn), but its optional (not needed for the homework)
    
@param section  - String for the section of EPFL         e.g. Informatique
@param year     - String for the Academic year.          e.g. 2016-2017
@param semester - The semester of the adacemic career.   e.g. Bachelor semestre 5

@return the url string
'''
def build_query_url(section, year, semester, season=EMPTY_DATA_STRING):
    
    # CONSTANTS TO USE IN THE URL
    url = ISA_url + '?'
    url += 'ww_x_GPS=-1&'
    url += 'ww_i_reportModel=133685247&'
    url += 'ww_i_reportModelXsl=133685270&'
    
    # Add the section 
    v1 = formValues['Unité académique']
    url += v1[QUERY_STRING]+'='+v1[section]+'&'
    
    # Add the year
    v1 = formValues['Période académique']
    url += v1[QUERY_STRING]+'='+v1[year]+'&'
    
    # Add the semester 
    v1 = formValues['Période pédagogique']
    url += v1[QUERY_STRING]+'='+v1[semester]
    
    # If a season is specified, add it to the query
    if (season != EMPTY_DATA_STRING):
        v1 = formValues['Type de semestre']
        url += '&'+v1[QUERY_STRING]+'='+v1[season]
    
    return url

**Retrieve data. From HTML to Dataframe**
Once we have a query URL to make our request, we need to extract the data stored in an HTML table into a Pandas Dataframe.

In [185]:
'''
Retrieve dataframes from the url given
'''
def getDataFrame(url):
    # print(url)
    r  = requests.get(url)
    # print(r.status_code)
    soup = BeautifulSoup(r.text, 'html.parser')

    # Dataframe structure
    header = []        # Will store the header for each column
    content = []       # The content of the data, row by row

    #Student data is stored in the only 'table' in the HTML
    for table in soup.find_all('table'):    
    
        #***************HEADER***********
        # retrieve headers from 'th' tag
        for column_header in table.find_all('th'):
            # Only happend the header if it's not empty 
            # This is also a way to clean the 'th' string we receive from the html table, 
            # because the first 'th' is not a column header
            if column_header.string:    
                header.append(column_header.string)
                      
        #***************CONTENT***********
        # The actual content of the table, the rows
        # Each row represents a student entry
        for students in table.find_all('tr'):
            student_row=[]
            #Each column attribute for a student is stored in a 'td'
            for student_column in students.find_all('td'):    
                s = student_column.string
                # If there is no data for a specific row, replace it by the 'empty_data' variable
                if (not s):
                    student_row.append(EMPTY_DATA_STRING)
                else:
                    student_row.append(s)
                
            if student_row:
                # In the HTML table retrieve from ISAcademia, there is a last 'td' in
                # the HTML table which doesn't correspond to any column. Need to remove it
                student_row.pop()
                content.append(student_row)

    # Create DataFrame.
    df = pd.DataFrame(content, columns=header)

    return df.drop(['Orientation Bachelor','Orientation Master','Spécialisation', 'Filière opt.', 'Mineur', 'Statut', 'Type Echange', 'Ecole Echange'], axis=1)

In [186]:
url = build_query_url('Informatique', '2016-2017', 'Bachelor semestre 5')
url2 = build_query_url('Informatique', '2007-2008', 'Bachelor semestre 6b')

In [187]:
def getYearRange(_from, _to):
    diff = _to - _from
    years = []
    for x in range (0, diff):
        years.append(str(_from + x) + '-' + str(_from + x + 1))
        
    return years

In [188]:
def bachelorSems(name, number):
    sems = []
    for x in range (1, len(number) + 1):
        sems.append(name + number[x - 1])

    return sems

In [189]:
b_sems = bachelorSems("Bachelor semestre ", ['1'])

In [190]:
year = getYearRange(2007, 2017)
year

['2007-2008',
 '2008-2009',
 '2009-2010',
 '2010-2011',
 '2011-2012',
 '2012-2013',
 '2013-2014',
 '2014-2015',
 '2015-2016',
 '2016-2017']

In [191]:
queries = []
for x in year:
    for y in b_sems:
        queries.append(build_query_url('Informatique', x, y))    

In [192]:
bachelor_student_data = []

for q in queries:
    bachelor_student_data.append(getDataFrame(q))
for y in range (0, len(year)):
    bachelor_student_data[y]['year'] = year[y][:4]


In [193]:
b_sem1_data = pd.concat(bachelor_student_data)

In [194]:
# b_sem1_data = b_sem1_data.set_index(['No Sciper'])

In [200]:
b_sem1_data = b_sem1_data.drop_duplicates(subset='No Sciper', keep='first')

In [201]:
b_sem1_data.set_index('No Sciper')

Unnamed: 0_level_0,Civilité,Nom Prénom,year
No Sciper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
169569,Monsieur,Arévalo Christian,2007
174905,Monsieur,Aubelle Flavien,2007
173922,Monsieur,Badoud Morgan,2007
179406,Monsieur,Baeriswyl Jonathan,2007
179428,Monsieur,Barroco Michael,2007
179324,Monsieur,Belfis Nicolas,2007
174597,Monsieur,Beliaev Stanislav,2007
179449,Monsieur,Bindschaedler Vincent,2007
178553,Monsieur,Bloch Marc-Olivier,2007
179426,Monsieur,Bloch Remi,2007
