**Homework 02 - Data from the Web**

In [32]:
# Import libraries
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import requests

In [204]:
# URL of IS Academia public access form
ISA_form_url = 'http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247'

# Some variable used
QUERY_STRING = 'query_string'
EMPTY_DATA_STRING = "**NO_DATA**"

**Retrieve data: how to**

In order to get the data, we need to perform a web crowl on the IS-Academia public plateform. IS-Academia has an online webpage with a form allowing us to download data based on different criteria (e.g. year, semester, ...)

In order to have a convienient way to retrieve data from this form, we create a data structure with all possibilities the form accepted for its filter. The data structure we decided to use is a dictionary:

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter name  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Filter value -> Filter value id <br/>

As you can see, the dictionary is actually a dictionary of dictionaries. The first level correponds to the filter name. The second level is all filter values possibles associated with the id.

Exemple:<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Unité académique  -> <br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Informatique -> 249847 <br/>

Note that the second level will also contain the string to use when creating the query.

In [205]:
# Construction of the form "values -> id" dictionary

# Retrieve HTML page
r  = requests.get(ISA_form_url)
soup = BeautifulSoup(r.text, 'html.parser')

# Filter HTML page to get only the filters
table = soup.find(lambda tag: tag.name == 'table' and tag['id'] == 'filtre')

# Each filter name has its own 'tr'
filter_names = table.findAll('tr')

# The dictionnary to store form "values -> id"
formValues= {}

for filter_name in filter_names:
    # Second level of the dictionary
    dic = {}
    
    # String correspind to the filter name (first level)
    filter_name_string = filter_name.find('th').string
    
    
    # The first entry on the second level is the string to use for the query
    dic[QUERY_STRING] = filter_name.find('select')['name']
    
    
    # Now we iterate over each possibility for this filter
    filter_values = filter_name.findAll(lambda tag: tag.name == 'option' and tag['value'] != "null")
    for v in filter_values:
        # Add each combination of "filter value -> id" to the second level
        dic[v.string] = v['value']
    
    # Add the first level to the dictionary
    formValues[filter_name_string] = dic

formValues

{'Période académique': {'2007-2008': '978181',
  '2008-2009': '978187',
  '2009-2010': '978195',
  '2010-2011': '39486325',
  '2011-2012': '123455150',
  '2012-2013': '123456101',
  '2013-2014': '213637754',
  '2014-2015': '213637922',
  '2015-2016': '213638028',
  '2016-2017': '355925344',
  'query_string': 'ww_x_PERIODE_ACAD'},
 'Période pédagogique': {'Bachelor semestre 1': '249108',
  'Bachelor semestre 2': '249114',
  'Bachelor semestre 3': '942155',
  'Bachelor semestre 4': '942163',
  'Bachelor semestre 5': '942120',
  'Bachelor semestre 5b': '2226768',
  'Bachelor semestre 6': '942175',
  'Bachelor semestre 6b': '2226785',
  'Master semestre 1': '2230106',
  'Master semestre 2': '942192',
  'Master semestre 3': '2230128',
  'Master semestre 4': '2230140',
  'Mineur semestre 1': '2335667',
  'Mineur semestre 2': '2335676',
  'Mise à niveau': '2063602308',
  'Projet Master automne': '249127',
  'Projet Master printemps': '3781783',
  'Semestre automne': '953159',
  'Semestre prin

**Retrieve data: build queries**

We made a function in order to create in an easy and intuitive way a query URL. The function uses the dictionary build before to retrive the strings and id to use inside the URL.

In [224]:
def build_query_url(section, year, semester, season=EMPTY_DATA_STRING):
    """ Create an query URL for IS-Academia. The webpage will contain all data about EPFL student
        register for a semester in a specifc year in a section. The section, year, and semester are 
        provided as function arguments.
        Note that the function can also support the season semester (Spring or Autumn), but its 
        optional (not needed for the homework)"""
    
    # CONSTANTS TO USE IN THE URL
    url = ISA_url + '?'
    url += 'ww_x_GPS=-1&'
    url += 'ww_i_reportModel=133685247&'
    url += 'ww_i_reportModelXsl=133685270&'
    
    # Add the section (e.g. Informatique)
    v1 = formValues['Unité académique']
    url += v1[QUERY_STRING]+'='+v1[section]+'&'
    
    # Add the year (e.g. 2016-2017)
    v1 = formValues['Période académique']
    url += v1[QUERY_STRING]+'='+v1[year]+'&'
    
    # Add the semester (e.g. Bachelor semester 5)
    v1 = formValues['Période pédagogique']
    url += v1[QUERY_STRING]+'='+v1[semester]
    
    # If a season is specified, add it to the query
    if (season != EMPTY_DATA_STRING):
        v1 = formValues['Type de semestre']
        url += '&'+v1[QUERY_STRING]+'='+v1[season]
    
    
    return url

**Retrieve data. From HTML to Dataframe**
Once we have a query URL to make our request, we need to extract the data stored in an HTML table into a Pandas Dataframe.

In [227]:
url = build_query_url('Informatique', '2016-2017', 'Bachelor semestre 5')

r  = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

# Dataframe structure
header = []        # Will store the header for each column
content = []       # The content of the data, row by row

#Student data is stored in the only 'table' in the HTML
for table in soup.find_all('table'):    
    
    #***************HEADER***********
    # The header is bold in the html page, in a 'th'
    for column_header in table.find_all('th'):
        # Only happend the header is it's not empty 
        # This is also a way to clean the 'th' string we receive from the html table, 
        # because the first 'th' is not a column header
        if column_header.string:    
            header.append(column_header.string)
                
                
    #***************CONTENT***********
    # The actual content of the table, the rows
    # Each row represents a student
    for students in table.find_all('tr'):
        
        student_row=[]
        #Each column attribute for a student is stored in a 'td'
        for student_column in students.find_all('td'):    
            s = student_column.string
            # If there is no data for a specific row, replace it by the 'empty_data' variable
            if (not s):
                student_row.append(EMPTY_DATA_STRING)
            else:
                student_row.append(s)
                
        if student_row:
            # In the HTML table retrieve from ISAcademia, there is a last 'td' in
            # the HTML table which doesn't correspond to any column. Need to remove it
            student_row.pop()
            content.append(student_row)

# Create a DataFrame from the data recovered.
df = pd.DataFrame(content, columns=header)
df

Unnamed: 0,Civilité,Nom Prénom,Orientation Bachelor,Orientation Master,Spécialisation,Filière opt.,Mineur,Statut,Type Echange,Ecole Echange,No Sciper
0,Monsieur,Abate Bryan Jeremy,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Congé,Erasmus,University of Bristol,246671
1,Monsieur,Alami-Idrissi Ali,**NO_DATA**,**NO_DATA**,**NO_DATA**,5 - Signal and Image Processing,**NO_DATA**,Congé,Erasmus,Linköping University,251759
2,Monsieur,Albergoni Tobia,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,248575
3,Monsieur,Alonso Seisdedos Florian,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,**NO_DATA**,Congé,**NO_DATA**,**NO_DATA**,215576
4,Monsieur,Aoun Leonardo,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Congé,Bilatéral,"University of Washington, Seattle",249498
5,Monsieur,Bachmann Roman Christian,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Congé,Erasmus,Norwegian University of Science and Technology...,234551
6,Monsieur,Badoux Christophe Dylan,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,249694
7,Monsieur,Ballerini Marco Roberto Julian,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,236818
8,Madame,Baraschi Zoé,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,219665
9,Monsieur,Bardet Mike Douglas,**NO_DATA**,**NO_DATA**,**NO_DATA**,6 - Visual computing,**NO_DATA**,Présent,**NO_DATA**,**NO_DATA**,208714
