# Prelude

Because of the IS-Academia website's *interesting* design choices, sacrificing a few goats to the deity of your choice may be required to fully understand this code.

In the immortal words of Dante Alighieri: **Lasciate ogni speranza, voi ch'entrate!**

# Fetching the data

First, let's import the libs we need.

In [64]:
import requests # HTTP requests
from bs4 import BeautifulSoup # HTML parsing
import re # Regular expressions :(

Then, let's get the index form, i.e. the base page we'll use to get all data:

In [90]:
# It's just an URL with a few weird symbols, how complex can it be?
index_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter"

# ...requesting that doesn't return anything, turns out we need URL parameters.
# Surely it's simple and self-descriptive?
index_params = {
    # ...oh well...
    "ww_i_reportmodel": "133685247" 
}

index_html = requests.get(index_url, params=index_params).text
index_page = BeautifulSoup(index_html, "lxml")

Then, fetch the page containing all of the "Informatique" (CS) links:

In [94]:
# Get the parameters for that page by fetching all of the "hidden" parameters, then adding our own to select CS.
info_index_params = dict([(i["name"], i["value"]) for i in index_page.findAll("input", attrs={"type": "hidden"})])

# Find the "HTML" option, get its value.
# Ideally we'd also find its name by looking for "HTML", but ISA doesn't use radiobuttons like normal people do,
# instead they have a button then some text right next to it, so looking for "html" will just find the text. :/
info_index_params["ww_i_reportModelXsl"] = index_page.find("input", attrs={"name": "ww_i_reportModelXsl"})["value"]

# Find the "Informatique" option, get its value
info_index_params["ww_x_UNITE_ACAD"] = index_page.find("option", text="Informatique")["value"]

info_index_html = requests.get(index_url, params=info_index_params).text
info_index_page = BeautifulSoup(info_index_html, "lxml")

Looking at the page in a browser, there are links.  
Have you ever heard of links? You define them with text, and an URL the user will go to if they click on the link.

...well, that's how normal people do links.  
IS-Academia does links that lead to nowhere, with JavaScript intercepting the click, creating an URL by manually scanning the user input on the page, setting a nested webpage's URL to that, and then reloading the page.

In [107]:
def find_semesters(name_regex):
    """Find all semesters in Informatique matching the given regex."""
    semesters_by_id = []
    for link in info_index_page.findAll("a", attrs={"class": "ww_x_GPS"}):
        # Parse the link name, to find the year + semester
        # The name regex is parenthesized so it's saved as group and not just matched
        link_name_match = re.search("Informatique, (\d+)-\d+, (" + name_regex + ")", link.text)
        
        # Ignore weird stuff
        if link_name_match is None:
            continue
        
        # Find the link ID inside the onclick JavaScript.
        # ...
        # ...
        # ...why? just... why? why would anybody ever write a webpage like this?
        link_id_match = re.search(r"ww_x_GPS=(\d+)", link["onclick"])

        semesters_by_id.append((link_id_match.group(1), link_name_match.group(1), link_name_match.group(2)))
           
    # Now filter them to keep only 2007 and onwards.
    # Also remove those 2017 and later, since that hasn't happened yet so the data would be of dubious value.
    semesters_by_id = [v for v in semesters_by_id if 2007 <= int(v[1]) <= 2016]
    
    # Are you scared yet? Now it gets worse!
    students_by_semester = []
    for semester in semesters_by_id:
        # For some reason the URL is now .html instead of .filter
        semester_url = "http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.html"
        
        # The parameters are the same as before, except now there's ww_x_GPS for the semester ID
        semester_params = info_index_params.copy()
        semester_params["ww_x_GPS"] = semester[0]
        
        semester_html = requests.get(semester_url, params=semester_params).text
        semester_page = BeautifulSoup(semester_html, "lxml")
        
        students = []
        # Iterate all rows, except the ones that have headers
        for row in semester_page.findAll("tr"):
            if row.contents[0].name != "th":
                # Just get the name!
                # It's the 2nd column. Can't be that hard.
                student_name = row.contents[1].text
                
                # ...oh wait. Instead of a normal space, it's a non-breaking space.
                # So let's replace that...
                student_name = student_name.replace("\xa0", " ")
                
                students.append(student_name)
        
        # Aaand we're done. Finally.
        students_by_semester.append((semester[1], semester[2], students))
        
    return students_by_semester