In [3]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# Familiarizing with the tools and data

Let's first analyze the HTML content of the IS-Academia Portal. We will this by using the `requests` library to `GET` the HTML content given a URL.

In [5]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")

Now we will use `BeautifulSoup` to parse through the data and visualize it nicely with the `pretiffy()` method.

In [7]:
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
   <div>
   </div>
   <title>
   </title>
   <script src="GEDPUBLICREPORTS.txt?ww_x_path=Gestac.Base.Palette_js&amp;ww_c_langue=fr" type="text/javascript">
   </script>
   <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.Style" rel="stylesheet" type="text/css">
    <link href="GEDPUBLICREPORTS.css?ww_x_path=Gestac.Moniteur.StyleNavigator" rel="stylesheet" type="text/css"/>
   </link>
  </meta>
 </head>
 <body alink="#666666" bgcolor="#ffffff" link="#666666" marginheight="0" marginwidth="5" vlink="#666666">
  <div class="filtres">
   <form action="!GEDPUBLICREPORTS.filter" method="GET" name="f">
    <input name="ww_b_list" type="hidden" value="1">
     <input name="ww_i_reportmodel" type="hidden" value="133685247">
      <input name="ww_c_langue" type="hidden" value="">
       <h1 id="titre">
        Liste des étudiants inscrits par semestre
       </h1>
       <table border="0" id="format">
 

With `BeautifulSoup`, we can convinient dig deeper into the HTML content as described in this tutorial: https://www.crummy.com/software/BeautifulSoup/bs4/doc/. We will isolate the filters used to distinguish students by Major (Unité académique), Academic Year (Période académique), Student Status (Période pédagogique), and Semester Type (Type de semestre). We have identified from the output above that the filters are in the `body`, between `<table>` tags with `id="filtre"`. Finally, we can use `find_all('tr')` to get each filter as an entry in a list.

In [21]:
filters = soup.body.find(id="filtre").find_all('tr')
print(filters)

[<tr><th>Unit\xe9 acad\xe9mique</th><td><input name="zz_x_UNITE_ACAD" type="hidden" value=""><select name="ww_x_UNITE_ACAD" onchange="document.f.zz_x_UNITE_ACAD.value=document.f.ww_x_UNITE_ACAD.options[document.f.ww_x_UNITE_ACAD.selectedIndex].text"><option value="null"></option><option value="942293">Architecture</option><option value="246696">Chimie et g\xe9nie chimique</option><option value="943282">Cours de math\xe9matiques sp\xe9ciales</option><option value="637841336">EME (EPFL Middle East)</option><option value="942623">G\xe9nie civil</option><option value="944263">G\xe9nie m\xe9canique</option><option value="943936">G\xe9nie \xe9lectrique et \xe9lectronique </option><option value="2054839157">Humanit\xe9s digitales</option><option value="249847">Informatique</option><option value="120623110">Ing\xe9nierie financi\xe8re</option><option value="946882">Management de la technologie</option><option value="944590">Math\xe9matiques</option><option value="945244">Microtechnique</option

Admittedly, not the nicest output. And besides the fact that there are square brackets, it's hard to even say that it's a list! Let's try to output this more cleanly. From the `prettify()` output, we saw that each filter, e.g. `Architecture` for Unité académique or `2010-2011` for Période académique has an `option` tag surrounding it. Let's use `find_all(option)` with each item in the above list to cleanly output the filter options.

In [24]:
for field in filters:
    print(field.find_all('option'))

[<option value="null"></option>, <option value="942293">Architecture</option>, <option value="246696">Chimie et g\xe9nie chimique</option>, <option value="943282">Cours de math\xe9matiques sp\xe9ciales</option>, <option value="637841336">EME (EPFL Middle East)</option>, <option value="942623">G\xe9nie civil</option>, <option value="944263">G\xe9nie m\xe9canique</option>, <option value="943936">G\xe9nie \xe9lectrique et \xe9lectronique </option>, <option value="2054839157">Humanit\xe9s digitales</option>, <option value="249847">Informatique</option>, <option value="120623110">Ing\xe9nierie financi\xe8re</option>, <option value="946882">Management de la technologie</option>, <option value="944590">Math\xe9matiques</option>, <option value="945244">Microtechnique</option>, <option value="945571">Physique</option>, <option value="944917">Science et g\xe9nie des mat\xe9riaux</option>, <option value="942953">Sciences et ing\xe9nierie de l'environnement</option>, <option value="945901">Science

That's a bit better. We can see a `value` associated with each field option, e.g. `249847` for `Informatique`. Let's actually use some of these filters on the website itself and intercept the requests using Postman + Postman Interceptor. Postman helps us track which URL's are requested and to analyze the corresponding HTML. Let's check out the URL and HTML content (with `BeautifulSoup`) when we select the following field options: `Informatique`, `2009-2010`, `Bachelor semestre 1`, and `Semestre d'automne`.

http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_b_list=1&ww_i_reportmodel=133685247&ww_c_langue=&ww_i_reportModelXsl=133685270&zz_x_UNITE_ACAD=Informatique&ww_x_UNITE_ACAD=249847&zz_x_PERIODE_ACAD=2009-2010&ww_x_PERIODE_ACAD=978195&zz_x_PERIODE_PEDAGO=Bachelor+semestre+1&ww_x_PERIODE_PEDAGO=249108&zz_x_HIVERETE=Semestre+d%27automne&ww_x_HIVERETE=2936286&dummy=ok

In the URL we can see the options we selected! Moreover, they have been used as parameters for the URL. Thes parameter names could be identified by navigating through our `filters` list (by going into the `td` tag and then selecting the `name` for the `input` tag).

In [31]:
for field in filters:
    print(field.td.input["name"])

zz_x_UNITE_ACAD
zz_x_PERIODE_ACAD
zz_x_PERIODE_PEDAGO
zz_x_HIVERETE


When checking out the HTML content, we see a new table at the bottom (corresponding to the options we see on the IS-Academia portal). Let's check it out with `BeautifulSoup`.

In [None]:
r = requests.get("http://isa.epfl.ch/imoniteur_ISAP/!GEDPUBLICREPORTS.filter?ww_i_reportModel=133685247")
soup = BeautifulSoup(r.content, 'html.parser')