First, we import all the needed librairies.

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json

We noticed that the actual data from topuniversities is not directly on the webpage, but on a separate text file, which contains json information.
Thus, we first get this json, parse it, and take the first 200 entries in it.
We noticed that the univertsity with rank 199 is actually the 198th entry, and thus the last 3 universities needs to have their rank corrected.

In [None]:
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508259845358')
raw_data = json.loads(r.text)['data'][:200]

We can print the first entry of the data to see how the informations are represented.

In [None]:
raw_data[0]

We can now define functions that will help us during the processing of this json.

First, process_university takes as input the raw json of a particular uni, and outputs a dictionnary containing the name, rank, country, region, number of faculty members (international and total) and number of students (international and total) for that given uni.

It uses other functions defined below.

In [None]:
def process_university(uni):
    name = uni['title']
    rank = get_rank(uni['rank_display'])
    country = uni['country']
    region = uni['region']
    
    numbers = get_numbers(uni['url'])
    info = {'name' : name, 'rank' : rank, 'country' : country, 'region' : region}
    info.update(numbers)
    return info

As there can be ties in rank, the displayed rank is not always a integer. Furthermore, as said above, the last 3 unis have incorrect ranks and need to be fixed.

In [None]:
def get_rank(rank_display):
    rank = int(rank_display.replace("=", ""))
    if rank >= 199:
        rank -= 1
    return rank
    

To get the number of faculty members (international and total) and number of students (international and total), we need to get another request, and this time, we will need to parse the webpage using BeautifulSoup.

By inspecting the webpage, we noticed the classes of the elements where the numbers are contained. Once we get these elements, we further need to parse its content, to get the value as an integer.

In [None]:
def get_numbers(url):
    r = requests.get("https://www.topuniversities.com/" + url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    faculty_info = soup.select(".text .number")
    total_faculty = parse_int(faculty_info[0].decode_contents(formatter="html"))
    international_faculty = parse_int(faculty_info[1].decode_contents(formatter="html"))
    
    student_info = soup.select(".barp .number")
    total_student = parse_int(student_info[0].decode_contents(formatter="html"))
    international_student = parse_int(student_info[1].decode_contents(formatter="html"))
    return {'total_faculty' : total_faculty, 'international_faculty' : international_faculty, 'total_student' : total_student, 'international_student' : international_student}

In [None]:
def parse_int(str):
    return int(str.replace("\n", "").replace(" ", "").replace(",", ""))

In [None]:
for uni in raw_data:
    print(process_university(uni))

In [None]:
#TODO : save final json to avoid doing all the requests again