# 02 - Data from the Web

In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

# Imports

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

%matplotlib inline

# Constans definition

In [None]:
QS_RANKING_URL = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'

TIMES_RANKING_URL = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'

# General use functions definition

In [None]:
def build_html_parser(url):
    '''
    Function to build a parser object of type BeautifulSoup
    
    url      the webpage url to which send a get request to
    
    return   a parser of the given webpage
    '''
    
    r = requests.get(url)
    page_body = r.text
    
    soup = BeautifulSoup(page_body, 'html.parser')
    
    return soup

### Task 1
Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018)

In [None]:
# TEMPORARY SOLUTION ! CHECK BELOW FOR THE EVOLUTION

file = open("qs.htm", "r")
page_body = file.read()

soup = BeautifulSoup(page_body, 'html.parser')

The ranking is presented as a ```<tr>``` element for each university inside a ```<table>``` tag for them all. Each page has a table with 25 rows (university) each. Each row element has an unique id assigned that matched the pattern cid-xxx (with xxx the specific entry number)

In [None]:
# Extract the list of row elements matching cid-xxx pattern
university_wrappers = soup.findAll("tr", {"id" : lambda x: x and x.startswith('cid-')})

print('Total number of items: {0}'.format(len(university_wrappers)))

In [None]:
def parse_detail_page(url_detail):
    '''
    Function that parses the missing informations from the detail page of the university from the QS website
    
    Return   a dictionary with all the data found as integers values
    '''
    
    # Build a parser for the detail page
    soup = build_html_parser(url_detail)
    
    # Obtain and clean up the total faculty member value
    faculty_member_total = soup.find('div', class_='total faculty').find('div', class_='number').text
    faculty_member_total = faculty_member_total.strip('\n').replace(',','')
    
    # Obtain and clean up the international faculty member value
    faculty_member_inter = soup.find('div', class_='inter faculty').find('div', class_='number').text.strip('\n')
    faculty_member_inter = faculty_member_inter.strip('\n').replace(',','')
    
    # Obtain and clean up the total students value
    student_total = soup.find('div', class_='total student').find('div', class_='number').text.strip('\n')
    student_total = student_total.strip('\n').replace(',','')
    
    # Obtain and clean up the international students value
    student_inter = soup.find('div', class_='total inter').find('div', class_='number').text.strip('\n')
    student_inter = student_inter.strip('\n').replace(',','')
    
    # Build a dictionary for the parsed informations
    detail_info = {'Total faculty member' : int(faculty_member_total), 
                   'International faculty member' : int(faculty_member_inter), 
                   'Total student' : int(student_total), 
                   'International student' : int(student_inter)
                  }
    
    return detail_info

In [None]:
# TODO: I have no idea what they mean by region... is it the city? or what?

university_list = []
for u in university_wrappers:
    rank = u.find('span', class_='rank').text  # get the rank value
    name = u.find('td', class_='uni').find('a').text # get the name value
    country = u.find('img', class_='flag')['alt'] # get the country value
        
    # Store the parsed information into a dictionary
    info = {'Rank': rank, 'Univerisity name': name, 'Country': country}
    
    # Extend the dictionary with the informations in the detail page
    url_detail = u.find('td', class_='uni').find('a')['href']
    info.update(parse_detail_page(url_detail))
    
    university_list.append(info)
    
    
ranking_df = pd.DataFrame.from_dict(university_list)
ranking_df

In [None]:
ranking_df.set_index(['Univerisity name'], inplace=True)

We now calculate the two required ratios with the help of two auxiliary functions:

In [None]:
def compute_facutly_member_ratio(df):
    '''
    Co
    '''
    li = list()
    for i, row in df.iterrows():
        li.append(row['Total faculty member'] / row['Total student'])
    return li

In [None]:
def compute_student_ratio(df):
    li = list()
    for i, row in df.iterrows():
        li.append(row['International student'] / row['Total student'])
    return li

The computation results are stored in two new colums of the dataframe

In [None]:
ranking_df['Faculty/students ratio'] = compute_facutly_member_ratio( ranking_df )
ranking_df['Intern/student ratio'] = compute_student_ratio( ranking_df )

ranking_df.head()

Plot of the dataframe for the ratios computed

In [None]:
ranking_df[['Faculty/students ratio', 'Intern/student ratio']].plot.bar( figsize=(25, 10))