# 02 - Data from the Web

In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

# Imports

In [None]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

%matplotlib inline

# Constans definition

In [None]:
QS_RANKING_URL = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
QS_RANKING_JSON = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508104120137'

TIMES_RANKING_URL = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'
TIMES_RANKING_JSON = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

# General use functions definition

In [None]:
def build_html_parser(url):
    '''
    Function to build a parser object of type BeautifulSoup
    
    url      the webpage url to which send a get request to
    
    return   a parser of the given webpage
    '''
    
    r = requests.get(url)
    page_body = r.text
    
    soup = BeautifulSoup(page_body, 'html.parser')
    
    return soup

### Task 1
Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018)

In [None]:
def parse_detail_page(url_detail):
    '''
    Function that parses the missing informations from the detail page of the university from the QS website
    
    Return   a dictionary with all the data found as integers values
    '''
    
    # Build a parser for the detail page
    soup = build_html_parser(url_detail)
    
    # Obtain and clean up the total faculty member value
    try:
        faculty_member_total = soup.find('div', class_='total faculty').find('div', class_='number').text
        faculty_member_total = faculty_member_total.strip('\n').replace(',','')
    except:
        faculty_member_total = -1
    
    
    # Obtain and clean up the international faculty member value
    try:
        faculty_member_inter = soup.find('div', class_='inter faculty').find('div', class_='number').text.strip('\n')
        faculty_member_inter = faculty_member_inter.strip('\n').replace(',','')
    except:
        faculty_member_inter = -1
    
    # Obtain and clean up the total students value
    try:
        student_total = soup.find('div', class_='total student').find('div', class_='number').text.strip('\n')
        student_total = student_total.strip('\n').replace(',','')
    except:
        student_total = -1
    
    # Obtain and clean up the international students value
    try:
        student_inter = soup.find('div', class_='total inter').find('div', class_='number').text.strip('\n')
        student_inter = student_inter.strip('\n').replace(',','')
    except:
        student_inter = -1
    
    # Build a dictionary for the parsed informations
    detail_info = {'Total faculty member' : int(faculty_member_total), 
                   'International faculty member' : int(faculty_member_inter), 
                   'Total student' : int(student_total), 
                   'International student' : int(student_inter)
                  }
    
    return detail_info

After some work on the Postman Inspector, we found out that the GET Request made to the QS website ended up with multiple attached files to go with the response. One of those files was a JSON with all the infos from the ranking.

In [None]:
r = requests.get(QS_RANKING_JSON)
data = r.json()

Such data is stored as a list of dictionaries, as visible in the example below:

In [None]:
print('First cell:')
print(data['data'][0], end='\n\n')

print('Second cell:')
print(data['data'][1], end='\n\n')

print('...')

In [None]:
university_list = []

# Iterate throu the first 200 elments of the list
for d in data['data'][:200]:
    
    # Store the parsed information into a dictionary
    info = {'Rank': d['rank_display'], 
            'University name': d['title'], 
            'Country': d['country'],
            'Region' : d['region']
           }
    
    # Extend the dictionary with the informations in the detail page
    url_detail = 'https://www.topuniversities.com' +  d['url']
    info.update( parse_detail_page( url_detail))
    
    university_list.append(info)
    
    
qs_ranking_df = pd.DataFrame.from_dict(university_list)
qs_ranking_df.head()

In [None]:
qs_ranking_df.set_index(['University name'], inplace=True)

We now calculate the two required ratios with the help of two auxiliary functions:

In [None]:
def compute_facutly_member_ratio(df):
    '''
    Co
    '''
    li = list()
    for i, row in df.iterrows():
        li.append(row['Total faculty member'] / row['Total student'])
    return li

In [None]:
def compute_student_ratio(df):
    li = list()
    for i, row in df.iterrows():
        li.append(row['International student'] / row['Total student'])
    return li

The computation results are stored in two new colums of the dataframe

In [None]:
qs_ranking_df['Faculty/students ratio'] = compute_facutly_member_ratio( qs_ranking_df )
qs_ranking_df['Intern/student ratio'] = compute_student_ratio( qs_ranking_df )

qs_ranking_df.head()

Plot of the dataframe for the ratios computed (double click on the plot for zoom)

In [None]:
ax = qs_ranking_df[['Faculty/students ratio', 'Intern/student ratio']].plot.bar( figsize=(150, 10))

### Task 2 - Scrape the Times ranking

In [None]:
r = requests.get(TIMES_RANKING_JSON)
data = r.json()

In [None]:
university_list = []

# Iterate throu the first 200 elments of the list
for d in data['data'][:200]:
    
    # Store the parsed information into a dictionary
    info = {'Rank': d['rank'], 
            'University name': d['name'], 
            'Country': d['location']
           }
    
    university_list.append(info)
    
    
times_ranking_df = pd.DataFrame.from_dict(university_list)
times_ranking_df.head()