# 02 - Data from the Web

In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

# Imports

In [None]:
# Import libraries
import string
import re
import pickle
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

%matplotlib inline

# Constans definition

In [None]:
QS_RANKING_URL = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
QS_RANKING_JSON = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508104120137'

TIMES_RANKING_URL = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'
TIMES_RANKING_JSON = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

In [None]:
SEARCH_REFERENCE_API = 'https://en.wikipedia.org/w/api.php?action=query&titles={0}&prop=revisions&rvprop=content&format=json&indexpageids'

In [None]:
try:
    COUNTRY_REGION_METADATA = pd.read_pickle('serial/country_region_metadata.p')
except (OSError, IOError) as e:
    COUNTRY_REGION_METADATA = pd.DataFrame(columns=['Region'])
    COUNTRY_REGION_METADATA.to_pickle('serial/country_region_metadata.p')

In [None]:
COUNTRY_REGION_METADATA

# General use functions definition

In [None]:
def build_html_parser(url):
    '''
    Function to build a parser object of type BeautifulSoup
    
    url      the webpage url to which send a get request to
    
    return   a parser of the given webpage
    '''
    
    r = requests.get(url)
    page_body = r.text
    
    soup = BeautifulSoup(page_body, 'html.parser')
    
    return soup

In [None]:
def clean_str_number(str_n):
    
    return str_n.strip('\n').strip('%').replace(',', '')

In [None]:
def search_standard_name(str_name):
    
    str_name = str_name.split("-")[0] # Manage name with - (short)
    str_name = str_name.split("–")[0] # Manage name with – (long)
    str_name = re.sub('\(.*?\)','', str_name) # no brackets
    str_name = str_name.strip().replace('&', '%26')
    
    #r = requests.get(SEARCH_REFERENCE_API.format(str_name.strip().replace(' ', '%20')))
    r = requests.get(SEARCH_REFERENCE_API.format(str_name.strip().replace(' ', '_')))
    data = r.json()
    
    page_id = data['query']['pageids'][0]
    
    if (page_id == '-1'):
        print('Not found :( -> {}'.format(str_name))
        
        # Manually set a standard name for the only unmatchable university. 
        # We have a total of 9 unknown sources during the WikiData requests, but only one university appears twice
        # and it need to receive a standard name to be merged later on. The other onces could keep their name
        if (str_name == "Scuola Superiore Sant'Anna Pisa di Studi Universitari e di Perfezionamento"):
            found_name = "Scuola Superiore Sant’Anna"
        else:
            found_name = str_name
    else:
        found_name = data['query']['pages'][page_id]['title']
        
    return(found_name)

In [None]:
def update_country_region_metadata(country, region):
    
    global COUNTRY_REGION_METADATA
    
    if (country in COUNTRY_REGION_METADATA.index):
        return
    
    new_row = pd.Series(region, index=['Region'])
    new_row.name = country
    
    COUNTRY_REGION_METADATA = COUNTRY_REGION_METADATA.append(new_row)

### Task 1
Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018)

In [None]:
# TODO: Handled the returning value for data not found. 
# Atm I return a -1, but this fucks up the plotting and computation for the ratios

def parse_detail_page(url_detail):
    '''
    Function that parses the missing informations from the detail page of the university from the QS website
    
    Return   a dictionary with all the data found as integers values
    '''
    
    # Build a parser for the detail page
    soup = build_html_parser(url_detail)
    
    # Obtain and clean up the total faculty member value
    try:
        faculty_member_total = soup.find('div', class_='total faculty').find('div', class_='number').text
        faculty_member_total = clean_str_number(faculty_member_total)
    except:
        faculty_member_total = -1
    
    
    # Obtain and clean up the international faculty member value
    try:
        faculty_member_inter = soup.find('div', class_='inter faculty').find('div', class_='number').text.strip('\n')
        faculty_member_inter = clean_str_number(faculty_member_inter)
    except:
        faculty_member_inter = -1
    
    # Obtain and clean up the total students value
    try:
        student_total = soup.find('div', class_='total student').find('div', class_='number').text.strip('\n')
        student_total = clean_str_number(student_total)
    except:
        student_total = -1
    
    # Obtain and clean up the international students value
    try:
        student_inter = soup.find('div', class_='total inter').find('div', class_='number').text.strip('\n')
        student_inter = clean_str_number(student_inter)
    except:
        student_inter = -1
    
    # Build a dictionary for the parsed informations
    detail_info = {'Total faculty member' : int(faculty_member_total), 
                   'International faculty member' : int(faculty_member_inter), 
                   'Total student' : int(student_total), 
                   'International student' : int(student_inter)
                  }
    
    return detail_info

After some work on the Postman Inspector, we found out that the GET Request made to the QS website ended up with multiple attached files to go with the response. One of those files was a JSON with all the infos from the ranking.

In [None]:
req = requests.get(QS_RANKING_JSON)
data_from_url = req.json()

Such data is stored as a list of dictionaries, as visible in the example below:

In [None]:
print('First cell:')
print(data_from_url['data'][0], end='\n\n')

print('Second cell:')
print(data_from_url['data'][1], end='\n\n')

print('...')

In [None]:
def scrape_qs_ranking():
    '''
    Obtain the ranking from QS in a dataframe
    '''
    
    r = requests.get(QS_RANKING_JSON)
    data = r.json()
    
    university_list = []

    # Iterate throu the first 200 elments of the list
    for d in data['data'][:200]:
    
        # Store the parsed information into a dictionary
        info = {'Rank': d['rank_display'], 
                'University name': search_standard_name(d['title']), 
                'Country': d['country'],
                'Region' : d['region']
               }
    
        update_country_region_metadata(d['country'], d['region'])
    
        # Extend the dictionary with the informations in the detail page
        url_detail = 'https://www.topuniversities.com' +  d['url']
        info.update( parse_detail_page( url_detail))
    
        university_list.append(info)
    
    # After scraping data from QS ranking the metadata dataframe needs to be stored updated
    COUNTRY_REGION_METADATA.to_pickle('serial/country_region_metadata.p')
    
    qs_ranking_df = pd.DataFrame.from_dict(university_list)
    return qs_ranking_df

In [None]:
try:
    qs_ranking_df = pd.read_pickle('serial/qs_save.p')
except (OSError, IOError) as e:
    qs_ranking_df = scrape_qs_ranking()
    qs_ranking_df.to_pickle('serial/qs_save.p')
    
qs_ranking_df.head()

In [None]:
qs_ranking_df.set_index(['University name'], inplace=True)

We now calculate the two required ratios with the help of two auxiliary functions:

In [None]:
def compute_facutly_member_ratio(df):
    '''
    Co
    '''
    li = list()
    for i, row in df.iterrows():
        li.append(row['Total faculty member'] / row['Total student'])
    return li

In [None]:
def compute_student_ratio(df):
    li = list()
    for i, row in df.iterrows():
        li.append(row['International student'] / row['Total student'])
    return li

The computation results are stored in two new colums of the dataframe

In [None]:
qs_ranking_df['Faculty/students ratio'] = compute_facutly_member_ratio( qs_ranking_df )
qs_ranking_df['Intern/student ratio'] = compute_student_ratio( qs_ranking_df )

qs_ranking_df.head()

Plot of the dataframe for the ratios computed (double click on the plot for zoom)

In [None]:
qs_ranking_df[['Faculty/students ratio', 'Intern/student ratio']].plot(kind='barh', figsize=(10,100))

Plot the results aggregating by region:

In [None]:
for i, (title, group) in enumerate(qs_ranking_df.groupby('Region')):
    ax = group[['Faculty/students ratio', 'Intern/student ratio']].plot.bar(figsize=(17, 7), 
                                                                            width= 0.5 if (len(group) > 2) else 0.1)
    plt.title(title)
    plt.xticks(rotation = 90 if (len(group) > 5) else 0)
    plt.xlabel("")
    plt.show()

Plot the results aggregating by country:

In [None]:
for i, (title, group) in enumerate(qs_ranking_df.groupby('Country')):
    ax = group[['Faculty/students ratio', 'Intern/student ratio']].plot.bar(figsize=(17, 7), 
                                                                            width= 0.5 if (len(group) > 5) else 0.1)
    plt.title(title)
    plt.xticks(rotation = 90 if (len(group) > 5) else 0)
    plt.xlabel("")
    plt.show()

### Task 2 - Scrape the Times ranking
Obtain the 200 top-ranking universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.

In [None]:
def compute_value_from_percentage(total, percentage):
    
    total = int( total )
    percentage = float( percentage )
    
    return round( (total/100) * percentage )

In [None]:
def compute_value_from_proportion(total, proportion):
    
    total = int( total )
    proportion = float( proportion )
    
    return round( total / proportion )

In [None]:
def scrape_times_ranking():
    '''
    Obtain the ranking from Times in a dataframe
    '''

    r = requests.get(TIMES_RANKING_JSON)
    data = r.json()

    university_list = []

    # Iterate throu the first 200 elments of the list
    for d in data['data'][:200]:
    
        # Preliminary computations to extract data
        intern_student = compute_value_from_percentage( clean_str_number( d['stats_number_students']), 
                                                       clean_str_number( d['stats_pc_intl_students'])
                                                      )
    
        faculty_member_total = compute_value_from_proportion( clean_str_number( d['stats_number_students']), 
                                                             clean_str_number( d['stats_student_staff_ratio'])
                                                            )
        
        # Determine region from the data of the QS ranking stored in the metadata
        try:
            region = COUNTRY_REGION_METADATA.get_value(d['location'], 'Region')
        except:
            region = 'NaN'
    
        # Store the parsed information into a dictionary
        info = {'Rank': d['rank'], 
                'University name': search_standard_name(d['name']), 
                'Country': d['location'],
                'Region' : region,
                'Total student' : int(clean_str_number( d['stats_number_students'])),
                'International student' : int(intern_student),
                'Total faculty member' : int(faculty_member_total)
               }
    
        university_list.append(info)
   
    times_ranking_df = pd.DataFrame.from_dict(university_list)
    return times_ranking_df

In [None]:
try:
    times_ranking_df = pd.read_pickle('serial/times_save.p')
except (OSError, IOError) as e:
    times_ranking_df = scrape_times_ranking()
    times_ranking_df.to_pickle('serial/times_save.p')
    
times_ranking_df.head()

Missig data:

In [None]:
times_ranking_df.set_index(['University name'], inplace=True)

In [None]:
#times_ranking_df['Faculty/students ratio'] = compute_facutly_member_ratio( qs_ranking_df )
times_ranking_df['Intern/student ratio'] = compute_student_ratio( times_ranking_df )

times_ranking_df.head()

In [None]:
times_ranking_df.sort_values('Intern/student ratio')[['Intern/student ratio']].plot( kind='barh', figsize=(10,100))

Plot the results aggregating by region:

In [None]:
for i, (title, group) in enumerate(times_ranking_df.groupby('Region')):
    ax = group[['Intern/student ratio']].plot.bar(figsize=(17, 7), 
                                                                            width= 0.5 if (len(group) > 2) else 0.1)
    plt.title(title)
    plt.xticks(rotation = 90 if (len(group) > 5) else 0)
    plt.xlabel("")
    plt.show()

Plot the results aggregating by country:

In [None]:
for i, (title, group) in enumerate(times_ranking_df.groupby('Country')):
    ax = group[['Intern/student ratio']].plot.bar(figsize=(17, 7), 
                                                                            width= 0.5 if (len(group) > 5) else 0.1)
    plt.title(title)
    plt.xticks(rotation = 90 if (len(group) > 5) else 0)
    plt.xlabel("")
    plt.show()

### Task 3 - Merge the dataframes
Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

In [None]:
qs_ranking_df['University name'] = qs_ranking_df.index
times_ranking_df['University name'] = times_ranking_df.index

In [None]:
qs_ranking_df['University name'] = qs_ranking_df['University name'].str.strip()
qs_ranking_df['Country'] = qs_ranking_df['Country'].str.strip()

times_ranking_df['University name'] = times_ranking_df['University name'].str.strip()
times_ranking_df['Country'] = times_ranking_df['Country'].str.strip()

Merging the two dataframes into one:

In [None]:
merged_ranking_df = pd.merge(qs_ranking_df, times_ranking_df, 
                             on=['University name', 'Country', 'Region'], 
                             how='outer', 
                             suffixes=('_QS', '_TM')
                            )

merged_ranking_df.set_index(['University name'], inplace=True)

merged_ranking_df