# 02 - Data from the Web

In this homework we will extract interesting information from [www.topuniversities.com](http://www.topuniversities.com) and [www.timeshighereducation.com](http://www.timeshighereducation.com), two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly.

## Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle
import pandas as pd
import re

#Still not using the following:
#import string
#import seaborn
#import matplotlib.pyplot as plt
#pd.options.mode.chained_assignment = None

%matplotlib inline

## Constants

In [2]:
QS_URL = 'https://www.topuniversities.com/university-rankings/world-university-rankings/2018'
QS_JSON = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt'

TIMES_URL = 'http://timeshighereducation.com/world-university-rankings/2018/world-ranking'
TIMES_JSON = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

## Task 1 

Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the details page. Store the resulting dataset in a pandas DataFrame.

#### 0.1 Web scraping

In [3]:
qs_json_data = requests.get(QS_JSON).json()
qs_fields = ['title', 'rank_display', 'country', 'region', 'url']
qs_json_df = pd.DataFrame(qs_json_data['data']).head(2)[qs_fields]

In [4]:
qs_json_df

Unnamed: 0,title,rank_display,country,region,url
0,Massachusetts Institute of Technology (MIT),1,United States,North America,/universities/massachusetts-institute-technolo...
1,Stanford University,2,United States,North America,/universities/stanford-university


In [7]:
qs_detail_fields = ['total faculty', 'inter faculty', 'total student', 'total inter']

In [8]:
def get_details_qs(x):
    url = str.format('https://www.topuniversities.com{}', x)
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    
    details = {}
    
    for field in qs_detail_fields:
        value = -1
        try:
            value = soup.find('div', class_=field).find('div', class_='number').text
            value = int(re.sub('[^0-9]', '', value))
        except:
            pass
        details[field] = value
    return pd.Series(details)

In [9]:
qs_json_df[qs_detail_fields] = qs_json_df['url'].apply(lambda x: get_details_qs(x))

In [4]:
qs_json_df

Unnamed: 0,title,rank_display,country,region,url,total faculty,inter faculty,total student,total inter
0,Massachusetts Institute of Technology (MIT),1,United States,North America,/universities/massachusetts-institute-technolo...,1679,2982,3717,11067
1,Stanford University,2,United States,North America,/universities/stanford-university,2042,4285,3611,15878


Serializing:

In [10]:
pickle.dump(qs_json_df, open('qs_json_df.p', 'wb'))

#### 0.2 De-serializing

In [3]:
qs_json_df = pickle.load(open('qs_json_df.p', 'rb'))

Verify this:

In [6]:
qs_json_df[qs_json_df['title'] == 'NaN']

Unnamed: 0,title,rank_display,country,region,url,total faculty,inter faculty,total student,total inter


#### 1 Which are the best universities in terms of ratio between faculty members and students?

In [8]:
qs_json_df['Faculty-students ratio'] = qs_json_df['total faculty'] / qs_json_df['total student']
faculty_students_rank_df = qs_json_df.sort_values('Faculty-students ratio', ascending=False)
faculty_students_rank_df.head(20)

Unnamed: 0,title,rank_display,country,region,url,total faculty,inter faculty,total student,total inter,Faculty-students ratio
1,Stanford University,2,United States,North America,/universities/stanford-university,2042,4285,3611,15878,0.565494
0,Massachusetts Institute of Technology (MIT),1,United States,North America,/universities/massachusetts-institute-technolo...,1679,2982,3717,11067,0.451708


Which are the best universities in terms of ratio of international students?

In [9]:
qs_json_df['International-students ratio'] = qs_json_df['total inter'] / qs_json_df['total student']
international_students_rank_df = qs_json_df.sort_values('International-students ratio', ascending=False)
international_students_rank_df.head(20)

Unnamed: 0,title,rank_display,country,region,url,total faculty,inter faculty,total student,total inter,Faculty-students ratio,International-students ratio
1,Stanford University,2,United States,North America,/universities/stanford-university,2042,4285,3611,15878,0.565494,4.39712
0,Massachusetts Institute of Technology (MIT),1,United States,North America,/universities/massachusetts-institute-technolo...,1679,2982,3717,11067,0.451708,2.977401
