# 02 - Data from the Web

## Deadline
Wednesday October 25, 2017 at 11:59PM

## Important Notes
* Make sure you push on GitHub your Notebook with all the cells already evaluated (i.e., you don't want your colleagues to generate unnecessary Web traffic during the peer review)
* Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
* Please write all your comments in English, and use meaningful variable names in your code.

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.

In [2]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import seaborn
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
%matplotlib inline

In [3]:
# Make a request
url_main = 'https://www.topuniversities.com'  # Found with postman
r = requests.get(url_main + '/sites/default/files/qs-rankings-data/357051.txt')
print('Response status code: {0}\n'.format(r.status_code))
page_body = r.text

# Serialize the json data with json library
rank_json = json.loads(page_body)

Response status code: 200



In [4]:
rank_df = pd.DataFrame()
rank_df = rank_df.from_dict(rank_json['data']).head(200)
rank_df.stars
rank_df.drop(['logo', 'stars', 'nid','cc', 'score'], axis=1, inplace=True)
rank_df.set_index('core_id', inplace=True)
rank_df = rank_df[['title', 'rank_display', 'country', 'region', 'url']]
rank_df.head()

Unnamed: 0_level_0,title,rank_display,country,region,url
core_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
410,Massachusetts Institute of Technology (MIT),1,United States,North America,/universities/massachusetts-institute-technolo...
573,Stanford University,2,United States,North America,/universities/stanford-university
253,Harvard University,3,United States,North America,/universities/harvard-university
94,California Institute of Technology (Caltech),4,United States,North America,/universities/california-institute-technology-...
95,University of Cambridge,5,United Kingdom,Europe,/universities/university-cambridge


In [5]:
def my_find(html_attributes, new_df_column_name, rank_df):
    _tag = html_attributes['tag']
    _class = html_attributes['class']
    # _list is a temporary list that will store the values found and then be converted in the df to be returned.
    _list = []
    for url in rank_df.url:
        # for every url contained in rank_df['url'], perform the corresponding request:
        uni_url = requests.get(url_main + url)
        uni_body = uni_url.text
        soup = BeautifulSoup(uni_body, 'html.parser')
        # look for <tag=_tag, class=_class>
        soup1 = soup.find(_tag, class_=_class)
        # if such tag has been found, look then for <tag=_tag, class='number> where the value 
        # we're interested in is stored! otherwise append -99
        if soup1:
            soup2 = soup1.find(_tag, class_='number') 
            # if such tag has been found, append its value to the _list, otherwise append -99
            if soup2:
                _list.append({new_df_column_name: soup2.text})
            else:
                _list.append({new_df_column_name: -99})
        else:
            _list.append({new_df_column_name: -99})
    # convert _list to dataframe and return it
    return pd.DataFrame.from_dict(_list).replace({r'\n': ''}, regex=True).replace({r',': ''}, regex=True).apply(pd.to_numeric).astype(int)

In [6]:
# defining HTML tag and class attributes that we want to find
tofind = [{'tag':'div', 'class': 'total faculty'}, 
          {'tag':'div', 'class': 'inter faculty'}, 
          {'tag':'div', 'class': 'total student'}, 
          {'tag':'div', 'class': 'total inter'}]

# creating DataFrame with the data found (NaN values = -99)
details_df = pd.concat([my_find(tofind[0], 'fac_memb_tot', rank_df),
                        my_find(tofind[1], 'fac_memb_int', rank_df),
                        my_find(tofind[2], 'nb_stud_tot', rank_df),
                        my_find(tofind[3], 'nb_stud_int', rank_df)], axis=1)

# concatenate the DataFrames into a unique one
details_df.set_index(rank_df.index, inplace=True)
QS_df = pd.concat([rank_df, details_df], axis=1)

# cleaning the unique DataFrame (deleting the = in rank_display)
QS_df.drop(['url'], axis=1, inplace=True)
QS_df.rank_display = QS_df.rank_display.replace({r'=': ''}, regex=True).apply(pd.to_numeric).astype(int)

# Creating the faculty_members_ratio and number_of_students ratio
QS_df['fac_memb_ratio'] = QS_df.fac_memb_tot / QS_df.nb_stud_tot
QS_df['int_stud_ratio'] = QS_df.nb_stud_int / QS_df.nb_stud_tot

# Deleting what's useless
del details_df, tofind, rank_df, r, page_body, rank_json

In [7]:
# resetting the index, dropping the 'core_id' column wich is useless, and renaming the columns
# in order to prepare themerge with the corresponding TimesHigherEducation dataframe
QS_df.reset_index(inplace = True)
QS_df.drop('core_id', axis=1, inplace = True)
QS_df.columns = ['QS_name', 'QS_rank', 'country', 'region', 'QS_fac_memb_tot', 'QS_fac_memb_int',
                'QS_nb_stud_tot', 'QS_nb_stud_int', 'QS_fac_memb_ratio', 'QS_int_stud_ratio']
QS_df.head()

Unnamed: 0,QS_name,QS_rank,country,region,QS_fac_memb_tot,QS_fac_memb_int,QS_nb_stud_tot,QS_nb_stud_int,QS_fac_memb_ratio,QS_int_stud_ratio
0,Massachusetts Institute of Technology (MIT),1,United States,North America,2982,1679,11067,3717,0.26945,0.335863
1,Stanford University,2,United States,North America,4285,2042,15878,3611,0.26987,0.227422
2,Harvard University,3,United States,North America,4350,1311,22429,5266,0.193945,0.234785
3,California Institute of Technology (Caltech),4,United States,North America,953,350,2255,647,0.422616,0.286918
4,University of Cambridge,5,United Kingdom,Europe,5490,2278,18770,6699,0.292488,0.356899


# TIMES HIGHER EDUCATION

In [9]:
# Making the request and beautifully-soupping it to obtain the dataframe THE_df
URL = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/'\
                +'world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
r = requests.get(URL)
print('r = {r} // status_code = {status}'.format(r=r,status=r.status_code))
r.content
soupp = BeautifulSoup(r.content,'html.parser')
rank_json = json.loads(r.text)
THE_df = pd.DataFrame()
THE_df = THE_df.from_dict(rank_json['data']).head(200)

# select columns of our interest and changing their name
THE_df = THE_df[['name', 'rank', 'stats_number_students', 'stats_student_staff_ratio', 'stats_pc_intl_students']]
THE_df.columns=['THE_name', 'THE_rank','THE_nb_stud_tot', 'THE_stats_student_staff_ratio','THE_int_stud_ratio']
THE_df.head()

r = <Response [200]> // status_code = 200


Unnamed: 0,THE_name,THE_rank,THE_nb_stud_tot,THE_stats_student_staff_ratio,THE_int_stud_ratio
0,University of Oxford,1,20409,11.2,38%
1,University of Cambridge,2,18389,10.9,35%
2,California Institute of Technology,=3,2209,6.5,27%
3,Stanford University,=3,15845,7.5,22%
4,Massachusetts Institute of Technology,5,11177,8.7,34%


### still the usual, hard-to-understand cleaning of the dataset

In [10]:
# THE_int_stud_ratio
THE_df['THE_int_stud_ratio'] = THE_df['THE_int_stud_ratio'].str.replace('%', '').astype('double')/100

# THE_nb_stud_tot
THE_df['THE_nb_stud_tot'] = THE_df['THE_nb_stud_tot'].str.replace(',', '').astype('int')

# THE_fac_memb_int is a missing data, I add it just to state it clearly
THE_df['THE_fac_memb_int'] = -99

# THE_fac_memb_tot
THE_df['THE_stats_student_staff_ratio']=THE_df['THE_stats_student_staff_ratio'].astype(float)
THE_df['THE_fac_memb_tot'] = THE_df['THE_nb_stud_tot']/THE_df['THE_stats_student_staff_ratio']
THE_df['THE_fac_memb_tot'] = THE_df['THE_fac_memb_tot'].astype(int)

# THE_nb_stud_int
THE_df['THE_nb_stud_int'] = THE_df['THE_nb_stud_tot']*THE_df['THE_int_stud_ratio']
THE_df['THE_nb_stud_int'] = THE_df['THE_nb_stud_int'].astype(int)

# THE_fac_memb_ratio
THE_df['THE_fac_memb_ratio'] = 1/THE_df['THE_stats_student_staff_ratio']
THE_df=THE_df.drop('THE_stats_student_staff_ratio', axis=1)

# THE_rank
THE_df['THE_rank'] = THE_df['THE_rank'].astype(str)
THE_df['THE_rank'] = THE_df['THE_rank'].replace({r'=': ''}, regex=True).apply(pd.to_numeric).astype(int)

# let's now have a look at the clean THE_df
THE_df.head()

Unnamed: 0,THE_name,THE_rank,THE_nb_stud_tot,THE_int_stud_ratio,THE_fac_memb_int,THE_fac_memb_tot,THE_nb_stud_int,THE_fac_memb_ratio
0,University of Oxford,1,20409,0.38,-99,1822,7755,0.089286
1,University of Cambridge,2,18389,0.35,-99,1687,6436,0.091743
2,California Institute of Technology,3,2209,0.27,-99,339,596,0.153846
3,Stanford University,3,15845,0.22,-99,2112,3485,0.133333
4,Massachusetts Institute of Technology,5,11177,0.34,-99,1284,3800,0.114943


# Let's merge!

In [12]:
# creating two lists with the names of the universities from the two datasets
THE_name = list(THE_df.THE_name)
QS_name = list(QS_df.QS_name)

#initializing a new column of the THE_df with the corresponding QS name found by the matching function
# just to control that everything went smoothly
THE_df['THE_corresponding QS name']='unknown'

# MATCHING FUNCTION
# finding the probable corresponding name in the QS dataframe for each university
for i,THE_uni in enumerate(THE_name):
    uni, prob=process.extractOne(THE_uni, QS_name, scorer=fuzz.token_sort_ratio)
    if prob>87: #if prob<87, I observed that the algorithm matches diffeent universities!! 97 is a good limit
        THE_df.set_value(i, 'THE_corresponding QS name', uni)
        
# MERGING        
Unique_df=pd.merge(THE_df,QS_df, left_on='THE_corresponding QS name', right_on='QS_name', how = 'right')

#SOMETHING'S STRANGE IN THE NEIGHBOURHOOD: WE NEED THE DATABUSTERS -->(LUCA)!
print(QS_df.shape)
print(THE_df.shape)
print(Unique_df.shape)

(200, 10)
(200, 9)
(204, 19)


# can't understand why the Unique_df has 204 rows
# still 60 rows have to be merged by hand!!! 

In [13]:
Unique_df

Unnamed: 0,THE_name,THE_rank,THE_nb_stud_tot,THE_int_stud_ratio,THE_fac_memb_int,THE_fac_memb_tot,THE_nb_stud_int,THE_fac_memb_ratio,THE_corresponding QS name,QS_name,QS_rank,country,region,QS_fac_memb_tot,QS_fac_memb_int,QS_nb_stud_tot,QS_nb_stud_int,QS_fac_memb_ratio,QS_int_stud_ratio
0,University of Oxford,1.0,20409.0,0.38,-99.0,1822.0,7755.0,0.089286,University of Oxford,University of Oxford,6,United Kingdom,Europe,6750,2964,19720,7353,0.342292,0.372870
1,University of Cambridge,2.0,18389.0,0.35,-99.0,1687.0,6436.0,0.091743,University of Cambridge,University of Cambridge,5,United Kingdom,Europe,5490,2278,18770,6699,0.292488,0.356899
2,California Institute of Technology,3.0,2209.0,0.27,-99.0,339.0,596.0,0.153846,California Institute of Technology (Caltech),California Institute of Technology (Caltech),4,United States,North America,953,350,2255,647,0.422616,0.286918
3,Stanford University,3.0,15845.0,0.22,-99.0,2112.0,3485.0,0.133333,Stanford University,Stanford University,2,United States,North America,4285,2042,15878,3611,0.269870,0.227422
4,Massachusetts Institute of Technology,5.0,11177.0,0.34,-99.0,1284.0,3800.0,0.114943,Massachusetts Institute of Technology (MIT),Massachusetts Institute of Technology (MIT),1,United States,North America,2982,1679,11067,3717,0.269450,0.335863
5,Harvard University,6.0,20326.0,0.26,-99.0,2283.0,5284.0,0.112360,Harvard University,Harvard University,3,United States,North America,4350,1311,22429,5266,0.193945,0.234785
6,Princeton University,7.0,7955.0,0.24,-99.0,958.0,1909.0,0.120482,Princeton University,Princeton University,13,United States,North America,1007,246,8069,1793,0.124799,0.222208
7,Imperial College London,8.0,15857.0,0.55,-99.0,1390.0,8721.0,0.087719,Imperial College London,Imperial College London,8,United Kingdom,Europe,3930,2071,16090,8746,0.244251,0.543567
8,University of Chicago,9.0,13525.0,0.25,-99.0,2181.0,3381.0,0.161290,University of Chicago,University of Chicago,9,United States,North America,2449,635,13557,3379,0.180645,0.249244
9,ETH Zurich – Swiss Federal Institute of Techno...,10.0,19233.0,0.38,-99.0,1317.0,7308.0,0.068493,ETH Zurich - Swiss Federal Institute of Techno...,ETH Zurich - Swiss Federal Institute of Techno...,10,Switzerland,Europe,2477,1886,19815,7563,0.125006,0.381681
