# Homework 2 : Data from the Web

In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import requests
from bs4 import BeautifulSoup

## 1. Top University Ranking : QS
Next we Are going to load the data from the URL of QS. The main URL of the ranking page is the following : 

In [5]:
QS_base_URL = "https://www.topuniversities.com"
QS_data_URL = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508016143198"

Nest, we make the HTTP Request and get the JSON file into a list of dictionnaries :

In [6]:
QS_R = requests.get(QS_data_URL)
QS_dict = QS_R.json()['data']
QS_dict[0]

{'cc': 'US',
 'core_id': '410',
 'country': 'United States',
 'guide': '<a href="/where-to-study/north-america/united-states/guide" class="guide-link" target="_blank">United States</a>',
 'logo': '<img src="https://www.topuniversities.com/sites/default/files/massachusetts-institute-of-technology-mit_410_small_0.jpg" alt="Massachusetts Institute of Technology (MIT)  Logo">',
 'nid': '294850',
 'rank_display': '1',
 'region': 'North America',
 'score': '100',
 'stars': '6',
 'title': 'Massachusetts Institute of Technology (MIT)',
 'url': '/universities/massachusetts-institute-technology-mit'}

Create an empty dataFrame for the QS Data :

In [171]:
ranking_QS = pd.DataFrame(columns = ['Name','Rank','Country','Region',\
                                     'Number of faculty members (int)','Number of faculty members (total)',\
                                     'Number of students (int)','Number of students (total)'])
ranking_QS.head()

Unnamed: 0,Name,Rank,Country,Region,Number of faculty members (int),Number of faculty members (total),Number of students (int),Number of students (total)


We create the following function wich takes in input the ranking dataframe and the dictionnary of the i_th university. The aim of this function is to go look for the missing information which are in the page of the university. 

In [185]:
def get_additional_info(university_dict):
    QS_univ_URL = QS_base_URL + university_dict['url'] # this URL contains additional information that we will extract
    QS_univ_r = requests.get(QS_univ_URL)
    QS_univ_s = BeautifulSoup(QS_univ_r.text,'html.parser')
    
    ### getting info on the faculty members : 
    faculty_s =  QS_univ_s.find('div', class_='faculty-main')
    dic_index = ['In total','International']
    faculty_members = dict(zip([num.string for num in faculty_s.find_all('div',class_='anno')],\
                               [num.string for num in faculty_s.find_all('div',class_='number')]))
        
    ### getting info on the students
    student_total_s = QS_univ_s.find('div',class_='students-main')
    student_int_s = QS_univ_s.find('div',class_='int-students-main')
    students = {'In total': student_total_s.find('div',class_='number').string ,\
                'International': student_int_s.find('div',class_='number').string}

    return faculty_members,students

The following code allows us to get all the information for the `n_univ` first universities, from the list of dicitonnaries `QS_dict`, and from the university website through the previous function `get_additional_info()`. All this information is set in the previous dataframe  As there are some universities that don't have the information for international faculty member, we decide to add a NaN value. 

In [None]:
n_univ = 200;
for i in range(0,n_univ):
    univ = QS_dict[i]
    ### Go get the missing information on the page of the univerisity
    [faculty_members,students] = get_additional_info(univ)
    
    # check if this information is complete, if not, replace the empty values by np.nan
    if len(faculty_members) != 2:
        faculty_members['International'] = np.nan
    
    ### create the new line of the Dataframe ranking_QS
    univ_series = [ univ['title'],univ['rank_display'],univ['country'],univ['region'],\
                      faculty_members['International'], faculty_members['In total'],\
                       students['International'], students['In total']]
    ranking_QS.loc[i] = univ_series
    

Now That we have the dataframe `ranking_QS`, we need to clean a little bit the data : 
- convert into numbers the columns that are in strings
- remove the =XXX in the `Rank` column when 2 universities have the same rank. 

In [None]:
# converting the following columns in int : 
conv_2_int = ['Number of faculty members (int)','Number of faculty members (total)',\
              'Number of students (int)','Number of students (total)']
for i in range(0,len(conv_2_int)):
    ranking_QS[conv_2_int[i]] = pd.to_numeric(ranking_QS[conv_2_int[i]].replace({',':''}, regex=True))

# converting the rank colum into int
ranking_QS['Rank'] = pd.to_numeric(ranking_QS['Rank'].replace({'=':''}, regex=True))   

In [235]:
# display the cleaned dataframe
ranking_QS

Unnamed: 0,Name,Rank,Country,Region,Number of faculty members (int),Number of faculty members (total),Number of students (int),Number of students (total)
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982,3717,11067
1,Stanford University,2,United States,North America,2042.0,4285,3611,15878
2,Harvard University,3,United States,North America,1311.0,4350,5266,22429
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953,647,2255
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490,6699,18770
5,University of Oxford,6,United Kingdom,Europe,2964.0,6750,7353,19720
6,UCL (University College London),7,United Kingdom,Europe,2554.0,6345,14854,31080
7,Imperial College London,8,United Kingdom,Europe,2071.0,3930,8746,16090
8,University of Chicago,9,United States,North America,635.0,2449,3379,13557
9,ETH Zurich - Swiss Federal Institute of Techno...,10,Switzerland,Europe,1886.0,2477,7563,19815


Now we have our clean dataframe and we can compute the asked values : 

### (a)  ratio between faculty members and students

In [242]:
ranking_QS['Ratio Faculty/Students'] = ranking_QS['Number of faculty members (total)']/ranking_QS['Number of students (total)']
ranking_QS.sort_values('Ratio Faculty/Students',ascending=False)

Unnamed: 0,Name,Rank,Country,Region,Number of faculty members (int),Number of faculty members (total),Number of students (int),Number of students (total),Ratio Faculty/Students,Ratio Int/Total Students
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953,647,2255,0.422616,0.286918
15,Yale University,16,United States,North America,1708.0,4940,2469,12402,0.398323,0.199081
5,University of Oxford,6,United Kingdom,Europe,2964.0,6750,7353,19720,0.342292,0.372870
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490,6699,18770,0.292488,0.356899
16,Johns Hopkins University,17,United States,North America,1061.0,4462,4105,16146,0.276353,0.254243
1,Stanford University,2,United States,North America,2042.0,4285,3611,15878,0.269870,0.227422
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982,3717,11067,0.269450,0.335863
185,University of Rochester,186,United States,North America,488.0,2569,2805,9636,0.266604,0.291096
18,University of Pennsylvania,19,United States,North America,1383.0,5499,4250,20639,0.266437,0.205921
17,Columbia University,18,United States,North America,913.0,6189,8105,25045,0.247115,0.323617


### (b)  ratio of international students

In [243]:
ranking_QS['Ratio Int/Total Students'] = ranking_QS['Number of students (int)']/ranking_QS['Number of students (total)']
ranking_QS.sort_values('Ratio Int/Total Students',ascending=False)


Unnamed: 0,Name,Rank,Country,Region,Number of faculty members (int),Number of faculty members (total),Number of students (int),Number of students (total),Ratio Faculty/Students,Ratio Int/Total Students
34,London School of Economics and Political Scien...,35,United Kingdom,Europe,687.0,1088,6748,9760,0.111475,0.691393
11,Ecole Polytechnique Fédérale de Lausanne (EPFL),12,Switzerland,Europe,1300.0,1695,5896,10343,0.163879,0.570047
7,Imperial College London,8,United Kingdom,Europe,2071.0,3930,8746,16090,0.244251,0.543567
198,Maastricht University,200,Netherlands,Europe,502.0,1277,8234,16385,0.077937,0.502533
47,Carnegie Mellon University,47,United States,North America,425.0,1342,6385,13356,0.100479,0.478062
6,UCL (University College London),7,United Kingdom,Europe,2554.0,6345,14854,31080,0.204151,0.477928
91,University of St Andrews,92,United Kingdom,Europe,485.0,1140,4030,8800,0.129545,0.457955
41,The University of Melbourne,41,Australia,Oceania,1477.0,3311,18030,42182,0.078493,0.427434
126,Queen Mary University of London,127,United Kingdom,Europe,801.0,1885,6806,16135,0.116827,0.421816
25,The University of Hong Kong,26,Hong Kong,Asia,2085.0,3012,8230,20214,0.149006,0.407144


### (c) By country : ratio of international students and  ratio between faculty members and students

We are going to display the same statistic but for each country. We only consider the Top 10 countries here.

In [279]:
# Top 10 : ratio of international students by country
ranking_QS.groupby('Country').mean().sort_values('Ratio Int/Total Students',ascending=False)\
            [['Ratio Int/Total Students']].head(10)


Unnamed: 0_level_0,Ratio Int/Total Students
Country,Unnamed: 1_level_1
United Kingdom,0.351308
Australia,0.346878
Switzerland,0.313816
Hong Kong,0.312148
Austria,0.306095
Singapore,0.277091
Canada,0.252604
New Zealand,0.248971
Netherlands,0.245456
Ireland,0.241791


In [280]:
# Top 10 : ratio between faculty members and students by country
ranking_QS.groupby('Country').mean().sort_values('Ratio Faculty/Students',ascending=False)\
            [['Ratio Faculty/Students']].head(10)
                                                 

Unnamed: 0_level_0,Ratio Faculty/Students
Country,Unnamed: 1_level_1
Russia,0.22191
Denmark,0.18658
Saudi Arabia,0.175828
Singapore,0.162279
Japan,0.15584
Malaysia,0.153893
United States,0.151679
South Korea,0.149356
France,0.144006
Israel,0.136047


### (d) By region : ratio of international students and  ratio between faculty members and students

We are going to display the same statistic but for each region. As we have 6 regions, we display the whole ranking.

In [286]:
# Ratio of international students by region
ranking_QS.groupby('Region').mean().sort_values('Ratio Int/Total Students',ascending=False)\
            [['Ratio Int/Total Students']]


Unnamed: 0_level_0,Ratio Int/Total Students
Region,Unnamed: 1_level_1
Oceania,0.329077
Europe,0.245932
North America,0.203583
Africa,0.169703
Asia,0.132394
Latin America,0.071751


In [284]:
# Ratio between faculty members and students by region
ranking_QS.groupby('Region').mean().sort_values('Ratio Faculty/Students',ascending=False)\
            [['Ratio Faculty/Students']]


Unnamed: 0_level_0,Ratio Faculty/Students
Region,Unnamed: 1_level_1
North America,0.145407
Asia,0.134673
Europe,0.120003
Latin America,0.096779
Africa,0.08845
Oceania,0.075003


## 2. Top University Ranking : THE
Next we Are going to load the data from the URL of THE. The main URL of the ranking page is the following : 