# Homework 2 - Introduction

The rest of the notebook is organized as follow:

* **Data Extraction**
* **Dataframes construction**
* **Dataset merging**
* **Analysis**
    1. **Homework question**
    2. **Exploratory analysis**

In [188]:
import requests as rq
import pandas as pd
import bs4
import re
import multiprocessing as mp
import numpy as np
import matplotlib.pyplot as plt

from itertools import product
from scipy.optimize import linear_sum_assignment
from nltk.metrics import edit_distance
from IPython.display import display, HTML

In [189]:
N = 200
M = 400

SITE1 = "https://www.topuniversities.com"
URL1 = SITE1+"/sites/default/files/qs-rankings-data/357051.txt"

SITE2 = "https://www.timeshighereducation.com"
URL2 = SITE2+"/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"

# Data extraction

In [190]:
data1, data2 = (rq.get(URL).json().get("data") for URL in (URL1, URL2))

display(sorted(list(data1[0].keys())))
display(sorted(list(data2[0].keys())))

names1, names2 = ([u.get(key) for u in data[:M]] for data, key in ((data1, 'title'), (data2, 'name')))
print("Extracted %d and %d names" % (len(names1), len(names2)))

['cc',
 'core_id',
 'country',
 'guide',
 'logo',
 'nid',
 'rank_display',
 'region',
 'score',
 'stars',
 'title',
 'url']

['aliases',
 'location',
 'member_level',
 'name',
 'nid',
 'rank',
 'rank_order',
 'record_type',
 'scores_citations',
 'scores_citations_rank',
 'scores_industry_income',
 'scores_industry_income_rank',
 'scores_international_outlook',
 'scores_international_outlook_rank',
 'scores_overall',
 'scores_overall_rank',
 'scores_research',
 'scores_research_rank',
 'scores_teaching',
 'scores_teaching_rank',
 'stats_female_male_ratio',
 'stats_number_students',
 'stats_pc_intl_students',
 'stats_student_staff_ratio',
 'subjects_offered',
 'url']

Extracted 400 and 400 names


*********************** TASK 1,2 and 3 ******************************************************************************

** We start by fetching all the requested data from both websites **
For the first URL (topuniversities), we have to request the data from each university page

In [191]:
s = rq.Session()
reqs = [(req_id, s.prepare_request(rq.Request('GET', SITE1+entry["url"]))) for req_id, entry in enumerate(data1[:N])]
print(len(reqs), "requests to be sent.")

resps = [(req_id, s.send(req)) for req_id, req in reqs]

done = [(req_id, resp.text) for req_id, resp in resps if resp.status_code == 200]
failed = [(req_id, resp) for req_id, resp in resps if resp.status_code != 200]

print("%d done, %d failed." % (len(done), len(failed)))

non_digit = re.compile('[^0-9]')

class_to_labels = {"total+faculty": "fac_c_total",
                   "inter+faculty": "fac_c_inter",
                  "total+student":"stu_c_total",
                  "total+inter":"stu_c_inter"}


def resp_to_counts(req):
    req_id, resp = req
    page = bs4.BeautifulSoup(resp, "html.parser")
    top = page.body.find("div", class_="view-academic-data-profile")
    numdivs = top.find_all("div", class_="number")
    
    def get_label(div):
        if div == top:
            return None
        label = class_to_labels.get("+".join(div.get("class")))
        return label or get_label(div.parent)
    
    fac_counts = {(get_label(div), int(re.sub(non_digit,'', div.string))) for div in numdivs}
    return req_id, fac_counts

print("Parsing responses using up to %d threads..." % mp.cpu_count(), end="") 
with mp.Pool(mp.cpu_count()) as p:
    for req_id, counts in p.map(resp_to_counts, done):
        data1[req_id].update(counts)
print("done")

200 requests to be sent.
200 done, 0 failed.
Parsing responses using up to 4 threads...done


# Dataframes construction

We turn the raw JSON data into a actual pandas Dataframe object. Notice that we instead of the "student to staff" ratio, we prefere to compute the "staff to student" ratio so that we only work in a "the higher the better" mindset.

In [192]:
basecol = ["title", "rank_display", "country", "region"]
addedcol = ["fac_c_inter", "fac_c_total", "stu_c_inter", "stu_c_total"]

uni_s1 = pd.DataFrame(data1[:N], columns= basecol+addedcol)
uni_s1.rename(columns={"title":"name", "rank_display": "rank"},inplace=True)
# Convert the rank to a numerical type
uni_s1["rank"] = uni_s1["rank"].str.extract('(\d+)', expand=False).astype(int)

#ratio computations for s1 only because they are already available for s2
uni_s1['staff_student_ratio'] = uni_s1.apply(lambda row: row.fac_c_total/row.stu_c_total, axis=1)
uni_s1['pc_intl_students'] = uni_s1.apply(lambda row: (row.stu_c_inter/row.stu_c_total), axis=1)

uni_s1.head()

Unnamed: 0,name,rank,country,region,fac_c_inter,fac_c_total,stu_c_inter,stu_c_total,staff_student_ratio,pc_intl_students
0,Massachusetts Institute of Technology (MIT),1,United States,North America,1679.0,2982.0,3717.0,11067.0,0.26945,0.335863
1,Stanford University,2,United States,North America,2042.0,4285.0,3611.0,15878.0,0.26987,0.227422
2,Harvard University,3,United States,North America,1311.0,4350.0,5266.0,22429.0,0.193945,0.234785
3,California Institute of Technology (Caltech),4,United States,North America,350.0,953.0,647.0,2255.0,0.422616,0.286918
4,University of Cambridge,5,United Kingdom,Europe,2278.0,5490.0,6699.0,18770.0,0.292488,0.356899


The second dataset already provide the ratios, we just have to inverse the ```staff_student_ratio``` into a ```student_staff_ratio```.

In [None]:
basecol = ["name", "rank", "location", "stats_pc_intl_students", "stats_student_staff_ratio"]

uni_s2 = pd.DataFrame(data2[:N], columns=basecol)
uni_s2.rename(columns={"location":"country","stats_pc_intl_students":"pc_intl_students", "stats_student_staff_ratio":"student_staff_ratio"},inplace=True)

uni_s2["rank"] = uni_s2["rank"].str.extract('(\d+)', expand=False).astype(int)

uni_s2["pc_intl_students"]= uni_s2["pc_intl_students"].str.extract('(\d+)', expand=False).astype(float) / 100

#transforming of the ratios for s2 so that they are comparable with the data for s1
uni_s2['staff_student_ratio'] = uni_s2.apply(lambda row: 1/float(row.student_staff_ratio), axis=1)

uni_s2.head()

Unnamed: 0,name,rank,country,pc_intl_students,student_staff_ratio,staff_student_ratio
0,University of Oxford,1,United Kingdom,0.38,11.2,0.089286
1,University of Cambridge,2,United Kingdom,0.35,10.9,0.091743
2,California Institute of Technology,3,United States,0.27,6.5,0.153846
3,Stanford University,3,United States,0.22,7.5,0.133333
4,Massachusetts Institute of Technology,5,United States,0.34,8.7,0.114943


TODO explain or remove

M=200 : 1119

M=300 : 969

M=400 : 875

M=500 : 860

M=800 : 822

M=1000: 819

# Datasets merging

we compute the best matchings for university names by creating the matrix of the costs (edit_distance) of all the possible assignements. We then use this matrix to solve th linear sum assignement problem (The linear sum assignment problem is also known as minimum weight matching in bipartite graphs. The goal is to find a complete assignment of workers to jobs of minimal cost.)

In [None]:
def col(name):
    return np.array([edit_distance(name, n) for n in names2])

p = mp.Pool(mp.cpu_count())
print("Computing cost matrix using %d workers..." % mp.cpu_count(), end="")

costs = np.array(p.map(col, names1))
print("done")

print("Computing optimal assigment...", end="")
id_n1, id_n2 = linear_sum_assignment(costs)
sol_costs = costs[id_n1[:N], id_n2[:N]]
print("Done: cost of solution = %d" % sol_costs.sum())


Computing cost matrix using 4 workers...

This method will find an assignement in any cases, which means that university names that do not have any real match (no corresponding name in second dataset) will be match with other names. To remove these outliers/mistakes we simply check that the university names that are matched are in the same country. If it is not the case, we consider there is no match and remove this entry from the merged dataset.

In [None]:
uni_m = uni_s1.join(uni_s2.loc[id_n2[:N]].reset_index(drop=True), rsuffix="_2")
uni_m.dropna(inplace=True) # Removed uni that are s

uni_m.replace("Russian Federation", "Russia", inplace=True)
uni_m = uni_m[uni_m["country"] == uni_m["country_2"]].drop("country_2", axis=1) # Remove unmatching countries 


print("Merged dataset is of size: %s" % len(uni_m))
display(uni_m[["name", "name_2"]][uni_m["name"] != uni_m["name_2"]]) # Shows a good quality of matching

uni_m.drop("name_2", axis=1, inplace=True)

uni_m.head()

# Analysis


##  Homework questions

The first questions can be answered directly from the extracted data.

In [None]:

def title(s):
    return display(HTML("<H3>%s</H3>" % s))

def top(n, var, df):
    bests = df.sort_values(var, ascending=False).head(n)
    return bests[["name", var]].reset_index(drop=True).set_index('name', append=True)

TOP_N = 5



#sorting of data with respect to each ratio
title("Best universities according to the staff/Student ratio")
title("<u>Site 1 (%s) </u>:" % SITE1)
display(top(TOP_N, 'staff_student_ratio', uni_s1))
title("<u>Site 2 (%s) </u>:" % SITE2)
display(top(TOP_N, 'staff_student_ratio', uni_s2))

title("Best universities according to the ratio of international students")
title("<u>Site 1 (%s) </u>:" % SITE1)
display(top(TOP_N, 'pc_intl_students', uni_s1))
title("<u>Site 2 (%s) </u>:" % SITE2)
display(top(TOP_N, 'pc_intl_students', uni_s2))

The second set of questions requires to compute some aggregation within groups, which we perform using the ```pivot_table``` function.

In [None]:
uni_s1_countries = uni_s1.pivot_table(index="country", values=["staff_student_ratio", "pc_intl_students"])
uni_s1_regions = uni_s1.pivot_table(index="region", values=["staff_student_ratio", "pc_intl_students"])
uni_s2_countries = uni_s2.pivot_table(index="country", values=["staff_student_ratio", "pc_intl_students"])

title("(c) Best countries according to faculty member to student ratio: (website 1 then 2)")
display(uni_s1_countries[["staff_student_ratio"]].sort_values("staff_student_ratio", ascending=False).head())
display(uni_s2_countries[["staff_student_ratio"]].sort_values("staff_student_ratio", ascending=False).head())
title("(d) Best countries according to international students ratio: (website 1 then 2)")
display(uni_s1_countries[["pc_intl_students"]].sort_values("pc_intl_students", ascending=False).head())
display(uni_s2_countries[["pc_intl_students"]].sort_values("pc_intl_students", ascending=False).head())

title("(d) Best regions according to international students ratio: (website 1 only)")
#display(uni_s1_regions.sort_values(sort_by, ascending=ascending).head())
display(uni_s1_regions[["pc_intl_students"]].sort_values("pc_intl_students", ascending=False).head())
title("(d) Best regions according to faculty member to student ratio: (website 1 only)")
display(uni_s1_regions[["staff_student_ratio"]].sort_values("staff_student_ratio", ascending=False).head())

In [None]:
def plot(title, df, color, figsize=(15, 5), legend=None, **kwargs):
    f, ax = plt.subplots(figsize=figsize)
    df.plot(kind='bar', ax=ax, color=color, title=title, legend=True, fontsize=12, **kwargs)
    ax.set_xlabel(df.index.name, fontsize=12)
    ax.set_ylabel("ratio", fontsize=12)
    ax.set_xticklabels(df.index)
    ax.tick_params(axis='x', which='major', pad=15)
    if legend:
        ax.legend(legend)
    display(f)
    plt.close(f)

plot("University mean ratios by region according to site 1's ranking",
     uni_s1_regions.sort_values("pc_intl_students", ascending=False), color=['b','r'],
    legend=['International Students ratio', 'Faculty/Student ratio'])

From the plot above, we notice that the best region according to site 1 in terms of proportion of international students is Oceania, followed by Europe then North America. In terms of proportion of Faculty Members to Students the best region is North America followed by Asia and Europe. 


In [None]:
plot("University Faculty/Student mean ratio by country according to site 1's ranking",
     uni_s1_countries["staff_student_ratio"].sort_values(ascending=False), color=['r'], legend=['Faculty/Student ratio'])

plot("University mean proportion of International Students by country according to site 1's ranking",
     uni_s1_countries["pc_intl_students"].sort_values(ascending=False), color=['b'], legend=['International Students ratio'])

According to the rankings of the first website, the best countries are United Kingdom and Australia in terms of proportion of International students while the best are Russia and Denmark according to the Faculty Members to the number of Students ratio.

In [None]:
plot("University mean proportion of International Students by country according to site 2's ranking",
     uni_s2_countries["staff_student_ratio"].sort_values(ascending=False), color=['b'], legend=['International Students ratio'])

plot("University mean proportion of International Students by country according to site 2's ranking",
     uni_s2_countries["pc_intl_students"].sort_values(ascending=False), color=['r'], legend=['Faculty/Student ratio'])

According to the rankings of the second website, the best countries are Luxembourg and United Kingdom in terms of proportion of International students while the best are Denmark and Italy according to the Faculty Members to the number of Students ratio.

## Exploratory analysis

************************** TASK 4 (a) ****************************
We have already observed that the two datasets have different statistics about the universities. We highlight this with some descriptive statistics:

In [None]:
uni_m['rank_diff'] = pd.DataFrame.abs(uni_m['rank']-uni_m['rank_2'])
uni_m['intl_pc_diff'] = pd.DataFrame.abs(uni_m['pc_intl_students']-uni_m['pc_intl_students_2'])
uni_m['staff_student_ratio_diff'] = pd.DataFrame.abs(uni_m['staff_student_ratio']-uni_m['staff_student_ratio_2'])


display(uni_m[['name', 'rank_diff', 'intl_pc_diff', 'staff_student_ratio_diff']].head())

In [None]:
uni_m[['rank_diff', 'intl_pc_diff', 'staff_student_ratio_diff']].describe()

We observe the distributions of the differences between the two websites datasets

In [None]:
import seaborn as sns
sns.set(color_codes=True)

def densplot(column):
    sns.distplot(column)
    plt.show()

def scatplot(xelem, yelem, xlabel, ylabel, title, polyfit=None):
    plt.scatter(xelem, yelem)
    if polyfit:
        plt.plot(np.unique(xelem), np.poly1d(np.polyfit(xelem, yelem, polyfit))(np.unique(xelem)), 'C2')
    plt.title(title)
    plt.xlabel(xlabel, fontsize=12)
    plt.ylabel(ylabel, fontsize=12)
    plt.show()

In [None]:
densplot(uni_m['rank_diff'])
densplot(uni_m['intl_pc_diff'])
densplot(uni_m['staff_student_ratio_diff'])

TODO observations

**We now look for correlations between different statistics (columns) **

Correlation (or not) between international staff proportion and international students proportion:

In [None]:
uni_m['pc_intl_staff'] = uni_m['fac_c_inter']/uni_m['fac_c_total']
scatplot(uni_m['rank'], uni_m['pc_intl_students'], 'rank', 'international_staff', 'website 1', 2)


TODO observation

Correlation (or not) between ranks and (1) international staff, (2) international studnets and (3) staff/student ratio

In [None]:
# 1
scatplot(uni_m['rank'], uni_m['pc_intl_staff'], 'rank', 'international_staff', 'website 1', 5)
scatplot(uni_m['rank_2'], uni_m['pc_intl_staff'], 'rank', 'international_staff_1', 'website 2', 5)

TODO observation

In [None]:
# 2
scatplot(uni_m['rank'], uni_m['pc_intl_students'], 'rank', 'international_student', 'website 1', 5)
scatplot(uni_m['rank_2'], uni_m['pc_intl_students_2'], 'rank_2', 'international_students', 'website 2', 5)
scatplot(uni_m['rank'], uni_m['fac_c_inter'], 'rank', 'international_staff', 'website 1', 5)
scatplot(uni_m['rank'], uni_m['stu_c_inter'], 'rank', 'international_students', 'website 1', 5)

TODO observation

In [None]:
uni_m.keys()

In [None]:
# 3
scatplot(uni_m['rank'], uni_m['staff_student_ratio'], 'rank', 'student_staff_ratio', 'website 1', 5)
scatplot(uni_m['rank_2'], uni_m['staff_student_ratio_2'], 'rank_2', 'staff_student_ratio_2', 'website 2', 5)

TODO observation

************************** TASK 5 ****************************
In order to select the best university according to both rankings, we simply do the average of the rankings.
We just want to know the best university and there is no tie for the first one which is Stanford University.
In this case the ties between the other in the ranking does not matter, otherwise we would have broken these ties by putting first the university with the highest ranking. (Example: University of Cambridge (4) would be before Caltech (3) if we had to break the ties since it is ranked 2 in the second website)

In [None]:
uni_m['combine_rank'] = (uni_m['rank']+uni_m['rank_2'])/2

uni_ms = uni_m.sort_values(['combine_rank'])
uni_ms.head()

*********************** TASK 4 (b): Exploratory analysis ***********************************************
In the following, we check the possible correlations of the different statistics with the combine rank of the universities

In [None]:
scatplot(uni_ms['combine_rank'], uni_ms['staff_student_ratio'], 'combine_rank', 'student_staff_ratio', 'website 1', 5)
scatplot(uni_ms['combine_rank'], uni_ms['pc_intl_students'], 'combine_rank', 'pc_intl_students', 'website 1', 5)
scatplot(uni_ms['combine_rank'], uni_ms['staff_student_ratio_2'], 'combine_rank', 'staff_student_ratio', 'website 2', 5)
scatplot(uni_ms['combine_rank'], uni_ms['pc_intl_students_2'], 'combine_rank', 'pc_intl_students', 'website 2', 5)

TODO observations

We now observe the difference between the two rankings.

In [None]:
uni_m['diff_rank_abs'] = pd.DataFrame.abs(uni_m['rank']-uni_m['rank_2'])
uni_m['diff_rank'] = uni_m['rank']-uni_m['rank_2']

scatplot(uni_m['rank'], uni_m['diff_rank'], 'rank', 'rank_diff', 'website 1', 5)
scatplot(uni_m['rank_2'], uni_m['diff_rank'], 'rank', 'rank_diff', 'website 2', 5)
scatplot(uni_m['combine_rank'], uni_m['diff_rank'], 'rank', 'rank_diff', 'website 1&2 combined', 5)

We observe that the difference between the two rankings is smaller for high-ranked universities than for lower ranked universities. This could be because the ranking is done more precisely for the top-30 universities than for the others.
Moreover, TODO