** D'abord, faire un Run All Cell sur le notebook "hw2_site1.ipynb", puis faire les tasks d'analyse ici **

In [None]:
import pandas as pd

uni_s1 = pd.read_pickle("site1.pkl")
uni_s2 = pd.read_pickle("site2.pkl")
uni_s1.head()


In [None]:
import re
import numpy as np
from itertools import product
from scipy.optimize import linear_sum_assignment
from nltk.metrics import edit_distance


n1 = uni_s1.name
n2 = uni_s2.name

costs = np.array([[0]*len(n1)]*len(n2))

for i, u1 in enumerate(n1):
    for j, u2 in enumerate(n2):
        c = edit_distance(u1,u2)
        costs[i][j] = c

In [None]:
id_n1, id_n2 = linear_sum_assignment(costs)

In [None]:
for i, j in zip(id_n1, id_n2):
    print(n1[i], "==",n2[j])
    

uni_s1 and uni_s2 are the dataframes corresponding to the rankings from the first and second websites respectively. For each university, we use the data from the columns with values for the total number of students, faculty members and international students to compute the student/staff ratio and the proportion of international students for each university. We then sort the dataframes according to each ratio to find the best universities with respect to each.

In [None]:
uni_s2.head()

In [None]:
#ratio computations for s1 only because they are already available for s2
uni_s1['staff_student_ratio'] = uni_s1.apply(lambda row: row.fac_c_total/row.stu_c_total, axis=1)
uni_s1['pc_intl_students'] = uni_s1.apply(lambda row: (row.stu_c_inter/row.stu_c_total), axis=1)
#transforming of the ratios for s2 so that they are comparable with the data for s1
uni_s2['staff_student_ratio'] = uni_s2.apply(lambda row: 1/float(row.student_staff_ratio), axis=1)
uni_s2['pc_intl_students'] = uni_s2.apply(lambda row: (row.pc_intl_students)/100, axis=1)
#sorting of data with respect to each ratio
uni_s1_FSsort=uni_s1.sort_values('staff_student_ratio', ascending=False)
uni_s1_Int=uni_s1.sort_values('pc_intl_students', ascending=False)
uni_s2_FSsort=uni_s2.sort_values('staff_student_ratio', ascending=False)
uni_s2_Int=uni_s2.sort_values('pc_intl_students', ascending=False)
uni_s1_FSsort.head()

In [None]:
uni_s2_FSsort.head()

In [None]:
uni_s1_Int.head()

In [None]:
uni_s2_Int.head()

According to website 1 the two best universities with respect to the staff/student ratio are Caltech and Yale. The two best universities with respect to the proportion of international students are  London School of Economics and Political Sciences and Ecole Polytechnique Fédérale de Lausanne (EPFL).

According to website 2 the two best universities with respect to the staff/student ratio are Vanderbilt University and University of Copenhagen. The two best universities with respect to the proportion of international students are  London School of Economics and Political Sciences and University of Luxembourg.

We now aggregate our results by country and region by grouping the data and computing the mean of the ratios for each group.

In [None]:

def aggregation(df, grouping):
    dfgroup=df.groupby(grouping)
    ratioFacStuMean=[] 
    ratioInterMean=[]
    list=[]
    for variable in dfgroup[grouping].unique():
        tmp=dfgroup.get_group(variable[0])
        ratioFacStuMean.append(tmp['staff_student_ratio'].mean()) 
        ratioInterMean.append(tmp['pc_intl_students'].mean())
    df_ratiogroup=pd.DataFrame(data={grouping: dfgroup[grouping].unique(), 'staff_student_ratio_mean': ratioFacStuMean, 'pc_intl_students_mean':ratioInterMean})
    df_RFC=df_ratiogroup.sort_values('staff_student_ratio_mean', ascending=False)
    df_RI=df_ratiogroup.sort_values('pc_intl_students_mean', ascending=False)
    return df_RFC, df_RI

uni_s1_RFC_country, uni_s1_RI_country= aggregation(uni_s1, 'country')
uni_s1_RFC_region, uni_s1_RI_region= aggregation(uni_s1, 'region')
uni_s2['staff_student_ratio']=pd.to_numeric(uni_s2['staff_student_ratio'])
uni_s2_RFC_country, uni_s2_RI_country= aggregation(uni_s2, 'country')


In [None]:
uni_s2_RI_country.head()

We plot our results in bar charts

In [None]:
import matplotlib.pyplot as plt
ax = uni_s1_RI_region[['staff_student_ratio_mean','pc_intl_students_mean']].plot(kind='bar', title ="University mean ratios by region according to site 1's ranking", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("region", fontsize=12)
ax.set_ylabel("ratio", fontsize=12)
ax.set_xticklabels(uni_s1_RI_region.region)
ax.tick_params(axis='x', which='major', pad=15)
plt.legend(['Faculty/Student ratio', 'International Students ratio'], loc='upper right')
plt.show()

From the plot above, we notice that the best region according to site 1 in terms of proportion of international students is Oceania and followed by Europe then North America. In terms of proportion of Faculty Members to Students the best region is North America followed by Asia and Europe. 

In [None]:
bx = uni_s1_RFC_country[['staff_student_ratio_mean']].plot(kind='bar', title ="University Faculty/Student mean ratio by country according to site 1's ranking", figsize=(15, 10), legend=True, fontsize=12)
bx.set_xlabel("country", fontsize=12)
bx.set_ylabel("ratio", fontsize=12)
bx.set_xticklabels(uni_s1_RFC_country.country)
bx.tick_params(axis='x', which='major', pad=15)
plt.show()

In [None]:
cx = uni_s1_RI_country[['pc_intl_students_mean']].plot(kind='bar', title ="University mean proportion of International Students by country according to site 1's ranking", figsize=(15, 10), legend=True, fontsize=12)
cx.set_xlabel("country", fontsize=12)
cx.set_ylabel("ratio", fontsize=12)
cx.set_xticklabels(uni_s1_RI_country.country)
cx.tick_params(axis='x', which='major', pad=15)
plt.show()

According to the rankings of the first website, the best countries are United Kingdom and Australia in terms of proportion of International students while the best are Russia and Denmark according to the Faculty Members to the number of Students ratio.

In [None]:
ax = uni_s2_RFC_country[['staff_student_ratio_mean']].plot(kind='bar', title ="University Faculty members to Students mean ratio by country according to site 2's ranking", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("country", fontsize=12)
ax.set_ylabel("ratio", fontsize=12)
ax.set_xticklabels(uni_s2_RFC_country.country)
ax.tick_params(axis='x', which='major', pad=15)
plt.show()

In [None]:
ax = uni_s2_RI_country[['pc_intl_students_mean']].plot(kind='bar', title ="University mean proportion of International Students by country according to site 2's ranking", figsize=(15, 10), legend=True, fontsize=12)
ax.set_xlabel("country", fontsize=12)
ax.set_ylabel("ratio", fontsize=12)
ax.set_xticklabels(uni_s2_RI_country.country)
ax.tick_params(axis='x', which='major', pad=15)
plt.show()

According to the rankings of the second website, the best countries are Luxembourg and United Kingdom in terms of proportion of International students while the best are Denmark and Italy according to the Faculty Members to the number of Students ratio.