# Analysis of World University Rankings

## 1. Introdaction

What are the best universities in the world?

Ranking universities is a complex, political and controversial practice. There are hundreds of different national and international university ranking systems, many of which are inconsistent with each other. This dataset contains three global university rankings from a wide variety of locations.

University Ranking Data
The Times Higher Education World University Rankings ranks as one of the most influential and widely observed university indicators. Founded in the UK in 2010, it has been criticized for commercializing and undermining non-English language schools.

The Academic Ranking of World Universities, also known as the Shanghai Ranking, is an equally influential ranking. It was founded in China in 2003 and has been criticized for its emphasis on raw research power and for undermining the liberal arts and teaching quality.

The Center for World University Rankings is a lesser known list from Saudi Arabia, it was founded in 2012.


## Data description and objectives

As stated above, Analysis consists of comparing the three main ratings. These are the Times Higher Education World University Rankings, the Academic World University Rankings and the Center for World University Rankings.
There are two basic two points.

The first is a dataset of educational attainment across the world. It is taken from the World Databank and contains information from the UNESCO Institute for Statistics and the Barro-Lee dataset.

A second complementary dataset contains information on direct public and private spending on education across countries. This data is obtained from the National Center for Education Statistics. It represents expenses as a percentage of gross domestic product.

Data that we will collect and use for our analysis:

world_rank - world ranking of the university, that is, where it stands

university_name - name of university

country - ountry of each university

teaching - university score for teaching (the learning environment)

international - university score international outlook (staff, students, research)

research - university score for research (volume, income and reputation)

citations - university score for citations (research influence)

income - university score for industry income (knowledge transfer)

total_score - total score for university, used to determine rank

num_students - number of students at the university

alumni - Alumni Score, based on the number of alumni of an institution winning nobel prizes and fields medals

award - Award Score, based on the number of staff of an institution winning Nobel Prizes in Physics, Chemistry, Medicine

hici - HiCi Score, based on the number of Highly Cited Researchers selected by Thomson Reuters

ns - N&S Score, based on the number of papers published in Nature and Science

pub - PUB Score, based on total number of papers indexed in the Science Citation Index-Expanded and Social Science Citation Index

pcp - PCP Score, the weighted scores of the above five indicators divided by the number of full time academic staff

## Research question

1)How do these rankings entity to each other?

2)Does the additional spending on education lead to a better international university ranking? if yes, how?

3)How does the level of national education compare with the quality of universities in each country?

4)What are university rankings based on?

5)How important is it to have a good rating for university?

In [1]:
# Data acquisiton, manipulation and validation

# Import the good stuff
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
from pandas import Series,DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [12]:
# Read the times data csv from Kaggle
times_df = pd.read_csv('./timesData.csv')

# Only EPFL rankings
times_epfl_df = times_df[times_df['university_name'] == ('École Polytechnique Fédérale de Lausanne')] 

# Preview of dataset
times_epfl_df.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
47,48,École Polytechnique Fédérale de Lausanne,Switzerland,55.0,100.0,56.1,83.8,38.0,66.5,9666,10.5,54%,27 : 73,2011
245,46,École Polytechnique Fédérale de Lausanne,Switzerland,53.1,98.9,43.9,95.3,46.7,66.3,9666,10.5,54%,27 : 73,2012
641,40,École Polytechnique Fédérale de Lausanne,Switzerland,62.4,98.8,57.0,95.0,49.8,73.0,9666,10.5,54%,27 : 73,2013
1038,37,École Polytechnique Fédérale de Lausanne,Switzerland,52.9,98.2,48.3,95.9,49.2,67.7,9666,10.5,54%,27 : 73,2014
1436,34,École Polytechnique Fédérale de Lausanne,Switzerland,54.7,98.8,56.9,95.0,61.9,70.9,9666,10.5,54%,27 : 73,2015


In [3]:
# Convert data to int and floats
times_epfl_df['world_rank'] = times_epfl_df['world_rank'].astype('int64')
times_epfl_df['international'] = times_epfl_df['international'].astype('float64')
times_epfl_df['income'] = times_epfl_df['income'].astype('float64')
times_epfl_df['total_score'] = times_epfl_df['total_score'].astype('float64')

# Plot evolution of ranking
ax = times_epfl_df.plot(
    kind='line', 
    x='year', 
    y='world_rank',
    xlim=(2011, 2016), 
    ylim=(1, 60), 
    xticks=range(2011, 2017)
)

# Have ints for the labels
ax.ticklabel_format(useOffset=False, style='plain')

NameError: name 'times_epfl_df' is not defined

In [None]:
# Position vs total_score, let's check the relationship of the ranking vs the total score
compare_df = times_epfl_df[['total_score', 'world_rank']]

percent_df = compare_df.pct_change()
percent_df = percent_df.dropna()
percent_df['year'] = times_epfl_df['year']

# It should be symetrical on the x axis, if the score goes up, the ranking goes down.
ax = percent_df.plot(
    kind='line', 
    x='year', 
    xticks=range(2012, 2017),# hacky
    ylim=(-1, 1), 
)

ax.set_xticklabels(['2011-2012', '2012-2013', '2013-2014', '2014-2015', '2015-2016'])

percent_df

In [None]:
# Plot evolution of each criteria of the ranking
bar_criteria_df = times_epfl_df[['year', 'teaching', 'international', 'research', 'citations', 'income']]
ax = bar_criteria_df.plot.bar(
    stacked=True,
    x='year'
)

ax.legend(loc=7, bbox_to_anchor=(1.4, 0.5)) # Tweaking until it looks nice, hacky

In [20]:
# Compute the Pearson correlation coefficient 
# http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf
from scipy.stats.stats import pearsonr

features = ['teaching', 'international', 'research', 'citations', 'income']
target = 'total_score'

for feature in features:
    coeff = pearsonr(times_epfl_df[feature], times_epfl_df[target])[0]
    print 'Pearson correlation for ' + feature + ' coeff: ' + str(coeff)

SyntaxError: invalid syntax (<ipython-input-20-77e412979629>, line 10)

In [None]:
times_epfl_df['international']

In [None]:
times_epfl_df['total_score']

In [None]:
sns.lmplot(x="international", y="total_score", data=times_epfl_df)


## Data manipulation: cleaning and shaping


In [5]:
# Shanghai 
shanghai_df = pd.read_csv('./shanghaiData.csv')

In [6]:
# Check the data
shanghai_df.head()

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005


In [7]:
# Check the types
shanghai_df.dtypes

world_rank          object
university_name     object
national_rank       object
total_score        float64
alumni             float64
award              float64
hici               float64
ns                 float64
pub                float64
pcp                float64
year                 int64
dtype: object

In [9]:
epfl_name = "Swiss Federal Institute of Technology Lausanne"

In [13]:
shanghai_epfl_df = shanghai_df[shanghai_df['university_name'] == epfl_name]
shanghai_epfl_df

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
3993,96,Swiss Federal Institute of Technology Lausanne,5,24.2,0,0,26.1,29.5,43.3,38.6,2014
4515,101-150,Swiss Federal Institute of Technology Lausanne,5,,0,0,26.1,29.2,41.8,38.4,2015


## CWUR

In [15]:
cwur_df = pd.read_csv('./cwurData.csv')

In [16]:
cwur_df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [17]:
cwur_df.dtypes

world_rank                int64
institution              object
country                  object
national_rank             int64
quality_of_education      int64
alumni_employment         int64
quality_of_faculty        int64
publications              int64
influence                 int64
citations                 int64
broad_impact            float64
patents                   int64
score                   float64
year                      int64
dtype: object

In [18]:
cwur_epfl_df = cwur_df[cwur_df['institution'] == "Swiss Federal Institute of Technology in Lausanne"]

In [19]:
cwur_epfl_df

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
68,69,Swiss Federal Institute of Technology in Lausanne,Switzerland,3,101,30,82,101,101,101,,74,47.68,2012
182,83,Swiss Federal Institute of Technology in Lausanne,Switzerland,4,101,101,85,101,101,83,,54,45.73,2013
292,93,Swiss Federal Institute of Technology in Lausanne,Switzerland,4,355,130,80,126,121,48,124.0,41,51.8,2014
1290,91,Swiss Federal Institute of Technology in Lausanne,Switzerland,2,367,74,116,124,114,39,110.0,58,51.47,2015


In [20]:
cwur_epfl_df.dtypes

world_rank                int64
institution              object
country                  object
national_rank             int64
quality_of_education      int64
alumni_employment         int64
quality_of_faculty        int64
publications              int64
influence                 int64
citations                 int64
broad_impact            float64
patents                   int64
score                   float64
year                      int64
dtype: object