<div style="background:#F5F7FA; height:100px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Want to do more?</span><span style="border: 1px solid #3d70b2;padding: 15px;float:right;margin-right:40px; color:#3d70b2; "><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
<span style="color:#5A6872;"> Try out this notebook with your free trial of IBM Watson Studio.</span>
</div>

# Analyze the Times Higher Education World University Rankings for 2016

### This notebook uses Times Higher Education (THE) World University rankings for 2016 data to analyze the following: 
- How the performance of the universities across different indicators is evaluated
- Which and where are the best universities
- What makes these universities stand out from the rest

This notebook runs on Python 3.

## Table of contents
1. [Import the libraries](#libraries)
2. [Import the dataset](#dataset)
3. [Tidy up the data](#tidy)
4. [Analyze the ranking data](#rank)
5. [Analyze the performance indicators](#performance)
6. [Remove outliers](#outliers)
7. [Correlation between the different indicators](#correlation)
8. [Next steps](#nextsteps)

## 1. Import the libraries<a id="libraries"></a>

In [None]:
# Import libraries and suppress warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')

## 2. Import the dataset<a id="dataset"></a>

Download the THE World University Ranking 2016 dataset as follows:
1. Go <a href="https://www.kaggle.com/mylesoneill/world-university-rankings/data" target="_blank" rel="noopener no referrer">Kaggle: World University Rankings</a>, log in, and then download `timesData.csv` to your computer.
3. Load the `timesData.csv` file into your notebook. Click the **Data** icon on the notebook action bar. Drop each file into the box or browse to select the file. The file is loaded to your object storage and appears in the Data Assets section of the project. For more information, see <a href="https://datascience.ibm.com/docs/content/analyze-data/load-and-access-data.html" target="_blank" rel="noopener noreferrer">Load and access data</a>.
4. To load the data from the `timesData.csv` file into a Pandas DataFrame, click in the next code cell, click the **Code Snippets** icon -> **Read data**, and browse your project to locate the dataset. Then select **Load as** > **Pandas DataFrame**.
6. Run the cell.

In [None]:
# empty cell



We will consider the year 2016 for our analysis because according to the detailed methodology <a href="https://www.timeshighereducation.com/news/ranking-methodology-2016" target="_blank" rel="noopener no referrer">here</a>, the underlying data for calculating rankings changed in 2016.

In [None]:
df1 = df_data_1[df_data_1['year'] == 2016]

In [None]:
# Lets get a brief overview of the data
df1.info()

As we can see, the THE dataset captures a lot of information about each university, but note these two observations:
- Many columns which should be of type `int`,`float` are instead of type `object`, which means we need to examine them more closely before doing any analysis
- Data is missing from some columns

Let's have a look at the data:

In [None]:
df1.describe()

## 3. Tidy up the data<a id="tidy"></a>

If a value for the indicator **income** is not provided, we assign it a value of  <i>20</i>. This is in line with THE methodology explained <a href="https://www.timeshighereducation.com/news/ranking-methodology-2016#survey-answer" target="_blank" rel="noopener no referrer">here</a>: 

<i>On the rare occasions when a particular data point is not provided – which affects only low-weighted indicators such as industrial income – we enter a low estimate between the average value of the indicators and the lowest value reported: the 25th percentile of the other indicators. By doing this, we avoid penalising an institution too harshly with a “zero” value for data that it overlooks or does not provide, but we do not reward it for withholding them.</i>

In [None]:
for x in range(len(df1)):
    idx = df1.iloc[x].name
    inc = df1['income'].iloc[x]
    students = df1['num_students'].iloc[x]
    int_students = df1['international_students'].iloc[x]
    fmr = df1['female_male_ratio'].iloc[x]
    
    # Convert '-' to NaN in income
    if "-" in inc:
        df1.at[idx, 'income'] = 20 # Arbitrary low value
    
    # Format the numbers properly
    if type(students) != float:
        students1 = students.replace(",", "")
        df1.at[idx, 'num_students'] = students1
        
    # Remove the '%' sign 
    if int_students and type(int_students) != float:
        int_students1 = int_students.replace("%", "")
        df1.at[idx, 'international_students'] = int_students1
        
    # Convert ratios of the form '43:57' to decimal ratios and convert '-' to NaN
    if type(fmr) != float:
        if ":" in fmr:
            arr2 = [float(x) for x in fmr.split() if x.isdigit()]
            if arr2[0] != 0 and arr2[1] != 0:
                ratio1 = arr2[0]/arr2[1]
                df1.at[idx, 'female_male_ratio'] = ratio1
            else:
                df1.at[idx, 'female_male_ratio'] = 100.0 # One university has all female students         
        else:
            df1.at[idx, 'female_male_ratio'] = np.NaN
            


In [None]:
# Let's change the type of each of the columns to float, for easier analysis.
df1['international'] = df1['international'].astype(float)
df1['income'] = df1['income'].astype(float)
df1['num_students'] = df1['num_students'].astype(float)
df1['international_students'] = df1['international_students'].astype(float)
df1['female_male_ratio'] = df1['female_male_ratio'].astype(float)

## 4. Analyze the ranking data<a id="rank"></a>

To calculate the final score for a university, different performance indicators have been given different weighting by THE. Let's analyze the data and do the following:
- Calculate how many types of ranks we have currently
- Calculate the total score for each entry based on the formula provided by THE
- Sort the universities by the new score
- Calculate how many universities have an equal score
- Find out which are the best universities in the world according to THE
- Visualize which countries have the best universities

<b>Calculate how many types of ranks we have currently:</b>

In [None]:
rank_int = [x for x in df1["world_rank"] if x.isdigit()]
rank_eq = [x for x in df1["world_rank"] if "=" in x]
rank_hyph = [x for x in df1["world_rank"] if "-" in x]
print("Simple numeric ranks ", len(rank_int))
print("Ranks with = sign ", len(rank_eq))
print("Ranks with - sign ", len(rank_hyph))

print(sorted(set(rank_hyph)))

Only the top 200 universities are actually given ranks. The rest are grouped into classes.
Some universities are ranked equally.

<b>Calculate the total score for each entry based on the formula provided by THE:</b>

In [None]:
for x in range(len(df1)):
    idx = df1.iloc[x].name  
    new_score = df1.loc[idx,"teaching"]*0.3 + df1.loc[idx,"research"]*0.3 + df1.loc[idx,"citations"]*0.3 + df1.loc[idx,"international"]*0.075 + df1.loc[idx,"income"]*0.025
    df1.at[idx, 'total_score'] = new_score
    

<b>Sort the universities by the new score:</b>

In [None]:
df1.sort_values('total_score', ascending=False, inplace=True)

<b>Calculate how many universities have an equal score:</b>

In [None]:
dups = df1["total_score"].value_counts()
dups[dups > 1]

In [None]:
df1.iloc[:, 0] = np.arange(1, 801)

Increase the rank of one of the universities having equal score by 0.1. This is to enable further analysis, and also to ensure that equally scored universities are not ranked differently (for example, 39 and 40).

In [None]:
arr1 = []

for x in range(len(df1)):
    rank = x + 1
    score = df1['total_score'].iloc[x]
    idx = df1.iloc[x].name    
    if score not in arr1:
        arr1.append(score)
    else:
        rank1 = rank + 0.1
        df1.at[idx, 'world_rank'] = rank1


<b>Now let's see which are the best universities in the world according to THE:</b>

In [None]:
df1[df1["world_rank"] <= 10].sort_values(["world_rank"])

All of them have a very high score in <b>research</b>, <b>teaching</b> and <b>citations</b>.

<b>Visualize which countries have the best universities:</b>

In [None]:
df1["country"][df1["world_rank"] < 200].value_counts().plot(kind='bar',color='gold')

The US and UK lead the list by far, followed by Germany.

Take a new look at the data we now have:

In [None]:
df1.describe()

## 5. Analyze the performance indicators<a id="performance"></a>

Analyze the relationship between ranking and these performance indicators:
- Citations
- Research
- Teaching
- Income
- International
- Number of students
- Student to staff ratio
- International students
- Female to male student ration

In [None]:
cols = ["citations", "research", "teaching", "income", "international", "num_students", "student_staff_ratio", "international_students", "female_male_ratio"]
for col in cols:
    df2 = df1[pd.notnull(df1[col])]
    plt.figure()
    plt.plot(df2["world_rank"], df2[col], "o")
    plt.xlabel("Rank", fontsize=12)
    plt.ylabel(col, fontsize=12)

## 6. Remove outliers<a id="outlier"></a>

Based on the graphs above, we can clearly see some outliers in the data. These are:
- Universities with huge number of students
- A female-only university

Remove the outliers before proceeding:

In [None]:
df1[df1["num_students"] > 200000].sort_values(["num_students"], ascending=[False])

In [None]:
df1[df1["student_staff_ratio"] > 100].sort_values(["student_staff_ratio"], ascending=[False])

In [None]:
df1[df1["international_students"] > 80].sort_values(["international_students"], ascending=[False])

In [None]:
df1[df1["female_male_ratio"] > 90].sort_values(["female_male_ratio"], ascending=[False])

In [None]:
drop_idx = [2413, 2430, 2562, 2411, 2227]
df1 = df1.drop(drop_idx, axis=0)

## 7. Correlation between the different indicators<a id="correlation"></a>
Now we can look at the correlation between ranking score and performance indicators.

In [None]:
corrmat = df1.corr()
f, ax = plt.subplots(figsize=(12, 7))
sns.heatmap(corrmat, vmax=.5, square=True);

This is a helpful graph! A number of observations here:
-  The rank of a university is most negatively correlated to the <b>citation</b> score, that is, the higher the citation score, the better the rank.
-  The citation score is highly correlated with <b>teaching</b>/<b>research</b>, which is expected.
-  <b>international_students</b> is highly correlated to <b>teaching</b>/<b>research</b>. Again, this is expected, because international students are attracted by a university's research and faculty reputation.
-  <b>international_students</b> is highly correlated to <b>international</b>, which represents the international outlook of the university (including staff and international research collaborations).
-  <b>num_students</b> is highly correlated to <b>student_staff_ratio</b>. Of course, the greater the number of students, the greater the student_staff_ratio.
-  <b>student_staff_ratio</b> is slightly negatively correlated to <b>teaching</b>. Teaching quality suffers if there is a higher number of students per teacher.
-  <b>research</b> is correlated to <b>income</b>. Because <b>income</b> is a measure of how much <i>"research income an institution earns from industry"</i>, hence better the research, more the income
-  An interesting slight negative correlation between <b>female_male_ratio</b> and <b>income</b>, which means more males study in higher income universities. As income is directly correlated to the quality of research, this seems to suggest that more males are into research (across disciplines) than females.

However, the highest correlation is between <b>teaching</b> and <b>research</b>. This is an expected correlation, but THE considers teaching and research to be different indicators when calculating the final score and gives them equal weighting. 
What if we simply omit the research score and double the weighting for teaching? Do the scores change by much?

In [None]:
mod_score = []

for x in range(len(df1)):
    idx = df1.iloc[x].name  
    new_score = df1.loc[idx,"teaching"]*0.6 + df1.loc[idx,"citations"]*0.3 + df1.loc[idx,"international"]*0.075 + df1.loc[idx,"income"]*0.025
    mod_score.append(new_score)

In [None]:
df1['total_score'] = df1['total_score'].astype(float)
np.corrcoef(mod_score, df1["total_score"])[0,1]

There is still a 99% correlation! It seems THE can save itself some work and make its surveys simpler by removing the research parameter because this parameter does not significantly alter the score/rankings.

## 8. Next steps <a id="nextsteps"></a>

Here are some suggestions about how we could further analyze the ranking data:
- Explore the relationship between research and citation; does a high research score always mean a high citation score?
- Explore the gender ratio and research score; are there actually less females in research fields? Filter out science universities from the list and check the gender ratio.
- Compare the rankings to the 2017 rankings.
- Correlate the rankings to the countries' education budget. This requires you to get the country expenditure details.

### Citations
- <a href="https://www.timeshighereducation.com/world-university-rankings" target="_blank" rel="noopener no referrer">THE world university rankings</a>
- <a href="https://www.timeshighereducation.com/news/ranking-methodology-2016" target="_blank" rel="noopener no referrer">THE world university ranking methodology 2016</a>
- <a href="https://www.kaggle.com/mylesoneill/world-university-rankings/data" target="_blank" rel="noopener no referrer">World University Rankings:
Investigate the best universities in the world</a>

### Author
**Vaibhav Mathur** is an QA Specialist at IBM, India.

Copyright © IBM Corp. 2017, 2018. This notebook and its source code are released under the terms of the MIT License.