In [None]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup  # Check description below if not already installed
import seaborn as sns

%pylab inline

sns.set_palette('Set2', 8)
sns.set_context("notebook")

# 0. Description

## Background
In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need!
You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this [brief tutorial](https://www.youtube.com/watch?v=jBjXVrS8nXs&list=PLM-7VG-sgbtD8qBnGeQM5nvlpqB_ktaLZ&autoplay=1) to understand quickly how to use it.

## Assignment
1. Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: `name`, `rank`, `country` and `region`, `number of faculty members` (international and total) and `number of students` (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

2. Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

3. Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

4. Find useful insights in the data by performing an exploratory analysis. Can you find a strong correlation between any pair of variables in the dataset you just created? Example: when a university is strong in its international dimension, can you observe a consistency both for students and faculty members?

5. Can you find the best university taking in consideration both rankings? Explain your approach.

Hints:
- Keep your Notebook clean and don't print the verbose output of the requests if this does not add useful information for the reader.
- In case of tie, use the order defined in the webpage.

## BeautifulSoup soup

BeautifulSoup is used to extract tags and information from webpage. To install the package please use the following command

```
conda install beautifulsoup4
```

# 1. topuniversities.com

Let's first focus on the first website. If you go to www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)) you will see the first 25 top ranked universities. Howerver if you look at the code source (HTML web page) you will not find occurence of MIT ranking informations. Let's now use Console (ctrl-shift-k on Firefox, ctrl-shift-i on Chrome). You can see the web trafic (Network on Firefox and Chrome), i.e. every requests that are sent the porperly display the webpage.


We can see that most of them are css files (*.css), javascript files (*.js) or images (*.png). The website will load the images of each university, therfore we can assume that it needs to know what is the ranking to do so.

## 1.1 Main ranking information
If we look on the requests that append just before we can find a XHR files (XML Http Request) https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt. It contains the univertities names and attributes. If you click on the link you can see that it is a JSON file. The first level is a key named `data`. `data` is a vector and each element if a university with tags displayed below. We can keep `country`, `rank_display`, `score`, `title`, `region`


In [None]:
r = requests.get('https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt')
print('Tags of JSON 1st level: {}'.format(r.json().keys()))
print('Tags of JSON 2nd level: {}'.format(r.json()['data'][0].keys()))

In [None]:
top_uni_df = pd.DataFrame(r.json()['data'], columns=['title', 'country', 'region', 'rank_display', 'score', 'url'])
top_uni_df.iloc[25:30]

Note that some univertities have same rank. In this case the rank number will be `= X` where `X` is the ranking number. We cannot only cast it to integer we have to remove `=` first. We also complete the url for each university with `url` = `https://www.topuniversities.com` + `url` + `#wurs` according to href link of website

In [None]:
top_uni_df['score'] = pd.to_numeric(top_uni_df['score'], errors='coerce')
top_uni_df['rank_display'] = pd.to_numeric(top_uni_df['rank_display'].map(lambda x: x.lstrip('=')), errors='coerce')
top_uni_df['url'] = top_uni_df['url'].map(lambda x: 'https://www.topuniversities.com' + x + '#wurs')
top_uni_df = top_uni_df.sort_values(by='score', ascending=False).iloc[:200]
top_uni_df.head()

We perform a sanity check. We can see that there are no NaN values in our dataframe

In [None]:
np.sum(top_uni_df.isnull())

## 1.2 Unversities detailed informations

We still need to find `number of faculty members` (international and total) and `number of students` (international and total). To do so we will use the `url` field that contains the URL to the page that describe the university. If we look, for example, at the [MIT](https://www.topuniversities.com/universities/massachusetts-institute-technology-mit#wurs) one we can see that there are plots displaying the Number of academic faculty staff aka `number of faculty members` and `number of students`. 

Looking at the source code allow us to see that thoses values are hard coded in the web page. We can therefore use BS4 to directly get the infos. All values are located inside `<div>` tags and have `class='number'`

In [None]:
from log import log_progress  # Fancy progress display
# Will contain the university info as [faculty_total, faculty_international, student_total, student_international]
infos = np.ones((len(top_uni_df), 4))*np.nan  
for i, url in log_progress(enumerate(top_uni_df['url']), every=1, size=len(top_uni_df)): 
    try:
        # Get request for specific university
        r_uni = requests.get(url)
        soup = BeautifulSoup(r_uni.text, 'html.parser')
        # Parse file to get specific informations about faculty and stuent (according to website html structure)
        faculty = soup.find('div', class_='faculty-main')
        infos[i, 0:2] = [int(val.text.replace(',', '')) for val in faculty.find_all('div', class_='number')]
        infos[i, 2] = int(soup.find('div', class_='students-main').find('div', class_='number').text.replace(',', ''))
        infos[i, 3] = int(soup.find('div', class_='int-students-main').find('div', class_='number').text.replace(',', ''))
    except Exception as e:
        print('Unable to find fields:', url)

Once the requests are performed and values loaded we can affect them to their respective fields. Note that (New York University)[https://www.topuniversities.com/universities/new-york-university-nyu#wurs] does not have informatation about student and facultier memebers. Therefore the values will be set to NaN. We also save the results avoid performing crowling at each run.

In [None]:
top_uni_df['faculty_tot'] = infos[:, 0]
top_uni_df['faculty_int'] = infos[:, 1]
top_uni_df['student_tot'] = infos[:, 2]
top_uni_df['student_int'] = infos[:, 3]
top_uni_df.set_index(['title'], inplace=True)
top_uni_df.to_csv('top_uni.csv')  # Backup (just run fetch once)
top_uni_df.head()

We can check if our index `title` is unique

In [None]:
print('Is index unique: {}'.format(top_uni_df.index.is_unique))

## 1.3 Results 

We can load our csv file and set `title` as index column. It allows us to check if all entries are unique (should be the case of course)

### (a) - (b)

We dont have to create new fields. All values are already present in our dataframe. We therefore express. (a) ratio between faculty members and students as `faculty_tot`/`student_tot` and (b) ratio of international students as `student_int`/`student_tot`

In [None]:
from textwrap import wrap
N_TOP = 8
def nice_bar_plot(data, ax, title='', y_axis=''):
    ax.set_title(title , fontsize=12, fontweight='bold')
    ax.set_ylabel(y_axis)
    labels = [ '\n'.join(wrap(l, 15)) for l in data.index ]
    g = sns.barplot(x=labels, y=data.values,  ax=ax)
    [lab.set_rotation(45) for lab in ax.get_xticklabels()]

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot
fac_vs_stu = top_uni_df['faculty_tot'].div(top_uni_df['student_tot']).sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = top_uni_df['student_int'].div(top_uni_df['student_tot']).sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students', y_axis='ratio', ax=axes[1])
plt.tight_layout()

#### (c) `Country`
Same logic as before, except we need to group entries by `country`

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot (country)
fac_vs_stu = top_uni_df.groupby('country')['faculty_tot'].sum().div(top_uni_df.groupby('country')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students (country)', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = top_uni_df.groupby('country')['student_int'].sum().div(top_uni_df.groupby('country')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students (country)', y_axis='ratio', ax=axes[1])
plt.tight_layout()

#### (d) `Region`

Same logic as before, except we need to group entries by `region`

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot (country)
fac_vs_stu = top_uni_df.groupby('region')['faculty_tot'].sum().div(top_uni_df.groupby('region')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students (region)', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = top_uni_df.groupby('region')['student_int'].sum().div(top_uni_df.groupby('region')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students (region)', y_axis='ratio', ax=axes[1])
plt.tight_layout()

## 2. timeshighereducation.com

We can now look at the second website : https://www.timeshighereducation.com. On the ([ranking 2018](https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats)) page we can also see the first 25 ranked universities. By inspecting the source code of the webpage you can see that the informations we want to retrieve are not hard coded but are fetch from other links - same as for the first website. 

### 2.1 Main ranking information

We used the same approach as for the top-university website and used the console to see the requests sent by the page when it loads. We could identify a [link](https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json) to a JSON file. <br>
While inspecting this file, we could find 4 first level key, especially one called `data` containing the tags related to what we are looking for, i.e. `name`, `location`, `rank`, `stats_number_students`. We found also that we could retrieve the student-staff ratio, `stats_student_staff_ratio`, which will be use to compute the number of faculty members later one. There is also the percentage of international student, `stats_pc_intl_students`, that we can use to compute the number of international students. We decided to keep the score, `scores_overall`, from each university as we think it can be useful for the second part of the exercise.

We then proceed to store all those data in a dataframe.

In [None]:
URL = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
r = requests.get(URL)
print('Tags of JSON 1st level: {}'.format(r.json().keys()))
print('Tags of JSON 2nd level: {}'.format(r.json()['data'][0].keys()))

In [None]:
the_df = pd.DataFrame(r.json()['data'], columns=['name', 'location', 'rank', 'scores_overall', 'stats_number_students', 'stats_pc_intl_students', 'stats_student_staff_ratio'])
the_df.head()

### 2.2 - Clean the Data

We have extracted raw data from the JSON file, now we would like to perform some computation on certain columns to have matching data with the other website. To do so we will convert the column `stats_number_students`,  `stats_pc_intl_students` and `stats_student_staff_ratio` to numerical values, then we will perform the computation to get `student_int`, the number of international students and `faculty_tot`, the number of faculty members. We also get rid of the "=" in the `rank` column that meant an equality rank for two or more universities.  

In [None]:
the_df['scores_overall'] = pd.to_numeric(the_df['scores_overall'], errors='coerce')
the_df['rank'] = pd.to_numeric(the_df['rank'].map(lambda x: x.lstrip('=')), errors='coerce')
the_df['stats_pc_intl_students'] = pd.to_numeric(the_df['stats_pc_intl_students'].map(lambda x: x.rstrip('%')), errors='coerce')
the_df['stats_number_students'] = pd.to_numeric(the_df['stats_number_students'].str.replace(',', ''), errors='coerce')
the_df['stats_student_staff_ratio'] = pd.to_numeric(the_df['stats_student_staff_ratio'], errors='coerce')
# Find number of Intl Student from % and total nber student after cleaning
the_df['student_int'] = (the_df['stats_pc_intl_students'] * the_df['stats_number_students'] / 100).round().astype(int)
the_df['faculty_tot'] = (the_df['stats_number_students'] / the_df['stats_student_staff_ratio']).round().astype(int)

Once we have matching data regarding to the dataframe generated from top-university website, we can drop the columns `stats_pc_intl_students` and `stats_student_staff_ratio` that are no longer required. After that we rename the columns we keep in order that their names match the names from the top-university dataframe to ease the merge.

In [None]:
the_df.drop(['stats_pc_intl_students', 'stats_student_staff_ratio'], axis=1, inplace=True)
the_df.columns = ['title', 'country', 'rank_display', 'score', 'student_tot', 'student_int', 'faculty_tot']
the_df = the_df.sort_values(by='score', ascending=False).iloc[:200]
the_df.set_index(['title'], inplace=True)
the_df.to_csv('the.csv')
the_df.head()

In [None]:
np.sum(the_df.isnull())

We can check if our index `title` is unique

In [None]:
print('Is index unique: {}'.format(the_df.index.is_unique))

### 2.3 Results

### (a) - (b)

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot
fac_vs_stu = the_df['faculty_tot'].div(the_df['student_tot']).sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = the_df['student_int'].div(the_df['student_tot']).sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students', y_axis='ratio', ax=axes[1])
plt.tight_layout()

### (c) `Country`

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot (country)
fac_vs_stu = the_df.groupby('country')['faculty_tot'].sum().div(the_df.groupby('country')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students (country)', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = the_df.groupby('country')['student_int'].sum().div(the_df.groupby('country')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students (country)', y_axis='ratio', ax=axes[1])
plt.tight_layout()

### (d) `Region`

In [None]:
# A traiter --> il n'y a pas de region sur le Times Higher Education

In [None]:
'''fig, axes = plt.subplots(1, 2, figsize=(16,6))
# Ratio faculty_tot / student_tot (country)
fac_vs_stu = top_uni_df.groupby('region')['faculty_tot'].sum().div(top_uni_df.groupby('region')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(fac_vs_stu, title='Ratio faculty members vs Students (region)', y_axis='ratio', ax=axes[0])
# Ratio student_int / student_tot
stu_vs_int = top_uni_df.groupby('region')['student_int'].sum().div(top_uni_df.groupby('region')['student_tot'].sum())\
                .sort_values(ascending=False).iloc[:N_TOP]
nice_bar_plot(stu_vs_int, title='Ratio International Students vs Total Students (region)', y_axis='ratio', ax=axes[1])
plt.tight_layout()'''

# 3. Merge data

In [None]:
the_df = pd.read_csv('the.csv', index_col='title')
top_uni_df = pd.read_csv('top_uni.csv', index_col='title')

In [None]:
data = np.concatenate((top_uni_df['country'].unique(), the_df['country'].unique()))
name, count = np.unique(data, return_counts=True)
print(name[count == 1])

In [None]:
print('Russian university to compare:\n\t{}\n\t{}'.format(
    the_df[the_df['country'] == 'Russian Federation'].index.values,
    top_uni_df[top_uni_df['country'] == 'Russia'].index.values))

the_df.loc[the_df['country'] == 'Russian Federation', 'country'] = 'Russia'

In [None]:
def remove_stop(name):
    stops = ['university', 'of', 'the', 'technology', 'de', 'institute', 
             'Universität', 'Universitaet', 'zu', 'technical']
    for stop in stops:
        name= name.replace(stop, '')
    return name

In [None]:
from difflib import SequenceMatcher

data = []
for country, the_country in the_df.groupby('country'):  # Iteration over sub datasets (country grouping)
    top_uni_country = top_uni_df.loc[top_uni_df['country'] == country]  # Match only in same country
    if len(top_uni_country) == 0:  # No country find in other dataset -> skip
        continue
    for name_the in the_country.index:   # Iteration our university name
        score = np.zeros(len(top_uni_country.index))
        for i, name_top in enumerate(top_uni_country.index):  # Compare with other data set entries
            score[i] = SequenceMatcher(None, remove_stop(name_top.lower()), 
                                       remove_stop(name_the.lower())).ratio()
        # Append best score
        data.append([name_the, top_uni_country.index[np.argmax(score)], np.max(score), country])

In [None]:
# Create datafram with scores, names and countries
df = pd.DataFrame(data,  columns=['title_the', 'title_top', 'match_score', 'country'])

In [None]:
# Keep only values with 0.7 confidence
df[df['match_score'] > 0.7].sort_values('match_score').head(5)

In [None]:
# Keep only values with 0.7 confidence
df[df['match_score'] <= 0.70].sort_values('match_score', ascending=False).head()

In [None]:
matched = df[df['match_score'] > 0.70]
duplicates = matched[matched.duplicated(subset='title_top')]['title_top']
matched.set_index(['title_top']).loc[duplicates]

In [None]:
matched = matched.sort_values('match_score', ascending=False).drop_duplicates(subset='title_top')

In [None]:
# Create dictionnary entries to replace old name with new one (same names to make merge easier)
renamed_index = dict(zip(top_uni_df.loc[matched['title_top']].index.values, 
                         the_df.loc[matched['title_the']].index.values))
top_uni_df.rename(index=renamed_index, inplace=True)
# top_uni_df.loc[matched['title_the']].head()

In [None]:
df_final = pd.merge(top_uni_df, the_df, right_index=True, left_index=True, 
                    how='inner', suffixes=('_top', '_times'))
df_final.drop(['url', 'country_top', 'country_times', 'faculty_int'], axis=1, inplace=True)
df_final['is_epfl'] = df_final.index == 'École Polytechnique Fédérale de Lausanne'
df_final.head()

In [None]:
print('Matched universities: {}'.format(len(df_final)))

In [None]:
df_final_times = pd.cut(df_final['rank_display_times'], [0, 5, 15, 50, 100, 200])
print('Named matched for Times ranking:\n{}'.format(df_final_times.value_counts(sort=False).cumsum()))

In [None]:
df_final_times = pd.cut(df_final['rank_display_top'], [0, 5, 15, 50, 100, 200])

print('Named matched for Top Universities ranking:\n{}'.format(df_final_times.value_counts(sort=False).cumsum()))

# 4. Results and Correlation

In [None]:
sns.pairplot(df_final, hue='is_epfl',
             x_vars=['rank_display_top', 'score_top', 'faculty_tot_top', 'student_tot_top', 'student_int_top'],
             y_vars=['rank_display_times', 'score_times', 'faculty_tot_times', 'student_tot_times', 'student_int_times'])

In [None]:
sns.pairplot(df_final, hue='is_epfl',
             y_vars=['score_top'],
             x_vars=['faculty_tot_top', 'student_tot_top', 'student_int_top'])

In [None]:
sns.pairplot(df_final, hue='is_epfl',
             y_vars=['score_times'],
             x_vars=['faculty_tot_times', 'student_tot_times', 'student_int_times'])

In [None]:
df_final['score_mean'] = (df_final['score_times'] + df_final['score_top'])/2
df_final.sort_values('score_mean', ascending=False).head()

In [None]:
N_TOP = 10
fig, ax = plt.subplots(1, 1, figsize=(16,4))
df_europe = df_final.loc[df_final['region']=='Europe']
nice_bar_plot(df_europe['score_mean'][:N_TOP], ax, title='Best European universities')