# Table of Contents

### [Task 1. Scraping Top Universities](#1)
- [1.1. Scraping the 200 top elements from Top Universities](#11)

- [1.2. Sorting by ratio between faculty members and students](#12)

- [1.3. Sorting by ratio of international students](#13)

- [1.4. Sorting while grouped by country](#14)

- [1.5. Sorting while grouped by region](#15)

### [Task 2. Scraping Times Higher Education](#2)
- [2.1. Scraping the 200 top elements from Times Higher Education](#21)

- [2.2. Sorting by ratio between faculty members and students](#22)

- [2.3. Sorting by ratio of international students](#23)

- [2.4. Sorting while grouped by country](#24)

- [2.5. Sorting while grouped by region](#25)

### [Task 3. Merging Both Dataframes](#3)

## Task 1. Scraping Top Universities <a class="anchor" id="1"></a>

### Assignment Instructions
Obtain the 200 top-ranking universities in www.topuniversities.com ([ranking 2018](https://www.topuniversities.com/university-rankings/world-university-rankings/2018)). In particular, extract the following fields for each university: name, rank, country and region, number of faculty members (international and total) and number of students (international and total). Some information is not available in the main list and you have to find them in the [details page](https://www.topuniversities.com/universities/ecole-polytechnique-fédérale-de-lausanne-epfl).
Store the resulting dataset in a pandas DataFrame and answer the following questions:
- Which are the best universities in term of: (a) ratio between faculty members and students, (b) ratio of international students?
- Answer the previous question aggregating the data by (c) country and (d) region.

Plot your data using bar charts and describe briefly what you observed.

### 1.1. Scraping the 200 top elements from Top Universities <a class="anchor" id="11"></a>

We import some libraries.

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
from matplotlib import pyplot as plt
%pylab inline

Populating the interactive namespace from numpy and matplotlib


We establish constants we need for scraping. We found the *json* URL by using Postman on the URL given in the instructions.

In [2]:
TOP_UNIVERSITIES_URL = 'https://www.topuniversities.com'
TOP_UNIVERSITIES_JSON_URL = TOP_UNIVERSITIES_URL + '/sites/default/files/qs-rankings-data/357051.txt'
TOP_UNIVERSITIES_JSON_COLUMNS = ['title', 'rank_display', 'country', 'region']
TOP_UNIVERSITIES_HTML_COLUMNS = ['total faculty', 'inter faculty', 'total student', 'total inter']
TOP_UNIVERSITIES_NEW_COLUMNS = ['Name', 'Rank', 'Country', 'Region', 'Faculty', 'International Faculty', 'Students', 'International Students']

We will use the following helper functions for scraping:
- `get_json` fetches the *json* data for a given URL;
- `get_html_beautiful_soup` gets the `BeautifulSoup` associated with a URL;
- `get_number` gets the `int` associated with a given class name in a given `BeautifulSoup` instance.

In [3]:
def get_json(url):
    """
    Returns the json associated with given url, or an empty dict if an error arises.
    :param url: string, url target
    :return: dict
    """
    try:
        return requests.get(url).json()
    except:
        return {}

def get_html_beautiful_soup(row, url_prefix):
    """
    Returns BeautifulSoup of the target URL, or None if an error arises.
    :param row: Pandas Series, corresponding to a row of a DataFrame.
    :param url_prefix: string, corresponding to URL prefix
    :return: BeautifulSoup, or None
    """
    try:
        return BeautifulSoup(requests.get(url_prefix + row['url']).text, 'html.parser')
    except:
        return None

def get_number(soup, class_name):
    """
    Returns number associated with the given class name in the soup passed as argument, or NaN if an error arises.
    :param soup: BeautifulSoup, extracted beforehand
    :param class_name: string, targeted class name
    :return: int, or NaN
    """
    try:
        number_text = soup.find('div', class_= class_name).find('div', class_='number').text
        return int(''.join([char for char in number_text if char.isdigit()]))
    except:
        return np.NaN

We use the following function to scrape universities.

In [4]:
def scrape_universities(json_url, current_columns, new_columns, extra_columns=[], extra_url='', max_universities=200):
    """
    Returns a DataFrame instance containing all university data.
    :param json_url: string, url containing json file
    :param current_columns: list of strings, containing columns to keep
    :param new_columns: list of strings, same length as current_columns, to rename the DataFrame's columns
    :param extra_columns: list of strings, extra columns to get from specific university pages
    :param extra_url: string, url prefix to get to specific university pages
    :param max_universities: int, number of top universities to get, default is 200
    :return: DataFrame
    """
    university_df = json_normalize(get_json(json_url)['data'][:max_universities])
    if extra_columns:
        beautiful_soups = university_df.apply(lambda row: get_html_beautiful_soup(row, extra_url), axis=1)
        for column in extra_columns:
            university_df[column] = beautiful_soups.apply(lambda soup: get_number(soup, column))
    university_df = university_df[current_columns].rename(index=str, columns=dict(zip(current_columns, new_columns)))
    return university_df

We scrape the top 200 universities.

In [None]:
top_universities_df = scrape_universities(TOP_UNIVERSITIES_JSON_URL,
                                          TOP_UNIVERSITIES_JSON_COLUMNS + TOP_UNIVERSITIES_HTML_COLUMNS,
                                          TOP_UNIVERSITIES_NEW_COLUMNS, 
                                          extra_columns=TOP_UNIVERSITIES_HTML_COLUMNS,
                                          extra_url=TOP_UNIVERSITIES_URL)
top_universities_df.head()

### 1.2. Sorting by ratio between faculty members and students <a class="anchor" id="12"></a>

We establish a sorting helper function, called `insert_column_and_sort`, to insert and compute a new column and then sort the values by this new column.

In [None]:
def insert_column_and_sort(dataframe, columns, head_elements=5):
    """
    Computes new column based on the division of one column by another one and returns it.
    :param dataframe: DataFrame, targeted instance
    :param columns: list of 3 strings and 2 booleans, containing names of the new column, 
                    numerator column and denominator column, and 2 booleans determining
                    if denominator column includes numerator one, and order of sorting
    :param head_elements: int, number of head elements to be shown
    :return: DataFrame
    """
    df_copy = dataframe.copy()
    new_column, numerator, denominator, denominator_includes_numerator, ascending = columns
    numerator_df = dataframe[numerator]
    denominator_df = dataframe[denominator]
    if denominator_includes_numerator:
        denominator_df = denominator_df - numerator_df
    df_copy[new_column] = numerator_df / denominator_df
    return df_copy.sort_values(new_column, ascending=ascending).head(head_elements)

We introduce a new column for the ratio between faculty members and students, such that each value is the average number of students per faculty member. Then, we sort the top universities scraped by this ratio in ascending order.

In [None]:
STUDENT_STAFF_RATIO = ['Student-Staff Ratio', 'Students', 'Faculty', False, True]
insert_column_and_sort(top_universities_df, STUDENT_STAFF_RATIO)

### 1.3. Sorting by ratio of international students <a class="anchor" id="13"></a>

Likewise, we introduce a new column for the ratio of international students, such that each value is the average number of international students per non-international student. Then, we sort the top universities scraped by this ratio in **descending** order.

In [None]:
INTL_STUDENT_RATIO = ['International Student Ratio', 'International Students', 'Students', True, False]
insert_column_and_sort(top_universities_df, INTL_STUDENT_RATIO)

### 1.4. Sorting while grouped by country <a class="anchor" id="14"></a>

We make a short helper function, called `group_insert_sort`, to group by a column, insert another one and sort by its values. It then shows the results in a bar chart.

In [None]:
def group_insert_sort(dataframe, group_by, columns):
    """
    Computes new column based on the division of one column by another one and returns it.
    Then, plots t
    :param dataframe: DataFrame, targeted instance
    :param group_by: string, name of the column by which dataframe is grouped
    :param columns: list of 3 strings and 2 booleans, containing names of the new column, 
                    numerator column and denominator column, and 2 booleans determining
                    if denominator column includes numerator one, and order of sorting
    :return: bar chart
    """
    return insert_column_and_sort(dataframe.groupby(group_by).mean(), columns, head_elements=None)[[columns[0]]].plot.bar()

The top universities by country and by student-staff ratio are the following. We observe that Russia has distinctly the lowest number of students per staff, as it is the only country dipping below 5.

In [None]:
group_insert_sort(top_universities_df, 'Country', STUDENT_STAFF_RATIO)

The top universities by country and by international student ratio are the following. We observe that Australia and the UK stand out as having the highest international outlook, followed closely by Hong Kong, Austria and Switzerland.

In [None]:
group_insert_sort(top_universities_df, 'Country', INTL_STUDENT_RATIO)

### 1.5. Sorting while grouped by region <a class="anchor" id="15"></a>

The top universities by region and by student-staff ratio are the following. We observe that Asia is the only continent which student-staff ratio is below 8.

In [None]:
group_insert_sort(top_universities_df, 'Region', STUDENT_STAFF_RATIO)

The top universities by region and by international student ratio are the following. We observe that Oceania has on average more than 50% of international students in its student body, exceeding by more than 20% the other continents.

In [None]:
group_insert_sort(top_universities_df, 'Region', INTL_STUDENT_RATIO)

## Task 2. Scraping Times Higher Education <a class="anchor" id="2"></a>

### Assignment Instructions

Obtain the 200 top-ranking universities in www.timeshighereducation.com ([ranking 2018](http://timeshighereducation.com/world-university-rankings/2018/world-ranking)). Repeat the analysis of the previous point and discuss briefly what you observed.

### 2.1. Scraping the top 200 elements from Times Higher Education <a class="anchor" id="21"></a>

We establish the following constants.

In [None]:
TIMES_HIGHER_ED_JSON_URL = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'
TIMES_HIGHER_ED_JSON_COLUMNS = ['name', 'rank', 'location', 'stats_number_students', 'stats_student_staff_ratio', 'stats_pc_intl_students', 'stats_female_male_ratio']
TIMES_HIGHER_ED_NEW_JSON_COLUMNS = ['Name', 'Rank', 'Country', 'Students', 'Student-Staff Ratio', 'Percentage of International Students', 'Female-Male Ratio']

Then we scrape the universities from Times Higher Education.

In [None]:
times_higher_ed_df = scrape_universities(TIMES_HIGHER_ED_JSON_URL, 
                                         TIMES_HIGHER_ED_JSON_COLUMNS, 
                                         TIMES_HIGHER_ED_NEW_JSON_COLUMNS)
times_higher_ed_df.head()

### 2.2. Sorting by ratio between faculty members and students <a class="anchor" id="22"></a>

There is already a *Student-Staff Ratio* column. We just need to make the values of that column `float` elements and then sort the table by it.

In [None]:
STUDENT_STAFF = "Student-Staff Ratio"
times_higher_ed_df[STUDENT_STAFF] = times_higher_ed_df[STUDENT_STAFF].apply(float)
times_higher_ed_df.sort_values(STUDENT_STAFF).head()

### 2.3. Sorting by ratio of international students <a class="anchor" id="23"></a>

There's already a *Percentage of International Students* column and we only need to make its values `float` instances. Ranking by the percentage of international students is equivalent to ranking by their ratio.

In [None]:
INTL_PERCENTAGE = "Percentage of International Students"
times_higher_ed_df[INTL_PERCENTAGE] = times_higher_ed_df[INTL_PERCENTAGE].apply(lambda x: float(x.strip('%'))/100)
times_higher_ed_df.sort_values(INTL_PERCENTAGE, ascending=False).head()

### 2.4. Sorting while grouped by country <a class="anchor" id="24"></a>

We can group by country and then perform the same sorting as above, as the values are now `float` instances.

The following is the sorting by ratio between faculty members and students, grouped by country. We observe that there is some consistency in ranking between both websites, as Russia and Denmark remain in the top five, but numbers are different from the *Top Universities* website.

In [None]:
grouped_by_country = times_higher_ed_df.groupby('Country').mean()
grouped_by_country.sort_values(STUDENT_STAFF)[[STUDENT_STAFF]].plot.bar()

The following is the sorting by percentage of international students, grouped by country. We notice that Luxembourg is very distinctly international, but this information is new as it does not appear in the *Top Universities* ranking.

In [None]:
grouped_by_country.sort_values(INTL_PERCENTAGE, ascending=False)[[INTL_PERCENTAGE]].plot.bar()

### 2.5. Sorting while grouped by region <a class="anchor" id="25"></a>

We can group by region, however we must first associate the countries with their respective regions.

In [None]:
merged_for_region = pd.merge(times_higher_ed_df, top_universities_df[['Country', 'Region']].drop_duplicates(), on="Country")
grouped_by_region = merged_for_region.groupby('Region').mean()

The following is the sorting by ratio between faculty members and students, grouped by region. We notice this time that numbers have changed from the other ranking, as Africa comes up with the least number of students per staff.

In [None]:
grouped_by_region.sort_values(STUDENT_STAFF)[[STUDENT_STAFF]].plot.bar()

The following is the sorting by percentage of international students, grouped by region. This ranking confirms Oceania's top place in international students.

In [None]:
grouped_by_region.sort_values(INTL_PERCENTAGE, ascending=False)[[INTL_PERCENTAGE]].plot.bar()

## Task 3. Merging Both Rankings <a class="anchor" id="3"></a>

### Assignment Instructions

Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

### Answers

We merge both tables that we scraped in tasks 1 and 2.

We first import a library and make a short function to find the closest match in names.

In [None]:
import difflib
from pprint import pprint
import unidecode

def clean_word(word):
    word = unidecode.unidecode(word)
    word = ''.join([char for char in word if char.isalpha() or char == ')'])
    return word.lower()

def is_all_caps(word):
    for char in word:
        if char.isalpha() and char.islower():
            return False
    return True

def filter_name(name):
    words = [clean_word(word) for word in name.split(' ') if (not is_all_caps(word) and word not in 'The of and in'.split(' ') and not (word.startswith('(') and word.endswith(')')))]
    return words

def compare_words(name1, name2):
    words1 = filter_name(name1)
    words2 = filter_name(name2)
    similarity_index = 0
    similarity = 0
    index1 = 0
    #print(words1, words2)
    while index1 < len(words1) :
        for index2 in range(similarity_index, len(words2)):
            #print(words1[index1], words2[index2], similarity)
            if words1[index1] == words2[index2]:
                similarity_index = index2
                similarity += 1
                break
        index1 += 1
    min_len = min(len(set(words1)), len(set(words2)))
    #print(similarity, min_len)
    set_words = set(words1 + words2)
    return similarity >= min_len

def clean_name(name):
    to_remove = 'University The'.split(' ')
    for word in to_remove:
        name = name.replace(word, '')
    return name.strip()

TAKEN = []

def find_closest_match(name, other_names):
    """
    Returns the closest match to given name in a series of other names, or just the name if nothing is found.
    :param name: string, given name that we want to find
    :param other_names: Pandas Series, names that the argument name will be compared to
    :return: string"""
    for other_name in other_names:
        if compare_words(name, other_name) and other_name not in TAKEN:
            TAKEN.append(other_name)
            return other_name
    return name

In [None]:
compare_words('Massachusetts Institute of Technology (MIT)', 'Massachusetts Institute of Technology')

We replace the column `'Name'` in the Times Higher Education `DataFrame` by its closest match in the Top Universities one.

In [None]:
NAME = 'Name'
times_higher_ed_df[NAME] = times_higher_ed_df[NAME].apply(lambda name: find_closest_match(name, top_universities_df[NAME]))

Then we merge both `DataFrame` instances into one.

In [None]:
merged_df = pd.merge(top_universities_df, times_higher_ed_df.drop(labels='Country', axis=1), how='outer', on='Name', suffixes=(' (Top Unis)', ' (T.H.E.)'))
merged_df.head()

## Task 4. Correlation <a class="anchor" id="4"></a>

See Pearson's correlation, Rank correlation, e.g., Spearman’s correlation coefficient, mutual information

Check for Simpson's paradox, test some hypotheses with **p-values**.