# 02 - Data from the Web

In this homework we will extract interesting information from www.topuniversities.com and www.timeshighereducation.com, two platforms that maintain a global ranking of worldwide universities. This ranking is not offered as a downloadable dataset, so you will have to find a way to scrape the information we need! You are not allowed to download manually the entire ranking -- rather you have to understand how the server loads it in your browser. For this task, Postman with the Interceptor extension can help you greatly. We recommend that you watch this brief tutorial to understand quickly how to use it.

* [Task 1 - Scrape the QS ranking](#Task-1---Scrape-the-QS-ranking)
    * [Best universities for ratio between faculty memebers and students](#Which-are-the-best-universities-in-term-of-ratio-between-faculty-members-and-students-according-to-QS?)
    * [Best universities for the international students ratio](#Which-are-the-best-universities-in-term-of-ratio-of-international-students-according-to-QS?)
    * [Results by region](#Plot-the-results-from-QS-ranking-aggregating-by-region)
    * [Results by country](#Plot-the-results-from-QS-ranking-aggregating-by-country)


* [Task 2 - Scrape the Times ranking](#Task-2---Scrape-the-Times-ranking)
    * [Best universities for ratio between faculty memebers and students](#Which-are-the-best-universities-in-term-of-ratio-between-faculty-members-and-students-according-to-Times?)
    * [Best universities for the international students ratio](#Which-are-the-best-universities-in-term-of-ratio-of-international-students-according-to-Times?)
    * [Results by region](#Plot-the-results-from-Times-ranking-aggregating-by-region)
    * [Results by country](#Plot-the-results-from-Times-ranking-aggregating-by-country)
    
    
* [Task 3 - Merge the dataframes](#Task-3---Merge-the-dataframes)


* [Task 4 - Exploratory analysis](#Task-4---Exploratory-analysis)


* [Task 5 - Exploratory analysis](#Task-5--)

# Imports

In [None]:
# Import libraries
import string
import re
import pickle
import requests
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

%matplotlib inline

# Constans definition

In [None]:
COLUMN_RANK = 'Rank'
COLUMN_UNIVERSITY_NAME = 'University Name'
COLUMN_REGION = 'Region'
COLUMN_COUNTRY = 'Country'
COLUMN_TOTAL_STUDENTS = 'Total Students'
COLUMN_INTERNATIONAL_STUDENTS = 'Total International Students'
COLUMN_TOTAL_FACULTY = 'Total Faculty'
COLUMN_INTERNATIONAL_FACULTY = 'Total International Faculty'
COLUMN_INTERNATIONAL_RATIO = 'Total International Students / Total Students'
COLUMN_FACULTY_RATIO = 'Total Faculty / Total Student'

In [None]:
QS_RANKING_JSON = 'https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt?_=1508104120137'
TIMES_RANKING_JSON = 'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json'

In [None]:
SEARCH_REFERENCE_API = 'https://en.wikipedia.org/w/api.php?action=query&titles={0}&prop=revisions&rvprop=content&format=json&indexpageids'

During the very first scraping, we build a table that links a country to a specific region. This information is available in the QS ranking and will be used to fill the Times dataframe.

Once the first scraping is done, the table (along with the two rankings) are serialized in local for future runs

In [None]:
try:
    COUNTRY_REGION_METADATA = pd.read_pickle('serial/country_region_metadata.p')
except (OSError, IOError) as e:
    COUNTRY_REGION_METADATA = pd.DataFrame(columns=[COLUMN_REGION])
    COUNTRY_REGION_METADATA.to_pickle('serial/country_region_metadata.p')

In [None]:
COUNTRY_REGION_METADATA

# General use functions definition

In [None]:
def build_html_parser(url):
    '''
    Function to build a parser object of type BeautifulSoup
    
    url -- the webpage url to which send a get request to
    
    return a parser of the given webpage
    '''
    
    r = requests.get(url)
    page_body = r.text
    
    soup = BeautifulSoup(page_body, 'html.parser')
    
    return soup

In [None]:
def clean_str_number(str_n):
    '''
    Obtain a a string that only contains the numeric information, 
    dropping any formatting character
    
    str_n -- a string containg both numeric and formatting characters
    
    return a string only with numeric characters
    '''
    
    return str_n.strip('\n').strip('%').replace(',', '')

In [None]:
def search_standard_name(str_name):
    '''
    Given a name, query the WikiData database to obtain the information formatted their way.
    The function takes care of cleaning up the string in order to facilitate
    the research process
    
    str_name -- the query to search in the WikiData
    
    return the WikiData formatted entry of the query value if found,
    the cleaned up version of the passed value otherwise
    '''
    
    str_name = str_name.split(" - ")[0] # Manage name with - (short)
    str_name = str_name.split(" – ")[0] # Manage name with – (long)
    str_name = re.sub('\(.*?\)','', str_name) # no brackets
    str_name = str_name.strip().replace('&', '%26')
    
    #r = requests.get(SEARCH_REFERENCE_API.format(str_name.strip().replace(' ', '%20')))
    r = requests.get(SEARCH_REFERENCE_API.format(str_name.strip().replace(' ', '_')))
    data = r.json()
    
    page_id = data['query']['pageids'][0]
    
    if (page_id == '-1'):
        print('Not found :( -> {}'.format(str_name))
        
        # Manually set a standard name for the only unmatchable university. 
        # We have a total of 9 unknown sources during the WikiData requests, but only one university appears twice
        # and it need to receive a standard name to be merged later on. The other onces could keep their name
        if (str_name == "Scuola Superiore Sant'Anna Pisa di Studi Universitari e di Perfezionamento"):
            found_name = "Scuola Superiore Sant’Anna"
        else:
            found_name = str_name
    else:
        found_name = data['query']['pages'][page_id]['title']
        
    return(found_name)

In [None]:
def update_country_region_metadata(country, region):
    '''
    Build a new entry for the global dataframe that correlates regions and countries.
    
    country -- the country to add
    region -- the region to add
    '''
    
    global COUNTRY_REGION_METADATA
    
    if (country in COUNTRY_REGION_METADATA.index):
        return
    
    new_row = pd.Series(region, index=[COLUMN_REGION])
    new_row.name = country
    
    COUNTRY_REGION_METADATA = COUNTRY_REGION_METADATA.append(new_row)

## Task 1 - Scrape the QS ranking
Obtain the 200 top-ranking universities in www.topuniversities.com (ranking 2018)

In [None]:
# TODO: Handled the returning value for data not found. 
# Atm I return a -1, but this fucks up the plotting and computation for the ratios

def parse_detail_page(url_detail):
    '''
    Function that parses the missing informations from the detail page of the university from the QS website
    
    url_detail -- the url of the detail page to scrape
    
    Return a dictionary with all the data found as integers values
    '''
    
    # Build a parser for the detail page
    soup = build_html_parser(url_detail)
    
    # Obtain and clean up the total faculty member value
    try:
        faculty_member_total = soup.find('div', class_='total faculty').find('div', class_='number').text
        faculty_member_total = clean_str_number(faculty_member_total)
    except:
        faculty_member_total = -1
    
    
    # Obtain and clean up the international faculty member value
    try:
        faculty_member_inter = soup.find('div', class_='inter faculty').find('div', class_='number').text.strip('\n')
        faculty_member_inter = clean_str_number(faculty_member_inter)
    except:
        faculty_member_inter = -1
    
    # Obtain and clean up the total students value
    try:
        student_total = soup.find('div', class_='total student').find('div', class_='number').text.strip('\n')
        student_total = clean_str_number(student_total)
    except:
        student_total = -1
    
    # Obtain and clean up the international students value
    try:
        student_inter = soup.find('div', class_='total inter').find('div', class_='number').text.strip('\n')
        student_inter = clean_str_number(student_inter)
    except:
        student_inter = -1
    
    # Build a dictionary for the parsed informations
    detail_info = {COLUMN_TOTAL_FACULTY : int(faculty_member_total), 
                   COLUMN_INTERNATIONAL_FACULTY : int(faculty_member_inter), 
                   COLUMN_TOTAL_STUDENTS : int(student_total), 
                   COLUMN_INTERNATIONAL_STUDENTS : int(student_inter)
                  }
    
    return detail_info

After some work on the Postman Inspector, we found out that the GET Request made to the QS website ended up with multiple attached files to go with the response. One of those files was a JSON with all the infos from the ranking.

In [None]:
req = requests.get(QS_RANKING_JSON)
data_from_url = req.json()

Such data is stored as a list of dictionaries, as visible in the example below:

In [None]:
print('First cell:')
print(data_from_url['data'][0], end='\n\n')

print('Second cell:')
print(data_from_url['data'][1], end='\n\n')

print('...')

In [None]:
def scrape_qs_ranking():
    '''
    Obtain the ranking from the top 200 of QS in a dataframe
    
    return the dataframe containing all the informations (main and detail page) of QS ranking
    '''
    
    r = requests.get(QS_RANKING_JSON)
    data = r.json()
    
    university_list = []

    # Iterate throu the first 200 elments of the list
    for i, d in enumerate(data['data'][:200]):
    
        # Store the parsed information into a dictionary ---- d['rank_display']
        info = {COLUMN_RANK: (i+1), 
                COLUMN_UNIVERSITY_NAME: search_standard_name(d['title']), 
                COLUMN_COUNTRY: d['country'],
                COLUMN_REGION : d['region']
               }
    
        update_country_region_metadata(d['country'], d['region'])
    
        # Extend the dictionary with the informations in the detail page
        url_detail = 'https://www.topuniversities.com' +  d['url']
        info.update( parse_detail_page( url_detail))
    
        university_list.append(info)
    
    # After scraping data from QS ranking the metadata dataframe needs to be stored updated
    COUNTRY_REGION_METADATA.to_pickle('serial/country_region_metadata.p')
    
    qs_ranking_df = pd.DataFrame.from_dict(university_list)
    return qs_ranking_df

In [None]:
try:
    qs_ranking_df = pd.read_pickle('serial/qs_save.p')
except (OSError, IOError) as e:
    qs_ranking_df = scrape_qs_ranking()
    qs_ranking_df.to_pickle('serial/qs_save.p')
    
qs_ranking_df.head()

In [None]:
qs_ranking_df.set_index(COLUMN_UNIVERSITY_NAME, inplace=True)

### Which are the best universities in term of ratio between faculty members and students according to QS?

In [None]:
# Compute the ratio between faculty member and students
qs_ranking_df[COLUMN_FACULTY_RATIO] = qs_ranking_df[COLUMN_TOTAL_FACULTY] / qs_ranking_df[COLUMN_TOTAL_STUDENTS]

# Clean up the computation taking care of unknown values (previously set at -1)
qs_fsratio_defined = (qs_ranking_df[COLUMN_TOTAL_FACULTY] != -1) | (qs_ranking_df[COLUMN_TOTAL_STUDENTS] != -1)
qs_ranking_df.loc[~qs_fsratio_defined, COLUMN_FACULTY_RATIO] = -1

# Define a dataset for the result computation
qs_faculty_students_rank_df = qs_ranking_df[[COLUMN_COUNTRY, COLUMN_REGION, COLUMN_TOTAL_FACULTY, COLUMN_TOTAL_STUDENTS, COLUMN_FACULTY_RATIO]]
qs_faculty_students_rank_df = qs_faculty_students_rank_df.sort_values(COLUMN_FACULTY_RATIO, ascending=False)

qs_faculty_students_rank_df.head()

In [None]:
qs_faculty_students_rank_df[:10][COLUMN_FACULTY_RATIO].plot(kind='barh', figsize=(10,7), color='green')

plt.title('Top 10 Universities according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

### Which are the best universities in term of ratio of international students according to QS?

In [None]:
# Compute the ratio of international students
qs_ranking_df[COLUMN_INTERNATIONAL_RATIO] = qs_ranking_df[COLUMN_INTERNATIONAL_STUDENTS] / qs_ranking_df[COLUMN_TOTAL_STUDENTS]

# Clean up the computation taking care of unknown values (previously set at -1)
qs_isratio_defined = (qs_ranking_df[COLUMN_INTERNATIONAL_STUDENTS] != -1) | (qs_ranking_df[COLUMN_TOTAL_STUDENTS] != -1)
qs_ranking_df.loc[~qs_isratio_defined, COLUMN_INTERNATIONAL_RATIO] = -1

# Define a dataset for the result computation
qs_international_students_rank_df = qs_ranking_df[[COLUMN_COUNTRY, COLUMN_REGION, COLUMN_INTERNATIONAL_STUDENTS, COLUMN_TOTAL_STUDENTS, COLUMN_INTERNATIONAL_RATIO]]
qs_international_students_rank_df = qs_international_students_rank_df.sort_values(COLUMN_INTERNATIONAL_RATIO, ascending=False)

qs_international_students_rank_df.head()

In [None]:
qs_international_students_rank_df[:10][COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10,7))

plt.title('Top 10 Universities according to the ratio of international students')
plt.ylabel("")
plt.xlabel('International students ratio')
plt.show()

### Plot the results from QS ranking aggregating by region

For the faculty to student ratio:

In [None]:
qs_facutly_student_ratio_by_region_df = qs_faculty_students_rank_df.groupby(COLUMN_REGION).head(1)
qs_facutly_student_ratio_by_region_df[COLUMN_FACULTY_RATIO].plot(kind='barh',figsize=(10,7),  color='green')

plt.title('Top University per region according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

For the international to total student ratio:

In [None]:
qs_international_students_ratio_by_region_df = qs_international_students_rank_df.groupby(COLUMN_REGION).head(1)
qs_international_students_ratio_by_region_df[COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10,7))

plt.title('Top University per region according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

### Plot the results from QS ranking aggregating by country

For the faculty to student ratio:

In [None]:
qs_facutly_student_ratio_by_country_df = qs_faculty_students_rank_df.groupby(COLUMN_COUNTRY).head(1)
qs_facutly_student_ratio_by_country_df[COLUMN_FACULTY_RATIO].plot(kind='barh', figsize=(10,10), color='green')

plt.title('Top University per country according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

For the international to total student ratio:

In [None]:
qs_international_students_ratio_by_country_df = qs_international_students_rank_df.groupby(COLUMN_COUNTRY).head(1)
qs_international_students_ratio_by_country_df[COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10, 10))

plt.title('Top University per country according to the ratio of international students')
plt.ylabel("")
plt.xlabel('International students ratio')
plt.show()

## Task 2 - Scrape the Times ranking
Obtain the 200 top-ranking universities in www.timeshighereducation.com (ranking 2018). Repeat the analysis of the previous point and discuss briefly what you observed.

In [None]:
def compute_value_from_percentage(total, percentage):
    '''
    Function to compute the percentage of a given total
    
    total -- the value on which to calculate the percentage
    percentage -- the percentage value
    
    return the calculated value
    '''
    
    total = int( total )
    percentage = float( percentage )
    
    return round( (total/100) * percentage )

In [None]:
def compute_value_from_proportion(total, proportion):
    '''
    Function to compute the proportion of a given total
    
    total -- the value on which to calculate the percentage
    proportion -- the proportion value
    
    return the calculated value
    '''
    
    total = int( total )
    proportion = float( proportion )
    
    return round( total / proportion )

In [None]:
def scrape_times_ranking():
    '''
    Obtain the ranking from the top 200 of Times in a dataframe
    
    return the dataframe containing all the informations of Times ranking
    '''

    r = requests.get(TIMES_RANKING_JSON)
    data = r.json()

    university_list = []

    # Iterate throu the first 200 elments of the list
    for i, d in enumerate(data['data'][:200]):
    
        # Preliminary computations to extract data
        intern_student = compute_value_from_percentage( clean_str_number( d['stats_number_students']), 
                                                       clean_str_number( d['stats_pc_intl_students'])
                                                      )
    
        faculty_member_total = compute_value_from_proportion( clean_str_number( d['stats_number_students']), 
                                                             clean_str_number( d['stats_student_staff_ratio'])
                                                            )
        
        # Determine region from the data of the QS ranking stored in the metadata
        try:
            region = COUNTRY_REGION_METADATA.get_value(d['location'], 'Region')
        except:
            region = 'NaN'
    
        # Store the parsed information into a dictionary
        info = {COLUMN_RANK: (i+1), 
                COLUMN_UNIVERSITY_NAME: search_standard_name(d['name']), 
                COLUMN_COUNTRY: d['location'],
                COLUMN_REGION : region,
                COLUMN_TOTAL_STUDENTS : int(clean_str_number( d['stats_number_students'])),
                COLUMN_INTERNATIONAL_STUDENTS : int(intern_student),
                COLUMN_TOTAL_FACULTY : int(faculty_member_total)
               }
    
        university_list.append(info)
   
    times_ranking_df = pd.DataFrame.from_dict(university_list)
    return times_ranking_df

In [None]:
try:
    times_ranking_df = pd.read_pickle('serial/times_save.p')
except (OSError, IOError) as e:
    times_ranking_df = scrape_times_ranking()
    times_ranking_df.to_pickle('serial/times_save.p')
    
times_ranking_df.head()

In [None]:
times_ranking_df.set_index(COLUMN_UNIVERSITY_NAME, inplace=True)

Even after extrapolating the region information from the first dataset, we might have some missing data. Those will be manually handled to fix the table

In [None]:
times_ranking_df[times_ranking_df[COLUMN_REGION] == 'NaN']

In [None]:
# Manually fix the missing data
times_ranking_df.set_value('University of Luxembourg', COLUMN_REGION, 'Europe')
times_ranking_df.set_value('Lomonosov Moscow State University', COLUMN_REGION, 'Europe')

# Check the new state
times_ranking_df[times_ranking_df[COLUMN_COUNTRY] == 'Luxembourg']

### Which are the best universities in term of ratio between faculty members and students according to Times?

In [None]:
# Compute the ratio between faculty member and students
times_ranking_df[COLUMN_FACULTY_RATIO] = times_ranking_df[COLUMN_TOTAL_FACULTY] / times_ranking_df[COLUMN_TOTAL_STUDENTS]

# Clean up the computation taking care of unknown values (previously set at -1)
times_isratio_defined = (times_ranking_df[COLUMN_TOTAL_FACULTY] != -1) | (times_ranking_df[COLUMN_TOTAL_STUDENTS] != -1)
times_ranking_df.loc[~times_isratio_defined, COLUMN_FACULTY_RATIO] = -1

# Define a dataset for the result computation
times_faculty_students_rank_df = times_ranking_df[[COLUMN_COUNTRY, COLUMN_REGION, COLUMN_TOTAL_FACULTY, COLUMN_TOTAL_STUDENTS, COLUMN_FACULTY_RATIO]]
times_faculty_students_rank_df = times_faculty_students_rank_df.sort_values(COLUMN_FACULTY_RATIO, ascending=False)

times_faculty_students_rank_df.head()

In [None]:
times_faculty_students_rank_df[:10][COLUMN_FACULTY_RATIO].plot(kind='barh', figsize=(10,7), color='green')

plt.title('Top 10 Universities according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

### Which are the best universities in term of ratio of international students according to Times?

In [None]:
# Compute the ratio of international students
times_ranking_df[COLUMN_INTERNATIONAL_RATIO] = times_ranking_df[COLUMN_INTERNATIONAL_STUDENTS] / times_ranking_df[COLUMN_TOTAL_STUDENTS]

# Clean up the computation taking care of unknown values (previously set at -1)
times_isratio_defined = (times_ranking_df[COLUMN_INTERNATIONAL_STUDENTS] != -1) | (times_ranking_df[COLUMN_TOTAL_STUDENTS] != -1)
times_ranking_df.loc[~times_isratio_defined, COLUMN_INTERNATIONAL_RATIO] = -1

# Define a dataset for the result computation
times_international_students_rank_df = times_ranking_df[[COLUMN_COUNTRY, COLUMN_REGION, COLUMN_INTERNATIONAL_STUDENTS, COLUMN_TOTAL_STUDENTS, COLUMN_INTERNATIONAL_RATIO]]
times_international_students_rank_df = times_international_students_rank_df.sort_values(COLUMN_INTERNATIONAL_RATIO, ascending=False)

times_international_students_rank_df.head()

In [None]:
times_international_students_rank_df[:10][COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10,7))

plt.title('Top 10 Universities according to the ratio of international students')
plt.ylabel("")
plt.xlabel('International students ratio')
plt.show()

### Plot the results from Times ranking aggregating by region

For the faculty to student ratio:

In [None]:
times_facutly_student_ratio_by_region_df = times_faculty_students_rank_df.groupby(COLUMN_REGION).head(1)
times_facutly_student_ratio_by_region_df[COLUMN_FACULTY_RATIO].plot(kind='barh',figsize=(10,7),  color='green')

plt.title('Top University per region according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

For the international to total student ratio:

In [None]:
times_international_students_ratio_by_region_df = times_international_students_rank_df.groupby(COLUMN_REGION).head(1)
times_international_students_ratio_by_region_df[COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10, 10))

plt.title('Top University per region according to the ratio of international students')
plt.ylabel("")
plt.xlabel('International students ratio')
plt.show()

### Plot the results from Times ranking aggregating by country

In [None]:
times_facutly_student_ratio_by_country_df = times_faculty_students_rank_df.groupby(COLUMN_COUNTRY).head(1)
times_facutly_student_ratio_by_country_df[COLUMN_FACULTY_RATIO].plot(kind='barh', figsize=(10,10), color='green')

plt.title('Top University per country according to ratio between faculty members and students')
plt.ylabel("")
plt.xlabel('Faculty members - Total students ratio')
plt.show()

In [None]:
times_international_students_ratio_by_country_df = times_international_students_rank_df.groupby(COLUMN_COUNTRY).head(1)
times_international_students_ratio_by_country_df[COLUMN_INTERNATIONAL_RATIO].plot(kind='barh', figsize=(10, 10))

plt.title('Top University per country according to the ratio of international students')
plt.ylabel("")
plt.xlabel('International students ratio')
plt.show()

## Task 3 - Merge the dataframes
Merge the two DataFrames created in questions 1 and 2 using university names. Match universities' names as well as you can, and explain your strategy. Keep track of the original position in both rankings.

Merging is trivial because we have already handled the standardization of University Names during the scapring task. Since both rankings have names from the WikiData database, merging gives a perfect result. We first do some riorganization of the two dataframes as follows

In [None]:
# Obtain a column for the university name on which do the merging
qs_ranking_df[COLUMN_UNIVERSITY_NAME] = qs_ranking_df.index
times_ranking_df[COLUMN_UNIVERSITY_NAME] = times_ranking_df.index

In [None]:
# Clean up the values in QS ranking for perfect merge
qs_ranking_df[COLUMN_UNIVERSITY_NAME] = qs_ranking_df[COLUMN_UNIVERSITY_NAME].str.strip()
qs_ranking_df[COLUMN_COUNTRY] = qs_ranking_df[COLUMN_COUNTRY].str.strip()

# Clean up the values in Times ranking for perfect merge
times_ranking_df[COLUMN_UNIVERSITY_NAME] = times_ranking_df[COLUMN_UNIVERSITY_NAME].str.strip()
times_ranking_df[COLUMN_COUNTRY] = times_ranking_df[COLUMN_COUNTRY].str.strip()

Then merging the two dataframes into one

In [None]:
# Merge the dataframes via an outer join on the cols name, country and region
merged_ranking_df = pd.merge(qs_ranking_df, times_ranking_df, 
                             on=[COLUMN_UNIVERSITY_NAME, COLUMN_REGION, COLUMN_COUNTRY], 
                             how='outer', 
                             suffixes=('_QS', '_TM')
                            )

# Use the University Name as an index for the merged dataframe too
merged_ranking_df.set_index(COLUMN_UNIVERSITY_NAME, inplace=True)

merged_ranking_df.head()

Optimize the data

In [None]:
cols = [COLUMN_TOTAL_STUDENTS, COLUMN_INTERNATIONAL_STUDENTS, COLUMN_TOTAL_FACULTY, 
        COLUMN_FACULTY_RATIO, COLUMN_INTERNATIONAL_RATIO]

for x in cols:
    merged_ranking_df[x] = merged_ranking_df[['{0}_{1}'.format(x, 'QS'), '{0}_{1}'.format(x, 'TM')]].mean(axis=1)
    merged_ranking_df.drop(['{0}_{1}'.format(x, 'QS'), '{0}_{1}'.format(x, 'TM')], axis=1, inplace=True)

Sample of the final result:

In [None]:
merged_ranking_df.head()

## Task 4 - Exploratory analysis

It is interesting to see how the top universities are distributed among the continents. Nearly half of these top universities are in Europe, and around one fourth of them are in North America.

In [None]:
explode = (0, 0, 0, 0.1, 0.2, 0.3)
merged_ranking_df[COLUMN_REGION].value_counts().plot(kind='pie', autopct='%1.1f%%', figsize=(12,12),fontsize=17, explode = explode)

plt.ylabel("")
plt.axis('equal')
plt.show()

We can then classify the universities according to their size. We are considering that their size is determined by the number of students, with the same classification that was used for the QS ranking (Described [here](http://www.iu.qs.com/university-rankings/qs-classifications/)).

In [None]:
def university_sizes(x):
    """Returns a size label based on the number of students."""
    if not x:
        return None
    else:
        return ('Small' if x < 5000 else 'Medium' if x < 12000
                else 'Large' if x < 30000 else 'Extra Large')

In [None]:
merged_ranking_df['Size'] = merged_ranking_df[COLUMN_TOTAL_STUDENTS].apply(lambda x: university_sizes(x))
merged_ranking_df.head()

When we look at the proportions of top universities according to their size, it is interesting to see that over 80% of them have more than 30,000 students - and out of these, 30% are considered to be extra large.

In [None]:
explode = (0, 0, 0.1, 0.2)
merged_ranking_df['Size'].value_counts().plot(kind='pie', autopct='%1.1f%%', figsize=(12,12), fontsize=17, explode = explode)

plt.ylabel("")
plt.axis('equal')
plt.show()

With the following boxplots we want to identify outliers within each region, and see if there is a difference in the ratio depending on the region of the university. 

In [None]:
sns.boxplot(x=COLUMN_REGION, y=COLUMN_FACULTY_RATIO, data=merged_ranking_df)

In [None]:
sns.boxplot(x=COLUMN_REGION, y=COLUMN_INTERNATIONAL_RATIO, data=merged_ranking_df);

In [None]:
sns.boxplot(x='Size', y=COLUMN_FACULTY_RATIO, data=merged_ranking_df);

In [None]:
sns.boxplot(x='Size', y=COLUMN_INTERNATIONAL_RATIO, data=merged_ranking_df);

Now, classify universities in [2,3,4] groups depending on their internationality ratio, and plot different numbers of students, number of staff, and faculty/student ratio to see if there's any correlation (?).

TODO: The upcoming code is shitty, I just wanted to check the plots.

In [None]:
median_internationality = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].median()
merged_ranking_df['Internationality2'] = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].apply(lambda x: 'International' if x>median_internationality else 'Not international')

quantiles_int = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].quantile([.33, .66])
merged_ranking_df['Internationality3'] = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].apply(lambda x: 'Not international' if x<quantiles_int[.33] else 'International' if x<quantiles_int[.66] else 'Very international')

quantiles_int = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].quantile([.25, .5, .75])
merged_ranking_df['Internationality4'] = merged_ranking_df[COLUMN_INTERNATIONAL_RATIO].apply(lambda x: 'Not' if x<quantiles_int[.25] else 'A bit' if x<quantiles_int[.5] else 'Quite' if x<quantiles_int[.75] else 'Very')

In [None]:
sns.boxplot(x=COLUMN_TOTAL_STUDENTS, y='Internationality2', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_TOTAL_STUDENTS, y='Internationality3', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_TOTAL_STUDENTS, y='Internationality4', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_TOTAL_FACULTY, y='Internationality2', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_TOTAL_FACULTY, y='Internationality3', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_TOTAL_FACULTY, y='Internationality4', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_FACULTY_RATIO, y='Internationality2', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_FACULTY_RATIO, y='Internationality3', data=merged_ranking_df);

In [None]:
sns.boxplot(x=COLUMN_FACULTY_RATIO, y='Internationality4', data=merged_ranking_df);

In [None]:
cols = ['{0}_QS'.format(COLUMN_RANK), '{0}_TM'.format(COLUMN_RANK), COLUMN_REGION, COLUMN_TOTAL_STUDENTS, COLUMN_INTERNATIONAL_STUDENTS, 
        COLUMN_TOTAL_FACULTY, COLUMN_FACULTY_RATIO, COLUMN_INTERNATIONAL_RATIO]

merged_ranking_df[cols].corr()

In [None]:
#plt.matshow(merged_ranking_df[['International faculty member', 'Rank_QS', 'Region', 'Rank_TM', 'Total student', 'International student', 'Total faculty member', 'Faculty/students ratio', 'Intern/student ratio']].corr())

plt.matshow(merged_ranking_df[cols].corr())

Summary of this last part and TODOs:
* We might want to get a numeric ranking, so that we can check the correlation between the positions in the rankings and the other variables. This could be achieved by saving the order of the rows as the rank in the parsing.
* I don't know what to do with the NaNs. Maybe the best thing is to ignore those with masks, or try to guess them from similar universities (would be more difficult though). As long as we're ignoring them, I think we should be able to plot everything else without them.
* I couldn't conclude important things from the boxplots, at least nothing (IMO) that we can use for Task 5. Bear in mind that we had `-1` all over the place...

## Task 5 - 