##### Progetto FOCS

You have to work on the [University dataset](https://drive.google.com/drive/folders/1Hs3nRtK_F3h8eg59B4-TD1DEua6g8Klv). It contains three different university rankings:
- The Times Higher Education World University Ranking, shortly *Times*,
- the Academic Ranking of World Universities, shortly *Shanghai*,
- the Center for World University Rankings, shortly *cwur*.

Notes
1. It is mandatory to use GitHub for developing the project.
2. The project must be a jupyter notebook.
3. There is no restriction on the libraries that can be used, nor on the Python version.
4. All questions on the project **must** be asked in a public channel on [Zulip](https://focs.zulipchat.com).

In [8]:
import pandas as pd
import re
import numpy as np

In [9]:
data_folder = "Data/"
times = pd.read_csv(f'{data_folder}timesData.csv', thousands=',')
times.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152.0,8.9,25%,,2011
1,2,California Institute of Technology,United States of America,97.7,54.6,98.0,99.9,83.7,96.0,2243.0,6.9,27%,33 : 67,2011
2,3,Massachusetts Institute of Technology,United States of America,97.8,82.3,91.4,99.9,87.5,95.6,11074.0,9.0,33%,37 : 63,2011
3,4,Stanford University,United States of America,98.3,29.5,98.1,99.2,64.3,94.3,15596.0,7.8,22%,42 : 58,2011
4,5,Princeton University,United States of America,90.9,70.3,95.4,99.9,-,94.2,7929.0,8.4,27%,45 : 55,2011


In [10]:
shanghai = pd.read_csv(f'{data_folder}shanghaiData.csv')
shanghai.head()

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
0,1,Harvard University,1,100.0,100.0,100.0,100.0,100.0,100.0,72.4,2005
1,2,University of Cambridge,1,73.6,99.8,93.4,53.3,56.6,70.9,66.9,2005
2,3,Stanford University,2,73.4,41.1,72.2,88.5,70.9,72.3,65.0,2005
3,4,"University of California, Berkeley",3,72.8,71.8,76.0,69.4,73.9,72.2,52.7,2005
4,5,Massachusetts Institute of Technology (MIT),4,70.1,74.0,80.6,66.7,65.8,64.3,53.0,2005


In [11]:
cwur = pd.read_csv(f'{data_folder}cwurData.csv')
cwur.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


Check of Na values 

In [12]:
times.isnull().any()

world_rank                False
university_name           False
country                   False
teaching                  False
international             False
research                  False
citations                 False
income                    False
total_score               False
num_students               True
student_staff_ratio        True
international_students     True
female_male_ratio          True
year                      False
dtype: bool

In [13]:
shanghai.isnull().any()

world_rank         False
university_name     True
national_rank       True
total_score         True
alumni              True
award               True
hici                True
ns                  True
pub                 True
pcp                 True
year               False
dtype: bool

In [166]:
shanghai[shanghai['university_name'].isnull()]

Unnamed: 0,world_rank,university_name,national_rank,total_score,alumni,award,hici,ns,pub,pcp,year
3896,99,,,,,,,,,,2013


This row will be deleted from the dataset because in this context the missing value associated to *university_name* makes the row itself meaningless

In [169]:
shanghai.dropna(subset = ['university_name'], inplace = True)

In [171]:
shanghai.isnull().any()

world_rank         False
university_name    False
national_rank      False
total_score         True
alumni             False
award               True
hici                True
ns                  True
pub                 True
pcp                 True
year               False
dtype: bool

In [14]:
cwur.isnull().any()

world_rank              False
institution             False
country                 False
national_rank           False
quality_of_education    False
alumni_employment       False
quality_of_faculty      False
publications            False
influence               False
citations               False
broad_impact             True
patents                 False
score                   False
year                    False
dtype: bool

## 1. For each university, extract from the times dataset the most recent and the least recent data, obtaining two separate dataframes

In [15]:
times['year'].dtype

dtype('int64')

In [16]:
times[times['university_name'] == 'Harvard University']

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
0,1,Harvard University,United States of America,99.7,72.4,98.7,98.8,34.5,96.1,20152.0,8.9,25%,,2011
201,2,Harvard University,United States of America,95.8,67.5,97.4,99.8,35.9,93.9,20152.0,8.9,25%,,2012
605,4,Harvard University,United States of America,94.9,63.7,98.6,99.2,39.9,93.6,20152.0,8.9,25%,,2013
1003,2,Harvard University,United States of America,95.3,66.2,98.5,99.1,40.6,93.9,20152.0,8.9,25%,,2014
1403,2,Harvard University,United States of America,92.9,67.6,98.6,98.9,44.0,93.3,20152.0,8.9,25%,,2015
1808,6,Harvard University,United States of America,83.6,77.2,99.0,99.8,45.2,91.6,20152.0,8.9,25%,,2016


In [17]:
times_max_year = times.iloc[times.groupby('university_name').idxmax()['year']].copy()
times_max_year.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
2405,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569.0,17.0,1%,-,2016
2003,201-250,Aalborg University,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016
2056,251-300,Aalto University,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016
1908,=106,Aarhus University,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016
2105,301-350,Aberystwyth University,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016


In [18]:
times_min_year = times.iloc[times.groupby('university_name').idxmin()['year']]
times_min_year.head()

Unnamed: 0,world_rank,university_name,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
2405,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569.0,17.0,1%,-,2016
501,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422.0,15.9,15%,48 : 52,2012
502,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099.0,24.2,17%,32 : 68,2012
166,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895.0,13.6,14%,54 : 46,2011
476,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252.0,19.2,18%,48 : 52,2012


## 2. For each university, compute the improvement in income between the least recent and the most recent data points

In [19]:
merged = pd.merge(times_min_year, times_max_year, on = ['university_name', 'country'], suffixes = ['_min', '_max'])
merged.head()

Unnamed: 0,world_rank_min,university_name,country,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,international_max,research_max,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max
0,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569.0,...,17.9,3.7,35.7,-,-,35569.0,17.0,1%,-,2016
1,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422.0,...,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016
2,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099.0,...,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016
3,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895.0,...,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016
4,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252.0,...,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016


In [20]:
merged[merged['university_name'] == 'Harvard University']['income_max'].dtype

dtype('O')

In [21]:
def difference(row):
    if (row['income_max'] == '-') or (row['income_min'] == '-'):
        return 'data not available'
    else:
        return float(row['income_max']) - float(row['income_min'])

In [22]:
merged['difference_for'] = merged.apply(difference, axis = 1)
merged.head()

Unnamed: 0,world_rank_min,university_name,country,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,research_max,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max,difference_for
0,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,-,-,35569.0,...,3.7,35.7,-,-,35569.0,17.0,1%,-,2016,data not available
1,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422.0,...,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,7.3
2,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099.0,...,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,-0.3
3,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895.0,...,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,6.8
4,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252.0,...,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,-4.2


In [23]:
merged['income_max'] = pd.to_numeric(merged['income_max'], errors = 'coerce')
merged['income_min'] = pd.to_numeric(merged['income_min'], errors = 'coerce')
merged.head()

Unnamed: 0,world_rank_min,university_name,country,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,research_max,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max,difference_for
0,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,,-,35569.0,...,3.7,35.7,,-,35569.0,17.0,1%,-,2016,data not available
1,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422.0,...,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,7.3
2,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099.0,...,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,-0.3
3,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895.0,...,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,6.8
4,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252.0,...,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,-4.2


In [24]:
merged['difference'] = merged['income_max'] - merged['income_min']
merged.head()

Unnamed: 0,world_rank_min,university_name,country,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max,difference_for,difference
0,601-800,AGH University of Science and Technology,Poland,14.2,17.9,3.7,35.7,,-,35569.0,...,35.7,,-,35569.0,17.0,1%,-,2016,data not available,
1,301-350,Aalborg University,Denmark,19.0,75.3,20.0,27.1,36.4,-,17422.0,...,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,7.3,7.3
2,301-350,Aalto University,Finland,26.2,49.0,22.2,37.5,61.9,-,16099.0,...,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,-0.3,-0.3
3,167,Aarhus University,Denmark,38.1,33.4,55.6,57.3,61.5,49.9,23895.0,...,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,6.8,6.8
4,276-300,Aberystwyth University,United Kingdom,19.8,63.8,15.5,56.6,35.5,-,9252.0,...,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,-4.2,-4.2


In [25]:
merged[['difference_for', 'difference']]

Unnamed: 0,difference_for,difference
0,data not available,
1,7.3,7.3
2,-0.3,-0.3
3,6.8,6.8
4,-4.2,-4.2
...,...,...
811,6.4,6.4
812,5.6,5.6
813,data not available,
814,27.4,27.4


## 3. Find the university with the largest increase computed in the previous point

In [26]:
merged[merged['difference_for'] != 'data not available'].sort_values('difference_for', ascending = False)

Unnamed: 0,world_rank_min,university_name,country,teaching_min,international_min,research_min,citations_min,income_min,total_score_min,num_students_min,...,citations_max,income_max,total_score_max,num_students_max,student_staff_ratio_max,international_students_max,female_male_ratio_max,year_max,difference_for,difference
427,251-275,TU Dresden,Germany,27.3,49.2,13.8,57.4,31.9,-,35487.0,...,66.1,99.7,52.1,35487.0,37.4,12%,42 : 58,2016,67.8,67.8
277,174,Nanyang Technological University,Singapore,43.6,96.3,51.7,45.0,40.0,49.0,25028.0,...,85.6,99.9,68.2,25028.0,16.2,33%,48 : 52,2016,59.9,59.9
229,61,LMU Munich,Germany,59.1,43.1,57.5,76.4,40.4,63.0,35691.0,...,85.7,100.0,77.3,35691.0,15.5,13%,62 : 38,2016,59.6,59.6
204,187,Karlsruhe Institute of Technology,Germany,45.0,47.3,35.4,60.7,40.0,47.2,25294.0,...,73.8,99.5,54.5,25294.0,24.6,16%,26 : 74,2016,59.5,59.5
275,201-225,Nagoya University,Japan,45.5,21.2,39.2,43.8,33.1,-,15529.0,...,40.1,91.4,-,15529.0,7.9,10%,29 : 71,2016,58.3,58.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
434,122,Technical University of Denmark,Denmark,46.2,64.0,46.9,64.6,95.5,54.5,9990.0,...,77.8,50.0,51.1,9990.0,5.0,18%,27 : 73,2016,-45.5,-45.5
585,276-300,University of Iceland,Iceland,10.7,56.9,17.3,62.4,75.4,-,13960.0,...,91.4,28.0,-,13960.0,25.9,8%,66 : 34,2016,-47.4,-47.4
236,124,Leiden University,Netherlands,47.3,40.0,54.9,59.3,100.0,54.4,21222.0,...,85.2,49.8,65.7,21222.0,17.1,10%,59 : 41,2016,-50.2,-50.2
490,95,University of Arizona,United States of America,52.4,21.9,52.2,70.1,84.2,57.3,36429.0,...,79.5,32.4,51.7,36429.0,12.7,8%,52 : 48,2016,-51.8,-51.8


In [27]:
#soluzione con ciclo for

temp = -1000

for i in range(1, len(merged)-1):
    
    if merged.iloc[i]['difference'] != 'data not available':
        if merged.iloc[i]['difference'] > temp:
                temp = merged.iloc[i]['difference']
                index = i
        
merged.iloc[index]

world_rank_min                   251-275
university_name               TU Dresden
country                          Germany
teaching_min                        27.3
international_min                   49.2
research_min                        13.8
citations_min                       57.4
income_min                          31.9
total_score_min                        -
num_students_min                   35487
student_staff_ratio_min             37.4
international_students_min           12%
female_male_ratio_min            42 : 58
year_min                            2012
world_rank_max                      =158
teaching_max                        41.4
international_max                   47.7
research_max                        45.8
citations_max                       66.1
income_max                          99.7
total_score_max                     52.1
num_students_max                   35487
student_staff_ratio_max             37.4
international_students_max           12%
female_male_rati

In [28]:
merged.iloc[merged['difference'].idxmax()]['university_name']

'TU Dresden'

In [29]:
merged[['income_max', 'income_min', 'difference', 'difference_for']]

Unnamed: 0,income_max,income_min,difference,difference_for
0,,,,data not available
1,43.7,36.4,7.3,7.3
2,61.6,61.9,-0.3,-0.3
3,68.3,61.5,6.8,6.8
4,31.3,35.5,-4.2,-4.2
...,...,...,...,...
811,37.1,30.7,6.4,6.4
812,31.7,26.1,5.6,5.6
813,82.3,,,data not available
814,65.4,38.0,27.4,27.4


## 4. For each ranking, consider only the most recent data point. For each university, compute the maximum difference between the rankings (e.g. for Aarhus University the value is 122-73=49). Notice that some rankings are expressed as a range

In [30]:
shanghai_max_year = shanghai.iloc[shanghai.groupby('university_name').idxmax()['year']][['world_rank', 'university_name']]
shanghai_max_year.set_index('university_name', inplace = True)
shanghai_max_year.head()

Unnamed: 0_level_0,world_rank
university_name,Unnamed: 1_level_1
Aalborg University,301-400
Aalto University,401-500
Aarhus University,73
Aix Marseille University,101-150
Aix-Marseille University,102-150


In [31]:
cwur_max_year = cwur.iloc[cwur.groupby('institution').idxmax()['year']][['world_rank', 'institution']]
cwur_max_year.set_index('institution', inplace = True)
cwur_max_year.head()

Unnamed: 0_level_0,world_rank
institution,Unnamed: 1_level_1
AGH University of Science and Technology,782
Aalborg University,565
Aalto University,421
Aarhus University,122
Aberystwyth University,814


In [32]:
times_max_year.set_index('university_name', inplace = True)

In [33]:
from functools import reduce

In [34]:
datasets = [times_max_year['world_rank'], shanghai_max_year, cwur_max_year]

In [35]:
rankings = reduce(lambda left, right: pd.merge(left, right, left_index = True, right_index = True, how = 'outer'), datasets)
rankings.head()

Unnamed: 0,world_rank_x,world_rank_y,world_rank
AGH University of Science and Technology,601-800,,782.0
Aalborg University,201-250,301-400,565.0
Aalto University,251-300,401-500,421.0
Aarhus University,=106,73,122.0
Aberystwyth University,301-350,,814.0


In [36]:
def newrank(string):
    #if string.isnull(): 
    if type(string) != str:
        return string
    else:
        occ1 = re.search('(?P<min>\d+)-(?P<max>\d+)', string)
        occ2 = re.search('^\D?(?P<num>\d+)$', string)
        if occ1:
            #per i range consideriamo la media?
            return (float(occ1.group('min')) + float(occ1.group('max')))/2
        elif occ2:
            return float(occ2.group('num')) 

In [37]:
rankings['world_rank_x_num'] = rankings['world_rank_x'].apply(newrank)
rankings['world_rank_y_num'] = rankings['world_rank_y'].apply(newrank)
rankings 
#pop?

Unnamed: 0,world_rank_x,world_rank_y,world_rank,world_rank_x_num,world_rank_y_num
AGH University of Science and Technology,601-800,,782.0,700.5,
Aalborg University,201-250,301-400,565.0,225.5,350.5
Aalto University,251-300,401-500,421.0,275.5,450.5
Aarhus University,=106,73,122.0,106.0,73.0
Aberystwyth University,301-350,,814.0,325.5,
...,...,...,...,...,...
École centrale de Lyon,,,881.0,,
École normale supérieure - Paris,,,37.0,,
École normale supérieure de Cachan,,,721.0,,
École normale supérieure de Lyon,,,471.0,,


In [38]:
rankings['world_rank'].dtype

dtype('float64')

In [39]:
def absolute_difference(row):
    item1 = row['world_rank_x_num']
    item2 = row['world_rank_y_num']
    item3 = row['world_rank']
    return max(abs(item1 - item2), abs(item1 - item3), abs(item3 - item2))

In [40]:
rankings['absolute_difference'] = rankings.apply(absolute_difference, axis = 1)
rankings

Unnamed: 0,world_rank_x,world_rank_y,world_rank,world_rank_x_num,world_rank_y_num,absolute_difference
AGH University of Science and Technology,601-800,,782.0,700.5,,
Aalborg University,201-250,301-400,565.0,225.5,350.5,339.5
Aalto University,251-300,401-500,421.0,275.5,450.5,175.0
Aarhus University,=106,73,122.0,106.0,73.0,49.0
Aberystwyth University,301-350,,814.0,325.5,,
...,...,...,...,...,...,...
École centrale de Lyon,,,881.0,,,
École normale supérieure - Paris,,,37.0,,,
École normale supérieure de Cachan,,,721.0,,,
École normale supérieure de Lyon,,,471.0,,,


## 5. Consider only the most recent data point of the times dataset. Compute the number of male and female students for each country.

In [41]:
times_max_year.head()

Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
AGH University of Science and Technology,601-800,Poland,14.2,17.9,3.7,35.7,-,-,35569.0,17.0,1%,-,2016
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016


In [42]:
print(f"the total number of rows is: {len(times_max_year)}")
print(f"the number of rows with missing 'female_male_ratio' values is: {len(times_max_year[(times_max_year['female_male_ratio'] == '-') | (times_max_year['female_male_ratio'].isnull())])}")

the total number of rows is: 818
the number of rows with missing 'female_male_ratio' values is: 79


In [43]:
#rows with missing 'female_male_ratio' values will be removed 
#this decision is taken because these rows won't contribute to the
#final result
#moreover, the number of such rows is significantly smaller (79 / 818) 
#than the total number of rows 

In [44]:
times_max_year_clean = times_max_year[(times_max_year['female_male_ratio'] != '-') & (times_max_year['female_male_ratio'].notnull())].copy()
times_max_year_clean.head()

Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016
Adam Mickiewicz University,601-800,Poland,20.0,25.7,11.0,15.3,28.7,-,40633.0,15.6,1%,71 : 29,2016


In [45]:
def female_male(row):
    occurrence = re.search('(?P<female>\d+) : (?P<male>\d+)', row['female_male_ratio'])
    if occurrence:
        female = float(occurrence.group('female'))*0.01*row['num_students']
        male = float(occurrence.group('male'))*0.01*row['num_students']
        return [int(round(female)), int(round(male))]
    else:
        return 'Error' 
        #in this way we can control if there are different formats of 'female_male_ratio'
        #that need to be handled 

In [46]:
times_max_year_clean['female_male_student'] = times_max_year_clean.apply(female_male, axis = 1)
times_max_year_clean

Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female_male_student
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,"[8363, 9059]"
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,"[5152, 10947]"
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,"[12903, 10992]"
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,"[4441, 4811]"
Adam Mickiewicz University,601-800,Poland,20.0,25.7,11.0,15.3,28.7,-,40633.0,15.6,1%,71 : 29,2016,"[28849, 11784]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
École Normale Supérieure,54,France,70.6,85.5,47.7,87.1,37.1,69.0,2400.0,7.9,20%,46 : 54,2016,"[1104, 1296]"
École Normale Supérieure de Lyon,201-250,France,41.6,65.6,30.0,69.0,31.7,-,2218.0,8.0,14%,49 : 51,2016,"[1087, 1131]"
École Polytechnique,=101,France,53.5,92.8,44.6,64.7,82.3,57.9,2429.0,4.8,30%,18 : 82,2016,"[437, 1992]"
École Polytechnique Fédérale de Lausanne,31,Switzerland,61.3,98.6,67.5,94.6,65.4,76.1,9666.0,10.5,54%,27 : 73,2016,"[2610, 7056]"


In [47]:
print(f"the number of rows with 'female_male_student' == 'Error' is {len(times_max_year_clean[times_max_year_clean['female_male_student'] == 'Error'])}")

the number of rows with 'female_male_student' == 'Error' is 0


It has been verified that the only format of _female_male_ratio_ is 
<code>\d+ : \d+</code>

In [48]:
times_max_year_clean['female'], times_max_year_clean['male'] =  zip(*times_max_year_clean['female_male_student'])

In [49]:
times_max_year_clean

Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female_male_student,female,male
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,"[8363, 9059]",8363,9059
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,"[5152, 10947]",5152,10947
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,"[12903, 10992]",12903,10992
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,"[4441, 4811]",4441,4811
Adam Mickiewicz University,601-800,Poland,20.0,25.7,11.0,15.3,28.7,-,40633.0,15.6,1%,71 : 29,2016,"[28849, 11784]",28849,11784
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
École Normale Supérieure,54,France,70.6,85.5,47.7,87.1,37.1,69.0,2400.0,7.9,20%,46 : 54,2016,"[1104, 1296]",1104,1296
École Normale Supérieure de Lyon,201-250,France,41.6,65.6,30.0,69.0,31.7,-,2218.0,8.0,14%,49 : 51,2016,"[1087, 1131]",1087,1131
École Polytechnique,=101,France,53.5,92.8,44.6,64.7,82.3,57.9,2429.0,4.8,30%,18 : 82,2016,"[437, 1992]",437,1992
École Polytechnique Fédérale de Lausanne,31,Switzerland,61.3,98.6,67.5,94.6,65.4,76.1,9666.0,10.5,54%,27 : 73,2016,"[2610, 7056]",2610,7056


In [50]:
students_per_country = times_max_year_clean.groupby('country')[['female', 'male', 'num_students']].sum()
students_per_country

Unnamed: 0_level_0,female,male,num_students
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,67191,41182,108373.0
Australia,391736,321640,713376.0
Austria,68364,66113,134477.0
Bangladesh,21323,41393,62716.0
Belarus,20219,9084,29303.0
...,...,...,...
Uganda,18670,18670,37340.0
Ukraine,17846,19250,37096.0
United Arab Emirates,9516,4931,14447.0
United Kingdom,711815,613028,1324842.0



## 6. Find the universities where the ratio between female and male is below the average ratio (computed over all universities)

In [51]:
def ratio(row):
    occurrence = re.search('(?P<female>\d+) : (?P<male>\d+)', row['female_male_ratio'])
    return np.float64(occurrence.group('female')) / np.float64(occurrence.group('male'))

In [52]:
times_max_year_clean['fm_ratio'] = times_max_year_clean.apply(ratio, axis = 1)
times_max_year_clean

  return np.float64(occurrence.group('female')) / np.float64(occurrence.group('male'))


Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female_male_student,female,male,fm_ratio
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,"[8363, 9059]",8363,9059,0.923077
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,"[5152, 10947]",5152,10947,0.470588
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,"[12903, 10992]",12903,10992,1.173913
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,"[4441, 4811]",4441,4811,0.923077
Adam Mickiewicz University,601-800,Poland,20.0,25.7,11.0,15.3,28.7,-,40633.0,15.6,1%,71 : 29,2016,"[28849, 11784]",28849,11784,2.448276
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
École Normale Supérieure,54,France,70.6,85.5,47.7,87.1,37.1,69.0,2400.0,7.9,20%,46 : 54,2016,"[1104, 1296]",1104,1296,0.851852
École Normale Supérieure de Lyon,201-250,France,41.6,65.6,30.0,69.0,31.7,-,2218.0,8.0,14%,49 : 51,2016,"[1087, 1131]",1087,1131,0.960784
École Polytechnique,=101,France,53.5,92.8,44.6,64.7,82.3,57.9,2429.0,4.8,30%,18 : 82,2016,"[437, 1992]",437,1992,0.219512
École Polytechnique Fédérale de Lausanne,31,Switzerland,61.3,98.6,67.5,94.6,65.4,76.1,9666.0,10.5,54%,27 : 73,2016,"[2610, 7056]",2610,7056,0.369863


In [53]:
from statistics import harmonic_mean

for this task we compute the _average ratio_ using the harmonic mean because it is the index that best handles operations with ratios    

In [54]:
average_ratio = harmonic_mean(times_max_year_clean['fm_ratio'])
print(average_ratio)

0.7438791777857604


In [55]:
times_max_year_clean['avg_ratio'] = average_ratio
times_max_year_clean.head()

Unnamed: 0_level_0,world_rank,country,teaching,international,research,citations,income,total_score,num_students,student_staff_ratio,international_students,female_male_ratio,year,female_male_student,female,male,fm_ratio,avg_ratio
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Aalborg University,201-250,Denmark,25.1,71.0,28.4,73.8,43.7,-,17422.0,15.9,15%,48 : 52,2016,"[8363, 9059]",8363,9059,0.923077,0.743879
Aalto University,251-300,Finland,31.1,65.4,32.8,62.1,61.6,-,16099.0,24.2,17%,32 : 68,2016,"[5152, 10947]",5152,10947,0.470588,0.743879
Aarhus University,=106,Denmark,36.9,76.8,50.7,79.8,68.3,57.7,23895.0,13.6,14%,54 : 46,2016,"[12903, 10992]",12903,10992,1.173913,0.743879
Aberystwyth University,301-350,United Kingdom,21.6,72.2,18.9,67.2,31.3,-,9252.0,19.2,18%,48 : 52,2016,"[4441, 4811]",4441,4811,0.923077,0.743879
Adam Mickiewicz University,601-800,Poland,20.0,25.7,11.0,15.3,28.7,-,40633.0,15.6,1%,71 : 29,2016,"[28849, 11784]",28849,11784,2.448276,0.743879


In [56]:
below_avg = times_max_year_clean[times_max_year_clean['fm_ratio'] < average_ratio][['world_rank', 'country', 'num_students', 'female_male_ratio', 'year', 'female', 'male', 'fm_ratio', 'avg_ratio']]
below_avg

Unnamed: 0_level_0,world_rank,country,num_students,female_male_ratio,year,female,male,fm_ratio,avg_ratio
university_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Aalto University,251-300,Finland,16099.0,32 : 68,2016,5152,10947,0.470588,0.743879
Ajou University,601-800,South Korea,12706.0,33 : 67,2016,4193,8513,0.492537,0.743879
Aligarh Muslim University,601-800,India,11197.0,17 : 83,2016,1903,9294,0.204819,0.743879
Amirkabir University of Technology,501-600,Iran,14080.0,34 : 66,2016,4787,9293,0.515152,0.743879
Andhra University,601-800,India,10407.0,36 : 64,2016,3747,6660,0.562500,0.743879
...,...,...,...,...,...,...,...,...,...
Yokohama National University,601-800,Japan,10117.0,28 : 72,2016,2833,7284,0.388889,0.743879
Yıldız Technical University,601-800,Turkey,31268.0,36 : 64,2016,11256,20012,0.562500,0.743879
Zhejiang University,251-300,China,47508.0,41 : 59,2016,19478,28030,0.694915,0.743879
École Polytechnique,=101,France,2429.0,18 : 82,2016,437,1992,0.219512,0.743879


## 7. For each country, compute the fraction of the students in the country that are in one of the universities computed in the previous point (that is, the denominator of the ratio is the total number of students over all universities in the country).

In [57]:
students_per_country.head() 

Unnamed: 0_level_0,female,male,num_students
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,67191,41182,108373.0
Australia,391736,321640,713376.0
Austria,68364,66113,134477.0
Bangladesh,21323,41393,62716.0
Belarus,20219,9084,29303.0


we considered at the denominator the total number of students over those universities in the country which had a non-missing _female_male_ratio_ value 

In [58]:
students_below = below_avg.groupby('country')['num_students'].sum()
students_below.head()

country
Austria        33961.0
Bangladesh     62716.0
Brazil          7741.0
Chile          34457.0
China         542991.0
Name: num_students, dtype: float64

In [59]:
students_ratio = pd.merge(students_per_country, students_below, left_index = True, right_index = True, suffixes = ['_tot', '_below'], how = 'outer')
students_ratio

Unnamed: 0_level_0,female,male,num_students_tot,num_students_below
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,67191,41182,108373.0,
Australia,391736,321640,713376.0,
Austria,68364,66113,134477.0,33961.0
Bangladesh,21323,41393,62716.0,62716.0
Belarus,20219,9084,29303.0,
...,...,...,...,...
Uganda,18670,18670,37340.0,
Ukraine,17846,19250,37096.0,
United Arab Emirates,9516,4931,14447.0,
United Kingdom,711815,613028,1324842.0,37784.0


the missing values of the column _num_students_below_ will be filled with the digit value 0.0, because the missing value means that there aren't universities with _fm_ratio_ below the average ratio in that specific country   

In [60]:
students_ratio['num_students_below'].fillna(0, inplace = True)
students_ratio.head()

Unnamed: 0_level_0,female,male,num_students_tot,num_students_below
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,67191,41182,108373.0,0.0
Australia,391736,321640,713376.0,0.0
Austria,68364,66113,134477.0,33961.0
Bangladesh,21323,41393,62716.0,62716.0
Belarus,20219,9084,29303.0,0.0


In [61]:
students_ratio['ratio'] = students_ratio['num_students_below'] / students_ratio['num_students_tot']
students_ratio

Unnamed: 0_level_0,female,male,num_students_tot,num_students_below,ratio
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Argentina,67191,41182,108373.0,0.0,0.000000
Australia,391736,321640,713376.0,0.0,0.000000
Austria,68364,66113,134477.0,33961.0,0.252541
Bangladesh,21323,41393,62716.0,62716.0,1.000000
Belarus,20219,9084,29303.0,0.0,0.000000
...,...,...,...,...,...
Uganda,18670,18670,37340.0,0.0,0.000000
Ukraine,17846,19250,37096.0,0.0,0.000000
United Arab Emirates,9516,4931,14447.0,0.0,0.000000
United Kingdom,711815,613028,1324842.0,37784.0,0.028520


## 8. Read the file educational_attainment_supplementary_data.csv, discarding any row with missing country_name or series_name

In [74]:
edu = pd.read_csv(f'{data_folder}educational_attainment_supplementary_data.csv')
edu

Unnamed: 0,country_name,series_name,1985,1986,1987,1990,1991,1992,1993,1995,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2015
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.33,,,0.44,,,,0.57,...,0.86,,,,,1.27,,,,
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1.03,,,1.26,,,,1.54,...,2.18,,,,,2.64,,,,
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.83,,,0.95,,,,1.26,...,1.01,,,,,2.45,,,,
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",2.34,,,2.22,,,,2.37,...,2.26,,,,,3.55,,,,
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.54,,,0.92,,,,0.94,...,2.00,,,,,1.29,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79050,,,,,,,,,,,...,,,,,,,,,,
79051,,,,,,,,,,,...,,,,,,,,,,
79052,,,,,,,,,,,...,,,,,,,,,,
79053,Data from database: Education Statistics: Educ...,,,,,,,,,,...,,,,,,,,,,


In [75]:
edu.isnull().any()

country_name    True
series_name     True
1985            True
1986            True
1987            True
1990            True
1991            True
1992            True
1993            True
1995            True
1996            True
1997            True
1998            True
1999            True
2000            True
2001            True
2002            True
2003            True
2004            True
2005            True
2006            True
2007            True
2008            True
2009            True
2010            True
2011            True
2012            True
2013            True
2015            True
dtype: bool

In [76]:
edu.dropna(subset = ['country_name', 'series_name'], inplace = True)
edu.head() #rivedere se possibile farlo nella read_csv

Unnamed: 0,country_name,series_name,1985,1986,1987,1990,1991,1992,1993,1995,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2015
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.33,,,0.44,,,,0.57,...,0.86,,,,,1.27,,,,
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1.03,,,1.26,,,,1.54,...,2.18,,,,,2.64,,,,
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.83,,,0.95,,,,1.26,...,1.01,,,,,2.45,,,,
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",2.34,,,2.22,,,,2.37,...,2.26,,,,,3.55,,,,
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",0.54,,,0.92,,,,0.94,...,2.0,,,,,1.29,,,,


In [77]:
edu.notnull().any()

country_name     True
series_name      True
1985             True
1986             True
1987             True
1990             True
1991             True
1992             True
1993             True
1995             True
1996             True
1997            False
1998             True
1999             True
2000             True
2001             True
2002             True
2003             True
2004             True
2005             True
2006             True
2007             True
2008             True
2009             True
2010             True
2011             True
2012             True
2013             True
2015            False
dtype: bool

In [78]:
edu.isnull().any().head()

country_name    False
series_name     False
1985             True
1986             True
1987             True
dtype: bool

## 9. From attainment build a dataframe with the same data, but with 4 columns: country_name, series_name, year, value

In [79]:
edu_melted = edu.melt(id_vars = ['country_name','series_name'], var_name = 'year')
edu_melted

Unnamed: 0,country_name,series_name,year,value
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.33
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,1.03
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.83
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,2.34
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.54
...,...,...,...,...
2134345,Zimbabwe,UIS: Percentage of population age 25+ with som...,2015,
2134346,Zimbabwe,UIS: Percentage of population age 25+ with som...,2015,
2134347,Zimbabwe,UIS: Percentage of population age 25+ with unk...,2015,
2134348,Zimbabwe,UIS: Percentage of population age 25+ with unk...,2015,


In [81]:
edu_melted.dropna(subset = ['value'], inplace = True)
edu_melted

Unnamed: 0,country_name,series_name,year,value
0,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.33000
1,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,1.03000
2,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.83000
3,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,2.34000
4,Afghanistan,"Barro-Lee: Average years of primary schooling,...",1985,0.54000
...,...,...,...,...
2054020,West Bank and Gaza,UIS: Percentage of population age 25+ with som...,2013,1.48356
2054021,West Bank and Gaza,UIS: Percentage of population age 25+ with som...,2013,1.88820
2054022,West Bank and Gaza,UIS: Percentage of population age 25+ with unk...,2013,0.00000
2054023,West Bank and Gaza,UIS: Percentage of population age 25+ with unk...,2013,0.00000


## 10. For each university, find the number of rankings in which they appear (it suffices to appear in one year for each ranking).

In [172]:
times_unique_uni = pd.DataFrame(times['university_name'].unique(), columns = ['university'])
times_unique_uni

Unnamed: 0,university
0,Harvard University
1,California Institute of Technology
2,Massachusetts Institute of Technology
3,Stanford University
4,Princeton University
...,...
813,Xidian University
814,Yeungnam University
815,Yıldız Technical University
816,Yokohama City University


In [175]:
shanghai_unique_uni = pd.DataFrame(shanghai['university_name'].unique(), columns = ['university'])
shanghai_unique_uni

Unnamed: 0,university
0,Harvard University
1,University of Cambridge
2,Stanford University
3,"University of California, Berkeley"
4,Massachusetts Institute of Technology (MIT)
...,...
653,Capital Medical University
654,Queensland University of Technology
655,Sharif University of Technology
656,University of Genoa


In [177]:
cwur_unique_uni = pd.DataFrame(cwur['institution'].unique(), columns = ['university'])
cwur_unique_uni

Unnamed: 0,university
0,Harvard University
1,Massachusetts Institute of Technology
2,Stanford University
3,University of Cambridge
4,California Institute of Technology
...,...
1019,Shenzhen University
1020,Tianjin Medical University
1021,Babeș-Bolyai University
1022,Henan Normal University


In [178]:
uni_concat = pd.concat([times_unique_uni, shanghai_unique_uni, cwur_unique_uni], ignore_index = True)
uni_concat

Unnamed: 0,university
0,Harvard University
1,California Institute of Technology
2,Massachusetts Institute of Technology
3,Stanford University
4,Princeton University
...,...
2495,Shenzhen University
2496,Tianjin Medical University
2497,Babeș-Bolyai University
2498,Henan Normal University


In [179]:
uni_concat[uni_concat['university'].isnull()]

Unnamed: 0,university


In [186]:
number_rankings = uni_concat.groupby('university', as_index = False).size()
number_rankings

Unnamed: 0,university,size
0,AGH University of Science and Technology,2
1,Aalborg University,3
2,Aalto University,3
3,Aarhus University,3
4,Aberystwyth University,2
...,...,...
1447,École centrale de Lyon,1
1448,École normale supérieure - Paris,1
1449,École normale supérieure de Cachan,1
1450,École normale supérieure de Lyon,1


In [187]:
number_rankings[(number_rankings['size'] > 3) | (number_rankings['size'] < 1)]

Unnamed: 0,university,size


## 11. In the times ranking, compute the number of times each university appears

## 12. Find the universities that appear at most twice in the times ranking.

## 13. The universities that, in any year, have the same position in all three rankings (they must have the same position in a year).