In [1]:
import numpy as np
import pandas as pd

import glob

# Ranking Melhores Universidades (2011_2024)

## Times Higher Education

Times higher education é uma revista britânica que trata sobre assuntos relacionados a educação superior. Ela publica rankings anuais que classificam as melhores universidades ao redor do mundo.

Sobre os dados:

The THE World University Rankings provide the definitive list of the world’s best universities, with an emphasis on the research mission. It is the only global university league table to judge research-intensive universities across all of their core missions: teaching (the learning environment); research (volume, income and reputation); citations (research influence); industry income (knowledge transfer) and international outlook (staff, students and research). It uses 13 carefully calibrated performance indicators to provide the most comprehensive and balanced comparisons. The overall list is accompanied by 11 subject-specific rankings.

Original data, as well as ranking methodology described per each year, is available on the official website www.timeshighereducation.com
 Methodology (2024): https://www.timeshighereducation.com/world-university-rankings/world-university-rankings-2024-methodology



### Carregar Dados

In [2]:
#Link para download dos dados: https://www.kaggle.com/datasets/r1chardson/the-world-university-rankings-2011-2023

path_the='../dados/THE World University Rankings 2011-2024' #caminho onde estão salvos os dados
files_the=glob.glob(path_the+'/*.csv')
files_the

['../dados/THE World University Rankings 2011-2024/2012_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2013_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2018_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2011_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2023_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2016_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2022_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2021_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2020_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2017_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2019_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2024_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2014_rankings.csv',
 '../dados/THE World University Rankings 2011-2024/2015_rankings.csv']

In [3]:
dados={} #dicionário que receberá os dados dos rankings entre 2011 e 2024, cada key no dicionário será um ano
anos=[str(ano) for ano in range(2011,2025,1)] #strings com os anos que serão usadas na contrução do dicionário
colunas_ano={} #dicinário com as colunas presentes nos ranking de cada ano. Será criado para avaliar se as métricas são as mesmas em todos os rankings
                #nesse dicionário as keys são os anos dos rankings e os valores as colunas presentes nos rankings
for index, file_path in enumerate(files_the):
    dados[anos[index]]=pd.read_csv(file_path)  #dados carregados no dicionário
    colunas_ano[anos[index]]=dados[anos[index]].columns.to_list()

### Análise Exploratória

In [4]:
#Nem todas as métricas estão presentes em todos os rankings
for ano in anos:
    print('{}: {} colunas'.format(ano,len(colunas_ano[ano])))

2011: 20 colunas
2012: 20 colunas
2013: 24 colunas
2014: 20 colunas
2015: 24 colunas
2016: 24 colunas
2017: 24 colunas
2018: 24 colunas
2019: 24 colunas
2020: 24 colunas
2021: 24 colunas
2022: 25 colunas
2023: 20 colunas
2024: 20 colunas


In [5]:
#Número de Universidade avaliadas em cada ano:
for ano in anos:
    print('{}: {} universidades no ranking)'.format(ano,len(dados[ano])))

2011: 402 universidades no ranking)
2012: 400 universidades no ranking)
2013: 1103 universidades no ranking)
2014: 200 universidades no ranking)
2015: 2345 universidades no ranking)
2016: 800 universidades no ranking)
2017: 2112 universidades no ranking)
2018: 1526 universidades no ranking)
2019: 1397 universidades no ranking)
2020: 981 universidades no ranking)
2021: 1258 universidades no ranking)
2022: 2671 universidades no ranking)
2023: 400 universidades no ranking)
2024: 401 universidades no ranking)


In [6]:
#Métricas que estão presentes no ranking de 2024 mas não estão em outros rankings
for ano in anos:
    for metric in colunas_ano['2024']:
        if metric not in  colunas_ano[ano]:
            print('{}: {} - ausente'.format(ano,metric))
    print('')

















Descrição das métricas (colunas nos dataframes): 

- rank_order: posição geral da universidade, considerando todas as notas atribuídas
- rank
- name: nome da universidade
- scores_overall: pontuação média da universidade considerando todas a métricas
- scores_overall_rank: ranking da universidade considerando o overall score
- scores_teaching: pontuacao atribuída para ensino
- scores_teaching_rank: rankeamento de acordo com o score para ensino
- scores_international_outlook: nível de internacionalização da universidade, mede a capacidade da universidade de atrair estudantes estrangeiros
- scores_international_outlook_rank: rankeamento de acorde com o nível de internacionalização
- scores_industry_income: mede a capacidade da universidede de coloborar com a indústria
- scores_industry_income_rank:  ranking criado de acordo com o scores_industry_income
- scores_research :avaliação da qualidade da pesquisa realizada na universidade
- scores_research_rank: ranking criado de acordo com o scores_research
- scores_citations :pontuação atribuída de acordo com as citações recebidas pelos trabalhos produzidos na universidade
- scores_citations_rank: ranking criado de acordo com scores_citations
- location: localização da universidade
- aliases: 
- subjects_offered': cursos oferecidos
- closed: se a universidade está ou não fechada
- unaccredited : universidade possui ou não credenciamento formal
- stats_proportion_of_isr: intedisciplinidade
- stats_female_male_ratio: proporção entre mulheres e homens
- stats_pc_intl_students: 
- stats_student_staff_ratio: proporção entre o número de funcionários e o número de estudantes
- stats_number_students: número de estudantes



In [7]:
#Conversão do dicionário em dataframe
df_dados=pd.concat(dados,axis=0)
#df_dados é um dataframe com os rankings de 2011 até 2024
#O primeiro index de df_dados é o ano do ranking. O ranking de cada ano está ordenado de acordo com a posição das universidade no ranking, da primeira até a última

In [8]:
df_dados.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 15996 entries, ('2011', 0) to ('2024', 400)
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   rank_order                         15996 non-null  int64  
 1   rank                               15996 non-null  object 
 2   name                               15996 non-null  object 
 3   scores_overall                     14235 non-null  object 
 4   scores_overall_rank                15996 non-null  int64  
 5   scores_teaching                    14235 non-null  float64
 6   scores_teaching_rank               15996 non-null  int64  
 7   scores_international_outlook       14235 non-null  object 
 8   scores_international_outlook_rank  15996 non-null  int64  
 9   scores_industry_income             14235 non-null  object 
 10  scores_industry_income_rank        15996 non-null  int64  
 11  scores_research                    1

In [9]:
#ranking do ano de 2024
df_dados.loc['2024',:].head()

Unnamed: 0,rank_order,rank,name,scores_overall,scores_overall_rank,scores_teaching,scores_teaching_rank,scores_international_outlook,scores_international_outlook_rank,scores_industry_income,...,location,aliases,subjects_offered,closed,unaccredited,stats_number_students,stats_student_staff_ratio,stats_pc_intl_students,stats_female_male_ratio,stats_proportion_of_isr
0,1,1,California Institute of Technology,94.3,1,92.2,2,67.0,120,89.1,...,United States,California Institute of Technology caltech,"Languages, Literature & Linguistics,Economics ...",False,False,,,,,
1,2,2,Harvard University,93.3,2,92.9,1,67.6,115,44.0,...,United States,Harvard University,"Mathematics & Statistics,Civil Engineering,Lan...",False,False,,,,,
2,3,3,University of Oxford,93.2,3,88.6,6,90.7,14,72.9,...,United Kingdom,University of Oxford,"Accounting & Finance,General Engineering,Commu...",False,False,,,,,
3,4,4,Stanford University,92.9,4,91.5,3,69.0,112,63.1,...,United States,Stanford University,"Physics & Astronomy,Computer Science,Politics ...",False,False,,,,,
4,5,5,University of Cambridge,92.0,5,89.7,4,87.8,23,51.1,...,United Kingdom,University of Cambridge,"Business & Management,General Engineering,Art,...",False,False,,,,,


In [10]:
#criar duas novas colunas, uma com a porcentagem de homens e outra com a porcentagem de mulheres
df_dados=pd.concat([df_dados,df_dados['stats_female_male_ratio'].str.split(':',expand=True)],axis=1)
df_dados.rename(columns={0:'male proportion',1:'female proportion'},inplace=True)

In [11]:
df_dados.loc['2024',:].head()

Unnamed: 0,rank_order,rank,name,scores_overall,scores_overall_rank,scores_teaching,scores_teaching_rank,scores_international_outlook,scores_international_outlook_rank,scores_industry_income,...,subjects_offered,closed,unaccredited,stats_number_students,stats_student_staff_ratio,stats_pc_intl_students,stats_female_male_ratio,stats_proportion_of_isr,male proportion,female proportion
0,1,1,California Institute of Technology,94.3,1,92.2,2,67.0,120,89.1,...,"Languages, Literature & Linguistics,Economics ...",False,False,,,,,,,
1,2,2,Harvard University,93.3,2,92.9,1,67.6,115,44.0,...,"Mathematics & Statistics,Civil Engineering,Lan...",False,False,,,,,,,
2,3,3,University of Oxford,93.2,3,88.6,6,90.7,14,72.9,...,"Accounting & Finance,General Engineering,Commu...",False,False,,,,,,,
3,4,4,Stanford University,92.9,4,91.5,3,69.0,112,63.1,...,"Physics & Astronomy,Computer Science,Politics ...",False,False,,,,,,,
4,5,5,University of Cambridge,92.0,5,89.7,4,87.8,23,51.1,...,"Business & Management,General Engineering,Art,...",False,False,,,,,,,


In [12]:
#Substituir as virgulas por pontos na coluna stats_number_students
#Essa coluna estão os números de estudantes em cada universidade, portanto são números inteiros

df_dados['stats_number_students']=df_dados['stats_number_students'].str.replace(',','.')
df_dados['stats_number_students']

2011  0      NaN
      1      NaN
      2      NaN
      3      NaN
      4      NaN
            ... 
2024  396    NaN
      397    NaN
      398    NaN
      399    NaN
      400    NaN
Name: stats_number_students, Length: 15996, dtype: object

### Análise dos Rankings

#### Melhores colocadas ao longo dos anos

In [13]:
def select_top_ranking(dados,feature,n_top,anos_selecionados):
    '''
    Função que seleciona as n primeiras linhas para o ranking de cada ano. 
    O dataframe dados tem os rankings das universidades entre 2001 a 2024. Os rankings de cada ano estão ordenados de acordo com as posições das universidades no ranking, começando pelo primeiro
    lugar.Sendo assim, se fature=name e n_top=9, a função retornará um dataframe com os nomes das universidade que oucuparam as 10 primeiras posições do ranking em cada ano.

    dados   - Pandas dataframe com  os rankings
    feature - (str) feature desejada. É uma das colunas presente no dataframe
    n_top   - (int) Número de linhas que serão selecionadas. 
    anos_seleciondados - lista com anos em formato string. Definirá os ranking que se deseja analisar. Estão disponíveis os rankings entre os de 2011 e 2024 

    '''
    top_list=[]
    n_top-=1 #subtrair 1. Necessário porque o index do dataframe começa em zero. Sem a subtração, caso n_top=10, seriam seleciondas 11 linhas.
    for ano in anos_selecionados: #anos é um lista para selecionar os rankings que serão analisados
        top=df_dados.loc[(ano,slice(0,n_top)),[feature]].rename(columns={feature:ano}) #multiindex: o primeiro index é o ano e o segundo index é selecionado no slice
        top.reset_index(drop=True,inplace=True)
        top_list.append(top)
    top_list=pd.concat(top_list,axis=1) #converter a lista em dataframes
    top_list=top_list.transpose() #transpor para que o ranking para cada anos apareça em cada linha
    name_columns=[str(i)+'°' for i in range(1,n_top+2)] #NOmes das colunas
    top_list.columns=name_columns

    return top_list


In [14]:
#Selecionar as universidade que oucuparam as 10 primeiras posições do ranking entre os anos de 2011 e 2024
top_universidades=select_top_ranking(df_dados,'name',10,anos)
top_universidades

Unnamed: 0,1°,2°,3°,4°,5°,6°,7°,8°,9°,10°
2011,California Institute of Technology,Harvard University,Stanford University,University of Oxford,Princeton University,University of Cambridge,Massachusetts Institute of Technology,Imperial College London,The University of Chicago,"University of California, Berkeley"
2012,California Institute of Technology,Stanford University,University of Oxford,Harvard University,Massachusetts Institute of Technology,Princeton University,University of Cambridge,Imperial College London,"University of California, Berkeley",The University of Chicago
2013,University of Oxford,University of Cambridge,California Institute of Technology,Stanford University,Massachusetts Institute of Technology,Harvard University,Princeton University,Imperial College London,The University of Chicago,ETH Zurich
2014,Harvard University,California Institute of Technology,Massachusetts Institute of Technology,Stanford University,Princeton University,University of Oxford,University of Cambridge,"University of California, Berkeley",Imperial College London,Yale University
2015,University of Oxford,Harvard University,University of Cambridge,Stanford University,Massachusetts Institute of Technology,California Institute of Technology,Princeton University,"University of California, Berkeley",Yale University,Imperial College London
2016,California Institute of Technology,University of Oxford,Stanford University,University of Cambridge,Massachusetts Institute of Technology,Harvard University,Princeton University,Imperial College London,ETH Zurich,The University of Chicago
2017,University of Oxford,California Institute of Technology,Harvard University,Stanford University,University of Cambridge,Massachusetts Institute of Technology,Princeton University,"University of California, Berkeley",Yale University,The University of Chicago
2018,University of Oxford,Stanford University,Harvard University,California Institute of Technology,Massachusetts Institute of Technology,University of Cambridge,"University of California, Berkeley",Yale University,Princeton University,The University of Chicago
2019,University of Oxford,California Institute of Technology,University of Cambridge,Stanford University,Massachusetts Institute of Technology,Princeton University,Harvard University,Yale University,The University of Chicago,Imperial College London
2020,University of Oxford,California Institute of Technology,Stanford University,University of Cambridge,Massachusetts Institute of Technology,Harvard University,Princeton University,Imperial College London,ETH Zurich,"University of California, Berkeley"


In [15]:
top_1=top_universidades['1°'].value_counts() #top_1 é um dataframe com a contagem do número de vezes que cada universidade oucupou a primeira posição do ranking
top_1=pd.DataFrame(top_1)
top_1.reset_index(inplace=True)
top_1.rename(columns={'1°':'Universidades','count':'Número de vezez na 1° posição do ranking entre 2011 e 2024'},inplace=True)
top_1

Unnamed: 0,Universidades,Número de vezez na 1° posição do ranking entre 2011 e 2024
0,University of Oxford,8
1,California Institute of Technology,5
2,Harvard University,1


#### Localização das primeira colocadas no ranking

##### Top 10

In [16]:
#Localização da 10 universidade melhores colocadas nos rankings de 2011 a 2024
top_paises=select_top_ranking(df_dados,'location',10,anos)

In [17]:
top_paises

Unnamed: 0,1°,2°,3°,4°,5°,6°,7°,8°,9°,10°
2011,United States,United States,United States,United Kingdom,United States,United Kingdom,United States,United Kingdom,United States,United States
2012,United States,United States,United Kingdom,United States,United States,United States,United Kingdom,United Kingdom,United States,United States
2013,United Kingdom,United Kingdom,United States,United States,United States,United States,United States,United Kingdom,United States,Switzerland
2014,United States,United States,United States,United States,United States,United Kingdom,United Kingdom,United States,United Kingdom,United States
2015,United Kingdom,United States,United Kingdom,United States,United States,United States,United States,United States,United States,United Kingdom
2016,United States,United Kingdom,United States,United Kingdom,United States,United States,United States,United Kingdom,Switzerland,United States
2017,United Kingdom,United States,United States,United States,United Kingdom,United States,United States,United States,United States,United States
2018,United Kingdom,United States,United States,United States,United States,United Kingdom,United States,United States,United States,United States
2019,United Kingdom,United States,United Kingdom,United States,United States,United States,United States,United States,United States,United Kingdom
2020,United Kingdom,United States,United States,United Kingdom,United States,United States,United States,United Kingdom,Switzerland,United States


In [18]:
#Número de vezes que cada país apareceu nas dez primeiras posições do ranking entre 2011 e 2024
#Somente universidade de três países oucuparam as 10 primeiras colocações nos anos considerados
#Por exemplo, no 14 rankings em 8 deles uma universidade do Reino Unido oucupou a primeira colocação, enquanto nos 6 restantes a primeira colocação coube a uma universidade americana
top_paises.apply(pd.Series.value_counts)


Unnamed: 0,1°,2°,3°,4°,5°,6°,7°,8°,9°,10°
Switzerland,,,,,,,,,2,1
United Kingdom,8.0,3.0,5.0,3.0,3.0,3.0,3.0,6.0,2,4
United States,6.0,11.0,9.0,11.0,11.0,11.0,11.0,8.0,10,9


In [19]:
count_paises_top_10=top_paises.apply(pd.Series.value_counts).sum(axis=1)
count_paises_top_10=pd.DataFrame(count_paises_top_10)
count_paises_top_10.index.names=['País de origem']
count_paises_top_10.rename(columns={0:"Número de universidades entre as 10 primeiras colocadas nos rankings entre 2011 e 2024"},inplace=True)

In [20]:
count_paises_top_10

Unnamed: 0_level_0,Número de universidades entre as 10 primeiras colocadas nos rankings entre 2011 e 2024
País de origem,Unnamed: 1_level_1
Switzerland,3.0
United Kingdom,40.0
United States,97.0


##### Top 100

In [21]:
top_100_paises=select_top_ranking(df_dados,'location',100,anos)

In [32]:
top_100_paises=top_100_paises.transpose()

In [33]:
top_100_paises

Unnamed: 0,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
1°,United States,United States,United Kingdom,United States,United Kingdom,United States,United Kingdom,United Kingdom,United Kingdom,United Kingdom,United Kingdom,United Kingdom,United States,United States
2°,United States,United States,United Kingdom,United States,United States,United Kingdom,United States,United States,United States,United States,United Kingdom,United States,United States,United States
3°,United States,United Kingdom,United States,United States,United Kingdom,United States,United States,United States,United Kingdom,United States,United States,United States,United Kingdom,United Kingdom
4°,United Kingdom,United States,United States,United States,United States,United Kingdom,United States,United States,United States,United Kingdom,United States,United States,United States,United States
5°,United States,United States,United States,United States,United States,United States,United Kingdom,United States,United States,United States,United States,United Kingdom,United States,United Kingdom
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96°,United States,United States,South Korea,United States,France,Canada,Denmark,South Korea,United States,United Kingdom,United States,Germany,France,United States
97°,United States,United States,United Kingdom,United States,China,United Kingdom,Belgium,Taiwan,Finland,Sweden,Australia,Sweden,United States,United States
98°,United States,United States,United States,United States,United States,United Kingdom,United States,United States,Sweden,Denmark,Sweden,United Kingdom,Netherlands,Sweden
99°,United Kingdom,Germany,United States,United States,Hong Kong,Germany,United States,Finland,United Kingdom,Switzerland,United States,Netherlands,Netherlands,Germany


In [41]:
paises_top_100={}
for ano in anos:
    paises_top_100[ano]=pd.DataFrame(top_100_paises[ano].value_counts()).rename(columns={'count':"Número de apariçoes no top 100"})

In [46]:
paises_diferentes_top_100={}
for ano in anos:
    paises_diferentes_top_100[ano]=len(paises_top_100[ano])

In [47]:
paises_diferentes_top_100

{'2011': 16,
 '2012': 15,
 '2013': 16,
 '2014': 14,
 '2015': 15,
 '2016': 17,
 '2017': 16,
 '2018': 18,
 '2019': 16,
 '2020': 17,
 '2021': 16,
 '2022': 16,
 '2023': 15,
 '2024': 17}