# `World University Rankings 2023`

In this notebook, we will be exploring the World University Rankings 2023.

**This notebook is made in partial fulfillment of the requirements for the course CSMODEL.**

## Nutritionists
- Corpuz, John Exequeil A.
- Diaz, Sebastian Q.
- Recato Dy, John Kieffer L.
- Tan, Timothy Joshua O.

## Import
Import **numpy** and **pandas**.

[**`numpy`**](https://numpy.org/) is a library for  Python, adding support for large, multi-dimensional arrays and matrices. It also offers a variety of high-level mathematical functions to operate on these arrays.

[**`pandas`**](https://pandas.pydata.org/pandas-docs/stable/index.html) is a software library for Python which provides data structures and data analysis tools.

In [35]:
import numpy as np
import pandas as pd

## The Dataset

The `World University Rankings 2023` dataset contains submitted data from nearly 2500 different institutions each characterized by 13 features.

The 2023 World University Rankings evaluated 1,799 institutions from 104 countries and are regarded as the most inclusive and broad rankings ever conducted. With 13 tailored performance indicators, the rankings evaluated four key proxies to present the learning, knowledge, research as well as international diversity dimensions. This year's scrutiny included looking at more than 121 million references from over 15.5 million scholarly articles, as well as survey input gleaned from 40,000 global academics. All told, the rankings culled through upwards of 680,000 data points derived from more than 2,500 participating schools.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Excel to see the raw data

## Reading the Dataset
The first step is using pandas to read the data set we can do this using the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function which will load it into a  `DataFrame`. Because the Dataset file is in the same folder as this notebook the directory below should work

In [36]:
rankings_df = pd.read_csv("World University Rankings 2023.csv")

## The Data Collection Method


The `2023 World University Rankings` combine the use of web-scraping with institutional submission to collect the data. The data is then refined using several libraries such as the popular Python PANDAS dataframe. Multiple sources of data improve the scope and authority of the collected dataset. However, it must be mentioned here that missing data points are replaced with conservative estimates in order to avoid undue penalization introducing a potential bias.

Where data is subjected to a stringent standardization process including statistical techniques like Z-scoring. But the Academic Reputation Survey does introduce an exponential element keeping in mind its distinctily distributed data. This does add variation though but including subjectivity makes it prone to different interpretations and rendering possible.

Essentially, it is a powerful methodology that is certainly not without its limitations and impacts both the internal validity of the research as well as the interpretability of the final rankings.

To check the number of observations in the `World University Rankings 2023` Dataset we call the `shape` function on the index column of the Dataset

In [8]:
rankings_df.shape[0]

2341

Now that we know how many observations there are lets take a more in depth look at the Dataset to get the general information about the data lets call the `info` function

In [6]:
rankings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2341 entries, 0 to 2340
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   University Rank              2341 non-null   object 
 1   Name of University           2233 non-null   object 
 2   Location                     2047 non-null   object 
 3   No of student                2209 non-null   object 
 4   No of student per staff      2208 non-null   float64
 5   International Student        2209 non-null   object 
 6   Female:Male Ratio            2128 non-null   object 
 7   OverAll Score                1799 non-null   object 
 8   Teaching Score               1799 non-null   float64
 9   Research Score               1799 non-null   float64
 10  Citations Score              1799 non-null   float64
 11  Industry Income Score        1799 non-null   float64
 12  International Outlook Score  1799 non-null   float64
dtypes: float64(6), obj

The dataset outlines 13 distinct performance metrics for each of the 2341 universities. These metrics provide a comprehensive view of a university's performance, from teaching and research quality to diversity and global impact. Each metric contributes to the overall ranking, offering a multi-dimensional snapshot of institutional excellence.
### `Performance Indicators` 
### Core Metrics

#### 1. University Rank
- **What It Is**: Overall global ranking position.
- **Why It Matters**: Reflects aggregate performance across all metrics.

#### 2. Name of University
- **What It Is**: Official institution name.
- **Why It Matters**: Identifies the evaluated university.

#### 3. Location
- **What It Is**: University's geographical setting.
- **Why It Matters**: Contextualizes environment and potential global influence.

---

### Academic Metrics

#### 4. No of Students
- **What It Is**: Total student body size.
- **Why It Matters**: Indicates scale and resource needs.

#### 5. No of Students per Staff
- **What It Is**: Ratio of students to staff.
- **Why It Matters**: Proxy for teaching quality and resource allocation.

#### 6. International Students
- **What It Is**: Count or percentage of foreign students.
- **Why It Matters**: Shows global appeal and diversity.

---

### Diversity Metrics

#### 7. Female:Male Ratio
- **What It Is**: Gender balance among students.
- **Why It Matters**: Reveals gender diversity.

---

### Performance Metrics

#### 8. Overall Score
- **What It Is**: Cumulative score from all metrics.
- **Why It Matters**: Summarizes comprehensive performance.

#### 9. Teaching Score
- **What It Is**: Quality of education score.
- **Why It Matters**: Gauges educational effectiveness.

#### 10. Research Score
- **What It Is**: Score for research output and quality.
- **Why It Matters**: Measures academic contributions.

#### 11. Citations Score
- **What It Is**: Citation count for university research.
- **Why It Matters**: Indicates research impact and relevance.

#### 12. Industry Income Score
- **What It Is**: Income from industry collaborations.
- **Why It Matters**: Assesses knowledge transfer and industry engagement.

#### 13. International Outlook Score
- **What It Is**: Score for international staff, students, and partnerships.
- **Why It Matters**: Evaluates global reach and influence.



Now lets take a look at the `Rows` of the Data Set using the `iloc` function to show all the values in the 3rd column of the Dataset where the descriptions of the food are found

In [10]:
universityName = rankings_df.iloc[:, 1]

# Print each value on a new line so its easier to view
for value in universityName:
    print(value)

University of Oxford
Harvard University
University of Cambridge
Stanford University
Massachusetts Institute of Technology
California Institute of Technology
Princeton University
University of California, Berkeley
Yale University
Imperial College London
Columbia University
ETH Zurich
The University of Chicago
University of Pennsylvania
Johns Hopkins University
Tsinghua University
Peking University
University of Toronto
National University of Singapore
Cornell University
University of California, Los Angeles
UCL
University of Michigan-Ann Arbor
New York University
Duke University
Northwestern University
University of Washington
Carnegie Mellon University
University of Edinburgh
Technical University of Munich
University of Hong Kong
University of California, San Diego
LMU Munich
University of Melbourne
King’s College London
Nanyang Technological University, Singapore
London School of Economics and Political Science
Georgia Institute of Technology
The University of Tokyo
University of Brit

From this we can observe that each row of the data set contains a unique university along with its matching performance metrics.

In [37]:
ghosts = rankings_df['Name of University'].isnull().sum()
print(f"Ghost Universities : {ghosts}")
print("BEFORE: " + str(rankings_df.shape[0]))

#Lets remove these universities since they are not relevant to our data analysis
rankings_df = rankings_df.dropna(subset=['Name of University'])
print("AFTER: " + str(rankings_df.shape[0]))


Ghost Universities : 108
BEFORE: 2341
AFTER: 2233


Lets Clean the Location column of the dataset first lets check how many of each there are here

In [38]:
NaN = rankings_df['Location'].isnull().sum()
duplicates = rankings_df['Location'].duplicated().sum()
incorrect_data_types = rankings_df[rankings_df['Location'].apply(lambda x: not isinstance(x, object))].shape[0]
print(f"Missing Values: {NaN}")
print(f"Duplicate Values: {duplicates}")
print(f"Incorrect Data Types: {incorrect_data_types}")

Missing Values: 186
Duplicate Values: 2116
Incorrect Data Types: 0


From these findings we can see that there are no Incorrect Data Types in the location column and we can disregard the duplicates because due to the nature of the data there will be duplicats in the locations since there are multiple universities in each country. The big problem is the missing values which can be solved by finding out where these universities are located since that is easy to find information.

Now lets fix these locations

In [51]:
universities_by_country = {
    'Australia': [
        "The University of Queensland",
        "UNSW Sydney",
        "University of Adelaide",
        "University of Technology Sydney",
        "Curtin University",
        "Queensland University of Technology",
        "Griffith University",
        "La Trobe University",
        "University of South Australia",
        "Federation University Australia",
        "Edith Cowan University",
        "Central Queensland University"
    ],
    'Belgium': [
        "KU Leuven"
    ],
    'Canada': [
        "McGill University",
        "University of Ottawa",
        "McMaster University",
        "Queen’s University Belfast",
        "Simon Fraser University",
        "Dalhousie University",
        "York University",
        "Western University",
        "Memorial University of Newfoundland",
        "Toronto Metropolitan University",
        "University of Montreal"

    ],
    'China': [
        "Tsinghua University",
        "Peking University",
        "University of Hong Kong",
        "Chinese University of Hong Kong",
        "The Hong Kong University of Science and Technology",
        "Zhejiang University",
        "University of Science and Technology of China",
        "Hong Kong Polytechnic University",
        "City University of Hong Kong",
        "University of Macau",
        "Macau University of Science and Technology",
        "Southern University of Science and Technology (SUSTech)",
        "Soochow University, China",
        "Chengdu University",
        "Chongqing University",
        "Southwest Jiaotong University",
        "Xi’an Jiaotong-Liverpool University",
        "Wenzhou University",
        "Southern Medical University",
        "University of Saint Joseph"
    ],
    'Cyprus': [
        "European University Cyprus"
    ],
    'Czech Republic': [
        "VSB - Technical University of Ostrava"
    ],
    'Estonia': [
        "University of Tartu",
        "Tallinn University of Technology"
    ],
    'France': [
        "Université Paris Cité",
        "University of Bordeaux"
    ],
    'Germany': [
        "Technical University of Munich",
        "LMU Munich",
        "Universität Heidelberg",
        "Constructor University Bremen",
        "TU Braunschweig",
        "University of Wuppertal"
    ],
    'Ghana': [
        "University of Cape Coast"
    ],
    'Hungary': [
        "Semmelweis University",
        "Eötvös Loránd University"
    ],
    'India': [
        "Lovely Professional University",
        "KIIT University",
        "Maharishi Markandeshwar University (MMU)",
        "Symbiosis International University",
        "University of Calcutta",
        "Graphic Era University",
        "Saveetha Institute of Medical and Technical Sciences",
        "Sathyabama Institute of Science and Technology",
        "Thapar Institute of Engineering and Technology"
    ],
    'Iran': [
        "University of Tabriz",
        "Tabriz University of Medical Sciences",
        "University of Tehran",
        "Tehran University of Medical Sciences"
    ],
    'Indonesia': [
        "University of Indonesia",
        "Universitas Airlangga"
    ],
    'Ireland': [
        "Trinity College Dublin",
        "University of Limerick",
        "RCSI University of Medicine and Health Sciences"
    ],
    'Italy': [
        "University of Bologna",
        "Humanitas University",
        "Politecnico di Milano",
        "Universita IULM",
        "Sant’Anna School of Advanced Studies – Pisa"
    ],
    'Japan': [
        "Sophia University",
        "Tokyo Metropolitan University"
    ],
    'Jordan': [
        "Al-Ahliyya Amman University",
        "The Hashemite University"
    ],
    'South Korea': [
        "Yonsei University (Seoul campus)",
        "Korea University",
        "Kyung Hee University",
        "Pohang University of Science and Technology (POSTECH)",
        "Sungkyunkwan University (SKKU)",
        "Ulsan National Institute of Science and Technology (UNIST)",
        "Hanyang University",
        "The Catholic University of Korea",
        "Chungbuk National University",
        "Jeju National University",
        "Kangwon National University",
        "Kyungpook National University",
        "Pusan National University",
        "Chonnam National University",
        "Jeonbuk National University",
        "University of Ulsan"
    ],
    'Kuwait': [
        "Kuwait University"
    ],
    'Lebanon': [
        "Al-Mustaqbal University"
    ],
    'Malaysia': [
        "Universiti Malaysia Sarawak (UNIMAS)",
        "Universiti Teknologi Malaysia"
    ],
    'Mexico': [
        "University of Guadalajara",
        "Monterrey Institute of Technology"
    ],
    'Netherlands': [
        "University of Groningen",
        "Leiden University",
        "Erasmus University Rotterdam",
        "Radboud University Nijmegen",
        "Tilburg University"
    ],
    'New Zealand': [
        "Auckland University of Technology",
        "University of Canterbury"
    ],
    'Nigeria': [
        "Afe Babalola University",
        "Covenant University"
    ],
    'Palestine': [
        "An-Najah National University"
    ],
    'Philippines': [
        "De La Salle University",
        "Mapúa University"
    ],
    'Saudi Arabia': [
        "King Abdulaziz University",
        "Alfaisal University",
        "Prince Sultan University (PSU)",
        "Qassim University",
        "Umm Al-Qura University",
        "Taif University"
    ],
    'Singapore': [
        "National University of Singapore"
    ],
    'South Africa': [
        "University of Cape Town",
        "University of the Witwatersrand",
        "University of Johannesburg",
        "University of KwaZulu-Natal",
        "Stellenbosch University",
        "University of the Western Cape"
    ],
    'Spain': [
        "Carlos III University of Madrid",
        "ESIC"
    ],
    'Taiwan': [
        "National Taiwan University (NTU)",
        "Taipei Medical University",
        "National Cheng Kung University (NCKU)",
        "National Taiwan Normal University",
        "National Chung Cheng University",
        "National Dong Hwa University"
    ],
    'Thailand': [
        "Chulalongkorn University",
        "Thammasat University",
        "Burapha University"
    ],
    'Turkey': [
        "Sabancı University",
        "Istanbul Medipol University",
        "Atılım University",
        "Yıldız Technical University",
        "Acıbadem University"
    ],
    'United Arab Emirates': [
        "United Arab Emirates University",
        "Khalifa University"
    ],
    'United Kingdom': [
        "University of Manchester",
        "University of Bristol",
        "University of Warwick",
        "University of Sheffield",
        "Newcastle University",
        "University of Liverpool",
        "Cardiff University",
        "Queen’s University Belfast",
        "University of East Anglia",
        "Coventry University",
        "University of Glasgow",
        "University of the West of Scotland"
    ],
    'United States': [
        "Arizona State University (Tempe)",
        "Virginia Polytechnic Institute and State University",
        "SUNY Binghamton University",
        "University of Texas at Arlington",
        "The University of Texas at San Antonio",
        "University of Toledo",
        "Ulster University",
        "University of Windsor",
        "University of Wolverhampton"
    ],
    'Austria': [
        "Paris Lodron Universität Salzburg"
    ],
    'Uzbekistan': [
        "Alisher Navo’i Tashkent State University of Uzbek Language and Literature",
        "Tashkent State University of Economics",
        "National University of Uzbekistan named after Mirzo Ulugbek"
    ]
}


for country, universities in universities_by_country.items():
    rankings_df.loc[rankings_df['Name of University'].isin(universities), 'Location'] = country

rankings_df['Location'].isnull().sum()



0

Now the Location Data has been cleaned