# World Universities Data Analysis
In this notebook, I'll be doing exploratory data analysis on world university rankings. I want to look at the following:

1. Geographical Analysis:
- Which countries have the highest number of top-ranked universities?
- How does the distribution of universities vary across different countries?

2. University Characteristics:
- What is the correlation between the overall university rank and the student population?
- How does the student-to-staff ratio vary among the top-ranked universities?
- Are there patterns in the female-to-male ratio across universities and countries?

3. Internationalization:
- How does the percentage of international students relate to the global ranking of universities?
- Are there countries that attract a higher percentage of international students?

4. Performance Metrics:
- What factors contribute the most to the overall score of a university? (Teaching, research, etc.)
- How do teaching and research scores individually contribute to the overall university ranking?

5. Gender Diversity:
- Are there universities with a notably high or low female-to-male ratio?
- Does the gender ratio correlate with the overall score or specific metrics?

6. Research Environment:
- How does the research environment score correlate with the overall university rank?
- Are there specific countries or regions with a better research environment?

7. Trends Over Time:
- Are there any noticeable trends or changes in university rankings over the years?
- How have the scores for teaching, research, and overall ranking evolved?

8. Comparisons Between Metrics:
- Is there a strong correlation between teaching and research scores?
- How does the international outlook correlate with the percentage of international students?

9. Outliers and Anomalies:
- Are there universities that perform exceptionally well in one metric but lower in others?
- Are there any universities with a surprisingly high or low student-to-staff ratio?

10. Impact on Industry:
- How does the industry impact score relate to the overall university ranking?
- Are there universities with a strong industry impact but lower overall scores?

## Preprocessing

First, we have to load in the data and then we'll do some data cleaning accordingly

In [3]:
# Import relevant libraries 
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot

In [8]:
# Load in the data
# Specify the encoding when reading the CSV file because we can't load it in with default utf-8 encoding
world_uni_df = pd.read_csv('./world-uni-rankings.csv', encoding='latin1')
world_uni_df.head()


Unnamed: 0,Rank,Name,Country,Student Population,Students to Staff Ratio,International Students,Female to Male Ratio,Overall Score,Teaching,Research Environment,Research Quality,Industry Impact,International Outlook,Year
0,1.0,California Institute of Technology,United States,2243,6.9,26%,33 : 67,95.2,95.6,97.6,99.8,97.8,64.0,2016
1,2.0,University of Oxford,United Kingdom,19920,11.6,34%,46:54:00,94.2,86.5,98.9,98.8,73.1,94.4,2016
2,3.0,Stanford University,United States,15596,7.8,22%,42:58:00,93.9,92.5,96.2,99.9,63.3,76.3,2016
3,4.0,University of Cambridge,United Kingdom,18810,11.8,34%,46:54:00,92.8,88.2,96.7,97.0,55.0,91.5,2016
4,5.0,Massachusetts Institute of Technology,United States,11074,9.0,33%,37 : 63,92.0,89.4,88.6,99.7,95.4,84.0,2016


Ok, so first thoughts is that the column with female to male ratio has to change. I think it'll be easier to work with the data if we create 2 separate columns for the percentages of male/female population for that university

In [9]:
# See summary statistics
world_uni_df.describe()

Unnamed: 0,Rank,Student Population,Students to Staff Ratio,Overall Score,Teaching,Research Environment,Research Quality,Industry Impact,International Outlook,Year
count,12430.0,12430.0,12430.0,12430.0,12430.0,12430.0,12430.0,12430.0,12430.0,12430.0
mean,736.831054,23367.0,18.897812,35.333011,28.536669,24.10864,49.188648,46.505502,47.604875,2020.6642
std,467.960733,34987.15,17.057811,16.883561,14.061391,17.598593,27.534337,18.695916,23.002701,2.483259
min,1.0,25.0,0.3,8.2225,8.2,0.8,0.7,0.0,7.1,2016.0
25%,346.0,10149.5,12.3,21.731875,18.8,11.7,24.5,35.3,28.225,2019.0
50%,691.0,17824.0,16.3,32.40275,24.3,18.1,47.45,39.5,43.3,2021.0
75%,1078.0,29218.5,22.0,45.189375,33.8,30.5,72.975,52.2,63.6,2023.0
max,1904.0,1824383.0,865.8,98.4575,99.0,100.0,100.0,100.0,100.0,2024.0


In [10]:
# Check for null values
world_uni_df.isnull().sum()

Rank                         0
Name                         0
Country                      0
Student Population           0
Students to Staff Ratio      0
International Students       0
Female to Male Ratio       591
Overall Score                0
Teaching                     0
Research Environment         0
Research Quality             0
Industry Impact              0
International Outlook        0
Year                         0
dtype: int64

In [11]:
# Let's delete the ones with no female to male ratio

world_uni_df = world_uni_df.dropna(axis=0)

In [12]:
world_uni_df.isnull().sum()

Rank                       0
Name                       0
Country                    0
Student Population         0
Students to Staff Ratio    0
International Students     0
Female to Male Ratio       0
Overall Score              0
Teaching                   0
Research Environment       0
Research Quality           0
Industry Impact            0
International Outlook      0
Year                       0
dtype: int64

In [15]:
world_uni_df.shape

(11839, 14)

## 1. Geographical Analysis