<h2>Population Distribution Among The Top 400 Universities in The World, 2023</h2>

<h3>Table of Contents</h3>

* [Connecting Drive to Google Colab](#Connecting_Drive_to_Google_Colab)

* [Installing Python Packages](#Installing_Python_Packages)
* [Importing Needed Python Packages And Accessing Data](#Importing_Needed_Python_Packages_And_Accessing_Data)
* [Geocoding Institutions](#Geocoding_Institutions)

*   [Map Production With Folium](#Map_Production_With_Folium)
    * [Graduated Symbol Map](#Graduated_Symbol_Map)
    * [Actual Locations of Institutions](#Actual_Locations_of_Institutions)



<h3>Connecting Drive to Google Colab</h3>

In [1]:
# allow google colab to access data from drive
from google.colab import drive
drive.mount('/content/drive')
data_path = '/content/drive/MyDrive/top_400'

Mounted at /content/drive


<h3>Installing Python Packages</h3>

In [None]:
# install additional python libraries
!pip install geopandas geopy folium

<h3>Importing Needed Python Packages</h3>

In [3]:
# import needed python libraries

import requests
import pandas as pd
import json
import urllib, os
import numpy as np
import geopandas as gpd
import folium
from folium import Marker, CircleMarker
from folium.plugins import *
from geopy.geocoders import Nominatim

In [4]:
url = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2023_0__83be12210294c582db8740ee29673120.json"

filename = os.path.join(data_path, "World University Ranking.json")
response = urllib.request.urlretrieve(url, filename)
json_file = open(filename)
json_raw = json.load(json_file)


In [None]:
url = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2023_0__83be12210294c582db8740ee29673120.json"

In [5]:

# get a header file to be able to get data from the site

HEADERS = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:98.0) Gecko/20100101 Firefox/98.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

In [6]:
# get a list of universities
response = requests.get(url = url, headers = HEADERS)
data = response.json()['data']
# list of unused keys
key_list = [
    "scores_citations", "scores_citations_rank", "cta_button", "apply_link", "url",
    "scores_industry_income", "scores_industry_income_rank", "subjects_offered",
    "aliases", "rank_order", "scores_overall", "scores_overall_rank", "scores_teaching",
    "closed", "unaccredited", "disabled", "record_type", "scores_teaching_rank",
    "scores_research", "scores_research_rank", "scores_international_outlook", 
    "scores_international_outlook_rank", "member_level", "nid"
    ]
# old and new column names
columns = {
    "rank": "Rank", "name": "Name", "location": "Country", 
    "stats_number_students": "No. of FTE Students", 
    "stats_student_staff_ratio": "No. of students per staff", 
    "stats_pc_intl_students": "International Students", "stats_female_male_ratio": "Female:Male Ratio"
    }

df = pd.json_normalize(data).loc[:399, columns.keys()].rename(columns=columns)

# save to csv
data_path = '/content/drive/MyDrive/top_400/top_400_institutions_2023.csv'

df.to_csv(data_path, index=None,encoding='utf-8')

# verify the last five elements of the table
df.tail()


Unnamed: 0,Rank,Name,Country,No. of FTE Students,No. of students per staff,International Students,Female:Male Ratio
395,351–400,University of Vaasa,Finland,3873,20.0,4%,53 : 47
396,351–400,Verona University,Italy,18621,23.8,4%,64 : 36
397,351–400,Wake Forest University,United States,8122,4.0,10%,54 : 46
398,351–400,Washington State University,United States,29463,19.5,7%,54 : 46
399,351–400,Wroclaw Medical University,Poland,6769,9.4,14%,71 : 29


In [7]:
geolocator = Nominatim(user_agent = 'edudzi')

universities = pd.read_csv(data_path)

<h3>Geocoding Institutions</h3>
⚓

Geocoding is the process of converting addresses (like "1600 Amphitheatre Parkway, Mountain View, CA") into geographic coordinates (like latitude 37.423021 and longitude -122.083739), which can be used to place markers on a map, or position the map. For this proect, Nominatim geocoding service from OpenStreetMap is used to get the coordinates of the institutions



In [8]:
def my_geocoder(row):
    try:
        point = geolocator.geocode(row).point
        return pd.Series({'Latitude': point.latitude, 'Longitude': point.longitude})
    except:
        return None
    
universities[['Latitude', 'Longitude']] = universities.apply(lambda x: my_geocoder(x['Name']), axis = 1)
print("{}% of addresses were geocoded".format((1 - sum(np.isnan(universities['Latitude'])) / len(universities)) * 100))
universities = universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(universities, geometry = gpd.points_from_xy(universities.Longitude, universities.Latitude))
universities.crs = {'init': 'epsg:4326'}
universities.head()

89.75% of addresses were geocoded


  in_crs_string = _prepare_from_proj_string(in_crs_string)


Unnamed: 0,Rank,Name,Country,No. of FTE Students,No. of students per staff,International Students,Female:Male Ratio,Latitude,Longitude,geometry
0,1,University of Oxford,United Kingdom,20967,10.6,42%,48 : 52,51.752546,-1.21433,POINT (-1.21433 51.75255)
1,2,Harvard University,United States,21887,9.6,25%,50 : 50,42.367909,-71.126782,POINT (-71.12678 42.36791)
2,=3,University of Cambridge,United Kingdom,20185,11.3,39%,47 : 53,52.200623,0.110474,POINT (0.11047 52.20062)
3,=3,Stanford University,United States,16164,7.1,24%,46 : 54,37.431314,-122.169365,POINT (-122.16937 37.43131)
4,5,Massachusetts Institute of Technology,United States,11415,8.2,33%,40 : 60,42.358253,-71.096627,POINT (-71.09663 42.35825)


In [9]:
data_path = '/content/drive/MyDrive/top_400/top_400_institutions_2023_geolocation.csv'

universities.to_csv(data_path, index=None,encoding='utf-8')

In [10]:
# convert data type of No. of FTE Students from string to integer
universities["No. of FTE Students"] = universities["No. of FTE Students"].str.replace(",","")

universities[["No. of FTE Students"]] = universities[["No. of FTE Students"]].apply(pd.to_numeric)

<h2>Map Production With <b>Folium</b></h2>🍃

folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. It enables data manipulation and visualization in Python

<h3>Concepts</h3

folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map.

<h3>Graduated Symbol Map</h3>


Graduated symbol maps scale the size of symbols proportionally to the quantity or value at that location.

In [12]:
# produce map to visualize the locations of the institutions

m = folium.Map(Width = "60", Height = "50", location = [54, 15], titles = 'cartodbd', zoom_start = 2, control_scale = True, prefer_canvas = True)
for idx, row in universities.iterrows():

    elements = '<b>Name: {} Country: {} Population: {} Rank: {}</b>'.format(row['Name'], 
                                                           row["Country"],
                                                           row['No. of FTE Students'], 
                                                           row['Rank'])
    CircleMarker(
        location = [row['Latitude'], row['Longitude']],
        radius = int((row['No. of FTE Students'])/2000),
        popup = elements,
        fill = True,
        color = 'IndianRed',
        key_on = idx,
        threshold_scale = [0,1,2,3],
        fill_color = '#9F2B68',
        fill_opacity = 0.2
    ).add_to(m)


folium.raster_layers.TileLayer('Stamen Terrain').add_to(m)
folium.raster_layers.TileLayer('Stamen Toner').add_to(m)
folium.raster_layers.TileLayer('Stamen Watercolor').add_to(m)
folium.raster_layers.TileLayer('CartoDB Positron').add_to(m)
folium.raster_layers.TileLayer('CartoDB Dark_Matter').add_to(m)
folium.LayerControl().add_to(m)
fullscreen = Fullscreen(position="top right")
minimap = MiniMap(toggle_display=True)
m.add_child(minimap)
m.add_child(fullscreen)

# output = '/content/drive/MyDrive/top_400/top_400_institutions_graduated_symbols.html'
# m.save(output)



display(m)

<h3>Actual Locations of Institutions</h3>

In [14]:

m_2 = folium.Map(Width = "60", Height = "50", location = [54, 15], titles = 'cartodbd', zoom_start = 2, control_scale = True)
unis_json = folium.GeoJson(universities, name='Universities',
                           tooltip = folium.GeoJsonTooltip(
                               fields = ["Name", "Country", "Rank", "International Students", "No. of FTE Students"], aliases=
                                                           ["Name", 'Country', 'Rank', 'Proportion of International Students', "Student Population"]
                                                           )).add_to(m_2)


search = Search(
    layer = unis_json,
    placeholder = "Search for institution",
    position = "topleft",
    collapsed = True,
    search_label = "Name",
    search_zoom = 5,
    weight = 3
).add_to(m_2)

folium.raster_layers.TileLayer('Stamen Terrain').add_to(m_2)
folium.raster_layers.TileLayer('Stamen Toner').add_to(m_2)
folium.raster_layers.TileLayer('Stamen Watercolor').add_to(m_2)
folium.raster_layers.TileLayer('CartoDB Positron').add_to(m_2)
folium.raster_layers.TileLayer('CartoDB Dark_Matter').add_to(m_2)
folium.LayerControl().add_to(m_2)
fullscreen = Fullscreen(position="top right")
minimap = MiniMap(toggle_display=True)
m_2.add_child(minimap)
m_2.add_child(fullscreen)


# save as html
# output_II = '/content/drive/MyDrive/top_400/institutions_geolocation.html'
# m_2.save(output_II)

m_2