
from IPython.core.display import HTML
HTML("""
<style>
.usecase-title, .usecase-duration, .usecase-section-header {
    padding-left: 15px;
    padding-bottom: 10px;
    padding-top: 10px;
    padding-right: 15px;
    background-color: #0f9295;
    color: #fff;
}

.usecase-title {
    font-size: 1.7em;
    font-weight: bold;
}

.usecase-authors, .usecase-level, .usecase-skill {
    padding-left: 15px;
    padding-bottom: 7px;
    padding-top: 7px;
    background-color: #baeaeb;
    font-size: 1.4em;
    color: #121212;
}

.usecase-level-skill  {
    display: flex;
}

.usecase-level, .usecase-skill {
    width: 50%;
}

.usecase-duration, .usecase-skill {
    text-align: right;
    padding-right: 15px;
    padding-bottom: 8px;
    font-size: 1.4em;
}

.usecase-section-header {
    font-weight: bold;
    font-size: 1.5em;
}

.usecase-subsection-header, .usecase-subsection-blurb {
    font-weight: bold;
    font-size: 1.2em;
    color: #121212;
}

.usecase-subsection-blurb {
    font-size: 1em;
    font-style: italic;
}
</style>
""")

# Optimising School Infrastructure Planning

**Authored by**:  Linh Huong Nguyen

**Duration**: 90 mins

**Level**: Intermediate

**Pre-requisite Skills**: Python, Pandas, Matplotlib, NumbPy, Seaborn, Scikit-learn


### Scenario

As a government policymaker in the education sector, I want to predict future school enrollment trends and optimize school locations, so that I can ensure efficient resource allocation, avoid overcrowding, and provide access to education in growing communities.

### What this use case will teach you

At the end of this use case, you will have demonstrated the following skills:

- Extracted and imported datasets from open data portals for analysis using APIs.

- Conducted data cleaning and preprocessing to ensure data quality and consistency.

- Performed exploratory data analysis (EDA) to uncover trends, patterns, and anomalies.

- Utilized clustering algorithms to segment regions or data points based on relevant features.

- Applied advanced machine learning and statistical methods to evaluate clustering outcomes and derive meaningful insights.

### Background and Introduction

The rapidly changing demographic landscape and the ongoing shifts in population density pose significant challenges for governments and policymakers in the education sector. As populations grow and migrate, ensuring that educational infrastructure keeps pace with demand is critical. This use case addresses the need for data-driven decision-making in the planning and optimization of school locations and enrollment projections. By using population forecast by small areas, future school enrollment trends can be identified and schools placements can be strategically made to meet these demands. By using school locations dataset and population forecast by small area dataset, policymakers can make informed decisions that optimize resource distribution, avoid overcrowding in existing schools, and ensure that educational opportunities are accessible to all students, especially in rapidly growing communities.

The overarching objective of this use case is to empower education departments with the tools and insights needed to plan for the future, ensuring that resources are allocated efficiently and effectively while providing equitable access to education for all students. Through strategic school placement and thoughtful infrastructure planning, this use case supports the long-term sustainability of the educational system in response to evolving community needs.

### Datasets used


- [2024 School Locations](https://discover.data.vic.gov.au/dataset/school-locations-2024) <br>
This dataset contains list of all school locations in Victoria. Includes primary and secondary schools, government and non-government. Information collected as part of the ongoing registration of schools. School details include name, school sector, school type, address, phone, region and area. Also included is the departmental region and area which are both based on the school administration campus. This dataset is sourced from the Victoria Department of Education and can be accessed via API V2.1.

- [City of Melbourne Population Forecasts by Small Area 2023-2043](https://data.melbourne.vic.gov.au/explore/dataset/city-of-melbourne-population-forecasts-by-small-area-2020-2040/information/)<br>
This dataset provides population forecasts by single year for 2023 to 2043, divided into genders age group and geography distribution. This dataset is sourced from the Melbourne Open Data website using API V2.1.

- [Small Areas for Census of Land Use and Employment (CLUE)](https://data.melbourne.vic.gov.au/explore/dataset/small-areas-for-census-of-land-use-and-employment-clue/information/) <br>
This dataset contains spatial layer of small areas used for the City of Melbourne's Census Of Land Use And Employment (CLUE) analysis. This dataset is used to provide detailed locations for the This dataset is sourced from the Melbourne Open Data website using API V2.1.


### Importing Datasets

This section imports essential libraries for data manipulation, visualization, geospatial analysis, interactive mapping, and fetching data from APIs. These libraries provide the necessary functionality for processing, analyzing, and visualizing the project data effectively.

In [1]:
import requests
import pandas as pd
import os
from io import StringIO
import requests
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

### Loading the datasets using API 2.1v

This section defines functions for fetching data from APIs. The API_Unlimited function retrieves datasets from the Melbourne Open Data Portal using dataset IDs, processes the data into a DataFrame, and provides a preview for verification. Similarly, the fetch_data_from_url function fetches data directly from a given URL, processes it into a DataFrame, and displays a sample for validation. These functions enable seamless access to external datasets for analysis.

In [4]:
def API_Unlimited(datasetname): # pass in dataset name and api key
    dataset_id = datasetname

    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
        print(datasetname.sample(10, random_state=999)) # Test
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))

In [16]:
def API_Unlimited_external(datasetname): # pass in dataset name and api key
    dataset_id = datasetname

    base_url = 'https://www.education.vic.gov.au/Documents/about/research/datavic/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}.{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.text
        datasetname = pd.read_csv(StringIO(url_content), delimiter=',')
        print(datasetname.sample(10, random_state=999)) # Test
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))



### Fetching and Previewing Datasets

This section defines the dataset download links required for the use case and fetches the corresponding data using the API_Unlimited function. The datasets include coworking spaces, business establishment details, tram stops and metro train stations which are essential for creating a map of all coworking spaces and nearby amenities. After retrieval, the code displays the first few rows of each dataset to confirm successful loading and ensure data integrity.

In [17]:
download_link_1 = 'city-of-melbourne-population-forecasts-by-small-area-2020-2040'
download_link_2 = 'small-areas-for-census-of-land-use-and-employment-clue'
download_link_3 = 'dv378_DataVic-SchoolLocations-2024'

# Use functions to download and load data
population_forecasts = API_Unlimited(download_link_1)
small_areas = API_Unlimited(download_link_2)
school_locations = API_Unlimited_external(download_link_3)

                         geography  year  gender        age  value
9869   West Melbourne (Industrial)  2029    Male  Age 80-84      0
1234               Melbourne (CBD)  2025    Male  Age 40-44   1222
7339                     Docklands  2034    Male  Age 65-69    352
6302   West Melbourne (Industrial)  2029  Female  Age 65-69      0
6867             City of Melbourne  2029    Male  Age 65-69   2237
1211               Melbourne (CBD)  2023    Male  Age 55-59    502
9875   West Melbourne (Industrial)  2030  Female  Age 80-84      0
10449            City of Melbourne  2036    Male  Age 35-39  13729
8939                     Parkville  2037    Male  Age 20-24   1068
1160                    Kensington  2039    Male  Age 20-24    417
                               geo_point_2d  \
2    -37.83760704949379, 144.98292521995853   
4    -37.79844895689088, 144.94506274103145   
11   -37.81381109987871, 144.96291513859617   
6    -37.83183174511404, 144.91223395712774   
12  -37.828764031547315, 144

### Displaying Dataset Overview

This part of the code verifies the datasets by displaying their dimensions and a preview of the first few rows. It ensures that the datasets have been successfully loaded and are ready for analysis. 

In [8]:
print(population_forecasts.head())
print(population_forecasts.info())

           geography  year  gender        age  value
0  City of Melbourne  2025    Male  Age 25-29  19246
1  City of Melbourne  2029  Female  Age 30-34  15926
2  City of Melbourne  2031  Female  Age 40-44   8304
3  City of Melbourne  2033    Male  Age 15-19   4855
4  City of Melbourne  2030  Female  Age 40-44   7757
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10584 entries, 0 to 10583
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   geography  10584 non-null  object
 1   year       10584 non-null  int64 
 2   gender     10584 non-null  object
 3   age        10584 non-null  object
 4   value      10584 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 413.6+ KB
None


In [9]:
print(small_areas.head())
print(small_areas.info())

                             geo_point_2d  \
0   -37.78711656492933, 144.9515603312268   
1  -37.82529018627908, 144.96176162794978   
2  -37.83760704949379, 144.98292521995853   
3  -37.814581164837946, 144.9825008488323   
4  -37.79844895689088, 144.94506274103145   

                                           geo_shape       featurenam  \
0  {"coordinates": [[[[144.94036533536232, -37.78...        Parkville   
1  {"coordinates": [[[[144.95599687351128, -37.82...        Southbank   
2  {"coordinates": [[[[144.98502208625717, -37.84...      South Yarra   
3  {"coordinates": [[[[144.9732217743585, -37.807...   East Melbourne   
4  {"coordinates": [[[[144.95732229939304, -37.80...  North Melbourne   

     shape_area    shape_len  
0  4.050997e+06  9224.569397  
1  1.596010e+06  6012.377239  
2  1.057773e+06  5424.136446  
3  1.909073e+06  6557.914249  
4  2.408377e+06  7546.649191  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 5 columns):
 #

In [20]:
print(school_locations.head())
print(school_locations.info())

  Education_Sector  Entity_Type  School_No                    School_Name  \
0         Catholic            2         20                 Parade College   
1         Catholic            2         25       Simonds Catholic College   
2         Catholic            2         26    St Maryâ€™s College Melbourne   
3         Catholic            2         28  St Patrick's College Ballarat   
4         Catholic            2         29            St Patrick's School   

  School_Type             Address_Line_1 Address_Line_2    Address_Town  \
0   Secondary           1436 Plenty Road            NaN        BUNDOORA   
1   Secondary        273 Victoria Street            NaN  WEST MELBOURNE   
2   Secondary         11 Westbury Street            NaN   ST KILDA EAST   
3   Secondary          1431 Sturt Street            NaN        BALLARAT   
4     Primary  119 Drummond Street South            NaN        BALLARAT   

  Address_State  Address_Postcode  ...     Postal_Town Postal_State  \
0           V

### Data Cleaning and Processing

#### Check for missing values and Duplicate Values

This section performs a data quality check by identifying missing values and duplicate rows in the coworking spaces, business establishments, tram stops and metro train stations datasets. This helps ensure the data is clean and ready for further analysis by highlighting potential issues that need to be addressed.

- Population Forecast Dataset

In [21]:
# Check if there are any missing values in the datasets
print(population_forecasts.isnull().sum())

#Check if there are any duplicates in the datasets
print(f"\nDuplicate rows in Population Forecast Dataset: {population_forecasts.duplicated().sum()}\n")  # Check for duplicates in the dataset and print the count of duplicates.

geography    0
year         0
gender       0
age          0
value        0
dtype: int64

Duplicate rows in Population Forecast Dataset: 0



- Small Areas Dataset

In [22]:
# Check if there are any missing values in the datasets
print(small_areas.isnull().sum())

#Check if there are any duplicates in the datasets
print(f"\nDuplicate rows in Small Areas Dataset: {small_areas.duplicated().sum()}\n")  # Check for duplicates in the dataset and print the count of duplicates.

geo_point_2d    0
geo_shape       0
featurenam      0
shape_area      0
shape_len       0
dtype: int64

Duplicate rows in Small Areas Dataset: 0



- School Locations Dataset

In [24]:
# Check if there are any missing values in the datasets
print(school_locations.isnull().sum())

#Check if there are any duplicates in the datasets
print(f"\nDuplicate rows in School Locations Dataset: {school_locations.duplicated().sum()}\n")  # Check for duplicates in the dataset and print the count of duplicates.

Education_Sector            0
Entity_Type                 0
School_No                   0
School_Name                 0
School_Type                 0
Address_Line_1              0
Address_Line_2           2283
Address_Town                0
Address_State               0
Address_Postcode            0
Postal_Address_Line_1       2
Postal_Address_Line_2    2280
Postal_Town                 2
Postal_State                2
Postal_Postcode             2
Full_Phone_No               0
Region_Name                 0
AREA_Name                   0
LGA_ID                      0
LGA_Name                    0
X                           1
Y                           1
dtype: int64

Duplicate rows in School Locations Dataset: 0



#### Handling Missing and Duplicate Data

This section addresses missing values in the school locations dataset by removing irrelevant and redundant columns

#### Removing unnecessary columns

In [25]:
school_clean=school_locations.drop(columns=['Address_Line_2','Entity_Type','Postal_Address_Line_1','Postal_Address_Line_2','Postal_Town','Postal_State','Postal_Postcode','LGA_ID','LGA_Name'])

#### Handling missing values

In [26]:
# Check if there are any missing values in the two datasets
print(school_clean.isnull().sum())
print(population_forecasts.isnull().sum())


Education_Sector    0
School_No           0
School_Name         0
School_Type         0
Address_Line_1      0
Address_Town        0
Address_State       0
Address_Postcode    0
Full_Phone_No       0
Region_Name         0
AREA_Name           0
X                   1
Y                   1
dtype: int64
geography    0
year         0
gender       0
age          0
value        0
dtype: int64


In [28]:
school_clean = school_clean.dropna(subset=['X', 'Y'])


In [32]:
# Import folium and create a map centered on Melbourne
import folium

# Average lat/lon for center
melbourne_center = [school_clean['Y'].mean(), school_clean['X'].mean()]
# popup=f"<b>{row['School_Name']}</b><br>{row['Address_Line_1']}",
school_map = folium.Map(location=melbourne_center, zoom_start=13)

# Add coworking space markers
for _, row in school_clean.iterrows():
    folium.Marker(
        location=[row['Y'], row['X']],
        icon=folium.Icon(color='blue', icon='briefcase', prefix='fa')
    ).add_to(school_map)

# Display the map
school_map
