---

# Smart Street Parking Assistant

**Author:** Stefan Cucos  
**Student ID:** 214397647  
**Email:** cucoss@deakin.edu.au  
**Project Title:** Smart Street Parking Assistant  
**Branch:** Data Science Portfolio – Project B Deakin  
**GitHub Repository:** [Insert after pushing to repo]  
**Notebook Name:** `01_data_collection_and_cleaning.ipynb`  
**Environment:** `smartpark`  
**Date:** 10 April 2025

---


---

## Step 1 — Data Source Planning

This section outlines the data sources required to build the Smart Street Parking Assistant.  
The objective is to identify available open datasets that provide:

- Real-time or historical parking information  
- Accepted payment methods  
- Applicable parking rules in urban areas

---

### Desired Data Features

- On-street parking zone information (location, type, restrictions)  
- Time limits (e.g., 1P, 2P, etc.)  
- Cost per hour (if applicable)  
- Accepted payment methods (App, Card, Cash)  
- Availability data (if possible – even simulated)

---

### Potential Data Sources

- Australian or local council Open Data portals  
  (e.g., City of Melbourne Open Data)  
- GTFS feeds with parking metadata (if available)  
- Sample or mock datasets for early testing

---

The final list of confirmed datasets will be documented in the next step  
and stored locally in the `data` folder.

---


## Step 1.1 — Preview and Choose an Initial Dataset

In this step, we select a real-world dataset to support the development of the **Smart Street Parking Assistant**.

Our objective is to find and download an open dataset that can be used for local analysis and early prototype testing.

After reviewing several options, we chose a dataset from the **City of Melbourne Open Data** platform:

> **Dataset:** `On-street Parking Bay Sensors`  
> **Source:** data.melbourne.vic.gov.au  
> **Type:** Real-time sensor data (updated every 2 minutes)  
> **Features:** Includes bay ID, sensor status (occupied or not), spatial coordinates, and timestamps

**Note:**  
This dataset may experience temporary outages and is unavailable on public holidays or in construction zones.  
Please monitor the `lastupdated` column to ensure data freshness.

The dataset has been saved locally in the `data/` folder for this project.


## Step 1.2 — Load and Preview the Dataset

In this step, we will load the downloaded dataset into the notebook to explore its structure and get an initial understanding of the data.

The goal is to:
- Load the dataset using `pandas`
- Preview the first few rows
- Check for column names, data types, and general structure

We will read the CSV file saved in the `data/` folder under the name:  
`on-street-parking-bay-sensors.csv`



In [12]:
# Step 1.2 — Load and preview the dataset
import pandas as pd

# Full file path to the downloaded dataset
file_path = r"C:\Projects\SmartParkingAssistant\data\on-street-parking-bay-sensors.csv"

# Load the dataset
df = pd.read_csv(file_path)

# Preview the dataset
print("Dataset Shape:", df.shape)
df.head()


Dataset Shape: (3307, 6)


Unnamed: 0,Lastupdated,Status_Timestamp,Zone_Number,Status_Description,KerbsideID,Location
0,2025-01-21T14:42:37+11:00,2025-01-21T12:37:41+11:00,7053.0,Unoccupied,10286,"-37.79493418293051, 144.97155131195115"
1,2025-01-21T14:42:37+11:00,2025-01-21T07:40:01+11:00,7053.0,Unoccupied,10279,"-37.79506794595409, 144.97277440018868"
2,2025-01-21T14:42:37+11:00,2025-01-20T16:57:00+11:00,7053.0,Present,10280,"-37.79502666258857, 144.97240204442983"
3,2025-01-21T14:42:37+11:00,2025-01-20T21:25:07+11:00,7049.0,Present,10288,"-37.79508290762313, 144.9712217164565"
4,2025-01-21T14:42:37+11:00,2025-01-19T10:01:01+11:00,7049.0,Unoccupied,10290,"-37.79511384563001, 144.97151325560708"


### Output Summary – Step 1.2

The dataset has been successfully loaded and contains **3,307 rows** and **6 columns**.  

Each row represents a parking bay sensor reading, and the columns provide:

- `Lastupdated`: When the data snapshot was taken
- `Status_Timestamp`: When the occupancy status was last updated
- `Zone_Number`: Identifier for the parking zone
- `Status_Description`: Indicates if the bay is *Occupied* or *Unoccupied*
- `KerbsideID`: ID used for spatial mapping or joining datasets
- `Location`: Latitude and longitude of the parking sensor

These variables will be crucial in visualising occupancy, identifying patterns, and simulating smart parking assistance.


---

## Step 1.3 — Clean and Inspect Dataset Structure

In this step, we will examine the structure and quality of the loaded data to prepare it for analysis.

The key objectives are to:

- Identify missing values or duplicates
- Inspect column data types
- Check for inconsistent formatting or outliers

This process ensures that the dataset is clean, well-structured, and ready for further transformation and modelling.

---


In [14]:
# Step 1.3 — Clean and Inspect Dataset Structure

# Check for missing values in each column
missing_values = df.isnull().sum()

# Check data types
data_types = df.dtypes

# Check for duplicates
duplicate_count = df.duplicated().sum()

# Display all checks
print("Missing Values:\n", missing_values)
print("\nData Types:\n", data_types)
print(f"\nNumber of Duplicate Rows: {duplicate_count}")


Missing Values:
 Lastupdated             0
Status_Timestamp        0
Zone_Number           225
Status_Description      0
KerbsideID              0
Location                0
dtype: int64

Data Types:
 Lastupdated            object
Status_Timestamp       object
Zone_Number           float64
Status_Description     object
KerbsideID              int64
Location               object
dtype: object

Number of Duplicate Rows: 0


---

## Step 1.3 — Clean and Inspect Dataset Structure

After loading the dataset, we performed three key checks:

- **Missing Values**  
  All columns are complete except for `Zone_Number`, which has 225 missing entries. This may need to be addressed in later cleaning steps depending on how critical zone information is for our model.

- **Data Types**  
  - `Lastupdated`, `Status_Timestamp`, `Status_Description`, and `Location` are stored as object (string) types.  
  - `Zone_Number` is a float64, likely due to missing values.  
  - `KerbsideID` is an int64.

  Some of these columns — particularly timestamps — will need to be converted to appropriate datetime types before further use.

- **Duplicate Rows**  
  There are 0 duplicate rows, which is ideal.

This structure check confirms that the dataset is mostly clean and ready for transformation, with only a few minor adjustments needed.

---


## Step 1.4 — Parse and Format the Timestamps

In this step, we will convert the string-based datetime columns into proper `datetime` objects using `pandas`.

This is essential because:

- It enables accurate **sorting, filtering, and time-based grouping**
- It allows us to calculate **durations**, identify **patterns**, and build **time series models**
- Misformatted timestamps are a common cause of **reporting errors** and **failed visualizations**

We will convert:

- `Lastupdated` → precise moment when the data was refreshed
- `Status_Timestamp` → when the bay status (e.g., Present/Unoccupied) was recorded

Both columns will be converted using `pd.to_datetime()` and displayed to verify success.


In [18]:
# Step 1.4 — Parse and Format the Timestamps

# Convert 'Lastupdated' and 'Status_Timestamp' columns to datetime format (UTC)
df['Lastupdated'] = pd.to_datetime(df['Lastupdated'], errors='coerce', utc=True)
df['Status_Timestamp'] = pd.to_datetime(df['Status_Timestamp'], errors='coerce', utc=True)

# Display the converted columns to verify success
print("Converted Timestamp Columns:\n")
print(df[['Lastupdated', 'Status_Timestamp']].head())


Converted Timestamp Columns:

                Lastupdated          Status_Timestamp
0 2025-01-21 03:42:37+00:00 2025-01-21 01:37:41+00:00
1 2025-01-21 03:42:37+00:00 2025-01-20 20:40:01+00:00
2 2025-01-21 03:42:37+00:00 2025-01-20 05:57:00+00:00
3 2025-01-21 03:42:37+00:00 2025-01-20 10:25:07+00:00
4 2025-01-21 03:42:37+00:00 2025-01-18 23:01:01+00:00


The output confirms that both `Lastupdated` and `Status_Timestamp` columns were successfully parsed as datetime objects.

By including `utc=True`, we’ve standardized the timestamps to Coordinated Universal Time (UTC), which avoids potential issues caused by inconsistent time zone formatting.

This ensures that all future operations involving date comparison, sorting, or time series analysis will behave consistently across the dataset.


## Step 1.5 — Extract Temporal Features

In this step, we extract new time-related features from the cleaned `Status_Timestamp` column to help us understand patterns of parking occupancy over time.

The following temporal features were created:

- **day_of_week**: Numerical day of the week (0=Monday, 6=Sunday)
- **hour**: Hour of the day (0–23)
- **date**: Calendar date (YYYY-MM-DD format)

These features will allow us to:
- Group or filter data by **specific days or hours**
- Build **usage profiles** based on time of day and weekday/weekend
- Support **time-based analysis** in later stages of the project

Below is a preview of the extracted features:


In [19]:
# Step 1.5 — Extract Temporal Features

# Extract new time-related columns
df['day_of_week'] = df['Status_Timestamp'].dt.dayofweek  # Monday=0, Sunday=6
df['hour'] = df['Status_Timestamp'].dt.hour
df['date'] = df['Status_Timestamp'].dt.date

# Display a preview of the new columns
df[['Status_Timestamp', 'day_of_week', 'hour', 'date']].head()


Unnamed: 0,Status_Timestamp,day_of_week,hour,date
0,2025-01-21 01:37:41+00:00,1,1,2025-01-21
1,2025-01-20 20:40:01+00:00,0,20,2025-01-20
2,2025-01-20 05:57:00+00:00,0,5,2025-01-20
3,2025-01-20 10:25:07+00:00,0,10,2025-01-20
4,2025-01-18 23:01:01+00:00,5,23,2025-01-18


We have successfully extracted three useful time-based features from the `Status_Timestamp` column:

- `day_of_week`: Indicates the day of the week (0 = Monday, 6 = Sunday)
- `hour`: Extracted hour (0 to 23), useful for hourly trend analysis
- `date`: Simplified calendar date for grouping and filtering

These features will allow us to perform time-based aggregations, such as:

- **Identifying peak parking hours** during weekdays or weekends
- **Grouping patterns by day** to compare weekdays vs weekends
- **Building time series models** using daily or hourly aggregates

This marks the completion of our initial data cleaning and temporal feature engineering phase.
