# UAV Developable Zone Clustering: Processing Data

This notebook is made to do processing for UAV data that collected from public source. The process will include:
---

1.   Data Collection
2.   Preprocessing Data
3.   Feature Extraction
4.   Modeling
5.   Implementation
---

The goals for this process is to cluster data based on:
---

1.   Developable area for plants
2.   Developable area for constructions
3.   Developable area for both plants and constrcutions
4.   Non-developable area (already balanced)

**this notebook is made in Google Colab while intergrated with Github, it will be better to run this notebook on the same space.*

## Required Library

In [36]:
!pip install Pillow torch torchvision



In [37]:
import os
import shutil
import numpy as np
import matplotlib.pyplot as plt
import torch
from torchvision import models, transforms
from PIL import Image

## Data Collection

Data that will be used comes from kaggle, the dataset is available to download at https://www.kaggle.com/datasets/ankit1743/skyview-an-aerial-landscape-dataset (also available in this repo as Aerial_Landscapes.zip).

### Download and Unzip

In [1]:
# Download dataset from github repo
!wget https://github.com/RML1812/uav-developable-zone-clustering/raw/refs/heads/main/Aerial_Landscapes.zip

--2024-10-03 10:26:14--  https://github.com/RML1812/uav-developable-zone-clustering/raw/refs/heads/main/Aerial_Landscapes.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/RML1812/uav-developable-zone-clustering/refs/heads/main/Aerial_Landscapes.zip [following]
--2024-10-03 10:26:14--  https://media.githubusercontent.com/media/RML1812/uav-developable-zone-clustering/refs/heads/main/Aerial_Landscapes.zip
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161312837 (154M) [application/zip]
Saving to: ‘Aerial_Landscapes.zip’


2024-10-03 10:26:19 (331 MB/s) - ‘Aerial_Landscapes.zip’ 

In [6]:
# Unzip to /data folder
!unzip /content/Aerial_Landscapes.zip -d /content/data/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/datas/Grassland/607.jpg  
  inflating: /content/datas/Grassland/608.jpg  
  inflating: /content/datas/Grassland/609.jpg  
  inflating: /content/datas/Grassland/610.jpg  
  inflating: /content/datas/Grassland/611.jpg  
  inflating: /content/datas/Grassland/612.jpg  
  inflating: /content/datas/Grassland/613.jpg  
  inflating: /content/datas/Grassland/614.jpg  
  inflating: /content/datas/Grassland/615.jpg  
  inflating: /content/datas/Grassland/616.jpg  
  inflating: /content/datas/Grassland/617.jpg  
  inflating: /content/datas/Grassland/618.jpg  
  inflating: /content/datas/Grassland/619.jpg  
  inflating: /content/datas/Grassland/620.jpg  
  inflating: /content/datas/Grassland/621.jpg  
  inflating: /content/datas/Grassland/622.jpg  
  inflating: /content/datas/Grassland/623.jpg  
  inflating: /content/datas/Grassland/624.jpg  
  inflating: /content/datas/Grassland/625.jpg  
  inflating: /content/d

In [8]:
# Read directory of data and total image in each folder
data_dir = '/content/data'

for folder in os.listdir(data_dir):
  folder_path = os.path.join(data_dir, folder)
  if os.path.isdir(folder_path):
    image_count = len([f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))])
    print(f"Folder: {folder}, Total Images: {image_count}")

Folder: Grassland, Total Images: 800
Folder: Forest, Total Images: 800
Folder: Beach, Total Images: 800
Folder: Parking, Total Images: 800
Folder: Airport, Total Images: 800
Folder: Residential, Total Images: 800
Folder: Port, Total Images: 800
Folder: City, Total Images: 800
Folder: Desert, Total Images: 800
Folder: Mountain, Total Images: 800
Folder: River, Total Images: 800
Folder: Highway, Total Images: 800
Folder: Agriculture, Total Images: 800
Folder: Lake, Total Images: 800
Folder: Railway, Total Images: 800


## Data Preprocessing

### Drop Data

Dropping data that's unnecessary or uncompatible against purpose of the clustering, which are:

*   **Terrain compatibility**: Desert
*   **Scape purpose**: Grassland, Forest
*   **Functional Structure**: Port, Airport, Agriculture, Railway, Highway


In [10]:
# Delete Desert, Grassland, Forest, Port, Airport, Agriculture, Railway, Highway folder form content/data folder
folders_to_delete = ['Desert', 'Grassland', 'Forest', 'Port', 'Airport', 'Agriculture', 'Railway', 'Highway']

for folder in folders_to_delete:
  folder_path = os.path.join(data_dir, folder)
  if os.path.exists(folder_path):
    shutil.rmtree(folder_path)
    print(f"Deleted folder: {folder_path}")
  else:
    print(f"Folder not found: {folder_path}")

Deleted folder: /content/data/Desert
Deleted folder: /content/data/Grassland
Deleted folder: /content/data/Forest
Deleted folder: /content/data/Port
Deleted folder: /content/data/Airport
Deleted folder: /content/data/Agriculture
Deleted folder: /content/data/Railway
Deleted folder: /content/data/Highway


In [12]:
# Read remaining data
for folder in os.listdir(data_dir):
  folder_path = os.path.join(data_dir, folder)
  if os.path.isdir(folder_path):
    image_count = len([f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))])
    print(f"Folder: {folder}, Total Images: {image_count}")

Folder: Beach, Total Images: 800
Folder: Parking, Total Images: 800
Folder: Residential, Total Images: 800
Folder: City, Total Images: 800
Folder: Mountain, Total Images: 800
Folder: River, Total Images: 800
Folder: Lake, Total Images: 800


### Cleaning and Normalize Images

The dataset source says that the data is already clean. Hence, to ensure it's right, this process will check every image.

---
The **properties** are,
* Dimensions: 256x256
* Horizontal resolution: 96 dpi
* Vertical resolution: 96 dpi
* Bit depth: 24


In [16]:
# Define properties
EXPECTED_DIMENSIONS = (256, 256)
EXPECTED_HORIZONTAL_RESOLUTION = 96
EXPECTED_VERTICAL_RESOLUTION = 96

In [28]:
# Function to check every images and print check result
def check_image_properties(image_path):
  try:
    with Image.open(image_path) as img:
      # Check dimensions
      if img.size != EXPECTED_DIMENSIONS:
        return False

      # Check resolution (DPI)
      dpi = img.info.get('dpi', (0, 0))
      if dpi != (0, 0):  # Only check DPI if it's present
        if dpi[0] != EXPECTED_HORIZONTAL_RESOLUTION or dpi[1] != EXPECTED_VERTICAL_RESOLUTION:
          return False

      # Check bit depth (RGB = 24-bit)
      if img.mode != 'RGB':
        return False

    return True

  except Exception as e:
    print(f"Error checking image {image_path}: {e}")
    return False

def check_images_in_folder(folder_path):
  different_properties_count = 0
  total_images = 0

  for root, dirs, files in os.walk(folder_path):
    for file in files:
      if file.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif', '.tiff')):
        image_path = os.path.join(root, file)
        total_images += 1
        if not check_image_properties(image_path):
          different_properties_count += 1

  if different_properties_count > 0:
    print(f"Total images with different properties: {different_properties_count}")
  else:
    print("All images have the correct properties.")

  print(f"Total images checked: {total_images}")

In [29]:
check_images_in_folder(data_dir)

All images have the correct properties.
Total images checked: 5600


Since all of the image is correct, any other normalizing/cleaning process won't be needed.

### Store Data

In [31]:
# Convert images to numpy arrays along with labels
def convert_to_numpy_with_labels(data_dir):
  images = []
  labels = []
  label_dict = {}
  label_counter = 0

  for folder_name in os.listdir(data_dir):
    folder_path = os.path.join(data_dir, folder_name)
    if os.path.isdir(folder_path):
      # Assign a unique integer label for each folder (class name)
      if folder_name not in label_dict:
        label_dict[folder_name] = label_counter
        label_counter += 1

      # Process each image file in the folder
      for filename in os.listdir(folder_path):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
          img_path = os.path.join(folder_path, filename)
          try:
            # Load image and convert to numpy array
            img = Image.open(img_path).convert("RGB")
            img_array = np.array(img)

            # Append the image data and its corresponding label
            images.append(img_array)
            labels.append(label_dict[folder_name])
          except Exception as e:
            print(f"Error processing image {img_path}: {e}")

  # Convert lists to numpy arrays
  images_np = np.array(images)
  labels_np = np.array(labels)

  return images_np, labels_np, label_dict

In [33]:
# Run convert function
images, labels, label_dict = convert_to_numpy_with_labels(data_dir)
print(f"Total images: {len(images)}")
print(f"Label dictionary: {label_dict}")

Total images: 5600
Label dictionary: {'Beach': 0, 'Parking': 1, 'Residential': 2, 'City': 3, 'Mountain': 4, 'River': 5, 'Lake': 6}


In [34]:
# Check detail of images
def print_total_images_per_label(labels_np, label_dict):
    # Get unique labels and their counts
    unique_labels, counts = np.unique(labels_np, return_counts=True)

    # Invert the label_dict to get class names from integer labels
    inverted_label_dict = {v: k for k, v in label_dict.items()}

    print("Total images per label:")
    for label, count in zip(unique_labels, counts):
        class_name = inverted_label_dict[label]
        print(f"Label '{class_name}' (ID {label}): {count} images")

print_total_images_per_label(labels, label_dict)

Total images per label:
Label 'Beach' (ID 0): 800 images
Label 'Parking' (ID 1): 800 images
Label 'Residential' (ID 2): 800 images
Label 'City' (ID 3): 800 images
Label 'Mountain' (ID 4): 800 images
Label 'River' (ID 5): 800 images
Label 'Lake' (ID 6): 800 images


## Feature Extraction