### Computing attributes for building classification inference.
The model classifies individual buildings as residential, non-residential, or industrial using a set of features describing the building and its surroundings. For that, this notebook computes the required attributes to run the model.

**Attributes computed in this notebook:**

* `perimeter_to_area_ratio:` A measure of shape complexity (Perimeter / Area), which was also clipped (capped) at a max value of 6.5.
* `radius_m:` A custom "average radius" metric, calculated in meters.
* `distance_to_1`: The distance (in meters) to the nearest Category 1 road (major highways).
* `nearest_road_type_1`: The highway type (e.g., 'motorway') of that nearest road.
* `distance_to_2`: The distance (in meters) to the nearest Category 2 road (secondary roads).
* `nearest_road_type_2`: The highway type (e.g., 'secondary') of that nearest road.
* `distance_to_3`: The distance (in meters) to the nearest Category 3 road (tertiary roads).
* `nearest_road_type_3`: The highway type (e.g., 'tertiary') of that nearest road.
* `distance_to_4`: The distance (in meters) to the nearest Category 4 road (local/residential roads).
* `nearest_road_type_4`: The highway type (e.g., 'residential') of that nearest road.
* `road_density_for_4_fixed`: The density of Category 4 roads within a 200-meter radius.
* `road_density_for_5_fixed`: The density of Category 5 roads (same as 4, but with a 100-meter radius).
* `SQN`: A "Squareness Index" (closer to 1.0 is more square).
* `faces`: The number of sides for each building (derived from num_vertices and clipped at a max of 20).

**Computational times**: The notebook was run on a local Machine with the following technical caracteristics: Processor	13th Gen Intel(R) Core(TM) i7-13700H (2.40 GHz), Installed RAM 16.0 GB (15.7 GB usable) and System type and 64-bit operating system, x64-based processor. It was tested on Patan, India, which covers a total of 182.113 buildings. **The total processing time of this notebook was 15.2 minutes.**

In [1]:
import time
starting_time =time.time()

In [2]:
import pandas as pd
import geopandas as gpd
import numpy as np
import shapely
from scipy.spatial.distance import pdist
from scipy.spatial import cKDTree
from sklearn.neighbors import BallTree
from pyproj import Geod 
from geopy.distance import geodesic

from collections import Counter
from tqdm import tqdm
from shapely import wkt
from shapely import wkb

import warnings
warnings.filterwarnings("ignore")

geod = Geod(ellps="WGS84")

In [3]:
file_parquet = r"C:\Users\renec\OneDrive\Documents\SEForALL\GitHub\new-classification-model-local\try_AP_BD"

### Data Loading & Spatial Preparation
It reads an Parquet file into a standard pandas DataFrame. The "geometry" column in the file is stored as binary text (WKB). It converts the pandas DataFrame into a GeoDataFrame. This will enable geospatial operations. It resets the DataFrame's index to a clean, sequential list of numbers (0, 1, 2...), which is good practice for preventing errors.

In [4]:
df_K = pd.read_parquet(file_parquet)
df_K["geometry"] = df_K["geometry"].apply(wkb.loads)
df_K = gpd.GeoDataFrame(df_K, geometry='geometry')
#df_K = gpd.GeoDataFrame(df_K, geometry=shapely.from_wkb(df_K['geometry']))
df_K.index = [i for i in range(len(df_K))]

### Perimeter-to-area ratio
It calculates the true geodetic perimeter (in meters, accounting for the Earth's curve) for each building. Then, it calculates the Perimeter-to-Area Ratio (PAR) by dividing the new perimeter by the building's area. This ratio is a classic metric for shape complexity (a high value means a more complex, less compact shape). It clips the ratio at a maximum value of 6.5. This is a data-cleaning step to remove any extreme, unrealistic values.
It normalizes this ratio by dividing every value by the new maximum (6.5).

In [5]:
df_K['building_perimeter_in_meters_new'] = \
df_K["geometry"].apply(lambda g: geod.geometry_area_perimeter(g)[1])#this extracts second value from the tuple which is perimeter

df_K['perimeter_to_area_ratio'] = df_K['building_perimeter_in_meters_new'] / df_K['area_in_meters']
df_K['perimeter_to_area_ratio'] = df_K['perimeter_to_area_ratio'].clip(upper=6.5)
df_K['normalized_perimeter_to_area_ratio'] = df_K['perimeter_to_area_ratio'] / df_K['perimeter_to_area_ratio'].max()

### Radius calculation

It first defines a custom function to calculate an "average radius" using the function calculate_radius. Then, it converts the data from latitude/longitude into a meter-based system.  Inthe meter-based system, it calculates the centroid and the new radius_m feature. It then projects the data back to lat/lon. Finally, it creates another feature by counting the total number of vertices for each shape.

In [6]:
def calculate_radius(geometry):
    geometry = geometry if geometry.type == 'Polygon' else geometry.convex_hull
    centroid = geometry.centroid
    boundary_points = np.array(geometry.exterior.coords)
    distances = np.linalg.norm(boundary_points - np.array([centroid.x, centroid.y]), axis=1)
    return np.mean(distances)

In [7]:
df_K.set_crs("EPSG:4326", inplace=True, allow_override=True)  
print("Current CRS:", df_K.crs)
df_K = df_K.to_crs("EPSG:7767") 
df_K["centroid"] = df_K.geometry.centroid
df_K["radius_m"] = df_K["geometry"].apply(calculate_radius) # Compute radius in meters
df_K = df_K.to_crs("EPSG:4326") # Convert back to WGS84 (if needed)
df_K["num_vertices"] = df_K["geometry"].apply(lambda x: len(x.exterior.coords) if x.type == 'Polygon' else sum(len(g.exterior.coords) for g in x.geoms))
df_K[["num_vertices"]].describe()

Current CRS: EPSG:4326


Unnamed: 0,num_vertices
count,182313.0
mean,5.307071
std,1.081455
min,5.0
25%,5.0
50%,5.0
75%,5.0
max,90.0


### Roads calculation
The main goal is to calculate the distance from every building to the nearest road, for four different categories of roads. It then adds these distances (e.g., distance_to_1, distance_to_2, etc.) as new features to the initial builduing DataFrame.
* Main functions:
    * **explode_multilinestrings(gdf):** it cleans up the road data. A "MultiLineString" is a single data entry that represents multiple,disconnected road segments. This function explodes them, creating a separate row for each individual segment.
    * **explode_road_geometry(df):** This is a key optimization technique. Instead of calculating the distance from a building to a complex line (which is slow), this function "samples" the road. It iterates through every road and then every vertex (coordinate pair) on that road, creating a new row for each vertex. The result is a massive DataFrame of points, where each point represents a spot on a road. This turns a complex "find nearest line" problem into a much faster "find nearest point" problem.

In [8]:
def explode_multilinestrings(gdf):
    """ Convert MultiLineStrings to separate LineStrings """
    gdf = gdf.explode(ignore_index=True)
    return gdf[gdf.geometry.type == 'LineString']

def explode_road_geometry(df):
    road_rows = []
    for row_idx, row in df.to_dict(orient='index').items():
        
        for x, y in row['geometry'].coords:
            current_row = row.copy()
            current_row['geometry_centroid'] = shapely.Point(x, y)
            current_row['row_idx'] = row_idx
            road_rows.append(current_row)

    result_df = pd.DataFrame.from_dict(road_rows)
    result_df.index = [i for i in range(len(result_df))]    
    
    result_df['centroid_x'] = result_df.geometry.apply(lambda g: g.centroid.xy[0][0])
    result_df['centroid_y'] = result_df.geometry.apply(lambda g: g.centroid.xy[1][0]) 
    
    return result_df

* **Load & Validate:** It reads the `roads_inside_TN.geojson` file into a GeoDataFrame final_gdf, asserts its CRS is EPSG:4326 (WGS84 lat/lon), and prints several validation checks.

In [9]:
final_gdf = gpd.read_file(r"C:\Users\renec\OneDrive\Documents\SEForALL\GitHub\new-classification-model-local\roads_inside_TN.geojson")
final_gdf = final_gdf.set_crs("EPSG:4326")
final_gdf['road_index'] = [i for i in range(len(final_gdf))]
print(final_gdf.geometry.type.value_counts())
print("Empty geometries count:", final_gdf.geometry.is_empty.sum())
print("Unique highway values:", final_gdf['highway'].unique())

LineString    2365
Point            1
Name: count, dtype: int64
Empty geometries count: 0
Unique highway values: ['bus_stop' 'primary' 'trunk' 'secondary' 'tertiary' 'residential'
 'unclassified' 'track' 'footway' 'road' 'service' 'steps' 'living_street'
 'tertiary_link' 'path']


* **Filter Columns:** It selects only the required_columns to save memory.
* **Simplify Geometries:** final_gdf['geometry'].simplify(tolerance) removes tiny, insignificant vertices from the road lines. This makes the geometries simpler and drastically reduces the number of points that `explode_road_geometry` will create, speeding up the entire process.
* **Exploding geometries:** It applies the `explode_multilinestrings` function to ensure every road segment is a separate row.

In [10]:
#extraction of required columns
required_columns = {"highway", "geometry", "id",'width','oneway','junction','lanes','maxspeed','motorcar', 'road_index'}
final_gdf = final_gdf[[col for col in required_columns if col in final_gdf.columns]]
duplic=final_gdf[final_gdf.duplicated(keep=False)]
tolerance = 0.00001
final_gdf['geometry_simplified'] = final_gdf['geometry'].simplify(tolerance)
final_gdf = final_gdf[final_gdf.geometry.notnull()]
final_gdf = explode_multilinestrings(final_gdf)

* **`roads_categories`** is a dictionary that groups all the different road highway types into four main numerical categories. Category 1 is major highways, while Category 4 is minor local roads.

In [11]:
roads_categories = {
    1: ['motorway', 'trunk_link', 'motorway_link', 'trunk', 'primary', 'primary_link'],
    2: ['secondary', 'secondary_link',],
    3: ['tertiary', 'tertiary_link', ],
    4: ['residential', 'footway', 'service', 'unclassified','living_street','steps','path','track','pedestrian','cycleway','raceway','bridleway','construction','services','bus_stop','road','rest_area','yes','emergency_access_point','corridor','junction','proposed','minor']
    }

#### Reprojection to a meter-based System
It sets the projected_crs to `EPSG:3857` (Web Mercator), a projection whose units are meters, not degrees. It re-projects both DataFrames (final_gdf for roads and *df_K* for buildings) into this meter-based CRS. It pre-calculates and stores the centroid_x and centroid_y (in meters) for all buildings in *df_K*. It finally sorts the entire buildings DataFrame by the 'area_in_meters' column, from smallest to largest. It then completely resets the index to a clean, sequential order (0, 1, 2, 3...).

In [12]:
projected_crs = "EPSG:3857"
final_gdf = final_gdf.to_crs(projected_crs)
df_K = df_K.to_crs(projected_crs)
df_K['centroid_x'] = df_K.geometry.apply(lambda g: g.centroid.xy[0][0])
df_K['centroid_y'] = df_K.geometry.apply(lambda g: g.centroid.xy[1][0])
df_K = df_K.sort_values(by='area_in_meters', ascending=True)
df_K.index = [i for i in range(len(df_K))]

#### Proximity Analysis Loop
This is the core of the script. It loops through each of the four road categories defined in the 'road_categories' and finds the nearest one.
Here's the process for a single loop:
* **Filter Roads**: It creates filtered_roads_df containing only the roads in the category.
* **Creating a cloud of points from roads:** It uses the function `explode_road_geometry` to convert all these road lines into a massive cloud of (x, y) vertex points.
* **Build a k-d Tree:** It extracts all road vertex coordinates into a NumPy array and then it builds a cKDTree. It gets all the building coordinates (house_coords). Then it takes all the houses and, for each one, finds the distance to and index of the single (k=1) nearest road vertex in the tree.
* **Save Results:** It loops through the distances and indices and saves them to the main DataFrame in new columns, distance_to_1 and nearest_road_type_1.

The script then repeats this entire process for Category 2, Category 3, and Category 4.

In [13]:
for category, road_types in roads_categories.items():

    print(f'Processing road_types: {road_types}')
    
    filtered_roads_df = final_gdf[final_gdf['highway'].isin(road_types)]
    print(f'Unexploded road geometries amount: {len(filtered_roads_df)}')
    
    filtered_roads_df = explode_road_geometry(filtered_roads_df)
    
    print(f'Exploded road geometries amount: {len(filtered_roads_df)}')
    road_centroids = filtered_roads_df['geometry_centroid']  

    road_coords = np.array([(point.x, point.y) for point in road_centroids if not point.is_empty])
    
    if len(road_coords) == 0:
        raise ValueError("No valid road centroids found. Check road geometries!")

    road_tree = cKDTree(road_coords)

    house_coords = np.array(list(zip(df_K.centroid_x, df_K.centroid_y)))  

    distances, indices = road_tree.query(house_coords, k=1)
    
    distance_col_name = f'distance_to_{category}'
    road_type_col_name = f'nearest_road_type_{category}'
    
    df_K[road_type_col_name] = ''
    for building_idx, (distance, idx) in tqdm(enumerate(zip(distances, indices)), desc='Assigning roads & distances', total=len(distances)):
        
        df_K.loc[building_idx, distance_col_name] = float(distance)
        df_K.loc[building_idx, road_type_col_name] = filtered_roads_df.iloc[idx].highway

Processing road_types: ['motorway', 'trunk_link', 'motorway_link', 'trunk', 'primary', 'primary_link']
Unexploded road geometries amount: 52
Exploded road geometries amount: 1399


Assigning roads & distances: 100%|███████████████████████████████████████████████████████████████████████████████| 182313/182313 [01:01<00:00, 2988.30it/s]


Processing road_types: ['secondary', 'secondary_link']
Unexploded road geometries amount: 33
Exploded road geometries amount: 1289


Assigning roads & distances: 100%|███████████████████████████████████████████████████████████████████████████████| 182313/182313 [01:01<00:00, 2970.74it/s]


Processing road_types: ['tertiary', 'tertiary_link']
Unexploded road geometries amount: 219
Exploded road geometries amount: 9982


Assigning roads & distances: 100%|███████████████████████████████████████████████████████████████████████████████| 182313/182313 [00:59<00:00, 3083.32it/s]


Processing road_types: ['residential', 'footway', 'service', 'unclassified', 'living_street', 'steps', 'path', 'track', 'pedestrian', 'cycleway', 'raceway', 'bridleway', 'construction', 'services', 'bus_stop', 'road', 'rest_area', 'yes', 'emergency_access_point', 'corridor', 'junction', 'proposed', 'minor']
Unexploded road geometries amount: 2061
Exploded road geometries amount: 28672


Assigning roads & distances: 100%|███████████████████████████████████████████████████████████████████████████████| 182313/182313 [01:01<00:00, 2972.19it/s]


#### Post-processing and Clean-up

It defines a `category_bbox_size` (e.g., 5,000m for Category 1). It then clips all calculated distances. Any distance over this limit is set to the limit (e.g., a distance_to_1 of 7,200m becomes 5,000m). This is a common data cleaning step to remove extreme outliers and cap the "search radius."

In [14]:
distance_columns = ['distance_to_1', 'distance_to_2', 'distance_to_3', 'distance_to_4']
df_K[distance_columns].describe()

Unnamed: 0,distance_to_1,distance_to_2,distance_to_3,distance_to_4
count,182313.0,182313.0,182313.0,182313.0
mean,2548.131462,2840.986775,847.159193,732.628463
std,2804.602063,2529.322822,893.655811,1215.731155
min,1.095893,2.741049,0.674453,0.123339
25%,396.409433,725.754614,176.827762,42.917409
50%,1357.871203,2073.158132,500.775865,159.987121
75%,3709.022163,4408.859023,1234.580574,911.949248
max,12888.639113,11466.629918,6280.931113,8255.610697


In [15]:
category_bbox_size = {
    1: 5_000,
    2: 4_000, 
    3: 3_000,
    4: 2_000
}
for cat, limit in category_bbox_size.items():
    df_K[f'distance_to_{cat}'] = df_K[f'distance_to_{cat}'].clip(lower=0, upper=limit)

In [16]:
distance_columns = ['distance_to_1', 'distance_to_2', 'distance_to_3', 'distance_to_4']
df_K[distance_columns].describe()

Unnamed: 0,distance_to_1,distance_to_2,distance_to_3,distance_to_4
count,182313.0,182313.0,182313.0,182313.0
mean,2059.249661,2197.731248,833.338958,555.28618
std,1837.168711,1492.53409,852.496139,706.220875
min,1.095893,2.741049,0.674453,0.123339
25%,396.409433,725.754614,176.827762,42.917409
50%,1357.871203,2073.158132,500.775865,159.987121
75%,3709.022163,4000.0,1234.580574,911.949248
max,5000.0,4000.0,3000.0,2000.0


### Density of roads
The main goal is to calculate, for each building, how many kilometers of road exist within a fixed radius (e.g., 200m) of that building, and to do this for different road categories.

#### Main Functions
* **`explore_road_geometry_without_index`** It takes the entire final_gdf (all roads) and "explodes" every line into its individual vertices (points).
* **`compute_road_density`** This function calculates the road density inside a circular buffer around a single building.

In [17]:
def explode_road_geometry_without_index(df):
    
    road_rows = []
    for row_idx, row in df.to_dict(orient='index').items():
        
        for x, y in row['geometry'].coords:
            current_row = row.copy()
            current_row['geometry_centroid'] = shapely.Point(x, y)
            current_row['row_idx'] = row_idx
            road_rows.append(current_row)

    result_df = pd.DataFrame.from_dict(road_rows)  
    
    result_df['centroid_x'] = result_df.geometry.apply(lambda g: g.centroid.xy[0][0])
    result_df['centroid_y'] = result_df.geometry.apply(lambda g: g.centroid.xy[1][0]) 
    
    return result_df

def compute_road_density(road_tree, building_x, building_y, radius, category_roads_df):
    # Get nearby roads
    nearby_indices = road_tree.query_ball_point((building_x, building_y), radius)
    
    building_radius_polygon = shapely.Point(building_x, building_y).buffer(radius)
    
    if not nearby_indices:
        return 0  

    # Get total road length within radius
    
    road_idxs = list(set(filtered_roads.loc[nearby_indices].row_idx))
    
    filtered_roads_df = category_roads_df[category_roads_df.road_index.isin(road_idxs)].copy()
    
    filtered_roads_df['geometry'] = filtered_roads_df['geometry'].apply(lambda g: g.intersection(building_radius_polygon))
    total_road_length = filtered_roads_df.geometry.length.sum()
    
    # Compute buffer area
    buffer_area = np.pi * (radius ** 2)  # Circle area formula πr²
    
    # Compute density: road length per km²
    return (total_road_length / buffer_area) * 1e6  # Convert to km/km²

#### Preparing the Road Data

* **filtered_roads** becomes a massive DataFrame where each row is a single point on a road. This "point cloud" of the entire road network is created once to be used for all subsequent lookups.
*  **road_centroids & road_tree**: It gets the (x, y) coordinates of every single road vertex from the filtered_roads DataFrame, then it builds one single cKDTree from all these road points.

In [18]:
filtered_roads = explode_road_geometry_without_index(final_gdf)
road_centroids = filtered_roads['geometry_centroid'] 
road_coords = np.array([(point.x, point.y) for point in road_centroids if not point.is_empty])
road_tree = cKDTree(road_coords)

* **fixed_radius_by_category**: This dictionary defines the search radius in meters to use for each category. For example, any analysis for Category 4 will use a 200m radius.
* **roads_categories_:** This is a new dictionary (note the underscore _). It only defines categories 4 and 5. This means the density analysis loop will only run for these two categories.

In [19]:
fixed_radius_by_category = {
    1: 500,
    2: 400,
    3: 300,
    4: 200,
    5: 100,
}

roads_categories_ = {
    4: ['residential', 'footway', 'service', 'unclassified','living_street','steps','path','track','pedestrian','cycleway','raceway','bridleway','construction','services','bus_stop','road','rest_area','yes','emergency_access_point','corridor','junction','proposed','minor'],
    5: ['residential', 'footway', 'service', 'unclassified','living_street','steps','path','track','pedestrian','cycleway','raceway','bridleway','construction','services','bus_stop','road','rest_area','yes','emergency_access_point','corridor','junction','proposed','minor']
    }

#### Analysis Loop
It loops through roads_categories_ (just categories 4 and 5). For Category 4, it sets fixed_radius to 200 and filters final_gdf to get all the original Category 4 road lines. It then iterates through every building (building_coords) and calls compute_road_density for each one, using the 200m radius and the Category 4 roads. The list of results is saved in a new column, road_density_for_4_fixed. It repeats the entire process for Category 5 (with a 100m radius).

In [20]:
building_coords = np.array(list(zip(df_K.centroid_x, df_K.centroid_y)))

for category, road_types in roads_categories_.items():

    fixed_radius = fixed_radius_by_category[category]
    
    category_roads_df = final_gdf[final_gdf.highway.isin(road_types)]
    # df_K[f"road_density_for_{category}_fixed"] = df_K[['centroid_x', 'centroid_y']].apply(lambda row: compute_road_density(row.centroid_x, row.centroid_y, fixed_radius), axis=1)
    # [ for x, y in tqdm(building_coords, total=len(building_coords))]
#  Compute road density for all buildings (Vectorized, fast)
    df_K[f"road_density_for_{category}_fixed"] = [compute_road_density(road_tree, x, y, fixed_radius, category_roads_df) for x, y in tqdm(building_coords, total=len(building_coords), desc=f'Counting for fixed radius: {fixed_radius}')]

Counting for fixed radius: 200: 100%|█████████████████████████████████████████████████████████████████████████████| 182313/182313 [05:41<00:00, 533.68it/s]
Counting for fixed radius: 100: 100%|█████████████████████████████████████████████████████████████████████████████| 182313/182313 [04:06<00:00, 738.44it/s]


In [21]:
density_columns = ['road_density_for_4_fixed', 'road_density_for_5_fixed']
df_K[density_columns].describe()

Unnamed: 0,road_density_for_4_fixed,road_density_for_5_fixed
count,182313.0,182313.0
mean,3834.751111,4249.697743
std,5322.991549,6721.15699
min,0.0,0.0
25%,0.0,0.0
50%,1141.682248,0.0
75%,6082.122016,6421.176792
max,28924.249628,49730.884029


### SQN + NUMBER OF FACES
* **Squareness:** It creates a new column called SQN, which is a Squareness Index. It's a metric that measures how "square-like" a building's footprint is. A perfect square will have an SQN value of exactly 1.0. Any other shape (like a long rectangle or a complex "L" shape) will have a value less than 1.0. The closer the value is to 1.0, the more compact and square-like the building is.
* **Faces:** This is a two-step process to create another metrict. It calculates the number of faces and then it clips out outliers biegger than 20. It substracs -1 to avoid double counting of one of the vertices.

In [22]:
df_K["SQN"] = (4 * np.sqrt(df_K["area_in_meters"]) / df_K["building_perimeter_in_meters_new"])

df_K["faces"] = df_K['num_vertices'] - 1 # #no of faces
df_K.loc[df_K["faces"] > 20, "faces"] = 20

### SAVING THE RESULT

In [23]:
df_K.to_parquet("Buildings_with_attributes_for_classification.parquet")
ending_time = time.time()
total_time = ending_time - starting_time
print(f"Total Executing time: {round(total_time/60, 2)} minutes")

Total Executing time: 15.2 minutes
